Due Sunday, February 09, 2014 at 23:59:59
Updates
- 2/03 @ 8:00PM - Updated Makefile posted
- 2/04 @ 11:00AM - Updated Makefile reposted
Goals
The objective of this assignment is to get you familiar and comfortable with string manipulation and algorithms in C. You'll also likely get some good experience debugging C code.
Background
grep
is a UNIX utility that is used to search for patterns in text files. It's a powerful and versatile tool, and in this assignment you will implement a version that, while simplified, should still be useful.
Your assignment is to complete the implementation of rgrep
, our simplified, restricted grep. rgrep
is "restricted" in the sense that the patterns it matches only support a few regular operators (the easier ones). The way rgrep
is used is that a pattern is specified on the command line. rgrep
then reads lines from its standard input and prints them out on its standard output if and only if the pattern "matches" the line. For example, we can use rgrep
to search for lines that contain text file names that are at least 3 characters long (plus the extension) in a file like the following:
$ ~/src/hw2$ cat testin # so you can see what lines are in the file 1 fine.txt 2 reallylong.txt 3 abc.txt 4 s.txt 5 nope.pdf $ ~/src/hw2$ ./rgrep ' ....{0,}\.txt' < testin # note the space in the pattern 1 fine.txt 2 reallylong.txt 3 abc.txt
What's going on here? rgrep
was given the pattern " ....{0,}\.txt
"; it printed only the lines from its standard input that matched this pattern. How can you tell if a line matches the pattern? A line matches a pattern iff the pattern "appears" somewhere inside the line. In the absence of any special operators, seeing if a line matches a pattern reduces to seeing if the pattern occurs as a substring anywhere in the line. So for most characters, their meaning in a pattern is just to match themselves in the target string. However, there are a few special clauses you must implement:
. (period) | Matches any character (including newlines or spaces). |
---|---|
{m,n} (bracket-clause) | The preceding character may appear between m and m+n times. If n is ommited then the preceding character may appear m or more times. You may assume that m,n will be non-negative decimal integers of reasonable size. Note that this is dfferent behavior from most regexp implementations, which would read this as matching between m and n appearances of the preceding character. |
\ (backslash) | "Escapes" the following character, nullifying any special meaning it has |
So, here are some examples of patterns and the kind of lines they match:
\{ | An open brace must appear in the line |
heyy{0,} | Matches a line that contains the string "hey" followed by any number of y's |
h.d.{1,}n | Matches lines that contain substrings like "hidden", "hidin", "hbdwabcdefgn", "hadbn", etc. |
cu\.{0,2}t | Matches lines that contain the substring "cut", "cu.t", or "cu..t". |
These are the only special characters you have to handle. With the exception of the null char that terminates a string, you should not have to handle any other character (like newlines and spaces) in any special way. You may assume that your code will not be run against patterns that don't make sense, like "lol{0,}{0,}" and "{0,}oops". You must follow the spec strictly - so #including a regular expression library will likely turn out badly for you.
Getting started
Copy the framework files (Makefile, matcher.c, matcher.h, rgrep.c) into your working directory:
$ cp -R ~cs61c/hw/02 hw2
To compile, type:
$ make
To run against a particular pattern, use
$ ./rgrep pattern
The skeleton code handles reading lines from standard input and printing them out for you; you must implement the function int rgrep_matches(char *line, char *pattern)
in matcher.c
, which returns true if and only if the string contains the pattern. You may change matcher.c however you want, but please do not modify any other files. The autograder will overwrite the other files in your submission with reference versions, so you only need to submit matcher.c
.
Testing
Be sure to test on a hive machine. For a quick sanity check to see if your solution is on the right track, type:
$ make check
Note that this doesn't mean your solution will receive full points, since the autograder will be running a much larger suite of test cases. You should test your code to make sure that it properly matches lines against patterns. One way to do this is to create a text file with the lines you want to test against, say test_input.txt
and then verify that running ./rgrep pattern < test_input.txt
prints only the lines that you think should match the pattern, and no others. You can also just invoke ./rgrep pattern
and type lines, and verify that rgrep
repeats lines iff they match the pattern.
Note that your shell might interpret the backslash operator for you, which is not what you want. For example, when you type at your shell
$ ./rgrep \.hi < input.txt
your program might get the pattern ".hi" because the shell interpreted the backslash before it got passed to your program. The solution is to put the pattern in single quotes, so what you want to type is:
$ ./rgrep '\.hi' < input.txt
This should ensure that your pattern operators aren't expanded or consumed by the shell. You should also be aware that input lines may end with a newline character, which the 'period' character will match.
Grading
You will be autograded on a hive machine and your score will depend on the tests your code passes. You won't get any points if your code doesn't compile. Each feature will be worth the following:
Feature | Points |
---|---|
Patterns without special characters. | 5 |
Patterns with dots. | 5 |
Patterns with backslashes and dots. | 5 |
Patterns with bracket-clauses. | 5 |
Patterns with backslashes, dots, and bracket-clauses. | 5 |
Submission
When you are ready to submit, go into the directory that contains your matcher.c
file. You should only submit matcher.c
. From within that directory, run submit hw2
. You can submit multiple times; we'll grade the latest submission.
Suggestions
Many people have found this assignment really challenging in the past. Here are some suggestions that may help you:
- Start early.
- Make use of gdb (you'll learn about it in lab2). At the very least, make use of its ability to tell you exactly where a segmentation fault is occuring.
- There are at least two reasonably simple approaches you can take. (If you want to solve this problem entirely on your own, you should stop reading.) One way is to use recursion (you've probably done similar problems in cs61a). The tricky part is handling all cases when you see a bracket clause The second way (which I think may be simpler) is to keep track of all positions in the input string you could have matched up to as you read the pattern. For instance, after reading "aa" in the pattern, there are three possible positions "aaabcaad" you could be.