CS61C Fall 2018 HW2: restricted grep (rgrep)

HW 2 TAs: Jonathan Murata, Sruthi Veeragandham. Originally by Conor Hughes.

Due Friday, September 7, 2018 at 23:59:59

UPDATES

9/2 10:40pm: Changes to the rgrep.c starter code to ignore newlines. This only affects you if you were testing with a pattern that ended with a period.

9/3 1:30am: Changes to the rgrep.c starter code to print newlines for clarity in testing. Please pull the starter code again. This should not create any merge conflicts, since you shouldn't have modified rgrep.c.

Goals

The objective of this assignment is to get you familiar with string/pointer manipulation and C code. You'll also likely get lots of good experience debugging with cgdb (or enhance your ability to use print statements).

Background

grep is a UNIX utility that is used to search for patterns in text files. It's a powerful and versatile tool, and in this assignment you will implement a version that, while simplified, should still be useful.

Your assignment is to complete the implementation of rgrep, our simplified version of grep. rgrep is "restricted" in the sense that the patterns it matches only support a few regular operators (the easier ones). The way rgrep is used is that a pattern is specified on the command line. rgrep then reads lines from its standard input and prints them out on its standard output if and only if the pattern "matches" the line or some subsection of it. For example, we can use rgrep to search for lines that contain filenames that are at least 3 characters long (plus the ".txt") as follows:


# show the contents of test_file.in:
$ cat test_file.in

fine.txt
reallylong.txt
abc.txt
s.txt
nope.pdf

$ ./rgrep '...+\.txt' < test_file.in

fine.txt
reallylong.txt
abc.txt
    

What's going on here? rgrep was given the pattern "...+\.txt"; it printed only the lines from its standard input (the contents of test_file.in) that matched this pattern. How can you tell if a line matches the pattern? A line matches a pattern iff the pattern "appears" somewhere inside the line. In the absence of any special operators, seeing if a line matches a pattern reduces to seeing if the pattern occurs as a substring anywhere in the line. So for most characters, their meaning in a pattern is just to match themselves in the target string. However, there are a few special clauses you must implement:

. (period) Matches any character.
+ (plus sign) The preceding character should appear 1 or more times. + is not greedy. It should give back characters to check for a match.
? (question mark) The preceding character should appear either 0 or 1 times (i.e. the preceding character is optional).
\ (backslash) "Escapes" the following character, nullifying any special meaning it has.

So, here are some examples of patterns and the kind of lines they match:

( An open parenthesis must appear somewhere in the line.
.+ Match any non-empty string.
hey+ Matches a line that contains the string "hey" followed by any number (0 or more) of y's, such as heyyyyyy (hey followed by 5 y's).
str?ing Matches lines that contain the substrings "string" or "sting", since the "r" is optional.
z.z\.txt  Matches lines that contain the substring "zaz.txt", "zbz.txt", etc., where the character between the z's can be anything, including a period or space.

These are the only special characters you have to handle. With the exception of the null character that terminates a string, you should not have to handle other characters in any special way. You may assume that your code will not be run against patterns that don't make sense.

Getting Started

Fill out the Homework Repository Registration form. This form will help you create a private GitHub Classroom repository and link your student ID to your CS 61C login and GitHub repository, which will be required to identify your work and assign grades. This is the same process as the lab repository setup in lab 0.

Clone your homework repository into a desired location with

$ git clone <homework repository link here> 

Go to your repository and add the CS 61C homework starter repository with

$ git remote add hw-starter https://github.com/61c-teach/fa18-hw-starter.git

Pull the starter code (Makefile, matcher.c, matcher.h, rgrep.c) into your working directory

$ git pull hw-starter master

The skeleton code handles reading lines from standard input and printing them out for you; you must implement the function int rgrep_matches(char *line, char *pattern) in matcher.c, which returns true if and only if the string contains the pattern. You may also choose to implement matches_leading and scan, which are also in matcher.c. You may change matcher.c however you want, but do not modify any other files.

Testing & Debugging

You should test your code to make sure that it properly matches lines against patterns. Be sure to test on a hive machine.

To compile your code, type

$ make

After compiling your code, to test a single input "string" against 'pattern', type

$ echo -e "string" | ./rgrep 'pattern' 

For more comprehensive tests against 'pattern', you may want to put inputs in a text file and run

$ ./rgrep 'pattern' < test_file.txt

For both of the above, ensure that rgrep prints only the lines that you think should match the pattern, and no others.


If you want to manually input words to test a particular pattern, use

$ ./rgrep pattern

Once here, type in any string of characters (ending with a newline character), and rgrep will reprint your input if it matches the pattern.


Note that for any of the above operations, your shell might interpret the backslash operator for you. This is not what you want. For example, when you type at your shell

$ ./rgrep \.hi < input.txt

your program might get the pattern ".hi" because the shell interpreted the backslash before it got passed to your program. The solution is to put the pattern in single quotes, so what you want to type is

$ ./rgrep '\.hi' < input.txt

This should ensure that your pattern operators aren't expanded or consumed by the shell.

Make Check

For a quick sanity check to see if your solution is on the right track, type

$ make check

To interpret the output of this command, check the message at the very bottom. If it gives you an error, your code either failed to complete or did not produce the correct output. The bottommost line marked "test" is the test that you failed on. If you reach the message "Passed sanity check", Congratulations! You are on your way to finishing this homework! Note that this doesn't mean your solution will receive full points, since the autograder will be running a much larger suite of test cases.

If you wish to replicate the sample tests from make check, run them from the command line with

$ test "`echo -e "a\nb\nc" | ./rgrep 'a'`" = "a"

If you find yourself stuck on passing one of these tests, remember that cgdb is a useful tool in debugging C projects. Run cgdb on rgrep with

$ cgdb rgrep

Within cgdb, you can run tests on your files with

$ run 'pattern' < test_file.txt

and you can pass in the make check test inputs to find where the bug is. Since this homework spans multiple files, make sure you are setting your breakpoints in the right spots!

Grading & Submission

You will be autograded on Gradescope, which will produce the exact same results as if you ran it on a hive machine. Your score will depend on the tests your code passes. You won't get any points if your code doesn't compile. If we catch any instance of cheating (importing regex libraries, copying code, etc.), we will follow the Academic Dishonesty and Cheating procedure explained on the course website.

When you are ready to submit, commit and push your code to your github student repository. On gradescope, go to the Homework 2 assignment. Choose your github student repository and select the branch your finished assignment is on. If you don't know the branch you are on, you can check by running

$ git status

in your cloned repo. While you are submitting all your files in the commit, only matcher.c will be used in grading. You can submit multiple times; we'll grade the latest submission.