EE122 Project #2: Simple Web Crawler

Page Last Updated: Oct 30, 8:30PM Initial checkpoint at 11PM, October 18
Full project due October 26, by 11PM

Updates and Announcements

Oct 30, 2006 8:30pm: Test cases and evaluation script used for final evaluation released.

Oct 25, 2006 6:55pm: Error in latest test cases corrected

Oct 25, 2006: Added latest evaluation scripts and test cases

Oct 15, 2006: More clarifications about the project specs

Oct 14, 2006: Added clarifications about checkpoint requirements

Oct 14, 2006: Added Checkpoint submission instructions

Oct 11, 2006: Added FAQ about usage of libraries

Oct 7, 2006 : Added exit code requirement

Oct 7, 2006 : Added test cases and evaluation script (version 0.1)

Oct 3, 2006 : Grading rubric for Project 2 changed. Added WriteToDisk feature. Reduced extra credit points.

Introduction

In Project 1, you wrote a client which sends data to a server that simply echoes it back. The goal of the project was to learn socket programming and to get exposed to the client-server networking paradigm. In this project, we will look at a more complex application that uses the client-server paradigm - the World Wide Web. You will leverage the socket programming skills you gained in Project 1 to write a program that will interact with Web servers via the HTTP protocol. This project will also illustrate how you develop a client that implements an existing text-based protocol. In the next project (Project 3), you will undertake designing a new protocol, as well as exploring peer-based network interactions rather than client/server.

The browser (Firefox, Safari, Internet Explorer, etc.) you use to browse the Web is a common example of an HTTP client that interacts with Web servers (e.g., http://www.cnn.com) via the HTTP protocol. The goal of this project is to build a simple HTTP client that automatically crawls Web pages looking for a user-supplied keyword. Automated Web crawlers are often referred to as robots or spiders. For example, GoogleBot is the robot that crawls and indexes the Web pages that show up in Google search results. Of course, the robot you will build in this project will be much simpler than GoogleBot. In the remainder of the project description page, we will refer to our robot crawler as KeywordHunter.

KeywordHunter - Specifications

Input

KeywordHunter must accept the following parameters from the command line:
- StartURL : The URL of the page from which crawling starts. This URL will be of the form http://hostname:port/path_to_file (e.g., http://www.cnn.com:80/index.html) or http://hostname/path_to_file (e.g., http://www.cnn.com/index.html). When the port number is unspecified, it defaults to port 80.
- SearchKeyword : The keyword we are looking for. It will be at most 100 characters long. The keyword will exist at a depth of 5 or less, starting at StartURL; otherwise it is deemed to be not found.
  The following example will clarify the meaning of depth: Let us refer to the page at StartURL as S. S is at depth 1. Let B1, B2 and B3 be pages linked to from inside S. B1, B2 and B3 are at depth 2. Pages linked to from inside B1, B2 and B3 (but not linked to from inside S) will be at depth 3 and so on.
- OutputDir : If this parameter is present, then all successfully fetched pages must be written to OutputDir.
To enable automated testing, the KeywordHunter executable must be called KeywordHunter (case sensitive). For example, your program must be invokable from the command line as follows:
KeywordHunter http://www.cnn.com/index.html FindMeTxt OutputDir
or
KeywordHunter http://www.cnn.com/index.html FindMeTxt

Exit Code : KeywordHunter must exit with exit code 0 if no error occurred. If some error (eg: invalid command line arguments) occurred, the program must exit with code 1. You can set a program's exit code by calling exit(TheDesiredCode) or by specifying the return value of main(). Note that an unfound keyword is NOT an error.
Functionality

KeywordHunter must use the HTTP GET request to fetch StartURL. It searches the fetched page for SearchKeyword. If the keyword is found, KeywordHunter just reports the page URL and line (more about output format later) and stops. If the keyword is not found, it fetches the pages linked to in the current page (more about how to detect these later) and searches them. This process is recursively carried out till you either find SearchKeyword or you give up searching on having crawled exhaustively to the maximum depth of 5.

KeywordHunter must be able to fetch pages from existing HTTP servers (e.g., http://www.cs.berkeley.edu/index.html). KeywordHunter needs to implement the GET request of HTTP 1.1. RFC 2616 describes the complete HTTP 1.1 protocol. HTTP Made Really Easy is a simple tutorial that will teach you most of what you need for this project.
To Implement or Not To Implement, That is the Question
1. Subset of HTTP 1.1 : You do not have to implement the whole HTTP 1.1 client standard. KeywordHunter should be able to send an HTTP GET request and parse the response. You need to distinguish between only two types of replies: An HTTP response code of 200 is treated as successful page retrieval; Any other response code is treated as an error. EXTRA CREDIT: Handle other types of responses, for example redirects.
2. User-Agent : KeywordHunter must identify itself while making the GET request. This means that it must include the following header in the GET request:
  
  User-Agent: ee122-KeywordHunter
3. robots.txt : Many websites do not want to be crawled by automated programs like our KeywordHunter. The administrators of these sites put a file called robots.txt in the top level directory of the Web site to request robots to stay away (e.g., the ones used by the Wikipedia site: http://en.wikipedia.org/robots.txt). This file contains lines of the form:
  User-agent: WebCopier Disallow: / User-agent: * Disallow: /trap/ Crawl-delay: 1
  The above lines (taken from http://en.wikipedia.org/robots.txt) indicate that the user agent named WebCopier must not access any page whose path matches "/"; essentially no files are allowed to be accessed. The "*" matches all user agents. All user agents are prohibited from accessing the subdirectory trap/. In addition, all user agents are requested not to crawl faster than one page per second. If a Web site does not have a robots.txt entry, you are free to access any page. For more information about robots.txt, please visit The Web Robots Page.
  
  KeywordHunter must honor robots.txt. This means that before accessing a URL, you must fetch the robots.txt page of the site (using GET, similar to fetching any other Web page), parse it for entries that prevent your user-agent from accessing the page you are interested in, and not fetch the page if an entry is found. You must also honor the Crawl-delay, which you can do by calling the standard sleep() library function prior to issuing each new GET request.
4. Parsing HTML:
  You do NOT have to implement full-fledged HTML parsing. You need to only follow links of the form <a href="xyz" ...> all on the same line. Hence you can use a standard string-matching function to find the pattern a href=" and interpret the text after it up until a closing quote as the link. Note that a link may be a full-fledged URL (e.g., http://www.google.com/index.html), an URL without the resource path (e.g., http://www.google.com), an absolute file path on the same server as the current page (e.g., /photos/january.html) or relative file path from the current page (e.g., january.html, assuming the current page is located in the photos/ subdirectory). You must handle all of these different types of links. In the case of an URL without a resource path, assume that we are fetching the file /index.html.
  
  Sometimes the text "a href" will be inside an HTML comment. To make parsing simple, you do not have to keep track of HTML comments. If something matches the pattern mentioned earlier, it is considered a link, irrespective of whether it is inside a comment or not.
  
  In addition, please note that there might be multiple links per line.
  
  The keyword being searched for will always be on a single line. So you can again apply a simple string-matching function to each line in order to look for the keyword.
5. Storing Pages to Disk
  All successfully fetched pages should be written to the OutputDir directory if the OutputDir command line argument is specified. For example, if you successfully fetch a page at the URL http://www.MyPhotos.com/photos/january.html, a file named january.html with the contents of the page must be created in OutputDir. All files should be created in the top level directory OutputDir - you should NOT create the file january.html in a subdirectory called photos inside OutputDir
6. Page Encoding:
  The page sent back in response to a GET request can be encoded in different ways. You need to handle only the default (identity) encoding. Pages with other encodings may be ignored. EXTRA CREDIT: Implement parsing of pages with chunked encoding.
7. Folded Headers:
  Usually a header is contained in a single line. Sometimes, they are broken into multiple lines using a technique called folding. You do NOT have to implement support for folded headers. EXTRA CREDIT: Implement support for folded headers.
8. Persistent Connections:
  You do not have to implement persistent connections. Send a Connection: close header as part of the GET request to tell the server to close the TCP connection after sending the complete page. EXTRA CREDIT: Implement persistent connections.

Output

To enable automated testing of your program, please follow the output instructions carefully.

Link: For each link it finds, KeywordHunter should print (to stdout) a line of the form:

LINK:host=<HOSTNAME>;port=<PORT>;respath=<RESOURCEPATH>;action=<ACTIONCODE>

ACTIONCODE can take the following values:

Value Meaning

follow This link was followed, i.e. you tried to fetch the page. The fetch may have succeeded or failed, nevertheless you attempted it.

robots_ignore This link was not followed since the robots.txt entry of the site prohibited it.

already_seen_ignore This link was not followed because we have seen this page previously.

Value	Meaning
follow	This link was followed, i.e. you tried to fetch the page. The fetch may have succeeded or failed, nevertheless you attempted it.
robots_ignore	This link was not followed since the robots.txt entry of the site prohibited it.
already_seen_ignore	This link was not followed because we have seen this page previously.

Example: Suppose you are currently parsing the page http://www.MyPhotos.com/photos/january.html. You find the following four links in it, in the following order:

february.html
/help/OnlyForHumans.html (robots.txt prohibits access to this page)
http://www.google.com (we have seen this page before)
http://www.cnn.com/world.html

You will output the following lines (in order):

LINK:host=www.MyPhotos.com;port=80;respath=/photos/february.html;action=follow
LINK:host=www.MyPhotos.com;port=80;respath=/help/OnlyForHumans.html;action=robots_ignore
LINK:host=www.google.com;port=80;respath=/index.html;action=already_seen_ignore
LINK:host=www.cnn.com;port=80;respath=/world.html;action=follow

Pages Fetched/Failed: For each link it follows, KeywordHunter should print (to stdout) a line of the form:

PAGE:host=<HOSTNAME>;port=<PORT>;respath=<RESOURCEPATH>;status=<STATUSCODE>

The status code is 0 if there was some error sending the GET request to the server - for example, the server's IP address resolution failed, or a TCP connect to the server failed. If you were able to send the GET request, the status code is the HTTP response code. For example if the page was successfully retrieved, the status code will be 200.

The LINK and PAGE lines should be printed out in the order in which the corresponding pages are encountered during your processing. For example, if you follow link A before link B, the LINK line for A should come before the LINK for B.

Found Keyword: If the search keyword is found in a particular page, you should output a line of the form:

FOUND_KEYWORD:host=<HOSTNAME>;port=<PORT>;respath=<RESOURCEPATH>;line=<LINE>

The LINE is the line (the actual text, including HTML tags) on which the keyword was found. If the keyword is not found after searching all the pages up to the maximum depth, output the following line:

NOT_FOUND_KEYWORD:

Your own output lines: You may print any number of lines to stdout (say for debugging). Just ensure that they do not start with LINK:, PAGE:, FOUND_KEYWORD: or NOT_FOUND_KEYWORD.

Testing

You can test your program by trying to fetch a URL of your choice (say, http://www.google.com/index.html).
Test case and script used for the final evaluation (proj2finaleval.tar.gz)
Latest test cases and evaluation script (10/25/2006 Corrected at 6:55pm) (project2_testcases_0_4.tar.gz).
Testcases 10/22/2006 (Copy the test cases to the cases subdirectory)
You can test your program using the same scripts that will be used for automated grading. Version 0.1 of the scripts and some sample test cases are now available(project2_testcases_0_1.tar.gz.

Packet Traces

You should include in your project submission packet traces recorded using tcpdump demonstrating your HTTP interactions with the server for a particular test case.
Make sure your traces include the entire packet (use -s 0 option - refer to the tcpdump tutorial covered in section).
Submit the binary file generated using the -w option and NOT the ASCII tcpdump output.
As always, DO NOT misuse tcpdump to spy on other's traffic or to compromise the security of the Berkeley computing infrastructure in any fashion.

General Instructions

All code must be written in C or C++.
You must supply a single Makefile for compiling your code.
Although this is not a software engineering course, we emphasize that you need to write clean, well-formatted and well-documented code.
The programs must compile and run on the UNIX instructional account machines. We will not be able to grade programs that work only on Windows.
You must submit a single README.txt file that briefly describes the contents of each source code and tcpdump trace file you submit.
Please do not hesitate to contact the TAs if you have any questions regarding the project, and/or send questions to the mailing list.

Checkpoint : October 18, 11PM

You must submit an initial version of KeywordHunter that:

conforms to input/output formats.
is able to GET a single page and search for the desired keyword (no need to follow any links).

by 11PM on October 18. This checkpoint is worth 7 points towards the final score for this project. You can check if your submission matches the checkpoint requirements by using sample test cases 1a and 1b.

Submission Guidelines

Checkpoint

You will need to submit the source code for KeywordHunter, along with an appropriate Makefile to compile the program.
Submission Steps:
1. Log in to your instructional account.
2. Create a directory called "proj2checkpoint" : mkdir proj2checkpoint
3. Change to that directory: cd proj2checkpoint
4. Copy all the files/subdirs that you wish to submit to this directory
5. Run the submit program: submit proj2checkpoint
6. Make sure that no error messages are displayed. If some error occurs, retry. If it still occurs, contact the TAs.

Final

You will need to submit the source code for KeywordHunter, along with an appropriate Makefile to compile the program.
Submission Steps:
1. Log in to your instructional account.
2. Create a directory called "proj2" : mkdir proj2
3. Change to that directory: cd proj2
4. Copy all the files/subdirs that you wish to submit to this directory
5. Run the submit program: submit proj2
6. Make sure that no error messages are displayed. If some error occurs, retry. If it still occurs, contact the TAs.

Tips and Caveats

TCP recv():
Remember that TCP may not always return the full amount of data available in one read. In Project 1, it probably never occurred. But here, due to the larger amount of data being transferred, it will almost certainly occur. Make sure you do multiple reads to read in the entire response from the server. If you are not using persistent connections, the server closing the connection will be the indication that all data has been received.
Reading a line:
The TCP recv (or read) function does not return indvidual lines. Multiple lines may be present in the bytes obtained in a single receive operation, or a single line may be spread across multiple receives. Since the HTTP protocol crucially depends on understanding line boundaries, you must implement your own way to extract individual lines from the received data.
Structuring your code:
This project is substantially bigger than Project 1. Accordingly, make sure that you structure your code so that you can reuse components. For example, you can use the same code to fetch both robots.txt and the pages to do keyword searching. Split up your code into multiple files instead of having one huge file - that makes it easier to manage and debug.
Writing fetched pages to disk:
When testing your program with the write pages to disk feature turned on (i.e. OutputDiris specified on the command line), make sure you do not fetch thousands of files and fill up your disk quota. We will provide you small test cases on which you can test this feature.

FAQ and Clarifications

What should I do with the HTTP headers in the reply from the server? Should I save them? Should I search for the keyword in them?
Answer:You should not save the headers as part of the output file. Only the body should be saved. You should search for the keyword only inside the body.
If the keyword is not found, should I output NOT_FOUND_KEYWORD: or NOT_FOUND_KEYWORD: TheKeywordIAmLookingFor?
Answer: As described in the specs, you should output only NOT_FOUND_KEYWORD:. There IS a colon at the end.
What exactly is a "line"?
Answer: Lines are groups of characters terminated by \n or \r\n. This corresponds to our familiar notion of a line. When we hit enter when typing text, a \n or \r\n gets inserted (depending on your editor and OS). You should parse the reply from the server line by line. Note that the data returned in a single recv() call can contain multiple lines, or a single line may be split across multiple recv() calls. It is your responsibility to appropriately parse line by line.
For the checkpoint, do I need to handle robots.txt and links?
Answer: You do not have to implement robots.txt support for the checkpoint. The test cases used for the checkpoint evaluation will not have any links inside them. So you do not have to worry about links.
Can I use some existing libraries for parsing HTML or fetching web pages?
Answer: You are NOT allowed to use existing libraries for fetching web pages. The goal of the project is to understand the HTTP protocol and hence, you must implement it yourself. You do not need to do complete HTML parsing. Hence the standard C/C++ string manipulation functions are sufficient and no external HTML parsing libraries may be used. If you are using an external library for some other purpose, please email Dilip and make sure that it is acceptable for this project.
Do I have to write my own HTTP server?
Answer: No, you only have to implement the HTTP client, i.e. KeywordHunter. KeywordHunter should be able to talk to existing HTTP servers (for e.g., www.cnn.com).
I am not sure about aspect X of the specifications. What should I do?
Answer: Please contact the TAs via email, discussion section or office hours; or consider using the mailing list, as often your question will be of interest to other students too. Make sure you clarify the specifications of the project before implementing. You might save a lot of effort and avoid implementing something that was not required.

Grading

This project is worth 100 points and 10% of the course grade.
Your submission must satisfy all the specifications in order to receive 100% of the project grade.
Programs will be tested on c199.eecs.berkeley.edu (a Solaris machine), so your final submitted code should work on that machine.
We will use an automated script that tests your program against different test cases. Hence we will not be able to grade your submission if it does not meet the input/output specifications.
Extra credit will be given for implementing the extra credit features mentioned in the specifications (see details below). Make sure you clearly list the extra credit features you implemented in README.txt.
Extra credit will also be given for implementing new features relevant to the project which the EE122 staff did not think of.
You will be penalized if the submitted code is shabbily written (difficult to extend, or for us to understand, or brittle in terms of error handling).

Points	Feature Description
Required Features (Total 100 points)
5 + 15	Being able to fetch at least one URL and look for the keyword inside it. To obtain the full 20 points, this feature needs to be ready by the checkpoint.
15	Following simple links of the form (http://hostname/filepath) and being able to find the keyword whenever it is within the specified depth.
15	Ability to follow links of all different types (full URL, absolute file path etc) and ignore malformed links.
5	Writing the successfully fetched pages to the specified output directory.
20	Correctly honoring robots.txt
10	Ignoring already seen pages, i.e. not fetching or parsing a page more than once.
5	Submitting packet traces that capture only the interaction of your program with the server in the test case. No other traffic should be present in the trace.
2 + 3	Adhering to the input and output format specs. Your submission needs to pass the input/output verification scripts we provide for you to earn this. To obtain the full 5 points, this feature needs to be ready by the checkpoint.
5	Writing clean code. Please avoid difficult-to-read or undocumented code.
Extra Credit ( Max 15 points)
2	Handle additional HTTP responses, such as redirects. Some responses are quite easy to handle and will not earn you full extra credit. Email Dilip to enquire about points for the additional HTTP responses you plan to handle.
7	Parse pages with chunked encoding.
2	Handle folded headers.
4	Use HTTP persistent connections while fetching multiple pages from the same site.