In Project 1, you wrote a client which sends data to a server that simply echoes it back. The goal of the project was to learn socket programming and to get exposed to the client-server networking paradigm. In this project, we will look at a more complex application that uses the client-server paradigm - the World Wide Web. You will leverage the socket programming skills you gained in Project 1 to write a program that will interact with Web servers via the HTTP protocol. This project will also illustrate how you develop a client that implements an existing text-based protocol. In the next project (Project 3), you will undertake designing a new protocol, as well as exploring peer-based network interactions rather than client/server.
The browser (Firefox, Safari, Internet Explorer, etc.) you use to browse the Web is a common example of an HTTP client that interacts with Web servers (e.g., http://www.cnn.com) via the HTTP protocol. The goal of this project is to build a simple HTTP client that automatically crawls Web pages looking for a user-supplied keyword. Automated Web crawlers are often referred to as robots or spiders. For example, GoogleBot is the robot that crawls and indexes the Web pages that show up in Google search results. Of course, the robot you will build in this project will be much simpler than GoogleBot. In the remainder of the project description page, we will refer to our robot crawler as KeywordHunter.
KeywordHunter must accept the following parameters from the command line:
The following example will clarify the meaning of depth: Let us refer to the page at StartURL as S. S is at depth 1. Let B1, B2 and B3 be pages linked to from inside S. B1, B2 and B3 are at depth 2. Pages linked to from inside B1, B2 and B3 (but not linked to from inside S) will be at depth 3 and so on.
Exit Code : KeywordHunter must exit with exit code 0 if no error occurred. If some error (eg: invalid command line arguments) occurred, the program must exit with code 1. You can set a program's exit code by calling exit(TheDesiredCode) or by specifying the return value of main(). Note that an unfound keyword is NOT an error.
KeywordHunter must use the HTTP GET request to fetch StartURL. It searches the fetched page for SearchKeyword. If the keyword is found, KeywordHunter just reports the page URL and line (more about output format later) and stops. If the keyword is not found, it fetches the pages linked to in the current page (more about how to detect these later) and searches them. This process is recursively carried out till you either find SearchKeyword or you give up searching on having crawled exhaustively to the maximum depth of 5.
KeywordHunter must be able to fetch pages from existing HTTP servers (e.g., http://www.cs.berkeley.edu/index.html). KeywordHunter needs to implement the GET request of HTTP 1.1. RFC 2616 describes the complete HTTP 1.1 protocol. HTTP Made Really Easy is a simple tutorial that will teach you most of what you need for this project.
Subset of HTTP 1.1 : You do not have to implement the whole HTTP 1.1 client standard. KeywordHunter should be able to send an HTTP GET request and parse the response. You need to distinguish between only two types of replies: An HTTP response code of 200 is treated as successful page retrieval; Any other response code is treated as an error. EXTRA CREDIT: Handle other types of responses, for example redirects.
User-Agent : KeywordHunter must
identify itself while making the GET request. This means that
it must include the following header in the GET request:
robots.txt : Many websites do not want to be crawled by automated programs like our KeywordHunter. The administrators of these sites put a file called robots.txt in the top level directory of the Web site to request robots to stay away (e.g., the ones used by the Wikipedia site: http://en.wikipedia.org/robots.txt). This file contains lines of the form:
User-agent: WebCopierThe above lines (taken from http://en.wikipedia.org/robots.txt) indicate that the user agent named WebCopier must not access any page whose path matches "/"; essentially no files are allowed to be accessed. The "*" matches all user agents. All user agents are prohibited from accessing the subdirectory trap/. In addition, all user agents are requested not to crawl faster than one page per second. If a Web site does not have a robots.txt entry, you are free to access any page. For more information about robots.txt, please visit The Web Robots Page.
KeywordHunter must honor robots.txt. This means that before accessing a URL, you must fetch the robots.txt page of the site (using GET, similar to fetching any other Web page), parse it for entries that prevent your user-agent from accessing the page you are interested in, and not fetch the page if an entry is found. You must also honor the Crawl-delay, which you can do by calling the standard sleep() library function prior to issuing each new GET request.
You do NOT have to implement full-fledged HTML parsing. You need to only follow links of the form <a href="xyz" ...> all on the same line. Hence you can use a standard string-matching function to find the pattern a href=" and interpret the text after it up until a closing quote as the link. Note that a link may be a full-fledged URL (e.g., http://www.google.com/index.html), an URL without the resource path (e.g., http://www.google.com), an absolute file path on the same server as the current page (e.g., /photos/january.html) or relative file path from the current page (e.g., january.html, assuming the current page is located in the photos/ subdirectory). You must handle all of these different types of links. In the case of an URL without a resource path, assume that we are fetching the file /index.html.
Sometimes the text "a href" will be inside an HTML comment. To make parsing simple, you do not have to keep track of HTML comments. If something matches the pattern mentioned earlier, it is considered a link, irrespective of whether it is inside a comment or not.
In addition, please note that there might be multiple links per line.
The keyword being searched for will always be on a single line. So you can again apply a simple string-matching function to each line in order to look for the keyword.
All successfully fetched pages should be written to the OutputDir directory if the OutputDir command line argument is specified. For example, if you successfully fetch a page at the URL http://www.MyPhotos.com/photos/january.html, a file named january.html with the contents of the page must be created in OutputDir. All files should be created in the top level directory OutputDir - you should NOT create the file january.html in a subdirectory called photos inside OutputDir
The page sent back in response to a GET request can be encoded in different ways. You need to handle only the default (identity) encoding. Pages with other encodings may be ignored. EXTRA CREDIT: Implement parsing of pages with chunked encoding.
Usually a header is contained in a single line. Sometimes, they are broken into multiple lines using a technique called folding. You do NOT have to implement support for folded headers. EXTRA CREDIT: Implement support for folded headers.
You do not have to implement persistent connections. Send a Connection: close header as part of the GET request to tell the server to close the TCP connection after sending the complete page. EXTRA CREDIT: Implement persistent connections.
To enable automated testing of your program, please follow the output instructions carefully.
Link: For each link it finds, KeywordHunter should print (to stdout) a line of the form:
LINK:host=<HOSTNAME>;port=<PORT>;respath=<RESOURCEPATH>;action=<ACTIONCODE>ACTIONCODE can take the following values:
Value | Meaning |
---|---|
follow | This link was followed, i.e. you tried to fetch the page. The fetch may have succeeded or failed, nevertheless you attempted it. |
robots_ignore | This link was not followed since the robots.txt entry of the site prohibited it. |
already_seen_ignore | This link was not followed because we have seen this page previously. |
Example: Suppose you are currently parsing the page http://www.MyPhotos.com/photos/january.html. You find the following four links in it, in the following order:
february.html
/help/OnlyForHumans.html (robots.txt prohibits access to this page)
http://www.google.com (we have seen this page before)
http://www.cnn.com/world.html
You will output the following lines (in order):
LINK:host=www.MyPhotos.com;port=80;respath=/photos/february.html;action=follow
LINK:host=www.MyPhotos.com;port=80;respath=/help/OnlyForHumans.html;action=robots_ignore
LINK:host=www.google.com;port=80;respath=/index.html;action=already_seen_ignore
LINK:host=www.cnn.com;port=80;respath=/world.html;action=follow
Pages Fetched/Failed: For each link it follows, KeywordHunter should print (to stdout) a line of the form:
PAGE:host=<HOSTNAME>;port=<PORT>;respath=<RESOURCEPATH>;status=<STATUSCODE>
The status code is 0 if there was some error sending the GET request to the server - for example, the server's IP address resolution failed, or a TCP connect to the server failed. If you were able to send the GET request, the status code is the HTTP response code. For example if the page was successfully retrieved, the status code will be 200.
The LINK and PAGE lines should be printed out in the order in which the corresponding pages are encountered during your processing. For example, if you follow link A before link B, the LINK line for A should come before the LINK for B.
Found Keyword: If the search keyword is found in a particular page, you should output a line of the form:
FOUND_KEYWORD:host=<HOSTNAME>;port=<PORT>;respath=<RESOURCEPATH>;line=<LINE>
The LINE is the line (the actual text, including HTML tags) on which the keyword was found. If the keyword is not found after searching all the pages up to the maximum depth, output the following line:
NOT_FOUND_KEYWORD:
Your own output lines: You may print any number of lines to stdout (say for debugging). Just ensure that they do not start with LINK:, PAGE:, FOUND_KEYWORD: or NOT_FOUND_KEYWORD.
You must submit an initial version of KeywordHunter that:
Remember that TCP may not always return the full amount of data available in one read. In Project 1, it probably never occurred. But here, due to the larger amount of data being transferred, it will almost certainly occur. Make sure you do multiple reads to read in the entire response from the server. If you are not using persistent connections, the server closing the connection will be the indication that all data has been received.
The TCP recv (or read) function does not return indvidual lines. Multiple lines may be present in the bytes obtained in a single receive operation, or a single line may be spread across multiple receives. Since the HTTP protocol crucially depends on understanding line boundaries, you must implement your own way to extract individual lines from the received data.
This project is substantially bigger than Project 1. Accordingly, make sure that you structure your code so that you can reuse components. For example, you can use the same code to fetch both robots.txt and the pages to do keyword searching. Split up your code into multiple files instead of having one huge file - that makes it easier to manage and debug.
When testing your program with the write pages to disk feature turned on (i.e. OutputDiris specified on the command line), make sure you do not fetch thousands of files and fill up your disk quota. We will provide you small test cases on which you can test this feature.
What should I do with the HTTP headers in the reply from the server? Should I save them? Should I search for the keyword in them?
Answer:You should not save the headers as part of the output file. Only the body should be saved. You should search for the keyword only inside the body.
If the keyword is not found, should I output NOT_FOUND_KEYWORD: or NOT_FOUND_KEYWORD: TheKeywordIAmLookingFor?
Answer: As described in the specs, you should output only NOT_FOUND_KEYWORD:. There IS a colon at the end.
What exactly is a "line"?
Answer: Lines are groups of characters terminated by \n or \r\n. This corresponds to our familiar notion of a line. When we hit enter when typing text, a \n or \r\n gets inserted (depending on your editor and OS). You should parse the reply from the server line by line. Note that the data returned in a single recv() call can contain multiple lines, or a single line may be split across multiple recv() calls. It is your responsibility to appropriately parse line by line.
For the checkpoint, do I need to handle robots.txt and links? Answer: You do not have to implement robots.txt support for the checkpoint. The test cases used for the checkpoint evaluation will not have any links inside them. So you do not have to worry about links.
Can I use some existing libraries for parsing HTML or fetching web pages?
Answer: You are NOT allowed to use existing libraries for fetching web pages. The goal of the project is to understand the HTTP protocol and hence, you must implement it yourself. You do not need to do complete HTML parsing. Hence the standard C/C++ string manipulation functions are sufficient and no external HTML parsing libraries may be used. If you are using an external library for some other purpose, please email Dilip and make sure that it is acceptable for this project.
Do I have to write my own HTTP server?
Answer: No, you only have to implement the HTTP client, i.e. KeywordHunter.
KeywordHunter should be able to talk to existing HTTP servers (for e.g., www.cnn.com).
I am not sure about aspect X of the specifications. What should I do?
Answer: Please contact the TAs via email, discussion section or office hours; or consider using the mailing list, as often your question will
be of interest to other students too.
Make sure you clarify the specifications of the project before implementing. You might save a lot of effort and avoid implementing something that was not required.
We will use an automated script that tests your program against different test cases. Hence we will not be able to grade your submission if it does not meet the input/output specifications.
Points | Feature Description | |
---|---|---|
Required Features (Total 100 points) | ||
5 + 15 | Being able to fetch at least one URL and look for the keyword inside it. To obtain the full 20 points, this feature needs to be ready by the checkpoint. | |
15 | Following simple links of the form (http://hostname/filepath) and being able to find the keyword whenever it is within the specified depth. | |
15 | Ability to follow links of all different types (full URL, absolute file path etc) and ignore malformed links. | |
5 | Writing the successfully fetched pages to the specified output directory. | |
20 | Correctly honoring robots.txt | |
10 | Ignoring already seen pages, i.e. not fetching or parsing a page more than once. | |
5 | Submitting packet traces that capture only the interaction of your program with the server in the test case. No other traffic should be present in the trace. | |
2 + 3 | Adhering to the input and output format specs. Your submission needs to pass the input/output verification scripts we provide for you to earn this. To obtain the full 5 points, this feature needs to be ready by the checkpoint. | |
5 | Writing clean code. Please avoid difficult-to-read or undocumented code. | |
Extra Credit ( Max 15 points) | ||
2 | Handle additional HTTP responses, such as redirects. Some responses are quite easy to handle and will not earn you full extra credit. Email Dilip to enquire about points for the additional HTTP responses you plan to handle. | |
7 | Parse pages with chunked encoding. | |
2 | Handle folded headers. | |
4 | Use HTTP persistent connections while fetching multiple pages from the same site. |