CompSci-61B Term Project 1
Email Address Parser

Instructor: Prof. Robert Burns

Write a program named Email.java in a package named compsci61b.project1 that opens and reads a text file and writes another text file. The purpose of the program is to extract email addresses embedded in the first (input) text file, and copy them to the second (output) text file. The application of the program is to make it easier to enter email addresses into an email message that is to be sent to a list of recipients, when those recipients are not already in a contacts list.

For example, the Electrical Engineering and Computer Sciences Home Page (www.eecs.berkeley.edu) has a "Directory" link. To send a single email message to everyone in the department, I have to copy/paste each email address individually from the directory into the "to", "cc", or "bcc" field of my email message. Using your new Email.java program, I could save the web page containing the directory listing to a file, run the program, open its output file, and copy/paste the entire list from the file into the into the "to", "cc", or "bcc" field.

Here's another example -- computer science department provides me with a roster of students in my class. The roster is a table with names, email addresses, student IDs, etc. I would like to save that page to a file, run your Email.java program, open its output file, and copy/paste the entire list from the file into the "to", "cc", or "bcc" field of a message to be sent to all students in my class.

You can probably think of many other possible applications for such a program.

In order to solve this, you need to understand the difference between string and characters. Your program will read strings from a file, and inspect them one character at a time. For this you will need to know how to extract a single character from a string, as a char value. Be sure to participate in the discussion groups after lecture for hints on how to do this.

Here are the specifications for the program:

  1. There is to be a series of user options in the program, as explained below. For each option, there is to be a default value, so that the user can simply press ENTER to accept any default. (That means that String will be the best choice of data type for the console input for each option.)
  2. Two options are the names of the input and output files. The default filenames are to be fileContainingEmails.txt for the input file, and copyPasteMyEmails.txt for the output file. The default location for these files is the working folder of the program (so do not specify a drive or folder for the default filenames). The actual names and locations of the files can be any valid filename for the operating system being used, and any existing drive and folder.
  3. It is okay for the user to select input and output names by the same name. If the user enters another name for the input file besides the default, then the default for the output file should be the same as that of the input file. So if the input and output filenames are the same, then the input file becomes replaced by the output file when the program is run.
  4. The output file should be overwritten, and not appended to. No warning is necessary when overwriting an already-existing file.
  5. Another user input option is the delimiter for separating the emails in the output file. The default option is a semicolon and a space (like rdb3@rdb3.com; tina@rdb3.com), designated the "email addressee option". The other option is a new-line (that is, \n), designated the "list option". You may include any additional delimiter options that you want, but include at least these two. Present the user with a menu for this option choice.
  6. Print each email to the output file, separated by the chosen delimiter. Include no delimiter or other text before the first email message, and nothing after the last. Include nothing in the file besides email addresses and delimiters. If you do this right, the number of delimiters should be one less than the number of emails addresses.
  7. Do not allow duplicate email addresses to appear in the output file. Ignore case when comparing two email addresses to determine if they are the same.
  8. Put the email addresses in lexicographical order, ignoring case.
  9. Count the number of unique (i.e., not counting any duplicates) email addresses found as the input file is processed, and list each on a separate line of console output. At the end of the list of unique email addresses on the console, print the total number of email addresses found. If the number of email addresses found is zero, do not create an output file. (That means, do not even open it.)
  10. In case an email address is split between two or more lines in the input file, ignore it. Valid email addresses must appear fully within one line. Also, note that each line of the input file may contain more than one email address.
  11. Include friendly, helpful labels on the console output. Instead of just printing the number of email addresses found, say something like "16 email addresses were found, and copied to the file output.txt". Or if none were found, say something like "Sorry, no email addresses were found in the file input.txt".
  12. Include a message in the console output explaining to the user to open the output file and copy/paste its contents into the "to", "cc", or "bcc" field of any email message. But explain that it is best to use the "bcc" field so that everyone's email address does not appear in the message, to protect their privacy.

To test your program, get the directory from the EECS website, and either save the web page as a file, or copy/paste the web page contents into a file. Be sure to test your program with at least one other data source -- search for one on the Internet, if you have to. You may share test input files with classmates, providing files to test eachother's programs.

This project is worth 100 points, awarded as follows: 10 points for the final code's neatness and presentation, including alignment and indenting of code, and comments, 90 points for the correctness of the solution. 10 points will be deducted for each listed specification that is not satisfied, with partial credit possible. Zero points will be awarded for a source file that does not compile.

A procedure for "parsing" text to find embedded email addresses will be developed in some of the post-lecture discussions. Email addresses consist of the characters A-Z, a-z, 0-9, underscore, dot, hyphen, and plus. Also, they must have exactly one '@' followed by at least one '.'. Click here for details.

The procedure basically involves reading a line from a file as a text variable, and traversing the line of text to find a '@' character. If one is found, its position is saved. Then traverse backwards until an invalid email character found -- that is the position before the email address starts. Then traverse forwards from the '@' until an invalid email character is found -- that is the position after the email address ends Also count the number of '.'s found as you traverse forwards from '@' -- if any are found, then you can extract a copy of the email address as a substring. Continue from the position after the extracted email address, until no more '@'s are found.

Post your project1.jar file to your student UNIX account for grading and credit. The jar should contain all of the files from your lab work to date. This does not have to be complete through lab 5b -- just go as far as you need to go in order to have the data structure classes needed in the project. Do not move any lab files into the compsci61b.project1 package -- leave then in the compsci61b package!


[ Home | Contact Prof. Burns ]