Homework 4: Scanners and Patterns

## A. Introduction

In this homework, you'll learn some things that we won't talk about in lecture: classes and methods dedicated to searching strings for selected patterns and for reading formatted input. These ideas will be relevant for projects, so its in your best interest to learn well.

Warning: There's a lot of jargon for this homework. Sorry! Focus on the experiments and writing code and it should all come together. We always suggest you read the spec, but you will be more confused than normal if you skip this reading! For this homework, we've also created an introductory video. It will be mostly parallel with the spec, and we strongly recommend you watch it. It even gives a generous "hint" for the second half of the homework...

As usual, you can obtain the skeleton with

$git fetch shared$ git merge shared/hw4
• The sequence (?m) always matches the empty string, but has a side effect of causing ^ and $ to match the beginnings and ends of lines as well as of entire strings. • The two-character escape sequences \?, \*, \., \+, etc., match the character after the backslash, ignoring their special significance. Thus, the pattern who\? matches the string "who?", and would be written in a Java program as the string literal "who\\?". ### Experiment #2: Matching Compile and run the Matching class. This class allows you to type in strings and patterns and see if the entire string matches the pattern. If you include any groups (read ahead if you're curious), it will also print those. You can run it like this: $ java Matching
Alternately type strings to match and patterns to match against
them. Use \ at the end of line to enter multi-line strings or
patterns (\s are removed, leaving newlines).  The program
will indicate whether each pattern matches the ENTIRE
preceding string.  Enter QUIT to end the program.
String: 123456
Pattern: [0-9]{6}
Matches.
String: 123456
Pattern: [0-9]{5}
No match.
String: 12345
Pattern: [0-9]{6}
No match.
String: abdeffff
Pattern: ab(c|de)f+
Matches.
Group 1: 'de'
String: abbbbdefefgg*h
Pattern: a(b+)d(ef)+gg\*h
Matches.
Group 1: 'bbbb'
Group 2: 'ef'
String: QUIT

Use this class to experiment with how patterns work. Try writing patterns that match the following. Sample answers are given for each problem (drag the mouse over the white area after "Answer:" to see it).

• A single digit between 5 and 8. Answer: [5-8].
• Sequences of lower case letters. Answer: [a-z]+
• Sequences of lower case letters except the letter j. Answer: [a-ik-z]+
• Sequences of characters that start with the uppercase letter A and end with the letter f. Answer: A.*f
• Sequences of three words separated by spaces, where a word is defined as a sequence of lower case letters. Answer: [a-z]+ +[a-z]+ +[a-z]+
• Sequences of three words separated by spaces, and where group 1 corresponds to the second word. Remember, a group is just a subpattern that you can specify with parenthesis, and is helpful for extracting portions of a pattern out. Regex are 1-indexed, unlike most of Java! Thus, group 1 is the first group, and there is no "group 0". Answer: [a-z]+ +([a-z]+) +[a-z]+

To get more practice with writing regular expressions check out RegExr or regular expressions 101. On the regex101 site, note that you should switch the "flavor" on the left to Java8. Additionally, this site doesn't require you to double escape your backslashes, so be careful there. Overall, both sites differ slightly from the type of Java patterns we will be writing. They are still a great way to build more familiarity with regular expressions, which as we have mentioned, have many different applications involving string matching across multiple different programming languages.

### Programming task

In P2Pattern.java, you are given 5 String variables named P1, P2, P3, P4 and P5. You are supposed to write regular expression as per the directions. You must complete 3 of the 5 patterns for full credit (though we recommend trying them all for practice). Don't forget to use the escape character twice (\\) wherever you need a backslash (\) in regular expression.

For all of the below, we STRONGLY recommend looking at TestP2Pattern.java to see what potential edge cases there are! Using a tool like an online regex tester can be helpful while building your patterns.

1. For P1:

• Define a pattern that matches valid dates of the form MM/DD/YYYY
• For single digit months, both 09/28/2021 and 9/28/2021 are valid. Similarly, both 10/05/2021 and 10/5/2021 are valid.
• For example, 12/25/2019 is a valid date but 25/12/2019 is not.
• Assume that MM ranges from 01-12, DD ranges from 01-31 and YYYY ranges from 1900 onwards (this means that technically 02/31/2021 is a valid date format for our purposes).
• If you're stuck on this, check out the intro video linked at the top of the spec.

2. For P2:

• Define a pattern that matches lists of non-negative numerals (e.g. (1, 2, 33, 1, 63)).
• The list cannot be empty.
• Each numeral but the last should be followed by a comma and one or more spaces.

3. For P3:

• Define a pattern that matches a valid domain name.
• For example, www.support.ucb-login.com is a valid domain name (even if it doesn't really exist!)
• A valid domain name contains set of alphanumeric characters (i.e., a-z, A-Z), numbers (i.e. 0-9) and dashes (-) or a combination of all of these.
• It does not begin or end with dash (-) or period (.)
• It does not contain whitespace ( ) or underscore (_)
• Assume that the top-level domain (last part after period) is between 2 to 6 characters in length.
• The characters - and . cannot be next to each other.

4. For P4:

• Define a pattern that matches a valid Java variable name
• For example, _myVariable$1 is a valid variable name in Java while 1stVariable is not. • A variable name cannot start with an integer. It can consist of alphanumeric characters as well as _ and $.

5. For P5:

• Define a pattern that matches valid IPv4 address.
• A valid IPv4 address consists of four positive integer parts separated by period (.). Each integer part can range from 0-255.
• For example, 127.0.0.1 is a valid IP address whereas 299.10.10.1 is not.
• It might be helpful here to first define a String that captures a subpattern, because the four integers each follow the same pattern. We can then use the String.format method to show the subpattern. Thus your final answer for P5 might be something like: