Navigation
Introduction
In this homework, you'll learn some things that we won't talk about in lecture: classes and methods dedicated to searching strings for selected patterns and for reading formatted input.
Warning: There's a lot of jargon for this homework. Sorry! Focus on the experiments and writing code and it should all come together.
As usual, you can obtain the skeleton with
$ git fetch shared
$ git merge shared/hw4
$ git push
and submit it by committing all work, tagging, pushing, and pushing tags.
B. Scanners
Intro to Scanners
If you're already familiar with scanners, you can skip straight to the tasks below.
The class java.util.Scanner
gives you a way to read
substrings of text, also called tokens, sequentially
from a stream of text that is furnished to the Scanner
by its
constructor. Typically, the stream of text comes from a file or from
a terminal, but there are ways to convert any source of characters
into a stream that a Scanner
can process.
One of Scanner's constructors accepts an InputStream
--a
stream of bytes (8-bit characters). Since System.in
, which is
normally the standard input stream to your program, is a
kind of InputStream
(that is, its type is a subtype of
InputStream
), you can write
java.util.Scanner inp = new java.util.Scanner(System.in);
to get something that scans the input from your terminal. (Normally,
of course, you'd put import java.util.Scanner;
at the
beginning of your source file and just write Scanner
instead
of java.util.Scanner
).
One can also create Scanners from Strings (e.g. Scanner s = new
Scanner("hello");
), input files, and more.
The simplest way to use a Scanner
is to treat the input stream
as a sequence of tokens separated by text that matches a
delimiter pattern. By default, the delimiter pattern
matches stretches of whitespace (blanks, tabs, newlines, carriage
returns). For example, consider the input below:
hello i am a half human half
horse
In this case, the tokens separated by our delimiter pattern are
"hello"
, "i"
, "am"
, "a"
,
"half"
, "human"
, "half"
, and
"horse"
.
Delimiter patterns can be arbitrary. For example, if our delimiter pattern were stretches of ";" and "," characters then the string:
hello; i am just a, horse
Would yield the tokens "hello"
, " i am just a"
, and
" horse"
.
Look at ReadInts.java
and TestReadInts.java
. ReadInts
provides one
complete method printInts
and two incomplete methods
readInts
and smartReadInts
. TestReadInts
calls all three methods, testing as appropriate.
Experiment #1: TestReadInts
Try running TestReadInts
. You should see that testReadInts
and
testSmartReadInts
fail, but testPrintInts
will print out the
contents of the input String inp
as you would expect. Yay, the
provided code works.
Next take a look at ReadInts.java
and look at the source of the printInts
method. You'll see that it uses the hasNext()
and nextInt()
methods from
the Scanner s
object created.
These methods, along with their most common brethren are described below.
s.hasNext()
is true iff there is another token (that is, something other than a delimiter) before the end of the input.s.next()
returns the next token, and advancess
past it.s.hasNextInt()
is true iffs.hasNext()
and the next token has the syntax of a (possibly signed) decimal numeral.s.nextInt()
does ans.next()
and then parses the token into anint
.- There are several other similarly named methods for other types such as
s.hasNextDouble()
, which returns true if there is another double available to be read ands.nextDouble()
which returns adouble
parsed from a call tos.next()
. s.hasNextInt(RADIX)
is true iff the next token exists and has the syntax of a (possibly signed) base-RADIX numeral. For example this could be used to check for integers that are binary (RADIX = 2) or hexadecimal (RADIX = 16).s.nextInt(RADIX)
reads the next token as a base-RADIX numeral, throwing an exception if there is no next token, or it does not have the right form for a numeral with that radix.s.hasNextLine()
is true iff there is any more input.s.nextLine()
returns the next line of input (that is, everything up to, but not including, the next end-of-line character, or the end of the input if there isn't an end-of-line at the end). It then positionsinp
past the end-of-line character. Thus, it differs from the othernext
methods in that it uses a different delimiter (end of line instead of whitespace).
Programming Tasks
Task 1: readInts()
As above, when you run TestReadInts
, you'll see that your code fails
in readInts
. Head to the readInts
method, and you'll see that one
line of code is missing.
Using the documentation for the ArrayList
class,
figure out how to modify the code so that the readInts method works
correctly.
The testReadInts
test feeds your method a bad input, and actually expects an
exception. For those of you who are particularly keen on the idea of testing
exceptions, Junit supports a less unwieldy syntax based on the @Rule tag, which
you're free to use.
Task 2: smartReadInts()
Complete the smartReadInts
method so that it works as described in the
comments and TestReadInts
. Use the readInts
method as a guide.
C. Patterns
Your code in the previous part looked through input token by token,
accepting tokens that were integers. How does Java know if a string
represents an integer? As you might imagine, it looks for sequences
that contain only the digits 0-9, possibly preceded by a -
symbol.
What if we want to match, for example, numerals that only contain digits less than 5? Or five-digit numerals only?
Java provides a faculty for this known as Pattern Matching, and
supports a rich syntax for specifying patterns. For example, one-digit numerals
less than 5 could be expressed by the pattern "[01234]". Five-digit
numberals could be expressed by "[0-9][0-9][0-9][0-9][0-9]"
or "[0-9]{5}"
.
These patterns are sometimes referred to as regular expressions, though strictly speaking, they're more general than the formal notion of a regular expression that might be discussed, for example, in an upper-division CS class in your future.
The full pattern language is quite rich, and is documented under java.util.regex.Pattern in the on-line Java library documentation. Here are just a few constructs that you might find particularly useful. You are not expected to read this entire list for this homework. However, you may find it useful moving forward in 61B.
- Most characters (letters, digits, most punctuation) simply match themselves.
- A period (".") acts as a wildcard and matches any character other
than (usually) newline. To get "." to match newline as well, include
(?s)
at the beginning of your pattern. - A sequence such as "[abe]" denotes a character
class: in this case, "any of the (single) characters
a
,b
, ore
". As a shorthand, you can represent a range of characters with a hyphen, as in"[abd-qs-z]"
to meana
,b
,d
throughq
, ands
throughz
. - A character class beginning with a carat, such as
[^abe]
, matches any single character other than those listed. - There are several useful two-character shorthands for certain
character classes.
\d
is short for[0-9]
,\s
is short for[ \t\n\r]
(that is, for whitespace). Unfortunately, in order to put an actual\
in a string, you must double it. Thus, a pattern that matches any two-digit string would be written as the string literal"\\d\\d"
. (The Python language got around this problem with "raw strings" such asr"\s"
, and Perl got around it by having patterns be a part of the syntax, distinct from strings. The Java designers, however, apparently never saw the need.) - If
P
represents a pattern, thenP*
represents "0 or more repetitions of P". Thus,x*
matches the empty string, "x", "xx", "xxx", etc.[a-c]*
matches "", "a", "ab", "aa", "bac", "ccc", etc.*
,+
, and?
(see below) have the strongest binding of any character. That is,ab*
means "onea
followed by 0 or moreb
s". To get the effect of "0 or moreab
s", use(ab)*
- Similarly, P+ means "1 or more Ps."
P?
denotes an optional P (0 or 1 Ps).- If
P
andQ
denote patterns, thenP|Q
denotes "aP
or aQ
". For example the patterna|b
will match either "a" or "b". (P)
denotes the same thing as P. It also serves to define a group, a subpattern whose match you can retrieve later.(?:P)
also denotes the same thing asP
, but it does not define a group that you can retrieve later.- Following a
*
,+
, or?
with a?
creates a "non-greedy" version, meaning that it matches as few characters as possible to make the match work. This affects what part of a string gets matched, but usually not whether a string gets matched. For example, if you are matching the string "1, 2, 3, 4" against the pattern string"(\\d).*(\\d).*"
, the first group will match1
and second will match4
. But with the pattern string"(\\d).*?(\\d).*"
, the second group will match2
. - Boundary matchers match the empty string, but only
at certain places.
^
and$
match the beginning and end of a string, respectively (but see below).\G
matches the point at which the last match ended (this makes sense only inScanners
,Matchers
, or other kinds of things that have a notion of "the last thing that was matched").\b
matches a "word boundary", a place where a word begins or ends. (Again, when coding the pattern as aString
literal, you must write\\b
.) - The sequence
(?m)
always matches the empty string, but has a side effect of causing^
and$
to match the beginnings and ends of lines as well as of entire strings. - The two-character escape sequences
\?
,\*
,\.
,\+
, etc., match the character after the backslash, ignoring their special significance. Thus, the patternwho\?
matches the string "who?", and would be written in a program as the string literal"who\\?"
.
Experiment #2: Matching
Compile and run the Matching
class. This class allows you to
type in strings and patterns and see if the entire string matches the
pattern. If you include any groups (read ahead if you're
curious), it will also print those. For example:
$ java Matching
Alternately type strings to match and patterns to match against
them. Use \ at the end of line to enter multi-line strings or
patterns (\s are removed, leaving newlines). The program
will indicate whether each pattern matches the ENTIRE
preceding string. Enter QUIT to end the program.
String: 123456
Pattern: [0-9]{6}
Matches.
String: 123456
Pattern: [0-9]{5}
No match.
String: 12345
Pattern: [0-9]{6}
No match.
String: abdeffff
Pattern: ab(c|de)f+
Matches.
Group 1: 'de'
String: abbbbdefefgg*h
Pattern: a(b+)d(ef)+gg\*h
Matches.
Group 1: 'bbbb'
Group 2: 'ef'
String: QUIT
Use this class to experiment with how patterns work. Try writing patterns that match the following. Sample answers are given for each problem (drag the mouse over the white area after "Answer:" to see it).
- A single digit between 5 and 8. Answer: [5-8].
- Sequences of lower case letters. Answer: [a-z]+
- Sequences of lower case letters except the letter j. Answer: [a-ik-z]+
- Sequences of characters that start with the uppercase letter A and end with the letter f. Answer: A.*f
- Sequences of three words separated by spaces, where a word is defined as a sequence of lower case letters. Answer: [a-z]+ +[a-z]+ +[a-z]+
- Sequences of three words separated by spaces, and where group 1 corresponds to the second word. Answer: [a-z]+ +([a-z]+) +[a-z]+
To get more practice with writing regular expressions check out RegExr or regular expressions 101. These sites use plain regular expressions rather than Java patterns. which differ slightly as we have discussed above. They are still a great way to build more familiarity with regular expressions, which as we have mentioned, have many different applications involving string matching across multiple different programming languages.
Programming task
In P2Pattern.java
, fill in the string with a pattern that
matches lists of non-negative numerals in the notation we used
in homework
2 (e.g. (1, 2, 33, 1, 63)
). Each numeral but the last should be followed
by a comma and one or more spaces.
Run TestP2Pattern
to verify that your pattern is correct.