Homework 9: NFA simulation

A. Background

What is an NFA?

NFA stands for Non-deterministic Finite Automaton and is a simple model of computation. Models strip away the details of whatever they aim to abstract so that we can generalize proofs and algorithms. You may have heard of a Turing Machine, which is an example of a model of computation (and the strongest one theorized). We will not talk about Turing Machines here, but if you're interested feel free to check out this link for more information.

We are specifically interested in NFAs because they compute regular languages. A regular language is a subset of strings over some alphabet with strict rules. Regular languages can be described with a regular expression (also called a regex, regexp, or pattern). These are exactly like the regular expressions you've learned about in class. For example, the regex (ab)* defines the following subset of strings: {$\varepsilon$, ab, abab, ababab, etc} where in this case, our alphabet is just the letters a and b. An alphabet is just the symbols that you are using. In this assignment, our alphabet is all Java chars except for *, +, |, (, ), \. That is because they have special meaning in our patterns. A more advanced implementation would allow these characters in the alphabet by escaping them, but we won't worry about that and we won't be testing on that. In this spec our NFA diagrams use a smaller alphabet: just a and b to keep things simple. Having a larger alphabet doesn't complicate anything so don't worry about the spec oversimplifying the problem.

You may be wondering what $\varepsilon$ is: it's just a way to denote an empty character/string. From lecture, we know that the * operator matches 0 or more of the preceding pattern (in the above case ab). So ab $0$ times is just an empty string, so this regexp includes $\varepsilon$.

Let's look at what an NFA for the regexp a*ab looks like:

a-star-a-b

You'll notice that an NFA is essentially a graph, which you should be familiar with by now (review the relevant lecture if you aren't). Specifically, an NFA is a directed graph with labeled edges. In this case, each edge label is a symbol in the alphabet our NFA is defined over (for us, this is a valid Java char). The name of a state doesn't matter and can be omitted (as our implementation does), it's just something we add so we can reference states easily in this explanation. Every NFA has a single start state (denoted with the incoming edge labeled "start") and a single acceptance state (denoted with the double circles). It is possible that the start and accept states may be the same state. It turns out that this doesn't complicate anything and you shouldn't worry about this case as the algorithm you'll implement isn't affected by this at all.

There are many possible NFAs for the pattern a*ab, we chose this one somewhat arbitrarily. For a fun exercise, after reading this next bit, see if you can draw a different NFA that computes the same pattern.

Earlier we said that NFAs compute regular languages. That is because an NFA can take some string (from here referred to as the query string) and tell you whether or not that string matches the underlying regexp it computes. This is called NFA simulation and is what you will implement in this assignment. At a high level, the query string is accepted by the NFA if you can begin at the start state and consume one character of the query string at a time while taking edges with labels that match the character you just consumed and end up in the accept state of the NFA. You must consume the entire string and be at the accept state for the string to be accepted (for example, aba will pass through the accept state of the NFA but not end there). Verify for yourself that this computes the regular expression a*ab by trying some query strings like ab, aab, aaab, and some strings you know aren't in the language like $\varepsilon$ a, b, aba, abb, etc. Also, the edge from $q_a$ to $q_f$ means that you can take either a or b to transition states. It's a shorthand used instead of drawing two separate edges (in the code we don't do this, check out the _edges variable of State to see how we store the directed, labeled edges).

You probably noticed there are potentially different paths the query string may take: this is the source of the non-determinism. For example, the query string ab can either take the path $q_0, q_1, q_a$ or it could take the path $q_0, q_0, q_f$. Remember the query string starts off in $q_0$ so that is why it is the first state of both paths. The first path would cause the query string to be accepted and the second path would cause the query string to be rejected: so our definition of accepting is that there exists some path from the start state to the accepting state that consumes all the characters in the query string.

So if our query string is aaba, then we must first begin at the start state and take an edge with label a, then an edge with label a, then an edge with label b, then an edge with label a. The one thing that somewhat complicates this is that we can always take a $\varepsilon$-edge at any time without consuming any characters of the query string. This is discussed later, but just keep that in mind while reading. One imporant thing to note is that at each step of simulation you must take a valid edge (possibly an $\varepsilon$-edge if one is available, more on this later). If there are no valid transitions then the query string is rejected by the NFA. The whole query string must be accepted by the NFA, not just part of it. For this particular NFA, it can get "stuck" (meaning no valid transitions from a state) because the state $q_f$ has no outgoing edges labeled a or b (remember our alphabet for this spec is just these two letters). You'll see later in this spec how to account for this in the successors(char c) method, but for now just know that if you get stuck that you try a different path.

You'll see a lot of code for constructing the NFA from a pattern string, but you won't have to touch this at all. At this point, you should understand that an NFA is really a directed graph with labels on edges. The next section will include a brief explanation of what operations our NFA implementation supports.

B. Thompson's Construction

The algorithm we are using to construct an NFA from a pattern is called Thompson's Construction.

You aren't responsible for writing any of the construction code as it has been provided for you. If you are curious, read through the code as we've included javadocs for all the methods. If you don't care for this, you can skip this section. For this assignment, you only need to know how simulation works.

Here is a table describing the different operators this implementation supports:

Operator Symbol Type Description Example
Concat N/A Binary Matches if and only if the query string matches the concatenation of the sub-pattern on the left with the sub-pattern on the right (ab)(c*) matches all strings that begin with ab followed by any number of c's (including $0$)
Or | Binary Matches if and only if the query string matches the left sub-pattern or the right sub-pattern (ab)|(c*) matches either ab, or any number of c's (including $0$ c's)
Kleene Star * Unary Matches any number (including $0$) of the sub-pattern it is applied on. (bc)* matches any number of bc's
Kleene Plus + Unary Exactly the same as the Kleene Star, except must have a positive (not $0$) number of the sub-pattern. (bc)+ matches any positive number of bc's.

What is really amazing is every regexp is built from these basic operators. The regexps you've used in past homeworks can all be implemented using just these operations. But it helps to have shortcuts, which is why we add two escape sequences to help with testing. Escape sequences are really just patterns: as an excercise, ponder what the regular expression would be for the following two escape sequences (the answers are in the code).

The two escape sequences we support are \l and \d. The former is not a standard escape character, we just include it to make testing easier. The second is one you should recognize. You shouldn't have to deal with these escape chars unless you are writing tests. The code already parses the escape chars into their underlying patterns. So nowhere in your code should you be worrying about any escape chars.

Here is a table that describes what each one means.

Pattern Description Notes
\l matches any lowercase letter This is NOT a standard regexp escape pattern.
\d matches any digit 0-9 This is a standard regexp pattern.

C. NFA Simulation

Up until this point, there has been no non-determinism: the simulation of an NFA is where the non-determinism arises.

As described earlier, NFA simulation requires some query string as well as some NFA. At a high level, we will be iterating over the query string one character at a time and keeping track of what states the NFA may be in. It is states (plural) and not state (singular) since a single state may have multiple outgoing edges with the same symbol and because of $\varepsilon$-edges. So at any point, we may be in many states at once. This is the source of the non-determinism.

We say that the NFA accepts the query string if the set of states the NFA can be in once all of the characters have been iterated over contains at least one state that is accepting. Otherwise, the NFA rejects the query string.

Pseudocode

Note: we modified the pseudocode a little bit to clarify the bit about removing states, so don't be worried if this looks a little different than the last time you saw it.

In rough pseudo-code, here at the steps we take:

It's ok if you are confused after reading that. The imporant thing to know is what each of these sets mean: $S$ represents all the states we may be in at any point in the simulation. $S_{new}$ is just to help us in the code so we don't get a ConcurrentModificationException, so in the examples below we won't be referencing $S_{new}$ explicitly.

Let's illuminate with some examples.

Here is an NFA for the pattern (ab)|(aab). This pattern only matches the two strings ab and aab.

ab-or-aab-with-eps

First, our NFA may be in states $S = \{ q_0, q_1 \}$ since $q_0$ has an $\varepsilon$-edge to $q_1$. Remember that we add all states reachable from $q_0$ by any number of successive $\varepsilon$-edges. A simple graph algorithm for this should spring to mind.

For the NFA above and the query string ab, we begin iteration with the first character, a.

So after the first character of the query string, the set of states the NFA may be in is $S = \{ q_1, q_2 \}$.

Since none of these states have an outgoing $\varepsilon$-edge, we continue to the next character in the query string: b.

Now our set of possible states $S = \{ q_f, q_a \}$ and none of these have outgoing $\varepsilon$-edges, so the iteration is done. Since this set includes the accept state, we say that ab is accepted by the NFA.

You should have noticed that the set $S$ was tracking all possible states the NFA could be in at any step of the simulation. It might have been a bit confusing in the pseudocode, so go ahead and reread the pseudocode with the understanding that $S$ should have all the possible states the NFA may be in and the way we update it should make more sense to you, especially the bit about the $\varepsilon$-edges which might have confused you earlier.

Here is another NFA for the same pattern (ab)|(aab)

ab-or-aab

If we use the query string ab, you'll notice that $q_0$ has multiple valid transitions when we iterate over the first character of the query string, a. Let's see how that is handeled:

First, our set of possible start states is just $S = \{ q_0 \}$ since there are no outgoing $\varepsilon$-edges from the start state. Now we iterate over the first character of the query string, a:

So, the set of states the NFA may be in is now $S = \{ q_1, q_3 \}$. None of these states have any outgoing $\varepsilon$-edges, so we continue to the next character.

The next character is b:

Our set of possible states is now $S = \{ q_2 \}$. Since $q_3$ didn't have any outgoing edges with label b, the set $S$ shrunk to size 1. The state $q_2$ does have an outgoing $\varepsilon$-edge to $q_a$, so we add $q_a$ to $S$. $q_a$ has no outgoing $\varepsilon$-edges, so the algorithm terminates.

Our set of possible states is now $S = \{ q_2, q_a \}$. Since our set of possible states includes the accepting state, the query string is accepted.

As an example to see the NFA get "stuck", walk through the simulaiton algorithm on this NFA with query strings b, aaa, aba, aabb. You'll find yourself with $S$ being empty at some intermediate step of the algorithm. Need you continue any further? In the code, can you see how we can terminate earlier if we ever have an empty set $S$?

What you need to implement for this section

You are to write the matches(String s) method as well as the successors(char c) method which is in the inner State class. You shouldn't modify anything else in the State class at all, but you should definitely read through it. Specifically, you should look at the instance attributes of State to see if they'll be useful for the successors(char c) method.

Hints/Notes

  1. You shouldn't need to add any other methods, you only need to fill in the matches(String s) and the successors(char c) methods. You should read through the entirety of NFA.java to get an idea of what methods are available to you.
  2. The only attributes of the NFA class that you should need to use is the EPSILON variable (which we use to denote the $\varepsilon$ character), the _startState (denotes the starting state of the NFA), and (depending on implementation) _acceptState which is the accepting state of the NFA.
  3. If in IntelliJ you are passing everything but on the AG or make check you aren't, make sure you have the -ea VM option set in IntelliJ. The tests use Assertion statements to check validity and if you do not enable assertions everything will always pass.
  4. Your successors(char c) should have three cases similar to the javadoc: one case (the easy one) is if the character is not EPSILON. If it is EPSILON, then you should reread what we said in the pseudocode and (ab)|(aab) example about what you should do with EPSILON edges in the start of the algorithm and intermediary stages (remembering that a simple graph algorithm should spring to mind). The third case is if there is no outgoing edge with the right edge label. Read the javadoc to see what you should return in that case.
  5. The matches(String s) should look pretty similar to the pseudocode we gave above. Note that we wrote that you should add to/delete from the set $S$ as you iterate over it: you can already smell the ConcurrentModificationException. As this is an implementation detail, we'll leave the solution up to you, but a quick Google search will help those that are stuck here.
  6. In some implementations, there may be multiple acceptance states. That doesn't change anything since we can just add $\varepsilon$-edges from every accepting state to a new "dummy" accept state, and then make all the old accepting states no longer accepting. That has nothing to do with anything you are coding, but is noted just in case you read something online that (you think) contradicts what we've said.
  7. We're not asking you to write some novel algorithm. The desired functionality can be fulfilled with algorithms we've learned in lecture.

D. Submission

Complete the successors(char c) method in the State inner class as well as the matches(String s) method in the NFA class. You can write additional tests if you'd like. Read the NFATests class for an explanation on how to write tests for this homework. You don't need to write any tests, but if you'd like to you have that power.

Then you may submit the normal way via git tag and git push --tags