## Navigation

## A. Background

### What is an NFA?

NFA stands for Non-deterministic Finite Automaton and is a simple model of computation. Models strip away the details of whatever they aim to abstract so that we can generalize proofs and algorithms. You may have heard of a Turing Machine, which is an example of a model of computation (and the strongest one theorized). We will not talk about Turing Machines here, but if you're interested feel free to check out this link for more information.

We are specifically interested in NFAs because they compute *regular languages*. A regular language is
a subset of strings over some alphabet with strict rules. Regular languages can be described with a
*regular expression* (also called a regex, regexp, or pattern). These are exactly like the regular
expressions you've learned about in class. For example, the regex `(ab)*`

defines the following subset of
strings: {$\varepsilon$, `ab`

, `abab`

, `ababab`

, etc} where in this case, our alphabet is just the
letters `a`

and `b`

. An alphabet is just the symbols that you are using. In this assignment, our alphabet
is all Java chars except for `*`

, `+`

, `|`

, `(`

, `)`

, `\`

. That is because they have special meaning in
our patterns. A more advanced implementation would allow these characters in the alphabet by escaping
them, but we won't worry about that and we won't be testing on that. In this spec our NFA diagrams
use a smaller alphabet: just `a`

and `b`

to keep things simple. Having a larger alphabet doesn't
complicate anything so don't worry about the spec oversimplifying the problem.

You may be wondering what $\varepsilon$ is: it's just a way to denote an empty character/string. From
lecture, we know that the `*`

operator matches 0 or more of the preceding pattern (in the above case
`ab`

). So `ab`

$0$ times is just an empty string, so this regexp includes $\varepsilon$.

Let's look at what an NFA for the regexp `a*ab`

looks like:

You'll notice that an NFA is essentially a graph, which you should be familiar with by now (review the
relevant lecture if you aren't). Specifically, an NFA is a directed graph with
labeled edges. In this case, each edge label is a symbol in the alphabet our NFA is defined over (for us,
this is a valid Java char). The name of a state doesn't matter and can be omitted (as our implementation
does), it's just something we add so we can reference states easily in this explanation. Every NFA has a
single *start state* (denoted with the incoming edge labeled "start") and a single *acceptance state*
(denoted with the double circles). It is possible that the start and accept states may be the same state.
It turns out that this doesn't complicate anything and you shouldn't worry about this case as the
algorithm you'll implement isn't affected by this at all.

There are many possible NFAs for the pattern `a*ab`

, we chose this one somewhat arbitrarily. For a fun
exercise, after reading this next bit, see if you can draw a different NFA that computes the same
pattern.

Earlier we said that NFAs *compute* regular languages. That is because an NFA can take some string
(from here referred to as the *query string*) and tell you whether or not that string matches the
underlying regexp it computes. This is called *NFA simulation* and is what you will implement in this
assignment. At a high level, the query string is *accepted* by the NFA if you can begin at the start
state and consume one character of the query string at a time while taking edges with labels that match
the character you just consumed and end up in the accept state of the NFA. You must consume the entire
string and be at the accept state for the string to be accepted (for example, `aba`

will pass through
the accept state of the NFA but not end there). Verify for yourself that this computes the regular
expression `a*ab`

by trying some query strings like `ab`

, `aab`

, `aaab`

, and some strings you know aren't
in the language like $\varepsilon$ `a`

, `b`

, `aba`

, `abb`

, etc. Also, the edge from $q_a$ to $q_f$
means that you can take *either* `a`

or `b`

to transition states. It's a shorthand used instead of
drawing two separate edges (in the code we don't do this, check out the `_edges`

variable of `State`

to
see how we store the directed, labeled edges).

You probably noticed there are potentially different paths the query string may take: this is the source
of the non-determinism. For example, the query string `ab`

can either take the path $q_0, q_1, q_a$ or
it could take the path $q_0, q_0, q_f$. Remember the query string starts off in $q_0$ so that is why
it is the first state of both paths. The first path would cause the query string to be accepted and the
second path would cause the query string to be rejected: so our definition of accepting is that there
exists *some* path from the start state to the accepting state that consumes all the characters in the
query string.

So if our query string is `aaba`

, then we must first begin at the start state and take an edge with label
`a`

, then an edge with label `a`

, then an edge with label `b`

, then an edge with label `a`

. The one thing
that somewhat complicates this is that we can always take a $\varepsilon$-edge at any time *without
consuming any characters* of the query string. This is discussed later, but just keep that in mind while
reading. One imporant thing to note is that at each step of simulation you must take a valid edge
(possibly an $\varepsilon$-edge if one is available, more on this later). If there are no valid
transitions then the query string is *rejected* by the NFA. **The whole query string** must be accepted
by the NFA, not just part of it. For this particular NFA, it can get "stuck" (meaning no valid
transitions from a state) because the state $q_f$ has no outgoing edges labeled `a`

or `b`

(remember our
alphabet for this spec is just these two letters). You'll see later in this spec how to account for this
in the `successors(char c)`

method, but for now just know that if you get stuck that you try a different
path.

You'll see a lot of code for constructing the NFA from a pattern string, but you won't have to touch this at all. At this point, you should understand that an NFA is really a directed graph with labels on edges. The next section will include a brief explanation of what operations our NFA implementation supports.

## B. Thompson's Construction

The algorithm we are using to construct an NFA from a pattern is called Thompson's Construction.

You aren't responsible for writing any of the construction code as it has been provided for you. If you are curious, read through the code as we've included javadocs for all the methods. If you don't care for this, you can skip this section. For this assignment, you only need to know how simulation works.

Here is a table describing the different operators this implementation supports:

Operator | Symbol | Type | Description | Example |
---|---|---|---|---|

Concat | N/A | Binary | Matches if and only if the query string matches the concatenation of the sub-pattern on the left with the sub-pattern on the right | `(ab)(c*)` matches all strings that begin with `ab` followed by any number of `c` 's (including $0$) |

Or | `|` |
Binary | Matches if and only if the query string matches the left sub-pattern or the right sub-pattern | `(ab)|(c*)` matches either `ab` , or any number of `c` 's (including $0$ `c` 's) |

Kleene Star | `*` |
Unary | Matches any number (including $0$) of the sub-pattern it is applied on. | `(bc)*` matches any number of `bc` 's |

Kleene Plus | `+` |
Unary | Exactly the same as the Kleene Star, except must have a positive (not $0$) number of the sub-pattern. | `(bc)+` matches any positive number of `bc` 's. |

What is really amazing is every regexp is built from these basic operators. The regexps you've used in past homeworks can all be implemented using just these operations. But it helps to have shortcuts, which is why we add two escape sequences to help with testing. Escape sequences are really just patterns: as an excercise, ponder what the regular expression would be for the following two escape sequences (the answers are in the code).

The two escape sequences we support are `\l`

and `\d`

. The former is **not a standard escape character**,
we just include it to make testing easier. The second is one you should recognize. You shouldn't have to
deal with these escape chars unless you are writing tests. The code already parses the escape chars into
their underlying patterns. So **nowhere in your code should you be worrying about any escape chars**.

Here is a table that describes what each one means.

Pattern | Description | Notes |
---|---|---|

`\l` |
matches any lowercase letter | This is NOT a standard regexp escape pattern. |

`\d` |
matches any digit 0-9 | This is a standard regexp pattern. |

## C. NFA Simulation

Up until this point, there has been no non-determinism: the simulation of an NFA is where the non-determinism arises.

As described earlier, NFA simulation requires some query string as well as some NFA. At a high level, we will be iterating over the query string one character at a time and keeping track of what states the NFA may be in. It is states (plural) and not state (singular) since a single state may have multiple outgoing edges with the same symbol and because of $\varepsilon$-edges. So at any point, we may be in many states at once. This is the source of the non-determinism.

We say that the NFA *accepts* the query string if the set of states the NFA can be in once all of the
characters have been iterated over contains at least one state that is accepting. Otherwise, the NFA
*rejects* the query string.

## Pseudocode

Note: we modified the pseudocode a little bit to clarify the bit about removing states, so don't be worried if this looks a little different than the last time you saw it.

In rough pseudo-code, here at the steps we take:

- Instantiate our set of possible states $S$ as the start state $q_0$ as well as
**any states reachable from $q_0$ via only $\varepsilon$-edges**. This includes successive $\varepsilon$-edges: so if a state $q$ is $1, 2, 3, 4, ...$ succesive $\varepsilon$-edges away from $q_0$, they should be added to $S$. This is imporant and we'll see an example of this. For each character $c$ in the query string:

- Create a new empty set $S_{new}$. This will represent all the States we can be in after we consume character $c$.
For each $q \in S$:

- Update our set $S_{new}$ by adding all states $q_{new}$ that $q$ has an outgoing edge to with edge label $c$. If no edges are labeled $c$, then don't add anything for this state.

- Set $S = S_{new}$ and make $S_{new}$ empty again. $S$ is almost ready for the next iteration, we now need to account for $\varepsilon$-edges.
For each $q_{new} \in S$ (the set $S$ should now only contain the new states you added):

- Update our set $S_{new}$ with all states $q_{\varepsilon}$ that $q_{new}$ can reach by only taking edges with edge label $\varepsilon$. Just like the first step of this algorithm, this includes successive $\varepsilon$-edges.

- Add all of the states from $S_{new}$ to $S$. $S$ is now ready for the next iteration.

- If the final set $S$ (obtained after iterating over all the characters in the query string) contains the accepting state, return true. Else false.

It's ok if you are confused after reading that. The imporant thing to know is what each of these sets
mean: $S$ represents all the states we may be in at any point in the simulation. $S_{new}$ is just
to help us in the code so we don't get a `ConcurrentModificationException`

, so in the examples below we
won't be referencing $S_{new}$ explicitly.

Let's illuminate with some examples.

Here is an NFA for the pattern `(ab)|(aab)`

. This pattern only matches the two strings `ab`

and `aab`

.

First, our NFA may be in states $S = \{ q_0, q_1 \}$ since $q_0$ has an $\varepsilon$-edge to $q_1$. Remember that we add all states reachable from $q_0$ by any number of successive $\varepsilon$-edges. A simple graph algorithm for this should spring to mind.

For the NFA above and the query string `ab`

, we begin iteration with the first character, `a`

.

- $q_0$ can transition to $q_1$ by using an
`a`

-edge - $q_1$ can transition to the $q_2$ by using an
`a`

-edge

So after the first character of the query string, the set of states the NFA may be in is $S = \{ q_1, q_2 \}$.

Since none of these states have an outgoing $\varepsilon$-edge, we continue to the next character in the
query string: `b`

.

- $q_1$ can transition to $q_f$ by using a
`b`

-edge - $q_2$ can transition to $q_a$ by using a
`b`

-edge

Now our set of possible states $S = \{ q_f, q_a \}$ and none of these have outgoing
$\varepsilon$-edges, so the iteration is done. Since this set includes the accept state, we say that `ab`

*is accepted* by the NFA.

You should have noticed that the set $S$ was tracking all possible states the NFA could be in at any step of the simulation. It might have been a bit confusing in the pseudocode, so go ahead and reread the pseudocode with the understanding that $S$ should have all the possible states the NFA may be in and the way we update it should make more sense to you, especially the bit about the $\varepsilon$-edges which might have confused you earlier.

Here is another NFA for the same pattern `(ab)|(aab)`

If we use the query string `ab`

, you'll notice that $q_0$ has multiple valid transitions when we
iterate over the first character of the query string, `a`

. Let's see how that is handeled:

First, our set of possible start states is just $S = \{ q_0 \}$ since there are no outgoing
$\varepsilon$-edges from the start state. Now we iterate over the first character of the query string,
`a`

:

- $q_0$ can transition to $q_1$ and $q_3$ by using an
`a`

-edge

So, the set of states the NFA may be in is now $S = \{ q_1, q_3 \}$. None of these states have any outgoing $\varepsilon$-edges, so we continue to the next character.

The next character is `b`

:

- $q_1$ can transition to $q_2$ by using a
`b`

-edge - $q_3$ can't transition to any state by using a
`b`

-edge

Our set of possible states is now $S = \{ q_2 \}$. Since $q_3$ didn't have any outgoing edges with
label `b`

, the set $S$ shrunk to size 1. The state $q_2$ *does* have an outgoing $\varepsilon$-edge to
$q_a$, so we add $q_a$ to $S$. $q_a$ has no outgoing $\varepsilon$-edges, so the algorithm terminates.

Our set of possible states is now $S = \{ q_2, q_a \}$. Since our set of possible states includes the accepting state, the query string is accepted.

As an example to see the NFA get "stuck", walk through the simulaiton algorithm on this NFA with query
strings `b`

, `aaa`

, `aba`

, `aabb`

. You'll find yourself with $S$ being empty at some intermediate step of
the algorithm. Need you continue any further? In the code, can you see how we can terminate earlier
if we ever have an empty set $S$?

### What you need to implement for this section

You are to write the `matches(String s)`

method as well as the `successors(char c)`

method which is in
the inner `State`

class. You shouldn't modify anything else in the `State`

class at all, but you should
**definitely** read through it. Specifically, you should look at the instance attributes of `State`

to
see if they'll be useful for the `successors(char c)`

method.

### Hints/Notes

- You shouldn't need to add any other methods, you only need to fill in the
`matches(String s)`

and the`successors(char c)`

methods. You should read through the entirety of`NFA.java`

to get an idea of what methods are available to you. - The only attributes of the
`NFA`

class that you should need to use is the`EPSILON`

variable (which we use to denote the $\varepsilon$ character), the`_startState`

(denotes the starting state of the NFA), and (depending on implementation)`_acceptState`

which is the accepting state of the NFA. - If in IntelliJ you are passing everything but on the AG or
`make check`

you aren't, make sure you have the`-ea`

VM option set in IntelliJ. The tests use`Assertion`

statements to check validity and if you do not enable assertions everything will always pass. - Your
`successors(char c)`

should have three cases similar to the javadoc: one case (the easy one) is if the character is not`EPSILON`

. If it is`EPSILON`

, then you should reread what we said in the pseudocode and`(ab)|(aab)`

example about what you should do with`EPSILON`

edges in the start of the algorithm and intermediary stages (remembering that a simple graph algorithm should spring to mind). The third case is if there is no outgoing edge with the right edge label. Read the javadoc to see what you should return in that case. - The
`matches(String s)`

should look pretty similar to the pseudocode we gave above. Note that we wrote that you should add to/delete from the set $S$ as you iterate over it: you can already smell the`ConcurrentModificationException`

. As this is an implementation detail, we'll leave the solution up to you, but a quick Google search will help those that are stuck here. - In some implementations, there may be multiple acceptance states. That doesn't change anything since we can just add $\varepsilon$-edges from every accepting state to a new "dummy" accept state, and then make all the old accepting states no longer accepting. That has nothing to do with anything you are coding, but is noted just in case you read something online that (you think) contradicts what we've said.
- We're not asking you to write some novel algorithm. The desired functionality can be fulfilled with algorithms we've learned in lecture.

## D. Submission

Complete the `successors(char c)`

method in the `State`

inner class as well as the `matches(String s)`

method in the `NFA`

class. You can write additional tests if you'd like. Read the `NFATests`

class for an
explanation on how to write tests for this homework. You don't need to write any tests, but if you'd
like to you have that power.

Then you may submit the normal way via `git tag`

and `git push --tags`