Study Guide: BNF (Backus-Naur Form)

Instructions

This is a study guide with links to past lectures, assignments, and handouts, as well as additional practice problems to assist you in learning the concepts.

Assignments

For solutions to these assignments once they have been released, find the links in the front page calendar.

Lectures

Readings

BNF is not covered in the 61A textbook, but here's a chapter from a UCI textbook: EBNF: A Notation to Describe Syntax. That chapter goes into more detail than strictly needed for our class, but may be helpful for those of you who like to learn from textbook readings.

Guides

BNF

Backus-Naur Form (BNF) is a syntax for describing a context-free grammar. It was invented for describing the syntax of programming languages, and is still commonly used in documentation and language parsers. EBNF is a dialect of BNF which contains some convenient shorthands.

An EBNF grammar contains symbols and a set of recursive production rules. In 61A, we are using the Python Lark library to write EBNF grammars, which has a few specific rules for grammar writing.

There are two types of symbols: Non-terminal symbols can expand into non-terminals (including themselves) or terminals. In the Python Lark library, non-terminal symbols are always lowercase. Terminal symbols can be strings or regular expressions. In Lark, terminals are always uppercase.

Consider these two production rules:

numbers: INTEGER | numbers "," INTEGER
INTEGER: /-?\d+/

The symbol numbers is a non-terminal with a recursive production rule. It corresponds to either an INTEGER terminal or to the numbers symbol (itself) plus a comma plus an INTEGER terminal. The INTEGER terminal is defined using a regular expression which matches any number of digits with an optional - sign in front.

This grammar can describe strings like:

10
10,-11
10,-11,12

And so on, with any number of integers in front.

A grammar should also specify a start symbol, which corresponds to the whole expression being parsed (or the whole sentence, for a spoken language).

For the simple example of comma-separated numbers, the start symbol could just be the numbers terminal itself:

?start: numbers
numbers: numbers "," INTEGER | INTEGER
INTEGER: /-?\d+/

EBNF grammars can use these shorthand notations for specifying how many symbols to match:

EBNF Notation Meaning Pure BNF Equivalent
item* Zero or more items items: | items item
item+ One or more items items: item | items item
[item]
item?
Optional item optitem: | item

Lark also includes a few handy features:

  • You can specify tokens to complete ignore by using the ignore directive at the bottom of a grammar. For example, %ignore /\s+/ ignores all whitespace (tabs/spaces/new lines).
  • You can import pre-defined terminals for common types of data to match. For example, %import common.NUMBER imports a terminal that matches any integer or decimal number.

Using all of that, here's an EBNF grammar that corresponds to the Calculator language:

start: calc_expr?
calc_expr: NUMBER | calc_op
calc_op: "(" OPERATOR calc_expr* ")"
OPERATOR: "+" | "-" | "*" | "/"

%ignore /\s+/
%import common.NUMBER

You can paste that into code.cs61a.org and then input Calculator expressions in the interpreter to see their parse trees. Try it!

Practice Problems

Q1: Toddler-ese

Consider this EBNF grammar for "Toddler-ese", a language for communicating wants and needs with toddlers.

?start: sentence
sentence: describe_wants | describe_feeling
describe_wants: TODDLER " wants " noun_phrase "!"
noun_phrase: (ARTICLE " ")? NOUN
describe_feeling: TODDLER " is " EMOTION "!"

TODDLER: "beverly" | "baggy" | "baby"
ARTICLE: "the" | "a" | "an" | "un" | "una"
NOUN: "ball" | "elmo" | "chalk" | "gusano"
EMOTION: "sad" | "mad" | "tired"

Which of these phrases cannot be parsed by this grammar?

  1. baby wants ball!
  2. baggy is mad!
  3. beverly is elmo!
  4. baggy wants un elmo!
Phrase #3, beverly is elmo!, cannot be parsed by this grammar. The "is" will be matched by the describe_feeling non-terminal, but "elmo" isn't matched by the EMOTION terminal, so there is no rule that matches the whole phrase.

Remember that you can paste the grammar in code.cs61a.org and then enter expressions in the terminal below to see the parse trees or errors for each expression!

Which of these phrases can be parsed by this grammar?

  1. baggy is mad!!
  2. beverly is un gusano!
  3. baggy wants a sad elmo!
  4. baby wants the ball!
Only phrase #4, baby wants the ball!, can be parsed by this grammar. Phrase #1 has too many exclamation points, phrase #2 is using a NOUN where an EMOTION is expected. Phrase #3 adds an adjective where none is allowed.

Which BNF line would we modify to add support for emotions in front of the NOUNs, like "mad elmo"?

  1. sentence: describe_wants | describe_feeling
  2. noun_phrase: (ARTICLE " ")? NOUN
  3. ARTICLE: "the" | "an" | "a" | "una" | "un"
  4. NOUN: "ball" | "elmo" | "chalk" | "gusano"
The best option in those lines is the noun_phrase non-terminal, modifying it like so:

noun_phrase: (ARTICLE " ")? (EMOTION " ")? NOUN

Q2: Python integers

Consider this EBNF grammar for allowed integer formats in the Python language, adopted from the Python documentation.

?start: integer
integer:  decinteger | bininteger | octinteger | hexinteger
decinteger:  NONZERODIGIT DIGIT*
bininteger:  "0" ("b" | "B") BINDIGIT+
octinteger:  "0" ("o" | "O") OCTDIGIT+
hexinteger:  "0" ("x" | "X") hexdigit+
NONZERODIGIT:  /[1-9]/
DIGIT:  /[0-9]/
BINDIGIT:  /[01]/
OCTDIGIT:  /[0-7]/
hexdigit:  DIGIT | /[a-f]/ | /[A-F]/

Which of these are valid Python integers, according to the grammar? Select all that apply

  1. 0xF9
  2. 0x55
  3. 0xf9
  4. 0xG9
  5. 0xFFF
They are all valid integers, matching the hexinteger rule, except for 0xG9. That one is not valid because it contains a "G" and hexdigit only matches [0-9], [a-F], [A-F].

Remember that you can paste the grammar in code.cs61a.org and then enter expressions in the terminal below to see the parse trees or errors for each expression!

Which of these is not a valid Python integer, according to the grammar? Select one

  1. 07543
  2. 0b0101
  3. 0o7543
  4. 0x1011
The first number 07543 is not valid since it does not match any of the rules. Decimal integers cannot start with a zero and the other integer types start with either "0b", "0o", or "0x".

Which of these can be parsed by the grammar? Select all that apply

  1. 0
  2. 0b
  3. 0x
  4. 0o
None of them! The zero by itself cannot be parsed, since a decinteger must start with a non-zero digit and the other integer types require an additional letter. The other options can't be parsed, since the bininteger, hexinteger, and octinteger rules require at least 1 digit after the "b"/"x"/"o".

Which of these symbols are considered a non-terminal symbol? Select one

  1. NONZERODIGIT
  2. DIGIT
  3. BINDIGIT
  4. OCTDIGIT
  5. hexdigit
The hexdigit symbol is a non-terminal symbol, since its rule can expand to another symbol, DIGIT. The other symbols in the list are all terminal symbols, since they are defined by regular expressions, not other symbols.

Q3: Scheme expressions

Consider this EBNF grammar for a subset of Scheme expressions, adopted from the Scheme documentation.

?start: expression
expression: constant | variable | "(if " expression expression expression? ")" | application
constant: BOOLEAN | NUMBER
variable: identifier
application: "(" expression expression* ")"

identifier: initial subsequent* | "+" | "-" | "..."
initial: LETTER | "!" | "$" | "%" | "&" | "*" | "/" | ":" | "<" | "=" | ">" | "?" | "~" | "_" | "^"
subsequent: initial | DIGIT | "." | "+" | "-"
LETTER: /[a-zA-z]/
DIGIT: /[0-9]/
BOOLEAN:  "#t" | "#f"

%import common.NUMBER
%ignore /\s+/

Which of these Scheme expressions can be parsed by that grammar? Select all that apply

  1. #t
  2. 193
  3. x
  4. -xy
The fourth expression -xy cannot be parsed by the grammar - it doesn't match variable since it starts with a "-" and doesn't match the other symbols since it doesn't start with a parentheses.

What would be equivalent to the current regular expression for DIGIT? Select all that apply

  1. DIGIT: /\d+/
  2. DIGIT: /\d*/
  3. DIGIT: /\d/
  4. DIGIT: /\d?/
The original regular expression matches a single digit from 0 to 9. Only the third option /\d/ would be equivalent. The other options have quantifiers, so they would match more or less than a single digit as well.

Which of these Scheme expressions can be parsed by that grammar? Select all that apply

  1. (define x 5)
  2. (if (= x 5) x y)
  3. (begin (print 1) (print 2) (print 3))
  4. (print (+ 1 2))
All of those expressions can be parsed by that grammar. However, the define and begin expressions would be parsed as an application, even though they are special forms. In the full Scheme EBNF grammar, there is a separate symbol for the special forms, so that the parser knows they aren't just standard procedure calls.

For example, to parse the define form, we could modify the grammar like so:

expression: constant | variable | "(if " expression expression expression? ")" | define | application
define: "(define" variable expression ")"

Which of these statements is true? Select all that apply

  1. If an expression parses with this grammar, then it is possible to run that code without error in a Scheme interpreter.
  2. This grammar contains all the information needed to build a working interpreter for this subset of Scheme.
  3. This grammar contains the information needed for the phase of the interpreter that parses an input string and turns it into an expression tree.
  4. If an expression parses with this grammar, it is considered syntactically correct Scheme code.
Options #3 and #4 are correct. Option #1 is not correct as some code is syntactically correct but still results in run-time errors, such as (1 2). Option #2 is not correct since a grammar does not contain information on how to evaluate the code.