Functional Programming with Datalog
- Discussion lead: Tyler, Jeremy
- Reading guide due Monday, February 19 at 2:30pm
- Paper link
Summary
We love Datalog. However, it is not easy to write and maintain large, complex Datalog programs. “Functional Programming with Datalog” presents a small functional language that compiles to Datalog rules; these rules can be evaluated efficiently by existing Datalog engines like Soufflé and IncA. The core language supports mutually recursive functions, if expressions, and variable bindings, and uses the standard demand transformation to “control” the flow of bottom-up evaluation. The language is extended first to support algebraic data types (ADTs); then to support first-class sets, and first-class functions (via defunctionalization). To evaluate their implementation, the authors write various program analyses and an interpreter for the untyped lambda calculus.
Suggested reading strategy
Section | Strategy | Summary / Notes | |
---|---|---|---|
1. | Introduction | Read thoroughly | Motivates the problem well; provides a good high level overview. |
2. | Datalog Frontends: State of the Art | Read | Contrasts the paper’s approach with existing work, and suggests that Datalog could become an internal representation (IR) for future languages to target. |
3.1 | Compilation by example | Read thoroughly | Presents the core language with a motivating example. See the notes for comments on the non-terminating factorial example. |
3.2 | Translating functional programs to Datalog, technically | Skim | Technical details on how to compile functional IncA to Datalog. See the notes for an explanation of Figure 5. |
3.3 | Demand-driven bottom-up evaluation | Read thoroughly | Overview of the demand transformation. The demand transformation was presented in the Slog class and a precursor was presented as Magic Sets in the Datalog lecture. |
4.1 | Compiling user-defined data types by example | Read thoroughly |
Overview of technique, and example with plus and twice for Peano
numbers encoded in an ADT (0 = Zero() , 1 = Succ(Zero()) , etc.).
|
4.2 | Extending functional IncA with ADTs | Skim | Technical details on how to extend functional IncA to support Algebraic Data Types. |
5 | Case study: Type Checking, Type Erasure, and Interpretation | Read | Description of type checker and interpreter written in functional IncA with ADTs. |
6 | Mixing functions and relations | Skim; read 6.3 (pg. 20) if you are interested in defunctionalization | Motivation, overview, and technical description on how to encode set objects in functional IncA as relations in Datalog. |
7 | Case Studies: Data-Flow Analyses and Clone Detection | Read | Description of flow-sensitive reaching definitions and interval analyses; clone detection of Java bytecode, written in functional IncA. |
8 | Implementation and Performance Evaluation | Read | |
9 | Related work | Skim | Comparisons to and background for Flix, Formulog, Datafun, Soufflé, and Dedalus/Bloom. |
10 | Conclusion | Skip |
Notes
Non-terminating queries: range-restriction vs “infinite” relations
Section 3.1 of the paper presents a factorial function that does not terminate:
fact(n, out) := n = 0, out = 1
fact(n, out) := n != 0, fact(n-1, out'), out = n * out' (1)
The authors say that the factorial function does not terminate because it is
not range-restricted. While this is true, we feel like the focus on
range-restriction is somewhat misleading. For example, in other Datalog
variants,
one could model n-1
as lookup into an “infinite” successor
relation:
fact(n, out) := n != 0, successor(k, n), fact(k, out'), out = n * out' (2)
In which case fact
would indeed be range-restricted. However, bottom-up
computation of e.g. fact(5)
would still not terminate because the successor
relation has infinite size. That is, lack of range-restriction is just one way
that a bottom-up evaluation might not terminate; another way that bottom-up
evaluation might not terminate is if rules include “infinite” relations in
their bodies (even if all rules are range-restricted).
The authors’ solution—the demand transformation—solves both cases of
non-termination. The observation is that a terminating functional program must
call fact(n)
a finite number of times, and therefore we can guard fact(n,
out)
with a demand relation fact_input(n)
that stores every call. In the
original example (1), the rule is range-restricted, and hence terminates. In
the successor
example (2), the join of fact_input(n), successor(k, n)
is
finite, and so it too terminates.
How might lack of range-restriction cause Datalog programs to not terminate?
Consider the model-theoretic semantics of Datalog. Recall that the Herbrand universe $\mathcal{U}$ of a Datalog program $P$ is the set all non-variable constants appearing in $P$, while the Herbrand base $\mathcal{B}$ is a set containing predicates with their variables replaced with all possible values in the Herbrand universe $\mathcal{U}$. For example, given the Datalog program
edge(1, 2) :-
edge(2, 3) :-
edge(3, 4) :-
P(x, y) :- edge(x, y)
P(x, y) :- P(x, y), edge(y, z)
The Herbrand universe $\mathcal{U}$ is $\mathcal{U} = \{1, 2, 3, 4\}$ and the Herbrand base $\mathcal{B}$ is:
An interpretation $I$ of a Datalog program $P$ is a subset $I$ of the Herbrand base, $I \subseteq \mathcal{B}$, and a model $M$ is an interpretation where every rule is satisfied over substitution of the rule’s variables with constants from $\mathcal{U}$. That is, every model of $P$ must satisfy the following (combinatorially many) sentences:
Note that most of the above sentences are trivially satisfied because their antecedent is false (for example, $\E(1, 1) = \texttt{false}$, which means the minimal model excludes $\P(1, 1)$, since its antecedent is never satisfied).
However, what happens if we add a rule whose head has an unbound variable, like
bad(x, y, n) :- edge(x, y)
? This injects even more sentences into the set
of sentences a model has to satisfy:
And in particular, since our model must satisfy the sentences with e.g. $\E(1, 2)$ in the antecedent, every model (and so therefore the minimal model) must contain many “useless” facts $\B(1, 2, 1), \B(1, 2, 2), \B(1, 2, 3), \ldots$, and so on. In addition, if the Herbrand universe is not finite (e.g. it contains all integers), the minimal model could be infinite, and a bottom up evaluation would never terminate!
This is the motivation behind range-restriction. A rule is range-restricted if every variable in the head is bound in the body. Intuitively, for every rule, the facts generated by that rule must have constants that appear in some relation in the body of the rule. Thus, the sizes of the body relations serve as a bound on the size of the head relation, and if the body relations have finite size, then the head relation also has finite size.
Figure 5
Translating an expression in functional IncA to Datalog results in a set containing tuples $(t, a)$. For each tuple $(t, a)$, the expression evaluates to value $t$ if the conditions in $a$ are satisfied. $a$ is a collection of Datalog antecedents, and can be roughly thought of encoding the “control-flow” paths (branches, function calls) the program must take in order to evaluate to $t$.
For example, the rewrite rule for if
is the union of two sets. The first set
denotes the possible values the entire if-expression could evaluate to if the
condition is true. This requires that $t_1$ evaluates to $\textsf{true}$,
guarded by its own guards $a_1$. The resulting value $t_2$ must also be guarded
by its own tuples $a_2$ (consider an if expression nested within another if
expression). The second set is analogous, except for the false case.
The union of both sets represents all possible values the entire expression
might evaluate to in either case.
Reading questions
- Compare and contrast functional IncA with Doop, Flix, and Slog.
- How does functional IncA encode control flow into Datalog?
- How does functional IncA support algebraic data types? How does its implementation differ from Slog’s?
- Explain the demand transformation as succinctly as you can.
- How do the authors add support for querying Datalog relations from functional IncA? How “clean” is their approach?
Discussion questions
- What are the advantages of targeting Datalog as a backend for a functional language? What are the disadvantages?
- Datalog engines like Soufflé might not support all features that a functional language might need, like user-defined aggregations, or negation. How might these limitations affect the power of functional IncA?
- What language features might be challenging to support in extensions to functional IncA?
See also:
- The demand transformation
- Magic sets
- Datafun, a system that takes a more formal approach to combine FP with Datalog
- Formulog, another Datalog with FP features (and SMT!)