Declarative Program Analysis and Optimization

CS294-260, Spring 2024

Datalog

Monday, February 5, 2024

Datalog is a declarative logic programming language that comes from the database theory and logic programming communities. While there are many perspectives on Datalog, we are going to focus on it from an operational perspective, and draw some comparisons to term rewriting systems.

Resources
Simple Example
Optimization
- Semi-naive Evaluation
- Magic Sets
Extensions

Resources

There are a lot of great resources out there on Datalog. Here are just a few:

Wikipedia has a nice overview.
Souffle is a modern Datalog implementation that provides a nice tutorial and other documentation.
Foundations of Databases (the “Alice book”) is a classic textbook on databases that covers Datalog. It’s available online at that link.
Philip Zucker has a notes pages, many blog posts, and an online book (in progress) on Datalog (and other topics relevant to this course!).
The following survey papers:
- Datalog and Recursive Query Processing
- Modern Datalog Engines

Simple Example

Let’s start by looking at a classic Datalog program to compute paths in a graph, and we’ll describe it in the typical, operational way:

edge(1, 2).
edge(2, 3).
edge(3, 4).

path(X, Y) :- edge(X, Y).
path(X, Z) :- edge(X, Y), path(Y, Z).

This program defines a database of facts and a set of rules that can be used to derive new facts. Each line of the form edge(1, 2). is a fact, stating that the binary relation edge contains the tuple (1, 2). The program begins evaluation with the following state:

time	`edge`	`path`
0	(1, 2)
	(2, 3)
	(3, 4)

So at timestep 0, the database only contains those tuples given by facts from the program. Then the program will apply rules to derive new facts. Let’s begin with the first rule, path(X, Y) :- edge(X, Y). Datalog rules can be read as a backward implication: “if the right-hand side is true, then the left-hand side is true”. So to apply one step of this rule, we look for tuples in the edge relation and add them to the path relation.

time	`edge`	`path`
0	(1, 2)
	(2, 3)
	(3, 4)
1		(1, 2)
		(2, 3)
		(3, 4)

At this point, it’s worth noting that the “base case” rule will not add anything else to the database. Relations are sets in Datalog, so adding the same tuple twice has no effect. So given that we never add anything else to the edge relation, the rule path(X, Y) :- edge(X, Y) is done at this point.

Now we move on to the “inductive” rule: path(X, Z) :- edge(X, Y), path(Y, Z). This rule epitomizes the recursive nature of Datalog rules. In plain English, it says “if there is an edge from X to Y and a path from Y to Z, then there is a path from X to Z”.

Let’s inspect the right-hand side of the rule (also called the body or the query) a little more. The body of a Datalog rule is a conjunction of atoms, so the comma is read as “and”. The capital letters (using the convention from Prolog) are variables, bound by the body and used in the head (the left hand side). The first step of applying this rule is to search for substitutions that make the body true, i.e., mappings from variables to values such that both edge(X, Y) and path(Y, Z) are in the database. A reader familiar with databases might recognize this as a conjunctive query, which is typically answered by a join operation in a relational database.

Now let’s apply the inductive rule. Looking at timestep 1, we find the following substitutions for X, Y, Z that make the body true: (1, 2, 3) and (2, 3, 4). The result is adding paths (1, 3) and (2, 4) to the database in the next timestep:

time	`edge`	`path`
0	(1, 2)
	(2, 3)
	(3, 4)
1		(1, 2)
		(2, 3)
		(3, 4)
2		(1, 3)
		(2, 4)

And finally, the next time step finds a path from 1 to 4:

time	`edge`	`path`
0	(1, 2)
	(2, 3)
	(3, 4)
1		(1, 2)
		(2, 3)
		(3, 4)
2		(1, 3)
		(2, 4)
3		(1, 4)

At this point, the program has reached a fixed point, applying the rules no longer adds any new facts to the database.

Optimization

Datalog is a very declarative language, and as such it’s amenable to many kinds of optimization.

One big one that we won’t talk about too much is query optimization. A big part of running Datalog is evaluating the queries that form the bodies of the rules. These are “just” conjunctive queries (like SQL joins), and so there are decades of research on optimizing them and executing them efficiently. One of the big pros of Datalog is that it gets to borrow all of that work; in fact some Datalog engines are built on top of standard SQL engines.

Other optimizations can be framed in terms of syntactic transformations of the program; something that probably feels more at home to programming languages folks. We’ll talk about two of the most common: semi-naive evaluation which incrementalizes Datalog, and magic sets which blurs the boundary between top-down and bottom-up evaluation.

Semi-naive Evaluation

Let’s return to our path-finding example. After time step 1, it’s obvious that the rule path(X, Y) :- edge(X, Y) will no longer add anything to the database. Why is that? Since the edge relation doesn’t change, the rule will never find any new tuples to add to the path relation.

The other rule is a bit more complicated, but will make the general case clear. The rule path(X, Z) :- edge(X, Y), path(Y, Z) involves a join between the edge and path relations. Let’s consider where “new” path tuples might come from. Well, the edge relation is fixed, so the only way to get new path tuples is if the path relation changes. In general, a join can only produce new tuples if one of the relations changes. Let’s make this more precise: for two relations $R$ and $S$, we’ll write $RS$ for their join. Let $\Delta R$ be the new tuples to be added to $R$, so that $R + \Delta R$ will the be “next” iteration of $R$. Then some algebraic manipulation shows that:

$$ \begin{align*} RS + \Delta(RS) &= (R + \Delta R)(S + \Delta S) \\ RS + \Delta(RS) &= RS + R\Delta S + S\Delta R + \Delta R \Delta S \\ \Delta(RS) &= \hspace{2.64em} R\Delta S + S\Delta R + \Delta R \Delta S \\ \end{align*} $$

The key step here is the recognition that we only actually need to compute $\Delta(RS)$, since we already have $RS$ from the previous iteration. This saves us from having to join (old) $R$ with (old) $S$, since we’ve already done that work. This is the essence of semi-naive evaluation: to compute the new tuples of a join, you need to join new-with-new, new-with-old, old-with-new, but not old-with-old.

Most datalog implementations feature semi-naive evaluation. It can be implemented almost entirely as a syntactic transformation, rewriting a join over many relations into a union of many joins over new/old parts. To support this, most Datalog engines will explicitly maintain the “new” and “old” parts of each relation.

Magic Sets

Datalog is typically evaluated bottom-up, which means facts are computed in order of the size of their derivation tree. This allows for efficient execution in many contexts, but it can be wasteful. Consider our path-finding example again. Running the program computes all paths in the graph. In fact, the Datalog program to compute path reachability in a graph is very similar to the Floyd-Warshall algorithm for all-pairs shortest paths.

But what if we only want to ask about a specific path between, say, vertices 42 and 56? It would be wasteful to compute all paths in the graph if we only want to know about some of them? We could manually write such a Datalog program:

path_from_42(42, X) :- edge(42, X)
path_from_42(X, Z) :- path_from_42(X, Y), edge(Y, Z)

The path_from_42 relation will contain precisely those edges that are on a path from 42 to some other vertex.

The magic set transformation is a way to automatically generate such a program from the more general one that we wrote earlier. If given a goal query to compute, say path(42, X), the magic set transformation transform the original program into a new one that only computes the facts necessary to answer the goal query. This goal-directedness means that magic set gives Datalog a top-down flavor, but it doesn’t actually change the (bottom up) evaluation strategy.

The details of how it works are beyond the scope of this lecture, but we’ll see a “souped-up” version of magic set in the paper on Functional Programming in Datalog.

Extensions

Datalog is a simple language, which makes it amendable to many extensions. There are many, many extensions to Datalog; we’ll only mention a few that are relatively standard in the literature and that are relevant to the themes of this course.

Negation

Negation in the head of a rule is tricky (but well studied in the literature), so we’ll focus on negation in the body of a rule. Consider the following extension to our path-finding example:

disconnected(X, Y) :- not path(X, Y).

This rule computes the disconnected pairs of vertices in the graph. Intuitively, this rule makes sense, and the intuitive method of computing it essentially how it’s done in practice: first compute path, then compute disconnected from path. The general version of this is called stratified negation.¹ This approach splits the program into strata such that:

All rules that share a head relation are in the same stratum.
- We say that the stratum containing rules with R in the head defines R.
A rule ... :- S(..), ... must be in same or a higher stratum than the one that defines S.
A rule ... :- not S(..), ... must be a strictly higher stratum than the one that defines S.

Stratification basically splits a Datalog program into a sequence of programs that can be evaluated in order, so no special implementation is needed to support this flavor of negation. However, some Datalog programs cannot be stratified. Consider trying to compute nodes in a graph that are only reachable via an odd number of steps from a given source (42):

odd(X) :- edge(42, X).
odd(Y) :- not odd(X), edge(X, Y).

This program cannot be stratified, so we cannot use stratified negation to compute it. There are other models of negations in Datalog,¹ but they are less frequently used than the relatively simple stratified negation.

From Booleans to Integers, Lattices, and Semi-rings

We typically think of the output (and state) of a Datalog program as a bunch of relations, where each relation is a set of tuples. But we can also think of the output as a bunch of functions, where each function maps the input tuple to a value. If we have our functions output booleans, then we can reconstruct the behavior of Datalog with relations:

$$ f_R(x, y, z) = \begin{cases} \textsf{true} &\text{ when } (x, y, z) \in R \\ \textsf{false} &\text{ otherwise} \end{cases} $$

What about other output types for those functions? This is an exciting area of research in Datalog, figuring out what kinds of values we can output from those functions and still preserve the nice properties of Datalog. Recall the semi-naive evaluation section, where we used some of the algebraic properties of sets rearrange those formula and derive a more efficient algorithm. The algebraic properties of the values you are working with in Datalog determine what kinds of optimizations you can do.

Typically literature on this topic will refer to two kinds of operations on the output of these functions:

joint use, written as $\times$, occurs when we jointly use multiple values. For example, in the boolean case of path(x, z) :- edge(x, y), path(y, z)., the comma denotes joint use and is interpreted as “and”.
alternative use, written as $+$, occurs when we derive a value multiple times. This occurs when a tuple is derived multiple times. For example, there maybe a multiple different paths from node 12 to 63, so what happens to the value of path(12, 63) when it is derived multiple times? In the boolean case, $+$ is interpreted as “or”, so basically nothing happens; if we’ve derived it once, there’s nothing else to add.

The literature on this topic mostly centers around dealing with lattices and (more generally) semi-rings. We will see more about this in our discussion of the Flix Datalog system.

A great example is working in “min-plus” semi-ring, where we use:²

$\min$ as our alternative use (instead of boolean and)
$+$ as our joint use (instead of boolean or)

Just by changing our semi-ring from boolean to min-plus, our path reachability program becomes shortest path! This is easy to see if you write it all out. So this Datalog program:

path(x, z) :- edge(x, z).
path(x, z) :- edge(x, y), path(y, z).

in the boolean semi-ring corresponds to the following formula:

$$ \textsf{path}(x, z) = \textsf{edge}(x, z) \vee \exists y. \left( \textsf{edge}(x, y) \wedge \textsf{path}(y, z) \right) $$

and in the min-plus semi-ring:

$$ \textsf{path}(x, z) = \min\left( \textsf{edge}(x, z), \min_y \left( \textsf{edge}(x, y) \wedge \textsf{path}(y, z) \right) \right) $$

This idea greatly enriches the kinds of programs you can write in Datalog. The Provenance Semirings paper is a seminal reference on this topic, and it also demonstrated another cool use case: provenance. This infrastructure can be used to to compute how a tuple was derived by a Datalog program. For example, in our shortest path program, instead of computing the shortest path, we could compute the path(s) themselves! What do we do if we there are multiple paths? In its most general form, the provenance records all possible ways to compute the relevant tuple. This is indeed expensive, and implementing this in practice is the subject of quite a few papers. But one key idea is that by adding in different kinds of algebraic restrictions, you can get more efficient algorithms. The lack of the restrictions can collide with the semi-naive algorithm. A recent work presented Datalogo, a framework for working with semi-rings and preserving the ability to do semi-naive.

Existentials, ADTs, EGDs, and TGDs

Datalog is very restrictive on what kinds of things you can put in the head of a rule. Negation, for example, is rarely allowed in the head. We just saw about how even plain-old Datalog features (implicit) existential quantification in the body of a rule. But what if we want to put an existential in the head?

First of all, what does that even mean? Well, it makes sense from a logical perspective. Datalog rules model an implication, and we can easily think of formulas like:

$$ \forall(x). \left( \textsf{person}(x) \implies \exists y. \textsf{parent}(x, y) \right) $$

This makes sense as a formula, but cannot be written in vanilla Datalog. The variable $y$ isn’t bound by the body of the rule, so it’s not clear what it means to have it in the head.

An algorithm called the chase from database theory is a sort of generalization of Datalog. It allows two kinds of “rules” which it calls dependencies:

Tuple Generating Dependencies (TGDs) are like Datalog rules. They have a body and a head, and they are used to derive new tuples. But TGDs allows for existential quantification in the head.
Equality Generating Dependencies (EGDs) are used to enforce constraints. They are used to enforce that certain values are equal.

The chase is a complex topic with many flavors and variations that we won’t get into here. Many of the variations revolve around how to deal with the existential quantification in the head of a rule. In general, the chase does not terminate, since the existential quantification in the head allows new values to be introduced at each step.

A restricted form of TGDs will look familiar to PL folks however: they are essentially algebraic data types (ADTs). ADTs are obviously powerful for modeling all kinds of tree-like data, and there are several Datalog implementation (including Souffle) that support them. Their relation to TGDs is that they both allow for the creation of new values in the head of a rule.

See these notes from Paris Koutris for a more detailed treatment. ↩ ↩²
This is a bit confusing because we are using addition for joint use, which is typically notated with $\times$. ↩