Declarative Program Analysis and Optimization

CS294-260, Spring 2024

Doop: Strictly Declarative Specification of Sophisticated Points-to Analyses

Wednesday, February 7, 2024
Discussion lead: Manish, Shaokai

Reading guide due Monday, February 5 at 2:30pm

Paper link

This paper presents Doop, a declarative framework for static analysis of Java programs centered around the points-to analysis. The declarative nature of Doop stems from its use of Datalog (more specifically, LogiQL, a Datalog dialect developed by LogicBlox) to specify an analysis. Once the Datalog specifications have been generated, they go through a novel optimization process developed by the authors, which produces a semantically equivalent Datalog program with full order-of-magnitude improvements in runtime. Doop is faster than the state-of-the-art points-to analyzers and achieves higher precision than the state-of-the-art tools.

What is points-to analysis? It asks, “What objects can a program variable point to?”. If you are familiar with this, skip Section 2, although it provides some good insights on why Datalog is a great fit for this. Particularly nice is how they explain program analysis as an “amalgamation of mutually recursive tasks.” For instance, consider this table for a small subset of inter-dependent program analyses:

Program Analysis	Program Analyses it depends on
Points-to	{Points-to, Call-graph, Reachability}
Call-graph	{Points-to}
Reachability	{Points-to, Call-graph}

Doing this at the operational level transforms these recursive definitions into complex imperative code. Instead, why not do it in a language that admits declarative specifications of this kind and also automatic optimization, i.e., Datalog.

How does Doop do this differently? The key elements of the Doop are:

Use declarative specifications (Datalog), not operational-level abstractions, for analyses.
Optimize the Datalog program for performance.

(1) Program analysis is generally expressed as a set of (recursively defined) program facts over which we compute a transitive closure. E.g., pointsTo(b, obj) <- Assign(a, b), pointsTo(a, obj). In a relational db sense, this transitive closure = a series of joins and projections over relations.

So, the high-level benefits of storing program facts in a DB and encoding static analysis in Datalog are:

the encoding works for a variety of problems and yields a uniform solution
the Datalog encoding separates the specification of analyses from how to compute the analyses
techniques for efficient eval of Datalog are techniques for efficient eval of your analysis

One can skip Sections 3.3 and 3.4 unless one is interested in the nitty-gritty of language and analysis features supported: a lot, basically! But the cool takeaway is that they support so much because:

extensions are well localized and do not affect basic rules
a rule’s logic is often very close to Java Language Specs and, hence, easy to write!

(2) The authors get into performance starting Section 4. Use two rules of thumb to crunch this:

When a relation is known to be small ⇒ optimizer will choose to join by exhaustively iterating.
Iteration will bind the join variables ⇒ keep the join variables innermost for efficient indexing.

That said, the two main optimizations are described in Section 4.2:

Variable reordering: Reorder relation vars such that joins and indexing are efficient.
Folding: Introduce temporary relations that hold the result of intermediate joins. The benefits are more variable re-ordering for future joins, caching, and query optimizer ordering joins.

Note: Using datalog for program analysis is not new. It is also not the optimization this work performs (variable reordering, folding, etc.) in datalog that is new. The key insight of this work is: “When to apply these optimizations?”! Section 4.2 is sprinkled with how the “semi-naive” algo and incremental evaluation (💡: deltas are small ⇒ should bind vars that will index into other relations) guide the answer to this question.

Results. The results are quite surprising: compared to the prior best-comparable system, Doop achieves speedups of an order of magnitude (10x or more). Interestingly, this performance improvement is not due to any significant algorithmic innovation; it primarily results from the optimization opportunities afforded by using a higher-level programming language (Datalog).

Reading Questions

How does Doop handle reflection “better”? The details were not that apparent.
How does Doop handle exceptions “better”? This is partly by building a call graph on-demand instead of precomputing, similar to how Reps used Datalog for demand-driven program slicing. But, beyond the phrase “on the fly,” the details were unclear. Thoughts?
The authors state that “A clear limitation, for instance, is that the context-depth used in the analysis has to be bounded.” Does this mean that Doop is incomplete, even though the authors say that Doop “offers full support for Java language semantics”?

Discussion Questions

How does one debug these complex inter-procedural analysis implementations? There’s rarely a small input program that helps, particularly for something like Java, which is so large.
Are there any downsides to using a declarative language like Datalog that are not mentioned in the paper?

References

[1] Codebase: https://bitbucket.org/yanniss/doop/src/master/

[2] Slides from the author Yannis Smaragdakis: slides

[3] Full detail talk by Yannis @ MSR: talk

[4] Porting Doop to Souffle: A Tale of Inter-Engine Portability for Datalog-Based Analyses