Today, a follow-up to Writing the Worst Datalog Ever in 26loc, maybe even the start of a series.🍿
Our 26-loc Datalog is naive. Nothing personal, it's a technical term: each iteration in saturate
rederives all the facts derived plus hopefully some new ones. The last iteration is guaranteed to be 100% redundant since by definition it's the one which derived nothing new!
Let's engineer a bad case (not that it's difficult given how purposefully unsophisticated our code is):
user=> (q
(set (for [i (range 50)] [:edge (inc i) i]))
'([i] [:edge+ 1 i])
'[([:edge+ i j] [:edge i j])
([:edge+ i j] [:edge+ i k] [:edge k j])])
#{[0]}
It runs in 10 seconds. Replace 50 by 100 and it runs in 4 minutes! 🐌
The solution to all these useless rederivations is well known and called semi-naive evaluation!
Don't forget: when we're not busy writing silly Datalog implementations, we are available to help you on your Clojure projects or working on ClojureDart or on our app Paktol (The positive spending tracker where money goes up!).
The idea behind semi-naive evaluation is to not keep rederiving from the same facts at each iteration. So the rule is to only consider facts which can be derived by using at least one fresh fact (derived during the previous iteration).
saturate
The first step is to split facts in two: fresh facts (dfacts
—the d
stands for diff or delta) and old news (facts
).
In the saturate
loop, we initialize dfacts
with the initial set of facts
because at the start of the computation everything is fresh. We keep looping while dfacts'
is not empty.
We will modify match-rule
to only return facts derived by using at least one fresh fact. However we'll still have to post-process its returned values with (remove facts')
just in case it accidentally rederives old facts.
(defn saturate [facts rules]
(loop [dfacts facts, facts #{}]
(let [facts' (into facts dfacts)
dfacts' (into #{} (comp (mapcat #(match-rule dfacts facts %)) (remove facts')) rules)]
(cond->> facts' (seq dfacts') (recur dfacts')))))
match-rule
Not much changes, we just pass an extra dfacts
argument through to match-patterns
.
(defn match-rule [dfacts facts [head & patterns]]
(for [env (second (match-patterns patterns dfacts facts))]
(into [] (map #(env % %)) head)))
Oh yes, that's true: we call second
on the value returned by match-patterns
and we explain why right below👇.
match-patterns
Here is usually where semi-naive gets gory in textbooks where a typical rule:
becomes a monster expression 🐲 such as:
👆And this is a simplified version since we lumped EDB and IDB together in the previous article.
It's tedious to implement and match_patterns
would end very different from its naive version.
However we can evolve the existing match_patterns
rather than replace it by looking at automatic differentiation with dual numbers for inspiration.
Automatic differentiation with dual numbers feels like magic: you compute the original function with special numbers and you get both the original result but also the value of the derivative and without knowing the expression of the derivative!
For example let's say you want to compute x^2+x
at 7, then you compute as usual but by writing 7+ε
instead of 7 and simplifying using the magic property that ε^2=0
. (A non-null number whose square is zero isn't more bizarre than a number whose square is -1...)
Here we have computed both the value (56) of x^2+x
for x=7
but we also computed the value (15) of the derivative of x^2+x
(2x+1
) without knowing the formula of the derivative!
The monster expression 🐲 above is a kind of derivative and we'd like to compute it without implementing it.
In the same way x
is replaced by x + ε
in dual numbers we are going to replace envs
by [envs denvs]
where envs
are environments created using strictly matches over old facts and denvs
environments where at least one fresh fact was matched against.
The original (for [fact facts env envs] ...)
is thus going to be declined in 4 versions: facts×envs
, facts×denvs
, dfacts×denvs
and dfacts×envs
. Only the first one contributes to the envs
component; all the others by having a d
on envs
or facts
contribute to the devs
component.
Last, we have to be careful to not eagerly compute envs
since it's exactly what we want to avoid: rediriving the old facts. We do so by delaying its actual computation with delay
.
(defn match-patterns [patterns dfacts facts]
(reduce
(fn [[envs denvs] pattern]
[(-> #{} (into (for [fact facts env @envs] (match pattern fact env))) (disj nil) delay)
(-> #{}
(into (for [fact facts env denvs] (match pattern fact env)))
(into (for [fact dfacts env denvs] (match pattern fact env)))
(into (for [fact dfacts env @envs] (match pattern fact env)))
(disj nil))])
[(delay #{{}}) #{}] patterns))
which keeps the same structure as the original naive version:
(defn match-patterns [patterns facts]
(reduce
(fn [envs pattern]
(-> (for [fact facts env envs] (match pattern fact env))
set (disj nil)))
#{{}} patterns))
The set
has been replaced by into #{}
to make envs
and denvs
computations more similar but otherwise the envs
component of the semi-naive version is the original envs
computation in the naive version.
q
Not much to say except we pass an empty set as the facts
parameter to match-rule
.
(defn q [facts query rules]
(-> facts (saturate rules) (match-rule #{} query) set))
🥳 Here it is: semi-naive datalog in 30 lines of code!
Remember the bad case which took 10s for 50 and 4 minutes for 100?
user=> (q
(set (for [i (range 50)] [:edge (inc i) i]))
'([i] [:edge+ 1 i])
'[([:edge+ i j] [:edge i j])
([:edge+ i j] [:edge+ i k] [:edge k j])])
#{[0]}
Now it takes 350ms for 50 and 5s for 100! It's respectively 30 and 50 times faster: progress! 🎉
This datalog is maybe vaguely better but it's still as minimal/useless as in our previous article: we have made the engine more efficient but we don't have made the language more powerful.
See, if we try to compute Bart's siblings (reusing edb
and rules
from 26-loc Datalog):
user=> (q edb '([c] [:parent :bart p] [:parent c p]) rules)
#{[:lisa] [:maggie] [:bart]}
Bart is returned as his own sibling! This is because we can't constraint a variable to have a different value than another.
This is what we'll fix in the next installment of this Datalog series, we'll add constraints to be able to write:
([:sibling a b] [:parent a p] [:parent b p] (not= a b))
and have:
user=> (q edb '([c] [:sibling :bart c]) rules)
#{[:lisa] [:maggie]}
Share online if you'd like to see more open ended exploratory code (including keeping to grow this datalog)!
(defn match [pattern fact env]
(when (= (count pattern) (count fact))
(reduce (fn [env [p v]]
(let [p-or-v (env p p)]
(cond
(= p '_) env
(= p-or-v v) env
(symbol? p-or-v) (assoc env p v)
:else (reduced nil))))
env (map vector pattern fact))))
(defn match-patterns [patterns dfacts facts]
(reduce
(fn [[envs denvs] pattern]
[(-> #{} (into (for [fact facts env @envs] (match pattern fact env))) (disj nil) delay)
(-> #{}
(into (for [fact facts env denvs] (match pattern fact env)))
(into (for [fact dfacts env denvs] (match pattern fact env)))
(into (for [fact dfacts env @envs] (match pattern fact env)))
(disj nil))])
[(delay #{{}}) #{}] patterns))
(defn match-rule [dfacts facts [head & patterns]]
(for [env (second (match-patterns patterns dfacts facts))]
(into [] (map #(env % %)) head)))
(defn saturate [facts rules]
(loop [dfacts facts, facts #{}]
(let [facts' (into facts dfacts)
dfacts' (into #{} (comp (mapcat #(match-rule dfacts facts %)) (remove facts')) rules)]
(cond->> facts' (seq dfacts') (recur dfacts')))))
(defn q [facts query rules]
(-> facts (saturate rules) (match-rule #{} query) set))