by cgrand (X 🦣)

Today, a follow-up to Writing the Worst Datalog Ever in 26loc, maybe even the start of a series.🍿

Our 26-loc Datalog is naive. Nothing personal, it's a technical term: each iteration in saturate rederives all the facts derived plus hopefully some new ones. The last iteration is guaranteed to be 100% redundant since by definition it's the one which derived nothing new!

Let's engineer a bad case (not that it's difficult given how purposefully unsophisticated our code is):

user=> (q 
         (set (for [i (range 50)] [:edge (inc i) i]))
         '([i] [:edge+ 1 i])
         '[([:edge+ i j] [:edge i j])
           ([:edge+ i j] [:edge+ i k] [:edge k j])])
#{[0]}

It runs in 10 seconds. Replace 50 by 100 and it runs in 4 minutes! 🐌

The solution to all these useless rederivations is well known and called semi-naive evaluation!

Still from the first Iron Man movie. Obadiah Stane addresses Tony Stark saying

Don't forget: when we're not busy writing silly Datalog implementations, we are available to help you on your Clojure projects or working on ClojureDart or on our app Paktol (The positive spending tracker where money goes up!).

Semi-naive evaluation

The idea behind semi-naive evaluation is to not keep rederiving from the same facts at each iteration. So the rule is to only consider facts which can be derived by using at least one fresh fact (derived during the previous iteration).

Changes to `saturate`

The first step is to split facts in two: fresh facts (dfacts—the d stands for diff or delta) and old news (facts).

In the saturate loop, we initialize dfacts with the initial set of facts because at the start of the computation everything is fresh. We keep looping while dfacts' is not empty.

We will modify match-rule to only return facts derived by using at least one fresh fact. However we'll still have to post-process its returned values with (remove facts') just in case it accidentally rederives old facts.

(defn saturate [facts rules]
  (loop [dfacts facts, facts #{}]
    (let [facts' (into facts dfacts)
          dfacts' (into #{} (comp (mapcat #(match-rule dfacts facts %)) (remove facts')) rules)]
      (cond->> facts' (seq dfacts') (recur dfacts')))))

Changes to `match-rule`

Not much changes, we just pass an extra dfacts argument through to match-patterns.

(defn match-rule [dfacts facts [head & patterns]]
  (for [env (second (match-patterns patterns dfacts facts))]
    (into [] (map #(env % %)) head)))

Oh yes, that's true: we call second on the value returned by match-patterns and we explain why right below👇.

Changes to `match-patterns`

Here is usually where semi-naive gets gory in textbooks where a typical rule:

becomes a monster expression 🐲 such as:

👆And this is a simplified version since we lumped EDB and IDB together in the previous article.

It's tedious to implement and match_patterns would end very different from its naive version.

However we can evolve the existing match_patterns rather than replace it by looking at automatic differentiation with dual numbers for inspiration.

Automatic differentiation with dual numbers feels like magic: you compute the original function with special numbers and you get both the original result but also the value of the derivative and without knowing the expression of the derivative!

For example let's say you want to compute x^2+x at 7, then you compute as usual but by writing 7+ε instead of 7 and simplifying using the magic property that ε^2=0. (A non-null number whose square is zero isn't more bizarre than a number whose square is -1...)

Here we have computed both the value (56) of x^2+x for x=7 but we also computed the value (15) of the derivative of x^2+x (2x+1) without knowing the formula of the derivative!

The monster expression 🐲 above is a kind of derivative and we'd like to compute it without implementing it.

In the same way x is replaced by x + ε in dual numbers we are going to replace envs by [envs denvs] where envs are environments created using strictly matches over old facts and denvs environments where at least one fresh fact was matched against.

visual decomposition of cartesian product

The original (for [fact facts env envs] ...) is thus going to be declined in 4 versions: facts×envs, facts×denvs, dfacts×denvs and dfacts×envs. Only the first one contributes to the envs component; all the others by having a d on envs or facts contribute to the devs component.

Last, we have to be careful to not eagerly compute envs since it's exactly what we want to avoid: rediriving the old facts. We do so by delaying its actual computation with delay.

(defn match-patterns [patterns dfacts facts]
  (reduce
    (fn [[envs denvs] pattern]
      [(-> #{} (into (for [fact facts env @envs] (match pattern fact env))) (disj nil) delay)
       (-> #{}
         (into (for [fact facts env denvs] (match pattern fact env)))
         (into (for [fact dfacts env denvs] (match pattern fact env)))
         (into (for [fact dfacts env @envs] (match pattern fact env)))
         (disj nil))])
    [(delay #{{}}) #{}] patterns))

which keeps the same structure as the original naive version:

(defn match-patterns [patterns facts]
  (reduce
    (fn [envs pattern]
      (-> (for [fact facts env envs] (match pattern fact env))
        set (disj nil)))
    #{{}} patterns))

The set has been replaced by into #{} to make envs and denvs computations more similar but otherwise the envs component of the semi-naive version is the original envs computation in the naive version.

Changes to `q`

Not much to say except we pass an empty set as the facts parameter to match-rule.

(defn q [facts query rules]   
  (-> facts (saturate rules) (match-rule #{} query) set))

🥳 Here it is: semi-naive datalog in 30 lines of code!

Is it less bad yet?

Remember the bad case which took 10s for 50 and 4 minutes for 100?

user=> (q 
         (set (for [i (range 50)] [:edge (inc i) i]))
         '([i] [:edge+ 1 i])
         '[([:edge+ i j] [:edge i j])
           ([:edge+ i j] [:edge+ i k] [:edge k j])])
#{[0]}

Now it takes 350ms for 50 and 5s for 100! It's respectively 30 and 50 times faster: progress! 🎉

Conclusion

This datalog is maybe vaguely better but it's still as minimal/useless as in our previous article: we have made the engine more efficient but we don't have made the language more powerful.

See, if we try to compute Bart's siblings (reusing edb and rules from 26-loc Datalog):

user=> (q edb '([c] [:parent :bart p] [:parent c p]) rules)
#{[:lisa] [:maggie] [:bart]}

Bart is returned as his own sibling! This is because we can't constraint a variable to have a different value than another.

This is what we'll fix in the next installment of this Datalog series, we'll add constraints to be able to write:

([:sibling a b] [:parent a p] [:parent b p] (not= a b))

and have:

user=> (q edb '([c] [:sibling :bart c]) rules)
#{[:lisa] [:maggie]}

Share online if you'd like to see more open ended exploratory code (including keeping to grow this datalog)!

Appendix: 30 half-sophisticated locs at once

(defn match [pattern fact env]
   (when (= (count pattern) (count fact))
     (reduce (fn [env [p v]]
               (let [p-or-v (env p p)]
                 (cond
                   (= p '_) env
                   (= p-or-v v) env
                   (symbol? p-or-v) (assoc env p v)
                   :else (reduced nil))))
       env (map vector pattern fact))))

(defn match-patterns [patterns dfacts facts]
  (reduce
    (fn [[envs denvs] pattern]
      [(-> #{} (into (for [fact facts env @envs] (match pattern fact env))) (disj nil) delay)
       (-> #{}
         (into (for [fact facts env denvs] (match pattern fact env)))
         (into (for [fact dfacts env denvs] (match pattern fact env)))
         (into (for [fact dfacts env @envs] (match pattern fact env)))
         (disj nil))])
    [(delay #{{}}) #{}] patterns))

(defn match-rule [dfacts facts [head & patterns]]
  (for [env (second (match-patterns patterns dfacts facts))]
    (into [] (map #(env % %)) head)))

(defn saturate [facts rules]
  (loop [dfacts facts, facts #{}]
    (let [facts' (into facts dfacts)
          dfacts' (into #{} (comp (mapcat #(match-rule dfacts facts %)) (remove facts')) rules)]
      (cond->> facts' (seq dfacts') (recur dfacts')))))

(defn q [facts query rules]   
  (-> facts (saturate rules) (match-rule #{} query) set))

Oct. 16, 2024, 1 p.m.

Half Dumb Datalog in 30 loc

Semi-naive evaluation

Changes to `saturate`

Changes to `match-rule`

Changes to `match-patterns`

Changes to `q`

Is it less bad yet?

Conclusion

Appendix: 30 half-sophisticated locs at once

Oct. 16, 2024, 1 p.m.

Half Dumb Datalog in 30 loc

Curiosities by Tensegritics

Semi-naive evaluation

Changes to saturate

Changes to match-rule

Changes to match-patterns

Changes to q

Is it less bad yet?

Conclusion

Appendix: 30 half-sophisticated locs at once

Changes to `saturate`

Changes to `match-rule`

Changes to `match-patterns`

Changes to `q`