Today, just some quick CLJD news and some things I found interesting over the last week:
To echo the title: at last, a video that made me understand why the direction in which you tie shoelaces matters:
We've expanded testing support in ClojureDart beyond unit testing, now integrating seamlessly with Dart's widget and integration testing tools. Read more here or ask on Clojurians' Slack #Clojuredart channel.
For companies seeking accountant-friendly options beyond GitHub Sponsors, we've introduced Open-Source Support Plans with Stripe payments and proper invoicing.
(via Null Bitmap)
The CVM algorithm elegantly computes an unbiased approximation of the number of distinct values in a stream with an upper bound on memory usage.
Here is an annotated Clojure version I wrote to understand the algorithm:
(defn approximate-count-distinct
"Returns an approximation of the number of distinct values in xs.
If this number is less than than threshold, the returned count is exact.
Threshold also acts as a bound on memory consumption.
Implement the CVM algorithm https://en.wikipedia.org/wiki/Count-distinct_problem#CVM_Algorithm"
[threshold xs]
(loop [multiplier 1 values #{} xs xs]
(if (<= (count values) threshold)
(if-some [[x & xs] (seq xs)]
(let [f (if (zero? (rand-int multiplier)) conj disj)]
; it looks like randomly conj or disj but it can
; be seen differently:
; an unconditional disj followed by a potential conj
(recur multiplier (f values x) xs))
(* multiplier (count values)))
; the sample is too big, we need to probabilistically halve it
; each time we halve the sample we double the multiplier
; ⚠️ you have to go through the (<= (count values) threshold) check after halving
; because if you are very unlucky you may not have shrunk the sample!
(recur (* 2 multiplier) (into #{} (filter (fn [_] (zero? (rand-int 2)))) values) xs))))
If you compare to the Knuth's version of the algorithm described in the Wikipedia's article, you may wonder why a map of values to random numbers is used while in the above Clojure code a plain set of values is used.
That's because we only conj in the set when the random integer is zero so we would only store 0. Quite useless.
The random values of the maps are only used when shrinking the samples. Here we do a coin flip ((rand-int 2)
) which can be seen as lazily computing further digits of the number we didn't store.
Thus it doesn't change the correctness of the implementation.
The algorithm works surprisingly well even for ridiculously small sample sizes: see below, with a sample size of 4, averaging over 10 runs to make up for the small sample.
=> (/ (reduce + (repeatedly 10 #(approximate-count-distinct 4 (range 2e6)))) 10.0)
1939865.6
=> (/ (reduce + (repeatedly 10 #(approximate-count-distinct 4 (concat (range 1e6) (range 1e6))))) 10.0)
1127219.2
A closed formula to fit a circle to a cloud of points.
Inigo Quiliez constantly blows my mind, this time generating a fractal landscape as a single SDF, a real SDF, one whose gradient has length 1.0.
This installment was a lighter update, but the Worst Datalog Ever series will be back soon with the next chapter—constraints!