Nov. 19, 2024, 1:10 a.m.

Loose Ends

Curiosities by Tensegritics

by cgrand (🦋 🦣 𝕏)

Today, just some quick CLJD news and some things I found interesting over the last week:

  • The CVM algorithm,
  • Least Squares Circle Fit,
  • Fractional Brownian Motion applied to Signed Distance Fields.

To echo the title: at last, a video that made me understand why the direction in which you tie shoelaces matters:

Some ClojureDart news

What's New: Broader Testing Support

We've expanded testing support in ClojureDart beyond unit testing, now integrating seamlessly with Dart's widget and integration testing tools. Read more here or ask on Clojurians' Slack #Clojuredart channel.

Open-Source Support Plans for Companies

For companies seeking accountant-friendly options beyond GitHub Sponsors, we've introduced Open-Source Support Plans with Stripe payments and proper invoicing.

The CVM algorithm

(via Null Bitmap)

The CVM algorithm elegantly computes an unbiased approximation of the number of distinct values in a stream with an upper bound on memory usage.

Here is an annotated Clojure version I wrote to understand the algorithm:

(defn approximate-count-distinct
  "Returns an approximation of the number of distinct values in xs.
  If this number is less than than threshold, the returned count is exact.
  Threshold also acts as a bound on memory consumption.
  Implement the CVM algorithm https://en.wikipedia.org/wiki/Count-distinct_problem#CVM_Algorithm"
  [threshold xs]
  (loop [multiplier 1 values #{} xs xs]
    (if (<= (count values) threshold)
      (if-some [[x & xs] (seq xs)]
        (let [f (if (zero? (rand-int multiplier)) conj disj)]
          ; it looks like randomly conj or disj but it can
          ; be seen differently:
          ; an unconditional disj followed by a potential conj  
          (recur multiplier (f values x) xs))
        (* multiplier (count values)))
      ; the sample is too big, we need to probabilistically halve it
      ; each time we halve the sample we double the multiplier
      ; ⚠️ you have to go through the (<= (count values) threshold) check after halving
      ; because if you are very unlucky you may not have shrunk the sample!
      (recur (* 2 multiplier) (into #{} (filter (fn [_] (zero? (rand-int 2)))) values) xs))))

If you compare to the Knuth's version of the algorithm described in the Wikipedia's article, you may wonder why a map of values to random numbers is used while in the above Clojure code a plain set of values is used.

That's because we only conj in the set when the random integer is zero so we would only store 0. Quite useless.

The random values of the maps are only used when shrinking the samples. Here we do a coin flip ((rand-int 2)) which can be seen as lazily computing further digits of the number we didn't store.

Thus it doesn't change the correctness of the implementation.

The algorithm works surprisingly well even for ridiculously small sample sizes: see below, with a sample size of 4, averaging over 10 runs to make up for the small sample.

=> (/ (reduce + (repeatedly 10 #(approximate-count-distinct 4 (range 2e6)))) 10.0)
1939865.6
=> (/ (reduce + (repeatedly 10 #(approximate-count-distinct 4 (concat (range 1e6) (range 1e6))))) 10.0)
1127219.2

Least Squares Circle Fit

A closed formula to fit a circle to a cloud of points.

Fractional Brownian Motion applied to Signed Distance Fields

image.png Inigo Quiliez constantly blows my mind, this time generating a fractal landscape as a single SDF, a real SDF, one whose gradient has length 1.0.

Stay Tuned!

This installment was a lighter update, but the Worst Datalog Ever series will be back soon with the next chapter—constraints!

You just read issue #11 of Curiosities by Tensegritics. You can also browse the full archives of this newsletter.

Share on Twitter Share on LinkedIn Share on Hacker News Share on Reddit Share on Mastodon
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.