Halfspace

Subscribe
Archives
April 1, 2021

Tracking the progress of a distribution

Tracking the progress of a distribution

I've been super busy lately, with work, a baby, and house hunting. Fun projects will have to take a back seat for now.

But! I recently found an opportunity to exercise some basic linear algebra when working on a metrics problem. It's simple enough to write up in the moments between house showings and baby duties.

To anonymize the domain, let's say we're working with sheep colored white, green, or purple. You start with a flock of n sheep, each of which may be any of the three colors. You're given a target distribution (p_1, p_2, p_3) over colors (white, green, purple), and you want to make your flock have that distribution of colors. You can paint your existing sheep, and/or buy new pre-painted sheep in a desired color, or sell sheep.

However, the operations you make to improve the flock's distribution toward the target aren't your only task. You're a busy shepherd, after all, and you can't spend all your time painting sheep. Still, you want to track how much progress your making each day, and show a metric that describes how well you're directing the (possibly small) effort you're spending on improving the color distribution.

The problem is how to measure "progress" faithfully in this situation. On one hand, you could use a statistical measure like total variation, or KL-divergence, or Euclidean distance between the current distribution and the target distribution. But these measures are agnostic to the size of the underlying flock.

For example, if you used total variation distance, and started with a flock of (10, 15, 24) which has distribution (0.204, 0.306, 0.490), and your target distribution was (0.3, 0.6, 0.1), then the total variation distance is 0.39. If you add a white sheep and a green sheep to the flock, you get a distribution of (0.216, 0.314, 0.470), which has total variation distance 0.37 from the target, an improvement of 0.02. We spent effort perfectly: all the operations made had a maximal impact on improving the distribution.

Now scale the flock up by 100, and start with (1000, 1500, 2400). Add one white and one green sheep, and you only improve the total variation by 0.00019, even though your effort was still spent perfectly. The damn statistic makes us look bad!

All the statistical distance measures I'm aware of have the same problem. The reason is because these comparisons are of distributions, and so they are not allowed to know about the underlying populations that the distributions represent.

What we need instead is some way to measure progress toward the target distribution relative to the amount of progress that could have been made if we theoretically directed our effort as much as possible toward improving the ratio.

When I thought about this, at first I was stuck in "statistics" land, and came up with a bunch of crappy measurements that were various ratios of ratios. Some examples showed that they weren't easy to interpret.

Then I realized (embarrassingly late) that the population itself is a vector, and so I can use basic linear algebra to measure distances of population vectors. From there a sensible approach felt obvious.

Let R, S be two length-n tuples of positive integers representing the population vectors before and after some change is made. Let M_R and M_S be the sum of the entries in R and S, respectively, representing the total flock size before and after. Let T = (t_1, ..., t_n) be the target distribution.

Then define the relative improvement of S from R toward T as follows, where norm() denotes the usual Euclidean (L2) norm and proj_T denotes the vector projection onto T:

improvement = 
  (norm(proj_T(R) - R) - norm(proj_T(S) - S))
  / norm(S - R)

In lieu of a picture, imagine the target distribution vector spanning a line in n-dimensional Euclidean space. This line represents all population states—for all possible flock sizes (even non-integer)—that match the target distribution perfectly. Then proj_T(R) describes the closest ideal population state vector for a given state vector R. Note the projection proj_T(R) will always have norm at most norm(R), but may have a larger flock size (L1 norm). Finally, norm(proj_T(R) - R) is how far a given state is from ideal (error), and the improvement formula above subtracts the error of the "after" state from the error of the "before" state, and divides by the distance between the before and after states.

Geometry makes analyzing this formula easy. It must always be between -1 and 1, because at best you can place S on the line between R and proj_T(R), in which case the numerator is equal to norm(R-S) and the improvement is 1. Conversely, the worst you can do is move straight away from the target, and the improvement is -1.

This hints that the improvement formula actually computes the cosine of the angle between S-R and R - proj_T(R). It's easy to prove. View S-R as the hypotenuse of a right triangle. One base lies along R - proj_T(R) and has length norm(proj_T(R) - R) - norm(proj_T(S) - S). The improvement ratio computes the ratio of base length to hypotenuse length, i.e., the cosine of the desired angle.

My intended application of this metric is boring. It has to do with computers in data centers. But it could be useful in more situations. Imagine a recommendation system with a target distribution over recommendation categories like "revenue gaining" and "exploratory." Progress is relative to the number of recommendation opportunities. Or imagine a hiring system in which you want to ensure you're receiving resumes from enough applicants from historically disadvantaged groups. Progress is relative to the amount of money you spend on recruiting efforts.

I haven't used this metric in practice yet, so I can't say whether it works well even for my intended application. There's the possibility that for state vectors starting close to the ideal line, progress will look wonky because you'll have no choice but to overshoot. You can alleviate that by adding error bars around the ideal, inside of which you don't track progress at all.

But mostly it feels right compared to the less coherent formulations that preceded it. The geometry also provides a nice framework with which to reason about its behavior. This means I spend less effort justifying the metric to skeptical stakeholders.

Don't miss what's next. Subscribe to Halfspace:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.