Tracking the progress of a distribution
Tracking the progress of a distribution
I've been super busy lately, with work, a baby, and house hunting. Fun projects will have to take a back seat for now.
But! I recently found an opportunity to exercise some basic linear algebra when working on a metrics problem. It's simple enough to write up in the moments between house showings and baby duties.
To anonymize the domain,
let's say we're working with
sheep colored white, green, or purple.
You start with a flock of n
sheep,
each of which may be any of the three colors.
You're given a target distribution
(p_1, p_2, p_3)
over colors (white, green, purple)
,
and you want to make your flock
have that distribution of colors.
You can paint your existing sheep,
and/or buy new pre-painted sheep
in a desired color,
or sell sheep.
However, the operations you make to improve the flock's distribution toward the target aren't your only task. You're a busy shepherd, after all, and you can't spend all your time painting sheep. Still, you want to track how much progress your making each day, and show a metric that describes how well you're directing the (possibly small) effort you're spending on improving the color distribution.
The problem is how to measure "progress" faithfully in this situation. On one hand, you could use a statistical measure like total variation, or KL-divergence, or Euclidean distance between the current distribution and the target distribution. But these measures are agnostic to the size of the underlying flock.
For example,
if you used total variation distance,
and started with a flock of (10, 15, 24)
which has distribution (0.204, 0.306, 0.490)
,
and your target distribution was (0.3, 0.6, 0.1)
,
then the total variation distance is 0.39
.
If you add a white sheep and a green sheep to the flock,
you get a distribution of (0.216, 0.314, 0.470)
,
which has total variation distance 0.37
from the target,
an improvement of 0.02
.
We spent effort perfectly:
all the operations made had a maximal impact
on improving the distribution.
Now scale the flock up by 100,
and start with (1000, 1500, 2400)
.
Add one white and one green sheep,
and you only improve the total variation by 0.00019
,
even though your effort was still spent perfectly.
The damn statistic makes us look bad!
All the statistical distance measures I'm aware of have the same problem. The reason is because these comparisons are of distributions, and so they are not allowed to know about the underlying populations that the distributions represent.
What we need instead is some way to measure progress toward the target distribution relative to the amount of progress that could have been made if we theoretically directed our effort as much as possible toward improving the ratio.
When I thought about this, at first I was stuck in "statistics" land, and came up with a bunch of crappy measurements that were various ratios of ratios. Some examples showed that they weren't easy to interpret.
Then I realized (embarrassingly late) that the population itself is a vector, and so I can use basic linear algebra to measure distances of population vectors. From there a sensible approach felt obvious.
Let R, S
be two length-n tuples
of positive integers
representing the population vectors
before and after some change is made.
Let M_R
and M_S
be the sum
of the entries in R
and S
, respectively,
representing the total flock size before and after.
Let T = (t_1, ..., t_n)
be the target distribution.
Then define the
relative improvement of S from R toward T
as follows,
where norm()
denotes the usual Euclidean (L2) norm
and proj_T
denotes the vector projection onto T
:
improvement =
(norm(proj_T(R) - R) - norm(proj_T(S) - S))
/ norm(S - R)
In lieu of a picture,
imagine the target distribution vector
spanning a line in n-dimensional Euclidean space.
This line represents all population states—for
all possible flock sizes (even non-integer)—that match
the target distribution perfectly.
Then proj_T(R)
describes
the closest ideal population state vector
for a given state vector R
.
Note the projection proj_T(R)
will always have
norm at most norm(R)
,
but may have a larger flock size (L1 norm).
Finally, norm(proj_T(R) - R)
is how far
a given state is from ideal (error),
and the improvement formula above
subtracts the error of the "after" state
from the error of the "before" state,
and divides by the distance between the before and after states.
Geometry makes analyzing this formula easy.
It must always be between -1 and 1,
because at best you can place S
on the line between R
and proj_T(R)
,
in which case the numerator is equal to norm(R-S)
and the improvement is 1.
Conversely, the worst you can do is move straight away
from the target, and the improvement is -1.
This hints that the improvement formula
actually computes the cosine of the angle
between S-R
and R - proj_T(R)
.
It's easy to prove.
View S-R
as the hypotenuse of a right triangle.
One base lies along R - proj_T(R)
and has length
norm(proj_T(R) - R) - norm(proj_T(S) - S)
.
The improvement ratio computes
the ratio of base length to hypotenuse length,
i.e., the cosine of the desired angle.
My intended application of this metric is boring. It has to do with computers in data centers. But it could be useful in more situations. Imagine a recommendation system with a target distribution over recommendation categories like "revenue gaining" and "exploratory." Progress is relative to the number of recommendation opportunities. Or imagine a hiring system in which you want to ensure you're receiving resumes from enough applicants from historically disadvantaged groups. Progress is relative to the amount of money you spend on recruiting efforts.
I haven't used this metric in practice yet, so I can't say whether it works well even for my intended application. There's the possibility that for state vectors starting close to the ideal line, progress will look wonky because you'll have no choice but to overshoot. You can alleviate that by adding error bars around the ideal, inside of which you don't track progress at all.
But mostly it feels right compared to the less coherent formulations that preceded it. The geometry also provides a nice framework with which to reason about its behavior. This means I spend less effort justifying the metric to skeptical stakeholders.