How to Count What You Can't See

tristanisham

tristanisham

Yesterday, I stumbled upon an article on Hacker News, Computer Scientists Invent an Efficient New Way to Count, by Quanta Magazine. The article details the discovery of a new algorithm called CVM for the distinct elements problem.

The distinct elements problem, also known as the f0 problem, is essentially, "how can I count the number of unique elements in a stream without being able to hold the stream in memory". Say you're Facebook and you want to count the number of unique logins on a specific day. That's billions of users logging in on multiple devices. No regular computer can hold all that info in memory, so the new CVM doesn't bother. It uses probability theory to estimate the number of unique elements in a stream without ever holding more than a fraction of the elements in memory at any given time.

CVM Algorithm 2

Strip away the mathematical symbolism, and you're left with this:

// Estimate uses the CVM algorithm to calculate the number of unique item in an array.
// e (epsilon) represents acceptable relative error.
// d (delta) represents the probability of failure for the algorithm to produce an estimation within the specific
// e-bound.
func Estimate[T comparable](source []T, e, d float64) float64 {

    x := make(map[T]struct{})
   
    // probability. Necessary for more mathematical implementations, but I don't think storing it is really 
    // necessary considering my implementation. 
    p := 1.0
    m := len(source)

    thresh := math.Ceil(12.0/e*e) * math.Log2(8.0*float64(m)/d)

    for _, item := range source {

        if p == 0.5 {
            if coin() {
                x[item] = struct{}{}
            }
        } else if p == 1 {
            if _, ok := x[item]; !ok {
                x[item] = struct{}{}
            }
        }

        if math.Abs(float64(len(x))) == thresh {
            randClean(x)
            p = p / 2

        }

    }

    return math.Abs(float64(len(x))) / p
}

Functions coin represents a 50/50 probability and randClear just iterates through the map tossing a coin, deleting the entry if it's heads. The rand package is math/rand/v2.

This code takes any comparable type (anything you can shove in a Go map) and calculates a threshold of the number of items the map can hold based on the slice's length. It'll iterate through the source slice until the threshold is met, disregarding any duplicate items, and if the length of the map equals our threshold, we'll randomly discard 50% of our values from the data structure.

The secret to this algorithm is knowing our probability. After our first iteration, all unique items in the slice have a 50/50 chance of being in our final array. Using math I don't understand, the algorithm is able to use this probability to estimate how many unique elements are in the slice without ever holding all of it in memory.

The point of the exercise has been to ensure that every word, by virtue of the random selections you’ve made, has the same probability of being there: 1/2k. If, for instance, you have 61 words on your list at the conclusion of Hamlet, and the process took six rounds, you can divide 61 by the probability, 1/26, to estimate the number of distinct words — which comes out to 3,904 in this case. (It’s easy to see how this procedure works: Suppose you start with 100 coins and flip each one individually, keeping only those that come up heads. You’ll end up with close to 50 coins, and if someone divides that number by the probability, ½, they can guess that there were about 100 coins originally.)

In the previous excerpt, k just represents the number of iterations.

Speed

I've written some tests, and it appears the algorithm works. It matches up with a traditional map filter, but there is a catch. The speed is unpredictable. It's not much less than a traditional map--considering you couldn't even use a map to perform this operation, I think it'll blow other algorithms out of the water--but you're going to want to avoid f0 unless you're working on systems with small memory. Routers, floppy disks, and embedded devices with TinyGO are much better targets.

map vs f0 results

That test runs over The Bible (around 4mb). If anyone has a larger system they'd be wiling to test my implementation on, or would be willing to help contribute fuzzing, that would be much appreciated.

Conclusion

This project was more about figuring out how to implement algorithms from scientific papers than actually needing f0 in my day-to-day. I was pleasantly surprised at how easy it was to implement. The paper's authors state that they think it'd be a good algorithm to teach undergrads as it doesn't involve any difficult math. I don't understand much about probability, but it was easy for me to implement the logarithm. You should try implementing it too. Use the paper and my code as a basis, but write it in your favorite language. It's new, so I doubt there are many existing libraries. Based off my cursory research, I think I'm the first person to implement it in Go. You could be the first in your language.

Don't forget to subscribe to my Polar newsletter, and consider supporting me financially. Check out ZVM if you're interested in Zig, or just need a better way to manage your Zig installs.