Analytics engine sampling and some more

I’m trying to better understand how sampling of Workers analytics work.
I’ve read [the docs](Workers Analytics Engine SQL API · Cloudflare Analytics docs], and I understand that at high rates not all events will be stored for a given index key.
I have a few questions:

  1. What is considered “high rate”?
  2. What happens to all the blobs and doubles when data is sampled? How can they be used?
  3. How reliable is the _sample_interval? Let’s say I want to use analytics for metered billing. I charge my customers $x for every 1000 events with specific blobs which I use as “event dimensions”. Will that provide accurate billing?
  4. The docs say “Sampling is based on the index of your dataset so that only indexes that receive large numbers of events will be sampled”. Is this the only thing that the index field is used for? Can I run queries based on blobs and aggregate data across indices?
  5. I use one of the blob fields to store a reference id, which is basically a unique identifier that can be used to later troubleshoot and connect that event to some data in the database. I might rarely query by it, but mostly just present it as discrete value so events can be viewed as “activity log” rather than aggregated data. Is this a good practice?

Maybe someone from Cloudflare could please answer these? Thanks!

Bump… Pretty-please?

The team wrote up a doc a while ago that should hopefully help explain sampling:

Thanks @Walshy, super interesting article, and clarifies a lot.
I think you should turn it into a blog post, and link to it from the analytics engine docs