Feature Flag rollout %s are not accurate #8001

neilkakkar · 2022-01-12T14:47:47Z

Bug description

(I don't know if there's a way to solve this well): When setting a rollout % on a FeatureFlag to 50%, I'd expect half the users to see it.

This isn't the case. The ratio is more like 45-55, and gets worse with time..

Here's an example

Here's a multivariate example.

The cause has been explored well in: #6043 . This isn't a problem with just multivariate testing.

Now, this is fine for rolling out new things (mostly), since the numbers don't have to be precise. But for running manual experiments, this is terrible (also why our Experimentation product is built off of multivariates only) - and gives false confidence to users who want to try running experiments. Even internally. For example: our experiment for product-cues

Environment

PostHog Cloud
self-hosted PostHog (ClickHouse-based), version/commit: please provide
self-hosted PostHog (Postgres-based, legacy), version/commit: please provide

Additional context

I don't know if we can solve this at all. But, probs need some way to tell users not to try running experiments with "normal" FFs.

Thank you for your bug report – we love squashing them!

paolodamico · 2022-01-12T17:55:54Z

Related to #1610, previously what I would do is normalize experiments by doing a ratio of the metric I care about among active users with/without the feature flag, but not ideal

timgl · 2022-01-12T18:29:50Z

This might just be the result of the hashing algorithm not giving a perfect distribution over smaller numbers. See code.

When running over <10k identifiers it starts of with quite a big gap and then closes, which is what I suspect is what's happening here as well.

neilkakkar · 2022-01-12T18:50:47Z

I don't think this is the case. (As one data point, there's ~10k events in the examples above^) Take the multivariate example above: All 3 variants are very close to each other (same hashing algorithm), and then there's this pile of leftover Nones, which get created for whatever reason: /decide request coming in too late / ???? / ????.

neilkakkar · 2022-01-13T11:18:14Z

To test the above hypothesis: This multivariate breakdown on an event that happens pretty late (so hopefully not influenced by /decide response).

As expected, the number of Nones go down, but a few still exist, which means there's at least one more problem with giving FFs that I'm not aware of.

And, kindof interesting: client_request_failure has Nones comparable to the rest

neilkakkar · 2022-06-06T12:18:17Z

Wherever precision is required, we ought to use multivariate, not simple flags.

Chalking up the None's remaining in multivariates to /decide being late, post people closing browsers.

neilkakkar added bug Something isn't working right feature/feature-flags Feature Tag: Feature flags labels Jan 12, 2022

neilkakkar closed this as completed Jun 6, 2022

neilkakkar mentioned this issue Jun 6, 2022

Feature Flags aren't consistent when distinctIDs can change #9547

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Flag rollout %s are not accurate #8001

Feature Flag rollout %s are not accurate #8001

neilkakkar commented Jan 12, 2022

paolodamico commented Jan 12, 2022

timgl commented Jan 12, 2022

neilkakkar commented Jan 12, 2022

neilkakkar commented Jan 13, 2022 •

edited

Loading

neilkakkar commented Jun 6, 2022

Feature Flag rollout %s are not accurate #8001

Feature Flag rollout %s are not accurate #8001

Comments

neilkakkar commented Jan 12, 2022

Bug description

Environment

Additional context

Thank you for your bug report – we love squashing them!

paolodamico commented Jan 12, 2022

timgl commented Jan 12, 2022

neilkakkar commented Jan 12, 2022

neilkakkar commented Jan 13, 2022 • edited Loading

neilkakkar commented Jun 6, 2022

neilkakkar commented Jan 13, 2022 •

edited

Loading