Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Flag rollout %s are not accurate #8001

Closed
3 tasks done
neilkakkar opened this issue Jan 12, 2022 · 5 comments
Closed
3 tasks done

Feature Flag rollout %s are not accurate #8001

neilkakkar opened this issue Jan 12, 2022 · 5 comments
Labels
bug Something isn't working right feature/feature-flags Feature Tag: Feature flags

Comments

@neilkakkar
Copy link
Contributor

Bug description

(I don't know if there's a way to solve this well): When setting a rollout % on a FeatureFlag to 50%, I'd expect half the users to see it.

This isn't the case. The ratio is more like 45-55, and gets worse with time..

Here's an example

Here's a multivariate example.

The cause has been explored well in: #6043 . This isn't a problem with just multivariate testing.

Now, this is fine for rolling out new things (mostly), since the numbers don't have to be precise. But for running manual experiments, this is terrible (also why our Experimentation product is built off of multivariates only) - and gives false confidence to users who want to try running experiments. Even internally. For example: our experiment for product-cues

Environment

  • PostHog Cloud
  • self-hosted PostHog (ClickHouse-based), version/commit: please provide
  • self-hosted PostHog (Postgres-based, legacy), version/commit: please provide

Additional context

I don't know if we can solve this at all. But, probs need some way to tell users not to try running experiments with "normal" FFs.

Thank you for your bug report – we love squashing them!

@neilkakkar neilkakkar added bug Something isn't working right feature/feature-flags Feature Tag: Feature flags labels Jan 12, 2022
@paolodamico
Copy link
Contributor

Related to #1610, previously what I would do is normalize experiments by doing a ratio of the metric I care about among active users with/without the feature flag, but not ideal

@timgl
Copy link
Collaborator

timgl commented Jan 12, 2022

This might just be the result of the hashing algorithm not giving a perfect distribution over smaller numbers. See code.

When running over <10k identifiers it starts of with quite a big gap and then closes, which is what I suspect is what's happening here as well.
image

@neilkakkar
Copy link
Contributor Author

I don't think this is the case. (As one data point, there's ~10k events in the examples above^) Take the multivariate example above: All 3 variants are very close to each other (same hashing algorithm), and then there's this pile of leftover Nones, which get created for whatever reason: /decide request coming in too late / ???? / ????.

@neilkakkar
Copy link
Contributor Author

neilkakkar commented Jan 13, 2022

To test the above hypothesis: This multivariate breakdown on an event that happens pretty late (so hopefully not influenced by /decide response).

As expected, the number of Nones go down, but a few still exist, which means there's at least one more problem with giving FFs that I'm not aware of.

And, kindof interesting: client_request_failure has Nones comparable to the rest

@neilkakkar
Copy link
Contributor Author

Wherever precision is required, we ought to use multivariate, not simple flags.

Chalking up the None's remaining in multivariates to /decide being late, post people closing browsers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right feature/feature-flags Feature Tag: Feature flags
Projects
None yet
Development

No branches or pull requests

3 participants