Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Or property filtering API #8012

Merged
merged 30 commits into from
Feb 14, 2022
Merged

Or property filtering API #8012

merged 30 commits into from
Feb 14, 2022

Conversation

EDsCODE
Copy link
Member

@EDsCODE EDsCODE commented Jan 12, 2022

Changes

Please describe.

  • add support for or filtering

  • OR properties filtering #5035

  • This PR doesn't change how the frontend needs to be structured. It sets up the logic to handle grouped filters but is compatible with the properties array as is. Follow up PRs will create migrations such that filters are stored in the new format

Current property payload:

All properties in the array are AND'd by default

[
     {
          key: ...
          type: ...
          value: ...
          ...
     },
     {
          key: ...
          type: ...
          value: ...
          ...
     }
     ...
]

Proposed structure:

There will be a group operator that determines the relationship between properties within the list.
property_groups is a new accepted filter parameter. When the properties field is empty, property_groups will be parsed

{
      type: 'OR',
      groups: [
            {
                 type: 'AND'
                 groups: [
                     {
                           key: ...
                           type: ...
                           value: ...
                           ...
                     },
                     {
                           key: ...
                           type: ...
                           value: ...
                           ...
                     }
                     ...
                 ]
            },
            {
                 type: 'AND'
                 groups: [
                     {
                           key: ...
                           type: ...
                           value: ...
                           ...
                     },
                     {
                           key: ...
                           type: ...
                           value: ...
                           ...
                     }
                     ...
                 ]
            }
      ]
}

How did you test this code?

Please describe.
-add tests

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@EDsCODE
Copy link
Member Author

EDsCODE commented Jan 20, 2022

If we add OR filters, this leaves a few of the current assumptions we make in ambiguous states.

  • What happens to entity.properties + filter.properties instances? Filter.properties could be multiple OR groups so how would entity.properties be merged in
  • What happens to test_account_filters? Right now we're treating it as a simple AND with the rest of the filters but if there are OR groups to filter, we won't be able to simplify together all the properties as we're doing

@@ -11,7 +11,7 @@

class PropertyMixin(BaseParamMixin):
@cached_property
def properties(self) -> List[Property]:
def properties(self) -> Union[List[Property], List[List[Property]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will come down to typing making sense IMO.

Let's rename current properties to something like VanillaProperty.

I think List[AnyProperty] is the proper value here, with AnyProperty = Union[VanillaProperty, OrProperty]

OrProperty should be a separate class from the property containing neccessary values.

How I imagine the class will look like:

class OrProperty:
  type = 'or'

  def __init__(properties: List[AnyProperty]): # ... 

From the frontend you'd pass a payload like properties: [{ type: 'or', properties: [...] }]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm i see, will give that a shot when i loop back around to this

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@neilkakkar
Copy link
Contributor

(some context below to familiarise with the problem, more a stream of consciousness as I discover/remember high school boolean logic. I could simply write the final conclusion I arrive on, but that misses valuable context & tradeoffs imo, so bear with me)

I think this still isn't flexible enough for all use cases? We have enough feedback here to know what all to support (pending UX stuff, which might indeed change this decision, but there's still an argument to make around making the backend as flexible as possible, such that all future iterations on OR property filtering are frontend/UX bound only)

What we currently have is a default AND between members of a list.

A terrible way to introduce ORs is to keep a single list with a property type per property, like:

[OR: A, AND: B, OR: C]. This is terrible because it's not clear whether we mean: (A || B) & C or A & ( B || C).

The fundamental problem is that OR and AND are binary operators, not unary, so they don't make sense when assigned as a property of a single property filter.

The way around this, if we want to preserve the existing data model is to only allow everything to be AND'ed or OR'ed together. This sounds not-flexible-enough to me.

One step above this is the current solution in this PR: We have two lists, the outer list is implicit AND, and inner lists are implicit OR:

[[A1, A2], [B1, B2], [C1, C2]] means (A1 || A2) & (B1 || B2) & (C1 || C2).

This is thankfully deterministic, and reasonably flexible, but misses this entire other class of combinations: Outer ORs with inner ANDs.

We can convert the above into this by using De Morgan's Law, and the NOT operator. But going down this route doesn't seem reasonable to me: It adds a lot more complexity around ensuring transformations are correct, figuring out Negation of properties, and also raises the question of where and how this happens (how does UI tell us this, and if the UI is general enough, there's no good reason for the backend to restrict itself to a constrained data model)

But, there's another class of property filtering this can't(?) account for. Something like: ((A || B) & C) || (D & C). This specific example raises the concern if I'm trying to overengineer / be too generic: Is there a usecase where people want to do something this complicated? Consider a user like N, which has multiple domains. They want to track pageviews for project A, B, and D (specific domain names, say, in the same industry), all restricted to country C.


At this point, worth mentioning(remembering) a very important theorem from boolean logic: Any boolean function may be written using only two levels of logic and possible negation of variables.

What this means is that the conjunctive normal form, or the representation we have in this PR, is good enough to represent any boolean operation on any number of operands.

((A || B) & C) || (D & C) simply reduces to (A&C) || (B&C) || (D&C), which by deMorgan's above can be flipped into

~(~(A&C) & ~(B&C) & ~(D&C)), which is ~( (~A || ~C) & (~B || ~C) & (~D || ~C)). AND of ORs, with some negations mixed in.


The crux for the data model then is such:

AND of ORs is good enough, if we're okay with running transformations on what users input. But, the user input is also in our control, so what if the UI restricts filters to only AND of ORs, and negations? This is technically enough, but here's the contention:

It's terrible UX for users to do these transformations themselves. "You want to see a trend line of domain A, B, and D only for country C?" "Yeah, 100% possible, you just need to insert: Not of Not domain A or Not country C, AND, not domain B or not country C, AND not domain D or not Country C; et viola!".

I think users shouldn't have to do this. Which begs the question, do we have to do this internally? What if we let users say whatever, and instead of simplifying it, just parse out the SQL as it comes out? Our internal representation then is something like an AST: ORs and ANDs of whatever properties users select, created into a tree, and parsed to spit out an SQL that works, sans clever mods.

(I don't know yet what implications this has for query running time, or if it's better/worse than CNF form queries)

Another option is to allow both CNF and DNF forms (AND of ORs, and ORs of ANDs), which might help make a bit more intuitive sense without allowing any deeply nested combos, and still be generic enough for most use cases. For the example above, this means we stop before deMorgan's, at (A&C) || (B&C) || (D&C). This actually seems like the best: decent UX restrictions, generic enough backend, without the overhead of managing an indeterminate height AST.

Opening this up to discussion, I can be convinced that the premise is wrong and we don't want to have as flexible as this APIs, i.e. either CNF or DNF is good enough & we don't need both / something more generic like AST.


For @clarkus : UX wise, the theorem implies that you can do whatever you want with 2 levels of logical operators only. I think it's reasonable to restrict it to 2 levels.

@clarkus
Copy link
Contributor

clarkus commented Feb 8, 2022

Here is the latest idea for balancing simplicity of composing filters with the power of 2 levels of logical operators. The component here uses groups with group-level operators that apply to all conditions. These are and / or groups. When multiple groups exist, we'll expose a higher level control that allows for matching all / any for the entire set of filters. This should allow us to meet all the demonstrated cases so far.

Screen Shot 2022-02-07 at 7 39 28 PM

@EDsCODE
Copy link
Member Author

EDsCODE commented Feb 8, 2022

At this point, worth mentioning(remembering) a very important theorem from boolean logic: Any boolean function may be written using only two levels of logic and possible negation of variables.

🎉 yupp, this is what I was trying to get at with this PR because I was going to just draw a line that it would be sufficient but your following points are 100% correct

I think @macobo's point was onto something though. Just have an (AND, OR) along with a group of properties. Pair this with @clarkus's switchers and some method for creating groups 1 layer deep (just an ability to +property within a group or a new group) and you should have all the abilities listed above. The only pattern missing is adding a negation on an entire group but this would be covered per demorgans law with the switcher

EDIT: just was writing this as the previous comment got posted too.

@posthog-bot posthog-bot removed the stale label Feb 8, 2022
@neilkakkar
Copy link
Contributor

neilkakkar commented Feb 8, 2022

Makes sense, this appears the best of both worlds to me so far as well!

Thought it made sense to line out all usecases + tradeoffs here. And yep, AST is overkill, and what Karl mentioned sounds reasonable enough:

Property, and PropertyGroup (with type='and | or'), sent the same way from FE, data model copies it with no transformations.

@clarkus
Copy link
Contributor

clarkus commented Feb 8, 2022

Latest updates to improve affordance for operators. This also demonstrates the workflow for the updated component. You'll note that last group has different icon actions. I was wondering if there are any other group level utility actions for a filter group. We have duplication and removal now, but is there anything else that could add value? Also of note is that we need a default for the match any / all control at the global level. I would default towards being more inclusive (any). Please share thoughts here or in figma.

Screen Shot 2022-02-08 at 9 07 40 AM

params.update(filter_params)

return " ".join(final), params
# TODO: clean
joined = f"AND ({' '.join(final).replace(property_operator, '', 1)})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically want to AND whatever comes out of this function from now on, so other filters, like team_id & datetimes etc. etc. can be ANDed as expected.

Earlier this wasn't a problem since everything was ANDed anyway

@@ -83,17 +175,19 @@ def parse_prop_clauses(
"{}person".format(prepend),
prop_var="person_props" if is_direct_query else "properties",
allow_denormalized_props=allow_denormalized_props and is_direct_query,
property_operator="AND",
Copy link
Member Author

@EDsCODE EDsCODE Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super tricky. This is a solo condition that's always pushed into a subquery so it needs to be AND. It's unrelated from the larger logic for the overall condition it's forming (addressed in test_parse_groups_persons)

@EDsCODE
Copy link
Member Author

EDsCODE commented Feb 9, 2022

Made some more progress on the grouping logic + tests. It should be able to handle any depth, not that it's necessary.

TODO:

  • write a migration process
  • make sure filter objects handle new property format
  • change all locations of parse_prop_clauses to parse_prop_grouped_clauses

@neilkakkar

@neilkakkar
Copy link
Contributor

Thanks! I've been doing the filter parsing today. Not quite sure yet how to combine things (and whether to use properties & parse_prop_grouped_clauses everywhere, or property_groups and parse_prop_grouped_clauses. Will look into this next, and simplifying whatever's going around the property mixin, and if there's a good way to swap over easily.

Not yet sure either about the migration either: Do we want to do this change on the DB, or can we perhaps make properties backwards compatible? If old style, convert to PropertyGroup with AND. Hmm, this is tricky.

@neilkakkar
Copy link
Contributor

(oh and moved a few things around to prevent circular imports). Test currently fail because we removed that first AND from parse_prop_clauses , which is ok.

I'm also wondering about scope with the switch-over: This PR would become absolutely massive if we try and do everything in here. This is attracting me towards the partial approach: Make some things support the new stuff, and transform old props into new props for that piece of code. Break up the PRs. Continue. New insights being modded save properties as new type(? not sure how complicated this gets to split, idk yet). And then, once we know this works well, decide to migrate old props to new.

)
if is_direct_query:
final.append(filter_query)
params.update(filter_params)
else:
# Subquery filter here always should be blank as it's the first
filter_query = filter_query.replace(property_operator, "", 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neilkakkar I changed the above "AND" back to the variable so it's less confusing and made it remove the operator. GET_DISTINCT_IDS_BY_PROPERTY_SQL doesn't need an operator because i removed the dummy 1 = 1

@EDsCODE
Copy link
Member Author

EDsCODE commented Feb 13, 2022

I think this is ready for a lookover to ship @neilkakkar. since we don't need a migration, this should be able to support a new frontend directly. But, we need to do some checks on how the caching logic works to make sure everything is accounted for

@neilkakkar
Copy link
Contributor

Thanks for fixing this bit^! Clever, to not worry about generating the entire valid test case, and focusing only on the direct_query part. Fix looks reasonable to me.

Tested out running cache update, dashboards and saved insight refreshes, everything seems to work okay.

Only new change I made: renaming filter_group to property_group.

Agreed though, that we'll need new checks and tests for updating caching, etc. when we start saving new filters with property_groups, since the to_dict stuff looks kinda sus to me.

@posthog-contributions-bot
Copy link
Contributor

This issue has 2004 words at 15 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

  1. Write some code and submit a pull request! Code wins arguments
  2. Have a sync meeting to reach a conclusion
  3. Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

@neilkakkar
Copy link
Contributor

Further, since we're making no db changes, this change, if things do go wrong, should be very easy to rollback.

@neilkakkar
Copy link
Contributor

Actually, just reminded of one thing (which you rightly mentioned 15 days ago): How does this now work with simplifying filters? It should be fine for now since default properties are ANDed, but when frontend starts sending property_groups, we'll be in bit of trouble, maybe.

@EDsCODE
Copy link
Member Author

EDsCODE commented Feb 14, 2022

Actually, just reminded of one thing (which you rightly mentioned 15 days ago): How does this now work with simplifying filters? It should be fine for now since default properties are ANDed, but when frontend starts sending property_groups, we'll be in bit of trouble, maybe.

The cool thing about our recursive structure now is that we could easily just add another nesting so that entity properties and test account properties are an AND group with the rest of the property_groups but this relationship between filters will need to be communicated clearly on the interface. It might be confusing to know if the entity specific filters are being AND'd or OR'd with the global filters.

@clarkus @liyiy

@liyiy
Copy link
Contributor

liyiy commented Feb 14, 2022

Actually, just reminded of one thing (which you rightly mentioned 15 days ago): How does this now work with simplifying filters? It should be fine for now since default properties are ANDed, but when frontend starts sending property_groups, we'll be in bit of trouble, maybe.

The cool thing about our recursive structure now is that we could easily just add another nesting so that entity properties and test account properties are an AND group with the rest of the property_groups but this relationship between filters will need to be communicated clearly on the interface. It might be confusing to know if the entity specific filters are being AND'd or OR'd with the global filters.

@clarkus @liyiy

That's neat! I think I'm in favor of leaving it as an AND with the global + entity filters for now. We should see how many users are engaging with the OR and more complex filters before adding even more complexity to our UI/UX

@neilkakkar
Copy link
Contributor

Yep, makes sense! Just meant to note that this behaviour isn't implemented here, but something we definitely would need to do before releasing groups.

This will probs change lots of simplify_filters logic.

Oh, also, a third place to keep in mind: ColumnOptimizer and https://github.com/PostHog/posthog/blob/master/ee/clickhouse/queries/column_optimizer.py#L78

@neilkakkar
Copy link
Contributor

That's neat! I think I'm in favor of leaving it as an AND with the global + entity filters for now. We should see how many users are engaging with the OR and more complex filters before adding even more complexity to our UI/UX

Definitely! I think the concern is that this can be confusing (just simply ANDing) when filters also have ORs in them. So need a way to tell users that test account filters are alwayss ANDed.

@EDsCODE EDsCODE merged commit e1476df into master Feb 14, 2022
@EDsCODE EDsCODE deleted the or-properties branch February 14, 2022 18:39


def is_property_group(group: Union[Property, "PropertyGroup"]):
if isinstance(group, PropertyGroup):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return isinstance(group, PropertyGroup)

Or even inline?

return final, final_params


def is_property_group(group: Union[Property, "PropertyGroup"]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a smell in that PropertyGroup just wants to be another kind of Property rather than it's own entity.

I suggested this in my initial review - not sure why a different approach was chosen but I strongly disagree with it.

if len(property_group.groups) == 0:
return "", {}

if isinstance(property_group.groups[0], PropertyGroup):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious why this is correct - why can you only send one PropertyGroup at a time?

This isn't verified anywhere -> leads to unexpected behavior. Note that these APIs are not completely private and will be more documented and public long-term

At a minimum, an assert should be here.

@macobo
Copy link
Contributor

macobo commented Feb 16, 2022

This PR is pretty broken:

  1. It breaks using materialized properties as column_optimizer does not walk the tree of property/propertygroups
  2. It breaks person property pushdowns - you can't just push all person predicates to the person join anymore.

I also think data structures need improvements - typing-wise propertygroups should just be a different kind of properties rather than it's own object.

This makes it easier to make sure that all code consuming property objects is correct and handles every case. The current implementation extends the API of filter objects, making it easy to miss cases like the above and introduce bugs long-term. Given how far and wide Filter objects and properties are used, the semantics should be as lean as possible, especially here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants