Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(join): validate group key #4958

Merged
merged 2 commits into from
Jul 5, 2022
Merged

fix(join): validate group key #4958

merged 2 commits into from
Jul 5, 2022

Conversation

scbrickley
Copy link
Contributor

This is a more complete fix for #4909 which was previously closed by #4933.

The previous fix only returned an error if a group key column was missing from the joined output record. This fix also makes sure that the value does not change.

@scbrickley scbrickley requested a review from a team as a code owner July 5, 2022 16:50
@scbrickley scbrickley requested review from Marwes and removed request for a team July 5, 2022 16:50
Copy link

@wolffcm wolffcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Figuring out if we can elide the group key validation sounds like a tricky problem. I guess we have the same issue with map though, I forget what we do there? I think we build the table and then slice it up into separate groups if needed. That's different than the situation here, where you need to produce an error.

If we ever want to vectorize the output function, we might considering deferring the check until after we've built an output chunk, and erroring slightly later.

@scbrickley
Copy link
Contributor Author

scbrickley commented Jul 5, 2022

Looks good. Figuring out if we can elide the group key validation sounds like a tricky problem. I guess we have the same issue with map though, I forget what we do there? I think we build the table and then slice it up into separate groups if needed. That's different than the situation here, where you need to produce an error.

Just thinking out loud here. Maybe the fix in this case would be to check the semantic graph and see if any group key columns are assigned a value other than l.<group key label> or r.<group key label>. Seems like we could do that check once at the beginning and error early. I'm having trouble thinking of an as function that does anything other than that while still having a valid group key for the output record. Thoughts?

@wolffcm
Copy link

wolffcm commented Jul 5, 2022

if any group key columns are assigned a value other than l.<group key label> or r.<group key label>. Seems like we could do that check once at the beginning and error early.

The trouble with that approach, in my opinion, is that it is hard to know what's allowed via static analysis, since you could have a function like:

emptyStringAppender = (str) => str + ""

(left, right) => {
  k0 = left.k0
  k0_ = emptyStringAppender(str: k0)
  return {
    k0: k0_, // the only group key column
    // other fields here...
  }
}

A naive check would probably produce an error early even though this is actually fine. This is obviously a contrived example, but I'm not sure we can analyze a function such that we can be 100% sure the group key will change or not. If we get it wrong, the user gets an error for something they reasonably expect could succeed.

I wonder if it would make sense just to produce the result with the new group key, the way that map does?

@scbrickley
Copy link
Contributor Author

I wonder if it would make sense just to produce the result with the new group key, the way that map does?

I considered it, but thought that might be surprising for users to see output records that don't match the record they defined in their as function. Reconsidering that now, though.

@scbrickley
Copy link
Contributor Author

Oh, I may have misunderstood your suggestion. Do you mean to regroup the results based on what the user sets as their new group key? I think the intention with the new join was to not allow users to modify the group key, just as experimental.join() does.

@wolffcm
Copy link

wolffcm commented Jul 5, 2022

Do you mean to regroup the results based on what the user sets as their new group key?

Yes, that is exactly what I mean. I know that it is a radical suggestion. This is exactly what map does though. It's not always as inefficient as it sounds since we can create a new table.Chunk when the group key changes. I believe that this does not do any memory allocations. @jsternberg might know better than I on this though.

I don't think we should do anything for now, what you have here is totally reasonable for a new function. I'm not sure if we went back and created a map function today that we would take the same path...

@scbrickley
Copy link
Contributor Author

Gotcha. I'll merge and maybe we can revisit this line of reasoning later.

@scbrickley scbrickley merged commit 7e1283d into master Jul 5, 2022
@jacobmarble jacobmarble deleted the join-validate-groupkey branch January 4, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants