Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stdlib): create a version of map that is columnar and supports vectorization #4329

Merged
merged 1 commit into from
Feb 9, 2022

Conversation

jsternberg
Copy link
Contributor

@jsternberg jsternberg commented Dec 8, 2021

The columnar version of map will run map over an entire table chunk
rather than running it and then appending per row. This version of map
should be more efficient in the normal cases. In addition, it also
supports running vectorized functions when vectorization is available.

Fixes #4186.

Done checklist

  • docs/SPEC.md updated
  • Test cases written

@jsternberg jsternberg force-pushed the feat/vectorized-map branch 3 times, most recently from 2af604f to a99745c Compare December 8, 2021 19:04
// It will return the new columns after the function was executed.
// These columns are not organized by group keys and may require regrouping
// if a group key column no longer contains consistent values.
func (m *mapTransformation2) execute(chunk table.Chunk, fn *execute.RowMapPreparedFn, mem memory.Allocator) ([]flux.ColMeta, []array.Interface, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the only function that has to get modified for the vectorized version. The return value is exactly what we need and the parts around this all rely on the same output, but this function would need to be updated to also execute the vectorized version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's value in putting this in as an XXX comment (or TODO or whatever; I have my preferences, but whatever you're good with).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the function that we have here has a type like (r: {a: v[int], b: v[int]}) => {c: v[int]}, could we expect that the above code in createSchema will just work?

Also, can createSchema be a stand-alone function rather than a method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kind of depends on the type that gets returned. We may have to teach create schema to look at the element type if the function is vectorized.

It would be possible for createSchema to be a standalone function, but I usually try to scope these by making them method calls so I don't have to worry about name conflicts and using overly generic names.

@jsternberg jsternberg requested review from a team and rockstar and removed request for a team December 8, 2021 19:09
@jsternberg jsternberg force-pushed the feat/vectorized-map branch 2 times, most recently from 2941a67 to 348928b Compare December 13, 2021 21:58
// It will return the new columns after the function was executed.
// These columns are not organized by group keys and may require regrouping
// if a group key column no longer contains consistent values.
func (m *mapTransformation2) execute(chunk table.Chunk, fn *execute.RowMapPreparedFn, mem memory.Allocator) ([]flux.ColMeta, []array.Interface, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's value in putting this in as an XXX comment (or TODO or whatever; I have my preferences, but whatever you're good with).

@jsternberg jsternberg force-pushed the feat/vectorized-map branch 3 times, most recently from ee833a8 to 899e919 Compare February 8, 2022 19:48
@jsternberg jsternberg marked this pull request as ready for review February 8, 2022 19:48
@jsternberg jsternberg requested review from a team as code owners February 8, 2022 19:48
@jsternberg jsternberg requested review from noramullen1, wolffcm and scbrickley and removed request for a team February 8, 2022 19:48
@jsternberg jsternberg changed the title feat(stdlib): create a version of map that is columnar feat(stdlib): create a version of map that is columnar and supports vectorization Feb 8, 2022
Copy link
Contributor

@scbrickley scbrickley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. I had a few questions that I'd like to discuss before this merges.

Comment on lines 32 to 39
// if spec.Fn.Fn.Vectorized != nil {
// fn = &mapVectorFunc{
// fn: execute.NewVectorMapFn(
// spec.Fn.Fn.Vectorized,
// compiler.ToScope(spec.Fn.Scope),
// ),
// }
// } else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an important check. Is there a reason we're skipping it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops.

@@ -206,7 +214,7 @@ func NewTablePredicateFn(fn *semantic.FunctionExpression, scope compiler.Scope)
}

func (f *TablePredicateFn) Prepare(tbl flux.Table) (*TablePredicatePreparedFn, error) {
fn, err := f.prepare(tbl.Key().Cols(), nil)
fn, err := f.prepare(tbl.Key().Cols(), nil, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about why we're passing bool literals as arguments to prepare. I thought we would be checking whether the function expression has a non-nil value for the Vectorized field, and setting the argument for vectorized based on that check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is after the check for vectorization. We've now either passed the vectorized function or not. It's mostly because, if it's vectorized, then the input columns need to be wrapped in vector types. Rather than expose that code to the map transformation, I chose to just add a boolean to this private helper struct.

Copy link

@wolffcm wolffcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments and questions, but nothing to keep you from merging.

Really nice to see this come together!

@@ -81,6 +81,7 @@ t_meta_query_keys = (table=<-) => {
|> range(start: 2018-05-22T19:53:26Z)
|> filter(fn: (r) => r._measurement == "cpu")
|> keys()
|> sort()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this had to be added, can you explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old map ordered the columns by using the sorted properties. This isn't necessarily correct. It's really supposed to be the order that the type system tells us. The new map used what the type system tells us about the order. These tests produce values depending on the column order which, in my opinion, should be treated as non-deterministic since the only reason why the order differed was because a map was inserted.

}
return true

}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Can these be stand-alone functions instead of methods of mapTransformation2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could, but this is a very large package and I didn't want to add free functions to it. Maybe I should refactor this to arrowutil so it's in a common location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Map the return object to the expected order from type inference.
// The compiler should have done this by itself, but it doesn't at the moment.
// When the compiler gets refactored so it returns records in the same order
// as type inference, we can remove this and just do a copy by index.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to update the compiler in this PR if it's not hard to do? Otherwise, can you link to an issue here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be too difficult. I can make an issue, but my thought is that it's not really worth doing except as part of a larger overhaul of the compiler runtime.

…ectorization

The columnar version of map will run map over an entire table chunk
rather than running it and then appending per row. This version of map
should be more efficient in the normal cases. In addition, it also
supports running vectorized functions when vectorization is available.
@jsternberg jsternberg merged commit 72b7382 into master Feb 9, 2022
@jsternberg jsternberg deleted the feat/vectorized-map branch February 9, 2022 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add vectorization support to Flux map function
4 participants