Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: a way to copy a Row to Rows #4466

Closed
alamb opened this issue Jun 29, 2023 · 0 comments · Fixed by #4470
Closed

Request: a way to copy a Row to Rows #4466

alamb opened this issue Jun 29, 2023 · 0 comments · Fixed by #4470
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jun 29, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I am implementing GroupByHash in DataFusion apache/datafusion#4973

We use the RowFormat to store grouping keys which is awesome.

The Grouping operation calculates the Row format for each input row, determines if they have been seen previously, and if not stores the newly seen Row. The only way I know of today is to copy each new row individually using owned():

┌──────────────────────────────────┐                                                            
│ ┌───────────────────────────────┐│                                                            
│ │               A               ││                                                            
│ ├───────────────────────────────┤│                                                            
│ │               B               │├────────────┐                                               
│ ├───────────────────────────────┤│            │                                               
│ │               A               ││            │                                               
│ ├───────────────────────────────┤│            │                                               
│ │               A               ││            │           ┌──────────────────────────────────┐
│ ├───────────────────────────────┤│            │           │ ┌───────────────────────────────┐│
│ │               C               ││            │           │ │               A               ││
│ ├───────────────────────────────┤│            │           │ └───────────────────────────────┘│
│ │               B               ││            │           │ ┌───────────────────────────────┐│
│ ├───────────────────────────────┤│            └───────────┼▶│               B               ││
│ │               A               ││                        │ └───────────────────────────────┘│
│ ├───────────────────────────────┤│  to add a new row, I   │                                  │
│ │               A               ││  currently do          │                                  │
│ └───────────────────────────────┘│  `Row::owned()` to     │                                  │
│  group keys for input batch      │  get a copy            │   distinct group keys seen in    │
│  often many repeated values      │                        │   previous batches               │
│                                  │                        │                                  │
└──────────────────────────────────┘                        └──────────────────────────────────┘
                                                                                                
     arrow_row::Rows                                         Vec<arrow_row::OwnedRow>           
                                                                                                

Describe the solution you'd like
I would like to be able to append a Row directly to a Rows:

┌──────────────────────────────────┐                                                            
│ ┌───────────────────────────────┐│                                                            
│ │               A               ││                                                            
│ ├───────────────────────────────┤│                                                            
│ │               B               │├────────────┐                                               
│ ├───────────────────────────────┤│            │                                               
│ │               A               ││            │                                               
│ ├───────────────────────────────┤│            │                                               
│ │               A               ││            │           ┌──────────────────────────────────┐
│ ├───────────────────────────────┤│            │           │ ┌───────────────────────────────┐│
│ │               C               ││            │           │ │               A               ││
│ ├───────────────────────────────┤│            │           │ ├───────────────────────────────┤│
│ │               B               ││            └───────────┼▶│               B               ││
│ ├───────────────────────────────┤│                        │ └───────────────────────────────┘│
│ │               A               ││                        │                                  │
│ ├───────────────────────────────┤│  Copying a new Row     │                                  │
│ │               A               ││  would just copy       │                                  │
│ └───────────────────────────────┘│  some bytes to the     │                                  │
│  group keys for input batch      │  other Rows            │   distinct group keys seen in    │
│  often many repeated values      │                        │   previous batches               │
│                                  │                        │                                  │
└──────────────────────────────────┘                        └──────────────────────────────────┘
                                                                                                
   arrow_row::Rows                                            arrow_row::Rows                   
                                                                                                

Describe alternatives you've considered

Currently my POC code uses Vec<OwnedRow> which adds an extra allocation for each row 😢

Additional context
apache/datafusion#4973

@alamb alamb added enhancement Any new improvement worthy of a entry in the changelog arrow Changes to the arrow crate labels Jun 29, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 30, 2023
tustvold added a commit that referenced this issue Jun 30, 2023
* Append Row to Rows (#4466)

* Tweak docs

* Pass slices to encode

* Clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant