-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xsv partition subcommand #52
Conversation
OK, I've implemented pretty much what we discussed in #51. This code is ready for review, though we might want to quickly revisit a design decision or two. I wasn't able to get the two extra allocations of the inner loop because of how the |
Hello, and happy new year! I'm getting back to work on map/reduce jobs with Pachyderm, and I'll be using Is there anything I can do to improve this PR? Also, an interesting use case keeps coming up internally. People don't always want to partition on the full value of a field, but sometimes only the first several characters. For example, they might want to partition data by the first 4 digits of the zip code, or partition words by the first letter. What people are suggesting looks something like this: # Partition on first four digits of zip code.
xsv partition --prefix=4 zip byzipcode
# Partition words by first letter.
(echo "word" && cat /usr/share/dict/words) | xsv partition --prefix=1 word byletter
# Partition rows by year and month.
xsv partition --prefix=6 isoDate by_yyyymm Now, this raises some interesting questions. Is this truly a common use case for partitioning? Does this belong in this tool? Or should it be in a separate tool, for better composability? Generally, partitioning using prefixes will produce reasonable results for many string-like data types, as in the examples above. Another advantage is that if you have longer fields, you can limit things to 100 or 1,000 output files instead of 10,000+ (which we already know will break). And furthermore, any other conceivable scheme for partitioning into fewer files will probably be harder to explain than "use the first N characters." I'm divided on this feature. I think some of the particular examples (such as the |
@emk Hi! Sorry I haven't had a chance to review your PR. I haven't forgotten about you. I'll try to find time later today, but it might slip until the weekend. :-( |
Totally understood! We all have a PR backlog. :-) |
Hello again. :-) We're reaching a point where we're going to soon be using Please let me know if there are any changes or improvements I can make! And thank you for your advice on writing this code. (I'm really excited about the possibilities of Rust for high-speed data processing.) |
Hello! I'm sorry to be a pest again. Is there anything I could do to improve this PR to make it easier for you to merge? Thank you very much for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to look at this. I dropped the ball.
Overall, this looks like great work, thank you. :-) I just had a couple of nits but this should otherwise be good to go!
src/cmd/partition.rs
Outdated
|
||
Usage: | ||
xsv partition [options] <column> <outdir> [<input>] | ||
xsv split --help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should say xsv partition --help
.
src/cmd/partition.rs
Outdated
arg_input: Option<String>, | ||
arg_outdir: String, | ||
flag_filename: FilenameTemplate, | ||
// This option is unused, both here and in the split command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, probably just a mishap on my part. No need to repeat the mistakes of the past. :-)
src/cmd/partition.rs
Outdated
if select_cols.len() == 1 { | ||
Ok(select_cols[0]) | ||
} else { | ||
Err("can only partition on one column".to_owned().into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the fail!
macro in order to be consistent with the rest of the code? Thanks. Example: https://github.com/BurntSushi/xsv/blob/master/src/cmd/join.rs#L327
src/cmd/partition.rs
Outdated
|
||
// Decide what file to put this in. | ||
let key = row[key_col].clone(); | ||
let mut entry = writers.entry(key.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is one more clone here than necessary. I didn't consult the compiler, but I think let key = &row[key_col];
should work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you need both as far as I can tell. We call try!(wtr.write(row.into_iter()));
below, which consumes the underlying backing row
, and which leads to borrow checker problems. The easiest workaround it to make our own copy of key
. I also tried something clever with let ref key
and drop(key)
, but it wouldn't work without non-lexical lifetimes.
src/cmd/partition.rs
Outdated
} else { | ||
// This will hang indefinitely if we somehow manage to use all | ||
// possible candidates, but to do this we would need to create | ||
// more than 2^64 open files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe set a reasonable upper bound and fail with an error message? (If there is something the user can do to increase their chances of success, the error message should mention what that is.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is basically impossible to hit. You'd need a CSV file with 2^64+1 rows in the partition column, each of which had a unique key that mapped to the same filename after being made shell safe.
So...
- I don't think there's any principled reason why we should give up here before we run out of possible file names. If we do want to give up before hitting OS or memory limits (or whatever), we should do that somewhere else.
- We're never going to run out of file names. Among other things, you'd need to store more than 2^64
String
values in 2^64 bytes of RAM. :-) - But just in case, I'm going to add a
checked_add
and a panic here, so that we handle the case of 2^64+1 distinct but colliding partition key values correctly. I could replace the panic with an error.
I suppose we could give up after 20,000 collisions or so, but it's a pretty marginal code path and it would take really contrived data to ever get more than 3 or 4 collisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the important limit here is actually the maximum number of open files the OS will allow us. This can be quite low! As we discussed in #51, this limit can be as low as 512 on Windows, which would effectively limit us to 500-way partitions.
You suggested that I could ignore the cross-platform file descriptor limits here:
I would be OK with taking reasonable shortcuts for some of those issues (like not doing multi-column partitions or handling the file descriptor limits) based on how much work/effort you want to put into this.
…and I'm just going along with this for the first version, because it's fairly fiddly to handle this well. We could certainly create a system for buffering data in memory and then opening and closing files as needed to stay under the limit.
We actually do have a use-case for a 10,000-way partition (ZIP prefixes), but only on Linux. But this means I may revisit this issue.
OK, I've finally moved back onto ETL (Extract Transform Load) work again, and I have some cycles for |
@emk No worries! I am also sorry for holding you up! |
I've made some changes, but when I try to run the tests, I'm now getting errors on every(?) single test case:
The backtraces look like:
I'll poke at this some more tomorrow, but I'm wiped out for the moment. |
OK, some more progress!
Anyway, thank you for all your help preparing this patch, and for your feedback! We're using |
OK, good fun this morning: One of our Node.js CSV partitioners is behaving strangely (when partitioning ~260 GB of data from 1,500 input files), and I'm going try replacing it with
(Name and short name subject to change.) Use cases:
I may also need to keep an eye on how we handle open files, because this is a 1,000-way partition, and Linux only allows 1,024 open files by default on my system. |
@emk That extra feature sounds good and so do your other responses. :-) I'm happy to hear Please do rebase and squash this whole PR down to a single commit. You can wait to do that once you think it's ready to merge if you like. |
OK, I've made one final tweak to how the xsv partition --filename {}/cities.csv state . all-cities.csv
xsv partition --filename {}/monuments.csv state . all-monuments.csv
xsv partition --filename {}/parks.csv state . all-parks.csv This also supports the case where we need to partition multiple files without accidentally overwriting things, which is a challenge I hit fairly often on our parallel data processing jobs: # Possibly in parallel.
xsv partition --filename {}/$(uuidgen).csv . input1.csv
xsv partition --filename {}/$(uuidgen).csv . input2.csv Does this seem like a reasonable feature to you? It's relatively unobtrusive, I hope. Note that like all code using Please let me know if you have any advice on how to improve this! |
OK, I've just applied the final bit of polishing I wanted to apply, and squashed this branch into a single patch on master. This week, we're running this on hundreds of gigabytes of data a couple of times per day, and it hasn't complained so far. This is ready for your final review and merge, I hope. :-) |
@emk Wow. I couldn't have done this better myself. This patch looks amazing and the work you put into this is first rate. I'm so happy to hear you folks are using it without any hiccups. :-) I have two very small requests:
|
This patch adds a new `xsv partition` subcommand, and makes a few minor modifications to `xsv split` for consistency. The new subcommand is used as follows: xsv partition state out/ cities.csv This will take the data in `cities.csv` and use the `state` column to create new file names, creating files like: out/CA.csv out/TX.csv If the `states` column is empty, we'll put the data into: out/empty.csv There's an option for specifying the filename template: xsv partition --filename=state-{}.csv state out/ cities.csv This will create: out/state-CA.csv out/state-TX.csv This `--filename` option may also be used for `xsv split`. There's also a `--prefix-length` argument, which can be used to limit the number of files created: xsv partition --prefix-length=3 zip out/ customers.csv This will create files using the first three digits of the zip code: out/000.csv out/010.csv out/011.csv out/213.csv Note that if you try to split into more than roughly 500 files on Windows or 1000 files on Linux systems, you may need to adjust `ulimit` or the local equivalent. (This was discussed in the original proposal.) You can also write `--filename={}/file.csv`. This supports two use cases: xsv partition --filename {}/cities.csv state . all-cities.csv xsv partition --filename {}/monuments.csv state . all-monuments.csv xsv partition --filename {}/parks.csv state . all-parks.csv Above, we want to partition our records by state into separate directories, but we have multiple kinds of data. xsv partition --filename {}/$(uuidgen).csv . input1.csv xsv partition --filename {}/$(uuidgen).csv . input2.csv Above, we're running multiple (possibly parallel) copies of xsv and we want to parition the data into multiple directories without a filename clash. Fixes BurntSushi#51.
Rebased, "Fixes #51" added, and all checks are green!
Thank you! It was a delightful learning exercise to try and "emulate" your coding style as closely as possible. I was particularly impressed by the Sometime soon, I want to write some blog posts about using the open source Pachyderm to run |
@emk That all sounds wonderful. You've actually motivated me to take a brief hiatus from SIMD and put some more attention into a refresh of the CSV crate. I have some ideas on how to make it even faster. :-) |
I have most of the parser rewritten now. The parser is now a proper DFA and can fit into 376 bytes on the stack. Benchmarks, before:
after:
There's also a special no-copy mode specifically designed for counting things without actually looking at the data:
|
Oh that is very, very cool. I'll have to port |
Also, if you let me know when you release a new version of |
@emk Hopefully you won't have to do that. I'm going to gut the existing |
@BurntSushi (By the way, one of the things we need to handle is non-compliant escaping of quotes. Hence the |
@emk Gotya. I do think my idea of the new |
(But if you need to do extra tricky things with malformed CSV, then sure, that might necessitate hand-rolling something. Failing that, the CSV parser should always return a parse, even if it's wrong, and that process should be deterministic. So you could fix the parse itself by figuring out how invalid data maps into it. But it's a little hokey...) |
|
Just wanted to drop here that |
READY FOR MERGE (I hope).
I've tried to follow the standard
xsv
coding style and to re-use existing support code where possible.But I want show my initial work and ask for any general feedback now.
One interesting wrinkle that I noticed: The
split
command has a--output
flag, but it ignores it. I'm thinking that perhaps instead of having a--filename TEMPLATE
argument, that bothsplit
and partition should have an--output
argument that defaults to{}.csv
using my newFilenameTemplate
type. Would this be a reasonable approach?TODO
--filename
argument, including prefix--filename
arguments, including no{}
or two{}
partition
andsplit
to use the same filename template system, possibly as--output
instead of--filename
?