Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xsv partition subcommand #52

Merged
merged 1 commit into from
Mar 25, 2017
Merged

Conversation

emk
Copy link
Contributor

@emk emk commented Dec 29, 2016

READY FOR MERGE (I hope).

I've tried to follow the standard xsv coding style and to re-use existing support code where possible.

But I want show my initial work and ask for any general feedback now.

One interesting wrinkle that I noticed: The split command has a --output flag, but it ignores it. I'm thinking that perhaps instead of having a --filename TEMPLATE argument, that both split and partition should have an --output argument that defaults to {}.csvusing my new FilenameTemplate type. Would this be a reasonable approach?

TODO

  • Empty strings in partition column
  • Create output directory if it does not exist
  • Sanitize filenames to contain only shell-safe characters
  • Collisions between sanitized field values
  • Test files with no headers & partitioning based on column number
  • Test --filename argument, including prefix
  • Invalid --filename arguments, including no {} or two {}
  • Modify both partition and split to use the same filename template system, possibly as --output instead of --filename?
  • More as I think of them

@emk
Copy link
Contributor Author

emk commented Dec 29, 2016

OK, I've implemented pretty much what we discussed in #51. This code is ready for review, though we might want to quickly revisit a design decision or two.

I wasn't able to get the two extra allocations of the inner loop because of how the entry API works and because the borrow checker was paranoid. But the byte_records iterator is allocation-heavy anyway, which means that two more allocations won't make much difference.

@emk emk changed the title WIP: xsv partition subcommand xsv partition subcommand Dec 29, 2016
@emk
Copy link
Contributor Author

emk commented Jan 5, 2017

Hello, and happy new year! I'm getting back to work on map/reduce jobs with Pachyderm, and I'll be using xsv soon. Should be fun!

Is there anything I can do to improve this PR?

Also, an interesting use case keeps coming up internally. People don't always want to partition on the full value of a field, but sometimes only the first several characters. For example, they might want to partition data by the first 4 digits of the zip code, or partition words by the first letter. What people are suggesting looks something like this:

# Partition on first four digits of zip code.
xsv partition --prefix=4 zip byzipcode

# Partition words by first letter.
(echo "word" && cat /usr/share/dict/words) | xsv partition --prefix=1 word byletter

# Partition rows by year and month.
xsv partition --prefix=6 isoDate by_yyyymm

Now, this raises some interesting questions. Is this truly a common use case for partitioning? Does this belong in this tool? Or should it be in a separate tool, for better composability?

Generally, partitioning using prefixes will produce reasonable results for many string-like data types, as in the examples above. Another advantage is that if you have longer fields, you can limit things to 100 or 1,000 output files instead of 10,000+ (which we already know will break). And furthermore, any other conceivable scheme for partitioning into fewer files will probably be harder to explain than "use the first N characters."

I'm divided on this feature. I think some of the particular examples (such as the isoDate one) look very reasonable, but I'd only want to add --prefix if this option is conceptually fundamental to partitioning. What do you think?

@BurntSushi
Copy link
Owner

@emk Hi! Sorry I haven't had a chance to review your PR. I haven't forgotten about you. I'll try to find time later today, but it might slip until the weekend. :-(

@emk
Copy link
Contributor Author

emk commented Jan 5, 2017

Totally understood! We all have a PR backlog. :-)

@emk
Copy link
Contributor Author

emk commented Jan 19, 2017

Hello again. :-) We're reaching a point where we're going to soon be using xsv for big CSV Pachyderm jobs, and I thought I'd ping you. No rush: I won't reach this stage of the data processing pipeline for a bit yet, and we can always build xsv code from our own branch or build a separate tool.

Please let me know if there are any changes or improvements I can make! And thank you for your advice on writing this code.

(I'm really excited about the possibilities of Rust for high-speed data processing.)

@emk
Copy link
Contributor Author

emk commented Feb 17, 2017

Hello! I'm sorry to be a pest again. Is there anything I could do to improve this PR to make it easier for you to merge? Thank you very much for xsv and for your time!

Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to look at this. I dropped the ball.

Overall, this looks like great work, thank you. :-) I just had a couple of nits but this should otherwise be good to go!


Usage:
xsv partition [options] <column> <outdir> [<input>]
xsv split --help
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should say xsv partition --help.

arg_input: Option<String>,
arg_outdir: String,
flag_filename: FilenameTemplate,
// This option is unused, both here and in the split command.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, probably just a mishap on my part. No need to repeat the mistakes of the past. :-)

if select_cols.len() == 1 {
Ok(select_cols[0])
} else {
Err("can only partition on one column".to_owned().into())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the fail! macro in order to be consistent with the rest of the code? Thanks. Example: https://github.com/BurntSushi/xsv/blob/master/src/cmd/join.rs#L327


// Decide what file to put this in.
let key = row[key_col].clone();
let mut entry = writers.entry(key.clone());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is one more clone here than necessary. I didn't consult the compiler, but I think let key = &row[key_col]; should work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you need both as far as I can tell. We call try!(wtr.write(row.into_iter())); below, which consumes the underlying backing row, and which leads to borrow checker problems. The easiest workaround it to make our own copy of key. I also tried something clever with let ref key and drop(key), but it wouldn't work without non-lexical lifetimes.

} else {
// This will hang indefinitely if we somehow manage to use all
// possible candidates, but to do this we would need to create
// more than 2^64 open files.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe set a reasonable upper bound and fail with an error message? (If there is something the user can do to increase their chances of success, the error message should mention what that is.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is basically impossible to hit. You'd need a CSV file with 2^64+1 rows in the partition column, each of which had a unique key that mapped to the same filename after being made shell safe.

So...

  1. I don't think there's any principled reason why we should give up here before we run out of possible file names. If we do want to give up before hitting OS or memory limits (or whatever), we should do that somewhere else.
  2. We're never going to run out of file names. Among other things, you'd need to store more than 2^64 String values in 2^64 bytes of RAM. :-)
  3. But just in case, I'm going to add a checked_add and a panic here, so that we handle the case of 2^64+1 distinct but colliding partition key values correctly. I could replace the panic with an error.

I suppose we could give up after 20,000 collisions or so, but it's a pretty marginal code path and it would take really contrived data to ever get more than 3 or 4 collisions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the important limit here is actually the maximum number of open files the OS will allow us. This can be quite low! As we discussed in #51, this limit can be as low as 512 on Windows, which would effectively limit us to 500-way partitions.

You suggested that I could ignore the cross-platform file descriptor limits here:

I would be OK with taking reasonable shortcuts for some of those issues (like not doing multi-column partitions or handling the file descriptor limits) based on how much work/effort you want to put into this.

…and I'm just going along with this for the first version, because it's fairly fiddly to handle this well. We could certainly create a system for buffering data in memory and then opening and closing files as needed to stay under the limit.

We actually do have a use-case for a 10,000-way partition (ZIP prefixes), but only on Linux. But this means I may revisit this issue.

@emk
Copy link
Contributor Author

emk commented Mar 15, 2017

OK, I've finally moved back onto ETL (Extract Transform Load) work again, and I have some cycles for xsv. Thank you for your feedback, my apologies for the delay, and I'll have something for you very shortly!

@BurntSushi
Copy link
Owner

@emk No worries! I am also sorry for holding you up!

@emk
Copy link
Contributor Author

emk commented Mar 15, 2017

I've made some changes, but when I try to run the tests, I'm now getting errors on every(?) single test case:

---- test_stats::stats_sum_mixed3::no_headers_no_index stdout ----
	thread 'test_stats::stats_sum_mixed3::no_headers_no_index' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 2, message: "No such file or directory" } }', /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/result.rs:837

---- test_stats::stats_zero_cardinality::headers_index stdout ----
	thread 'test_stats::stats_zero_cardinality::headers_index' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 2, message: "No such file or directory" } }', /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/result.rs:837

The backtraces look like:

   6:     0x55ceefbd4339 - std::panicking::begin_panic_fmt::h173eadd80ae64bec
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:499
   7:     0x55ceefbd42c7 - rust_begin_unwind
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:475
   8:     0x55ceefbff41d - core::panicking::panic_fmt::h3b2d1e30090844ff
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/panicking.rs:69
   9:     0x55ceef9d7458 - core::result::unwrap_failed::h822856a25f2ebc35
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/macros.rs:29
  10:     0x55ceef9e4726 - tests::workdir::Workdir::output::h770e37d410351393
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/result.rs:737
                        at /home/emk/w/clients/faraday/xsv/tests/workdir.rs:77
  11:     0x55ceef9e48dd - tests::workdir::Workdir::stdout::h5ebdbccaa288b5e9
                        at /home/emk/w/clients/faraday/xsv/tests/workdir.rs:97
  12:     0x55ceef9e3d5e - tests::workdir::Workdir::read_stdout::h611362861c60b942
                        at /home/emk/w/clients/faraday/xsv/tests/workdir.rs:64
  13:     0x55ceef9e9768 - tests::test_cat::prop_cat_cols::p::hd3e07fc7766080cd
                        at /home/emk/w/clients/faraday/xsv/tests/test_cat.rs:23
                        at /home/emk/w/clients/faraday/xsv/tests/test_cat.rs:76
  14:     0x55ceef9d4c58 - std::panicking::try::do_call::hcb225f1ac2d3bf89
                        at /home/emk/.cargo/registry/src/github.com-1ecc6299db9ec823/quickcheck-0.3.1/src/tester.rs:272
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panic.rs:295
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:458

I'll poke at this some more tomorrow, but I'm wiped out for the moment.

@emk
Copy link
Contributor Author

emk commented Mar 16, 2017

OK, some more progress!

  1. I fixed the unit test failures by merging master back into this PR. If you'd prefer for me to rebase this PR onto the latest master, I can do that.
  2. I made most of the changes that you suggested, with two exceptions: (a) I didn't eliminate the double clone, because that involves a fairly tricky borrow-checker issue, and (b) I fixed the unique name generator to prevent overflows, but I didn't try to impose any hard-coded limits at that level. Any limits we impose need to happen far earlier, and they should involve the maximum number of open files. But see my comment for details.
  3. The new xsv partition command supports a --filename argument so that you can do things like --filename state-{}.csv. I generalized this to also work for xsv split, because I need to be able to do things like xsv split --filename "$(uuidgen)-{}.csv" in a parallel system. And also because it's less surprising if split and partition support similar arguments, of course. See the commit message for details.

Anyway, thank you for all your help preparing this patch, and for your feedback! We're using xsv heavily for all sorts of things at Faraday, and I'm in the process of setting up a new cluster-based data-cleaning system that will rely heavily on xsv and other Rust tools.

@emk
Copy link
Contributor Author

emk commented Mar 22, 2017

OK, good fun this morning: One of our Node.js CSV partitioners is behaving strangely (when partitioning ~260 GB of data from 1,500 input files), and I'm going try replacing it with xsv partition. This will probably involve adding one more flag:

-p, --prefix-length CHARS   When partitioning, use only the first N characters of the key column.

(Name and short name subject to change.)

Use cases:

  1. Partition using the first N digits of the zip code.
  2. Partition using the first couple letters of a word.
  3. Partition using the first N digits of a part number.

I may also need to keep an eye on how we handle open files, because this is a 1,000-way partition, and Linux only allows 1,024 open files by default on my system.

@BurntSushi
Copy link
Owner

@emk That extra feature sounds good and so do your other responses. :-) I'm happy to hear xsv is working for you folks! :-)

Please do rebase and squash this whole PR down to a single commit. You can wait to do that once you think it's ready to merge if you like.

@emk
Copy link
Contributor Author

emk commented Mar 22, 2017

OK, I've made one final tweak to how the --filename argument works, allowing users to write things like:

xsv partition --filename {}/cities.csv state . all-cities.csv
xsv partition --filename {}/monuments.csv state . all-monuments.csv
xsv partition --filename {}/parks.csv state . all-parks.csv

This also supports the case where we need to partition multiple files without accidentally overwriting things, which is a challenge I hit fairly often on our parallel data processing jobs:

# Possibly in parallel.
xsv partition --filename {}/$(uuidgen).csv . input1.csv
xsv partition --filename {}/$(uuidgen).csv . input2.csv

Does this seem like a reasonable feature to you? It's relatively unobtrusive, I hope.

Note that like all code using create_dir_all, this is potentially affected by the race condition recently fixed by rust-lang/rust#39799. In order to support stable Rust (and also the older Rust versions targeted by xsv), I'm going submit one final patch fixing this race condition. Then I'll squash this PR for final review.

Please let me know if you have any advice on how to improve this!

@emk
Copy link
Contributor Author

emk commented Mar 23, 2017

OK, I've just applied the final bit of polishing I wanted to apply, and squashed this branch into a single patch on master.

This week, we're running this on hundreds of gigabytes of data a couple of times per day, and it hasn't complained so far.

This is ready for your final review and merge, I hope. :-)

@BurntSushi
Copy link
Owner

@emk Wow. I couldn't have done this better myself. This patch looks amazing and the work you put into this is first rate. I'm so happy to hear you folks are using it without any hiccups. :-)

I have two very small requests:

  1. Could you add Fixes #51 to the bottom of your absolutely lovely commit message?
  2. Could you rebase this once more on top of master? I just pushed some stuff to fix CI (Travis changed something about their Rust install), and given the size of this PR, I would like to have it pass CI first before merging. :-)

This patch adds a new `xsv partition` subcommand, and makes a few minor
modifications to `xsv split` for consistency.  The new subcommand is
used as follows:

    xsv partition state out/ cities.csv

This will take the data in `cities.csv` and use the `state` column to
create new file names, creating files like:

    out/CA.csv
    out/TX.csv

If the `states` column is empty, we'll put the data into:

    out/empty.csv

There's an option for specifying the filename template:

    xsv partition --filename=state-{}.csv state out/ cities.csv

This will create:

    out/state-CA.csv
    out/state-TX.csv

This `--filename` option may also be used for `xsv split`.

There's also a `--prefix-length` argument, which can be used to limit
the number of files created:

    xsv partition --prefix-length=3 zip out/ customers.csv

This will create files using the first three digits of the zip code:

    out/000.csv
    out/010.csv
    out/011.csv
    out/213.csv

Note that if you try to split into more than roughly 500 files on
Windows or 1000 files on Linux systems, you may need to adjust `ulimit`
or the local equivalent.  (This was discussed in the original proposal.)

You can also write `--filename={}/file.csv`.  This supports two use
cases:

    xsv partition --filename {}/cities.csv state . all-cities.csv
    xsv partition --filename {}/monuments.csv state . all-monuments.csv
    xsv partition --filename {}/parks.csv state . all-parks.csv

Above, we want to partition our records by state into separate
directories, but we have multiple kinds of data.

    xsv partition --filename {}/$(uuidgen).csv . input1.csv
    xsv partition --filename {}/$(uuidgen).csv . input2.csv

Above, we're running multiple (possibly parallel) copies of xsv and we
want to parition the data into multiple directories without a filename
clash.

Fixes BurntSushi#51.
@emk
Copy link
Contributor Author

emk commented Mar 25, 2017

Rebased, "Fixes #51" added, and all checks are green!

@emk Wow. I couldn't have done this better myself. This patch looks amazing and the work you put into this is first rate.

Thank you! It was a delightful learning exercise to try and "emulate" your coding style as closely as possible. I was particularly impressed by the WorkDir pattern you use to test the xsv and ripgrep CLIs, and so I spun the basic idea off into a cli_test_dir crate for use in my own programs and as part of our standard Rust style at Faraday in the future. We've also released scrubcsv, which uses your csv crate to aggressively normalize some of the weirder CSV files we need to ingest.

Sometime soon, I want to write some blog posts about using the open source Pachyderm to run xsv on a compute cluster. Pachyderm still has some pretty rough edges (especially in terms of setup), but it's based on Docker containers that read files (or named pipes) from /pfs/$INPUT and write files to /pfs/out. This turns out to be a great match for data processing in Rust, allowing us to use xsv on "medium data".

@BurntSushi
Copy link
Owner

@emk That all sounds wonderful. You've actually motivated me to take a brief hiatus from SIMD and put some more attention into a refresh of the CSV crate. I have some ideas on how to make it even faster. :-)

@BurntSushi BurntSushi merged commit 378d64f into BurntSushi:master Mar 25, 2017
@BurntSushi
Copy link
Owner

I have most of the parser rewritten now. The parser is now a proper DFA and can fit into 376 bytes on the stack. Benchmarks, before:

test raw_records_game ... bench:  11,663,862 ns/iter (+/- 406,948) = 222 MB/s
test raw_records_nfl  ... bench:   4,347,024 ns/iter (+/- 121,103) = 313 MB/s

after:

test raw_records_game_copy   ... bench:   6,242,855 ns/iter (+/- 341,646) = 416 MB/s
test raw_records_nfl_copy    ... bench:   3,338,591 ns/iter (+/- 167,221) = 408 MB/s

There's also a special no-copy mode specifically designed for counting things without actually looking at the data:

test raw_records_game_nocopy ... bench:   5,565,494 ns/iter (+/- 297,325) = 467 MB/s
test raw_records_nfl_nocopy  ... bench:   2,930,041 ns/iter (+/- 154,472) = 465 MB/s

@emk
Copy link
Contributor Author

emk commented Mar 27, 2017

Oh that is very, very cool. I'll have to port scrubcsv to use the low-level parser API and a buffer, I'm thinking. :-)

@emk emk deleted the partition branch March 27, 2017 02:03
@emk
Copy link
Contributor Author

emk commented Mar 27, 2017

Also, if you let me know when you release a new version of xsv, I'll see if I can get you some numbers. Should be fun!

@BurntSushi
Copy link
Owner

@emk Hopefully you won't have to do that. I'm going to gut the existing csv crate and build it on top of csv-core. Pretty much everyone should just use csv at that point. :-)

@emk
Copy link
Contributor Author

emk commented Mar 27, 2017

@BurntSushi scrubcsv was written in a single morning while waiting for cloud jobs to finish. It uses ByteString and allocs in the inner loop. I want to port it to use the low-level parsing API of csv because we feed it ridiculous quantities of data.

(By the way, one of the things we need to handle is non-compliant escaping of quotes. Hence the scrub part of the name. There are some weird *.csv files out there in the wild. Right now we can sort of parse these using the current version of csv but I may switch to custom parsing. Gotta see how it works with the new csv first. :-) )

@BurntSushi
Copy link
Owner

@emk Gotya. I do think my idea of the new csv crate should at least enabled you to do what you're doing today without needing to drop down to csv-core. csv-core is really annoying to use because of burden of dynamic memory allocation (even if it's amortized) is pushed on to the caller. With my idea for the new csv crate, dynamic memory allocation can be handled by Vec, but the allocation can still be amortized, which should wind up being similar perf. :-)

@BurntSushi
Copy link
Owner

(But if you need to do extra tricky things with malformed CSV, then sure, that might necessitate hand-rolling something. Failing that, the CSV parser should always return a parse, even if it's wrong, and that process should be deterministic. So you could fix the parse itself by figuring out how invalid data maps into it. But it's a little hokey...)

@BurntSushi
Copy link
Owner

xsv 0.11.0 is out. :-) https://github.com/BurntSushi/xsv/releases

@jfmontanaro
Copy link

jfmontanaro commented Jul 14, 2022

Just wanted to drop here that partition seems to be missing from the main list of subcommands (that you get from xsv -h). This is with xsv 0.13.0 on Linux, just downloaded directly from Github. Was a pleasant surprise to discover that it existed, though! I thought I was going to have to roll my own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants