xsv partition subcommand #52

emk · 2016-12-29T14:00:10Z

READY FOR MERGE (I hope).

I've tried to follow the standard xsv coding style and to re-use existing support code where possible.

But I want show my initial work and ask for any general feedback now.

One interesting wrinkle that I noticed: The split command has a --output flag, but it ignores it. I'm thinking that perhaps instead of having a --filename TEMPLATE argument, that both split and partition should have an --output argument that defaults to {}.csvusing my new FilenameTemplate type. Would this be a reasonable approach?

TODO

Empty strings in partition column
Create output directory if it does not exist
Sanitize filenames to contain only shell-safe characters
Collisions between sanitized field values
Test files with no headers & partitioning based on column number
Test --filename argument, including prefix
Invalid --filename arguments, including no {} or two {}
Modify both partition and split to use the same filename template system, possibly as --output instead of --filename?
More as I think of them

emk · 2016-12-29T17:24:01Z

OK, I've implemented pretty much what we discussed in #51. This code is ready for review, though we might want to quickly revisit a design decision or two.

I wasn't able to get the two extra allocations of the inner loop because of how the entry API works and because the borrow checker was paranoid. But the byte_records iterator is allocation-heavy anyway, which means that two more allocations won't make much difference.

emk · 2017-01-05T14:08:33Z

Hello, and happy new year! I'm getting back to work on map/reduce jobs with Pachyderm, and I'll be using xsv soon. Should be fun!

Is there anything I can do to improve this PR?

Also, an interesting use case keeps coming up internally. People don't always want to partition on the full value of a field, but sometimes only the first several characters. For example, they might want to partition data by the first 4 digits of the zip code, or partition words by the first letter. What people are suggesting looks something like this:

# Partition on first four digits of zip code.
xsv partition --prefix=4 zip byzipcode

# Partition words by first letter.
(echo "word" && cat /usr/share/dict/words) | xsv partition --prefix=1 word byletter

# Partition rows by year and month.
xsv partition --prefix=6 isoDate by_yyyymm

Now, this raises some interesting questions. Is this truly a common use case for partitioning? Does this belong in this tool? Or should it be in a separate tool, for better composability?

Generally, partitioning using prefixes will produce reasonable results for many string-like data types, as in the examples above. Another advantage is that if you have longer fields, you can limit things to 100 or 1,000 output files instead of 10,000+ (which we already know will break). And furthermore, any other conceivable scheme for partitioning into fewer files will probably be harder to explain than "use the first N characters."

I'm divided on this feature. I think some of the particular examples (such as the isoDate one) look very reasonable, but I'd only want to add --prefix if this option is conceptually fundamental to partitioning. What do you think?

BurntSushi · 2017-01-05T14:10:01Z

@emk Hi! Sorry I haven't had a chance to review your PR. I haven't forgotten about you. I'll try to find time later today, but it might slip until the weekend. :-(

emk · 2017-01-05T14:18:16Z

Totally understood! We all have a PR backlog. :-)

emk · 2017-01-19T12:14:22Z

Hello again. :-) We're reaching a point where we're going to soon be using xsv for big CSV Pachyderm jobs, and I thought I'd ping you. No rush: I won't reach this stage of the data processing pipeline for a bit yet, and we can always build xsv code from our own branch or build a separate tool.

Please let me know if there are any changes or improvements I can make! And thank you for your advice on writing this code.

(I'm really excited about the possibilities of Rust for high-speed data processing.)

emk · 2017-02-17T13:58:46Z

Hello! I'm sorry to be a pest again. Is there anything I could do to improve this PR to make it easier for you to merge? Thank you very much for xsv and for your time!

BurntSushi

Sorry for taking so long to look at this. I dropped the ball.

Overall, this looks like great work, thank you. :-) I just had a couple of nits but this should otherwise be good to go!

BurntSushi · 2017-02-18T17:22:00Z

src/cmd/partition.rs

+
+Usage:
+    xsv partition [options] <column> <outdir> [<input>]
+    xsv split --help


I think this should say xsv partition --help.

BurntSushi · 2017-02-18T17:23:27Z

src/cmd/partition.rs

+    arg_input: Option<String>,
+    arg_outdir: String,
+    flag_filename: FilenameTemplate,
+    // This option is unused, both here and in the split command.


Ah, probably just a mishap on my part. No need to repeat the mistakes of the past. :-)

BurntSushi · 2017-02-18T17:25:17Z

src/cmd/partition.rs

+        if select_cols.len() == 1 {
+            Ok(select_cols[0])
+        } else {
+            Err("can only partition on one column".to_owned().into())


Can you use the fail! macro in order to be consistent with the rest of the code? Thanks. Example: https://github.com/BurntSushi/xsv/blob/master/src/cmd/join.rs#L327

BurntSushi · 2017-02-18T17:32:39Z

src/cmd/partition.rs

+
+            // Decide what file to put this in.
+            let key = row[key_col].clone();
+            let mut entry = writers.entry(key.clone());


I think there is one more clone here than necessary. I didn't consult the compiler, but I think let key = &row[key_col]; should work?

No, you need both as far as I can tell. We call try!(wtr.write(row.into_iter())); below, which consumes the underlying backing row, and which leads to borrow checker problems. The easiest workaround it to make our own copy of key. I also tried something clever with let ref key and drop(key), but it wouldn't work without non-lexical lifetimes.

BurntSushi · 2017-02-18T17:35:01Z

src/cmd/partition.rs

+        } else {
+            // This will hang indefinitely if we somehow manage to use all
+            // possible candidates, but to do this we would need to create
+            // more than 2^64 open files.


Maybe set a reasonable upper bound and fail with an error message? (If there is something the user can do to increase their chances of success, the error message should mention what that is.)

This case is basically impossible to hit. You'd need a CSV file with 2^64+1 rows in the partition column, each of which had a unique key that mapped to the same filename after being made shell safe.

So...

I don't think there's any principled reason why we should give up here before we run out of possible file names. If we do want to give up before hitting OS or memory limits (or whatever), we should do that somewhere else.

We're never going to run out of file names. Among other things, you'd need to store more than 2^64 String values in 2^64 bytes of RAM. :-)

But just in case, I'm going to add a checked_add and a panic here, so that we handle the case of 2^64+1 distinct but colliding partition key values correctly. I could replace the panic with an error.

I suppose we could give up after 20,000 collisions or so, but it's a pretty marginal code path and it would take really contrived data to ever get more than 3 or 4 collisions.

I think the important limit here is actually the maximum number of open files the OS will allow us. This can be quite low! As we discussed in #51, this limit can be as low as 512 on Windows, which would effectively limit us to 500-way partitions.

You suggested that I could ignore the cross-platform file descriptor limits here:

I would be OK with taking reasonable shortcuts for some of those issues (like not doing multi-column partitions or handling the file descriptor limits) based on how much work/effort you want to put into this.

…and I'm just going along with this for the first version, because it's fairly fiddly to handle this well. We could certainly create a system for buffering data in memory and then opening and closing files as needed to stay under the limit.

We actually do have a use-case for a 10,000-way partition (ZIP prefixes), but only on Linux. But this means I may revisit this issue.

emk · 2017-03-15T18:19:15Z

OK, I've finally moved back onto ETL (Extract Transform Load) work again, and I have some cycles for xsv. Thank you for your feedback, my apologies for the delay, and I'll have something for you very shortly!

BurntSushi · 2017-03-15T18:22:06Z

@emk No worries! I am also sorry for holding you up!

emk · 2017-03-15T19:51:28Z

I've made some changes, but when I try to run the tests, I'm now getting errors on every(?) single test case:

---- test_stats::stats_sum_mixed3::no_headers_no_index stdout ----
	thread 'test_stats::stats_sum_mixed3::no_headers_no_index' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 2, message: "No such file or directory" } }', /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/result.rs:837

---- test_stats::stats_zero_cardinality::headers_index stdout ----
	thread 'test_stats::stats_zero_cardinality::headers_index' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 2, message: "No such file or directory" } }', /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/result.rs:837

The backtraces look like:

   6:     0x55ceefbd4339 - std::panicking::begin_panic_fmt::h173eadd80ae64bec
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:499
   7:     0x55ceefbd42c7 - rust_begin_unwind
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:475
   8:     0x55ceefbff41d - core::panicking::panic_fmt::h3b2d1e30090844ff
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/panicking.rs:69
   9:     0x55ceef9d7458 - core::result::unwrap_failed::h822856a25f2ebc35
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/macros.rs:29
  10:     0x55ceef9e4726 - tests::workdir::Workdir::output::h770e37d410351393
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/result.rs:737
                        at /home/emk/w/clients/faraday/xsv/tests/workdir.rs:77
  11:     0x55ceef9e48dd - tests::workdir::Workdir::stdout::h5ebdbccaa288b5e9
                        at /home/emk/w/clients/faraday/xsv/tests/workdir.rs:97
  12:     0x55ceef9e3d5e - tests::workdir::Workdir::read_stdout::h611362861c60b942
                        at /home/emk/w/clients/faraday/xsv/tests/workdir.rs:64
  13:     0x55ceef9e9768 - tests::test_cat::prop_cat_cols::p::hd3e07fc7766080cd
                        at /home/emk/w/clients/faraday/xsv/tests/test_cat.rs:23
                        at /home/emk/w/clients/faraday/xsv/tests/test_cat.rs:76
  14:     0x55ceef9d4c58 - std::panicking::try::do_call::hcb225f1ac2d3bf89
                        at /home/emk/.cargo/registry/src/github.com-1ecc6299db9ec823/quickcheck-0.3.1/src/tester.rs:272
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panic.rs:295
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:458

I'll poke at this some more tomorrow, but I'm wiped out for the moment.

emk · 2017-03-16T11:15:45Z

OK, some more progress!

I fixed the unit test failures by merging master back into this PR. If you'd prefer for me to rebase this PR onto the latest master, I can do that.
I made most of the changes that you suggested, with two exceptions: (a) I didn't eliminate the double clone, because that involves a fairly tricky borrow-checker issue, and (b) I fixed the unique name generator to prevent overflows, but I didn't try to impose any hard-coded limits at that level. Any limits we impose need to happen far earlier, and they should involve the maximum number of open files. But see my comment for details.
The new xsv partition command supports a --filename argument so that you can do things like --filename state-{}.csv. I generalized this to also work for xsv split, because I need to be able to do things like xsv split --filename "$(uuidgen)-{}.csv" in a parallel system. And also because it's less surprising if split and partition support similar arguments, of course. See the commit message for details.

Anyway, thank you for all your help preparing this patch, and for your feedback! We're using xsv heavily for all sorts of things at Faraday, and I'm in the process of setting up a new cluster-based data-cleaning system that will rely heavily on xsv and other Rust tools.

emk · 2017-03-22T11:08:29Z

OK, good fun this morning: One of our Node.js CSV partitioners is behaving strangely (when partitioning ~260 GB of data from 1,500 input files), and I'm going try replacing it with xsv partition. This will probably involve adding one more flag:

-p, --prefix-length CHARS   When partitioning, use only the first N characters of the key column.

(Name and short name subject to change.)

Use cases:

Partition using the first N digits of the zip code.
Partition using the first couple letters of a word.
Partition using the first N digits of a part number.

I may also need to keep an eye on how we handle open files, because this is a 1,000-way partition, and Linux only allows 1,024 open files by default on my system.

BurntSushi · 2017-03-22T11:13:16Z

@emk That extra feature sounds good and so do your other responses. :-) I'm happy to hear xsv is working for you folks! :-)

Please do rebase and squash this whole PR down to a single commit. You can wait to do that once you think it's ready to merge if you like.

emk · 2017-03-22T21:42:27Z

OK, I've made one final tweak to how the --filename argument works, allowing users to write things like:

xsv partition --filename {}/cities.csv state . all-cities.csv
xsv partition --filename {}/monuments.csv state . all-monuments.csv
xsv partition --filename {}/parks.csv state . all-parks.csv

This also supports the case where we need to partition multiple files without accidentally overwriting things, which is a challenge I hit fairly often on our parallel data processing jobs:

# Possibly in parallel.
xsv partition --filename {}/$(uuidgen).csv . input1.csv
xsv partition --filename {}/$(uuidgen).csv . input2.csv

Does this seem like a reasonable feature to you? It's relatively unobtrusive, I hope.

Note that like all code using create_dir_all, this is potentially affected by the race condition recently fixed by rust-lang/rust#39799. In order to support stable Rust (and also the older Rust versions targeted by xsv), I'm going submit one final patch fixing this race condition. Then I'll squash this PR for final review.

Please let me know if you have any advice on how to improve this!

emk · 2017-03-23T18:46:19Z

OK, I've just applied the final bit of polishing I wanted to apply, and squashed this branch into a single patch on master.

This week, we're running this on hundreds of gigabytes of data a couple of times per day, and it hasn't complained so far.

This is ready for your final review and merge, I hope. :-)

BurntSushi · 2017-03-24T22:57:17Z

@emk Wow. I couldn't have done this better myself. This patch looks amazing and the work you put into this is first rate. I'm so happy to hear you folks are using it without any hiccups. :-)

I have two very small requests:

Could you add Fixes #51 to the bottom of your absolutely lovely commit message?
Could you rebase this once more on top of master? I just pushed some stuff to fix CI (Travis changed something about their Rust install), and given the size of this PR, I would like to have it pass CI first before merging. :-)

This patch adds a new `xsv partition` subcommand, and makes a few minor modifications to `xsv split` for consistency. The new subcommand is used as follows: xsv partition state out/ cities.csv This will take the data in `cities.csv` and use the `state` column to create new file names, creating files like: out/CA.csv out/TX.csv If the `states` column is empty, we'll put the data into: out/empty.csv There's an option for specifying the filename template: xsv partition --filename=state-{}.csv state out/ cities.csv This will create: out/state-CA.csv out/state-TX.csv This `--filename` option may also be used for `xsv split`. There's also a `--prefix-length` argument, which can be used to limit the number of files created: xsv partition --prefix-length=3 zip out/ customers.csv This will create files using the first three digits of the zip code: out/000.csv out/010.csv out/011.csv out/213.csv Note that if you try to split into more than roughly 500 files on Windows or 1000 files on Linux systems, you may need to adjust `ulimit` or the local equivalent. (This was discussed in the original proposal.) You can also write `--filename={}/file.csv`. This supports two use cases: xsv partition --filename {}/cities.csv state . all-cities.csv xsv partition --filename {}/monuments.csv state . all-monuments.csv xsv partition --filename {}/parks.csv state . all-parks.csv Above, we want to partition our records by state into separate directories, but we have multiple kinds of data. xsv partition --filename {}/$(uuidgen).csv . input1.csv xsv partition --filename {}/$(uuidgen).csv . input2.csv Above, we're running multiple (possibly parallel) copies of xsv and we want to parition the data into multiple directories without a filename clash. Fixes BurntSushi#51.

emk · 2017-03-25T10:55:21Z

Rebased, "Fixes #51" added, and all checks are green!

@emk Wow. I couldn't have done this better myself. This patch looks amazing and the work you put into this is first rate.

Thank you! It was a delightful learning exercise to try and "emulate" your coding style as closely as possible. I was particularly impressed by the WorkDir pattern you use to test the xsv and ripgrep CLIs, and so I spun the basic idea off into a cli_test_dir crate for use in my own programs and as part of our standard Rust style at Faraday in the future. We've also released scrubcsv, which uses your csv crate to aggressively normalize some of the weirder CSV files we need to ingest.

Sometime soon, I want to write some blog posts about using the open source Pachyderm to run xsv on a compute cluster. Pachyderm still has some pretty rough edges (especially in terms of setup), but it's based on Docker containers that read files (or named pipes) from /pfs/$INPUT and write files to /pfs/out. This turns out to be a great match for data processing in Rust, allowing us to use xsv on "medium data".

BurntSushi · 2017-03-25T11:55:38Z

@emk That all sounds wonderful. You've actually motivated me to take a brief hiatus from SIMD and put some more attention into a refresh of the CSV crate. I have some ideas on how to make it even faster. :-)

BurntSushi · 2017-03-26T00:20:22Z

I have most of the parser rewritten now. The parser is now a proper DFA and can fit into 376 bytes on the stack. Benchmarks, before:

test raw_records_game ... bench:  11,663,862 ns/iter (+/- 406,948) = 222 MB/s
test raw_records_nfl  ... bench:   4,347,024 ns/iter (+/- 121,103) = 313 MB/s

after:

test raw_records_game_copy   ... bench:   6,242,855 ns/iter (+/- 341,646) = 416 MB/s
test raw_records_nfl_copy    ... bench:   3,338,591 ns/iter (+/- 167,221) = 408 MB/s

There's also a special no-copy mode specifically designed for counting things without actually looking at the data:

test raw_records_game_nocopy ... bench:   5,565,494 ns/iter (+/- 297,325) = 467 MB/s
test raw_records_nfl_nocopy  ... bench:   2,930,041 ns/iter (+/- 154,472) = 465 MB/s

emk · 2017-03-27T02:03:00Z

Oh that is very, very cool. I'll have to port scrubcsv to use the low-level parser API and a buffer, I'm thinking. :-)

emk · 2017-03-27T02:06:21Z

Also, if you let me know when you release a new version of xsv, I'll see if I can get you some numbers. Should be fun!

BurntSushi · 2017-03-27T11:53:55Z

@emk Hopefully you won't have to do that. I'm going to gut the existing csv crate and build it on top of csv-core. Pretty much everyone should just use csv at that point. :-)

emk · 2017-03-27T16:21:26Z

@BurntSushi scrubcsv was written in a single morning while waiting for cloud jobs to finish. It uses ByteString and allocs in the inner loop. I want to port it to use the low-level parsing API of csv because we feed it ridiculous quantities of data.

(By the way, one of the things we need to handle is non-compliant escaping of quotes. Hence the scrub part of the name. There are some weird *.csv files out there in the wild. Right now we can sort of parse these using the current version of csv but I may switch to custom parsing. Gotta see how it works with the new csv first. :-) )

BurntSushi · 2017-03-27T18:23:33Z

@emk Gotya. I do think my idea of the new csv crate should at least enabled you to do what you're doing today without needing to drop down to csv-core. csv-core is really annoying to use because of burden of dynamic memory allocation (even if it's amortized) is pushed on to the caller. With my idea for the new csv crate, dynamic memory allocation can be handled by Vec, but the allocation can still be amortized, which should wind up being similar perf. :-)

BurntSushi · 2017-03-27T18:24:38Z

(But if you need to do extra tricky things with malformed CSV, then sure, that might necessitate hand-rolling something. Failing that, the CSV parser should always return a parse, even if it's wrong, and that process should be deterministic. So you could fix the parse itself by figuring out how invalid data maps into it. But it's a little hokey...)

BurntSushi · 2017-03-27T22:15:00Z

xsv 0.11.0 is out. :-) https://github.com/BurntSushi/xsv/releases

jfmontanaro · 2022-07-14T16:55:26Z

Just wanted to drop here that partition seems to be missing from the main list of subcommands (that you get from xsv -h). This is with xsv 0.13.0 on Linux, just downloaded directly from Github. Was a pleasant surprise to discover that it existed, though! I thought I was going to have to roll my own.

emk changed the title ~~WIP: xsv partition subcommand~~ xsv partition subcommand Dec 29, 2016

BurntSushi requested changes Feb 18, 2017

View reviewed changes

emk force-pushed the partition branch from 08934bf to 42f37b2 Compare March 16, 2017 10:17

emk force-pushed the partition branch from aebad57 to 074f14c Compare March 22, 2017 19:59

emk force-pushed the partition branch from d80b39c to 2592f10 Compare March 23, 2017 18:43

BurntSushi approved these changes Mar 24, 2017

View reviewed changes

emk force-pushed the partition branch from 2592f10 to 6d3af83 Compare March 25, 2017 10:34

BurntSushi merged commit 378d64f into BurntSushi:master Mar 25, 2017

emk deleted the partition branch March 27, 2017 02:03

xsv partition subcommand #52

xsv partition subcommand #52

Conversation

emk commented Dec 29, 2016 • edited Loading

TODO

emk commented Dec 29, 2016

emk commented Jan 5, 2017

BurntSushi commented Jan 5, 2017

emk commented Jan 5, 2017

emk commented Jan 19, 2017

emk commented Feb 17, 2017

BurntSushi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emk commented Mar 15, 2017

BurntSushi commented Mar 15, 2017

emk commented Mar 15, 2017

emk commented Mar 16, 2017

emk commented Mar 22, 2017

BurntSushi commented Mar 22, 2017

emk commented Mar 22, 2017

emk commented Mar 23, 2017

BurntSushi commented Mar 24, 2017

emk commented Mar 25, 2017

BurntSushi commented Mar 25, 2017

BurntSushi commented Mar 26, 2017

emk commented Mar 27, 2017

emk commented Mar 27, 2017

BurntSushi commented Mar 27, 2017

emk commented Mar 27, 2017

BurntSushi commented Mar 27, 2017

BurntSushi commented Mar 27, 2017

BurntSushi commented Mar 27, 2017

jfmontanaro commented Jul 14, 2022 • edited Loading

emk commented Dec 29, 2016 •

edited

Loading

jfmontanaro commented Jul 14, 2022 •

edited

Loading