how does resuming a batch output job work? #172

ctb · 2025-01-12T13:26:45Z

something died with OOM and then resuming didn't work - it just restarted from scratch. Any tips or tricks?

bluegenes · 2025-01-13T17:35:16Z

Were you using --batch-size? Can you give me the output from the resumed run?

In short, if you are using batches you get {filename}.n.zip zipfiles, where n is the batch and {filename}.zip is the specified output. If we find any {filename}.n.zip files on a subsequent run, we read all that we can, ignoring incomplete batches, and continue forward with batch n+1.

If you are not using batches, we do not resume, b/c afaik, rust zip utils can't append to zips and incomplete zipfiles are not readable. With current strategy, if we were to read {filename}.zip, we would count those sketches as 'done', but then we would overwrite {filename}.zip with the new sketches (meaning we lose the old ones). An alternative is that we could read that file and copy all the old sketches into memory before writing them all again into the same output.

Happy to modify if I'm missing something about rust zip writing or you have other strategy suggestions.

ctb · 2025-01-13T17:49:26Z

I was using batch, but it didn't pick it up. Maybe I got something wrong. I'll give it a try again!

For the bigger databases, I'm also thinking of doing a manual split of the input CSV to get to a small chunk size and then using snakemake on that. Animal genomes are all really big!

bluegenes · 2025-01-13T18:18:26Z

I also think using the NCBI REST API links instead might help, especially since we could up the # of simultaneous downloads if providing an API key. I'll make an issue for that

it is much faster with simultaneous downloads, especially since genome sizes vary and the biggest ones take a lot of time.

ctb · 2025-01-19T16:04:49Z

Here's an example (in ~ctbrown/scratch3/2025-ncbi-rest-api):

% ls sketches/bilateria-minus-vertebrates.*.zip | head
sketches/bilateria-minus-vertebrates.1.zip
sketches/bilateria-minus-vertebrates.10.zip
sketches/bilateria-minus-vertebrates.100.zip
sketches/bilateria-minus-vertebrates.101.zip
sketches/bilateria-minus-vertebrates.11.zip
sketches/bilateria-minus-vertebrates.12.zip
sketches/bilateria-minus-vertebrates.13.zip
%         sourmash scripts gbsketch outputs/bilateria-minus-vertebrates-links.csv -n 9 -r 10 -p k=21,k=31,k=51,dna             --failed sketches/bilateria-minus-vertebrates.gbsketch-fail.txt --checksum-fail sketches/bilateria-minus-vertebrates.gbsketch-check-fail.txt             -o sketches/bilateria-minus-vertebrates.sig.zip -c 16 --batch 50
...

output:

Downloading and sketching all accessions in 'outputs/bilateria-minus-vertebrates-links.csv using 9 simultaneous downloads, 10 retries, and 16 threads.
No valid existing signature batches found; building all signatures.
Loaded 5044 rows
No protein signature templates provided, and --keep-fasta is not set.
Downloading and sketching genomes only.
Error: Failed to create file: "sketches/bilateria-minus-vertebrates.1.zip"

(Luckily I'd write-protected the zip files.)

But the concerning thing is:

No valid existing signature batches found; building all signatures.

It's not detecting the batches. Very odd.

Maybe the regexp is not matching? I don't grok ^{}(?:\.(\d+))?\.zip$ ;)

ctb · 2025-01-19T21:20:27Z

again, this time with vertebrates, so a simpler filename. Same parameters as above. Maybe it's the directory?

=> sourmash_plugin_directsketch 0.4.1
params: ['k=21,k=31,k=51,dna']
Downloading and sketching all accessions in 'outputs/vertebrates-links.csv using 9 simultaneous downloads, 10 retries, and 16 threads.
No valid existing signature batches found; building all signatures.
WARNING: extra column 'taxid' in CSV file. Ignoring.
Loaded 4507 rows
No protein signature templates provided, and --keep-fasta is not set.
Downloading and sketching genomes only.

ctb · 2025-01-19T21:22:16Z

No, ran it in the subdirectory, still found no matches:

% sourmash scripts gbsketch ../outputs/vertebrates-links.csv -n 9 -r 10 -p k=21,k=31,k=51,dna             --failed vertebrates.gbsketch-fail.txt --checksum-fail vertebrates.gbsketch-check-fail.txt             -o vertebrates.sig.zip -c 16 --batch 50
== This is sourmash version 4.8.13. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_directsketch 0.4.1
params: ['k=21,k=31,k=51,dna']
Downloading and sketching all accessions in '../outputs/vertebrates-links.csv using 9 simultaneous downloads, 10 retries, and 16 threads.
No valid existing signature batches found; building all signatures.
WARNING: extra column 'taxid' in CSV file. Ignoring.
Loaded 4507 rows
No protein signature templates provided, and --keep-fasta is not set.
Downloading and sketching genomes only.

bluegenes · 2025-01-20T02:04:52Z

Maybe the dash?

ctb · 2025-01-21T12:35:15Z

hmm, I wonder if this is related to #191, oddly enough!

In the output, I just noticed that it said:

Sigs in 'sketches/eukaryotes-missing.sig.1.zip', etc

when I asked for output to go to sketches/eukaryotes-missing.sig.zip.

BUT, all of the batches were output to sketches/eukaryotes-missing.N.zip - not .sig.N.zip.

...ok, well, I looked at the code, and that's in Python, not in Rust...

But, OK, I think the regexp might not like the period in my output filenames.

bluegenes · 2025-01-29T23:20:39Z

Ok, it sounds like the files were created, they just weren't named properly. We need to modify to avoid dropping the .sig (and generally any other additional . in the output filename)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how does resuming a batch output job work? #172

how does resuming a batch output job work? #172

ctb commented Jan 12, 2025

bluegenes commented Jan 13, 2025

ctb commented Jan 13, 2025 •

edited

Loading

bluegenes commented Jan 13, 2025 •

edited

Loading

ctb commented Jan 19, 2025

ctb commented Jan 19, 2025

ctb commented Jan 19, 2025

bluegenes commented Jan 20, 2025

ctb commented Jan 21, 2025

bluegenes commented Jan 29, 2025 •

edited

Loading

how does resuming a batch output job work? #172

how does resuming a batch output job work? #172

Comments

ctb commented Jan 12, 2025

bluegenes commented Jan 13, 2025

ctb commented Jan 13, 2025 • edited Loading

bluegenes commented Jan 13, 2025 • edited Loading

ctb commented Jan 19, 2025

ctb commented Jan 19, 2025

ctb commented Jan 19, 2025

bluegenes commented Jan 20, 2025

ctb commented Jan 21, 2025

bluegenes commented Jan 29, 2025 • edited Loading

ctb commented Jan 13, 2025 •

edited

Loading

bluegenes commented Jan 13, 2025 •

edited

Loading

bluegenes commented Jan 29, 2025 •

edited

Loading