Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how does resuming a batch output job work? #172

Open
ctb opened this issue Jan 12, 2025 · 9 comments
Open

how does resuming a batch output job work? #172

ctb opened this issue Jan 12, 2025 · 9 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jan 12, 2025

something died with OOM and then resuming didn't work - it just restarted from scratch. Any tips or tricks?

@bluegenes
Copy link
Collaborator

Were you using --batch-size? Can you give me the output from the resumed run?

In short, if you are using batches you get {filename}.n.zip zipfiles, where n is the batch and {filename}.zip is the specified output. If we find any {filename}.n.zip files on a subsequent run, we read all that we can, ignoring incomplete batches, and continue forward with batch n+1.

If you are not using batches, we do not resume, b/c afaik, rust zip utils can't append to zips and incomplete zipfiles are not readable. With current strategy, if we were to read {filename}.zip, we would count those sketches as 'done', but then we would overwrite {filename}.zip with the new sketches (meaning we lose the old ones). An alternative is that we could read that file and copy all the old sketches into memory before writing them all again into the same output.

Happy to modify if I'm missing something about rust zip writing or you have other strategy suggestions.

@ctb
Copy link
Contributor Author

ctb commented Jan 13, 2025

I was using batch, but it didn't pick it up. Maybe I got something wrong. I'll give it a try again!

For the bigger databases, I'm also thinking of doing a manual split of the input CSV to get to a small chunk size and then using snakemake on that. Animal genomes are all really big!

@bluegenes
Copy link
Collaborator

bluegenes commented Jan 13, 2025

I also think using the NCBI REST API links instead might help, especially since we could up the # of simultaneous downloads if providing an API key. I'll make an issue for that

it is much faster with simultaneous downloads, especially since genome sizes vary and the biggest ones take a lot of time.

@ctb
Copy link
Contributor Author

ctb commented Jan 19, 2025

Here's an example (in ~ctbrown/scratch3/2025-ncbi-rest-api):

% ls sketches/bilateria-minus-vertebrates.*.zip | head
sketches/bilateria-minus-vertebrates.1.zip
sketches/bilateria-minus-vertebrates.10.zip
sketches/bilateria-minus-vertebrates.100.zip
sketches/bilateria-minus-vertebrates.101.zip
sketches/bilateria-minus-vertebrates.11.zip
sketches/bilateria-minus-vertebrates.12.zip
sketches/bilateria-minus-vertebrates.13.zip
%         sourmash scripts gbsketch outputs/bilateria-minus-vertebrates-links.csv -n 9 -r 10 -p k=21,k=31,k=51,dna             --failed sketches/bilateria-minus-vertebrates.gbsketch-fail.txt --checksum-fail sketches/bilateria-minus-vertebrates.gbsketch-check-fail.txt             -o sketches/bilateria-minus-vertebrates.sig.zip -c 16 --batch 50
...

output:

Downloading and sketching all accessions in 'outputs/bilateria-minus-vertebrates-links.csv using 9 simultaneous downloads, 10 retries, and 16 threads.
No valid existing signature batches found; building all signatures.
Loaded 5044 rows
No protein signature templates provided, and --keep-fasta is not set.
Downloading and sketching genomes only.
Error: Failed to create file: "sketches/bilateria-minus-vertebrates.1.zip"

(Luckily I'd write-protected the zip files.)

But the concerning thing is:

No valid existing signature batches found; building all signatures.

It's not detecting the batches. Very odd.

Maybe the regexp is not matching? I don't grok ^{}(?:\.(\d+))?\.zip$ ;)

@ctb
Copy link
Contributor Author

ctb commented Jan 19, 2025

again, this time with vertebrates, so a simpler filename. Same parameters as above. Maybe it's the directory?

=> sourmash_plugin_directsketch 0.4.1
params: ['k=21,k=31,k=51,dna']
Downloading and sketching all accessions in 'outputs/vertebrates-links.csv using 9 simultaneous downloads, 10 retries, and 16 threads.
No valid existing signature batches found; building all signatures.
WARNING: extra column 'taxid' in CSV file. Ignoring.
Loaded 4507 rows
No protein signature templates provided, and --keep-fasta is not set.
Downloading and sketching genomes only.

@ctb
Copy link
Contributor Author

ctb commented Jan 19, 2025

No, ran it in the subdirectory, still found no matches:

% sourmash scripts gbsketch ../outputs/vertebrates-links.csv -n 9 -r 10 -p k=21,k=31,k=51,dna             --failed vertebrates.gbsketch-fail.txt --checksum-fail vertebrates.gbsketch-check-fail.txt             -o vertebrates.sig.zip -c 16 --batch 50
== This is sourmash version 4.8.13. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_directsketch 0.4.1
params: ['k=21,k=31,k=51,dna']
Downloading and sketching all accessions in '../outputs/vertebrates-links.csv using 9 simultaneous downloads, 10 retries, and 16 threads.
No valid existing signature batches found; building all signatures.
WARNING: extra column 'taxid' in CSV file. Ignoring.
Loaded 4507 rows
No protein signature templates provided, and --keep-fasta is not set.
Downloading and sketching genomes only.

@bluegenes
Copy link
Collaborator

Maybe the dash?

@ctb
Copy link
Contributor Author

ctb commented Jan 21, 2025

hmm, I wonder if this is related to #191, oddly enough!

In the output, I just noticed that it said:

Sigs in 'sketches/eukaryotes-missing.sig.1.zip', etc

when I asked for output to go to sketches/eukaryotes-missing.sig.zip.

BUT, all of the batches were output to sketches/eukaryotes-missing.N.zip - not .sig.N.zip.

...ok, well, I looked at the code, and that's in Python, not in Rust...

But, OK, I think the regexp might not like the period in my output filenames.

@bluegenes
Copy link
Collaborator

bluegenes commented Jan 29, 2025

Ok, it sounds like the files were created, they just weren't named properly. We need to modify to avoid dropping the .sig (and generally any other additional . in the output filename)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants