Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a way to read .trees metadata without loading the whole file would be useful #1854

Closed
bhaller opened this issue Oct 25, 2021 · 21 comments
Closed
Labels
enhancement New feature or request Python API Issue is about the Python API

Comments

@bhaller
Copy link

bhaller commented Oct 25, 2021

This came up in SLiM (https://github.com/MesserLab/SLiM/discussions/233) but I think it has general utility, as I'll describe. What the user wants is a way to get the metadata from a .trees file without the big overhead of loading the whole .trees into memory (whether as tables or a tree sequence); just the metadata.

In SLiM they want this so they can read in parameters that they previously put into the metadata, before the point in their script where they would actually load in the .trees data.

This seems generally useful to me because someone might wish to, e.g., loop in Python over the .trees files in a directory that contains many of them, and do "something" with each .trees file that has metadata with a certain property. Like: process all the .trees files that come from msprime but not those from SLiM, or copy all those where their parameter XYZZY had a value of 15.5 into a different folder, or whatever.

So for SLiM it'd be great to have C API for this; for other uses Python API seems called for. Would be nice to have it in C API 1.0 – we would use it immediately in SLiM – but it's not a big deal if it isn't.

@grahamgower
Copy link
Member

This seems generally useful to me because someone might wish to, e.g., loop in Python over the .trees files in a directory that contains many of them, and do "something" with each .trees file that has metadata with a certain property.

I have done exactly this, but to get the provenance info (I bodged parameters in here before tskit had metadata). The kastore code is fairly minimal. However I agree that it makes sense to have a small kastore wrapper in tskit that that is specifically for metadata.

@jeromekelleher jeromekelleher added enhancement New feature or request Python API Issue is about the Python API labels Oct 26, 2021
@jeromekelleher
Copy link
Member

jeromekelleher commented Oct 26, 2021

I can definitely see the value of adding this to the Python API, and it would be simple enough to implement. Pulling the metadata out of the kastore and decoding it is definitely handy, and not trivial to implement. The only downside is a run-time dependency on the kastore Python package, but that's no big deal. I guess the simplest thing about be to call the function load_metadata or something? (Or maybe peek_metadata?)

I'm less enthusiastic about adding it to the C API, mainly because it really only is a few lines of kastore code to do (and anyone using the tskit C API already has kastore). Since we're not decoding the metadata in C, tskit would just give you back a bunch of bytes. Then, would you want to return just the metadata, or would you want the schema also? The only argument for keeping it in tskit I would imagine is that so client code wouldn't need to know about the internal file format, but it's pretty simple and I would be very surprised if the way of accessing the top level metadata ever changed. We probably haven't made any formal contracts about the file format, but the top level details won't change now.

I could be convinced otherwise though, of course, if anyone wants to suggest a C API.

@bhaller
Copy link
Author

bhaller commented Oct 26, 2021

I'm less enthusiastic about adding it to the C API, mainly because it really only is a few lines of kastore code to do (and anyone using the tskit C API already has kastore). Since we're not decoding the metadata in C, tskit would just give you back a bunch of bytes. Then, would you want to return just the metadata, or would you want the schema also?

Schema also, I suppose?

The only argument for keeping it in tskit I would imagine is that so client code wouldn't need to know about the internal file format, but it's pretty simple and I would be very surprised if the way of accessing the top level metadata ever changed. We probably haven't made any formal contracts about the file format, but the top level details won't change now.

I could be convinced otherwise though, of course, if anyone wants to suggest a C API.

Well, we can certainly do that "few lines of kastore code" ourselves in SLiM. Seems like a cleaner/better design for those lines to live inside tskit, and I personally would have no idea what those few lines of kastore code would actually look like so there's that :->, but with a helping hand I'm sure we can get it done. As for suggesting a C API, perhaps that would be best done by @petrelharp? I have no idea. :->

@jeromekelleher
Copy link
Member

jeromekelleher commented Oct 26, 2021

Thinking about this a bit more, I think the right approach is to add a new flag TSK_LOAD_SKIP_TABLES (or something) which is provided to table_collection_load. With this flag, only the top-level data for a tree sequence will be loaded in and set on the table collection instance. The tables will be ignored, and therefore have zero rows in the returned table collection/tree sequence.

This is the simplest approach because:

a) We'll nearly always want to look at things like the file format version anyway if peeking at the metadata
b) We don't need to come up with any new APIs or ways of packaging up just the metadata/schema. It'll just be there on an empty table collection/tree sequence.
c) It'll be easy to implement - just put a conditional around the x_table_load functions here

This will also work in Python, as we can add a boolean flag to the load function and it'll take care of itself.

We'd need some help to get this done if you'd like it before 1.0.0 @bhaller/@grahamgower - we're already snowed under with stuff to get wrapped up.

@bhaller
Copy link
Author

bhaller commented Oct 26, 2021

I'm happy to take a crack at it, but will probably flail. :-> It looks like tsk_table_collection_load() has an options already, so I can define a new flag for that, and I guess add a bool flag to tsk_table_collection_loadf_inited() that uses that passed in; that's not public API I guess, so I can add a bbol to its interface I suppose? Then obvs easy to if out the x_table_load functions in tsk_table_collection_loadf_inited() based on that flag.

I have absolutely no idea how to do the Python-side work at all, though – you say "add a boolean flag to the load function" but mysteries abound. :-> Wouldn't know how to test the python side (or the C side, for that matter) properly even if I did get it working. Also not sure what the protocol is for all of this – fork tskit, make a local branch in my fork, commit something to that local branch, push the commit to my GitHub fork, turn it into a PR in GitHub? And then do I need to do something to make it a PR for the "real" tskit, not just for my fork of tskit? You probably don't realize how clueless I am in Git, I'm very accustomed to working in my own fiefdom and I never do anything complicated. :-> Best might be if I could watch someone else like @grahamgower or @petrelharp fix this over zoom; I'd certainly be interested in learning how to do this kind of minor fix!

@jeromekelleher
Copy link
Member

Thanks @bhaller! We could leave out the Python side of things on the initial pass, it's an easy one to fill in later.

Re the workflow, the dev docs should answer your questions (if they don't, it's a bug). We've just done a format switch, so if you see any issues, please do let us know.

@bhaller
Copy link
Author

bhaller commented Oct 26, 2021

Wow, you guys are organized – dev docs! OK @jeromekelleher, I will take a crack at it in the next couple days, unless @clwgg (who started this whole thing! :->) wants to have a go. :->

@clwgg
Copy link
Contributor

clwgg commented Oct 26, 2021

I'm happy to give it a try!

@jeromekelleher
Copy link
Member

Great! Feel free to ping questions here if you hit any issues @clwgg!

@benjeffery
Copy link
Member

@clwgg @bhaller this is a great opportunity to debug the dev docs, so please don't hold back on corrections or suggestions.

@clwgg
Copy link
Contributor

clwgg commented Nov 2, 2021

@benjeffery thanks, the dev docs are pretty fantastic! They made it much much easier to get into the project from the outside, especially when touching the backend for the first time. I just opened a PR that fixes two broken links I found (#1880) -- otherwise I don't think they missed anything I felt I needed.

@clwgg
Copy link
Contributor

clwgg commented Nov 2, 2021

alright, I just opened a (very-much-a-)draft PR for this feature. looking forward to your feedback (:

@bhaller
Copy link
Author

bhaller commented Nov 17, 2021

Hi @clwgg, @benjeffery. Just wondering whether this is likely to make it into C API 1.0. @clwgg, do you care? (If it makes it into C API 1.0 then it will make it into SLiM 3.7; if not, probably not.)

@clwgg
Copy link
Contributor

clwgg commented Nov 17, 2021

Hi! So the C side is basically ready barring some additions to the documentation -- on the Python side (which is less relevant here but part of the PR) I still have some tests to add but the PR is essentially "feature complete". I hope to finish up the tests and documentation this coming weekend.
That's just my two cents -- I can't speak to the merge window of the API version bump (:

@jeromekelleher
Copy link
Member

It's basically ready, we'll get it in for 0.4.0/1.0

jeromekelleher pushed a commit to clwgg/tskit that referenced this issue Nov 26, 2021
@mergify mergify bot closed this as completed in 763e8db Nov 26, 2021
@bhaller
Copy link
Author

bhaller commented Nov 30, 2021

Hi all. Looking into using this now. I'm surprised to see that when TSK_LOAD_SKIP_TABLES is set, it loads the reference sequence. The goal here, as I see it, is to get the metadata; I don't really see what use the reference sequence would be in this context (what's a scenario where someone wants the reference sequence without actually wanting to load the tables too?). For a 1e10 length chromosome, the reference is ~10 GB, so we really don't want to go loading it in when we don't need it; this variant is supposed to be lightweight and fast, I think. I'm going to reopen this issue (or should I open a new issue?); of course if you really think this is correct functionality then go ahead and close it again :-> But my vote is for either (a) not loading the reference sequence with TSK_LOAD_SKIP_TABLES, or (b) adding another flag that lets me specify that desire as well. Thanks!

@bhaller bhaller reopened this Nov 30, 2021
@clwgg
Copy link
Contributor

clwgg commented Nov 30, 2021

Hi! I agree that there will be more use-cases of TSK_LOAD_SKIP_TABLES where we wouldn't want to load the reference than those where we do (if there are any). Still, I suppose the reference is more "top-level" than "table" information, so in the way the flag is currently documented it also doesn't feel totally right to just skip the reference as well?
So maybe a TSK_LOAD_SKIP_REF flag (that could even be implied by TSK_LOAD_SKIP_TABLES and documented as such) would be more clear?
FWIW, the current reference code got merged right before the skip_tables code, so this might have also just been a rebase mishap on my part in that I didn't catch the newly added tsk_table_collection_load_reference_sequence in the TSK_LOAD_SKIP_TABLES conditional -- I'm of course happy to fix that if that's indeed the desired solution.

@bhaller
Copy link
Author

bhaller commented Nov 30, 2021

Agreed, if loading the reference sequence is simply taken out then the doc should change, since the ref seq is "top level" not "tables".

@benjeffery
Copy link
Member

when TSK_LOAD_SKIP_TABLES is set, it loads the reference sequence

The ref seq behaviour is still under development, so we hadn't yet considered it's interaction with this feature. Thanks for the reminder though and sorry you are having to be on the bleeding edge!

@jeromekelleher
Copy link
Member

You've preempted a few issues that have cropped up while I've been working through the details in #1944 and didn't have time to lay out properly @bhaller. I'm going to close this issue and open some more so we can discuss in isolation.

@bhaller
Copy link
Author

bhaller commented Nov 30, 2021

Thanks @benjeffery and @jeromekelleher. No worries, I chose to walk out onto the bleeding edge here. :->

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Python API Issue is about the Python API
Projects
None yet
Development

No branches or pull requests

5 participants