-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a way to read .trees metadata without loading the whole file would be useful #1854
Comments
I have done exactly this, but to get the provenance info (I bodged parameters in here before tskit had metadata). The kastore code is fairly minimal. However I agree that it makes sense to have a small kastore wrapper in tskit that that is specifically for metadata. |
I can definitely see the value of adding this to the Python API, and it would be simple enough to implement. Pulling the metadata out of the kastore and decoding it is definitely handy, and not trivial to implement. The only downside is a run-time dependency on the I'm less enthusiastic about adding it to the C API, mainly because it really only is a few lines of kastore code to do (and anyone using the tskit C API already has kastore). Since we're not decoding the metadata in C, tskit would just give you back a bunch of bytes. Then, would you want to return just the metadata, or would you want the schema also? The only argument for keeping it in tskit I would imagine is that so client code wouldn't need to know about the internal file format, but it's pretty simple and I would be very surprised if the way of accessing the top level metadata ever changed. We probably haven't made any formal contracts about the file format, but the top level details won't change now. I could be convinced otherwise though, of course, if anyone wants to suggest a C API. |
Schema also, I suppose?
Well, we can certainly do that "few lines of kastore code" ourselves in SLiM. Seems like a cleaner/better design for those lines to live inside tskit, and I personally would have no idea what those few lines of kastore code would actually look like so there's that :->, but with a helping hand I'm sure we can get it done. As for suggesting a C API, perhaps that would be best done by @petrelharp? I have no idea. :-> |
Thinking about this a bit more, I think the right approach is to add a new flag This is the simplest approach because: a) We'll nearly always want to look at things like the file format version anyway if peeking at the metadata This will also work in Python, as we can add a boolean flag to the We'd need some help to get this done if you'd like it before 1.0.0 @bhaller/@grahamgower - we're already snowed under with stuff to get wrapped up. |
I'm happy to take a crack at it, but will probably flail. :-> It looks like I have absolutely no idea how to do the Python-side work at all, though – you say "add a boolean flag to the |
Wow, you guys are organized – dev docs! OK @jeromekelleher, I will take a crack at it in the next couple days, unless @clwgg (who started this whole thing! :->) wants to have a go. :-> |
I'm happy to give it a try! |
Great! Feel free to ping questions here if you hit any issues @clwgg! |
@benjeffery thanks, the dev docs are pretty fantastic! They made it much much easier to get into the project from the outside, especially when touching the backend for the first time. I just opened a PR that fixes two broken links I found (#1880) -- otherwise I don't think they missed anything I felt I needed. |
alright, I just opened a (very-much-a-)draft PR for this feature. looking forward to your feedback (: |
Hi @clwgg, @benjeffery. Just wondering whether this is likely to make it into C API 1.0. @clwgg, do you care? (If it makes it into C API 1.0 then it will make it into SLiM 3.7; if not, probably not.) |
Hi! So the C side is basically ready barring some additions to the documentation -- on the Python side (which is less relevant here but part of the PR) I still have some tests to add but the PR is essentially "feature complete". I hope to finish up the tests and documentation this coming weekend. |
It's basically ready, we'll get it in for 0.4.0/1.0 |
Hi all. Looking into using this now. I'm surprised to see that when TSK_LOAD_SKIP_TABLES is set, it loads the reference sequence. The goal here, as I see it, is to get the metadata; I don't really see what use the reference sequence would be in this context (what's a scenario where someone wants the reference sequence without actually wanting to load the tables too?). For a 1e10 length chromosome, the reference is ~10 GB, so we really don't want to go loading it in when we don't need it; this variant is supposed to be lightweight and fast, I think. I'm going to reopen this issue (or should I open a new issue?); of course if you really think this is correct functionality then go ahead and close it again :-> But my vote is for either (a) not loading the reference sequence with TSK_LOAD_SKIP_TABLES, or (b) adding another flag that lets me specify that desire as well. Thanks! |
Hi! I agree that there will be more use-cases of |
Agreed, if loading the reference sequence is simply taken out then the doc should change, since the ref seq is "top level" not "tables". |
The ref seq behaviour is still under development, so we hadn't yet considered it's interaction with this feature. Thanks for the reminder though and sorry you are having to be on the bleeding edge! |
Thanks @benjeffery and @jeromekelleher. No worries, I chose to walk out onto the bleeding edge here. :-> |
This came up in SLiM (https://github.com/MesserLab/SLiM/discussions/233) but I think it has general utility, as I'll describe. What the user wants is a way to get the metadata from a .trees file without the big overhead of loading the whole .trees into memory (whether as tables or a tree sequence); just the metadata.
In SLiM they want this so they can read in parameters that they previously put into the metadata, before the point in their script where they would actually load in the .trees data.
This seems generally useful to me because someone might wish to, e.g., loop in Python over the .trees files in a directory that contains many of them, and do "something" with each .trees file that has metadata with a certain property. Like: process all the .trees files that come from msprime but not those from SLiM, or copy all those where their parameter XYZZY had a value of 15.5 into a different folder, or whatever.
So for SLiM it'd be great to have C API for this; for other uses Python API seems called for. Would be nice to have it in C API 1.0 – we would use it immediately in SLiM – but it's not a big deal if it isn't.
The text was updated successfully, but these errors were encountered: