-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read metadata without loading tables #1882
Conversation
Please note that this is very much a draft -- I am of course happy to clean this up and add tests and everything once some of the decisions I outline above have been made. |
Codecov Report
@@ Coverage Diff @@
## main #1882 +/- ##
===========================================
- Coverage 93.32% 79.78% -13.55%
===========================================
Files 27 27
Lines 25186 24803 -383
Branches 1107 1092 -15
===========================================
- Hits 23505 19789 -3716
- Misses 1647 4953 +3306
- Partials 34 61 +27
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks @clwgg! I think you've done a great job and have basically got it right.
I think it would be better to build an empty index in the skip_tables
case rather than complicating the logic for load_indexes. It'll basically be a no-op, and I think the logic would be simpler.
Does this work for you?
c/tskit/tables.c
Outdated
} | ||
ret = tsk_table_collection_load_indexes(self, &store); | ||
ret = tsk_table_collection_load_indexes(self, &store, options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well skip skip loading the indexes as well as the tables, since the indexes don't make any sense without the edge table. I would just put this into the same "if" block as the table loads.
python/tskit/tables.py
Outdated
@@ -3754,3 +3754,8 @@ def ibd_segments( | |||
store_pairs=store_pairs, | |||
store_segments=store_segments, | |||
) | |||
|
|||
|
|||
def load_metadata(file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's much point in providing the load_metadata
function to be honest - it's just as easy to call
metadata = tskit.load(path, skip_tables=True).metadata
It just doesn't seem worth having the extra function to document since it's already a one-liner and people might want to access other top-level info from the tree sequence too.
Great stuff, glad you could get going so quickly. FWIW I agree with @jeromekelleher on both having an empty index and that |
Thanks a lot for the quick replies! I just have two clarifying questions for how to go forward. Question 1: My understanding is that Question 2: |
Thanks, good questions. This is an advanced use case, I don't think we need to add it to It is fine for |
I've had a look at this @clwgg and I think we can make it work all right. The load function looks a bit like this: }
ret = tsk_table_collection_load_indexes(self, &store);
if (ret != 0) {
goto out;
}
} else {
ret = tsk_table_collection_build_index(self, 0);
if (ret != 0) {
goto out;
}
} Then, for testing, we can add something like this function to the static void
test_skip_tables(void)
{
int ret;
tsk_treeseq_t *ts1 = caterpillar_tree(5, 3, 3);
tsk_treeseq_t ts2;
tsk_table_collection_t tables;
ret = tsk_treeseq_dump(ts1, _tmp_file_name, 0);
CU_ASSERT_EQUAL_FATAL(ret, 0);
ret = tsk_table_collection_load(&tables, _tmp_file_name, TSK_LOAD_SKIP_TABLES);
CU_ASSERT_EQUAL_FATAL(ret, 0);
CU_ASSERT_TRUE(
tsk_table_collection_equals(&tables, ts1->tables, TSK_CMP_IGNORE_TABLES));
CU_ASSERT_EQUAL(tables.individuals.num_rows, 0);
CU_ASSERT_EQUAL(tables.nodes.num_rows, 0);
CU_ASSERT_EQUAL(tables.edges.num_rows, 0);
CU_ASSERT_EQUAL(tables.migrations.num_rows, 0);
CU_ASSERT_EQUAL(tables.sites.num_rows, 0);
CU_ASSERT_EQUAL(tables.mutations.num_rows, 0);
CU_ASSERT_EQUAL(tables.provenances.num_rows, 0);
/* We should be able to make a tree sequence */
ret = tsk_treeseq_init(&ts2, &tables, 0);
CU_ASSERT_EQUAL_FATAL(ret, 0);
tsk_treeseq_free(&ts2);
/* Do the same thing with treeseq API */
ret = tsk_treeseq_load(&ts2, _tmp_file_name, TSK_LOAD_SKIP_TABLES);
CU_ASSERT_EQUAL_FATAL(ret, 0);
CU_ASSERT_TRUE(tsk_table_collection_equals(&tables, ts2.tables, 0));
tsk_treeseq_free(&ts2);
tsk_table_collection_free(&tables);
tsk_treeseq_free(ts1);
free(ts1);
} This uses a new flag bool
tsk_table_collection_equals(const tsk_table_collection_t *self,
const tsk_table_collection_t *other, tsk_flags_t options)
{
bool ret = self->sequence_length == other->sequence_length
&& self->time_units_length == other->time_units_length
&& tsk_memcmp(self->time_units, other->time_units,
self->time_units_length * sizeof(char))
== 0;
if (!(options & TSK_CMP_IGNORE_TABLES)) {
ret = ret
&& tsk_individual_table_equals(
&self->individuals, &other->individuals, options)
&& tsk_node_table_equals(&self->nodes, &other->nodes, options)
&& tsk_edge_table_equals(&self->edges, &other->edges, options)
&& tsk_migration_table_equals(
&self->migrations, &other->migrations, options)
&& tsk_site_table_equals(&self->sites, &other->sites, options)
&& tsk_mutation_table_equals(&self->mutations, &other->mutations, options)
&& tsk_population_table_equals(
&self->populations, &other->populations, options);
/* TSK_CMP_IGNORE_TABLES impliest TSK_CMP_IGNORE_PROVENANCE */
if (!(options & TSK_CMP_IGNORE_PROVENANCE)) {
ret = ret
&& tsk_provenance_table_equals(
&self->provenances, &other->provenances, options);
}
}
/* TSK_CMP_IGNORE_TS_METADATA is implied by TSK_CMP_IGNORE_METADATA */
if (options & TSK_CMP_IGNORE_METADATA) {
options |= TSK_CMP_IGNORE_TS_METADATA;
}
if (!(options & TSK_CMP_IGNORE_TS_METADATA)) {
ret = ret
&& (self->metadata_length == other->metadata_length
&& self->metadata_schema_length == other->metadata_schema_length
&& tsk_memcmp(self->metadata, other->metadata,
self->metadata_length * sizeof(char))
== 0
&& tsk_memcmp(self->metadata_schema, other->metadata_schema,
self->metadata_schema_length * sizeof(char))
== 0);
}
return ret;
} Note the tests don't quite pass with these changes - there's one more update that needs to be made (which I'll leave you to figure, if you like!) |
Thanks for the great pointers @jeromekelleher! I had missed the |
I just implemented your suggestions and added the tests you wrote. Given that you included |
This is a random side note, but I can't figure out how to make clang-format use the indentation style that the test suite is complaining about when mixing multi-line parentheses and lines starting with logical operators. The |
Ok, I just added this for completeness' sake -- please let me know if you would like to have this kept in here, and I can start adding python tests for whatever aspects of this we'd like to expose to python. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks just right to me, thanks @clwgg! Some comments about testing above.
To test the overall functionality in Python, I think the best thing to do is add a new test class TestSkipTables
to the tests/test_file_format.py
file. We don't have to do anything exhaustive, just check the basic properties of the tables/tree sequences we're loading on a few examples.
This would be simplest if we added the ignore_tables
argument to assert_equals, but maybe this PR is getting big enough as it is and this should be done in a follow up. It's up to you.
c/tskit/tables.h
Outdated
/* Flags for table collection load */ | ||
/* This shares an interface with table collection init, so we have to | ||
* increment offset by one. | ||
* NOTE: Should these two be under the same header? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super annoying @clwgg but you have to use the exact version of clang-format |
So yes, just to be clear on your question above, I think we should support this on both the TableCollection.load and TreeSequence.load code paths. |
Ah, I realised there's a wrinkle here that I forgot about. We need to change the load flags to kastore to tell it READ_ALL only if we're not in the "skip tables" case. Otherwise, we just read the whole file anyway. This is easily done: int kas_flags = options & TSK_LOAD_SKIP_TABLES? 0: KAS_READ_ALL;
ret = kastore_openf(&store, file, "r", kas_flags); I started writing a few tests to check if this was working: def assert_tables_empty(tables):
for table in tables.name_map.values():
assert len(table) == 0
class TestSkipTables:
def test_ts_read_path_interface(self, tmp_path, ts_fixture):
# Check the fixture has metadata and a schema
assert ts_fixture.metadata_schema is not None
assert len(ts_fixture.metadata) > 0
save_path = tmp_path / "tmp.trees"
ts_fixture.dump(save_path)
ts_no_tables = tskit.load(save_path, skip_tables=True)
assert ts_no_tables.metadata == ts_fixture.metadata
assert ts_no_tables.sequence_length == ts_fixture.sequence_length
assert_tables_empty(ts_no_tables.tables)
def test_ts_read_one_stream(self, tmp_path, ts_fixture):
save_path = tmp_path / "tmp.trees"
ts_fixture.dump(save_path)
with open(save_path, "rb") as f:
ts_no_tables = tskit.load(f, skip_tables=True)
assert ts_no_tables.metadata == ts_fixture.metadata
assert ts_no_tables.sequence_length == ts_fixture.sequence_length
assert_tables_empty(ts_no_tables.tables)
def test_ts_read_multi_stream_fails(self, tmp_path, ts_fixture):
save_path = tmp_path / "tmp.trees"
ts_fixture.dump(save_path)
with open(save_path, "rb") as f:
ts_no_tables = tskit.load(f, skip_tables=True)
# The second load should fail because we've not read to the
# end of the file.
with pytest.raises(tskit.FileFormatError):
tskit.load(f) We should probably document this corner case (you can't read multiple tree sequences from a stream in |
Great catch on the |
While working on the tests it did indeed seem like a good idea to expose the |
Sorry for the mess with that last commit, but I was looking for places where the |
f18c5ce
to
a2470bb
Compare
Sorry @jeromekelleher I just re-read this after I had implemented some of those tests already. The last commit (cd6775b) now also added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great, thanks @clwgg ! I think you just need to fill out the few remaining testing corner cases, add some documentation and we're done.
Thanks for following up with the ignore_tables
functionality here, you're going above and beyond!
python/tests/test_lowlevel.py
Outdated
@@ -164,6 +164,36 @@ class TestTableCollection(LowLevelTestCase): | |||
Tests for the low-level TableCollection class | |||
""" | |||
|
|||
def test_skip_tables(self, tmp_path): | |||
def tables_are_skipped(tc, tc_skip): | |||
return ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a little better to structure this like
assert not tc.equals(tc_skip)
assert tc.equals(tc_skip, ignore_tables=True)
... etc
The reasons is that if the tests every fail, the specific problem will be immediately apparent. This way, you'd have to put in some prints to try and figure out which one of the assertions was failing.
python/tests/test_tables.py
Outdated
@@ -4003,6 +4003,8 @@ def test_ignore_timestamps(self, t1, t2): | |||
t1.assert_equals(t2, ignore_provenance=True) | |||
t1.assert_equals(t2, ignore_timestamps=True) | |||
|
|||
# NOTE: test `ignore_tables` here too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please!
python/tskit/tables.py
Outdated
@@ -2850,6 +2856,10 @@ def assert_equals( | |||
f" differs: self={self.sequence_length} other={other.sequence_length}" | |||
) | |||
|
|||
# NOTE: Do we need to anything "extra" to account for `ignore_tables`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine - we don't keep any particular state to say that the tables were created by "ignore_tables" so failing on the number of edges being different is as good as any.
I think that's probably better all right. The low-level tests are really just to make sure we're not making any basic mistakes at the Python-C level and we're exercising all the code/error paths. Let's to the logic testing in |
Hi @jeromekelleher! I've been working on adding more tests, and I might be getting a little too much into the weeds of this... One thing I was wondering about the A - The test case you proposed def test_ts_read_multi_stream_fails(self, tmp_path, ts_fixture):
save_path = tmp_path / "tmp.trees"
ts_fixture.dump(save_path)
with open(save_path, "rb") as f:
ts_no_tables = tskit.load(f, skip_tables=True)
# The second load should fail because we've not read to the
# end of the file.
with pytest.raises(exceptions.FileFormatError):
tskit.load(f) My understanding is that this throws a B def test_if_this_works(self, tmp_path, ts_fixture):
save_path = tmp_path / "tmp.trees"
ts_fixture.dump(save_path)
with open(save_path, "rb") as f:
ts1 = tskit.load(f)
with pytest.raises(EOFError):
ts2 = tskit.load(f) So really what A is testing is whether we throw a C def test_double_file(self, tmp_path, ts_fixture):
save_path = tmp_path / "tmp.trees"
with open(save_path, "wb") as f:
ts_fixture.dump(f)
ts_fixture.dump(f)
with open(save_path, "rb") as f:
ts1 = tskit.load(f)
ts2 = tskit.load(f)
assert ts_fixture.equals(ts1)
assert ts_fixture.equals(ts2)
def test_double_file_fails(self, tmp_path, ts_fixture):
save_path = tmp_path / "tmp.trees"
with open(save_path, "wb") as f:
ts_fixture.dump(f)
ts_fixture.dump(f)
with open(save_path, "rb") as f:
ts1 = tskit.load(f, skip_tables=True)
with pytest.raises(exceptions.FileFormatError):
ts2 = tskit.load(f) Here, we use the concatenation of two tree sequence dumps in a single file to simulate reading from a non-seekable stream, which works fine without Anyway, I'm curious to hear what you think about this, and I'll try to get less distracted while adding the remaining tests... |
Yes, you've got it right @clwgg. I think your proposal of C is good. We are testing read-multiple-from-stream behaviour pretty thoroughly in test_fileobj.py but having this test nearby would help to make the behaviour on The base-case here is that the |
We're hoping to get 0.4.0 shipped in the coming week @clwgg - do you think this can be finished up by then? |
great! Yes, I think so. |
bbeda82
to
9652fc9
Compare
Alright, I added more tests and API documentation, so I think this is pretty much ready for final(ish) review! There is one new issue that CI is complaining about:
Apart from that I am of course happy to make any other changes based on your review, especially of these new additions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @clwgg! I've made a batch of suggestions.
Once these are in, can you rebase and squash the commits please? I think it's ready to merge then.
91796a8
to
8506102
Compare
done! |
8506102
to
56779fd
Compare
oh, I just noticed that I can reproduce the failing test locally now that I rebased onto the current upstream main! |
Ah, that's good. What version of Python are you using? |
To me it does look like the issue you first reported in #1938 -- I get a |
I'm on |
it looks like exception chaining is only triggered on an |
Aha, that's exactly it @clwgg |
This needs another rerbase @clwgg, but should be good to merge then. |
56779fd
to
ca1c0e7
Compare
done! |
ca1c0e7
to
82a56e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, thanks @clwgg! We got there 😄
I updated the commit to include some changelog information; merging.
82a56e7
to
763e8db
Compare
Description
This is a first stab at #1854. I think there are a couple of decisions to make here that I don't feel qualified to make, so I would greatly appreciate any feedback. I've tried to document the process in the commit log as best I could, but to avoid everyone having to stumble through those I thought I'd put a little write-up here as well.
The main decision point I hit early on is whether
tsk_table_collection_load_indexes
should be part of the conditional withintsk_table_collection_loadf_inited
or not. Without any further changes, either decision leads to errors, but ONLY if a table_collection loaded withTSK_LOAD_SKIP_TABLES
is ever attempted to be put into a tree sequence (which I implemented but ended up reverting for this PR since I don't know if its necessary for the use cases we envision -- more on this below).Let me walk you through each of these cases, assuming that
TSK_LOAD_SKIP_TABLES
is set:If
tsk_table_collection_load_indexes
is not called, the table collection won't have it's indexes set, so if we attempt to load this table collection into a tree sequence we encounter an error intsk_table_collection_check_integrity
.If
tsk_table_collection_load_indexes
is called and we attempt to add indexes to a table collection without having loaded its tables, we encounter an error within the..._load_indexes
function when checking whether(num_rows != self->edges.num_rows)
since we never actually loaded the edge table.What I ended up doing is to call the
..._load_indexes
function, but pass theTSK_LOAD_SKIP_TABLES
option to it to skip the(num_rows != self->edges.num_rows)
check in case the option is set. My motivation was that it feels cleaner to have an empty but indexed table collection, than an empty, un-indexed table collection that throws an error whenever it tries to touch a tree sequence. Please note though that I'm not totally sure if there's any unwanted consequences of skipping that check (it seems to just check if the edge table was loaded correctly or was malformed in some way), so I rely on your advice here.I then exposed this flag to the python
TableCollection
class through its C interface. I ended up reverting a change that allowed to also pass this option all the way through a TreeSequence since I don't really see the use case of that, but of course I'd be happy to re-revert this if needed. What I did instead is add a top-leveltskit.load_metadata
function which just callsTableCollection.load(file, skip_tables=True)
and returns the metadata. I added this mostly as a proof-of-principle of what I am envisioning the use-case of this flag to be.PR Checklist: