Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read metadata without loading tables #1882

Merged
merged 1 commit into from
Nov 26, 2021

Conversation

clwgg
Copy link
Contributor

@clwgg clwgg commented Nov 2, 2021

Description

This is a first stab at #1854. I think there are a couple of decisions to make here that I don't feel qualified to make, so I would greatly appreciate any feedback. I've tried to document the process in the commit log as best I could, but to avoid everyone having to stumble through those I thought I'd put a little write-up here as well.

The main decision point I hit early on is whether tsk_table_collection_load_indexes should be part of the conditional within tsk_table_collection_loadf_inited or not. Without any further changes, either decision leads to errors, but ONLY if a table_collection loaded with TSK_LOAD_SKIP_TABLES is ever attempted to be put into a tree sequence (which I implemented but ended up reverting for this PR since I don't know if its necessary for the use cases we envision -- more on this below).

Let me walk you through each of these cases, assuming that TSK_LOAD_SKIP_TABLES is set:

If tsk_table_collection_load_indexes is not called, the table collection won't have it's indexes set, so if we attempt to load this table collection into a tree sequence we encounter an error in tsk_table_collection_check_integrity.

If tsk_table_collection_load_indexes is called and we attempt to add indexes to a table collection without having loaded its tables, we encounter an error within the ..._load_indexes function when checking whether (num_rows != self->edges.num_rows) since we never actually loaded the edge table.

What I ended up doing is to call the ..._load_indexes function, but pass the TSK_LOAD_SKIP_TABLES option to it to skip the (num_rows != self->edges.num_rows) check in case the option is set. My motivation was that it feels cleaner to have an empty but indexed table collection, than an empty, un-indexed table collection that throws an error whenever it tries to touch a tree sequence. Please note though that I'm not totally sure if there's any unwanted consequences of skipping that check (it seems to just check if the edge table was loaded correctly or was malformed in some way), so I rely on your advice here.

I then exposed this flag to the python TableCollection class through its C interface. I ended up reverting a change that allowed to also pass this option all the way through a TreeSequence since I don't really see the use case of that, but of course I'd be happy to re-revert this if needed. What I did instead is add a top-level tskit.load_metadata function which just calls TableCollection.load(file, skip_tables=True) and returns the metadata. I added this mostly as a proof-of-principle of what I am envisioning the use-case of this flag to be.

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@clwgg
Copy link
Contributor Author

clwgg commented Nov 2, 2021

Please note that this is very much a draft -- I am of course happy to clean this up and add tests and everything once some of the decisions I outline above have been made.

@codecov
Copy link

codecov bot commented Nov 2, 2021

Codecov Report

Merging #1882 (91796a8) into main (c9cecb1) will decrease coverage by 13.54%.
The diff coverage is 96.25%.

❗ Current head 91796a8 differs from pull request most recent head 763e8db. Consider uploading reports for the commit 763e8db to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##             main    #1882       +/-   ##
===========================================
- Coverage   93.32%   79.78%   -13.55%     
===========================================
  Files          27       27               
  Lines       25186    24803      -383     
  Branches     1107     1092       -15     
===========================================
- Hits        23505    19789     -3716     
- Misses       1647     4953     +3306     
- Partials       34       61       +27     
Flag Coverage Δ
c-tests 92.20% <98.21%> (-0.13%) ⬇️
lwt-tests 89.14% <ø> (ø)
python-c-tests 68.63% <91.66%> (-2.99%) ⬇️
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/trees.py 43.80% <50.00%> (-53.99%) ⬇️
c/tskit/tables.c 90.28% <98.07%> (+0.09%) ⬆️
c/tskit/trees.c 94.96% <100.00%> (-0.01%) ⬇️
python/_tskitmodule.c 90.38% <100.00%> (-0.44%) ⬇️
python/tskit/tables.py 84.41% <100.00%> (-14.53%) ⬇️
python/tskit/cli.py 0.00% <0.00%> (-95.94%) ⬇️
python/tskit/formats.py 6.10% <0.00%> (-93.90%) ⬇️
python/tskit/vcf.py 9.78% <0.00%> (-90.22%) ⬇️
python/tskit/drawing.py 10.41% <0.00%> (-89.02%) ⬇️
python/tskit/combinatorics.py 13.70% <0.00%> (-85.66%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9cecb1...763e8db. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @clwgg! I think you've done a great job and have basically got it right.

I think it would be better to build an empty index in the skip_tables case rather than complicating the logic for load_indexes. It'll basically be a no-op, and I think the logic would be simpler.

Does this work for you?

c/tskit/tables.c Outdated
}
ret = tsk_table_collection_load_indexes(self, &store);
ret = tsk_table_collection_load_indexes(self, &store, options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well skip skip loading the indexes as well as the tables, since the indexes don't make any sense without the edge table. I would just put this into the same "if" block as the table loads.

@@ -3754,3 +3754,8 @@ def ibd_segments(
store_pairs=store_pairs,
store_segments=store_segments,
)


def load_metadata(file):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's much point in providing the load_metadata function to be honest - it's just as easy to call

metadata = tskit.load(path, skip_tables=True).metadata

It just doesn't seem worth having the extra function to document since it's already a one-liner and people might want to access other top-level info from the tree sequence too.

@benjeffery
Copy link
Member

Great stuff, glad you could get going so quickly. FWIW I agree with @jeromekelleher on both having an empty index and that load_metadata isn't needed.

@clwgg
Copy link
Contributor Author

clwgg commented Nov 2, 2021

Thanks a lot for the quick replies! I just have two clarifying questions for how to go forward.

Question 1:
My motivation for load_metadata was that currently I expose the skip_tables option only to TableCollection.load. This is done completely through the TableCollection backend, so the TSK_LOAD_SKIP_TABLES flag never actually touches a tree sequence and exists only in "TableCollection land".

My understanding is that tskit.load calls TreeSequence.load however, so to make skip_tables available here, I would need to propagate it all the way from TreeSequence.load -> _tskit.TreeSequence_load -> tsk_treeseq_load until it hits tsk_table_collection_load. This was in fact my original plan (and I had implemented the backend part before reverting it here: 20fe73f), until I realized that I could just expose it through the TableCollection only. I am happy to implement the TreeSequence part if you prefer to expose the option in this way -- I just wanted to double check here.

Question 2:
If we do actually make a TreeSequence out of the TableCollection loaded with TSK_LOAD_SKIP_TABLES we can't completely skip loading the indexes, since this leads tsk_table_collection_check_integrity to complain (which is called from tsk_treeseq_init within tsk_treeseq_load). Are you suggesting to add a function that adds a "dummy" empty index if TSK_LOAD_SKIP_TABLES is set, instead of having tsk_table_collection_load_indexes do it?

@benjeffery
Copy link
Member

Thanks, good questions.

This is an advanced use case, I don't think we need to add it to TreeSequence or tskit.load.

It is fine for tsk_table_collection_check_integrity to complain, there is no use case where you need a TreeSequence with just metadata that isn't covered by the TableCollection API.

@jeromekelleher
Copy link
Member

I've had a look at this @clwgg and I think we can make it work all right. The load function looks a bit like this:

        }
        ret = tsk_table_collection_load_indexes(self, &store);
        if (ret != 0) {
            goto out;
        }
    } else {
        ret = tsk_table_collection_build_index(self, 0);
        if (ret != 0) {
            goto out;
        }
    }

Then, for testing, we can add something like this function to the tests/test_file_format.c file:

static void
test_skip_tables(void)
{
    int ret;
    tsk_treeseq_t *ts1 = caterpillar_tree(5, 3, 3);
    tsk_treeseq_t ts2;
    tsk_table_collection_t tables;

    ret = tsk_treeseq_dump(ts1, _tmp_file_name, 0);
    CU_ASSERT_EQUAL_FATAL(ret, 0);
    ret = tsk_table_collection_load(&tables, _tmp_file_name, TSK_LOAD_SKIP_TABLES);
    CU_ASSERT_EQUAL_FATAL(ret, 0);

    CU_ASSERT_TRUE(
        tsk_table_collection_equals(&tables, ts1->tables, TSK_CMP_IGNORE_TABLES));
    CU_ASSERT_EQUAL(tables.individuals.num_rows, 0);
    CU_ASSERT_EQUAL(tables.nodes.num_rows, 0);
    CU_ASSERT_EQUAL(tables.edges.num_rows, 0);
    CU_ASSERT_EQUAL(tables.migrations.num_rows, 0);
    CU_ASSERT_EQUAL(tables.sites.num_rows, 0);
    CU_ASSERT_EQUAL(tables.mutations.num_rows, 0);
    CU_ASSERT_EQUAL(tables.provenances.num_rows, 0);

    /* We should be able to make a tree sequence */
    ret = tsk_treeseq_init(&ts2, &tables, 0);
    CU_ASSERT_EQUAL_FATAL(ret, 0);
    tsk_treeseq_free(&ts2);
    
    /* Do the same thing with treeseq API */
    ret = tsk_treeseq_load(&ts2, _tmp_file_name, TSK_LOAD_SKIP_TABLES);
    CU_ASSERT_EQUAL_FATAL(ret, 0);
    CU_ASSERT_TRUE(tsk_table_collection_equals(&tables, ts2.tables, 0));
    tsk_treeseq_free(&ts2);

    tsk_table_collection_free(&tables);
    tsk_treeseq_free(ts1);
    free(ts1); 
}

This uses a new flag TSK_CMP_IGNORE_TABLES to table_collection_equals, which I think is worth adding, and the compare function would now look like:

bool
tsk_table_collection_equals(const tsk_table_collection_t *self,
    const tsk_table_collection_t *other, tsk_flags_t options)
{
    bool ret = self->sequence_length == other->sequence_length
               && self->time_units_length == other->time_units_length
               && tsk_memcmp(self->time_units, other->time_units,
                      self->time_units_length * sizeof(char))
                      == 0;
    if (!(options & TSK_CMP_IGNORE_TABLES)) {
        ret = ret
              && tsk_individual_table_equals(
                     &self->individuals, &other->individuals, options)
              && tsk_node_table_equals(&self->nodes, &other->nodes, options)
              && tsk_edge_table_equals(&self->edges, &other->edges, options)
              && tsk_migration_table_equals(
                     &self->migrations, &other->migrations, options)
              && tsk_site_table_equals(&self->sites, &other->sites, options)
              && tsk_mutation_table_equals(&self->mutations, &other->mutations, options)
              && tsk_population_table_equals(
                     &self->populations, &other->populations, options);
        /* TSK_CMP_IGNORE_TABLES impliest TSK_CMP_IGNORE_PROVENANCE */
        if (!(options & TSK_CMP_IGNORE_PROVENANCE)) {
            ret = ret
                  && tsk_provenance_table_equals(
                         &self->provenances, &other->provenances, options);
        }
    }
    /* TSK_CMP_IGNORE_TS_METADATA is implied by TSK_CMP_IGNORE_METADATA */
    if (options & TSK_CMP_IGNORE_METADATA) {
        options |= TSK_CMP_IGNORE_TS_METADATA;
    }
    if (!(options & TSK_CMP_IGNORE_TS_METADATA)) {
        ret = ret
              && (self->metadata_length == other->metadata_length
                     && self->metadata_schema_length == other->metadata_schema_length
                     && tsk_memcmp(self->metadata, other->metadata,
                            self->metadata_length * sizeof(char))
                            == 0
                     && tsk_memcmp(self->metadata_schema, other->metadata_schema,
                            self->metadata_schema_length * sizeof(char))
                            == 0);
    }
    return ret;
}

Note the tests don't quite pass with these changes - there's one more update that needs to be made (which I'll leave you to figure, if you like!)

@clwgg
Copy link
Contributor Author

clwgg commented Nov 6, 2021

Thanks for the great pointers @jeromekelleher! I had missed the tsk_table_collection_build_index function on my first pass, but of course it makes a lot more sense to use that one!

@clwgg
Copy link
Contributor Author

clwgg commented Nov 6, 2021

I just implemented your suggestions and added the tests you wrote. Given that you included tsk_treeseq_load(&ts2, _tmp_file_name, TSK_LOAD_SKIP_TABLES); in the test suite, I also re-added exposing the flag for tree sequence loading here: 65823c6. Does this mean I should also expose it to python TreeSequence.load? Sorry that I keep asking this, just trying to make sure since I'm getting some "mixed signals" ;).

@clwgg
Copy link
Contributor Author

clwgg commented Nov 6, 2021

This is a random side note, but I can't figure out how to make clang-format use the indentation style that the test suite is complaining about when mixing multi-line parentheses and lines starting with logical operators. The .clang-format file in the repo doesn't reproduce it for me, and neither do a bunch of other config options I've tried. I'm on clang version 12.0.1 -- please let me know if anyone has an idea (:

@clwgg
Copy link
Contributor Author

clwgg commented Nov 7, 2021

Does this mean I should also expose it to python TreeSequence.load?

Ok, I just added this for completeness' sake -- please let me know if you would like to have this kept in here, and I can start adding python tests for whatever aspects of this we'd like to expose to python.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks just right to me, thanks @clwgg! Some comments about testing above.

To test the overall functionality in Python, I think the best thing to do is add a new test class TestSkipTables to the tests/test_file_format.py file. We don't have to do anything exhaustive, just check the basic properties of the tables/tree sequences we're loading on a few examples.

This would be simplest if we added the ignore_tables argument to assert_equals, but maybe this PR is getting big enough as it is and this should be done in a follow up. It's up to you.

c/tskit/tables.h Outdated
/* Flags for table collection load */
/* This shares an interface with table collection init, so we have to
* increment offset by one.
* NOTE: Should these two be under the same header?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions - let's do this as part of #1720

Can you leave a comment here please, "TODO review as part of #1720"

python/_tskitmodule.c Show resolved Hide resolved
@jeromekelleher
Copy link
Member

This is a random side note, but I can't figure out how to make clang-format use the indentation style that the test suite is complaining about when mixing multi-line parentheses and lines starting with logical operators. The .clang-format file in the repo doesn't reproduce it for me, and neither do a bunch of other config options I've tried. I'm on clang version 12.0.1 -- please let me know if anyone has an idea (:

This is super annoying @clwgg but you have to use the exact version of clang-format ☹️ The docs should help.

@jeromekelleher
Copy link
Member

So yes, just to be clear on your question above, I think we should support this on both the TableCollection.load and TreeSequence.load code paths.

@jeromekelleher
Copy link
Member

jeromekelleher commented Nov 8, 2021

Ah, I realised there's a wrinkle here that I forgot about. We need to change the load flags to kastore to tell it READ_ALL only if we're not in the "skip tables" case. Otherwise, we just read the whole file anyway. This is easily done:

 int kas_flags = options & TSK_LOAD_SKIP_TABLES? 0: KAS_READ_ALL;

 ret = kastore_openf(&store, file, "r", kas_flags);

I started writing a few tests to check if this was working:

def assert_tables_empty(tables):
    for table in tables.name_map.values():
        assert len(table) == 0


class TestSkipTables:
    def test_ts_read_path_interface(self, tmp_path, ts_fixture):
        # Check the fixture has metadata and a schema
        assert ts_fixture.metadata_schema is not None
        assert len(ts_fixture.metadata) > 0
        save_path = tmp_path / "tmp.trees"
        ts_fixture.dump(save_path)
        ts_no_tables = tskit.load(save_path, skip_tables=True)
        assert ts_no_tables.metadata == ts_fixture.metadata
        assert ts_no_tables.sequence_length == ts_fixture.sequence_length
        assert_tables_empty(ts_no_tables.tables)

    def test_ts_read_one_stream(self, tmp_path, ts_fixture):
        save_path = tmp_path / "tmp.trees"
        ts_fixture.dump(save_path)
        with open(save_path, "rb") as f:
            ts_no_tables = tskit.load(f, skip_tables=True)
            assert ts_no_tables.metadata == ts_fixture.metadata
            assert ts_no_tables.sequence_length == ts_fixture.sequence_length
            assert_tables_empty(ts_no_tables.tables)

    def test_ts_read_multi_stream_fails(self, tmp_path, ts_fixture):
        save_path = tmp_path / "tmp.trees"
        ts_fixture.dump(save_path)
        with open(save_path, "rb") as f:
            ts_no_tables = tskit.load(f, skip_tables=True)
            # The second load should fail because we've not read to the
            # end of the file.
            with pytest.raises(tskit.FileFormatError):
                tskit.load(f)

We should probably document this corner case (you can't read multiple tree sequences from a stream in skip_tables mode). This also means that you can't skip_tables when reading from a non-seekable file descriptor like a socket. That's OK though, since this is basically nonsensical anyway.

@clwgg
Copy link
Contributor Author

clwgg commented Nov 13, 2021

Great catch on the KAS_READ_ALL flag! I added that and am looking into the python tests now.

@clwgg
Copy link
Contributor Author

clwgg commented Nov 13, 2021

While working on the tests it did indeed seem like a good idea to expose the TSK_CMP_IGNORE_TABLES flag we introduced as ignore_tables to python. I tried to hunt down all the places where this would need adding and hope I caught them all.

@clwgg
Copy link
Contributor Author

clwgg commented Nov 13, 2021

Sorry for the mess with that last commit, but I was looking for places where the ignore_* flags to the .equals() methods get tested and just added a couple of in-line comments to not loose track of them -- any input on the notes in that commit (87974d8) would of course be very much appreciated!

@clwgg clwgg force-pushed the clwgg_load_metadata branch from f18c5ce to a2470bb Compare November 13, 2021 06:11
@clwgg
Copy link
Contributor Author

clwgg commented Nov 13, 2021

some very simple tests on the skip_tables argument (do we get a TypeError on bad types) and just checks that it runs ok with skip_tables=True. (No need to test any properties of the ts, we'll do that elsewhere)

Sorry @jeromekelleher I just re-read this after I had implemented some of those tests already. The last commit (cd6775b) now also added a bad_bool test that I think is closer to what you were envisioning. I'm happy to pull the tables_are_skipped stuff out of the lowlevel tests if you think this would be better tested in test_file_format.py.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great, thanks @clwgg ! I think you just need to fill out the few remaining testing corner cases, add some documentation and we're done.

Thanks for following up with the ignore_tables functionality here, you're going above and beyond!

@@ -164,6 +164,36 @@ class TestTableCollection(LowLevelTestCase):
Tests for the low-level TableCollection class
"""

def test_skip_tables(self, tmp_path):
def tables_are_skipped(tc, tc_skip):
return (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a little better to structure this like

          assert not tc.equals(tc_skip)
          assert tc.equals(tc_skip, ignore_tables=True) 
           ... etc

The reasons is that if the tests every fail, the specific problem will be immediately apparent. This way, you'd have to put in some prints to try and figure out which one of the assertions was failing.

@@ -4003,6 +4003,8 @@ def test_ignore_timestamps(self, t1, t2):
t1.assert_equals(t2, ignore_provenance=True)
t1.assert_equals(t2, ignore_timestamps=True)

# NOTE: test `ignore_tables` here too?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please!

@@ -2850,6 +2856,10 @@ def assert_equals(
f" differs: self={self.sequence_length} other={other.sequence_length}"
)

# NOTE: Do we need to anything "extra" to account for `ignore_tables`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine - we don't keep any particular state to say that the tables were created by "ignore_tables" so failing on the number of edges being different is as good as any.

@jeromekelleher
Copy link
Member

I'm happy to pull the tables_are_skipped stuff out of the lowlevel tests if you think this would be better tested in test_file_format.py.

I think that's probably better all right. The low-level tests are really just to make sure we're not making any basic mistakes at the Python-C level and we're exercising all the code/error paths. Let's to the logic testing in test_file_formats like I suggested above. We can test ignore_tables in test_tables.py where you suggested.

@clwgg
Copy link
Contributor Author

clwgg commented Nov 23, 2021

Hi @jeromekelleher! I've been working on adding more tests, and I might be getting a little too much into the weeds of this... One thing I was wondering about the multi_stream test you proposed above, is what the "base case" we're testing against is (i.e. without the flag we're testing for), but I'm not sure if that's the right way of thinking about it. Here are a few examples:

A - The test case you proposed

    def test_ts_read_multi_stream_fails(self, tmp_path, ts_fixture):
        save_path = tmp_path / "tmp.trees"
        ts_fixture.dump(save_path)
        with open(save_path, "rb") as f:
            ts_no_tables = tskit.load(f, skip_tables=True)
            # The second load should fail because we've not read to the
            # end of the file.
            with pytest.raises(exceptions.FileFormatError):
                tskit.load(f)

My understanding is that this throws a FileFormatError because we're trying to read a new tree sequence from a file handle that is now positioned somewhere inside of the file, so in this case doing something like f.seek(0) between the calls to tskit.load() would work to resolve the error.
If we don't set skip_tables on the first load, the equivalent test throws an EOFError instead, since we're now at the end of the file after the first read:

B

    def test_if_this_works(self, tmp_path, ts_fixture):
        save_path = tmp_path / "tmp.trees"
        ts_fixture.dump(save_path)
        with open(save_path, "rb") as f:
            ts1 = tskit.load(f)
            with pytest.raises(EOFError):
                ts2 = tskit.load(f)

So really what A is testing is whether we throw a FileFormatError instead of an EOFError by setting the skip_tables flag, correct?
With that, I was wondering if you think it would be worth it to test the "multistream" behavior more explicitly? One way to do that would be something like:

C

    def test_double_file(self, tmp_path, ts_fixture):
        save_path = tmp_path / "tmp.trees"
        with open(save_path, "wb") as f:
            ts_fixture.dump(f)
            ts_fixture.dump(f)
        with open(save_path, "rb") as f:
            ts1 = tskit.load(f)
            ts2 = tskit.load(f)
        assert ts_fixture.equals(ts1)
        assert ts_fixture.equals(ts2)

    def test_double_file_fails(self, tmp_path, ts_fixture):
        save_path = tmp_path / "tmp.trees"
        with open(save_path, "wb") as f:
            ts_fixture.dump(f)
            ts_fixture.dump(f)
        with open(save_path, "rb") as f:
            ts1 = tskit.load(f, skip_tables=True)
            with pytest.raises(exceptions.FileFormatError):
                ts2 = tskit.load(f)

Here, we use the concatenation of two tree sequence dumps in a single file to simulate reading from a non-seekable stream, which works fine without skip_tables, but throws the appropriate error when skip_tables is true. I think the concatenation into a single file looks a little ugly, but I thought it was easier to simulate it this way rather than having to set up sockets or something.

Anyway, I'm curious to hear what you think about this, and I'll try to get less distracted while adding the remaining tests...

@jeromekelleher
Copy link
Member

Yes, you've got it right @clwgg. I think your proposal of C is good. We are testing read-multiple-from-stream behaviour pretty thoroughly in test_fileobj.py but having this test nearby would help to make the behaviour on skip_tables clearer.

The base-case here is that the test_double_file_fails test will fail if we don't provide the correct flags to kastore.

@jeromekelleher
Copy link
Member

We're hoping to get 0.4.0 shipped in the coming week @clwgg - do you think this can be finished up by then?

@clwgg
Copy link
Contributor Author

clwgg commented Nov 23, 2021

great! Yes, I think so.

@jeromekelleher jeromekelleher added this to the Python 0.4.0 milestone Nov 23, 2021
@clwgg clwgg force-pushed the clwgg_load_metadata branch 2 times, most recently from bbeda82 to 9652fc9 Compare November 24, 2021 03:16
@clwgg
Copy link
Contributor Author

clwgg commented Nov 24, 2021

Alright, I added more tests and API documentation, so I think this is pretty much ready for final(ish) review! There is one new issue that CI is complaining about:

  • The python test case test_ts_twofile_stream_fails(), which I outlined in my "Option C" above as test_double_file_fails, is failing (and not in the way it's supposed to...). Unfortunately I haven't been able to reproduce that locally, and I'm not totally sure what is going wrong.

Apart from that I am of course happy to make any other changes based on your review, especially of these new additions!

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @clwgg! I've made a batch of suggestions.

Once these are in, can you rebase and squash the commits please? I think it's ready to merge then.

c/tskit/tables.h Outdated Show resolved Hide resolved
c/tskit/tables.h Outdated Show resolved Hide resolved
c/tskit/tables.h Outdated Show resolved Hide resolved
c/tskit/tables.h Outdated Show resolved Hide resolved
c/tskit/tables.h Outdated Show resolved Hide resolved
python/tskit/trees.py Outdated Show resolved Hide resolved
python/tskit/trees.py Outdated Show resolved Hide resolved
python/tskit/trees.py Outdated Show resolved Hide resolved
python/tskit/trees.py Outdated Show resolved Hide resolved
python/tests/test_file_format.py Show resolved Hide resolved
@clwgg
Copy link
Contributor Author

clwgg commented Nov 24, 2021

done!

@clwgg clwgg force-pushed the clwgg_load_metadata branch from 8506102 to 56779fd Compare November 24, 2021 17:26
@clwgg clwgg marked this pull request as ready for review November 24, 2021 17:27
@clwgg
Copy link
Contributor Author

clwgg commented Nov 24, 2021

oh, I just noticed that I can reproduce the failing test locally now that I rebased onto the current upstream main!

@jeromekelleher
Copy link
Member

oh, I just noticed that I can reproduce the failing test locally now that I rebased onto the current upstream main!

Ah, that's good. What version of Python are you using?

@clwgg
Copy link
Contributor Author

clwgg commented Nov 24, 2021

To me it does look like the issue you first reported in #1938 -- I get a TypeError: expected str, bytes or os.PathLike object, not BufferedReader from the open() call in raise_hdf5_format_error, which is triggered by the try: except: FileFormatError in TreeSequence.load()

@clwgg
Copy link
Contributor Author

clwgg commented Nov 24, 2021

I'm on python 3.9.7.

@clwgg
Copy link
Contributor Author

clwgg commented Nov 24, 2021

it looks like exception chaining is only triggered on an OSError in raise_hdf5_format_error, while the TypeError actually does throw an exception, right? I don't know, I might be totally off here.

@jeromekelleher
Copy link
Member

Aha, that's exactly it @clwgg TypeError is not an OSError. A quick fix would be to add TypeError to the expected exceptions.

@jeromekelleher
Copy link
Member

This needs another rerbase @clwgg, but should be good to merge then.

@clwgg clwgg force-pushed the clwgg_load_metadata branch from 56779fd to ca1c0e7 Compare November 25, 2021 17:43
@clwgg
Copy link
Contributor Author

clwgg commented Nov 25, 2021

done!

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, thanks @clwgg! We got there 😄

I updated the commit to include some changelog information; merging.

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 26, 2021
@mergify mergify bot merged commit 73b7ea6 into tskit-dev:main Nov 26, 2021
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants