Add zero-copy array access to ReferenceSequence #1989

jeromekelleher · 2021-12-02T15:53:37Z

The ReferenceSequence.data attribute returns the reference sequence data as a string. For large references we almost definitely don't want to do this, as this will create a new Python string and copy of the data. So, it would be good to have a numpy array view of the data.

We should see first how we might use this, though. The only place we're using this at the moment is in the alignments method. In this case we can definitely sidestep the full Python string because we're immediately turning the data into a numpy array here. So, it'll be quite easy to have an internal API using something like data_array which is a view.

However, it might not be worth doing this because we'll have to implement alignments in C fairly soon anyway (#1589

If it's easy I'll implement the data_array when we're in read-only mode for #1935, which is soon on the menu.

In general, I don't think we'll be accessing the data attribute directly much, as we'll want to present a higher-level interface in Python (for example, we implement __getitem__ to support pulling out a slice of a reference, which can operate on either the data or url - see #1988)

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2021-12-02T20:30:01Z

Thinking about the alignments issue some more, I'm not sure this is worth optimising for. We're currently storing n copies of the reference sequence anyway which are the alignments, so avoiding one more copy would be a very minor optimisation.

jeromekelleher mentioned this issue Dec 2, 2021

Refseq update #1944

Merged

jeromekelleher added the Python API Issue is about the Python API label Dec 2, 2021

jeromekelleher added this to the Python upcoming milestone Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zero-copy array access to ReferenceSequence #1989

Add zero-copy array access to ReferenceSequence #1989

jeromekelleher commented Dec 2, 2021

jeromekelleher commented Dec 2, 2021

Add zero-copy array access to ReferenceSequence #1989

Add zero-copy array access to ReferenceSequence #1989

Comments

jeromekelleher commented Dec 2, 2021

jeromekelleher commented Dec 2, 2021