Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concat #720

Merged
merged 16 commits into from
Nov 10, 2016
Merged

Concat #720

merged 16 commits into from
Nov 10, 2016

Conversation

wasade
Copy link
Member

@wasade wasade commented Nov 4, 2016

Resolves #716. This ended up growing in complexity to handle the general case in which one set of IDs are disjoint while the other axis doesn't need to be. I'd be curious if others see ways to reduce or simplify some of the code complexity here.

Now, why is this method useful? Often, when merging tables, one axis is disjoint (e.g., the sample IDs in a meta-analysis). This method offers 80x or so improvement over the existing merge method for this case. In addition, it supports concatenating multiple tables while merge must operate only on pairs thus offering likely further performance gains in real world scenarios.

The first benchmark shows approximately a 20x improvement. When profiling, a hotspot was identified in sort_order, which was also rectified.

In [1]: import biom

In [2]: a = biom.load_table('twins.qiita-2014.biom')

In [3]: b = biom.load_table('ag.qiita-10317.biom')

In [4]: a
Out[4]: 8735 x 1046 <class 'biom.table.Table'> with 344079 nonzero entries (3% dense)

In [5]: b
Out[5]: 9128 x 9919 <class 'biom.table.Table'> with 1508178 nonzero entries (1% dense)

In [6]: %timeit c = a.concat([b])
1 loop, best of 3: 2.25 s per loop

In [7]: %timeit c = a.merge(b)
1 loop, best of 3: 43.9 s per loop

A quick profile suggested a hotspot in sort_order stemming from a) an implicit cast to dense (which also stood to bloat memory) and b) utilizing incremental indexing instead of fancy indexing. So I fixed that. New timing below:

In [1]: import biom

In [2]: a = biom.load_table('twins.qiita-2014.biom')

In [3]: b = biom.load_table('ag.qiita-10317.biom')

In [4]: %timeit c = a.concat([b])
1 loop, best of 3: 582 ms per loop

@wasade
Copy link
Member Author

wasade commented Nov 4, 2016

@antgonza relevant for qiita

@gregcaporaso relevant for q2-feature-table on its merge

@coveralls
Copy link

coveralls commented Nov 4, 2016

Coverage Status

Coverage increased (+0.1%) to 86.28% when pulling ab2414d on wasade:concat into 1ebd2d1 on biocore:master.

1 similar comment
@coveralls
Copy link

coveralls commented Nov 4, 2016

Coverage Status

Coverage increased (+0.1%) to 86.28% when pulling ab2414d on wasade:concat into 1ebd2d1 on biocore:master.

@coveralls
Copy link

coveralls commented Nov 4, 2016

Coverage Status

Coverage increased (+0.1%) to 86.28% when pulling 40a3c35 on wasade:concat into 1ebd2d1 on biocore:master.

@coveralls
Copy link

coveralls commented Nov 4, 2016

Coverage Status

Coverage increased (+0.1%) to 86.28% when pulling 1d935a2 on wasade:concat into 1ebd2d1 on biocore:master.

@wasade
Copy link
Member Author

wasade commented Nov 5, 2016

This resolves #717 as well, adding a note and test in a second.

Copy link
Contributor

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments.

O5 0.0 0.0 0.0 0.0 0.0 0.0 15.0 16.0 17.0

"""
# should this be a staticmethod?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, yes? No real preference but we should remove this comment.

tmp_ids = list(table.ids(axis=axis))
tmp_md = table.metadata(axis=axis)

# this sucks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

????????????????

@wasade wasade added this to the 2.1.6 milestone Nov 5, 2016
@coveralls
Copy link

coveralls commented Nov 5, 2016

Coverage Status

Coverage increased (+0.1%) to 86.28% when pulling 8409a43 on wasade:concat into 1ebd2d1 on biocore:master.

@coveralls
Copy link

coveralls commented Nov 5, 2016

Coverage Status

Coverage increased (+0.1%) to 86.28% when pulling 05de544 on wasade:concat into 1ebd2d1 on biocore:master.

@coveralls
Copy link

coveralls commented Nov 5, 2016

Coverage Status

Coverage increased (+0.1%) to 86.28% when pulling 7a4cb03 on wasade:concat into 1ebd2d1 on biocore:master.

@wasade
Copy link
Member Author

wasade commented Nov 5, 2016

Thanks @antgonza, is this good then? I'm not thrilled with the complexity but meh...

@coveralls
Copy link

coveralls commented Nov 5, 2016

Coverage Status

Coverage increased (+0.1%) to 86.28% when pulling b4b0d1d on wasade:concat into 1ebd2d1 on biocore:master.

@coveralls
Copy link

coveralls commented Nov 5, 2016

Coverage Status

Coverage increased (+0.1%) to 86.28% when pulling 0c6dac7 on wasade:concat into 1ebd2d1 on biocore:master.

@coveralls
Copy link

coveralls commented Nov 5, 2016

Coverage Status

Coverage increased (+0.05%) to 86.185% when pulling 5221508 on wasade:concat into 1ebd2d1 on biocore:master.

@coveralls
Copy link

coveralls commented Nov 5, 2016

Coverage Status

Coverage increased (+0.05%) to 86.185% when pulling e7300ac on wasade:concat into 1ebd2d1 on biocore:master.

@coveralls
Copy link

coveralls commented Nov 5, 2016

Coverage Status

Coverage increased (+0.05%) to 86.185% when pulling fb824c8 on wasade:concat into 1ebd2d1 on biocore:master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) to 86.185% when pulling fb824c8 on wasade:concat into 1ebd2d1 on biocore:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) to 86.185% when pulling fb824c8 on wasade:concat into 1ebd2d1 on biocore:master.

@ElDeveloper
Copy link
Member

This looks good, just a few minor comments. As for your original question, maybe making it a class method makes more sense. It really should just be a function somewhere, but a class method makes organizational sense 👍

@wasade
Copy link
Member Author

wasade commented Nov 5, 2016

@ElDeveloper, i didn't see any comments?

I'm tempted to defer doing this as a class or staticmethod because, while not good, its more consistent for this to be a Table instance method...

@@ -15,6 +15,8 @@ New Features:
* `Table.rankdata` has been added to convert values to ranked abundances on
either axis. See [issue #645](https://github.com/biocore/biom-format/issues/639).
* Format of numbers in ``biom summarize-table`` output is now more readable and localized. See [issue #679](https://github.com/biocore/biom-format/issues/679).
* `Table.concat` has been added to the API and allows for concatenating multiple tables in which the IDs of one of the axes are known to be disjoint. This has substantial performance benefit over `Table.merge`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benefit -> benefits

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1795,39 +1797,23 @@ def sort_order(self, order, axis='sample'):
O2 1.0 0.0 4.0

"""
md = []
vals = []
fancy = np.array([self.index(i, axis=axis) for i in order], dtype=int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these were relative abundances, this line would fail, right?

----------
others : iterable of biom.Table
Tables to concatenate
axis : {'sample', 'observation'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, optional is missing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

others : iterable of biom.Table
Tables to concatenate
axis : {'sample', 'observation'}
The axis to concatenate on. i.e., if axis is 'sample', then tables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit unclear as to why this needs to be specified, what's the reason that you need the axis in order to concatenate the table?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I get it now.


# test we have disjoint IDs
if not axis_ids.isdisjoint(table_axis_ids):
raise DisjointIDError("IDs are not disjoint")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth adding more information noting what the offending table is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would take a little bit of book keeping to know which two tables were in conflict, and it might be more than 2 in conflict. Blocking?

@@ -2935,6 +2921,161 @@ def _intersect_id_order(self, a, b):
idx += 1
return new_order

def concat(self, others, axis='sample'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking, but just making an observation, this method is a little bit long, is there any way to break it down a bit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I know. Not enthusiastic. A bunch of private methods perhaps...?


with self.assertRaises(DisjointIDError):
example_table.concat([example_table], axis='observation')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with an empty list of tables?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works as expected?

In [2]: biom.example_table.concat([biom.Table([], [], [])])
Out[2]: 2 x 3 <class 'biom.table.Table'> with 5 nonzero entries (83% dense)

@@ -1620,6 +1712,15 @@ def test_update_ids_cache_bug(self):
exp_index = {'x': 0, 'y': 1}
self.assertEqual(obs._sample_index, exp_index)

def test_other_spmatrix_type(self):
# I dont actually remember what bug stemmed from this...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It came up in two scenarios. I could dig through the code but not really excited about that as it would necessitate a bunch of parallel exploration in an interpreter. I could just delete the comment? :)

@ElDeveloper
Copy link
Member

I'm tempted to defer doing this as a class or staticmethod because, while not good, its more consistent for this to be a Table instance method...

Sounds good.

@ElDeveloper
Copy link
Member

Also, comments posted, that was my bad (GitHub's new UI).

@@ -1795,39 +1797,23 @@ def sort_order(self, order, axis='sample'):
O2 1.0 0.0 4.0

"""
md = []
vals = []
fancy = np.array([self.index(i, axis=axis) for i in order], dtype=int)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are index positions so content of the matrix doesn't matter. Forcing int as numpy by default does float.

@ElDeveloper
Copy link
Member

Ah, ok, this makes sense!

On (Nov-05-16|15:53), Daniel McDonald wrote:

wasade commented on this pull request.

@@ -1795,39 +1797,23 @@ def sort_order(self, order, axis='sample'):
O2 1.0 0.0 4.0

     """
  •    md = []
    
  •    vals = []
    
  •    fancy = np.array([self.index(i, axis=axis) for i in order], dtype=int)
    

These are index positions so content of the matrix doesn't matter. Forcing int as numpy by default does float.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#720 (review)

@ElDeveloper
Copy link
Member

Not blocking!

On (Nov-05-16|15:56), Daniel McDonald wrote:

wasade commented on this pull request.

  •    axis_ids = set()
    
  •    invaxis_ids = set()
    
  •    invaxis_metadata = {}
    
  •    all_tables = others[:]
    
  •    all_tables.insert(0, self)
    
  •    # verify disjoint, and fetch all ids from all tables
    
  •    for table in all_tables:
    
  •        table_axis_ids = table.ids(axis=axis)
    
  •        table_invaxis_order = table.ids(axis=invaxis)
    
  •        table_invaxis = set(table_invaxis_order)
    
  •        # test we have disjoint IDs
    
  •        if not axis_ids.isdisjoint(table_axis_ids):
    
  •            raise DisjointIDError("IDs are not disjoint")
    

it would take a little bit of book keeping to know which two tables were in conflict, and it might be more than 2 in conflict. Blocking?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#720

@ElDeveloper
Copy link
Member

Ah, no not what I meant, I meant:

bt.concat([])

On (Nov-05-16|15:57), Daniel McDonald wrote:

wasade commented on this pull request.

  •                          [0, 0, 0, 3, 4, 5]]),
    
  •                ['O1', 'O2', 'O3'],
    
  •                ['S1', 'S2', 'S3', 'S4', 'S5', 'S6'],
    
  •                exp_obs_md,
    
  •                list(example_table.metadata()) \* 2)
    
  •    obs = example_table.concat([table2, ], axis='sample')
    
  •    self.assertEqual(obs, exp)
    
  • def test_concat_raise_overlap(self):
  •    with self.assertRaises(DisjointIDError):
    
  •        example_table.concat([example_table])
    
  •    with self.assertRaises(DisjointIDError):
    
  •        example_table.concat([example_table], axis='observation')
    

works as expected?

In [2]: biom.example_table.concat([biom.Table([], [], [])])
Out[2]: 2 x 3 <class 'biom.table.Table'> with 5 nonzero entries (83% dense)

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#720

@wasade
Copy link
Member Author

wasade commented Nov 5, 2016

Thanks!

@ElDeveloper
Copy link
Member

That sounds like a better solution haha.

On (Nov-05-16|15:59), Daniel McDonald wrote:

wasade commented on this pull request.

@@ -1620,6 +1712,15 @@ def test_update_ids_cache_bug(self):
exp_index = {'x': 0, 'y': 1}
self.assertEqual(obs._sample_index, exp_index)

  • def test_other_spmatrix_type(self):
  •    # I dont actually remember what bug stemmed from this...
    

It came up in two scenarios. I could dig through the code but not really excited about that as it would necessitate a bunch of parallel exploration in an interpreter. I could just delete the comment? :)

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#720

@ElDeveloper
Copy link
Member

Also, looks like GitHub does not link messages together :[

@coveralls
Copy link

coveralls commented Nov 5, 2016

Coverage Status

Coverage increased (+0.05%) to 86.185% when pulling 0b2160b on wasade:concat into 1ebd2d1 on biocore:master.

@wasade
Copy link
Member Author

wasade commented Nov 5, 2016

Thanks! Will test empty case and resolve remaining in about an hour

On Nov 5, 2016 4:02 PM, "Yoshiki Vázquez Baeza" notifications@github.com
wrote:

Also, looks like GitHub does not link messages together :[


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#720 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAc8sokNH3HceP8qeBOnx9MC50CF972Kks5q7QsQgaJpZM4KpFlM
.

@wasade
Copy link
Member Author

wasade commented Nov 6, 2016

I think that's it. Decomposition would be nice but still not really sure
what would be the best here.

On Sat, Nov 5, 2016 at 4:07 PM, Daniel T. McDonald <
Daniel.Mcdonald@colorado.edu> wrote:

Thanks! Will test empty case and resolve remaining in about an hour

On Nov 5, 2016 4:02 PM, "Yoshiki Vázquez Baeza" notifications@github.com
wrote:

Also, looks like GitHub does not link messages together :[


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#720 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAc8sokNH3HceP8qeBOnx9MC50CF972Kks5q7QsQgaJpZM4KpFlM
.

@coveralls
Copy link

coveralls commented Nov 6, 2016

Coverage Status

Coverage increased (+0.05%) to 86.185% when pulling 9f7dc42 on wasade:concat into 1ebd2d1 on biocore:master.

@antgonza
Copy link
Contributor

antgonza commented Nov 6, 2016

This looks good to me. I will defer to @ElDeveloper to merge.

Copy link
Member

@josenavas josenavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wasade ! I made some comments, although I don't have a strong preference on them. It mainly depends on how we see the future of this functionality.

@@ -64,7 +64,10 @@ def subset_table(input_hdf5_fp, input_json_fp, axis, ids, output_fp):
input_json_fp = f.read()

with open(ids, 'U') as f:
ids = [line.strip() for line in f]
ids = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but isn't this the same?

ids = [line.strip().split('\t')[0] for line in f if not line.startwith('#')

you know to avoid appends (no need to change if you think it makes the code less readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, made for a long line and this isnt a performance critical block

all_tables.insert(0, self)

# verify disjoint, and fetch all ids from all tables
for table in all_tables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this for loop is a good candidate for a small private function.

missing_ids = list(invaxis_ids - set(table.ids(axis=invaxis)))

if missing_ids:
# determine new shape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the contents of this if statement can also be broken in another small private function

tmp_ids = list(table.ids(axis=axis))
tmp_md = table.metadata(axis=axis)

# resolve construction based off axis. This really should be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be too much pain to put it in a classmethod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the resolution off the axis is something which definitely should be a class method and would be nice to do as it would remove this common pattern from the codebase

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your comment, it looks like this pattern is repeated in multiple parts of the code base, in such case I think it is ok to open an issue about it and fix all occurrences at once later. Do you agree?

@@ -2935,6 +2921,161 @@ def _intersect_id_order(self, a, b):
idx += 1
return new_order

def concat(self, others, axis='sample'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the code in here I think it makes sense to break it in a couple of private functions and probably upgrade this to a class method, it will be easier to do this change now than later and break compatibility on the API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i stepped back from wanting to make this a classmethod as, while not ideal, it would be inconsistent with the rest of the API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...could possibly decompose...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency with the rest of the API is important so I think it is fine to leave as it is.

Just checking, do you think it is worth having an "inplace" parameter and do the modifications in place given that this is not a class method? I don't think it is necessary but wanted to ask to see what you think about it and how does it looks from an API point of view.

@wasade
Copy link
Member Author

wasade commented Nov 9, 2016

+1

On Nov 8, 2016 4:56 PM, "Jose Navas" notifications@github.com wrote:

@josenavas commented on this pull request.

In biom/table.py #720:

  •            # resolve invert axis ids and metadata
    
  •            tmp_inv_ids = list(table.ids(axis=invaxis))
    
  •            tmp_inv_ids.extend(missing_ids)
    
  •            tmp_inv_md = table.metadata(axis=invaxis)
    
  •            if tmp_inv_md is None:
    
  •                tmp_inv_md = [None] \* len(table.ids())
    
  •            else:
    
  •                tmp_inv_md = list(tmp_inv_md)
    
  •            tmp_inv_md.extend([invaxis_metadata[i] for i in missing_ids])
    
  •            # resolve axis ids and metadata
    
  •            tmp_ids = list(table.ids(axis=axis))
    
  •            tmp_md = table.metadata(axis=axis)
    
  •            # resolve construction based off axis. This really should be
    

From your comment, it looks like this pattern is repeated in multiple
parts of the code base, in such case I think it is ok to open an issue
about it and fix all occurrences at once later. Do you agree?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#720, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAc8sv5oRTfzySAqPTmrJFF7nDvBKVHPks5q8RpKgaJpZM4KpFlM
.

@josenavas
Copy link
Member

@ElDeveloper wanna merge if you don't have any further comments?

@ElDeveloper ElDeveloper merged commit 3a54af7 into biocore:master Nov 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Table.hstack and Table.vstack
6 participants