Concat #720

wasade · 2016-11-04T01:08:44Z

Resolves #716. This ended up growing in complexity to handle the general case in which one set of IDs are disjoint while the other axis doesn't need to be. I'd be curious if others see ways to reduce or simplify some of the code complexity here.

Now, why is this method useful? Often, when merging tables, one axis is disjoint (e.g., the sample IDs in a meta-analysis). This method offers 80x or so improvement over the existing merge method for this case. In addition, it supports concatenating multiple tables while merge must operate only on pairs thus offering likely further performance gains in real world scenarios.

The first benchmark shows approximately a 20x improvement. When profiling, a hotspot was identified in sort_order, which was also rectified.

In [1]: import biom

In [2]: a = biom.load_table('twins.qiita-2014.biom')

In [3]: b = biom.load_table('ag.qiita-10317.biom')

In [4]: a
Out[4]: 8735 x 1046 <class 'biom.table.Table'> with 344079 nonzero entries (3% dense)

In [5]: b
Out[5]: 9128 x 9919 <class 'biom.table.Table'> with 1508178 nonzero entries (1% dense)

In [6]: %timeit c = a.concat([b])
1 loop, best of 3: 2.25 s per loop

In [7]: %timeit c = a.merge(b)
1 loop, best of 3: 43.9 s per loop

A quick profile suggested a hotspot in sort_order stemming from a) an implicit cast to dense (which also stood to bloat memory) and b) utilizing incremental indexing instead of fancy indexing. So I fixed that. New timing below:

In [1]: import biom

In [2]: a = biom.load_table('twins.qiita-2014.biom')

In [3]: b = biom.load_table('ag.qiita-10317.biom')

In [4]: %timeit c = a.concat([b])
1 loop, best of 3: 582 ms per loop

wasade · 2016-11-04T01:20:54Z

@antgonza relevant for qiita

@gregcaporaso relevant for q2-feature-table on its merge

coveralls · 2016-11-04T01:24:21Z

Coverage increased (+0.1%) to 86.28% when pulling ab2414d on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-04T01:24:21Z

Coverage increased (+0.1%) to 86.28% when pulling ab2414d on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-04T01:27:10Z

Coverage increased (+0.1%) to 86.28% when pulling 40a3c35 on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-04T01:31:33Z

Coverage increased (+0.1%) to 86.28% when pulling 1d935a2 on wasade:concat into 1ebd2d1 on biocore:master.

wasade · 2016-11-05T18:11:51Z

This resolves #717 as well, adding a note and test in a second.

antgonza

Couple of comments.

antgonza · 2016-11-05T18:21:39Z

biom/table.py

+        O5	0.0	0.0	0.0	0.0	0.0	0.0	15.0	16.0	17.0
+
+        """
+        # should this be a staticmethod?


Good question, yes? No real preference but we should remove this comment.

antgonza · 2016-11-05T18:23:25Z

biom/table.py

+                tmp_ids = list(table.ids(axis=axis))
+                tmp_md = table.metadata(axis=axis)
+
+                # this sucks.


????????????????

coveralls · 2016-11-05T18:26:51Z

Coverage increased (+0.1%) to 86.28% when pulling 8409a43 on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-05T20:08:12Z

Coverage increased (+0.1%) to 86.28% when pulling 05de544 on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-05T20:09:32Z

Coverage increased (+0.1%) to 86.28% when pulling 7a4cb03 on wasade:concat into 1ebd2d1 on biocore:master.

wasade · 2016-11-05T20:18:35Z

Thanks @antgonza, is this good then? I'm not thrilled with the complexity but meh...

coveralls · 2016-11-05T20:23:26Z

Coverage increased (+0.1%) to 86.28% when pulling b4b0d1d on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-05T20:39:33Z

Coverage increased (+0.1%) to 86.28% when pulling 0c6dac7 on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-05T20:46:19Z

Coverage increased (+0.05%) to 86.185% when pulling 5221508 on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-05T21:04:31Z

Coverage increased (+0.05%) to 86.185% when pulling e7300ac on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-05T21:23:16Z

Coverage increased (+0.05%) to 86.185% when pulling fb824c8 on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-05T21:23:16Z

Coverage increased (+0.05%) to 86.185% when pulling fb824c8 on wasade:concat into 1ebd2d1 on biocore:master.

coveralls · 2016-11-05T21:23:16Z

Coverage increased (+0.05%) to 86.185% when pulling fb824c8 on wasade:concat into 1ebd2d1 on biocore:master.

ElDeveloper · 2016-11-05T22:09:48Z

This looks good, just a few minor comments. As for your original question, maybe making it a class method makes more sense. It really should just be a function somewhere, but a class method makes organizational sense 👍

wasade · 2016-11-05T22:31:25Z

@ElDeveloper, i didn't see any comments?

I'm tempted to defer doing this as a class or staticmethod because, while not good, its more consistent for this to be a Table instance method...

ElDeveloper · 2016-11-05T21:45:25Z

ChangeLog.md

@@ -15,6 +15,8 @@ New Features:
 * `Table.rankdata` has been added to convert values to ranked abundances on
  either axis. See [issue #645](https://github.com/biocore/biom-format/issues/639).
 * Format of numbers in ``biom summarize-table`` output is now more readable and localized. See [issue #679](https://github.com/biocore/biom-format/issues/679).
+* `Table.concat` has been added to the API and allows for concatenating multiple tables in which the IDs of one of the axes are known to be disjoint. This has substantial performance benefit over `Table.merge`.


benefit -> benefits

ElDeveloper · 2016-11-05T21:50:05Z

biom/table.py

@@ -1795,39 +1797,23 @@ def sort_order(self, order, axis='sample'):
        O2	1.0	0.0	4.0

        """
-        md = []
-        vals = []
+        fancy = np.array([self.index(i, axis=axis) for i in order], dtype=int)


If these were relative abundances, this line would fail, right?

ElDeveloper · 2016-11-05T21:51:21Z

biom/table.py

+        ----------
+        others : iterable of biom.Table
+            Tables to concatenate
+        axis : {'sample', 'observation'}


, optional is missing here.

ElDeveloper · 2016-11-05T21:58:59Z

biom/table.py

+        others : iterable of biom.Table
+            Tables to concatenate
+        axis : {'sample', 'observation'}
+            The axis to concatenate on. i.e., if axis is 'sample', then tables


I am a bit unclear as to why this needs to be specified, what's the reason that you need the axis in order to concatenate the table?

Oh, I get it now.

ElDeveloper · 2016-11-05T22:02:57Z

biom/table.py

+
+            # test we have disjoint IDs
+            if not axis_ids.isdisjoint(table_axis_ids):
+                raise DisjointIDError("IDs are not disjoint")


Maybe worth adding more information noting what the offending table is?

it would take a little bit of book keeping to know which two tables were in conflict, and it might be more than 2 in conflict. Blocking?

ElDeveloper · 2016-11-05T22:04:18Z

biom/table.py

@@ -2935,6 +2921,161 @@ def _intersect_id_order(self, a, b):
                idx += 1
        return new_order

+    def concat(self, others, axis='sample'):


Not blocking, but just making an observation, this method is a little bit long, is there any way to break it down a bit?

Yea, I know. Not enthusiastic. A bunch of private methods perhaps...?

ElDeveloper · 2016-11-05T22:08:03Z

tests/test_table.py

+
+        with self.assertRaises(DisjointIDError):
+            example_table.concat([example_table], axis='observation')
+


What happens with an empty list of tables?

works as expected?

In [2]: biom.example_table.concat([biom.Table([], [], [])]) Out[2]: 2 x 3 <class 'biom.table.Table'> with 5 nonzero entries (83% dense)

ElDeveloper · 2016-11-05T22:08:27Z

tests/test_table.py

@@ -1620,6 +1712,15 @@ def test_update_ids_cache_bug(self):
        exp_index = {'x': 0, 'y': 1}
        self.assertEqual(obs._sample_index, exp_index)

+    def test_other_spmatrix_type(self):
+        # I dont actually remember what bug stemmed from this...


It came up in two scenarios. I could dig through the code but not really excited about that as it would necessitate a bunch of parallel exploration in an interpreter. I could just delete the comment? :)

ElDeveloper · 2016-11-05T22:45:08Z

I'm tempted to defer doing this as a class or staticmethod because, while not good, its more consistent for this to be a Table instance method...

Sounds good.

ElDeveloper · 2016-11-05T22:45:27Z

Also, comments posted, that was my bad (GitHub's new UI).

wasade · 2016-11-05T22:53:13Z

biom/table.py

@@ -1795,39 +1797,23 @@ def sort_order(self, order, axis='sample'):
        O2	1.0	0.0	4.0

        """
-        md = []
-        vals = []
+        fancy = np.array([self.index(i, axis=axis) for i in order], dtype=int)


These are index positions so content of the matrix doesn't matter. Forcing int as numpy by default does float.

ElDeveloper · 2016-11-05T22:54:07Z

Ah, ok, this makes sense!

On (Nov-05-16|15:53), Daniel McDonald wrote:

wasade commented on this pull request.

@@ -1795,39 +1797,23 @@ def sort_order(self, order, axis='sample'):
O2 1.0 0.0 4.0
     """
   md = []
   vals = []
   fancy = np.array([self.index(i, axis=axis) for i in order], dtype=int)
These are index positions so content of the matrix doesn't matter. Forcing int as numpy by default does float.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#720 (review)

ElDeveloper · 2016-11-05T22:57:53Z

Not blocking!

On (Nov-05-16|15:56), Daniel McDonald wrote:

wasade commented on this pull request.

```
   axis_ids = set()
```
```
   invaxis_ids = set()
```
```
   invaxis_metadata = {}
```
```
   all_tables = others[:]
```
```
   all_tables.insert(0, self)
```

   # verify disjoint, and fetch all ids from all tables

```
   for table in all_tables:
```

       table_axis_ids = table.ids(axis=axis)

       table_invaxis_order = table.ids(axis=invaxis)

       table_invaxis = set(table_invaxis_order)

```
       # test we have disjoint IDs
```

       if not axis_ids.isdisjoint(table_axis_ids):

           raise DisjointIDError("IDs are not disjoint")

it would take a little bit of book keeping to know which two tables were in conflict, and it might be more than 2 in conflict. Blocking?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#720

ElDeveloper · 2016-11-05T22:58:37Z

Ah, no not what I meant, I meant:

bt.concat([])

On (Nov-05-16|15:57), Daniel McDonald wrote:

wasade commented on this pull request.

                         [0, 0, 0, 3, 4, 5]]),

```
               ['O1', 'O2', 'O3'],
```

               ['S1', 'S2', 'S3', 'S4', 'S5', 'S6'],

```
               exp_obs_md,
```

               list(example_table.metadata()) \* 2)

   obs = example_table.concat([table2, ], axis='sample')

```
   self.assertEqual(obs, exp)
```
def test_concat_raise_overlap(self):

   with self.assertRaises(DisjointIDError):

       example_table.concat([example_table])

   with self.assertRaises(DisjointIDError):

       example_table.concat([example_table], axis='observation')

works as expected?

In [2]: biom.example_table.concat([biom.Table([], [], [])])
Out[2]: 2 x 3 <class 'biom.table.Table'> with 5 nonzero entries (83% dense)

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#720

wasade · 2016-11-05T22:59:25Z

Thanks!

ElDeveloper · 2016-11-05T23:02:16Z

That sounds like a better solution haha.

On (Nov-05-16|15:59), Daniel McDonald wrote:

wasade commented on this pull request.

@@ -1620,6 +1712,15 @@ def test_update_ids_cache_bug(self):
exp_index = {'x': 0, 'y': 1}
self.assertEqual(obs._sample_index, exp_index)
def test_other_spmatrix_type(self):
   # I dont actually remember what bug stemmed from this...
It came up in two scenarios. I could dig through the code but not really excited about that as it would necessitate a bunch of parallel exploration in an interpreter. I could just delete the comment? :)

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#720

ElDeveloper · 2016-11-05T23:02:39Z

Also, looks like GitHub does not link messages together :[

coveralls · 2016-11-05T23:07:12Z

Coverage increased (+0.05%) to 86.185% when pulling 0b2160b on wasade:concat into 1ebd2d1 on biocore:master.

wasade · 2016-11-05T23:08:01Z

Thanks! Will test empty case and resolve remaining in about an hour

On Nov 5, 2016 4:02 PM, "Yoshiki Vázquez Baeza" notifications@github.com
wrote:

Also, looks like GitHub does not link messages together :[

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#720 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAc8sokNH3HceP8qeBOnx9MC50CF972Kks5q7QsQgaJpZM4KpFlM
.

wasade · 2016-11-06T00:41:06Z

I think that's it. Decomposition would be nice but still not really sure
what would be the best here.

On Sat, Nov 5, 2016 at 4:07 PM, Daniel T. McDonald <
Daniel.Mcdonald@colorado.edu> wrote:

Thanks! Will test empty case and resolve remaining in about an hour

On Nov 5, 2016 4:02 PM, "Yoshiki Vázquez Baeza" notifications@github.com
wrote:

Also, looks like GitHub does not link messages together :[

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#720 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAc8sokNH3HceP8qeBOnx9MC50CF972Kks5q7QsQgaJpZM4KpFlM
.

coveralls · 2016-11-06T00:45:06Z

Coverage increased (+0.05%) to 86.185% when pulling 9f7dc42 on wasade:concat into 1ebd2d1 on biocore:master.

antgonza · 2016-11-06T13:44:40Z

This looks good to me. I will defer to @ElDeveloper to merge.

josenavas

Thanks @wasade ! I made some comments, although I don't have a strong preference on them. It mainly depends on how we see the future of this functionality.

josenavas · 2016-11-08T00:48:57Z

biom/cli/table_subsetter.py

@@ -64,7 +64,10 @@ def subset_table(input_hdf5_fp, input_json_fp, axis, ids, output_fp):
            input_json_fp = f.read()

    with open(ids, 'U') as f:
-        ids = [line.strip() for line in f]
+        ids = []


Minor but isn't this the same?

ids = [line.strip().split('\t')[0] for line in f if not line.startwith('#')

you know to avoid appends (no need to change if you think it makes the code less readable.

yup, made for a long line and this isnt a performance critical block

josenavas · 2016-11-08T00:50:52Z

biom/table.py

+        all_tables.insert(0, self)
+
+        # verify disjoint, and fetch all ids from all tables
+        for table in all_tables:


I think this for loop is a good candidate for a small private function.

josenavas · 2016-11-08T00:51:40Z

biom/table.py

+            missing_ids = list(invaxis_ids - set(table.ids(axis=invaxis)))
+
+            if missing_ids:
+                # determine new shape


I think the contents of this if statement can also be broken in another small private function

josenavas · 2016-11-08T00:52:50Z

biom/table.py

+                tmp_ids = list(table.ids(axis=axis))
+                tmp_md = table.metadata(axis=axis)
+
+                # resolve construction based off axis. This really should be


Would it be too much pain to put it in a classmethod?

the resolution off the axis is something which definitely should be a class method and would be nice to do as it would remove this common pattern from the codebase

From your comment, it looks like this pattern is repeated in multiple parts of the code base, in such case I think it is ok to open an issue about it and fix all occurrences at once later. Do you agree?

josenavas · 2016-11-08T00:54:48Z

biom/table.py

@@ -2935,6 +2921,161 @@ def _intersect_id_order(self, a, b):
                idx += 1
        return new_order

+    def concat(self, others, axis='sample'):


After reading the code in here I think it makes sense to break it in a couple of private functions and probably upgrade this to a class method, it will be easier to do this change now than later and break compatibility on the API.

i stepped back from wanting to make this a classmethod as, while not ideal, it would be inconsistent with the rest of the API

...could possibly decompose...

Consistency with the rest of the API is important so I think it is fine to leave as it is.

Just checking, do you think it is worth having an "inplace" parameter and do the modifications in place given that this is not a class method? I don't think it is necessary but wanted to ask to see what you think about it and how does it looks from an API point of view.

wasade · 2016-11-09T00:57:57Z

+1

On Nov 8, 2016 4:56 PM, "Jose Navas" notifications@github.com wrote:

@josenavas commented on this pull request.

In biom/table.py #720:
           # resolve invert axis ids and metadata
           tmp_inv_ids = list(table.ids(axis=invaxis))
           tmp_inv_ids.extend(missing_ids)
           tmp_inv_md = table.metadata(axis=invaxis)
           if tmp_inv_md is None:
               tmp_inv_md = [None] \* len(table.ids())
           else:
               tmp_inv_md = list(tmp_inv_md)
           tmp_inv_md.extend([invaxis_metadata[i] for i in missing_ids])
           # resolve axis ids and metadata
           tmp_ids = list(table.ids(axis=axis))
           tmp_md = table.metadata(axis=axis)
           # resolve construction based off axis. This really should be
From your comment, it looks like this pattern is repeated in multiple
parts of the code base, in such case I think it is ok to open an issue
about it and fix all occurrences at once later. Do you agree?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#720, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAc8sv5oRTfzySAqPTmrJFF7nDvBKVHPks5q8RpKgaJpZM4KpFlM
.

josenavas · 2016-11-09T01:21:30Z

@ElDeveloper wanna merge if you don't have any further comments?

wasade added 4 commits November 2, 2016 17:12

TST: tests for concat

48d3693

API: added Table.concat

50463d1

PERF: speed boost and memory reduction on Table.sort_order

eaa5668

STY: flake8

ab2414d

DOC: changelog was linked wrong

40a3c35

DOC: improved a comment

1d935a2

BUG: fixes biocore#717

8409a43

antgonza reviewed Nov 5, 2016

View reviewed changes

wasade added this to the 2.1.6 milestone Nov 5, 2016

wasade added 2 commits November 5, 2016 13:00

DOC: resolving biocore#685

05de544

MAINT: resolves biocore#677

7a4cb03

Addressing @antgonza's comments

b4b0d1d

wasade mentioned this pull request Nov 5, 2016

Create method to add more data (observation and samples) #475

Closed

Resolves biocore#582

0c6dac7

Resolves biocore#587

5221508

Resolves biocore#380

e7300ac

Classifiers didn't mention py3

fb824c8

ElDeveloper reviewed Nov 5, 2016

View reviewed changes

wasade commented Nov 5, 2016

View reviewed changes

Addressing @ElDeveloper's comments

0b2160b

Addressing @ElDeveloper's comments

9f7dc42

josenavas reviewed Nov 8, 2016

View reviewed changes

ElDeveloper merged commit 3a54af7 into biocore:master Nov 10, 2016


		with self.assertRaises(DisjointIDError):
		example_table.concat([example_table], axis='observation')

Concat #720

Concat #720

Conversation

wasade commented Nov 4, 2016

wasade commented Nov 4, 2016

coveralls commented Nov 4, 2016 • edited Loading

coveralls commented Nov 4, 2016 • edited Loading

coveralls commented Nov 4, 2016 • edited Loading

coveralls commented Nov 4, 2016 • edited Loading

wasade commented Nov 5, 2016

antgonza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Nov 5, 2016 • edited Loading

coveralls commented Nov 5, 2016 • edited Loading

coveralls commented Nov 5, 2016 • edited Loading

wasade commented Nov 5, 2016

coveralls commented Nov 5, 2016 • edited Loading

coveralls commented Nov 5, 2016 • edited Loading

coveralls commented Nov 5, 2016 • edited Loading

coveralls commented Nov 5, 2016 • edited Loading

coveralls commented Nov 5, 2016 • edited Loading

coveralls commented Nov 5, 2016

coveralls commented Nov 5, 2016

ElDeveloper commented Nov 5, 2016

wasade commented Nov 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElDeveloper commented Nov 5, 2016

ElDeveloper commented Nov 5, 2016

Choose a reason for hiding this comment

ElDeveloper commented Nov 5, 2016

ElDeveloper commented Nov 5, 2016

ElDeveloper commented Nov 5, 2016

wasade commented Nov 5, 2016

ElDeveloper commented Nov 5, 2016

ElDeveloper commented Nov 5, 2016

coveralls commented Nov 5, 2016 • edited Loading

wasade commented Nov 5, 2016

wasade commented Nov 6, 2016

coveralls commented Nov 6, 2016 • edited Loading

antgonza commented Nov 6, 2016

josenavas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wasade commented Nov 9, 2016

@josenavas commented on this pull request.

josenavas commented Nov 9, 2016

coveralls commented Nov 4, 2016 •

edited

Loading

coveralls commented Nov 4, 2016 •

edited

Loading

coveralls commented Nov 4, 2016 •

edited

Loading

coveralls commented Nov 4, 2016 •

edited

Loading

coveralls commented Nov 5, 2016 •

edited

Loading

coveralls commented Nov 5, 2016 •

edited

Loading

coveralls commented Nov 5, 2016 •

edited

Loading

coveralls commented Nov 5, 2016 •

edited

Loading

coveralls commented Nov 5, 2016 •

edited

Loading

coveralls commented Nov 5, 2016 •

edited

Loading

coveralls commented Nov 5, 2016 •

edited

Loading

coveralls commented Nov 5, 2016 •

edited

Loading

coveralls commented Nov 5, 2016 •

edited

Loading

coveralls commented Nov 6, 2016 •

edited

Loading