-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concat #720
Concat #720
Conversation
@antgonza relevant for qiita @gregcaporaso relevant for |
1 similar comment
This resolves #717 as well, adding a note and test in a second. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments.
O5 0.0 0.0 0.0 0.0 0.0 0.0 15.0 16.0 17.0 | ||
|
||
""" | ||
# should this be a staticmethod? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, yes? No real preference but we should remove this comment.
tmp_ids = list(table.ids(axis=axis)) | ||
tmp_md = table.metadata(axis=axis) | ||
|
||
# this sucks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
????????????????
Thanks @antgonza, is this good then? I'm not thrilled with the complexity but meh... |
2 similar comments
This looks good, just a few minor comments. As for your original question, maybe making it a class method makes more sense. It really should just be a function somewhere, but a class method makes organizational sense 👍 |
@ElDeveloper, i didn't see any comments? I'm tempted to defer doing this as a class or staticmethod because, while not good, its more consistent for this to be a |
@@ -15,6 +15,8 @@ New Features: | |||
* `Table.rankdata` has been added to convert values to ranked abundances on | |||
either axis. See [issue #645](https://github.com/biocore/biom-format/issues/639). | |||
* Format of numbers in ``biom summarize-table`` output is now more readable and localized. See [issue #679](https://github.com/biocore/biom-format/issues/679). | |||
* `Table.concat` has been added to the API and allows for concatenating multiple tables in which the IDs of one of the axes are known to be disjoint. This has substantial performance benefit over `Table.merge`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benefit -> benefits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1795,39 +1797,23 @@ def sort_order(self, order, axis='sample'): | |||
O2 1.0 0.0 4.0 | |||
|
|||
""" | |||
md = [] | |||
vals = [] | |||
fancy = np.array([self.index(i, axis=axis) for i in order], dtype=int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these were relative abundances, this line would fail, right?
---------- | ||
others : iterable of biom.Table | ||
Tables to concatenate | ||
axis : {'sample', 'observation'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, optional
is missing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
others : iterable of biom.Table | ||
Tables to concatenate | ||
axis : {'sample', 'observation'} | ||
The axis to concatenate on. i.e., if axis is 'sample', then tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit unclear as to why this needs to be specified, what's the reason that you need the axis in order to concatenate the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I get it now.
|
||
# test we have disjoint IDs | ||
if not axis_ids.isdisjoint(table_axis_ids): | ||
raise DisjointIDError("IDs are not disjoint") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth adding more information noting what the offending table is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would take a little bit of book keeping to know which two tables were in conflict, and it might be more than 2 in conflict. Blocking?
@@ -2935,6 +2921,161 @@ def _intersect_id_order(self, a, b): | |||
idx += 1 | |||
return new_order | |||
|
|||
def concat(self, others, axis='sample'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking, but just making an observation, this method is a little bit long, is there any way to break it down a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I know. Not enthusiastic. A bunch of private methods perhaps...?
|
||
with self.assertRaises(DisjointIDError): | ||
example_table.concat([example_table], axis='observation') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with an empty list of tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works as expected?
In [2]: biom.example_table.concat([biom.Table([], [], [])])
Out[2]: 2 x 3 <class 'biom.table.Table'> with 5 nonzero entries (83% dense)
@@ -1620,6 +1712,15 @@ def test_update_ids_cache_bug(self): | |||
exp_index = {'x': 0, 'y': 1} | |||
self.assertEqual(obs._sample_index, exp_index) | |||
|
|||
def test_other_spmatrix_type(self): | |||
# I dont actually remember what bug stemmed from this... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It came up in two scenarios. I could dig through the code but not really excited about that as it would necessitate a bunch of parallel exploration in an interpreter. I could just delete the comment? :)
Sounds good. |
Also, comments posted, that was my bad (GitHub's new UI). |
@@ -1795,39 +1797,23 @@ def sort_order(self, order, axis='sample'): | |||
O2 1.0 0.0 4.0 | |||
|
|||
""" | |||
md = [] | |||
vals = [] | |||
fancy = np.array([self.index(i, axis=axis) for i in order], dtype=int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are index positions so content of the matrix doesn't matter. Forcing int
as numpy by default does float
.
Ah, ok, this makes sense! On (Nov-05-16|15:53), Daniel McDonald wrote:
|
Not blocking! On (Nov-05-16|15:56), Daniel McDonald wrote:
|
Ah, no not what I meant, I meant: bt.concat([]) On (Nov-05-16|15:57), Daniel McDonald wrote:
|
Thanks! |
That sounds like a better solution haha. On (Nov-05-16|15:59), Daniel McDonald wrote:
|
Also, looks like GitHub does not link messages together :[ |
Thanks! Will test empty case and resolve remaining in about an hour On Nov 5, 2016 4:02 PM, "Yoshiki Vázquez Baeza" notifications@github.com
|
I think that's it. Decomposition would be nice but still not really sure On Sat, Nov 5, 2016 at 4:07 PM, Daniel T. McDonald <
|
This looks good to me. I will defer to @ElDeveloper to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wasade ! I made some comments, although I don't have a strong preference on them. It mainly depends on how we see the future of this functionality.
@@ -64,7 +64,10 @@ def subset_table(input_hdf5_fp, input_json_fp, axis, ids, output_fp): | |||
input_json_fp = f.read() | |||
|
|||
with open(ids, 'U') as f: | |||
ids = [line.strip() for line in f] | |||
ids = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but isn't this the same?
ids = [line.strip().split('\t')[0] for line in f if not line.startwith('#')
you know to avoid appends (no need to change if you think it makes the code less readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, made for a long line and this isnt a performance critical block
all_tables.insert(0, self) | ||
|
||
# verify disjoint, and fetch all ids from all tables | ||
for table in all_tables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this for loop is a good candidate for a small private function.
missing_ids = list(invaxis_ids - set(table.ids(axis=invaxis))) | ||
|
||
if missing_ids: | ||
# determine new shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the contents of this if statement can also be broken in another small private function
tmp_ids = list(table.ids(axis=axis)) | ||
tmp_md = table.metadata(axis=axis) | ||
|
||
# resolve construction based off axis. This really should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too much pain to put it in a classmethod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the resolution off the axis is something which definitely should be a class method and would be nice to do as it would remove this common pattern from the codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your comment, it looks like this pattern is repeated in multiple parts of the code base, in such case I think it is ok to open an issue about it and fix all occurrences at once later. Do you agree?
@@ -2935,6 +2921,161 @@ def _intersect_id_order(self, a, b): | |||
idx += 1 | |||
return new_order | |||
|
|||
def concat(self, others, axis='sample'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the code in here I think it makes sense to break it in a couple of private functions and probably upgrade this to a class method, it will be easier to do this change now than later and break compatibility on the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i stepped back from wanting to make this a classmethod as, while not ideal, it would be inconsistent with the rest of the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...could possibly decompose...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency with the rest of the API is important so I think it is fine to leave as it is.
Just checking, do you think it is worth having an "inplace" parameter and do the modifications in place given that this is not a class method? I don't think it is necessary but wanted to ask to see what you think about it and how does it looks from an API point of view.
+1 On Nov 8, 2016 4:56 PM, "Jose Navas" notifications@github.com wrote:
|
@ElDeveloper wanna merge if you don't have any further comments? |
Resolves #716. This ended up growing in complexity to handle the general case in which one set of IDs are disjoint while the other axis doesn't need to be. I'd be curious if others see ways to reduce or simplify some of the code complexity here.
Now, why is this method useful? Often, when merging tables, one axis is disjoint (e.g., the sample IDs in a meta-analysis). This method offers 80x or so improvement over the existing
merge
method for this case. In addition, it supports concatenating multiple tables whilemerge
must operate only on pairs thus offering likely further performance gains in real world scenarios.The first benchmark shows approximately a 20x improvement. When profiling, a hotspot was identified in
sort_order
, which was also rectified.A quick profile suggested a hotspot in
sort_order
stemming from a) an implicit cast to dense (which also stood to bloat memory) and b) utilizing incremental indexing instead of fancy indexing. So I fixed that. New timing below: