-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ak.merge_union_of_lists
#2235
Conversation
ak.merge_union_of_lists
if not all(x.is_list for x in layout.contents): | ||
raise ak._errors.wrap_error(NotImplementedError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add support for all list types, add tests, etc.
Codecov Report
Additional details and impacted files
|
Will this work for arbitrarily many lists deep? Will it work in the presence of any option-types (keeping in mind that unions and options already interact)? Most importantly, does it solve @ivirshup's problem? |
Yes, in its current form :)
This is another question (hence the draft status)! My approach with these functions is to keep their scope small; operating only upon parent-child layouts where possible. This means that there are cases where multiple functions are needed: >>> x = ak.concatenate(([{'x': 1}], [{'y':2}]))
>>> ak.merge_union_of_records(x)
<Array [{x: 1, y: None}, {x: None, ...}] type='2 * {x: ?int64, y: ?int64}'> vs >>> y = ak.concatenate(([{'x': 1}], [None,{'y':2}]))
>>> ak.merge_union_of_records(y)
<Array [{x: 1}, None, {y: 2}] type='3 * union[{x: int64}, ?{y: int64}]'> This latter example needs the user to first call >>> y = ak.concatenate(([{'x': 1}], [None,{'y':2}]))
>>> ak.merge_option_of_records(y)
<Array [{x: 1}, {y: None}, {y: 2}] type='3 * union[{x: int64}, {y: ?int64}]'>
>>> ak.merge_union_of_records(ak.merge_option_of_records(y))
<Array [{x: 1, y: None}, ..., {x: None, ...}] type='3 * {x: ?int64, y: ?int64}'> My view is that the same should apply here, if possible. |
Well, if it works one list deep, then it can be applied repeatedly (at a specified I didn't want the scope to balloon in the direction of needing yet another function for merging options. This is starting to sound like something that would be cleaner to do some other way. @ivirshup, why are you concatenating different types? We should understand the use-case better. (I usually think of concatenation for same-type or what ought to be same-typed data, such as partitions of a large dataset. How is your use-case different?) |
This is an important question w.r.t the original motivation. I'm continuing this discussion given that these functions are intrinsically useful irrespective of this problem.
Yes, the problem is when you have mixed option — non-option. This is something that we can (and should) handle. The benefit of adding
I'm on the same page. And, although I took a wide route, I don't think we need a new function in this case. We should just be handling unions given that we know what the result should look like. |
The use case is really "concatenation of very similar types". E.g the likely use cases for I can give a toy example, but let me know if you'd like something more concrete (it will get a bit bio-y). Lets say a certain experiment results in data that looks like awkward will give a union type here. This is an issue for a number of reasons, including that they're a little difficult to reason about, to export to other libraries and it'll error if I try to access import awkward as ak, pandas as np, numpy as np
exp_a = ak.Array([
[{"measure_a": .42, "measure_b": .3, "additional_value": 3}, {"measure_a": .26, "measure_b": .03, "additional_value": 9}],
[{"measure_a": .29, "measure_b": .01, "additional_value": 0}, {"measure_a": .41, "measure_b": .62, "additional_value": 78}, {"measure_a": .31, "measure_b": .54, "additional_value": 62}],
])
exp_b = ak.Array([
[{"measure_a": .42, "measure_b": .3}],
[{"measure_a": .29, "measure_b": .01}, {"measure_a": .41, "measure_b": .62}, {"measure_a": .31, "measure_b": .54}],
[{"measure_a": .26, "measure_b": .03}, {"measure_a": .41, "measure_b": .62}, {"measure_a": .31, "measure_b": .54}, {"measure_a": .41, "measure_b": .62}, {"measure_a": .31, "measure_b": .54}],
])
ak.concatenate([exp_a, exp_b]).type.show()
# 5 * union[
# var * {
# measure_a: float64,
# measure_b: float64,
# additional_value: int64
# },
# var * {
# measure_a: float64,
# measure_b: float64
# }
# ] If I were to do this in pandas, I would have a dataframe like exp_a_df = ak.to_dataframe(exp_a).astype({"additional_value": pd.Int64Dtype()})
exp_b_df = ak.to_dataframe(exp_b)
pd.concat([exp_a_df, exp_b_df], join="outer")
# measure_a measure_b additional_value
# entry subentry
# 0 0 0.42 0.30 3
# 1 0.26 0.03 9
# 1 0 0.29 0.01 0
# 1 0.41 0.62 78
# 2 0.31 0.54 62
# 0 0 0.42 0.30 <NA>
# 1 0 0.29 0.01 <NA>
# 1 0.41 0.62 <NA>
# 2 0.31 0.54 <NA>
# 2 0 0.26 0.03 <NA>
# 1 0.41 0.62 <NA>
# 2 0.31 0.54 <NA>
# 3 0.41 0.62 <NA>
# 4 0.31 0.54 <NA> The This PR will get a little closer to the goal here. There are a few other tasks I would need to do:
This should be manageable, but "checking if the result it mergable" may be difficult, and each call to one of the |
Maybe we're going about this in too complicated a way, trying to fix a union after it has been created. The situation of having new data with more fields than old data is a familiar one, which I'd call schema evolution. If it always involves record types, such as >>> old_data = ak.Array([[{"a": 1}], [], [{"a": 2}]])
>>> old_data.show(type=True)
type: 3 * var * {
a: int64
}
[[{a: 1}],
[],
[{a: 2}]]
>>> new_data = ak.Array([[], [{"a": 3.3, "b": 3.3}, {"a": 4.4, "b": 4.4}]])
>>> new_data.show(type=True)
type: 2 * var * {
a: float64,
b: float64
}
[[],
[{a: 3.3, b: 3.3}, {a: 4.4, b: 4.4}]] just concatenating them makes an undesirable union, >>> ak.concatenate((old_data, new_data)).show(type=True)
type: 5 * union[
var * {
a: int64
},
var * {
a: float64,
b: float64
}
]
[[{a: 1}],
[],
[{a: 2}],
[],
[{a: 3.3, b: 3.3}, {a: 4.4, b: 4.4}]] but if we explicitly set missing values for the records that don't have them, >>> old_data["b"] = None
>>> old_data.show(type=True)
type: 3 * var * {
a: int64,
b: ?unknown
}
[[{a: 1, b: None}],
[],
[{a: 2, b: None}]] then they'll concatenate the way you want them without even needing the special function: >>> ak.concatenate((old_data, new_data)).show(type=True)
type: 5 * var * {
a: float64,
b: ?float64
}
[[{a: 1, b: None}],
[],
[{a: 2, b: None}],
[],
[{a: 3.3, b: 3.3}, {a: 4.4, b: 4.4}]] This kind of procedure doesn't have to be manual: >>> newer_data = ak.Array([[{"b": 5, "c": []}, {"b": 6.6, "c": [1, 2]}]])
>>> all_fields = set(old_data.fields) | set(new_data.fields) | set(newer_data.fields)
>>> all_fields
{'a', 'c', 'b'}
>>> for field in all_fields:
... if field not in old_data.fields:
... old_data[field] = None
... if field not in new_data.fields:
... new_data[field] = None
... if field not in newer_data.fields:
... newer_data[field] = None
...
>>> ak.concatenate((old_data, new_data, newer_data)).show(type=True)
type: 6 * var * {
a: ?float64,
b: ?float64,
c: option[var * int64]
}
[[{a: 1, b: None, c: None}],
[],
[{a: 2, b: None, c: None}],
[],
[{a: 3.3, b: 3.3, c: None}, {a: 4.4, b: 4.4, c: None}],
[{a: None, b: 5, c: []}, {a: None, b: 6.6, c: [1, ...]}]] But it is strictly acting on records; there's no equivalent of this for unions of other types, such as Integers → strings doesn't promote, regardless of whether they're in records or not: >>> even_newer_data = ak.Array([[{"a": "hello"}]])
>>> ak.concatenate((old_data, new_data, newer_data, even_newer_data)).show(type=True)
type: 7 * union[
var * {
a: ?float64,
b: ?float64,
c: option[var * int64]
},
var * {
a: string
}
]
[[{a: 1, b: None, c: None}],
[],
[{a: 2, b: None, c: None}],
[],
[{a: 3.3, b: 3.3, c: None}, {a: 4.4, b: 4.4, c: None}],
[{a: None, b: 5, c: []}, {a: None, b: 6.6, c: [1, ...]}],
[{a: 'hello'}]] |
Yes, I think this was actually the approach I had been exploring. It starts to fall down if the user provided unions or you hit nested types like:
If you look at the example code for the relevant issue in AnnData (scverse/anndata#898), I was hoping to use the For simple record types, I'd been doing a similar union of records to achieve the above (as discussed: #1668 (comment)). |
It doesn't fall down all the way: >>> old_data = ak.Array([{"gene_id": "meow", "exons": [{"exon_start": 100}]}])
>>> new_data = ak.Array([{"gene_id": "woof", "exons": [{"exon_start": 200, "exon_end": 300}]}])
>>> old_data["exons", "exon_end"] = None
>>> ak.concatenate((old_data, new_data)).show(type=True)
type: 2 * {
gene_id: string,
exons: var * {
exon_start: int64,
exon_end: ?int64
}
}
[{gene_id: 'meow', exons: [{exon_start: 100, ...}]},
{gene_id: 'woof', exons: [{exon_start: 200, ...}]}] It has become more complicated to find out what fields are needed and where they're needed, but it's still possible: >>> new_data.fields
['gene_id', 'exons']
>>> new_data["exons"].fields
['exon_start', 'exon_end']
>>> new_data["exons", "something_deeper"].fields
... (so I suppose there would be some tree-recursion needed there). All of these are working because string tuple-items in square brackets means nested record fields. It's starting to look like a complex function, but if we can generalize it and give it a good name, that might be the way to go. This method differs from
Maybe it should be a non-default option of |
@jpivarski and I discussed this in our meeting today. We proposed the addition of a new >>> arrays = ak.broadcast_fields([
[{'x': 1}],
[{'y': 2}],
[{'z': 2}],
])
>>> arrays[0].type.show()
3 * {
x: ?int64,
y: ?int64,
z: ?int64
} This function would find the common type of missing fields, and add them in the appropriate places. Where there is no common type, "common type" means "union of types". You could then use this function with ak.concatenate(
ak.broadcast_fields(...)
) By creating unions as deep as possible, this function would not require Does this sound like a good approach? |
sounds good to me! Happy to give it a try when you have a prototype. One question that I have from your example: When |
Yes, a mistake on my part! I've corrected the example. |
Closing this PR in favour of #2267 |
This PR adds a new
ak.merge_union_of_lists
, which pushes unions inside of list types, i.e.This is useful in combination with
ak.merge_union_of_records
, which cannot operate uponUnion[var * Record, ...]
.Fixes #2211
is_option
types inak.merge_union_of_lists
is_option
types inak.merge_union_of_options