feat: add `ak.merge_union_of_lists` #2235

agoose77 · 2023-02-13T14:18:20Z

This PR adds a new ak.merge_union_of_lists, which pushes unions inside of list types, i.e.

Union[List[X], List[Y]] → List[Union[X, Y]]

This is useful in combination with ak.merge_union_of_records, which cannot operate upon Union[var * Record, ...].

Fixes #2211

Support is_option types in ak.merge_union_of_lists
Support is_option types in ak.merge_union_of_options

agoose77 · 2023-02-13T14:20:20Z

src/awkward/operations/ak_merge_union_of_lists.py

+                if not all(x.is_list for x in layout.contents):
+                    raise ak._errors.wrap_error(NotImplementedError)


TODO: add support for all list types, add tests, etc.

codecov · 2023-02-13T14:26:31Z

Codecov Report

Merging #2235 (9b3b87e) into main (7d10007) will decrease coverage by 0.15%.
The diff coverage is 38.63%.

Additional details and impacted files

Impacted Files	Coverage Δ
src/awkward/_do.py	`84.79% <ø> (+0.40%)`	⬆️
src/awkward/operations/ak_concatenate.py	`96.11% <ø> (ø)`
src/awkward/operations/ak_merge_union_of_lists.py	`0.00% <0.00%> (ø)`
src/awkward/_nplikes/typetracer.py	`74.92% <33.33%> (ø)`
src/awkward/_nplikes/numpylike.py	`73.80% <66.66%> (-1.02%)`	⬇️
src/awkward/typing.py	`83.33% <66.66%> (-5.56%)`	⬇️
src/awkward/operations/ak_almost_equal.py	`92.72% <75.00%> (-1.62%)`	⬇️
src/awkward/_nplikes/array_module.py	`89.84% <100.00%> (ø)`
src/awkward/_nplikes/shape.py	`78.68% <0.00%> (+1.63%)`	⬆️

jpivarski · 2023-02-13T14:32:35Z

Will this work for arbitrarily many lists deep? Will it work in the presence of any option-types (keeping in mind that unions and options already interact)?

Most importantly, does it solve @ivirshup's problem?

agoose77 · 2023-02-13T14:39:52Z

Most importantly, does it solve @ivirshup's problem?

Yes, in its current form :)

Will this work for arbitrarily many lists deep? Will it work in the presence of any option-types (keeping in mind that unions and options already interact)?

This is another question (hence the draft status)! My approach with these functions is to keep their scope small; operating only upon parent-child layouts where possible. This means that there are cases where multiple functions are needed:

>>> x = ak.concatenate(([{'x': 1}], [{'y':2}]))
>>> ak.merge_union_of_records(x)
<Array [{x: 1, y: None}, {x: None, ...}] type='2 * {x: ?int64, y: ?int64}'>

vs

>>> y = ak.concatenate(([{'x': 1}], [None,{'y':2}]))
>>> ak.merge_union_of_records(y)
<Array [{x: 1}, None, {y: 2}] type='3 * union[{x: int64}, ?{y: int64}]'>

This latter example needs the user to first call ak.merge_option_of_records:

>>> y = ak.concatenate(([{'x': 1}], [None,{'y':2}]))
>>> ak.merge_option_of_records(y)
<Array [{x: 1}, {y: None}, {y: 2}] type='3 * union[{x: int64}, {y: ?int64}]'>
>>> ak.merge_union_of_records(ak.merge_option_of_records(y))
<Array [{x: 1, y: None}, ..., {x: None, ...}] type='3 * {x: ?int64, y: ?int64}'>

My view is that the same should apply here, if possible.

jpivarski · 2023-02-13T14:45:56Z

Well, if it works one list deep, then it can be applied repeatedly (at a specified axis).

I didn't want the scope to balloon in the direction of needing yet another function for merging options. This is starting to sound like something that would be cleaner to do some other way.

@ivirshup, why are you concatenating different types? We should understand the use-case better. (I usually think of concatenation for same-type or what ought to be same-typed data, such as partitions of a large dataset. How is your use-case different?)

agoose77 · 2023-02-13T14:52:26Z

We should understand the use-case better.

This is an important question w.r.t the original motivation. I'm continuing this discussion given that these functions are intrinsically useful irrespective of this problem.

Well, if it works one list deep, then it can be applied repeatedly (at a specified axis).

Yes, the problem is when you have mixed option — non-option. This is something that we can (and should) handle. The benefit of adding ak.merge_option_of_lists is therefore mostly tangential to this discussion. My discussion above was framed in terms of what we've currently implemented. It actually does make sense for us to support Union[Option[X], Y].

I didn't want the scope to balloon in the direction of needing yet another function for merging options. This is starting to sound like something that would be cleaner to do some other way.

I'm on the same page. And, although I took a wide route, I don't think we need a new function in this case. We should just be handling unions given that we know what the result should look like.

ivirshup · 2023-02-14T11:11:46Z

We should understand the use-case better. (I usually think of concatenation for same-type or what ought to be same-typed data, such as partitions of a large dataset. How is your use-case different?)

The use case is really "concatenation of very similar types". E.g the likely use cases for pd.concat and xr.concat.

I can give a toy example, but let me know if you'd like something more concrete (it will get a bit bio-y).

Lets say a certain experiment results in data that looks like n * var * {measure_a: float, measure_b: float}. I've annotated an experiment with some computed value per measurement (or maybe an annotation) so now it looks like: n * var * {measure_a: float, measure_b: float, additional_value: int}. Now I want to combine this with some other experiment that doesn't have the additional value, and has the datashape of the base case. I want to maintain the values of additional_value for the observations that have it.

awkward will give a union type here. This is an issue for a number of reasons, including that they're a little difficult to reason about, to export to other libraries and it'll error if I try to access "additional_value" on the combined dataset.

import awkward as ak, pandas as np, numpy as np

exp_a = ak.Array([
    [{"measure_a": .42, "measure_b": .3, "additional_value": 3}, {"measure_a": .26, "measure_b": .03, "additional_value": 9}],
    [{"measure_a": .29, "measure_b": .01, "additional_value": 0}, {"measure_a": .41, "measure_b": .62, "additional_value": 78}, {"measure_a": .31, "measure_b": .54, "additional_value": 62}],
])

exp_b = ak.Array([
    [{"measure_a": .42, "measure_b": .3}],
    [{"measure_a": .29, "measure_b": .01}, {"measure_a": .41, "measure_b": .62}, {"measure_a": .31, "measure_b": .54}],
    [{"measure_a": .26, "measure_b": .03}, {"measure_a": .41, "measure_b": .62}, {"measure_a": .31, "measure_b": .54}, {"measure_a": .41, "measure_b": .62}, {"measure_a": .31, "measure_b": .54}],
])

ak.concatenate([exp_a, exp_b]).type.show()
# 5 * union[
#     var * {
#         measure_a: float64,
#         measure_b: float64,
#         additional_value: int64
#     },
#     var * {
#         measure_a: float64,
#         measure_b: float64
#     }
# ]

If I were to do this in pandas, I would have a dataframe like n * {observation: int, measure_a: float, measure_b: float, additional_value: int} and n * {entry: int, sub_entry:int, measure_a: float, measure_b: float}. I could call pd.concat([experiment_a, experiment_b], join="outer") and end up with n * {observation: int, measure_a: float, measure_b: float, additional_value: ?int}

exp_a_df = ak.to_dataframe(exp_a).astype({"additional_value": pd.Int64Dtype()})
exp_b_df = ak.to_dataframe(exp_b)

pd.concat([exp_a_df, exp_b_df], join="outer")
#                 measure_a  measure_b  additional_value
# entry subentry                                        
# 0     0              0.42       0.30                 3
#       1              0.26       0.03                 9
# 1     0              0.29       0.01                 0
#       1              0.41       0.62                78
#       2              0.31       0.54                62
# 0     0              0.42       0.30              <NA>
# 1     0              0.29       0.01              <NA>
#       1              0.41       0.62              <NA>
#       2              0.31       0.54              <NA>
# 2     0              0.26       0.03              <NA>
#       1              0.41       0.62              <NA>
#       2              0.31       0.54              <NA>
#       3              0.41       0.62              <NA>
#       4              0.31       0.54              <NA>

The join argument from xr.concat allows very similar behavior.

This PR will get a little closer to the goal here. There are a few other tasks I would need to do:

Check if the result is mergable. I'm not sure I want to bother merging arrays where I can't eliminate unions.
Iteratively call either merge_unions_of_records, merge_unions_of_lists (based on current top level union) until unions are eliminated.

This should be manageable, but "checking if the result it mergable" may be difficult, and each call to one of the merge_unions functions may be expensive (e.g. I think Union[var * {a}, var * {b}] -> var * Union[{a}, {b}] -> var * {?a, ?b} allocates a lot more than a 1 step process would).

jpivarski · 2023-02-14T16:11:16Z

Maybe we're going about this in too complicated a way, trying to fix a union after it has been created. The situation of having new data with more fields than old data is a familiar one, which I'd call schema evolution. If it always involves record types, such as

>>> old_data = ak.Array([[{"a": 1}], [], [{"a": 2}]])
>>> old_data.show(type=True)
type: 3 * var * {
    a: int64
}
[[{a: 1}],
 [],
 [{a: 2}]]
>>> new_data = ak.Array([[], [{"a": 3.3, "b": 3.3}, {"a": 4.4, "b": 4.4}]])
>>> new_data.show(type=True)
type: 2 * var * {
    a: float64,
    b: float64
}
[[],
 [{a: 3.3, b: 3.3}, {a: 4.4, b: 4.4}]]

just concatenating them makes an undesirable union,

>>> ak.concatenate((old_data, new_data)).show(type=True)
type: 5 * union[
    var * {
        a: int64
    },
    var * {
        a: float64,
        b: float64
    }
]
[[{a: 1}],
 [],
 [{a: 2}],
 [],
 [{a: 3.3, b: 3.3}, {a: 4.4, b: 4.4}]]

but if we explicitly set missing values for the records that don't have them,

>>> old_data["b"] = None
>>> old_data.show(type=True)
type: 3 * var * {
    a: int64,
    b: ?unknown
}
[[{a: 1, b: None}],
 [],
 [{a: 2, b: None}]]

then they'll concatenate the way you want them without even needing the special function:

>>> ak.concatenate((old_data, new_data)).show(type=True)
type: 5 * var * {
    a: float64,
    b: ?float64
}
[[{a: 1, b: None}],
 [],
 [{a: 2, b: None}],
 [],
 [{a: 3.3, b: 3.3}, {a: 4.4, b: 4.4}]]

This kind of procedure doesn't have to be manual:

>>> newer_data = ak.Array([[{"b": 5, "c": []}, {"b": 6.6, "c": [1, 2]}]])

>>> all_fields = set(old_data.fields) | set(new_data.fields) | set(newer_data.fields)
>>> all_fields
{'a', 'c', 'b'}
>>> for field in all_fields:
...     if field not in old_data.fields:
...         old_data[field] = None
...     if field not in new_data.fields:
...         new_data[field] = None
...     if field not in newer_data.fields:
...         newer_data[field] = None
... 
>>> ak.concatenate((old_data, new_data, newer_data)).show(type=True)
type: 6 * var * {
    a: ?float64,
    b: ?float64,
    c: option[var * int64]
}
[[{a: 1, b: None, c: None}],
 [],
 [{a: 2, b: None, c: None}],
 [],
 [{a: 3.3, b: 3.3, c: None}, {a: 4.4, b: 4.4, c: None}],
 [{a: None, b: 5, c: []}, {a: None, b: 6.6, c: [1, ...]}]]

But it is strictly acting on records; there's no equivalent of this for unions of other types, such as union[int, str]. But it sounds to me like the problem you're trying to solve is records getting or losing attributes, and they can change in type from, say, integers to floats, but they don't change from integers to strings.

Integers → strings doesn't promote, regardless of whether they're in records or not:

>>> even_newer_data = ak.Array([[{"a": "hello"}]])
>>> ak.concatenate((old_data, new_data, newer_data, even_newer_data)).show(type=True)
type: 7 * union[
    var * {
        a: ?float64,
        b: ?float64,
        c: option[var * int64]
    },
    var * {
        a: string
    }
]
[[{a: 1, b: None, c: None}],
 [],
 [{a: 2, b: None, c: None}],
 [],
 [{a: 3.3, b: 3.3, c: None}, {a: 4.4, b: 4.4, c: None}],
 [{a: None, b: 5, c: []}, {a: None, b: 6.6, c: [1, ...]}],
 [{a: 'hello'}]]

ivirshup · 2023-02-14T17:00:26Z

Maybe we're going about this in too complicated a way, trying to fix a union after it has been created.

Yes, I think this was actually the approach I had been exploring. It starts to fall down if the user provided unions or you hit nested types like:

n * {gene_id: str, exons: var * {exon_start: int, exon_end: int}}

If you look at the example code for the relevant issue in AnnData (scverse/anndata#898), I was hoping to use the merge_unions_* logic to figure out resultant types.

For simple record types, I'd been doing a similar union of records to achieve the above (as discussed: #1668 (comment)).

jpivarski · 2023-02-14T18:41:15Z

It doesn't fall down all the way:

>>> old_data = ak.Array([{"gene_id": "meow", "exons": [{"exon_start": 100}]}])
>>> new_data = ak.Array([{"gene_id": "woof", "exons": [{"exon_start": 200, "exon_end": 300}]}])

>>> old_data["exons", "exon_end"] = None

>>> ak.concatenate((old_data, new_data)).show(type=True)
type: 2 * {
    gene_id: string,
    exons: var * {
        exon_start: int64,
        exon_end: ?int64
    }
}
[{gene_id: 'meow', exons: [{exon_start: 100, ...}]},
 {gene_id: 'woof', exons: [{exon_start: 200, ...}]}]

It has become more complicated to find out what fields are needed and where they're needed, but it's still possible:

>>> new_data.fields
['gene_id', 'exons']

>>> new_data["exons"].fields
['exon_start', 'exon_end']

>>> new_data["exons", "something_deeper"].fields
...

(so I suppose there would be some tree-recursion needed there). All of these are working because string tuple-items in square brackets means nested record fields.

It's starting to look like a complex function, but if we can generalize it and give it a good name, that might be the way to go. This method differs from ak.merge_union_of_records in that it:

works at all list depths; option-type data are automatically handled (no special cases)
would have to include the concatenation as part of the operation, whereas ak.merge_union_of_records could be separate.

Maybe it should be a non-default option of ak.concatenate? Maybe not; it sounds too special-purpose for a generic (and NumPy-extending) function like ak.concatenate. Maybe a longer name: ak.concatenate_merge_records?

agoose77 · 2023-02-15T17:11:48Z

@jpivarski and I discussed this in our meeting today. We proposed the addition of a new ak.broadcast_fields() (or similar name) function that broadcasts the field structure of multiple arrays, e.g.

>>> arrays = ak.broadcast_fields([
    [{'x': 1}],
    [{'y': 2}],
    [{'z': 2}],
])
>>> arrays[0].type.show()
3 * {
    x: ?int64,
    y: ?int64,
    z: ?int64
}

This function would find the common type of missing fields, and add them in the appropriate places. Where there is no common type, "common type" means "union of types".

You could then use this function with ak.concatenate, i.e.

ak.concatenate(
    ak.broadcast_fields(...)
)

By creating unions as deep as possible, this function would not require ak.merge_union_of_lists.

Does this sound like a good approach?

grst · 2023-02-15T19:02:17Z

sounds good to me! Happy to give it a try when you have a prototype.

One question that I have from your example: When broadcast_fields is called with a list of arrays, would it again return a list of arrays or a single array? Your first example looks like the former, your second like the latter...

agoose77 · 2023-02-15T21:04:46Z

Your first example looks like the former, your second like the latter...

Yes, a mistake on my part! I've corrected the example.

agoose77 · 2023-02-27T15:42:34Z

Closing this PR in favour of #2267

agoose77 added 3 commits February 13, 2023 14:17

feat: new merge_union_of_lists

4c7a56c

chore: add note about misuse of parameter

9d8eb01

chore: remove dead code

9b3b87e

agoose77 changed the title ~~feat: new merge_union_of_lists~~ feat: add ak.merge_union_of_lists Feb 13, 2023

agoose77 commented Feb 13, 2023

View reviewed changes

agoose77 mentioned this pull request Feb 13, 2023

ak.merge_union_of_records not merging Union[var * {a: int, b:int}, var * {a: int}] #2211

Closed

agoose77 temporarily deployed to docs-preview February 13, 2023 14:31 — with GitHub Actions Inactive

agoose77 mentioned this pull request Feb 13, 2023

fix: support options in ak.merge_union_of_records #2236

Merged

1 task

grst mentioned this pull request Feb 15, 2023

Can't concatenate more than 3 types into Union #2239

Closed

agoose77 closed this Feb 27, 2023

agoose77 deleted the agoose77/feat-merge-union-of-lists branch March 16, 2023 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `ak.merge_union_of_lists` #2235

feat: add `ak.merge_union_of_lists` #2235

agoose77 commented Feb 13, 2023 •

edited

Loading

agoose77 Feb 13, 2023

codecov bot commented Feb 13, 2023

jpivarski commented Feb 13, 2023

agoose77 commented Feb 13, 2023 •

edited

Loading

jpivarski commented Feb 13, 2023

agoose77 commented Feb 13, 2023 •

edited

Loading

ivirshup commented Feb 14, 2023 •

edited

Loading

jpivarski commented Feb 14, 2023

ivirshup commented Feb 14, 2023 •

edited

Loading

jpivarski commented Feb 14, 2023

agoose77 commented Feb 15, 2023 •

edited

Loading

grst commented Feb 15, 2023

agoose77 commented Feb 15, 2023

agoose77 commented Feb 27, 2023

		if not all(x.is_list for x in layout.contents):
		raise ak._errors.wrap_error(NotImplementedError)

feat: add ak.merge_union_of_lists #2235

feat: add ak.merge_union_of_lists #2235

Conversation

agoose77 commented Feb 13, 2023 • edited Loading

agoose77 Feb 13, 2023

Choose a reason for hiding this comment

codecov bot commented Feb 13, 2023

Codecov Report

jpivarski commented Feb 13, 2023

agoose77 commented Feb 13, 2023 • edited Loading

jpivarski commented Feb 13, 2023

agoose77 commented Feb 13, 2023 • edited Loading

ivirshup commented Feb 14, 2023 • edited Loading

jpivarski commented Feb 14, 2023

ivirshup commented Feb 14, 2023 • edited Loading

jpivarski commented Feb 14, 2023

agoose77 commented Feb 15, 2023 • edited Loading

grst commented Feb 15, 2023

agoose77 commented Feb 15, 2023

agoose77 commented Feb 27, 2023

feat: add `ak.merge_union_of_lists` #2235

feat: add `ak.merge_union_of_lists` #2235

agoose77 commented Feb 13, 2023 •

edited

Loading

agoose77 commented Feb 13, 2023 •

edited

Loading

agoose77 commented Feb 13, 2023 •

edited

Loading

ivirshup commented Feb 14, 2023 •

edited

Loading

ivirshup commented Feb 14, 2023 •

edited

Loading

agoose77 commented Feb 15, 2023 •

edited

Loading