Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ak.merge_union_of_lists #2235

Closed
wants to merge 3 commits into from

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Feb 13, 2023

This PR adds a new ak.merge_union_of_lists, which pushes unions inside of list types, i.e.

Union[List[X], List[Y]] → List[Union[X, Y]]

This is useful in combination with ak.merge_union_of_records, which cannot operate upon Union[var * Record, ...].

Fixes #2211

  • Support is_option types in ak.merge_union_of_lists
  • Support is_option types in ak.merge_union_of_options

@agoose77 agoose77 changed the title feat: new merge_union_of_lists feat: add ak.merge_union_of_lists Feb 13, 2023
Comment on lines +52 to +53
if not all(x.is_list for x in layout.contents):
raise ak._errors.wrap_error(NotImplementedError)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add support for all list types, add tests, etc.

@codecov
Copy link

codecov bot commented Feb 13, 2023

Codecov Report

Merging #2235 (9b3b87e) into main (7d10007) will decrease coverage by 0.15%.
The diff coverage is 38.63%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_do.py 84.79% <ø> (+0.40%) ⬆️
src/awkward/operations/ak_concatenate.py 96.11% <ø> (ø)
src/awkward/operations/ak_merge_union_of_lists.py 0.00% <0.00%> (ø)
src/awkward/_nplikes/typetracer.py 74.92% <33.33%> (ø)
src/awkward/_nplikes/numpylike.py 73.80% <66.66%> (-1.02%) ⬇️
src/awkward/typing.py 83.33% <66.66%> (-5.56%) ⬇️
src/awkward/operations/ak_almost_equal.py 92.72% <75.00%> (-1.62%) ⬇️
src/awkward/_nplikes/array_module.py 89.84% <100.00%> (ø)
src/awkward/_nplikes/shape.py 78.68% <0.00%> (+1.63%) ⬆️

@agoose77 agoose77 temporarily deployed to docs-preview February 13, 2023 14:31 — with GitHub Actions Inactive
@jpivarski
Copy link
Member

Will this work for arbitrarily many lists deep? Will it work in the presence of any option-types (keeping in mind that unions and options already interact)?

Most importantly, does it solve @ivirshup's problem?

@agoose77
Copy link
Collaborator Author

agoose77 commented Feb 13, 2023

Most importantly, does it solve @ivirshup's problem?

Yes, in its current form :)

Will this work for arbitrarily many lists deep? Will it work in the presence of any option-types (keeping in mind that unions and options already interact)?

This is another question (hence the draft status)! My approach with these functions is to keep their scope small; operating only upon parent-child layouts where possible. This means that there are cases where multiple functions are needed:

>>> x = ak.concatenate(([{'x': 1}], [{'y':2}]))
>>> ak.merge_union_of_records(x)
<Array [{x: 1, y: None}, {x: None, ...}] type='2 * {x: ?int64, y: ?int64}'>

vs

>>> y = ak.concatenate(([{'x': 1}], [None,{'y':2}]))
>>> ak.merge_union_of_records(y)
<Array [{x: 1}, None, {y: 2}] type='3 * union[{x: int64}, ?{y: int64}]'>

This latter example needs the user to first call ak.merge_option_of_records:

>>> y = ak.concatenate(([{'x': 1}], [None,{'y':2}]))
>>> ak.merge_option_of_records(y)
<Array [{x: 1}, {y: None}, {y: 2}] type='3 * union[{x: int64}, {y: ?int64}]'>
>>> ak.merge_union_of_records(ak.merge_option_of_records(y))
<Array [{x: 1, y: None}, ..., {x: None, ...}] type='3 * {x: ?int64, y: ?int64}'>

My view is that the same should apply here, if possible.

@jpivarski
Copy link
Member

Well, if it works one list deep, then it can be applied repeatedly (at a specified axis).

I didn't want the scope to balloon in the direction of needing yet another function for merging options. This is starting to sound like something that would be cleaner to do some other way.

@ivirshup, why are you concatenating different types? We should understand the use-case better. (I usually think of concatenation for same-type or what ought to be same-typed data, such as partitions of a large dataset. How is your use-case different?)

@agoose77
Copy link
Collaborator Author

agoose77 commented Feb 13, 2023

We should understand the use-case better.

This is an important question w.r.t the original motivation. I'm continuing this discussion given that these functions are intrinsically useful irrespective of this problem.

Well, if it works one list deep, then it can be applied repeatedly (at a specified axis).

Yes, the problem is when you have mixed option — non-option. This is something that we can (and should) handle. The benefit of adding ak.merge_option_of_lists is therefore mostly tangential to this discussion. My discussion above was framed in terms of what we've currently implemented. It actually does make sense for us to support Union[Option[X], Y].

I didn't want the scope to balloon in the direction of needing yet another function for merging options. This is starting to sound like something that would be cleaner to do some other way.

I'm on the same page. And, although I took a wide route, I don't think we need a new function in this case. We should just be handling unions given that we know what the result should look like.

@ivirshup
Copy link

ivirshup commented Feb 14, 2023

We should understand the use-case better. (I usually think of concatenation for same-type or what ought to be same-typed data, such as partitions of a large dataset. How is your use-case different?)

The use case is really "concatenation of very similar types". E.g the likely use cases for pd.concat and xr.concat.

I can give a toy example, but let me know if you'd like something more concrete (it will get a bit bio-y).

Lets say a certain experiment results in data that looks like n * var * {measure_a: float, measure_b: float}. I've annotated an experiment with some computed value per measurement (or maybe an annotation) so now it looks like: n * var * {measure_a: float, measure_b: float, additional_value: int}. Now I want to combine this with some other experiment that doesn't have the additional value, and has the datashape of the base case. I want to maintain the values of additional_value for the observations that have it.

awkward will give a union type here. This is an issue for a number of reasons, including that they're a little difficult to reason about, to export to other libraries and it'll error if I try to access "additional_value" on the combined dataset.

import awkward as ak, pandas as np, numpy as np

exp_a = ak.Array([
    [{"measure_a": .42, "measure_b": .3, "additional_value": 3}, {"measure_a": .26, "measure_b": .03, "additional_value": 9}],
    [{"measure_a": .29, "measure_b": .01, "additional_value": 0}, {"measure_a": .41, "measure_b": .62, "additional_value": 78}, {"measure_a": .31, "measure_b": .54, "additional_value": 62}],
])

exp_b = ak.Array([
    [{"measure_a": .42, "measure_b": .3}],
    [{"measure_a": .29, "measure_b": .01}, {"measure_a": .41, "measure_b": .62}, {"measure_a": .31, "measure_b": .54}],
    [{"measure_a": .26, "measure_b": .03}, {"measure_a": .41, "measure_b": .62}, {"measure_a": .31, "measure_b": .54}, {"measure_a": .41, "measure_b": .62}, {"measure_a": .31, "measure_b": .54}],
])

ak.concatenate([exp_a, exp_b]).type.show()
# 5 * union[
#     var * {
#         measure_a: float64,
#         measure_b: float64,
#         additional_value: int64
#     },
#     var * {
#         measure_a: float64,
#         measure_b: float64
#     }
# ]

If I were to do this in pandas, I would have a dataframe like n * {observation: int, measure_a: float, measure_b: float, additional_value: int} and n * {entry: int, sub_entry:int, measure_a: float, measure_b: float}. I could call pd.concat([experiment_a, experiment_b], join="outer") and end up with n * {observation: int, measure_a: float, measure_b: float, additional_value: ?int}

exp_a_df = ak.to_dataframe(exp_a).astype({"additional_value": pd.Int64Dtype()})
exp_b_df = ak.to_dataframe(exp_b)

pd.concat([exp_a_df, exp_b_df], join="outer")
#                 measure_a  measure_b  additional_value
# entry subentry                                        
# 0     0              0.42       0.30                 3
#       1              0.26       0.03                 9
# 1     0              0.29       0.01                 0
#       1              0.41       0.62                78
#       2              0.31       0.54                62
# 0     0              0.42       0.30              <NA>
# 1     0              0.29       0.01              <NA>
#       1              0.41       0.62              <NA>
#       2              0.31       0.54              <NA>
# 2     0              0.26       0.03              <NA>
#       1              0.41       0.62              <NA>
#       2              0.31       0.54              <NA>
#       3              0.41       0.62              <NA>
#       4              0.31       0.54              <NA>

The join argument from xr.concat allows very similar behavior.


This PR will get a little closer to the goal here. There are a few other tasks I would need to do:

  • Check if the result is mergable. I'm not sure I want to bother merging arrays where I can't eliminate unions.
  • Iteratively call either merge_unions_of_records, merge_unions_of_lists (based on current top level union) until unions are eliminated.

This should be manageable, but "checking if the result it mergable" may be difficult, and each call to one of the merge_unions functions may be expensive (e.g. I think Union[var * {a}, var * {b}] -> var * Union[{a}, {b}] -> var * {?a, ?b} allocates a lot more than a 1 step process would).

@jpivarski
Copy link
Member

Maybe we're going about this in too complicated a way, trying to fix a union after it has been created. The situation of having new data with more fields than old data is a familiar one, which I'd call schema evolution. If it always involves record types, such as

>>> old_data = ak.Array([[{"a": 1}], [], [{"a": 2}]])
>>> old_data.show(type=True)
type: 3 * var * {
    a: int64
}
[[{a: 1}],
 [],
 [{a: 2}]]
>>> new_data = ak.Array([[], [{"a": 3.3, "b": 3.3}, {"a": 4.4, "b": 4.4}]])
>>> new_data.show(type=True)
type: 2 * var * {
    a: float64,
    b: float64
}
[[],
 [{a: 3.3, b: 3.3}, {a: 4.4, b: 4.4}]]

just concatenating them makes an undesirable union,

>>> ak.concatenate((old_data, new_data)).show(type=True)
type: 5 * union[
    var * {
        a: int64
    },
    var * {
        a: float64,
        b: float64
    }
]
[[{a: 1}],
 [],
 [{a: 2}],
 [],
 [{a: 3.3, b: 3.3}, {a: 4.4, b: 4.4}]]

but if we explicitly set missing values for the records that don't have them,

>>> old_data["b"] = None
>>> old_data.show(type=True)
type: 3 * var * {
    a: int64,
    b: ?unknown
}
[[{a: 1, b: None}],
 [],
 [{a: 2, b: None}]]

then they'll concatenate the way you want them without even needing the special function:

>>> ak.concatenate((old_data, new_data)).show(type=True)
type: 5 * var * {
    a: float64,
    b: ?float64
}
[[{a: 1, b: None}],
 [],
 [{a: 2, b: None}],
 [],
 [{a: 3.3, b: 3.3}, {a: 4.4, b: 4.4}]]

This kind of procedure doesn't have to be manual:

>>> newer_data = ak.Array([[{"b": 5, "c": []}, {"b": 6.6, "c": [1, 2]}]])

>>> all_fields = set(old_data.fields) | set(new_data.fields) | set(newer_data.fields)
>>> all_fields
{'a', 'c', 'b'}
>>> for field in all_fields:
...     if field not in old_data.fields:
...         old_data[field] = None
...     if field not in new_data.fields:
...         new_data[field] = None
...     if field not in newer_data.fields:
...         newer_data[field] = None
... 
>>> ak.concatenate((old_data, new_data, newer_data)).show(type=True)
type: 6 * var * {
    a: ?float64,
    b: ?float64,
    c: option[var * int64]
}
[[{a: 1, b: None, c: None}],
 [],
 [{a: 2, b: None, c: None}],
 [],
 [{a: 3.3, b: 3.3, c: None}, {a: 4.4, b: 4.4, c: None}],
 [{a: None, b: 5, c: []}, {a: None, b: 6.6, c: [1, ...]}]]

But it is strictly acting on records; there's no equivalent of this for unions of other types, such as union[int, str]. But it sounds to me like the problem you're trying to solve is records getting or losing attributes, and they can change in type from, say, integers to floats, but they don't change from integers to strings.

Integers → strings doesn't promote, regardless of whether they're in records or not:

>>> even_newer_data = ak.Array([[{"a": "hello"}]])
>>> ak.concatenate((old_data, new_data, newer_data, even_newer_data)).show(type=True)
type: 7 * union[
    var * {
        a: ?float64,
        b: ?float64,
        c: option[var * int64]
    },
    var * {
        a: string
    }
]
[[{a: 1, b: None, c: None}],
 [],
 [{a: 2, b: None, c: None}],
 [],
 [{a: 3.3, b: 3.3, c: None}, {a: 4.4, b: 4.4, c: None}],
 [{a: None, b: 5, c: []}, {a: None, b: 6.6, c: [1, ...]}],
 [{a: 'hello'}]]

@ivirshup
Copy link

ivirshup commented Feb 14, 2023

Maybe we're going about this in too complicated a way, trying to fix a union after it has been created.

Yes, I think this was actually the approach I had been exploring. It starts to fall down if the user provided unions or you hit nested types like:

n * {gene_id: str, exons: var * {exon_start: int, exon_end: int}}

If you look at the example code for the relevant issue in AnnData (scverse/anndata#898), I was hoping to use the merge_unions_* logic to figure out resultant types.

For simple record types, I'd been doing a similar union of records to achieve the above (as discussed: #1668 (comment)).

@jpivarski
Copy link
Member

It doesn't fall down all the way:

>>> old_data = ak.Array([{"gene_id": "meow", "exons": [{"exon_start": 100}]}])
>>> new_data = ak.Array([{"gene_id": "woof", "exons": [{"exon_start": 200, "exon_end": 300}]}])

>>> old_data["exons", "exon_end"] = None

>>> ak.concatenate((old_data, new_data)).show(type=True)
type: 2 * {
    gene_id: string,
    exons: var * {
        exon_start: int64,
        exon_end: ?int64
    }
}
[{gene_id: 'meow', exons: [{exon_start: 100, ...}]},
 {gene_id: 'woof', exons: [{exon_start: 200, ...}]}]

It has become more complicated to find out what fields are needed and where they're needed, but it's still possible:

>>> new_data.fields
['gene_id', 'exons']

>>> new_data["exons"].fields
['exon_start', 'exon_end']

>>> new_data["exons", "something_deeper"].fields
...

(so I suppose there would be some tree-recursion needed there). All of these are working because string tuple-items in square brackets means nested record fields.

It's starting to look like a complex function, but if we can generalize it and give it a good name, that might be the way to go. This method differs from ak.merge_union_of_records in that it:

  • works at all list depths; option-type data are automatically handled (no special cases)
  • would have to include the concatenation as part of the operation, whereas ak.merge_union_of_records could be separate.

Maybe it should be a non-default option of ak.concatenate? Maybe not; it sounds too special-purpose for a generic (and NumPy-extending) function like ak.concatenate. Maybe a longer name: ak.concatenate_merge_records?

@agoose77
Copy link
Collaborator Author

agoose77 commented Feb 15, 2023

@jpivarski and I discussed this in our meeting today. We proposed the addition of a new ak.broadcast_fields() (or similar name) function that broadcasts the field structure of multiple arrays, e.g.

>>> arrays = ak.broadcast_fields([
    [{'x': 1}],
    [{'y': 2}],
    [{'z': 2}],
])
>>> arrays[0].type.show()
3 * {
    x: ?int64,
    y: ?int64,
    z: ?int64
}

This function would find the common type of missing fields, and add them in the appropriate places. Where there is no common type, "common type" means "union of types".

You could then use this function with ak.concatenate, i.e.

ak.concatenate(
    ak.broadcast_fields(...)
)

By creating unions as deep as possible, this function would not require ak.merge_union_of_lists.

Does this sound like a good approach?

@grst
Copy link

grst commented Feb 15, 2023

sounds good to me! Happy to give it a try when you have a prototype.

One question that I have from your example: When broadcast_fields is called with a list of arrays, would it again return a list of arrays or a single array? Your first example looks like the former, your second like the latter...

@agoose77
Copy link
Collaborator Author

Your first example looks like the former, your second like the latter...

Yes, a mistake on my part! I've corrected the example.

@agoose77
Copy link
Collaborator Author

Closing this PR in favour of #2267

@agoose77 agoose77 closed this Feb 27, 2023
@agoose77 agoose77 deleted the agoose77/feat-merge-union-of-lists branch March 16, 2023 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ak.merge_union_of_records not merging Union[var * {a: int, b:int}, var * {a: int}]
4 participants