Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.unflatten does not always respect zero counts #905

Closed
agoose77 opened this issue Jun 9, 2021 · 7 comments · Fixed by #1178
Closed

ak.unflatten does not always respect zero counts #905

agoose77 opened this issue Jun 9, 2021 · 7 comments · Fixed by #1178
Labels
bug The problem described is something that must be fixed

Comments

@agoose77
Copy link
Collaborator

agoose77 commented Jun 9, 2021

>>> array = ak.Array([[], [0, 2, 2, 2, 2, 2]])
>>> lengths = ak.Array([0, 0, 0, 1, 0, 5])
>>> ak.unflatten(array, lengths, axis=1)
<Array [[], [[0], [], [2, 2, 2, 2, 2]]] type='2 * var * var * int64'>
>>> assert ak.unflatten(array, lengths, axis=1).to_list() == [[[], [], []], [[0], [], [2, 2, 2, 2, 2]]]
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-284-684780e734b5> in <module>
      1 array = ak.Array([[], [0, 2, 2, 2, 2, 2]])
      2 lengths = ak.Array([0, 0, 0, 1, 0, 5])
----> 3 assert ak.unflatten(array, lengths, axis=1).to_list() == [
      4     [[], [], []],
      5     [[0], [], [2, 2, 2, 2, 2]],

AssertionError: 

I expect the function to produce the layout as shown on the RHS, [[[], [], []], [[0], [], [2, 2, 2, 2, 2]]].

I would like to support zeros like this, because I want the output to be a regular array. I am effectively using a modified run_lengths to count runs of a predefined list of keys.

I may be able to look at this today, I'll self-assign if so.

@agoose77 agoose77 added the bug (unverified) The problem described would be a bug, but needs to be triaged label Jun 9, 2021
@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 9, 2021

I think the cause here is the various searchsorted calls. https://github.com/scikit-hep/awkward-1.0/blob/3f36053f0a2d40c6920aa27c65d9fb08e0677495/src/awkward/operations/structure.py#L2056-L2061
https://github.com/scikit-hep/awkward-1.0/blob/3f36053f0a2d40c6920aa27c65d9fb08e0677495/src/awkward/operations/structure.py#L2108-L2110

There is another case with axis=0, where

array = ak.Array([0, 2, 2, 2, 2, 2])
lengths = ak.Array([1, 0, 4, 0, 0, 0])
ak.unflatten(array, lengths)

fails

@jpivarski
Copy link
Member

"Getting zeros right" was an active effort at one point—the fact that the counts can contain zeros was not ignored. The this that's supposed to do this is called current_offsets. Unfortunately, the implementation of this zeros-handling is stateful: you have to be aware of how the value of current_offsets changes through the algorithm. (Lots of print-outs!)

There are some ambiguities in where to put the zeros when partitioning is involved: the same zero-length list could just as well go at the end of one partition or the beginning of another. The current choice was made to make it easier to not lose any zero-length lists at the end of the array (zeros at the end of counts are easy to miss/have to be handled carefully).

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 9, 2021

Ah, I can now see that I didn't notice that the problem of arranging zero-length lists is degenerate in the nested axis case; as you say, without a size constraint the zero-length lists can be placed in either partition.

However, in this case, the meaning of [0 0 0] is lost during the unflattening. I don't think that is undefined - they could sit in either partition, e.g.

[
    [[], [], []],
    [[0], [], [2, 2, 2, 2, 2]],
]

or

[
    [],
    [[], [], [], [0], [], [2, 2, 2, 2, 2]],
]

or any intermediate configuration thereof, but there should still (in my opinion) be three lists somewhere?

@jpivarski
Copy link
Member

Oh yeah, it's not just an ambiguity between partitions, but also between nested lists if you're unflattering inside of existing lists, like with axis!=1. But they should appear somewhere, and you may have found a case in which the current_offsets mechanism is losing empties. It was probably only tested with a single empty list—is it a situation in which only multiple, consecutive empty lists are being lost?

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 9, 2021

@jpivarski it looks that way (for axis != 0).

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 9, 2021

Still, for my use case, perhaps I'll need to roll a custom version unflatten that has a predetermined set of rules for arranging these empty lists.

@jpivarski jpivarski linked a pull request Dec 7, 2021 that will close this issue
@jpivarski jpivarski added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Dec 7, 2021
@jpivarski
Copy link
Member

Though old, this was still a bug and had to be fixed! (I'm putting the same corrections into the commented out code that's waiting to be ported to v2, so that it doesn't regress. Also, there's a test and all tests will be ported, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants