-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: serialize dask_awkward.Array to None and store hard ref in closure in behavior #824
Conversation
@jrueb can you please try this one out! |
@nsmith- this feels mildly dirty, any idea? |
Keeping |
@nsmith- it'll get called at program exit for sure (with the atexit flag set to true, judging from docs), this just a more controllable way of stuffing the reference somewhere that has known behavior. As far as behavior stickiness, it's not that. It's because the I think there are ways to get the list of active finalizers on an object and we can percolate lifetime control through that? |
So it for sure goes at program exit, I added in a print statement:
I'll add in a |
Arg, multiple levels of confusion but it works like I expect. |
I'm going to push on behavior stickyness because it should be possible: after all, how can a dask-awkward array know its mixin without keeping somewhere a reference to the mixin class type? The only other option is a name in the parameters and a global lookup, but as far as I know ak2 keeps a behavior namespace per object as it did in ak1. |
@nsmith- dask-awkward arrays don't directly know about their behaviors. It's stored in the The problem with what you want to do is:
|
The root issue here is: what is the ak2-blessed way of handling self-references that survive slicing? Stepping back to what the idea was in 0.7/ak1: you could delete the original events and/or it's factory, but with a handle to some sliced events array (that internally held a strong ref to the factory which was serializable and could either recreate or use its existing weakref to an events object), you can call any method that makes use of the original array, where the method would know how to get the factory pointer (it was stored in the behavior metadata) and hence the (original via weakref or re-made) unsliced events. |
I've been staring at this, trying to understand the issue. I highly doubt that weak references will be needed in ak2. From what I understood in ak1, the only reason we needed them then was to break reference cycles through C++, and all of that is gone now. Reference cycles are fine; the garbage collector will find them. The thing I'm trying to parse is "self-references that survive slicing." Does this have something to do with one particle type's record fields being global indexes into another particle type (e.g. muons have a list of closest jets, expressed as indexes, and index If that's the issue, I don't see how weak references help. As long as we're talking about cousin cross-references, not cycles in the graph that would break most recursive functions (i.e. still a DAG), you can keep a reference to the original array by burying it deep in an For instance, suppose you're given these data, import awkward as ak
import numpy as np
events = ak.Array([
{"muon": [{"pt": 1}, {"pt": 2}, {"pt": 3}], "jet": [{"pt": 1.1}, {"pt": 2.2}]},
{"muon": [{"pt": 4}, {"pt": 5}], "jet": [{"pt": 3.3}, {"pt": 4.4}]},
{"muon": [{"pt": 6}], "jet": [{"pt": 5.5}, {"pt": 6.6}, {"pt": 7.7}]},
{"muon": [{"pt": 7}, {"pt": 8}], "jet": []},
{"muon": [{"pt": 9}], "jet": [{"pt": 8.8}, {"pt": 9.9}]},
]) and you annotate the types (and provide behaviors, but I don't need to do that in this example). events["muon"] = ak.with_name(events["muon"], "Muon")
events["jet"] = ak.with_name(events["jet"], "Jet") Suppose that you're also given the global index of each muon's favorite jet, muons_favorite_jet_index = ak.Array([[0, 1, -1], [3, 2], [5], [-1, -1], [8]]) which can be assigned as a field of the events["muon", "favorite_index"] = muons_favorite_jet_index Similarly, you can add the whole jet objects in their original order by nesting them within an muons_favorite_jet = ak.Array(
ak.contents.ListOffsetArray(
events["muon"].layout.offsets,
ak.contents.IndexedOptionArray(
ak.index.Index64(muons_favorite_jet_index.layout.content.data),
events["jet"].layout.content,
),
)
)
events["muon", "favorite"] = muons_favorite_jet and >>> events.type.show()
5 * {
jet: var * Jet[
pt: float64
],
muon: var * Muon[
pt: int64,
favorite_index: int64,
favorite: ?Jet[
pt: float64
]
]
} When selected_events = events[[True, False, False, True, True]] So >>> events["muon", "favorite", "pt"].show()
[[1.1, 2.2, None],
[4.4, 3.3],
[6.6],
[None, None],
[9.9]]
>>> selected_events["muon", "favorite", "pt"].show()
[[1.1, 2.2, None],
[None, None],
[9.9]] still has the right indexing (the last event, which passed the cut, still has a jet pt of 9.9). In fact, >>> selected_events["muon", "favorite"].layout.content.content.content is events["jet"].layout.content
True This "keeps events alive for the duration of the user session" because even if a user cuts a lot of events and drops the reference to the original array, they're kept alive inside the In summary, Also, you're asking about this in the context of dask-awkward, not awkward. The |
|
Oh I think I read the old version of your comment @jpivarski just a second. Indeed - I'll give this a try with a form and typetracer first to see if I can demonstrate it from that end. |
A major issue is to be able to do this lazily without inducing unpleasant responses in users. :-) |
If this thing is constructed without touching (again, I don't know if we need a Dask-friendly See: def add_form_keys(form, hint=()):
add_form_keys.num += 1
num = add_form_keys.num
if hasattr(form, "contents"):
return form.copy(
contents=[
add_form_keys(v, hint + (k,))
for k, v in zip(form.fields, form.contents)
],
form_key=f"node-{num}-{'-'.join(hint)}",
)
elif hasattr(form, "content"):
return form.copy(
content=add_form_keys(form.content, hint + ('@',)),
form_key=f"node-{num}-{'-'.join(hint)}",
)
else:
return form.copy(
form_key=f"node-{num}-{'-'.join(hint)}",
)
add_form_keys.num = 0
form_with_keys = add_form_keys(events.layout.form)
typetracer, report = ak.typetracer.typetracer_with_report(form_with_keys)
events_tt = ak.Array(typetracer) and then: >>> report.data_touched
[]
>>> events_tt.muon.pt + 1
<Array-typetracer [...] type='## * var * int64'>
>>> report.data_touched
['node-5-muon', 'node-7-muon-@-pt']
>>> events_tt.jet.pt + 1
<Array-typetracer [...] type='## * var * float64'>
>>> report.data_touched
['node-5-muon', 'node-7-muon-@-pt', 'node-2-jet', 'node-4-jet-@-pt']
>>> events_tt.muon.favorite.pt + 1
<Array-typetracer [...] type='## * var * ?float64'>
>>> report.data_touched
['node-5-muon', 'node-7-muon-@-pt', 'node-2-jet', 'node-4-jet-@-pt',
'node-9-muon-@-favorite', 'node-11-muon-@-favorite-@-pt'] Or restart and >>> typetracer, report = ak.typetracer.typetracer_with_report(form_with_keys)
>>> events_tt = ak.Array(typetracer)
>>> report.data_touched
[]
>>> events_tt.muon.favorite.pt + 1
<Array-typetracer [...] type='## * var * ?float64'>
>>> report.data_touched
['node-5-muon', 'node-9-muon-@-favorite', 'node-11-muon-@-favorite-@-pt'] Touching the jet fields is independent of touching the muon fields (though it does touch the muon offsets because the favorite jets are nested under the muon's |
Looking at how we've done things I think you're correct that we can start everything off as an IndexedOptionArray and work from that instead of ListOffsetArrays that we do takes on to get the IndexedOptionArray. It'll take a bit to re-engineer but it'll be far more correct. @nsmith- your thoughts? |
Thanks Jim for also understanding what we were talking about. I should have included an example. Sadly, there are cycles, e.g. gen particle parents, so embedding the reference in the form is not an option. Actually, since cross-references go both ways, A minimal example, focusing on one "event" to not cloud the issue with the jagged array indexing mess that has to happen: import awkward as ak
behavior = {}
@ak.mixin_class(behavior)
class GenPart:
@property
def parent(self):
return self.behavior["__original__"][self.parentIdx]
genpart = ak.zip(
{
"pt": ak.Array([0., 0., 13., 14., 15., 16., 17.]),
"parentIdx": ak.Array([None, None, 0, 0, 1, 2, 3]),
},
with_name="GenPart",
behavior=behavior,
)
# hide in a spot that is sticky to all downstream arrays
genpart.behavior["__original__"] = genpart
# slice
genpart = genpart[3:]
print(genpart.parent) This works in ak1 and ak2. But for dask-awkward, I am not sure where to keep the reference. |
import awkward as ak
behavior = {}
@ak.mixin_class(behavior)
class GenPart:
@property
def parent(self):
return self.behavior["__original__"][self.parentIdx]
parentIdx = ak.Array([None, None, 0, 0, 1, 2, 3])
pt = ak.Array([0., 0., 13., 14., 15., 16., 17.])
genpart = ak.zip(
{
"pt": pt,
"parentIdx": parentIdx,
},
with_name="GenPart",
behavior=behavior,
)
layout_parents = ak.contents.IndexedOptionArray(
ak.index.Index64(ak.fill_none(parentIdx, -1)),
genpart.layout,
)
parents = ak.Array(layout_parents)
# hide in a spot that is sticky to all downstream arrays
genpart.behavior["__original__"] = genpart
# slice
genpart = genpart[3:]
parents = parents[3:]
print(genpart.parent)
print(parents) appears to work! |
Ok, and |
Copying (and summarizing) from Slack: Although ak2's Pythonic layout nodes can, in principle, support cycles, you'd have to build them by modifying private attributes and we'd have a lot of trouble memoizing all of the recursive functions to make them cyclic-safe. In many cases, the problem would be conceptual—sure, we can stop recursing once we get to a node we've seen before, but what would the return value be? For example, should So we should consider ourselves restricted to no-cycles. Two use-cases that break this are:
I'll come back to the genparticles case because I don't think we need to make that structurally recursive. There's only one layout node involved—we should be able to sequester it somehow and do all of the iterative self-application in behaviors, rather than structure. For bidirectional cross-references among cousins, there's a way out. It means users would be able to say events.muon.favorite_jet.pt and events.jet.favorite_muon.pt but not events.muon.favorite_jet.favorite_muon.favorite_jet.favorite_muon.pt Does anybody actually need the last case? It seems eclectic. To do this, you can define only one mixin class named Who needs to literally take advantage of bidirectional cousin cycles? If it's just a technicality, that technicality can be avoided. |
Ah, apologies, I didn't read Jim's comment all the way. For muons and jets there should be ways to handle things, even if a little less convenient, but this is the simple case of something very similar in gen particles where you'd lose a lot of convenience if you can't naturally traverse the parentage in multiple directions and find those cousins. That feels like bread and butter when you're, e.g., relating decay products of tops, including all those radiated hadrons |
appending to the above: layout_parents_parents = ak.contents.IndexedOptionArray(
ak.index.Index64(ak.fill_none(parents.parentIdx, -1)),
parents.layout.content,
)
parents_parents = ak.Array(layout_parents_parents)
print(genpart.parent.parent)
print(parents_parents) yields:
|
so we can absolutely define a mixin that builds the parents on the fly (as a daskable operation too), that should also be reasonably traceable given Jim's indications above. |
I was just working on a We should avoid stashing large data, like arrays, in the Since behaviors can add new methods and unary/binary operators to records, the way that a Numba function gets compiled depends on the behavior. Change the behavior and we need to recompile the function, and Numba only knows that it needs to recompile the function if the (string) type name changes. Oh wait—the string representation of an array is not large, so this is not a performance issue. But it would lead to Numba functions getting recompiled when decimal values of the few numbers that do appear in the So anyway, here is a import awkward as ak
import numpy as np
parentIdx = ak.Array([None, None, 0, 0, 1, 2, 3])
pt = ak.Array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6])
genpart = ak.zip(
{
"pt": pt,
"parentIdx": ak.fill_none(parentIdx, -1),
}
)
@ak.mixin_class(ak.behavior)
class GenPart:
@property
def pt(self):
return self["_data", "pt"]
@property
def parent(self):
return ak.Array(
ak.contents.RecordArray(
[
ak.contents.IndexedOptionArray(
ak.index.Index64(
ak.fill_none(self["_data", "parentIdx"], -1)
),
self["_data"].layout.content,
),
],
[
"_data",
],
),
with_name="GenPart",
)
sequestered = ak.Array(
ak.contents.RecordArray(
[
ak.contents.IndexedOptionArray(
ak.index.Index64(np.arange(len(genpart), dtype=np.int64)),
genpart.layout,
),
],
[
"_data",
],
),
with_name="GenPart",
) With this definition, >>> sequestered.pt
<Array [0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6] type='7 * ?float64'>
>>> sequestered.parent.pt
<Array [None, None, 0, 0, 1.1, 2.2, 3.3] type='7 * ?float64'>
>>> sequestered.parent.parent.pt
<Array [None, None, None, None, None, 0, 0] type='7 * ?float64'> and >>> sequestered[3:].pt
<Array [3.3, 4.4, 5.5, 6.6] type='4 * ?float64'>
>>> sequestered[3:].parent.pt
<Array [0, 1.1, 2.2, 3.3] type='4 * ?float64'>
>>> sequestered[3:].parent.parent.pt
<Array [None, None, 0, 0] type='4 * ?float64'> and >>> sequestered[[3, 4, 5, 6]].pt
<Array [3.3, 4.4, 5.5, 6.6] type='4 * ?float64'>
>>> sequestered[[3, 4, 5, 6]].parent.pt
<Array [0, 1.1, 2.2, 3.3] type='4 * ?float64'>
>>> sequestered[[3, 4, 5, 6]].parent.parent.pt
<Array [None, None, 0, 0] type='4 * ?float64'> This would usually work without the double-nesting (just an The next step would be to make sure that these sorts of constructions will work in dask-awkward. The method of stashing the original genparticles in |
Just as something to think about—you probably don't want to introduce this complication: it's conceivably possible to make cross-referenced cousins work exactly two levels deep. What I described above made them work one level deep by introducing two
The events object would include events.muon1.favorite_jet.favorite_muon.pt but no deeper: events.muon1.favorite_jet.favorite_muon.favorite_jet # is an error! Similarly, you can go events.jet1.favorite_muon.favorite_jet.pt but no deeper: events.jet1.favorite_muon.favorite_jet.favorite_muon # is an error! With enough different This is probably more complicated than you want to consider, but it's good to know that there's room to grow, if you ever need to. Also, I'm going to assert that while there might be a 1% chance that someone really needs |
So this exploration is interesting, but isn't allowing storing a reference back the original array a bit of a cleaner implementation at this point? It's much more clear, IMO, what it's doing to a typical physicist using the ref-to-original trick. The content munging takes a lot longer to understand, and is harder to reproduce oneself (as we often like to do to convince ourselves we understand something). |
I like the sequestered array approach! Essentially you get recursion by holding onto a tail. The only downside is now each of the array's fields that is now hidden in |
It will be nasty to deal with in terms of data versioning (though GenParticles don't move that much). |
@jpivarski on the dask-awkward side of things the way it's presently implemented shows that you don't even need the pointer-to-original in the task graph being executed, it's basically embedded in the graph itself. Right now with coffea the weakref gets pickled to a None (yes this a bad override I would like to see go away) and is not used when being executed. Since dask forces the operation into a DAG (since infinite recursion is not possible in an analysis you can actually execute) the need to pickle the ref-to-self disappears when dealing with the concrete awkward arrays on worker nodes. |
Oh, but ak.to_packed or ak.to_arrow (and therefore anything that writes the arrays) would blow up memory use, for either |
@nsmith- do you find that an acceptable risk? |
While we're discussing on slack the possibility of just having the self-reference in dask-awkward (some understanding was reached about what was going on before that hadn't been there before). I'll just leave the above implementation as is, might be moot. I'm glad we figured all this stuff out though. |
Okay, once more unto the breach. Here's a solution that's safe for saving (and continues to work after loading a saved file), doesn't explode memory use even after Instead of keeping the original, uncut data, we allow the data to be cut and always translate the global But anyway, the implementation ( import awkward as ak
import numpy as np
pt = ak.Array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6])
global_index = ak.Array([None, None, 0, 0, 1, 2, 3])
@ak.mixin_class(ak.behavior)
class GenPart:
@property
def parent(self):
surrogate_index = ak.to_numpy(self["surrogate_index"])
inverse_map = np.full(np.max(surrogate_index) + 1, -1, dtype=np.int64)
inverse_map[surrogate_index] = np.arange(len(surrogate_index))
slicer = ak.Array(
ak.contents.ByteMaskedArray(
ak.index.Index8(inverse_map >= 0),
ak.contents.NumpyArray(inverse_map),
valid_when=True,
)
)[global_index]
return self[slicer]
genpart = ak.zip(
{
"pt": pt,
"global_index": global_index,
"surrogate_index": np.arange(len(global_index)),
},
with_name="GenPart",
) Starting with our original >>> genpart.show()
[{pt: 0, global_index: None, surrogate_index: 0},
{pt: 1.1, global_index: None, surrogate_index: 1},
{pt: 2.2, global_index: 0, surrogate_index: 2},
{pt: 3.3, global_index: 0, surrogate_index: 3},
{pt: 4.4, global_index: 1, surrogate_index: 4},
{pt: 5.5, global_index: 2, surrogate_index: 5},
{pt: 6.6, global_index: 3, surrogate_index: 6}] it computes the parents correctly: >>> genpart.parent.show()
[None,
None,
{pt: 0, global_index: None, surrogate_index: 0},
{pt: 0, global_index: None, surrogate_index: 0},
{pt: 1.1, global_index: None, surrogate_index: 1},
{pt: 2.2, global_index: 0, surrogate_index: 2},
{pt: 3.3, global_index: 0, surrogate_index: 3}] But then, suppose we had cut out the genparticle at (original, global) index >>> selected_genpart = genpart[[True, True, False, True, True, True, True]]
>>> selected_genpart.show()
[{pt: 0, global_index: None, surrogate_index: 0},
{pt: 1.1, global_index: None, surrogate_index: 1},
{pt: 3.3, global_index: 0, surrogate_index: 3},
{pt: 4.4, global_index: 1, surrogate_index: 4},
{pt: 5.5, global_index: 2, surrogate_index: 5},
{pt: 6.6, global_index: 3, surrogate_index: 6}] Now even though all indexes above >>> selected_genpart.parent.show()
[None,
None,
{pt: 0, global_index: None, surrogate_index: 0},
{pt: 0, global_index: None, surrogate_index: 0},
{pt: 1.1, global_index: None, surrogate_index: 1},
None,
{pt: 3.3, global_index: 0, surrogate_index: 3}] Since the particle at original index This is robust because we're always making an Oh, and also the particle attributes don't have to be modified. Maybe you want to hide such details as |
Actually, the size of the max(np.max(global_index), np.max(surrogate_index)) + 1 instead of just np.max(surrogate_index) + 1 In principle, it should be the size of the original table, but after slicing we might not know the size of the original table. For the technique to work, it only needs to be larger than the arrays that are going to index it, which is not just Setting the size of an array by a value derived from another array (the There might also be a problem with the mutation in place: inverse_map[surrogate_index] = np.arange(len(surrogate_index)) which is why I switched to NumPy. (Is it a problem in dask-awkward to switch between Awkward and NumPy? Is dask-awkward well connected to |
How can we hide |
@jpivarski It's OK to use numpy so long as we hide it behind a The ending type is still easily determined. |
Only by prepending it with an underscore and saying that Python rules apply here. There isn't a way to make it visible to a library like Coffea and also not visible to users of that library. Or maybe it could be given a more meaningful name ("position in uncut genparticles") and not hidden. Someone may be able to use it. |
I think having the parent even if you filtered out is a hard requirement here. |
@jrueb ok a much more reasonable fix is now in place (and then an even more reasonable fix than that to come). Please give it a try. |
I think I have a possibly better solution for this. If I store (via some indirection) the original HLG layer that got generated I can always reconstruct the dask array. |
6f40692
to
1c7b204
Compare
Merging this end of today FNAL time if there are no further issues. |
Is the change to lookup_tools directly related to the NanoEvents cycles? It wasn't clear to me how they are related |
If we are not defaulting weakref to serialize to None then I have to control the lifecycle of the weakref manually. They're actually related! (otherwise using dask distributed client fails, for instance, since it tries to pickle a weakref) |
My issue indeed seems to be solved with this change. |
Fixes #823