-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accelerate plotly JSON encoder for numpy arrays without nans #2880
Conversation
Is the basic idea here that you're skipping the recursive encoding of the elements of the array? If so, that makes a lot of sense. |
I understand (but maybe this is not correct) that there is an additional cycle of decoding-reencoding here in order to take care of nans and infs. If this is the case, we can check if there are no nans and infs and skip these steps (the "recursive encoding") I think. But maybe I don't understand completely the purpose of the recursive encoding. The CI failure of one of the builds is definitely linked to this PR, I need to investigate. |
ohh, wow, I didn't realize (remember?) that we did that double encoding. That would definitely slow things down. It would be great to get rid of that all together, but it would take some investigation to work through why it's there in the first place. What I meant by "recursive" is that JSON serialization handles lists of objects (like fig["data"]) by recursively encoding each element of the list. I'm not certain, but I think this recursive encoding happens for each element of numpy arrays as well, even though it wouldn't be strictly necessary. |
It'd be nice to get rid of the double-encoding in general but in the meantime, if there are special cases where we feel safe to skip it e.g. arrays with every entry a finite value, then that would be a short-term win also for many cases :) |
I'll note that by playing string games with |
obj.dtype, numpy.number | ||
): | ||
if numpy.all(numpy.isfinite(obj)): | ||
self.unsafe = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I am missing something. We can test all the numpy arrays that are in the dict being encoded, but we'd not catch nans and infs in any sublists, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this will work only for numpy arrays but for other kinds of objects the existing process (encoding - decoding - reencoding) will still be executed. Or at least this is the intended behaviour :-), maybe I am the one missing something! I need to write tests anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but for other kinds of objects the existing process (encoding - decoding - reencoding) will still be executed
I don't think this is the case, since as I understand this class, encode()
is not called recursively, or is it?
@jonmmease I wonder if you could help me understanding the py3.7-orca CI failure, which I cannot reproduce locally in a conda environment with (I think) the same packages installed and py3.7. The problem occurs in the doctest of the |
FWIW - just in case anyone was looking to go down this route: Iast Friday I did a brief stab at making encoding I figured we could overload the JSONEncoder.iterencode method. It defines a function added: I did a quick test to force using the Python (non-C) version of the builtin encoder. But that makes it twice as slow. |
Hi @emmanuelle, A while ago I profiled plotly/dash and found the numpy to JSON encoding to be one of the bottlenecks and also found that by not doing the encoding twice the code can be sped up by approximately a factor 2: The way I implemented it I was able to handle numpy arrays with both +/- inf's and nan's without the need to re-encode. But some care is needed as numpy has quite a lot of different variable types. Hope this helps. |
@almarklein you were right of course I had not understood in which order the different methods were called. My workaround with Therefore I resorted to a brute-force approach of testing whether the json string encoded the first time has the substrings "NaN" or "Infinity" and if not, skip the additional decoding and re-encoding step. For large data arrays the cost of checking these two patterns if small compared to the encoding time (~2% of the time on a couple of examples I tested). This is a slight cost burden to be added to figures with NaN or Infinity either in their data or in text strings (for example titles, labels etc.) but I think we can live with this, given the large speed-up (x2.5) for non-problematic cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, but otherwise lgtm
# this will raise errors in a normal-expected way | ||
self.hasinfnans = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using hasinfnans
anywhere else now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no you were right :-). Thanks!
# but this is ok since the intention is to skip the decoding / reencoding | ||
# step when it's completely safe | ||
if not ("Infinity" in encoded_o or "NaN" in encoded_o): | ||
return encoded_o |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this for now given your description of the performance characteristics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe check for NaN
first, as it feels like that'd appear more frequently?
I like this solution! I find it actually quite elegant given the limitations of the json encoder that we have to work with, and I'm pretty sure that any solution that somehow checks the input beforehand would be slower than this proposed approach. The only downside is that the speed of encoding depends on the presence of NaN and Inf in the object, which might be an unexpected effect. But these cases are hopefully rare, e.g. the plotly example about gaps in line charts uses |
💃 |
@pfbuxton looks like you went down this path before, I'm sorry no one replied to you more than a year ago :( Thanks for bringing up your idea again in this context! I think that your idea would layer well onto this one, i.e. now that we only re-encode when we need to, modifying |
This is a proof of concept PR, I need to add a bunch of tests to be sure it behaves well for numpy arrays of different types, but the idea is to remove the encoding-decoding-reencoding process when JSON-serializing a numpy array without nans or infs (which is the reason why there is this 3-step process).
For large data arrays (for example a 1000x1000 array) there is a large performance gain. For example for
the proposed change results in a 2.5x performance gain on my machine (0.11 s instead of 0.27). The performance gain is interesting in Dash apps in particular.
@jonmmease what do you think of this?