ENH: Optional dependencies for accelerating JSON serialization #2944

jonmmease · 2020-11-29T18:34:24Z

On top of #2943, I investigated a couple of interesting libraries we could potentially use as optional dependencies to further accelerate JSON serialization

pybase64

I played with pybase64 a little, and it looks like an easy way to get a decent speedup over the built-in python base64 module for performing the numpy base64 encoding step being introduced in #2943. This wouldn't require any refactoring or anything, and can drop the base64 encoding time (which is a substantial portion of the total json encoding time for figures that contain large numpy arrays) by something liek 20% to 40%.

orjson

orjson is a really impressive alternative JSON encoder that, in playing with a little bit, I've seen it be 2x to 5x times faster than the built-in Python json encoder.

orjson doesn't support custom JSON encoder classes (like PlotlyJSONEncoder), so supporting this as an optional dependency would require a total refactor of the current json encoding process.

Basically, we would need to switch to an architecture where we would preprocess the figure dictionary recursively to perform any conversions we need, and then feed that dictionary through the JSON encoder.

Another nice thing about orjson is that it automatically converts nan and infinity values to JSON null values, so the JSON re-endcoding stuff we were working through in #2880 wouldn't be needed (cc @emmanuelle ).

The text was updated successfully, but these errors were encountered:

sdementen · 2020-11-30T06:19:24Z

Some complement on the performance of orjson: https://python-rapidjson.readthedocs.io/en/latest/benchmarks.html#tables

I have also been digging into the JSON serialization performance in plotly, and noticed that, on a large plot px.line(df) with df a pandas Dataframe (17520 rows x 8 columns of random float, index is a DatetimeIndex) that takes 1.8s to generate (px.line(df) only),

there is an extra 0.4s due to the deepcopy in line https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/plotly/basedatatypes.py#L3289 which is not necessary I guess as the JSON serialization will not change the data
for serializing one column of the Dataframe (pandas Series), json.dumps(sr, cls=plotly.utils.PlotlyJSONEncoder) take 10x more time than serializing via sr.to_json(orient="values")
for serializing the index (pandas DatetimeIndex), json.dumps(df.index, cls=plotly.utils.PlotlyJSONEncoder) take 32x more time than serializing via df.index.to_series().to_json(orient="values", date_format="iso", date_unit="s"). The output format is not identical as the is a trailing 'Z' in the to_json method.
when multiple traces share the same index, the index is serialized for each trace independently (ie. multiple times the same serialization). For complex objects like numpy/pandas.*, it may be worth to have some "cache" for the JSONified string on the id of the object to reuse what has already been done.

On the figure generation (so not related to JSON) for this same large Dataframe, it is more than 13x faster (from 1.8s to 0.4s) to:

generate the figure with only the first row of the dataframe px.line(df.iloc[:1,:])
loop on each trace and adapt the "x", "y" data with the full data
So the following code (vs the simpler px.line(df))

            fig = px.line(df.iloc[:1])
            data = fig["data"]
            traces = {trace["name"]: trace for trace in data}
            x = df.index
            for col, y in df.items():
                trace = traces[str(col)]
                trace["x"] = x
                trace["y"] = y

and in this case, we can also manage the NaN more efficiently by removing them from the trace

            fig = px.line(df.iloc[:1])
            data = fig["data"]
            traces = {trace["name"]: trace for trace in data}
            x = df.index
            for col, y in df.items():
                trace = traces[str(col)]
                notnan = ~y.isna()
                trace["x"] = x[notnan]
                trace["y"] = y[notnan]

I hope this information can help improving plotly performances.
If I misusing plotly in some way and there is already today better way to get better performance, I would glad to apply it!

I haven't tested with the change from #2880

jonmmease · 2020-11-30T10:53:05Z

Thanks for sharing your observations here @sdementen.

jonmmease mentioned this issue Dec 5, 2020

JSON encoding refactor and orjson encoding #2955

Merged

1 task

nicolaskruchten closed this as completed May 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Optional dependencies for accelerating JSON serialization #2944

ENH: Optional dependencies for accelerating JSON serialization #2944

jonmmease commented Nov 29, 2020

sdementen commented Nov 30, 2020 •

edited

Loading

jonmmease commented Nov 30, 2020

ENH: Optional dependencies for accelerating JSON serialization #2944

ENH: Optional dependencies for accelerating JSON serialization #2944

Comments

jonmmease commented Nov 29, 2020

pybase64

orjson

sdementen commented Nov 30, 2020 • edited Loading

jonmmease commented Nov 30, 2020

sdementen commented Nov 30, 2020 •

edited

Loading