REF: remove Block access in the JSON writing code #41081

jorisvandenbossche · 2021-04-21T14:33:46Z

This is a small refactor of the JSON writing code (objToJSON.c) to remove the access of the Blocks. Serializing to JSON was already done column-by-column (also when accessing the blocks), but currently we were iterating over the different "columns" of the 2D array per block in the C code. While with this PR, the C code directly iterates over all the columns (instead of per block).
This gives a less invasive change as the previous attempt of #27166

To be able to iterate over the column arrays in C, I added the column_arrays attribute on the manager, for this.

I ran the JSON ASV benchmarks with this branch:

$ asv continuous -b json -f 1.01 upstream/master HEAD

       before           after         ratio
     [692bba57]       [4142a532]
     <master>         <refactor-json-blocks>
+         162±2ms          192±2ms     1.19  io.json.ToJSON.time_to_json_wide('values', 'df_td_int_ts')
+         164±2ms          191±2ms     1.16  io.json.ToJSON.time_to_json_wide('split', 'df_td_int_ts')
+      96.2±0.3ms        111±0.8ms     1.15  io.json.ToJSON.time_to_json('values', 'df')
+         192±3ms          221±3ms     1.15  io.json.ToJSON.time_to_json_wide('records', 'df_td_int_ts')
+       106±0.3ms          121±1ms     1.15  io.json.ToJSON.time_to_json('split', 'df')
+        96.4±2ms          110±2ms     1.15  io.json.ToJSON.time_to_json('values', 'df_date_idx')
+         192±4ms          219±4ms     1.14  io.json.ToJSON.time_to_json_wide('index', 'df_td_int_ts')
+         177±4ms          198±3ms     1.12  io.json.ToJSON.time_to_json_wide('columns', 'df_td_int_ts')
+         112±1ms          124±2ms     1.11  io.json.ToJSON.time_to_json('records', 'df')
+       129±0.6ms          143±2ms     1.11  io.json.ToJSON.time_to_json('index', 'df')
+         138±1ms          152±1ms     1.10  io.json.ToJSON.time_to_json('index', 'df_date_idx')
+       168±0.9ms          182±1ms     1.09  io.json.ToJSONLines.time_floats_with_int_idex_lines
+         168±2ms          182±1ms     1.08  io.json.ToJSONLines.time_floats_with_dt_index_lines
+       117±0.9ms          127±1ms     1.08  io.json.ToJSON.time_to_json_wide('split', 'df_date_idx')
+         157±2ms          165±2ms     1.05  io.json.ToJSON.time_to_json_wide('split', 'df_int_floats')
-            177M             175M     0.99  io.json.ToJSON.peakmem_to_json_wide('split', 'df_int_float_str')
-         185±5ms          172±3ms     0.93  io.json.ToJSON.time_to_json('index', 'df_int_floats')
-         127±8ms        110±0.6ms     0.87  io.json.ToJSON.time_to_json('split', 'df_td_int_ts')

There is a slight slowdown in some of the benchmarks for wide dataframes, but only 20%.
So we can decide to keep the original block-iteration code for BlockManager for now, instead of fully replacing it as I did now, if we don't want this slight slowdown.

This also fixes some remaining cases for ArrayManager (xref #39146)

jorisvandenbossche · 2021-04-21T14:41:38Z

It might be that with a bit of optimization of BlockManager.column_arrays, that could reduce the slowdown.

WillAyd · 2021-04-21T14:57:34Z

pandas/_libs/src/ujson/python/objToJSON.c


-            blkCtxt->cindices[colIdx] = idx;


I think if you remove this assignment to cindices you can remove this struct member altogether - not aware of it being written to anywhere else. Might help reduce confusion over state management here

Yes, indeed, there are still some further things to clean up. And this cindices is now no longer used.

pandas/tests/io/json/test_normalize.py

jreback · 2021-04-21T16:14:19Z

+1 on deleting the block code

jorisvandenbossche · 2021-04-21T19:16:09Z

With the last commit (adding some extra python code to more efficiently get a list of 1D ndarrays for BlockManager), I now get hose ASV results (for the same asv call as above):

       before           after         ratio
     [692bba57]       [3c9aa994]
     <master>         <refactor-json-blocks>
+       114±0.6ms          122±2ms     1.07  io.json.ToJSON.time_to_json_wide('split', 'df_date_idx')
+         152±1ms          162±2ms     1.06  io.json.ToJSON.time_to_json_wide('records', 'df')
+         148±2ms          156±2ms     1.05  io.json.ToJSON.time_to_json_wide('columns', 'df')
-      79.8±0.2ms       78.8±0.2ms     0.99  io.json.NormalizeJSON.time_normalize_json('values', 'df_int_floats')

So that looks quite good I think

WillAyd

lgtm when green - nice simplification!

jorisvandenbossche · 2021-04-22T20:56:38Z

@jreback do you want to review this more in depth?

jreback · 2021-04-23T00:33:17Z

lgtm. need a rebase

jbrockmendel · 2021-04-23T01:21:00Z

pandas/core/internals/blocks.py

@@ -224,14 +224,6 @@ def get_values(self, dtype: DtypeObj | None = None) -> np.ndarray:
        # expected "ndarray")
        return self.values  # type: ignore[return-value]

-    @final
-    def get_block_values_for_json(self) -> np.ndarray:


i think the "final" is out of date; same method on DatetimeLikeBlock

There is only a single get_block_values_for_json (no Block subclass overrides it)

Or at least until a few hours ago, when your PR that changed this was merged ;) (~~You could have just as well said that you just changed this / pointed to #41082~~)

jbrockmendel · 2021-04-23T01:21:46Z

pandas/core/internals/managers.py

+        This optimizes compared to using `iget_values` by converting each
+        block.values to a np.ndarray only once up front
+        """
+        arrays = [np.asarray(blk.values) for blk in self.blocks]


i think this will change the behavior for dt64tz

This will not change behaviour, as it is the same as get_block_values_for_json did before until a few hours ago. The comment you added to the changed implementation in #41082 is "Not necessary to override, but helps perf", so assuming that that didn't change behaviour.

I would also assume it is covered by the tests, but checking explicitly:

Master:

In [1]: pd.DataFrame({'a': [1, 2], 'b': pd.date_range("2012", periods=2, tz="Europe/Brussels")}).to_json() Out[1]: '{"a":{"0":1,"1":2},"b":{"0":1325372400000,"1":1325458800000}}' In [2]: pd.DataFrame({'b': pd.date_range("2012", periods=2, tz="Europe/Brussels")}).to_json() Out[2]: '{"b":{"0":1325372400000,"1":1325458800000}}'

PR:

In [1]: pd.DataFrame({'a': [1, 2], 'b': pd.date_range("2012", periods=2, tz="Europe/Brussels")}).to_json() Out[1]: '{"a":{"0":1,"1":2},"b":{"0":1325372400000,"1":1325458800000}}' In [2]: pd.DataFrame({'b': pd.date_range("2012", periods=2, tz="Europe/Brussels")}).to_json() Out[2]: '{"b":{"0":1325372400000,"1":1325458800000}}'

Now, to preserve the performance improvement of #41082, I can add a check for datetimetz.

hopefully more accurate statement: i think this will entail an object-dtype conversion for dt64tz. I'm not especially bothered by this bc i think this is a good cleanup

... which i now see you already addressed in a new commit. carry on

jorisvandenbossche · 2021-04-26T07:51:19Z

Failure is a flaky resourcewarning.

jorisvandenbossche added Refactor Internal refactoring of code IO JSON read_json, to_json, json_normalize labels Apr 21, 2021

jorisvandenbossche requested review from WillAyd and jreback April 21, 2021 14:33

WillAyd reviewed Apr 21, 2021

View reviewed changes

jorisvandenbossche added 2 commits April 21, 2021 19:21

REF: remove Block access in the JSON writing code

df8d226

optimize BlockManager.column_arrays

e0f32f2

jorisvandenbossche force-pushed the refactor-json-blocks branch from 3c9aa99 to e0f32f2 Compare April 21, 2021 17:21

jorisvandenbossche added 2 commits April 21, 2021 21:20

clean: remove cindices

ce2f9d0

add docstring to column_arrays

14fa6d3

WillAyd approved these changes Apr 21, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into refactor-json-blocks

2ca894b

jreback added this to the 1.3 milestone Apr 23, 2021

jbrockmendel reviewed Apr 23, 2021

View reviewed changes

jorisvandenbossche added 5 commits April 23, 2021 08:39

Merge remote-tracking branch 'upstream/master' into refactor-json-blocks

14d1953

add check for datetimetz

03ade01

add comment about special case

aad5c32

fixup

b19fc1f

Merge remote-tracking branch 'upstream/master' into refactor-json-blocks

ba863f2

jorisvandenbossche merged commit 7c1f454 into pandas-dev:master Apr 26, 2021

jorisvandenbossche deleted the refactor-json-blocks branch April 26, 2021 07:51

This was referenced Apr 26, 2021

Refactor - ArrayManager overview issue #39146

Closed

[ArrayManager] TST: remove more ArrayManager skips for JSON IO #41164

Merged

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

REF: remove Block access in the JSON writing code (pandas-dev#41081)

de9d9c4

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

REF: remove Block access in the JSON writing code (pandas-dev#41081)

701613a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: remove Block access in the JSON writing code #41081

REF: remove Block access in the JSON writing code #41081

jorisvandenbossche commented Apr 21, 2021 •

edited

Loading

jorisvandenbossche commented Apr 21, 2021

WillAyd Apr 21, 2021

jorisvandenbossche Apr 21, 2021

jreback commented Apr 21, 2021

jorisvandenbossche commented Apr 21, 2021

WillAyd left a comment

jorisvandenbossche commented Apr 22, 2021 •

edited

Loading

jreback commented Apr 23, 2021

jbrockmendel Apr 23, 2021

jorisvandenbossche Apr 23, 2021

jorisvandenbossche Apr 23, 2021 •

edited

Loading

jbrockmendel Apr 23, 2021

jorisvandenbossche Apr 23, 2021

jbrockmendel Apr 23, 2021

jbrockmendel Apr 23, 2021

jorisvandenbossche commented Apr 26, 2021

REF: remove Block access in the JSON writing code #41081

REF: remove Block access in the JSON writing code #41081

Conversation

jorisvandenbossche commented Apr 21, 2021 • edited Loading

jorisvandenbossche commented Apr 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 21, 2021

jorisvandenbossche commented Apr 21, 2021

WillAyd left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 22, 2021 • edited Loading

jreback commented Apr 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Apr 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 26, 2021

jorisvandenbossche commented Apr 21, 2021 •

edited

Loading

jorisvandenbossche commented Apr 22, 2021 •

edited

Loading

jorisvandenbossche Apr 23, 2021 •

edited

Loading