Toplevel datasets #951

jakevdp · 2018-06-20T04:08:02Z

With this change, you can run:

alt.data_transformers.enable(consolidate_datasets=True)

and then in all rendered charts, inline data will be replaced with named data and added to the top-level datasets attribute. This automatically assigns the name to any identical datasets.

This addresses the issue that typical usage patterns end up embedding multiple copies of the dataset within compound charts. I think that this consolidation behavior should probably be the default, but I'd like to get it in as an option to start with.

As an example, consider the following chart, using a standard concatenation pattern:

import altair as alt
import pandas as pd

data = pd.DataFrame({
    'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
    'b': [28, 55, 43, 91, 81, 53, 19, 87, 52],
    'c': [43, 91, 81, 53, 19, 87, 52, 28, 55]
})

base = alt.Chart(data).mark_bar().encode(
    y='a'
).properties(width=200)

chart = base.encode(x='b') | base.encode(x='c')
chart

If we look at the spec produced by the chart, we see that this causes the dataset to be duplicated between the two charts:

>>> chart.to_dict()
{'$schema': 'https://vega.github.io/schema/vega-lite/v2.5.2.json',
 'config': {'view': {'height': 300, 'width': 400}},
 'hconcat': [{'data': {'values': [{'a': 'A', 'b': 28, 'c': 43},
     {'a': 'B', 'b': 55, 'c': 91},
     {'a': 'C', 'b': 43, 'c': 81},
     {'a': 'D', 'b': 91, 'c': 53},
     {'a': 'E', 'b': 81, 'c': 19},
     {'a': 'F', 'b': 53, 'c': 87},
     {'a': 'G', 'b': 19, 'c': 52},
     {'a': 'H', 'b': 87, 'c': 28},
     {'a': 'I', 'b': 52, 'c': 55}]},
   'encoding': {'x': {'field': 'b', 'type': 'quantitative'},
    'y': {'field': 'a', 'type': 'nominal'}},
   'mark': 'bar',
   'width': 200},
  {'data': {'values': [{'a': 'A', 'b': 28, 'c': 43},
     {'a': 'B', 'b': 55, 'c': 91},
     {'a': 'C', 'b': 43, 'c': 81},
     {'a': 'D', 'b': 91, 'c': 53},
     {'a': 'E', 'b': 81, 'c': 19},
     {'a': 'F', 'b': 53, 'c': 87},
     {'a': 'G', 'b': 19, 'c': 52},
     {'a': 'H', 'b': 87, 'c': 28},
     {'a': 'I', 'b': 52, 'c': 55}]},
   'encoding': {'x': {'field': 'c', 'type': 'quantitative'},
    'y': {'field': 'a', 'type': 'nominal'}},
   'mark': 'bar',
   'width': 200}]}

This is not a big problem for small datasets, but with large datasets and charts with many layers or panels, this can lead to unnecessarily large specs.

Within the mechanism in this PR, you can do the following:

>>> alt.data_transformers.enable(consolidate_datasets=True)
>>> chart.to_dict()
{'$schema': 'https://vega.github.io/schema/vega-lite/v2.5.2.json',
 'config': {'view': {'height': 300, 'width': 400}},
 'datasets': {'data-ee7f01090b9e4fcef2554f6712660b80': [
   {'a': 'A', 'b': 28, 'c': 43},
   {'a': 'B', 'b': 55, 'c': 91},
   {'a': 'C', 'b': 43, 'c': 81},
   {'a': 'D', 'b': 91, 'c': 53},
   {'a': 'E', 'b': 81, 'c': 19},
   {'a': 'F', 'b': 53, 'c': 87},
   {'a': 'G', 'b': 19, 'c': 52},
   {'a': 'H', 'b': 87, 'c': 28},
   {'a': 'I', 'b': 52, 'c': 55}]},
 'hconcat': [{'data': {'name': 'data-ee7f01090b9e4fcef2554f6712660b80'},
   'encoding': {'x': {'field': 'b', 'type': 'quantitative'},
    'y': {'field': 'a', 'type': 'nominal'}},
   'mark': 'bar',
   'width': 200},
  {'data': {'name': 'data-ee7f01090b9e4fcef2554f6712660b80'},
   'encoding': {'x': {'field': 'c', 'type': 'quantitative'},
    'y': {'field': 'a', 'type': 'nominal'}},
   'mark': 'bar',
   'width': 200}]}

Notice that the data in each subchart is replaced by a named reference to a single dataset at the top level. This should all be transparent to the user, but result in more efficient chart specifications.

The only reason I hesitate in making this the default is that I'm worried there may be corner cases I'm not thinking about where this would break a working chart.

…op-level datasets

jakevdp · 2018-06-20T15:12:03Z

@kanitw @domoritz I'm curious if there's anything obvious I'm overlooking from the vega-lite side here... Is there any reason that replacing data: {values: [...]}} with {data: {name: '...'}} might cause a problem that I'm not considering?

domoritz · 2018-06-20T15:14:09Z

This is great! It might even make sense to make this the default, no?

I don't think there are any issues with using named data sources instead of value but let's check with @jheer. It is possible that Vega will make extra copies but I don't think that's the case here because the data source we create only sets the values and has no transforms.

jakevdp · 2018-06-20T15:15:11Z

I think it would make sense as the default, I'm just worried that it might break the chart in situations that I'm not considering.

domoritz · 2018-06-20T15:18:51Z

I would be surprised and would consider that a bug in Vega-Lite since I designed the top level data sources specifically for what you're using it for.

jakevdp · 2018-06-27T20:15:12Z

I'm going to merge as is, and start an issue discussing whether we should make this behavior the default.

jakevdp added 2 commits June 19, 2018 13:23

ENH: add global settings capability to plugin registry

19c275a

ENH: allow optional automatic consolodation of inline datasets into t…

a4da4b0

…op-level datasets

jakevdp merged commit 9eb71cd into vega:master Jun 27, 2018

jakevdp deleted the toplevel-datasets branch June 27, 2018 20:16

jakevdp mentioned this pull request Jun 27, 2018

ENH: Move all datasets to top level by default? #981

Closed

This was referenced Jul 3, 2018

consolidate datasets within a spec vegawidget/vegawidget#9

Closed

idea: remove and distill data from/in charts vegawidget/altair#26

Closed

Keep an eye on vega_datasets being imported into altair vegawidget/altair#51

Closed

ijlyttle mentioned this pull request Jul 11, 2018

Feature request: top-level datasets to be more-expressive vega/vega-lite#4004

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Toplevel datasets #951

Toplevel datasets #951

jakevdp commented Jun 20, 2018 •

edited

Loading

jakevdp commented Jun 20, 2018

domoritz commented Jun 20, 2018 •

edited

Loading

jakevdp commented Jun 20, 2018

domoritz commented Jun 20, 2018

jakevdp commented Jun 27, 2018

Toplevel datasets #951

Toplevel datasets #951

Conversation

jakevdp commented Jun 20, 2018 • edited Loading

jakevdp commented Jun 20, 2018

domoritz commented Jun 20, 2018 • edited Loading

jakevdp commented Jun 20, 2018

domoritz commented Jun 20, 2018

jakevdp commented Jun 27, 2018

jakevdp commented Jun 20, 2018 •

edited

Loading

domoritz commented Jun 20, 2018 •

edited

Loading