Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toplevel datasets #951

Merged
merged 2 commits into from
Jun 27, 2018
Merged

Toplevel datasets #951

merged 2 commits into from
Jun 27, 2018

Conversation

jakevdp
Copy link
Collaborator

@jakevdp jakevdp commented Jun 20, 2018

With this change, you can run:

alt.data_transformers.enable(consolidate_datasets=True)

and then in all rendered charts, inline data will be replaced with named data and added to the top-level datasets attribute. This automatically assigns the name to any identical datasets.

This addresses the issue that typical usage patterns end up embedding multiple copies of the dataset within compound charts. I think that this consolidation behavior should probably be the default, but I'd like to get it in as an option to start with.

As an example, consider the following chart, using a standard concatenation pattern:

import altair as alt
import pandas as pd

data = pd.DataFrame({
    'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
    'b': [28, 55, 43, 91, 81, 53, 19, 87, 52],
    'c': [43, 91, 81, 53, 19, 87, 52, 28, 55]
})

base = alt.Chart(data).mark_bar().encode(
    y='a'
).properties(width=200)

chart = base.encode(x='b') | base.encode(x='c')
chart

visualization 13

If we look at the spec produced by the chart, we see that this causes the dataset to be duplicated between the two charts:

>>> chart.to_dict()
{'$schema': 'https://vega.github.io/schema/vega-lite/v2.5.2.json',
 'config': {'view': {'height': 300, 'width': 400}},
 'hconcat': [{'data': {'values': [{'a': 'A', 'b': 28, 'c': 43},
     {'a': 'B', 'b': 55, 'c': 91},
     {'a': 'C', 'b': 43, 'c': 81},
     {'a': 'D', 'b': 91, 'c': 53},
     {'a': 'E', 'b': 81, 'c': 19},
     {'a': 'F', 'b': 53, 'c': 87},
     {'a': 'G', 'b': 19, 'c': 52},
     {'a': 'H', 'b': 87, 'c': 28},
     {'a': 'I', 'b': 52, 'c': 55}]},
   'encoding': {'x': {'field': 'b', 'type': 'quantitative'},
    'y': {'field': 'a', 'type': 'nominal'}},
   'mark': 'bar',
   'width': 200},
  {'data': {'values': [{'a': 'A', 'b': 28, 'c': 43},
     {'a': 'B', 'b': 55, 'c': 91},
     {'a': 'C', 'b': 43, 'c': 81},
     {'a': 'D', 'b': 91, 'c': 53},
     {'a': 'E', 'b': 81, 'c': 19},
     {'a': 'F', 'b': 53, 'c': 87},
     {'a': 'G', 'b': 19, 'c': 52},
     {'a': 'H', 'b': 87, 'c': 28},
     {'a': 'I', 'b': 52, 'c': 55}]},
   'encoding': {'x': {'field': 'c', 'type': 'quantitative'},
    'y': {'field': 'a', 'type': 'nominal'}},
   'mark': 'bar',
   'width': 200}]}

This is not a big problem for small datasets, but with large datasets and charts with many layers or panels, this can lead to unnecessarily large specs.

Within the mechanism in this PR, you can do the following:

>>> alt.data_transformers.enable(consolidate_datasets=True)
>>> chart.to_dict()
{'$schema': 'https://vega.github.io/schema/vega-lite/v2.5.2.json',
 'config': {'view': {'height': 300, 'width': 400}},
 'datasets': {'data-ee7f01090b9e4fcef2554f6712660b80': [
   {'a': 'A', 'b': 28, 'c': 43},
   {'a': 'B', 'b': 55, 'c': 91},
   {'a': 'C', 'b': 43, 'c': 81},
   {'a': 'D', 'b': 91, 'c': 53},
   {'a': 'E', 'b': 81, 'c': 19},
   {'a': 'F', 'b': 53, 'c': 87},
   {'a': 'G', 'b': 19, 'c': 52},
   {'a': 'H', 'b': 87, 'c': 28},
   {'a': 'I', 'b': 52, 'c': 55}]},
 'hconcat': [{'data': {'name': 'data-ee7f01090b9e4fcef2554f6712660b80'},
   'encoding': {'x': {'field': 'b', 'type': 'quantitative'},
    'y': {'field': 'a', 'type': 'nominal'}},
   'mark': 'bar',
   'width': 200},
  {'data': {'name': 'data-ee7f01090b9e4fcef2554f6712660b80'},
   'encoding': {'x': {'field': 'c', 'type': 'quantitative'},
    'y': {'field': 'a', 'type': 'nominal'}},
   'mark': 'bar',
   'width': 200}]}

Notice that the data in each subchart is replaced by a named reference to a single dataset at the top level. This should all be transparent to the user, but result in more efficient chart specifications.

The only reason I hesitate in making this the default is that I'm worried there may be corner cases I'm not thinking about where this would break a working chart.

@jakevdp
Copy link
Collaborator Author

jakevdp commented Jun 20, 2018

@kanitw @domoritz I'm curious if there's anything obvious I'm overlooking from the vega-lite side here... Is there any reason that replacing data: {values: [...]}} with {data: {name: '...'}} might cause a problem that I'm not considering?

@domoritz
Copy link
Member

domoritz commented Jun 20, 2018

This is great! It might even make sense to make this the default, no?

I don't think there are any issues with using named data sources instead of value but let's check with @jheer. It is possible that Vega will make extra copies but I don't think that's the case here because the data source we create only sets the values and has no transforms.

@jakevdp
Copy link
Collaborator Author

jakevdp commented Jun 20, 2018

I think it would make sense as the default, I'm just worried that it might break the chart in situations that I'm not considering.

@domoritz
Copy link
Member

I would be surprised and would consider that a bug in Vega-Lite since I designed the top level data sources specifically for what you're using it for.

@jakevdp
Copy link
Collaborator Author

jakevdp commented Jun 27, 2018

I'm going to merge as is, and start an issue discussing whether we should make this behavior the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants