Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unstructured grid #105

Closed
mathause opened this issue Oct 14, 2021 · 5 comments · Fixed by #217
Closed

unstructured grid #105

mathause opened this issue Oct 14, 2021 · 5 comments · Fixed by #217

Comments

@mathause
Copy link
Member

mathause commented Oct 14, 2021

This issue is relevant for #65


Internally mesmer uses an unstructured grid. That is the lat and lon coords are not 2D but along a vector. When we start using xarray internally we need to name the non-coordinate dimension of the vector. @leabeusch suggests to use "gp" (for gridpoint). The most likely alternative would be "cell" (see details).

This is what some other models do:

  • ICON: cell
  • MPAS: nCells
  • CAM with unstructured grid(?) (CAMSE?): ncol
  • CLM (used internally only): column and gridcell
  • AWI ocean model: ncells

Example:

import xarray as xr
xr.set_options(display_style="text")

lat = [0.5, 0.5, 1.5, 1.5]
lon = [0.5, 1.5, 0.5, 1.5]
data = [0.5, 0.7, 0.8, 0.2]

ds = xr.Dataset(
    data_vars=dict(data=("gp", data)),
    coords={"lon": ("gp", lon), "lat": ("gp", lat)}
)

and the repr would look like

<xarray.DataArray 'data' (gp: 4)>
array([0.5, 0.7, 0.8, 0.2])
Coordinates:
    lon      (gp) float64 0.5 1.5 0.5 1.5
    lat      (gp) float64 0.5 0.5 1.5 1.5
Dimensions without coordinates: gp
  • Note: I would not go for a MultiIndex because (i) it brings its own set of problems (ii) it should no longer be necessary after the index refactor of xarray (which is actually finally underway) (iii) we probably seldom need to select individual grid points from the

  • Obviously the array has more dimensions, likely "time" and "member" (or "realization").

Dimensions:  (member, time, gp)
Coordinates:
    time (time)
    lon  (gp)
    lat  (gp)
Dimensions without coordinates: member, gp
  • xarray does not support two dimensions with the same name. Therefore we need new names for the geo distance matrix and the correlation matrix. We thought to subscript "gp", "lon", and "lat" with "_i", and "_j":
geodist = xr.Dataset(
    data_vars=dict(
        dist=(("gp_j", "gp_i"), [
        [0.5, 0.7, 0.8, 0.2],
        [0.5, 0.7, 0.8, 0.2],
        [0.5, 0.7, 0.8, 0.2],
        [0.5, 0.7, 0.8, 0.2]
        ])
    ),
    coords={
        "lon_j": ("gp_j", lon),
        "lat_j": ("gp_j", lat),
        "lon_i": ("gp_i", lon),
        "lat_i": ("gp_i", lat)
    }
)

I.e. the array would look like:

<xarray.Dataset>
Dimensions:  (gp_i: 4, gp_j: 4)
Coordinates:
    lon_j    (gp_j) float64 0.5 ...
    lat_j    (gp_j) float64 0.5 ...
    lon_i    (gp_i) float64 0.5 ...
    lat_i    (gp_i) float64 0.5 ...
Dimensions without coordinates: gp_i, gp_j
Data variables:
    dist     (gp_j, gp_i) float64 0.5 ...
@leabeusch
Copy link
Collaborator

This is based on a discussion @mathause & I had last week (obviously very much driven by Mathias' actual knowledge on these things and me learning about them ;)) -> @znicholls maybe it would make sense for you to have a quick look at it before our meeting tomorrow? Especially the part about avoiding MultiIndex.

@leabeusch suggests to use "gp" (for gridpoint). The most likely alternative would be "cell" (see details).

I can live with "cell" too, if there is someone with a clear preference for it. In my head it was just always called "gp". ^^

Obviously the array has more dimensions, likely "time" and "member".

@mathause, I remember we talked about "member" vs "realization" but I cannot remember why we leaned towards "member" at the end? Currently "realization" seems more intuitive to me. But I'm sure I could be convinced otherwise again.

dist (gp_j, gp_i) float64 0.5 ...

Usually, we'd put i before j, no? (I know, extremely relevant point)

@znicholls
Copy link
Collaborator

znicholls commented Oct 21, 2021

Especially the part about avoiding MultiIndex

Avoiding multiindex is totally fine for me, I was literally just hacking anything together which would sort of work but I am glad we now have an actual xarray expert.

I can live with "cell" too, if there is someone with a clear preference for it. In my head it was just always called "gp"

I'm happy with whatever. I have always preferred longer names (so gridpoint rather than gp) because I find myself being like, "wtf is gp" for too long at the start of doing work and the extra characters are free. Given that, perhaps cell is better because it's shorter but not an abbreviation (but I really have no preference).

Currently "realization" seems more intuitive to me

Realisation (and if you want to use american spelling I will live) makes more sense to me too (given I think of emulations leading to realisations or draws), member is also totally fine though given CMIP always takes about members and member_id.

  • xarray does not support two dimensions with the same name. Therefore we need new names for the geo distance matrix and the correlation matrix. We thought to subscript "gp", "lon", and "lat" with "_i", and "_j"

I would make it a more explicit name e.g. "gridpoint_correlation_matrix". I know it starts to get long but I would have no idea what the difference between "gp" and "gp_i" was without stopping and thinking whereas "gridpoint" and "gridpoint_correlation_matrix" are immediately obvious to me (and given we use black the code will never look that horrendous anyway). "gridpoint_crossterms" would also work if we want a more general thing.

@mathause mathause mentioned this issue Oct 27, 2021
@yquilcaille
Copy link
Collaborator

yquilcaille commented Oct 27, 2021

Note: I would not go for a MultiIndex

I agree, it would bring more problems over the long term than it would solve now.

I can live with "cell" too, if there is someone with a clear preference for it. In my head it was just always called "gp"

I have a small preferences for cell, for the same reasons that @znicholls mentions. "gp" is not very clear, not very user-friendly, and cell is shorter than gridpoint.

@mathause, I remember we talked about "member" vs "realization"

I have a preference for member. We are using the runs from ESMs on scenarios under different ensemble members. It is the term used in the climate community, then it would make more sense from my perspective. Plus, it is shorter than realization :)
One quick note on this point, we should refer to the members (or realizations if you prefer) using their full id and not number, to be sure that we use the same ones. For instance, when using tas and hfds in a training.

@leabeusch
Copy link
Collaborator

Just to follow up on this one more time: seems you all do have nice arguments for "cell" over "gp" -> consider me convinced of "cell" too.

Realisation (and if you want to use american spelling I will live)

Funny that this is coming up already again (@mathause & I had a moderately important discussion about American vs British English in the context of a comma a few weeks ago). I think we may actually have to make a decision on the type of English we use sometimes soonish. ^^

& more importantly, on the "realization" vs "member" topic: I see @yquilcaille's point as long as it's the actual ESM output. But for the emulations we generate, but I find "realization" for the inidivdual emulations a lot more descriptive, as they are realizations of a stochastic process... & the part about the full id, I also see advantages for clear ESM run identification but it feels like a bit a strange overhead for the emulations? & are these member id's even defined outside of CMIP experiments? On the other hand: it would probably also be very counterintuitive to have different naming conventions for the ESM simulations & the emulations.

@mathause
Copy link
Member Author

mathause commented Nov 3, 2021

Yes, I am not always consistent in my choice of dialect.

I did just go check the cmip6 definitions (see note 8). The whole thing is the variant_label = "r1i1p1f1" and

For a given experiment, the realization_index, initialization_index, physics_index, and forcing_index are used to uniquely identify each simulation of an ensemble of runs contributed by a single model.

but we don't have to adapt this nomenclature. E.g. we generally want to pool everything that is below the level of "model" (but see also #113).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants