Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRS for load_collection #102

Closed
wants to merge 2 commits into from
Closed

CRS for load_collection #102

wants to merge 2 commits into from

Conversation

m-mohr
Copy link
Member

@m-mohr m-mohr commented Dec 11, 2019

Here's a proposal for #98 to base discussions upon. Please comment and/or up/down-vote.

@m-mohr m-mohr added the help wanted Extra attention is needed label Dec 11, 2019
@m-mohr m-mohr added this to the v1.0 milestone Dec 11, 2019
@jdries
Copy link
Contributor

jdries commented Dec 11, 2019

Not a big fan of this proposal, I prefer that users and process graphs do not enforce a CRS, unless the use case strictly demands it.
There is also not really a concept of a 'multi-crs' collection, in the sense that it is not exposed in the metadata. So users might be tempted to start defining crs's everywhere 'just in case', at the expense of performance and cost.

@m-mohr
Copy link
Member Author

m-mohr commented Dec 11, 2019

There is also not really a concept of a 'multi-crs' collection, in the sense that it is not exposed in the metadata.

That's not correct, the API supports it since version 0.4.

@lforesta
Copy link
Contributor

I agree with @jdries that performance may be badly affected. Many users will be tempted to specify the CRS without a proper understanding of the implications (whereas if they explicitly use the resample_spatial process at any point in the pg, they are conscious about their choice).

If we go for it, the CRS should be null by default (as in the current PR), because we should also consider that EO includes lots of data still in sensor geometry (e.g. S1 and S3 L1 data), and simply specifying a CRS at load_collection will not suffice in this case (maybe relatively few users will need to work with this level of data, but many scientists do research on the L1 to L2 processing step).

@m-mohr
Copy link
Member Author

m-mohr commented Dec 16, 2019

I agree with @jdries that performance may be badly affected. Many users will be tempted to specify the CRS without a proper understanding of the implications (whereas if they explicitly use the resample_spatial process at any point in the pg, they are conscious about their choice).

We can simply add a note to the description of the parameter so a user is aware of the issue.

@m-mohr m-mohr mentioned this pull request Dec 17, 2019
@edzer
Copy link
Member

edzer commented Dec 20, 2019

@jdries and I had a long discussion on this issue in the train on Wednesday, and may have come to a possible resolution. The problem is that load_collection() shall return a data cube, that a data cube with x and y dimensions shall have an unambiguous mapping from index to coordinates, and that this is not the case when tiles in a collection span several UTM zones. Pragmatic solutions to this (picking the CRS at the center of the extent, picking the single CRS when the extent only spans a single CRS) may sometimes work but not always. Of course for downloading a GeoTIFF or viewing a WMTS a CRS needs to be set, but the use case where someone wants to compute (and save) NDVI (or something much more expensive) in the native pixel / tile space spanning several CRS's is valid, and should not be cut off by requiring setting a single CRS (and working on it) here.

The way out I propose here is to make CRS a dimension.

This way, x and y map into the same coordinates for different CRS, and the CRS slice you're in determines how they map to world coordinates. This way, you can do everything on spectral and time dimensions, and simply loop over x, y and CRS with no need for resampling.

Certain spatial things that only work on the local (relative) context can also be done, think of a low-pass filter. Other spatial things, like aggregate_spatial with a set of polygons, need an unfolding of the CRS dimension into a mosaic with a single CRS before they make sense, or be done in an apply over the CRS dimension (in which case polygons crossing UTM zones will not result in a single result).

A similar idea of having CRS as a dimension would be having tile ID as a dimension: x and y indexes are the row/cols and then map into coordinates relative to the tile origin, with similar consequences.

@mkadunc what do you think?

@m-mohr
Copy link
Member Author

m-mohr commented Dec 20, 2019

Hm, interesting idea. We need to think about how that would work with our processes. What do we need to change in resample_spatial / resample_spatial_cube and how would for example a reduce on the crs dimension work?

@edzer
Copy link
Member

edzer commented Dec 20, 2019

I'm planning to write a bit of a longer piece about this, looking into the topics you mention.

@mkadunc
Copy link
Member

mkadunc commented Dec 20, 2019

Having CRS as a dimension makes a lot of sense, especially if it is treated as one of the spatial dimensions.

One concern I have is that adding CRS dimension might unnecessarily burden the user with this extra "singleton" dimension - just the fact that it's there is not a problem, but it can become one if you need to take it into account on almost each node of the process graph, even when the operations are not spatial. Maybe we should do some experiments with extra singleton dimensions and see how much they impact working with data cubes - if there's no impact we're OK, but if it's significant, we could find ways to implicitly take such extra dimensions into account without burdening the user.


Side note: There is another (a bit more involved) option, where CRS would not necessarily be a dimension - if we extended the concept of axis labels into a more general concept of "cell-attributes", then we could just require that every spatial cell has a non-null crs attribute: the value of this could come from:

  • the root of the cube (when the whole cube is a single CRS)
  • the separate CRS axis (when there are different CRSs and not aligned with other axes)
  • some other axis (when CRS can be derived from some other axis, as in the tileId-axis example)

In NetCDF, one would solve this with an extra variable (whose value is defined just once as a scalar or as N-D array along a subset of axes), but since we decided to only hold a single value in a cube, we can't use the same mechanism.

The cell-attributes concept has additional uses, e.g.:

  • absoluteOrbitId can be used as a dimension, and its labels can provide the well-known time attribute so it can function as a temporal axis
  • productId as a dimension could, in addition to time and crs, provide per-scene info about the platform (S2A / S2B), processing version and other scene-related metadata (sun angles, viewing angles etc.)

This are maybe all things that could be handled with extra dimensions, but I think that it is more intuitive to think about dimensions, especially in the data cubes world, as "the minimal independent set of cell attributes that unambiguously address a single value" (in relational world these would be called "candidate key") — adding extra dimensions just to provide some metadata to the cells feels wrong.

@m-mohr
Copy link
Member Author

m-mohr commented Dec 20, 2019

We (@flahn and me) went a bit crazy discussing this proposal and have some more thoughts:

  1. The proposal with the crs parameter in load_collection is the simplest from the user perspective. This would be very similar to what GEE offers. There are obviously quite a bunch of users that want this simplicity.
  2. The proposal with crs as dimension is not easy to understand, I think. A GEE user will probably not want to think about how to reduce a crs dimension and just stick with the simpler GEE approach. Scientists who strictly want to work on the original data/crs will probably like this approach.

So could we combine these two approaches? Could we say that if crs is not required and only if the collection has multiple CRS and the crs parameter in load_collection is not set, then a user gets a crs dimension?

The question really is: Who is the target audience? Who do we cater for? Which audience would lead to most uptake? etc.


Another thought was to allow something like this for advanced user that want to work on the original crs (in JS-like pseude code):

var epsgCodesArray = get_metadata(field = 'eo:epsg') // Get all epsg codes from metadata
var dataCubesArray = epsgCodesArray.apply(epsgCode => {
  dataCube = load_collection( // Load all data that is stored with this epsg code
    id = "Sentinel-2",
    temporal_extent = null,
    spatial_extent = null,
    properties = eq('eo:epsg', epsgCode)
  )
  // further process data cubes
})
var mergedDataCube = dataCubesArray.reduce(process = resample_spatial_cube, binary = true)
save_result(mergedDataCube)

That's mostly possible already. It basically returns one data cube per crs defined for the collection, all of them just containing the data that is available for that specific crs. Afterwards process them individually and merge them using a binary reducer if wanted.

@@ -268,13 +303,19 @@
}
}
}
}
},
"crs": 3857
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't put the crs parameter in the example, as it would suggest that this is something that should always be used - users should rather be encouraged to leave the CRS choice to the backend.

"exceptions": {
"TargetCrsMissing": {
"message": "The data requested from the collection has multiple CRS assigned and thus a valid target CRS must be specified."
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when the data has multiple CRSs assigned, the backend can probably do a better job of selecting the more appropriate option than the user. Forcing the backend to fail when CRS is not provided is IMO counter-productive.

@m-mohr m-mohr added the waiting label Jan 6, 2020
@m-mohr
Copy link
Member Author

m-mohr commented Jan 13, 2020

Closing, we are going the route described in #98.

@m-mohr m-mohr closed this Jan 13, 2020
@m-mohr m-mohr deleted the issue-98 branch January 20, 2020 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants