-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRS for load_collection #102
Conversation
Not a big fan of this proposal, I prefer that users and process graphs do not enforce a CRS, unless the use case strictly demands it. |
That's not correct, the API supports it since version 0.4. |
I agree with @jdries that performance may be badly affected. Many users will be tempted to specify the CRS without a proper understanding of the implications (whereas if they explicitly use the resample_spatial process at any point in the pg, they are conscious about their choice). If we go for it, the CRS should be null by default (as in the current PR), because we should also consider that EO includes lots of data still in sensor geometry (e.g. S1 and S3 L1 data), and simply specifying a CRS at load_collection will not suffice in this case (maybe relatively few users will need to work with this level of data, but many scientists do research on the L1 to L2 processing step). |
We can simply add a note to the description of the parameter so a user is aware of the issue. |
@jdries and I had a long discussion on this issue in the train on Wednesday, and may have come to a possible resolution. The problem is that The way out I propose here is to make CRS a dimension. This way, x and y map into the same coordinates for different CRS, and the CRS slice you're in determines how they map to world coordinates. This way, you can do everything on spectral and time dimensions, and simply loop over x, y and CRS with no need for resampling. Certain spatial things that only work on the local (relative) context can also be done, think of a low-pass filter. Other spatial things, like aggregate_spatial with a set of polygons, need an unfolding of the CRS dimension into a mosaic with a single CRS before they make sense, or be done in an apply over the CRS dimension (in which case polygons crossing UTM zones will not result in a single result). A similar idea of having CRS as a dimension would be having tile ID as a dimension: x and y indexes are the row/cols and then map into coordinates relative to the tile origin, with similar consequences. @mkadunc what do you think? |
Hm, interesting idea. We need to think about how that would work with our processes. What do we need to change in resample_spatial / resample_spatial_cube and how would for example a reduce on the crs dimension work? |
I'm planning to write a bit of a longer piece about this, looking into the topics you mention. |
Having CRS as a dimension makes a lot of sense, especially if it is treated as one of the spatial dimensions. One concern I have is that adding CRS dimension might unnecessarily burden the user with this extra "singleton" dimension - just the fact that it's there is not a problem, but it can become one if you need to take it into account on almost each node of the process graph, even when the operations are not spatial. Maybe we should do some experiments with extra singleton dimensions and see how much they impact working with data cubes - if there's no impact we're OK, but if it's significant, we could find ways to implicitly take such extra dimensions into account without burdening the user. Side note: There is another (a bit more involved) option, where CRS would not necessarily be a dimension - if we extended the concept of axis labels into a more general concept of "cell-attributes", then we could just require that every spatial cell has a non-null
In NetCDF, one would solve this with an extra The cell-attributes concept has additional uses, e.g.:
This are maybe all things that could be handled with extra dimensions, but I think that it is more intuitive to think about dimensions, especially in the data cubes world, as "the minimal independent set of cell attributes that unambiguously address a single value" (in relational world these would be called "candidate key") — adding extra dimensions just to provide some metadata to the cells feels wrong. |
We (@flahn and me) went a bit crazy discussing this proposal and have some more thoughts:
So could we combine these two approaches? Could we say that if crs is not required and only if the collection has multiple CRS and the crs parameter in load_collection is not set, then a user gets a crs dimension? The question really is: Who is the target audience? Who do we cater for? Which audience would lead to most uptake? etc. Another thought was to allow something like this for advanced user that want to work on the original crs (in JS-like pseude code): var epsgCodesArray = get_metadata(field = 'eo:epsg') // Get all epsg codes from metadata
var dataCubesArray = epsgCodesArray.apply(epsgCode => {
dataCube = load_collection( // Load all data that is stored with this epsg code
id = "Sentinel-2",
temporal_extent = null,
spatial_extent = null,
properties = eq('eo:epsg', epsgCode)
)
// further process data cubes
})
var mergedDataCube = dataCubesArray.reduce(process = resample_spatial_cube, binary = true)
save_result(mergedDataCube) That's mostly possible already. It basically returns one data cube per crs defined for the collection, all of them just containing the data that is available for that specific crs. Afterwards process them individually and merge them using a binary reducer if wanted. |
@@ -268,13 +303,19 @@ | |||
} | |||
} | |||
} | |||
} | |||
}, | |||
"crs": 3857 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't put the crs
parameter in the example, as it would suggest that this is something that should always be used - users should rather be encouraged to leave the CRS choice to the backend.
"exceptions": { | ||
"TargetCrsMissing": { | ||
"message": "The data requested from the collection has multiple CRS assigned and thus a valid target CRS must be specified." | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even when the data has multiple CRSs assigned, the backend can probably do a better job of selecting the more appropriate option than the user. Forcing the backend to fail when CRS is not provided is IMO counter-productive.
Closing, we are going the route described in #98. |
Here's a proposal for #98 to base discussions upon. Please comment and/or up/down-vote.