-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add percentage valid points in get_stats()
#644
Conversation
2b4958b
to
5368572
Compare
Nice addition, and good catch on the redundancy of the RMSE in this case! 😄 Two small remarks:
Last thought: I didn't see any function to calculate the valid points in the changes, maybe it's not there yet! In that case I would recommend |
Good addition!
Regarding @rhugonnet's comment:
Not so easy to figure out which ones are the most useful between total count, valid count and fraction of valid pixels. It does not make so much sense to give all 3 as one can derive the 3rd from the 2 others... But I think the percentage of valid pixel is more useful than a count (which is generally gonna be large and not very convenient). Also, total count can be easily derived from the data shape... So I would go for either just the percentage of valid pixels, or percentage + total valid count. |
dc529a5
to
4b35217
Compare
@rhugonnet the function to calculate the valid point was already there, but I used @adehecq the stats are applied on the unmasked data : geoutils/geoutils/raster/raster.py Line 1891 in 514ea82
Following your feedback I have changed percentile to percentage , and I added the valid points stat 😃
|
get_stats()
get_stats()
8216872
to
30c6300
Compare
|
Perfect! 🙂 I have one final comment on those two lines of code: valid_points = np.count_nonzero(np.isfinite(compressed_data))
valid_points_all_data = np.count_nonzero(np.isfinite(data)) It's a bit delicate to explain, a lot of NumPy masked array technicalities 😅. Basically we want to ignore everything under the For the first line: The For the second line: The mask is not accounted for during So in short, we should replace by: valid_points = np.count_nonzero(~data.mask)
total_points = np.prod(data.shape) And "Valid points all data" by "Total count"/"Total points". See the correspondance with the example file of GeoUtils that has invalid values (but without NaNs under the mask like the slope):
and
|
Actually, this also reminds me that we should use So replacing |
Great work and remarks @rhugonnet @adebardo @vschaffn I would only say that the naming are not that explicit as you asked @adebardo :) :
Find something the most explicit with what does the code and really clear for the user. My proposition (not perfect...): Hope I understand well and that it helps. Anyway, be explicit in documentation and naming with what is done. |
I agree with @duboise-cnes. |
@rhugonnet @dubois Thank you for your feedback and for the clarification on how rasters handle masking. I have updated the code by replacing all instances of Additionally, I incorporated your suggestions regarding the final statistics ( For example, in areas like slopes, there is often a 1-pixel |
OK, perfect! I'm not sure I fully understand the point about the difference between
The Raster geoutils/geoutils/raster/raster.py Line 1687 in 5aab570
So it is impossible to create or derive a |
@rhugonnet Perhaps I misspoke, when I said no mask, I meant no additional mask (on top of the ‘classic’ mask for a raster which hides nan/inf values), for classification for example. Here is an example: |
OK I see now! Thanks a lot for the clarification 😄 So I think the underlying issue here is that we need to clearly differentiate nodata masking ( Sorry, I didn't realize this limitation when we framed the design (that you'd be using To solve this specific issue here, I think we could simply add an So instead of doing: rst.set_mask(mask1)
stats1 = rst.get_stats()
rst.set_mask(mask2)
stats2 = rst.get_stats() We'd do: all_stats = rst.get_stats(inlier_masks=[mask1, mask2]) What do you think? 😛 |
rst.set_mask(mask1)
stats1 = rst.get_stats() As we plan to develop a high-level classification in xDEM (internal tickets, comming on github soon), we will think about ways to recover statistics from each class easily in classification classes 😄 |
OK. Just to clarify: My proposition was not to add multiple classification classes (this is a bonus), but to solve the issue of ambiguity of the masked values raised above. It is impossible to define the |
87dfa4b
to
bc7385d
Compare
@rhugonnet ok I think I understood what you mean. I pushed some modifications, where I added the possibility to pass a mask in |
geoutils/raster/raster.py
Outdated
:returns: The requested statistic or a dictionary of statistics if multiple or all are requested. | ||
""" | ||
if not self.is_loaded: | ||
self.load() | ||
stats_dict = self._statistics(band=band) | ||
if mask is not None: | ||
total_count = np.count_nonzero(~self.get_mask()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at this line, instead of ~self.get_mask()
, it should be mask
?
All the values we sample from (so the total_count
) is based on the mask
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the total count, we want to retrieve all the finite values (i.e. all the False values in the outlier mask). As mask
represents a classification mask for example (inlier mask), it's self.get_mask()
that we need here 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it looks like we have a different definition in mind for total_count
and/or valid_count
😅
Here's how I usually define those, so that we can identify the discrepancy between our definitions 😉:
- The total count represents all the values (valid and invalid) that are sampled from the raster to compute the statistic. When a classif mask is passed, the sampled raster values are limited to the inlier mask (equivalent to running all statistics on the smaller array
Raster[inlier_mask]
). So thetotal_count
isnp.count_nonzero(inlier_mask)
. With no inlier mask, on the entire raster, thetotal_count
is always all valuesnp.prod(self.shape)
. - The valid count represents the valid values among all those sampled to compute the statistic. When a classif mask is passed, these values are limited to the valid values among raster values subsampled by the inlier mask, so
np.count_nonzero(~Raster.mask[inlier_mask])
(which in this case is equivalent the way you compute it with justnp.isfinite(~self.mask)
, because you've usedset_mask()
before). With no inlier mask, on the entire raster, it's always all the valid valuesnp.count_nonzero(~Raster.mask)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To better show the link between the two with these definition:
Without a classif mask: total_count
is all values of the masked-array, valid_count
is all non-masked values of the masked-array.
With a classif mask: Exactly the same calculations as above, but on the masked-array subsampled on the inlier mask Raster.data[inlier_mask]
, instead of the full masked-array Raster.data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! All good for me after the small changes mentioned above 🙂
bc7385d
to
8192338
Compare
…f no inlier mask is passed
ff281d5
to
8f78710
Compare
@rhugonnet as discussed I disabled |
More specifically, I was thinking that we should add a And then we always keep |
@rhugonnet : I redefined the follwing stats:
If an inlier mask is passed:
|
c3045db
to
5531ef6
Compare
@rhugonnet :
|
5531ef6
to
27aa9bc
Compare
27aa9bc
to
8aebf0b
Compare
Thanks for the work and convergence ! @rhugonnet @vschaffn and @adebardo ! |
Resolves GlacioHack/xdem#679.
Resolves GlacioHack/xdem#685
Description
valid points
andpercentage valid points
statistic calculation, as the number of finite values divided by the number of values in the unmasked Raster.valid points
andpercentage valid points
in thedict
returned byget_stats
, add aliases.valid points
andpercentage valid points
intest_stats
method intest_raster
.valid points no mask
andpercentage valid points no mask
to compute the same stats as above but without maskingvalues.
LE90
instats.py
and inRaster.get_stats
, and test it intest_stats.py
.