Add percentage valid points in `get_stats()` #644

vschaffn · 2025-01-24T10:11:17Z

Resolves GlacioHack/xdem#679.
Resolves GlacioHack/xdem#685

Description

Add valid points and percentage valid points statistic calculation, as the number of finite values divided by the number of values in the unmasked Raster.
Add valid points and percentage valid points in the dict returned by get_stats, add aliases.
Add valid points and percentage valid points in test_stats method in test_raster.
Add valid points no mask and percentage valid points no mask to compute the same stats as above but without masking
values.
Add LE90 in stats.py and in Raster.get_stats, and test it in test_stats.py.

rhugonnet · 2025-01-25T00:35:49Z

Nice addition, and good catch on the redundancy of the RMSE in this case! 😄

Two small remarks:

For this type of statistic I think "count" is the name often used, for example the statistic argument of this SciPy function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html, or the name of this NumPy function: https://numpy.org/doc/2.1/reference/generated/numpy.count_nonzero.html. We could use this here too?
It might be useful to some users to know the total point count (NaNs included)? If we report a "total count" and "valid count", then users can derive the percent of valid points themselves by dividing the two.

Last thought: I didn't see any function to calculate the valid points in the changes, maybe it's not there yet! In that case I would recommend np.count_nonzero applied to np.isfinite, instead of ~np.isnan, as the latter considers only NaNs but not +/- infinity that are often unusable in stats (and I've had some misadventures with those being propagated in raster data before!).

adehecq · 2025-01-29T11:36:50Z

Good addition!
Two other thoughts:

I believe your implementation does not exclude masked values. Remember that self.data is a masked array, so invalid pixels are masked instead of being set to NaN. One quick way to get total number of unmasked values is through self.data.compressed().size. Maybe it would be good to add 1-2 tests. For example, the "exploradores_aster_dem" example has data gaps.
the name of your variable is incorrect. It should not be "percentile" but "percentage".

Regarding @rhugonnet's comment:

It might be useful to some users to know the total point count (NaNs included)? If we report a "total count" and "valid count", then users can derive the percent of valid points themselves by dividing the two.

Not so easy to figure out which ones are the most useful between total count, valid count and fraction of valid pixels. It does not make so much sense to give all 3 as one can derive the 3rd from the 2 others... But I think the percentage of valid pixel is more useful than a count (which is generally gonna be large and not very convenient). Also, total count can be easily derived from the data shape... So I would go for either just the percentage of valid pixels, or percentage + total valid count.

vschaffn · 2025-02-04T10:58:06Z

@rhugonnet the function to calculate the valid point was already there, but I used ~np.isnan, then I have changed for np.isfinite following your feedback.

@adehecq the stats are applied on the unmasked data :

geoutils/geoutils/raster/raster.py

Line 1891 in 514ea82

data = data.compressed()

Following your feedback I have changed percentile to percentage, and I added the valid points stat 😃

adebardo · 2025-02-14T08:39:12Z

This PR is ready for me; I have no inspiration to improve the naming. @rhugonnet , @adehecq , @duboise-cnes , do you have any ideas?
Feel free to merge this PR during the week to facilitate progress.

rhugonnet · 2025-02-14T10:42:03Z

Perfect! 🙂

I have one final comment on those two lines of code:

        valid_points = np.count_nonzero(np.isfinite(compressed_data))
        valid_points_all_data = np.count_nonzero(np.isfinite(data))

It's a bit delicate to explain, a lot of NumPy masked array technicalities 😅. Basically we want to ignore everything under the data.mask, but the current functions don't.

For the first line: The compressed_data can never contain any non-finite data (in Raster we enforce it in @data.setter in case the user overwrote values with NaNs/inf, but even in typical behaviour of masked arrays it's rare), so valid_points could simply be len(compressed_data). But actually we don't even need to compress the data (which flattens it into a separate 1D array, and creates more memory usage), we can simply count the boolean mask data.mask.

For the second line: The mask is not accounted for during np.isfinite(data), so it's counting all pixels under the mask too, which we don't want. Most masked values actually have a real .data.data value, often equal to the .nodata value like -9999 for instance, so it'll often be the total count. But not always. This is the case in the "slope" example you show, some NaNs/inf values were likely created by division by zero during the operation, so there are some NaNs/inf under the mask. But, as these values are masked, they are ignored by all operations, so it doesn't matter (they will be converted back to the nodata value, like -9999, when the raster is written to disk).

So in short, we should replace by:

        valid_points = np.count_nonzero(~data.mask)
        total_points = np.prod(data.shape)

And "Valid points all data" by "Total count"/"Total points".

See the correspondance with the example file of GeoUtils that has invalid values (but without NaNs under the mask like the slope):

aster_dem_path = examples.get_path("exploradores_aster_dem")
rst = gu.Raster(aster_dem_path)
np.count_nonzero(np.isfinite(rst.data))
Out[7]: 333102
np.prod(np.shape(rst.data))
Out[11]: 333102

and

np.count_nonzero(np.isfinite(rst.data.compressed()))
Out[8]: 324194
np.count_nonzero(~rst.data.mask)
Out[10]: 324194

rhugonnet · 2025-02-14T10:48:38Z

Actually, this also reminds me that we should use np.ma functions everywhere for computational performance: They can be more efficient than doing np.func(data.compressed()).

So replacing np.nanmean(compressed_data) (note that the nanmean was not necessary!) into np.ma.mean(data) directly, and same for all other functions 😉.
And normally our nmad should also support this.

geoutils/raster/raster.py

duboise-cnes · 2025-02-14T15:18:13Z

Great work and remarks @rhugonnet @adebardo @vschaffn
I have reviewed it see above.

I would only say that the naming are not that explicit as you asked @adebardo :) :

at minima, explain it in documentation precisely (nan, mask, ...) to have at least a RTFM possibility for users ;)
and for naming, "valid" is fuzzy: what is valid ? can you describe it to find a naming ?
also, "all data" what is the meaning. If all data, what is valid ?

Find something the most explicit with what does the code and really clear for the user.

My proposition (not perfect...):
"Number of non-masked (ou unmasked) valid points " "Percentage of non-masked points"
"Number of valid points" and "Percentage of valid points"
and explain in documentation that valid is all finite values without np.nan, np.inf et -np.inf (if correct ?)
(if compressed is https://numpy.org/doc/stable/reference/generated/numpy.ma.MaskedArray.compressed.html it is explicit ;) )

Hope I understand well and that it helps. Anyway, be explicit in documentation and naming with what is done.

rhugonnet · 2025-02-14T22:42:26Z

I agree with @duboise-cnes.
My favourite namings would be "Valid count" and "Total count" (see my first comment on top of the thread on the naming "count" in NumPy and SciPy). And in the description/documentation, we can explain that "valid" means non-masked values, which are always finite in a Raster.

vschaffn · 2025-02-18T10:56:27Z

@rhugonnet @dubois Thank you for your feedback and for the clarification on how rasters handle masking.

I have updated the code by replacing all instances of np.nanstat with np.ma.stat, except for the 90th percentile, as np.ma.percentile does not exist.

Additionally, I incorporated your suggestions regarding the final statistics (valid count, total count, and percentage valid points) and provided more detailed explanations in the documentation. However, there is a distinction between the total count and the size of the raster.

For example, in areas like slopes, there is often a 1-pixel NaN padding, which we prefer to exclude from the total count. Otherwise, the percentage of valid points would not be 100% even when no mask is applied. To address this, I defined total count as the number of finite values in the raster and added an additional statistic: size, which returns the total size of the raster. These two values may be the same in some cases, but not always.

rhugonnet · 2025-02-20T13:17:39Z

OK, perfect!

I'm not sure I fully understand the point about the difference between total_count and size:

For example, in areas like slopes, there is often a 1-pixel NaN padding, which we prefer to exclude from the total count. Otherwise, the percentage of valid points would not be 100% even when no mask is applied

The Raster data.setter imposes that any NaN/inf value is masked, see:

geoutils/geoutils/raster/raster.py

Line 1687 in 5aab570

    
                   5. Masks non-finite values that are unmasked, whether the input is a classic array or a masked_array. Note that

So it is impossible to create or derive a Raster with a NaN/inf value that is non-masked (except maybe by force-setting in Raster.data.data? Not sure there). Consequently, the case in the example above of "when no mask is applied" can never happen for a Raster with NaNs 😉.
So I'm not sure there can be a difference here between shape and total_count. It makes sense to me that any Raster which contains some NaNs (no matter where they originate from; such as for the slope) would then not have 100% percent valid points!

vschaffn · 2025-02-20T16:06:18Z

@rhugonnet Perhaps I misspoke, when I said no mask, I meant no additional mask (on top of the ‘classic’ mask for a raster which hides nan/inf values), for classification for example.
The purpose of total_count is to separate out the masked values, from nan/inf, or from other contributions (such as a classification mask, for example). It then gives a percentage of valid points which does not take into account nan/inf due to the calculation of slope for example.

Here is an example:

rhugonnet · 2025-02-20T16:43:02Z

OK I see now! Thanks a lot for the clarification 😄

So I think the underlying issue here is that we need to clearly differentiate nodata masking (.mask and set_mask() to set/update valid raster values) and zonal masking/binning (classification = evaluating the raster on a mask).
Ideally, for evaluating the Raster on zonal/value bins and then compute associated statistics (especially counts), it's much better to do statistic(Raster[Mask]), otherwise using set_mask() makes it impossible to differentiate "invalid values" (NaNs/inf in the data) with "values we don't want to bin". It's also less practical for multiple call (need to re-set the mask every time).

Sorry, I didn't realize this limitation when we framed the design (that you'd be using set_mask() in synergy with get_stats())!
It joins my earlier comment for get_stats() to support binning (zonal or by other values) in the functionality: GlacioHack/xdem#620 (comment).

To solve this specific issue here, I think we could simply add an inlier_masks argument to Raster.get_stats(), which would expect a list[Mask | np.ndarray[bool]] and evaluate the statistics on every mask passed from that list (optionally of length 1 for a single mask, and None by default which evaluates the whole raster).
This could later be extended to support any kind of binning, not just zonal.

So instead of doing:

rst.set_mask(mask1)
stats1 = rst.get_stats()
rst.set_mask(mask2)
stats2 = rst.get_stats()

We'd do:

all_stats = rst.get_stats(inlier_masks=[mask1, mask2])

What do you think? 😛

vschaffn · 2025-02-21T10:04:00Z

LE90 have been added (see GlacioHack/xdem#685)
@rhugonnet Thanks for this proposal, but for now I am happy with

rst.set_mask(mask1)
stats1 = rst.get_stats()

As we plan to develop a high-level classification in xDEM (internal tickets, comming on github soon), we will think about ways to recover statistics from each class easily in classification classes 😄

rhugonnet · 2025-02-21T14:24:29Z

OK. Just to clarify: My proposition was not to add multiple classification classes (this is a bonus), but to solve the issue of ambiguity of the masked values raised above.

It is impossible to define the total_count with the current approach of using set_mask(classif_mask) then get_stats(). It mixes two masked value concepts: invalid values and classif values, that once intertwined cannot be recovered.
To make this feasible, we would have to pass a classif_mask argument directly to get_stats(), hence my proposition.

vschaffn · 2025-02-24T14:09:39Z

@rhugonnet ok I think I understood what you mean. I pushed some modifications, where I added the possibility to pass a mask in get_stats(), so that total_count can be computed before applying the mask to the raster. If no mask is passed, total_count has the same value than valid_count by default.
No list of masks for now, the get_stats() method would have too much "@ overload" otherwise, but it is still possible to do so.

vschaffn · 2025-02-24T14:18:48Z

rhugonnet · 2025-02-24T17:05:32Z

geoutils/raster/raster.py

        :returns: The requested statistic or a dictionary of statistics if multiple or all are requested.
        """
        if not self.is_loaded:
            self.load()
-        stats_dict = self._statistics(band=band)
+        if mask is not None:
+            total_count = np.count_nonzero(~self.get_mask())


I think at this line, instead of ~self.get_mask(), it should be mask?
All the values we sample from (so the total_count) is based on the mask.

For the total count, we want to retrieve all the finite values (i.e. all the False values in the outlier mask). As mask represents a classification mask for example (inlier mask), it's self.get_mask() that we need here 😃

Sorry it looks like we have a different definition in mind for total_count and/or valid_count 😅

Here's how I usually define those, so that we can identify the discrepancy between our definitions 😉:

The total count represents all the values (valid and invalid) that are sampled from the raster to compute the statistic. When a classif mask is passed, the sampled raster values are limited to the inlier mask (equivalent to running all statistics on the smaller array Raster[inlier_mask]). So the total_count is np.count_nonzero(inlier_mask). With no inlier mask, on the entire raster, the total_count is always all values np.prod(self.shape).

The valid count represents the valid values among all those sampled to compute the statistic. When a classif mask is passed, these values are limited to the valid values among raster values subsampled by the inlier mask, so np.count_nonzero(~Raster.mask[inlier_mask]) (which in this case is equivalent the way you compute it with just np.isfinite(~self.mask), because you've used set_mask() before). With no inlier mask, on the entire raster, it's always all the valid values np.count_nonzero(~Raster.mask).

To better show the link between the two with these definition:
Without a classif mask: total_count is all values of the masked-array, valid_count is all non-masked values of the masked-array.
With a classif mask: Exactly the same calculations as above, but on the masked-array subsampled on the inlier mask Raster.data[inlier_mask], instead of the full masked-array Raster.data.

geoutils/raster/raster.py

rhugonnet

Great! All good for me after the small changes mentioned above 🙂

…f no inlier mask is passed

vschaffn · 2025-02-25T16:39:06Z

@rhugonnet as discussed I disabled total_count and percentage_valid_points if no inlier mask is passed 😃

rhugonnet · 2025-02-26T11:36:59Z

More specifically, I was thinking that we should add a total_inlier_points, valid_inlier_points and percentage_valid_inlier_points for when inlier_mask is passed (which would correspond to the calculation described here #644 (comment) with a classif mask).
And we disable those when no classif mask if passed, as you just said!

And then we always keep total_points, valid_points and percentage_valid_points that are the statistics on the full raster, even when an inlier mask is passed (calculations described in #644 (comment) without a classif mask).

vschaffn · 2025-02-26T13:16:42Z

@rhugonnet : I redefined the follwing stats:

Valid points: number of finite data points in the array.
Total points: total size of the raster.
Percentage valid points: ration between Valid points and Total points.

If an inlier mask is passed:

Valid inlier points: number of unmasked data points in the array after applying the inlier mask.
Percentage valid inlier points: The ration between Valid inlier points and Valid points. Useful for classification statistics.

vschaffn · 2025-02-26T14:33:37Z

@rhugonnet :
If an inlier mask is passed:

Inlier points: number of unmasked data points in the inlier mask.
Valid inlier points: number of unmasked data points in the array after applying the inlier mask.
Percentage inlier points: ratio between Valid inlier points and Valid points. Useful for classification statistics.
Percentage valid inlier points: rati between Valid inlier points and Inlier points.

doc/source/raster_class.md

duboise-cnes · 2025-02-27T11:12:24Z

Thanks for the work and convergence ! @rhugonnet @vschaffn and @adebardo !

vschaffn force-pushed the 679-valid_points_stat branch from 2b4958b to 5368572 Compare January 24, 2025 10:18

vschaffn force-pushed the 679-valid_points_stat branch from dc529a5 to 4b35217 Compare February 4, 2025 10:48

vschaffn changed the title ~~Add percentile valid points in get_stats()~~ Add percentage valid points in get_stats() Feb 4, 2025

vschaffn force-pushed the 679-valid_points_stat branch 4 times, most recently from 8216872 to 30c6300 Compare February 11, 2025 13:40

duboise-cnes reviewed Feb 14, 2025

View reviewed changes

geoutils/raster/raster.py Outdated Show resolved Hide resolved

geoutils/raster/raster.py Outdated Show resolved Hide resolved

geoutils/raster/raster.py Outdated Show resolved Hide resolved

geoutils/raster/raster.py Outdated Show resolved Hide resolved

vschaffn force-pushed the 679-valid_points_stat branch from 87dfa4b to bc7385d Compare February 24, 2025 14:04

rhugonnet reviewed Feb 24, 2025

View reviewed changes

geoutils/raster/raster.py Outdated Show resolved Hide resolved

rhugonnet approved these changes Feb 24, 2025

View reviewed changes

vschaffn added 2 commits February 25, 2025 09:49

feat: add percentile valid points in

071ca8e

fix: remove rmse from and update docstrings

6344f22

vschaffn added 10 commits February 25, 2025 09:49

feat: add in stats, typing

70f4a1e

feat: add counts whithout mask in stats

517f56b

feat: add rmse in raster stats

f0035aa

fix doctestitem

d186d68

fix: correct valid points in stats, correct typing

22c9818

fix: modify the way to handle masks in stats

122fde0

docs: complete doc about raster.get_stats

409ef4d

feat: add LE90 in stats

b401b54

docs: update

1041f54

feat: add mask parameter in get_stats()

8192338

vschaffn force-pushed the 679-valid_points_stat branch from bc7385d to 8192338 Compare February 25, 2025 08:59

fix: disable total_count and percentage valid points in get_stats() i…

8f78710

…f no inlier mask is passed

vschaffn force-pushed the 679-valid_points_stat branch from ff281d5 to 8f78710 Compare February 25, 2025 16:38

vschaffn force-pushed the 679-valid_points_stat branch 2 times, most recently from c3045db to 5531ef6 Compare February 26, 2025 14:31

vschaffn force-pushed the 679-valid_points_stat branch from 5531ef6 to 27aa9bc Compare February 26, 2025 14:35

rhugonnet reviewed Feb 26, 2025

View reviewed changes

doc/source/raster_class.md Outdated Show resolved Hide resolved

fix: refactor stats

8aebf0b

vschaffn force-pushed the 679-valid_points_stat branch from 27aa9bc to 8aebf0b Compare February 26, 2025 15:34

adebardo merged commit 0c18d9e into GlacioHack:main Feb 26, 2025
13 checks passed

vschaffn deleted the 679-valid_points_stat branch February 27, 2025 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add percentage valid points in `get_stats()` #644

Add percentage valid points in `get_stats()` #644

vschaffn commented Jan 24, 2025 •

edited

Loading

rhugonnet commented Jan 25, 2025

adehecq commented Jan 29, 2025

vschaffn commented Feb 4, 2025 •

edited

Loading

adebardo commented Feb 14, 2025

rhugonnet commented Feb 14, 2025 •

edited

Loading

rhugonnet commented Feb 14, 2025

duboise-cnes commented Feb 14, 2025

rhugonnet commented Feb 14, 2025

vschaffn commented Feb 18, 2025

rhugonnet commented Feb 20, 2025 •

edited

Loading

vschaffn commented Feb 20, 2025

rhugonnet commented Feb 20, 2025 •

edited

Loading

vschaffn commented Feb 21, 2025

rhugonnet commented Feb 21, 2025 •

edited

Loading

vschaffn commented Feb 24, 2025 •

edited

Loading

vschaffn commented Feb 24, 2025

rhugonnet Feb 24, 2025

vschaffn Feb 25, 2025

rhugonnet Feb 25, 2025

rhugonnet Feb 25, 2025

rhugonnet left a comment

vschaffn commented Feb 25, 2025

rhugonnet commented Feb 26, 2025

vschaffn commented Feb 26, 2025

vschaffn commented Feb 26, 2025

duboise-cnes commented Feb 27, 2025

Add percentage valid points in get_stats() #644

Add percentage valid points in get_stats() #644

Conversation

vschaffn commented Jan 24, 2025 • edited Loading

Description

rhugonnet commented Jan 25, 2025

adehecq commented Jan 29, 2025

vschaffn commented Feb 4, 2025 • edited Loading

adebardo commented Feb 14, 2025

rhugonnet commented Feb 14, 2025 • edited Loading

rhugonnet commented Feb 14, 2025

duboise-cnes commented Feb 14, 2025

rhugonnet commented Feb 14, 2025

vschaffn commented Feb 18, 2025

rhugonnet commented Feb 20, 2025 • edited Loading

vschaffn commented Feb 20, 2025

rhugonnet commented Feb 20, 2025 • edited Loading

vschaffn commented Feb 21, 2025

rhugonnet commented Feb 21, 2025 • edited Loading

vschaffn commented Feb 24, 2025 • edited Loading

vschaffn commented Feb 24, 2025

rhugonnet Feb 24, 2025

Choose a reason for hiding this comment

vschaffn Feb 25, 2025

Choose a reason for hiding this comment

rhugonnet Feb 25, 2025

Choose a reason for hiding this comment

rhugonnet Feb 25, 2025

Choose a reason for hiding this comment

rhugonnet left a comment

Choose a reason for hiding this comment

vschaffn commented Feb 25, 2025

rhugonnet commented Feb 26, 2025

vschaffn commented Feb 26, 2025

vschaffn commented Feb 26, 2025

duboise-cnes commented Feb 27, 2025

Add percentage valid points in `get_stats()` #644

Add percentage valid points in `get_stats()` #644

vschaffn commented Jan 24, 2025 •

edited

Loading

vschaffn commented Feb 4, 2025 •

edited

Loading

rhugonnet commented Feb 14, 2025 •

edited

Loading

rhugonnet commented Feb 20, 2025 •

edited

Loading

rhugonnet commented Feb 20, 2025 •

edited

Loading

rhugonnet commented Feb 21, 2025 •

edited

Loading

vschaffn commented Feb 24, 2025 •

edited

Loading