Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add percentage valid points in get_stats() #644

Merged
merged 14 commits into from
Feb 26, 2025

Conversation

vschaffn
Copy link
Contributor

@vschaffn vschaffn commented Jan 24, 2025

Resolves GlacioHack/xdem#679.
Resolves GlacioHack/xdem#685

Description

  • Add valid points and percentage valid points statistic calculation, as the number of finite values divided by the number of values in the unmasked Raster.
  • Add valid points and percentage valid points in the dict returned by get_stats, add aliases.
  • Add valid points and percentage valid points in test_stats method in test_raster.
  • Add valid points no mask and percentage valid points no mask to compute the same stats as above but without masking
    values.
  • Add LE90 in stats.py and in Raster.get_stats, and test it in test_stats.py.

@vschaffn vschaffn force-pushed the 679-valid_points_stat branch from 2b4958b to 5368572 Compare January 24, 2025 10:18
@rhugonnet
Copy link
Member

Nice addition, and good catch on the redundancy of the RMSE in this case! 😄

Two small remarks:

Last thought: I didn't see any function to calculate the valid points in the changes, maybe it's not there yet! In that case I would recommend np.count_nonzero applied to np.isfinite, instead of ~np.isnan, as the latter considers only NaNs but not +/- infinity that are often unusable in stats (and I've had some misadventures with those being propagated in raster data before!).

@adehecq
Copy link
Member

adehecq commented Jan 29, 2025

Good addition!
Two other thoughts:

  • I believe your implementation does not exclude masked values. Remember that self.data is a masked array, so invalid pixels are masked instead of being set to NaN. One quick way to get total number of unmasked values is through self.data.compressed().size. Maybe it would be good to add 1-2 tests. For example, the "exploradores_aster_dem" example has data gaps.
  • the name of your variable is incorrect. It should not be "percentile" but "percentage".

Regarding @rhugonnet's comment:

It might be useful to some users to know the total point count (NaNs included)? If we report a "total count" and "valid count", then users can derive the percent of valid points themselves by dividing the two.

Not so easy to figure out which ones are the most useful between total count, valid count and fraction of valid pixels. It does not make so much sense to give all 3 as one can derive the 3rd from the 2 others... But I think the percentage of valid pixel is more useful than a count (which is generally gonna be large and not very convenient). Also, total count can be easily derived from the data shape... So I would go for either just the percentage of valid pixels, or percentage + total valid count.

@vschaffn vschaffn force-pushed the 679-valid_points_stat branch from dc529a5 to 4b35217 Compare February 4, 2025 10:48
@vschaffn
Copy link
Contributor Author

vschaffn commented Feb 4, 2025

@rhugonnet the function to calculate the valid point was already there, but I used ~np.isnan, then I have changed for np.isfinite following your feedback.

@adehecq the stats are applied on the unmasked data :

data = data.compressed()

Following your feedback I have changed percentile to percentage, and I added the valid points stat 😃

@vschaffn vschaffn changed the title Add percentile valid points in get_stats() Add percentage valid points in get_stats() Feb 4, 2025
@vschaffn vschaffn force-pushed the 679-valid_points_stat branch 4 times, most recently from 8216872 to 30c6300 Compare February 11, 2025 13:40
@adebardo
Copy link

image
This PR is ready for me; I have no inspiration to improve the naming. @rhugonnet , @adehecq , @duboise-cnes , do you have any ideas?
Feel free to merge this PR during the week to facilitate progress.

@rhugonnet
Copy link
Member

rhugonnet commented Feb 14, 2025

Perfect! 🙂

I have one final comment on those two lines of code:

        valid_points = np.count_nonzero(np.isfinite(compressed_data))
        valid_points_all_data = np.count_nonzero(np.isfinite(data))

It's a bit delicate to explain, a lot of NumPy masked array technicalities 😅. Basically we want to ignore everything under the data.mask, but the current functions don't.

For the first line: The compressed_data can never contain any non-finite data (in Raster we enforce it in @data.setter in case the user overwrote values with NaNs/inf, but even in typical behaviour of masked arrays it's rare), so valid_points could simply be len(compressed_data). But actually we don't even need to compress the data (which flattens it into a separate 1D array, and creates more memory usage), we can simply count the boolean mask data.mask.

For the second line: The mask is not accounted for during np.isfinite(data), so it's counting all pixels under the mask too, which we don't want. Most masked values actually have a real .data.data value, often equal to the .nodata value like -9999 for instance, so it'll often be the total count. But not always. This is the case in the "slope" example you show, some NaNs/inf values were likely created by division by zero during the operation, so there are some NaNs/inf under the mask. But, as these values are masked, they are ignored by all operations, so it doesn't matter (they will be converted back to the nodata value, like -9999, when the raster is written to disk).

So in short, we should replace by:

        valid_points = np.count_nonzero(~data.mask)
        total_points = np.prod(data.shape)

And "Valid points all data" by "Total count"/"Total points".

See the correspondance with the example file of GeoUtils that has invalid values (but without NaNs under the mask like the slope):

aster_dem_path = examples.get_path("exploradores_aster_dem")
rst = gu.Raster(aster_dem_path)
np.count_nonzero(np.isfinite(rst.data))
Out[7]: 333102
np.prod(np.shape(rst.data))
Out[11]: 333102

and

np.count_nonzero(np.isfinite(rst.data.compressed()))
Out[8]: 324194
np.count_nonzero(~rst.data.mask)
Out[10]: 324194

@rhugonnet
Copy link
Member

Actually, this also reminds me that we should use np.ma functions everywhere for computational performance: They can be more efficient than doing np.func(data.compressed()).

So replacing np.nanmean(compressed_data) (note that the nanmean was not necessary!) into np.ma.mean(data) directly, and same for all other functions 😉.
And normally our nmad should also support this.

@duboise-cnes
Copy link
Member

Great work and remarks @rhugonnet @adebardo @vschaffn
I have reviewed it see above.

I would only say that the naming are not that explicit as you asked @adebardo :) :

  • at minima, explain it in documentation precisely (nan, mask, ...) to have at least a RTFM possibility for users ;)
  • and for naming, "valid" is fuzzy: what is valid ? can you describe it to find a naming ?
  • also, "all data" what is the meaning. If all data, what is valid ?

Find something the most explicit with what does the code and really clear for the user.

My proposition (not perfect...):
"Number of non-masked (ou unmasked) valid points " "Percentage of non-masked points"
"Number of valid points" and "Percentage of valid points"
and explain in documentation that valid is all finite values without np.nan, np.inf et -np.inf (if correct ?)
(if compressed is https://numpy.org/doc/stable/reference/generated/numpy.ma.MaskedArray.compressed.html it is explicit ;) )

Hope I understand well and that it helps. Anyway, be explicit in documentation and naming with what is done.

@rhugonnet
Copy link
Member

I agree with @duboise-cnes.
My favourite namings would be "Valid count" and "Total count" (see my first comment on top of the thread on the naming "count" in NumPy and SciPy). And in the description/documentation, we can explain that "valid" means non-masked values, which are always finite in a Raster.

@vschaffn
Copy link
Contributor Author

@rhugonnet @dubois Thank you for your feedback and for the clarification on how rasters handle masking.

I have updated the code by replacing all instances of np.nanstat with np.ma.stat, except for the 90th percentile, as np.ma.percentile does not exist.

Additionally, I incorporated your suggestions regarding the final statistics (valid count, total count, and percentage valid points) and provided more detailed explanations in the documentation. However, there is a distinction between the total count and the size of the raster.

For example, in areas like slopes, there is often a 1-pixel NaN padding, which we prefer to exclude from the total count. Otherwise, the percentage of valid points would not be 100% even when no mask is applied. To address this, I defined total count as the number of finite values in the raster and added an additional statistic: size, which returns the total size of the raster. These two values may be the same in some cases, but not always.

@rhugonnet
Copy link
Member

rhugonnet commented Feb 20, 2025

OK, perfect!

I'm not sure I fully understand the point about the difference between total_count and size:

For example, in areas like slopes, there is often a 1-pixel NaN padding, which we prefer to exclude from the total count. Otherwise, the percentage of valid points would not be 100% even when no mask is applied

The Raster data.setter imposes that any NaN/inf value is masked, see:

5. Masks non-finite values that are unmasked, whether the input is a classic array or a masked_array. Note that

So it is impossible to create or derive a Raster with a NaN/inf value that is non-masked (except maybe by force-setting in Raster.data.data? Not sure there). Consequently, the case in the example above of "when no mask is applied" can never happen for a Raster with NaNs 😉.
So I'm not sure there can be a difference here between shape and total_count. It makes sense to me that any Raster which contains some NaNs (no matter where they originate from; such as for the slope) would then not have 100% percent valid points!

@vschaffn
Copy link
Contributor Author

@rhugonnet Perhaps I misspoke, when I said no mask, I meant no additional mask (on top of the ‘classic’ mask for a raster which hides nan/inf values), for classification for example.
The purpose of total_count is to separate out the masked values, from nan/inf, or from other contributions (such as a classification mask, for example). It then gives a percentage of valid points which does not take into account nan/inf due to the calculation of slope for example.

Here is an example:

image
image

@rhugonnet
Copy link
Member

rhugonnet commented Feb 20, 2025

OK I see now! Thanks a lot for the clarification 😄

So I think the underlying issue here is that we need to clearly differentiate nodata masking (.mask and set_mask() to set/update valid raster values) and zonal masking/binning (classification = evaluating the raster on a mask).
Ideally, for evaluating the Raster on zonal/value bins and then compute associated statistics (especially counts), it's much better to do statistic(Raster[Mask]), otherwise using set_mask() makes it impossible to differentiate "invalid values" (NaNs/inf in the data) with "values we don't want to bin". It's also less practical for multiple call (need to re-set the mask every time).

Sorry, I didn't realize this limitation when we framed the design (that you'd be using set_mask() in synergy with get_stats())!
It joins my earlier comment for get_stats() to support binning (zonal or by other values) in the functionality: GlacioHack/xdem#620 (comment).

To solve this specific issue here, I think we could simply add an inlier_masks argument to Raster.get_stats(), which would expect a list[Mask | np.ndarray[bool]] and evaluate the statistics on every mask passed from that list (optionally of length 1 for a single mask, and None by default which evaluates the whole raster).
This could later be extended to support any kind of binning, not just zonal.

So instead of doing:

rst.set_mask(mask1)
stats1 = rst.get_stats()
rst.set_mask(mask2)
stats2 = rst.get_stats()

We'd do:

all_stats = rst.get_stats(inlier_masks=[mask1, mask2])

What do you think? 😛

@vschaffn
Copy link
Contributor Author

LE90 have been added (see GlacioHack/xdem#685)
@rhugonnet Thanks for this proposal, but for now I am happy with

rst.set_mask(mask1)
stats1 = rst.get_stats()

As we plan to develop a high-level classification in xDEM (internal tickets, comming on github soon), we will think about ways to recover statistics from each class easily in classification classes 😄

@rhugonnet
Copy link
Member

rhugonnet commented Feb 21, 2025

OK. Just to clarify: My proposition was not to add multiple classification classes (this is a bonus), but to solve the issue of ambiguity of the masked values raised above.

It is impossible to define the total_count with the current approach of using set_mask(classif_mask) then get_stats(). It mixes two masked value concepts: invalid values and classif values, that once intertwined cannot be recovered.
To make this feasible, we would have to pass a classif_mask argument directly to get_stats(), hence my proposition.

@vschaffn vschaffn force-pushed the 679-valid_points_stat branch from 87dfa4b to bc7385d Compare February 24, 2025 14:04
@vschaffn
Copy link
Contributor Author

vschaffn commented Feb 24, 2025

@rhugonnet ok I think I understood what you mean. I pushed some modifications, where I added the possibility to pass a mask in get_stats(), so that total_count can be computed before applying the mask to the raster. If no mask is passed, total_count has the same value than valid_count by default.
No list of masks for now, the get_stats() method would have too much "@ overload" otherwise, but it is still possible to do so.

@vschaffn
Copy link
Contributor Author

image
image
image

:returns: The requested statistic or a dictionary of statistics if multiple or all are requested.
"""
if not self.is_loaded:
self.load()
stats_dict = self._statistics(band=band)
if mask is not None:
total_count = np.count_nonzero(~self.get_mask())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this line, instead of ~self.get_mask(), it should be mask?
All the values we sample from (so the total_count) is based on the mask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the total count, we want to retrieve all the finite values (i.e. all the False values in the outlier mask). As mask represents a classification mask for example (inlier mask), it's self.get_mask() that we need here 😃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it looks like we have a different definition in mind for total_count and/or valid_count 😅

Here's how I usually define those, so that we can identify the discrepancy between our definitions 😉:

  • The total count represents all the values (valid and invalid) that are sampled from the raster to compute the statistic. When a classif mask is passed, the sampled raster values are limited to the inlier mask (equivalent to running all statistics on the smaller array Raster[inlier_mask]). So the total_count is np.count_nonzero(inlier_mask). With no inlier mask, on the entire raster, the total_count is always all values np.prod(self.shape).
  • The valid count represents the valid values among all those sampled to compute the statistic. When a classif mask is passed, these values are limited to the valid values among raster values subsampled by the inlier mask, so np.count_nonzero(~Raster.mask[inlier_mask]) (which in this case is equivalent the way you compute it with just np.isfinite(~self.mask), because you've used set_mask() before). With no inlier mask, on the entire raster, it's always all the valid values np.count_nonzero(~Raster.mask).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To better show the link between the two with these definition:
Without a classif mask: total_count is all values of the masked-array, valid_count is all non-masked values of the masked-array.
With a classif mask: Exactly the same calculations as above, but on the masked-array subsampled on the inlier mask Raster.data[inlier_mask], instead of the full masked-array Raster.data.

Copy link
Member

@rhugonnet rhugonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! All good for me after the small changes mentioned above 🙂

@vschaffn vschaffn force-pushed the 679-valid_points_stat branch from bc7385d to 8192338 Compare February 25, 2025 08:59
@vschaffn vschaffn force-pushed the 679-valid_points_stat branch from ff281d5 to 8f78710 Compare February 25, 2025 16:38
@vschaffn
Copy link
Contributor Author

@rhugonnet as discussed I disabled total_count and percentage_valid_points if no inlier mask is passed 😃

@rhugonnet
Copy link
Member

More specifically, I was thinking that we should add a total_inlier_points, valid_inlier_points and percentage_valid_inlier_points for when inlier_mask is passed (which would correspond to the calculation described here #644 (comment) with a classif mask).
And we disable those when no classif mask if passed, as you just said!

And then we always keep total_points, valid_points and percentage_valid_points that are the statistics on the full raster, even when an inlier mask is passed (calculations described in #644 (comment) without a classif mask).

@vschaffn
Copy link
Contributor Author

@rhugonnet : I redefined the follwing stats:

  • Valid points: number of finite data points in the array.
  • Total points: total size of the raster.
  • Percentage valid points: ration between Valid points and Total points.

If an inlier mask is passed:

  • Valid inlier points: number of unmasked data points in the array after applying the inlier mask.
  • Percentage valid inlier points: The ration between Valid inlier points and Valid points. Useful for classification statistics.

@vschaffn vschaffn force-pushed the 679-valid_points_stat branch 2 times, most recently from c3045db to 5531ef6 Compare February 26, 2025 14:31
@vschaffn
Copy link
Contributor Author

@rhugonnet :
If an inlier mask is passed:

  • Inlier points: number of unmasked data points in the inlier mask.
  • Valid inlier points: number of unmasked data points in the array after applying the inlier mask.
  • Percentage inlier points: ratio between Valid inlier points and Valid points. Useful for classification statistics.
  • Percentage valid inlier points: rati between Valid inlier points and Inlier points.

@vschaffn vschaffn force-pushed the 679-valid_points_stat branch from 5531ef6 to 27aa9bc Compare February 26, 2025 14:35
@vschaffn vschaffn force-pushed the 679-valid_points_stat branch from 27aa9bc to 8aebf0b Compare February 26, 2025 15:34
@adebardo adebardo merged commit 0c18d9e into GlacioHack:main Feb 26, 2025
13 checks passed
@vschaffn vschaffn deleted the 679-valid_points_stat branch February 27, 2025 09:21
@duboise-cnes
Copy link
Member

Thanks for the work and convergence ! @rhugonnet @vschaffn and @adebardo !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants