ENH: Add default fill values for decode_cf #5680

gcaria · 2021-08-06T19:54:05Z

Closes Suggestion: Add option for default_fillvals to open_dataset #2374
Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst

This is a work in progress, mostly so that I can ask some clarifying questions.

I see that netCDF4 is an optional dependency for xarray, so probably import netCDF4 can't be used. Should xarray simply hard-code default fill values ?

From the issue's conversation, it wasn't clear to me whether an argument should control the use of the default fill value. Since some tests fail now I guess the answer is yes.

github-actions · 2021-08-06T20:18:42Z

Unit Test Results

        6 files         6 suites 56m 11s ⏱️
16 201 tests 14 433 ✔️ 1 735 💤   33 ❌
90 402 runs 82 037 ✔️ 8 175 💤 190 ❌

For more details on these failures, see this check.

Results for commit 26f1b32.

♻️ This comment has been updated with latest results.

dcherian · 2021-08-06T22:42:19Z

xarray/coding/variables.py

@@ -183,6 +184,8 @@ def decode(self, variable, name=None):
            pop_to(attrs, encoding, attr, name=name)
            for attr in ("missing_value", "_FillValue")
        ]
+
+        raw_fill_values.append(netCDF4.default_fillvals["f8"])


I think we should vendor this dictionary to avoid adding netcdf4 as a required dependency.

{'S1': '\x00', 'i1': -127, 'u1': 255, 'i2': -32767, 'u2': 65535, 'i4': -2147483647, 'u4': 4294967295, 'i8': -9223372036854775806, 'u8': 18446744073709551614, 'f4': 9.969209968386869e+36, 'f8': 9.969209968386869e+36}

We'll also have to pick the appropriate one based on dtype.

pep8speaks · 2021-08-08T16:02:37Z

Hello @gcaria! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-08-08 16:03:49 UTC

gcaria · 2021-08-08T16:07:54Z

I would have liked to add the dictionary in xarray.conventions but this causes circular imports.

It seems that a lot of test failures are caused by something like:

> xarray.core.dtypes.maybe_promote(dtype('int64'))

(dtype('float64'), nan

So the answer to this must really be yes, am I right?

From the issue's conversation, it wasn't clear to me whether an argument should control the use of the default fill value. Since some tests fail now I guess the answer is yes.

shoyer · 2021-08-09T07:54:22Z

Could you clarify where these default fill values come from?

Are they just an arbitrary choice by netCDF4-Python? Or are they part of some broader standard?

dcherian · 2021-08-09T15:50:46Z

It's in the standard (partly?): https://www.unidata.ucar.edu/software/netcdf/documentation/4.7.4-pre/file_format_specifications.html#atts_spec

                                  // Default fill values for each type, may be
                                  // overridden by variable attribute named
                                  // '_FillValue'. See "Note on fill values",
                                  // below.
     FILL_CHAR    = \x00                      // null byte
     FILL_BYTE    = \x81                      // (signed char) -127
     FILL_SHORT   = \x80 \x01                 // (short) -32767
     FILL_INT     = \x80 \x00 \x00 \x01       // (int) -2147483647
     FILL_FLOAT   = \x7C \xF0 \x00 \x00       // (float) 9.9692099683868690e+36
     FILL_DOUBLE  = \x47 \x9E \x00 \x00 \x00 \x00 \x00 \x00 //(double)9.9692099683868690e+36

and

Note on fill values: Because data variables may be created before their values are written, and because values need not be written sequentially in a netCDF file, default “fill values” are defined for each type, for initializing data values before they are explicitly written. This makes it possible to detect reading values that were never written. The variable attribute “_FillValue”, if present, overrides the default fill value for a variable. If _FillValue is defined then it should be scalar and of the same type as the variable.

Fill values are not required, however, because netCDF libraries have traditionally supported a “no fill” mode when writing, omitting the initialization of variable values with fill values. This makes the creation of large files faster, but also eliminates the possibility of detecting the inadvertent reading of values that haven't been written

EDIT: I remember reading some text about how the default _FillValues are "close to the largest or smallest number representable by a datatype", but I cannot find it now.

kmuehlbauer · 2021-08-09T16:38:13Z

AFAIK, these values are chosen, because their binary presentation is good for compression.

For instance the 32bit float 9.969209968386869e+36 is hex 0x7CF00000.

Unfortunately I can't find a link describing that.

shoyer · 2021-08-09T18:46:59Z

Right, so netCDF3 has a default value used for filling out variables before any data is written.

My concern is that there are two (overlapping) use-case for fill values:

The default array value used for variables on disk, e.g., before they are written
Truly missing values (with different semantics), which Xarray represents with NaN

Certainly these sometimes coincide, but that isn't necessarily the case.

shoyer · 2021-08-09T20:11:24Z

To follow up, from a practical perspective, there are two problems with assuming that there are always "truly missing values" (case 2):

It makes it impossible to represent the full range of values in a data type, e.g., 255 for uint8 now means "missing".
Due to unfortunately limited options for representing missing data in NumPy, Xarray represents truly missing values in its data model with "NaN". This is more or less OK for floating point data, but means that integer data gets converted into floats. For example, uint8 would now get automatically converted into float32.

Both of these issues are problematic for faithful "round tripping" of Xarray data into netCDF and back. For this reason, Xarray needs an unambiguous way to know if a netCDF variable could contain semantically missing values. So far, we've used the presence of missing_value and _FillValue attributes for that.

Add f8 default fill value and test for it.

4ebfb4d

gcaria changed the title ~~Add f8 default fill value and test for it.~~ ENH: Add default fill values for decode_cf Aug 6, 2021

dcherian reviewed Aug 6, 2021

View reviewed changes

Add built-in dictionary for dtype to default fill value

26f1b32

gcaria force-pushed the add_default_fill_value branch from 834aaba to 26f1b32 Compare August 8, 2021 16:03

dcherian mentioned this pull request Apr 6, 2023

default fill_value not masked when read from file #7723

Closed

4 tasks

kmuehlbauer mentioned this pull request Apr 28, 2023

masked_array write/read differences between xarray and netCDF4 #2478

Closed

This was referenced Sep 13, 2023

Unexpected type conversion in variables with _FillValue #6055

Closed

Add support for netCDF4.EnumType #8147

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add default fill values for decode_cf #5680

ENH: Add default fill values for decode_cf #5680

gcaria commented Aug 6, 2021 •

edited

Loading

github-actions bot commented Aug 6, 2021 •

edited

Loading

dcherian Aug 6, 2021

pep8speaks commented Aug 8, 2021 •

edited

Loading

gcaria commented Aug 8, 2021

shoyer commented Aug 9, 2021

dcherian commented Aug 9, 2021 •

edited

Loading

kmuehlbauer commented Aug 9, 2021

shoyer commented Aug 9, 2021

shoyer commented Aug 9, 2021

ENH: Add default fill values for decode_cf #5680

Are you sure you want to change the base?

ENH: Add default fill values for decode_cf #5680

Conversation

gcaria commented Aug 6, 2021 • edited Loading

github-actions bot commented Aug 6, 2021 • edited Loading

Unit Test Results

dcherian Aug 6, 2021

Choose a reason for hiding this comment

pep8speaks commented Aug 8, 2021 • edited Loading

Comment last updated at 2021-08-08 16:03:49 UTC

gcaria commented Aug 8, 2021

shoyer commented Aug 9, 2021

dcherian commented Aug 9, 2021 • edited Loading

kmuehlbauer commented Aug 9, 2021

shoyer commented Aug 9, 2021

shoyer commented Aug 9, 2021

gcaria commented Aug 6, 2021 •

edited

Loading

github-actions bot commented Aug 6, 2021 •

edited

Loading

pep8speaks commented Aug 8, 2021 •

edited

Loading

dcherian commented Aug 9, 2021 •

edited

Loading