Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Add a categorical field type [native values version] #62

Closed
wants to merge 13 commits into from
Closed
74 changes: 70 additions & 4 deletions content/docs/specifications/table-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,18 +127,23 @@ A Table Schema descriptor `MAY` contain a property `fieldsMatch` that `MUST` be

Many datasets arrive with missing data values, either because a value was not collected or it never existed. Missing values may be indicated simply by the value being empty in other cases a special value may have been used e.g. `-`, `NaN`, `0`, `-9999` etc.

`missingValues` dictates which string values `MUST` be treated as `null` values. This conversion to `null` is done before any other attempted type-specific string conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value.
The `missingValues` property configures which native values `MUST` be treated as logical `null` values. If provided, the `missingValues` property `MUST` be an `array` of native values, or an array of `object`s.

`missingValues` `MUST` be an `array` where each entry is a `string`.
If an `array` of `object`s is provided, each `object` `MUST` have a `value` and optional `label` property. The `value` property `MUST` be a native value that represents a logical `null` in a field. The optional `label` property `MUST` be a `string` that provides a human-readable label for the missing value.

**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`.
The conversion to `null` is done before any other attempted type-specific conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to `null` will be done, on any value.

Examples:
Examples of the `missingValues` property:

```text
"missingValues": [""]
"missingValues": ["-"]
"missingValues": ["NaN", "-"]
"missingValues": [-9999]
"missingValues": [
{ "value": "", "label": "OMITTED" },
{ "value": -99, "label": "REFUSED" }
]
```

#### `primaryKey`
Expand Down Expand Up @@ -469,6 +474,63 @@ The boolean field can be customised with these additional properties:
- **trueValues**: `[ "true", "True", "TRUE", "1" ]`
- **falseValues**: `[ "false", "False", "FALSE", "0" ]`

### `categorical`

The field contains categorical data, defined as data with a finite set of possible values that represent levels of a categorical variable.

**Native Representaiton**

If supported, categorical values `MUST` be natively represented by the data format. In this case, the field `MAY` additionally include the `categories` property, as described below. If categorical values are not supported by the native format and instead represented using other native types (e.g. native strings or numbers) the `categories` property `MUST` be provided.

The `categories` property `MUST` be an array of native values, or an array of objects.

When the `categories` property is an array of native values, the values `MUST` be unique and `MUST` match the native values of the field. For example:

```json
{
"name": "fruit",
"type": "categorical",
"categories": ["apple", "orange", "banana"]
}
```

When the `categories` property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be the native value of the field when representing that level. The optional `label` property, when present, `MUST` be a string that provides a human-readable label for the level. For example, if the native values `0`, `1`, and `2` were used as codes to represent the levels `apple`, `orange`, and `banana` in the previous example, the `categories` property would be defined as follows:

```json
{
"name": "fruit",
"type": "categorical",
"categories": [
{ "value": 0, "label": "apple" },
{ "value": 1, "label": "orange" },
{ "value": 2, "label": "banana" }
]
}
```

The `categorical` field type `MAY` additionally have the property `ordered` to indicate whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels, in ascending order. In cases where the native values are numeric and `ordered` is `true`, the order of the levels `SHOULD` match the numerical order of the values (e.g., 1, 2, 3, ...) to avoid ambiguity. For example:

```json
{
"name": "agreementLevel",
"type": "categorical",
"categories": [
{ "value": 1, "label": "Strongly Disagree" },
{ "value": 2 },
{ "value": 3 },
{ "value": 4 },
{ "value": 5, "label": "Strongly Agree" }
],
"ordered": true
}
```

When the property `ordered` is `false` or not present, and no ordering information is provided by the native format, implementations `SHOULD` assume that the levels of the `categorical` do not have a natural order.

Although the categorical field type restricts a field to a finite set of possible values, similar to an enum constraint, the categorical field type enables data producers to explicitly indicate to implementations that a field SHOULD be loaded as a categorical data type (when supported by the implementation). By contrast, enum constraints simply add validation rules to existing field types.

When an enum constraint is defined on a categorical field, the values in the enum constraint MUST be a subset of the logical values representing the levels of the categorical. Logical values of categorical levels are indicated by either a native value or object matching the corresponding level definition in the `categories` property.

### `object`

The field contains a valid JSON object.
Expand Down Expand Up @@ -684,6 +746,10 @@ A regular expression that can be used to test field values. If the regular expre

The value of the field `MUST` exactly match one of the values in the `enum` array.

:::note[Backward Compatibility]
Many `v1.0` implementations imported fields with `enum` constraints as categorical data types. Starting in `v2.0` this behavior is discouraged in favor of explicit use of the [`categorical`](#categorical) field type. In `v2.0`, an `enum` constraint `SHOULD` be interpreted by implementations as a validation rule on an existing field type, and `SHOULD NOT` change the imported data type of the field.
:::

:::note[Implementation Note]

- Implementations `SHOULD` report an error if an attempt is made to evaluate a value against an unsupported constraint.
Expand Down