Allow any type for `true/falseValues`? #1011

akariv · 2018-09-18T12:10:35Z

akariv
Sep 18, 2018
Collaborator

For some reason this has been limited to strings only, and there's no actual need for that.
There are quite a few cases where boolean values are represented with 0/1 (integers) in the source data.

I propose removing the type restriction here, so that you could specify (as an example):

{
   "name": "my_boolean",
   "type": "boolean",
   "trueValues": [1],
   "falseValues": [0]
}

akariv · 2018-09-18T12:10:46Z

akariv
Sep 18, 2018
Collaborator Author

/cc @roll @zelima

0 replies

roll · 2018-09-19T07:39:23Z

roll
Sep 19, 2018
Maintainer

In the physical representations of data where boolean values are represented with strings, the values set in trueValues and falseValues are to be cast to their logical representation as booleans. trueValues and falseValues are arrays which can be customised to user need. The default values for these are in the additional properties section below.

Also, this section seems bound to the string representation.

0 replies

akariv · 2018-09-19T08:13:25Z

akariv
Sep 19, 2018
Collaborator Author

Yes, I know, except a physical representation is not necessarily a string...

The physical representation of data refers to the representation of data as text on disk, for example, in a CSV or JSON file. This representation may have some type information (JSON, where the primitive types that JSON supports can be used) or not (CSV, where all data is represented in string form).

I think this is an error in the spec - there's an issue I opened there.

0 replies

roll · 2024-01-03T15:59:14Z

roll
Jan 3, 2024
Maintainer

It took me a while to think about it, and based on the current Table Schema's concept of physical (I'm not sure that physical is a good word here it's more like textual) and logical separation it seems to be that this issue needs to be closed as wontfix.

Of course, there are use cases when boolean fields represented with integers but strictly speaking if a logical value is an integer it must be marked invalid against a boolean field. By my understanding value substitution is a part of data casting process and we do don't data casting in-general for already typed data.

If the above is wrong I think the change should affect missingValues as well.

I'll ask WG for a discussion for this issue

0 replies

peterdesmet · 2024-01-08T14:51:31Z

peterdesmet
Jan 8, 2024
Collaborator

The PR #5 allows non-string values. Do I understand this issue is solved then?

0 replies

roll · 2024-01-08T14:57:04Z

roll
Jan 8, 2024
Maintainer

Hi @peterdesmet,

Sorry for the confusion. #5 has been reverted and issue returned back to the discussion (I was measled by #864)

0 replies

peterdesmet · 2024-01-08T16:12:58Z

peterdesmet
Jan 8, 2024
Collaborator

If I understand correctly, the suggestion by @roll is to keep requiring the values provided in missingValues, trueValues and falseValues to be strings.

If so, I'm fine with that suggestion.

0 replies

pwalsh · 2024-01-08T19:48:08Z

pwalsh
Jan 8, 2024
Collaborator

@roll I think @akariv 's suggestion is preferable

0 replies

akariv · 2024-01-09T06:10:46Z

akariv
Jan 9, 2024
Collaborator Author

This is actually a very good example on why the distinction between logical and representation (physical, lexical...) values is so important, and how making that distinction would have made this issue very simple to resolve.

Generally we use two types of values in the spec, without distinction - on the one hand, we talk about logical values a lot. For example, a datetime value points to a specific point in time. A number will point to a specific point on the real number line. A boolean can have two distinct logical values, a truthy one and a falsey one. And so on and so forth.

The representation of these values in data files that may be described by a data package, might also vary. While we're used to thinking about csv files, where the issues are usually dates with various formats or the decimal character of a number, other data formats use different data types - i.e. not strings - to represent data. For example, Excel might use an integer to represent dates or booleans. JSON files will have native boolean values, but still needs help when dates need to be decoded.

In the spec we sometimes refer to the logical value - for example, when defining constraints for max value of a number, we don't care how the it was represented, only it's logical value. In other cases, we refer to the representation of the value - for example, when defining the 'missingValues' field, we will declare which representations of values should be ignored.

What we need to do is to specify in every location where a value is to be given whether it's a logical value or a representation value, and have the same rules apply for all instances of the same kind. Obviously, this specific issue requires a representation value, so if we decide that we allow for excel files to be described by data packages I think the conclusion should be pretty clear on what should be the correct solution here :)

0 replies

khusmann · 2024-01-09T20:52:57Z

khusmann
Jan 9, 2024
Collaborator

@akariv I appreciate your distinctions & definitions here. I think making the distinction between logical and representation values is especially salient re: value labels in categorical / ordinal types (as discussed in #844).

Conceptually, a boolean logical type with true/falseValues is a special case of a ordinal / categorical logical type with value labels (i.e. a binary categorical variable with logical values "true" and "false").

So as we consider the approach & language to adopt here with boolean types in distinguishing logical vs representations (i.e. label vs underlying representation values), I think it would be good to be thinking about how the decisions here generalize to categorical / ordinal types and value labels so we can keep the approach / language there consistent.

Right now the categorical extension defines a mapping between representation and label in the enumLabels where the representation values are all strings (which is consistent with how strings are provided for missingValues, trueValues, and falseValues). Allowing any type for the keys enumLabels opens a can of worms, because as a json object the keys need be strings.

Another potential point of logical / representation confusion is labels on missing values. If the logical missing label "PARTICIPANT_SKIPPED_ITEM" is represented in the data as -999, does should we define missingValues = [-999] or missingValues = ["PARTICIPANT_SKIPPED_ITEM"]?

I'm 100% with you in the idea that making the distinction between logical & representation values from the get-go would have made this issue (and the value labels situation by extension) easy to solve. I think the challenge here is to balance conceptual correctness / internal consistency / historical compatibility. So rather than doing the surgery that would be required across the spec to represent excel files or other binary formats "correctly", I think it might be easier to just let a limitation of the datapackage format be that it is designed for underlying textual data representations... That way, "representation values" (e.g. missingValues, trueValues, falseValues, enumLabels keys) are always strings.

Not ideal, I know (especially for representing floating point values!). But given that we're not allowing breaking changes in V2, it kind of limits our ability for deep surgery... I'm open to more thoughts / brainstorm though!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow any type for `true/falseValues`? #1011

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Allow any type for true/falseValues? #1011

akariv Sep 18, 2018 Collaborator

Replies: 10 comments

akariv Sep 18, 2018 Collaborator Author

roll Sep 19, 2018 Maintainer

akariv Sep 19, 2018 Collaborator Author

roll Jan 3, 2024 Maintainer

peterdesmet Jan 8, 2024 Collaborator

roll Jan 8, 2024 Maintainer

peterdesmet Jan 8, 2024 Collaborator

pwalsh Jan 8, 2024 Collaborator

akariv Jan 9, 2024 Collaborator Author

khusmann Jan 9, 2024 Collaborator

Allow any type for `true/falseValues`? #1011

akariv
Sep 18, 2018
Collaborator

akariv
Sep 18, 2018
Collaborator Author

roll
Sep 19, 2018
Maintainer

akariv
Sep 19, 2018
Collaborator Author

roll
Jan 3, 2024
Maintainer

peterdesmet
Jan 8, 2024
Collaborator

roll
Jan 8, 2024
Maintainer

peterdesmet
Jan 8, 2024
Collaborator

pwalsh
Jan 8, 2024
Collaborator

akariv
Jan 9, 2024
Collaborator Author

khusmann
Jan 9, 2024
Collaborator