Replies: 10 comments
-
Also, this section seems bound to the string representation. |
Beta Was this translation helpful? Give feedback.
-
Yes, I know, except a physical representation is not necessarily a string...
I think this is an error in the spec - there's an issue I opened there. |
Beta Was this translation helpful? Give feedback.
-
It took me a while to think about it, and based on the current Table Schema's concept of Of course, there are use cases when boolean fields represented with integers but strictly speaking if a logical value is an integer it must be marked invalid against a boolean field. By my understanding value substitution is a part of data casting process and we do don't data casting in-general for already typed data. If the above is wrong I think the change should affect I'll ask WG for a discussion for this issue |
Beta Was this translation helpful? Give feedback.
-
The PR #5 allows non-string values. Do I understand this issue is solved then? |
Beta Was this translation helpful? Give feedback.
-
Hi @peterdesmet, Sorry for the confusion. #5 has been reverted and issue returned back to the discussion (I was measled by #864) |
Beta Was this translation helpful? Give feedback.
-
If I understand correctly, the suggestion by @roll is to keep requiring the values provided in If so, I'm fine with that suggestion. |
Beta Was this translation helpful? Give feedback.
-
This is actually a very good example on why the distinction between logical and representation (physical, lexical...) values is so important, and how making that distinction would have made this issue very simple to resolve. Generally we use two types of values in the spec, without distinction - on the one hand, we talk about logical values a lot. For example, a The representation of these values in data files that may be described by a data package, might also vary. While we're used to thinking about csv files, where the issues are usually dates with various formats or the decimal character of a number, other data formats use different data types - i.e. not strings - to represent data. For example, Excel might use an integer to represent dates or booleans. JSON files will have native boolean values, but still needs help when dates need to be decoded. In the spec we sometimes refer to the logical value - for example, when defining constraints for max value of a number, we don't care how the it was represented, only it's logical value. In other cases, we refer to the representation of the value - for example, when defining the 'missingValues' field, we will declare which representations of values should be ignored. What we need to do is to specify in every location where a value is to be given whether it's a logical value or a representation value, and have the same rules apply for all instances of the same kind. Obviously, this specific issue requires a representation value, so if we decide that we allow for excel files to be described by data packages I think the conclusion should be pretty clear on what should be the correct solution here :) |
Beta Was this translation helpful? Give feedback.
-
@akariv I appreciate your distinctions & definitions here. I think making the distinction between logical and representation values is especially salient re: value labels in categorical / ordinal types (as discussed in #844). Conceptually, a boolean logical type with true/falseValues is a special case of a ordinal / categorical logical type with value labels (i.e. a binary categorical variable with logical values "true" and "false"). So as we consider the approach & language to adopt here with boolean types in distinguishing logical vs representations (i.e. label vs underlying representation values), I think it would be good to be thinking about how the decisions here generalize to categorical / ordinal types and value labels so we can keep the approach / language there consistent. Right now the categorical extension defines a mapping between representation and label in the Another potential point of logical / representation confusion is labels on missing values. If the logical missing label "PARTICIPANT_SKIPPED_ITEM" is represented in the data as -999, does should we define missingValues = [-999] or missingValues = ["PARTICIPANT_SKIPPED_ITEM"]? I'm 100% with you in the idea that making the distinction between logical & representation values from the get-go would have made this issue (and the value labels situation by extension) easy to solve. I think the challenge here is to balance conceptual correctness / internal consistency / historical compatibility. So rather than doing the surgery that would be required across the spec to represent excel files or other binary formats "correctly", I think it might be easier to just let a limitation of the datapackage format be that it is designed for underlying textual data representations... That way, "representation values" (e.g. Not ideal, I know (especially for representing floating point values!). But given that we're not allowing breaking changes in V2, it kind of limits our ability for deep surgery... I'm open to more thoughts / brainstorm though! |
Beta Was this translation helpful? Give feedback.
-
For some reason this has been limited to strings only, and there's no actual need for that.
There are quite a few cases where boolean values are represented with 0/1 (integers) in the source data.
I propose removing the type restriction here, so that you could specify (as an example):
Beta Was this translation helpful? Give feedback.
All reactions