-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pattern supporting use of value labels, categoricals and factors #844
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
The text and spec suggestions are all very clear and well-thought-out.
I looked it over from start to finish and only had a couple of questions/clarifications, but even these were very minor.
Thanks; I'll make the changes to resolve the final issue raised by @peterdesmet above, and then you can merge anytime. |
Thanks a lot, everyone! Just let me know when it's ready it's really exciting to see this collaborative effort happening ❤️ |
Ok @roll, I've completed all of the suggested changes and I think this is now ready to merge. Thanks! |
"type": "integer", | ||
"enum": [1,2,3,4,5] | ||
"enumOrdered": true | ||
"enumLabels": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only hesitation about this proposal is the use of enumLabels
instead of meta: enum
which would be consistent with Adobe's implementation of jsonschema2md
. I'm inclined to make things match in name when they match in definition (which they do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting to see it implemented the same. Personally I prefer keeping names consistent within the (Frictionless) schema and referring to concepts from other schemas using e.g. "skos:exactMatch": "jsonschema2md/enums/meta:enum"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to clarify in the specs (if its' not clarified yet) that namespece: property
is a recommended way for metadata enrichment as in other specs like (csvw)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is clarified somewhere yet. I guess that clarification should not be part of this proposal, but can be recorded as a todo issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an issue - #845
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the external reference was to a standard or ontology, then I definitely see the value of using that, but jsonschema2md
is a software application not explicitly intended to serve as a reference. Thus, I'm not sure what the added benefit would be over enumLabels
. But please let me know if I'm missing something here; glad to defer to those with greater knowledge.
Thanks! Let's wait a few days for more comments and merge |
I'm just realizing that the use of |
I must admit that my thinking on this had been restricted to use when specifying statistical models, so while I see your point I can't immediately think of a use case. If you'd like to suggest some text, I'd be glad to add it to the pattern. |
@pschumm I have included a suggestion for the use in profiles at #844 (comment). Since don't have edit rights to this branch, I had to make suggestions via a review. I also suggested some other minor changes and corrections to the text. |
Hi all, I'm new to this party but wanted to add a +1 to this direction y'all are going. I recently stumbled on frictionless and have been considering / wanting to use the standard in some of the social science & education research data collection efforts I'm a part of, but was seeing the lack of explicit support for value labels as a big barrier. So I'm excited to see the momentum in this thread. Thank you all for your work on this! I want to add a couple of considerations here from a social science and edu research perspective that I haven’t seen mentioned yet (but I’m new & still catching up, so sorry if this has already been discussed!) Consideration 1: I think one of the big reasons platforms like RedCAP, SAS, SPSS et al. prefer data in encoded form is that the numeric values of the ordinal scales are often substantively meaningful, especially for ordinal items designed to be combined into composite measures. In these cases, knowing only the ordering of the item labels does not “provide all of the information necessary” to use the item in practice. For example, say we have the Likert scale This means on a day-to-day practical level the categorical/factor implementations in pandas & R end up being pretty awkward to use, because it’s easy to lose the scale info. So I see social/edu researchers using these features a lot less than one might expect, in favor of keeping all their ordinal values numeric. But then, of course, you get a bunch of magic numbers in your code instead of labels, which isn’t great either… Some of this can be helped with the R labelled package, which gives you value labels a la SAS/SPSS/Stata, but most often I see people just making do with numeric types. In my work, I’ve found it useful to represent schema definitions of enum levels as triplets: unique labels (str), with an associated value (int), and text (str). For example:
This gives me the most flexibility. I can convert to a values representation when I’m calculating a composite measure, I can use the label representation as a unique human-readable identifier in scripts (e.g. Consideration 2: In a perfect world, I’d want to save / archive / publish ordinal data with their labels rather than numeric values, because they unambiguously reflect which level they represent even if the CSV becomes divorced from the schema, whereas numeric codes are opaque. How would we represent such a column with label->value codings in the current spec? Something like this?
I suppose you can tell it’s a label->value map because the core item type is string, so you know to expect mappings to values? ...that feels a little implicit, but I guess it works. (This is one of the reasons I like having the ability to explicitly define keys on enum levels like I described above -- the definition of That said, in practice I find I more often produce CSVs with the numeric values instead of labels (as in the current ...Anyway, sorry for the long winded comment, but just wanted to chime in as a perspective from the education / social sciences world & let you know I appreciate the movement in this direction, and share some examples of how I would use this extension in my context. Cheers! |
Co-authored-by: Peter Desmet <peter.desmet.work@gmail.com>
Co-authored-by: Peter Desmet <peter.desmet.work@gmail.com>
Co-authored-by: Peter Desmet <peter.desmet.work@gmail.com>
Co-authored-by: Peter Desmet <peter.desmet.work@gmail.com>
Co-authored-by: Peter Desmet <peter.desmet.work@gmail.com>
Thanks very much @peterdesmet for the excellent edits—I have made all of them. |
Great to hear from you @khusmann, and let me say at the outset that I think the concerns you raise are exactly on point. FWIW, in my work I use a combination of two strategies to address these: (a) demonstrate ways to avoid such problems by working differently; and (b) make sure that users have simple options to keep doing things exactly the way they've done before (if they don't want to change). I'll try to allude to these in a few specific comments below.
|
@pschumm Thanks for your thoughtful reply! It sounds like we have very similar approaches & philosophies. In the spirit of sharing different strategies, I want to highlight the similarities and differences of our workflows and needs, with an eye towards future extensions that might subsume our differences. As you say, for a lot of categorical and ordinal data, the underlying numeric codes do not have special significance. In these cases my approach is pretty much identical to yours, except I encode the levels in my CSV as short “labels”, instead of the level text. Then, in my processing scripts I can convert to values as you describe, but don’t have to use the long item text:
I like working with these “labels” representing levels rather than level text, because it gives me short identifiers to labels which are easier to read & type in scripts, rather than the entire level text. (it gets really useful in filtering, grouping, selecting & otherwise slicing & dicing the data). And it also avoids the “summing items based on the assumption that they are coded a certain way” problem, as you describe. If I ever need the exact item text, I do a similar transformation:
Like you said, in a lot of cases, the label -> value mappings are not meaningful. In these cases, I agree, the best practice should be to avoid putting implementation-specific “suggested” values into the schema. (The label->text information, on the other hand, I consider a part of the item definition, and so I include this map with my custom schema props). In some cases though, I do think label -> values mappings are meaningful, and so I like including a label -> value map in my schema in addition to my standard label -> text map. For example, on a group of items that can be aggregated to produce scores on scales that have been normed or standardized across a population. In this case, their underlying values are not merely a suggestion, but I would argue become an intrinsic attribute of that item’s scale, in the same way that the level text is an intrinsic attribute of the item’s display. It’s useful having this info in the global schema, because I want all scripts / implementations reading the schema to be sure to translate values using this map in their scoring calculations, and I want these values to be noted in the codebooks I generate with the schema. (Like you, I maintain scripts that convert to SAS / Stata / etc representations and handle the implementation-specific parts) I think the potential meta-pattern here is that, in general, enum levels may have multiple attributes that are relevant for a complete description and use of the data. The abstraction progression I’m seeing is something like this:
The possible uses of extended enum level-attributes is something I’m still chewing on, and want to think it through a lot more before proposing anything…I’d be curious to hear more of your thoughts / reactions, if you’re up for it! In the meantime though, the present addition of value-labels in this PR is going to go a long way to represent the data I’m putting together. Thanks again for your work and for this discussion! :) |
Thanks @khusmann for sharing this additional information. I think one key issue here involves distinguishing between which metadata belong in a Table Schema and which belong elsewhere. And fortunately, the ability to use custom schema properties without creating incompatibilities with the standard tools gives users (and vendors) a lot of flexibility to make their own choices. I would just add two more comments:
In sum, the purpose of the specific pattern proposed here is simply to include in the standard Table Schema the minimal information necessary to work effectively with categorical data (primarily from an analytic viewpoint), excluding anything that is software-specific. And in this case, a lot is software specific, given the fact that the different analytic packages have such different features and functionality. What would then be nice, I think, would be to create a space where those of us implementing and/or working with specific software can share and discuss the features of those software implementations. |
Thanks @pschumm for your additional comments. It sounds like we’re actually very much on the same page regarding big-picture direction here. As you say, it looks like the minor differences in perspectives relate to how to parsimoniously (but still flexibly & inclusively) define the separation between what is software specific and what should be natively understood / archived by the schema definition.
I wholeheartedly agree! We’re wading into territory beyond this specific PR, and I think rather than responding to your points above in this thread it’d be nice to find a space for continuing this conversation in a space more tailored for it. I’m new to the frictionless scene, so do you know of spaces that already exist that would be a good fit for continuing this conversation in? If not, would you potentially be interested in co-organizing something with me? Issue threads are good, but easy to lose momentum in… It’d be nice to have a meeting every once in a while to discuss bigger picture ideas (e.g. patterns for handling “scales” across different software implementations) to help strategise / structure collaboration efforts as we build features in these directions on our own projects that may have application to the wider community. …and speaking of collaboration in this direction, I’m happy to put together a PR implementing this extension in the frictionless-r package, if nobody else has started on this yet! |
I'm the maintainer of that R package and that sounds excellent @khusmann! We currently have a number of changes lined up for a version 1.1 of the package, which I hope to release before the end of the year. I think implementing the functionality proposed in this PR would be good for a version 1.2, but could also be considered for 1.1 depending timing. |
I'm starting to implement this extension in
What does it mean when an enumLabels property exists without an enum constraint? What does it mean when a key exists in enumLabels that is not present in the enum constraint, and vice versa?
If there is a enum constraint or enumLabels is defined, I'm constructing a categorical/factor variable, right? Categorical vars are either unordered or ordered, so if enumOrdered is not defined and I should not interpret this as enumOrdered: false, does this mean I should make them ordered by default (enumOrdered: true)? |
In some situations you may want to label specific values but not impose a
One example where some values in the
IMO an
A good question. I must confess, when I was writing this I was thinking about pandas Categorical which uses a nullable boolean (with default |
I don't, but I'd defer to the community leaders here to ensure that whatever we do is both helpful and consistent with existing initiatives and workflows. I think that the defining feature of what we're doing here is that it is focused on the use of Frictionless data packages/resources across a broad range of analytic software (e.g., Stata, R, SAS, SPSS, Pandas, Julia). Thus, while one could imagine a section on categoricals or value labels in the documentation for separate plugins for each of these software packages, that wouldn't be very efficient nor would it permit cross fertilization. My go-to here would be to start with a dedicated GitHub repository containing files in reStructuredText or Markdown format, rendered via Sphinx and exposed via GitHub Pages. This is both easy to edit and easy to consume. The content would be ideas, tips, etc. for using Frictionless data packages/resources as part of statistical analyses in fields like biomedical research, social science research, etc. (these are intended as examples only—not meant to be exclusive). Once there is a critical mass of content, it can always be moved elsewhere and/or reorganized. But as I said, I'm glad to defer to the wisdom of others on this. |
I agree. To summarize:
Agreed. Perhaps another way of saying this is "If there is an enum constraint, the default value of enumOrdering SHOULD be false. If there is no enum constraint, the enumOrdering property SHOULD NOT be defined".
Hmm, in that case, when enumLabels is used on a non-enum type (e.g. an integer or numeric), having "enum" in the name may be a little misleading. In this case, they're no longer labels for an enum values; they're labels for integer or numeric values. Connecting this back to something you said earlier:
I can get behind this way of thinking. Along with your examples it helps clarify how numeric codes (as used by SPSS/Stata et al) are tangling at least two separable concepts: "value labels", and "encoding". Clear examples of "encoding" would be designated values for missingness (e.g., .a, .b, .c or -97, -98, -99), or a categorical measure (say, 0: MALE, 1: FEMALE), where the numeric levels don’t mean anything. In these cases, value being stored doesn't have any significance to the schema; it's entirely implementation-specific. We only mention them in the schema for the purpose of translating the stored file; similar to CSV dialect. By contrast, the two labels anchoring a pain scale from 0 - 10 are not an "encoding", they're "value labels", that is, extra metadata attached to already-meaningful values. Given that these are two separate concerns, what if we split the For example, here's a 0-5 pain scale with two value label anchors, with Stata-encoded missing reasons:
And a binary gender scale, stored with SPSS-like encodings (Note here how 0: Male and 1: Female are explicitly noted as an encoding and so do not pollute
And an integer (non-enum) scale with two value labels:
The advantage with an approach like this is that by decomposing the two roles that The disadvantage would be, of course, a little more complexity. [Full disclosure: one of the projects I'm working on is & hope to release soon an "interactive codebook" viewer for frictionless packages that renders histograms, etc. based on type / schema information… So that's part of my motivation for bringing this up]
I agree.
Same here, I'm interested in doing whatever is most helpful, and I defer to whatever the community leaders think is best. And I agree with your characterization of what we're doing! |
I definitely see your argument, but I think it's just a matter of balancing the conceptual advantages against the additional complexity. For example, one might argue that the One quick comment about nomenclature: The name "storageEncoding" is almost synonymous with file encoding (e.g., UTF-8), so I don't think that would work. The chance for misinterpretation or confusion is too high. If you look at the original version of this PR, you'll see that my own thinking on this has evolved a bit thanks to comments by @peterdesmet (I started with something a bit more complex). And I've come to rather like the result. So the only thing I know to do is to offer my own best endorsement/explanation of the current proposal, with the understanding that I'm glad to defer to the Frictionless design/development team if they want to make modifications. The property 1. Datasets written to a CSV file in "encoded" form by commonly used software such as Stata, SAS, SPSS, REDCap, etc. More generally, Finally, it is understood that for certain software and/or specific purposes, there may be a desire for additional metadata in conjunction with discrete/categorical fields (e.g., additional value label mappings). Since these are by definition special purpose, they can be freely included in additional properties in the schema, in supplementary files in the data package (e.g., JSON/YAML file or another tabular resource linked via a foreign key), or in an That's the best I can do based on my current thinking, but as I said, glad to defer to the Frictionless team if they want to make changes. |
Agreed. Another dimension we're balancing is backwards compatibility, which I think is another one of the advantages of the current proposal as it stands.
Exactly. This is a part of the larger conversation I'm interested in continuing (somewhere outside of this PR). The present software landscape has representations that support some concepts, but not others, and then additionally have features that conflate separable concepts. It creates quite a tangle. I'd really love to continue working on this with folks like yourself from different substantive & implementation backgrounds to build a map of these concepts (what are the separable, higher-order types / constructs?) and ways they fit together (what is preserved / lost when translating between implementations / representations?). The goal here, of course, is to maximize "cross-pollination" of data across implementations & representations, as you put it. In addition to enums, I think the way missingness and its associated metadata is represented (in general) is also a big piece of the puzzle. I notice you originally had a field-specific property for "missingValues" in your original proposal, which reinforces to me that we're thinking along similar lines (thanks for pointing me there, by the way, I didn't realize so much changed from the earlier revisions). I have some ideas regarding representations of missingness across implementations, but yeah, all this is a conversation for a different context, and I defer to community leaders for the best way to facilitate it. In the meantime, sorry I've taken up so much space in this PR, I wasn't meaning to delay this from going through and to open a can of worms! I really appreciate your patience with me @pschumm as I've been getting up to speed. I hope I can repay some of your time by helping to put together PRs that implement this extension in different libraries.
Totally agree. That name was just what I came up with at the spur of the moment. Maybe
Me too! I realize this is a careful compromise across a lot of dimensions, and that we should not let conceptual perfection become the enemy of the practical "good enough". Thanks again for the engaging discussion and for all your work spearheading this proposal @pschumm! Like I said at the beginning, better enum support in frictionless is key for its adoption in a lot of the groups I'm a part of. I'll get back to work on a potential implementation of this extension in |
PS: I don't have edit access so I'm not sure where else to put this minor edit but I think this example is missing a |
Indeed—thanks! I've just fixed these.
Exactly. Most packages now have importers or exporters to/from other formats (e.g., SAS can export a Stata file, and Stata can import files from SAS or SPSS), but those don't address the need to create open, software-agnostic data resources that are broadly and easily accessible (i.e., frictionless).
I agree, which is why I've been trying to respond. And I agree that the way missingness is handled is also an important part of facilitating use of Frictionless among various fields and with specific software. I decided to drop it from this pattern, partly in response to discussions here and elsewhere, to better maintain separation of concerns and to keep things moving along.
No need for apologies. I'm committed to helping to make use of Frictionless seamless for health and social scientists (including the data repositories that they use), and for statistical analysis across the widest possible range of software. It's terrific to meet others with similar objectives who are willing to work together. |
Thaks a lot @pschumm! Dear all, are we ready to merge? It looks really good for me |
This pull request adds a pattern for supporting the use of value labels and categoricals (sometimes called factors), as requested by @rufuspollock here. Comments and feedback are welcome.