Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update all labels according to a standard convention (replaces #20) #227

Closed
DanCarey404 opened this issue Apr 22, 2020 · 34 comments · Fixed by #428
Closed

Update all labels according to a standard convention (replaces #20) #227

DanCarey404 opened this issue Apr 22, 2020 · 34 comments · Fixed by #428
Assignees
Labels
impact: minor New, backward-compatible functionality (does not change inferences; e.g., adding a term) priority: should have Medium priority feature or bug fix status: implementation specified Implementation has been specified. A developer should be assigned.

Comments

@DanCarey404
Copy link
Contributor

Add/replace rdfs:label values according to the agreed-to standard.

@DanCarey404 DanCarey404 self-assigned this Apr 22, 2020
@DanCarey404 DanCarey404 added impact: minor New, backward-compatible functionality (does not change inferences; e.g., adding a term) priority: should have Medium priority feature or bug fix labels Apr 22, 2020
@rjyounes
Copy link
Collaborator

@DanCarey404 Can you please add a summary of the agreed-upon standard so that it is documented here? Then can #20 be closed?

@rjyounes
Copy link
Collaborator

Replaces #20

@DanCarey404
Copy link
Contributor Author

Per Rebecca's request, these are the labeling standards being implemented.

Classes

  • Sentence case
  • Normalized to natural language standards. E.g., hyphens inserted, acronyms in all caps, etc.
    • Examples: AMA guideline, ISBN-10

Properties

  • Same as classes, but initial lowercase
  • Examples: has unit of measure, has SSN.

@rjyounes
Copy link
Collaborator

Just to clarify: this was not my request: it was my proposal, which the group discussed and agreed on. :)

@rjyounes rjyounes added status: implementation specified Implementation has been specified. A developer should be assigned. status: triaged labels May 28, 2020
@rjyounes
Copy link
Collaborator

rjyounes commented Dec 4, 2020

This task needs an assignee.

@uscholdm
Copy link
Contributor

uscholdm commented Dec 4, 2020

@sa-bpelakh Boris has a query for this. Can you pop it into this issue, for convenience?

@sa-bpelakh
Copy link
Collaborator

What I have are SHACL rules that validate that labels are conformant: https://github.com/semanticarts/platts-ontology/blob/develop/shapes/ontologyShapes.ttl. They enforce the policy described above, and even detect acronyms in all caps (minimum of 2 letters, I believe) and ignore their casing. The current version does not allow for numbers in class or property names, so if that's a requirement, we'll have to make some changes.

@rjyounes
Copy link
Collaborator

rjyounes commented Dec 6, 2020

I think we should allow numbers; e.g., hypothetically we could define classes like Shimano105Components, Iso639 (subclass of Category), TourDeFrance2020Racers, CharactersIn1984, ...

@sa-bpelakh
Copy link
Collaborator

I can see that for a specific domain, but for the base gist? We can set up a fun game of regex golf for labels.

@rjyounes
Copy link
Collaborator

rjyounes commented Dec 9, 2020

Maybe it's less likely in an upper ontology, but why exclude it in principle?

Re regex golf - that looks fun! Maybe at our next happy hour?

@uscholdm
Copy link
Contributor

uscholdm commented Dec 9, 2020

Maybe it's less likely in an upper ontology, but why exclude it in principle?

We get to choose our own stylistic conventions, as does each client project. I don't think we want gist to have numbers in IRIs and a rule for this would find a one the looks exactly like a lower case el. So I vote to put it in our gist checks as a warning.

@rjyounes
Copy link
Collaborator

rjyounes commented Dec 9, 2020

If we disallow numbers in local names, and we happen to come across the need for one, we are then forced to spell the number out, which I think is worse. What's wrong with numbers in IRIs?

A reminder that this issue is the implementation of a set of conventions that had already been decided on and documented in the gist style guide. The point is not to revisit the decisions here. Quoting from the style guide that we had agreed on:

  • Alphanumeric characters only.
    • Example: Isbn10, not Isbn-10 or ISBN-10.

This issue surfaced because I want to find a new assignee, since we agreed on the implementation back in April and have been postponing it since them.

@rjyounes
Copy link
Collaborator

rjyounes commented Dec 9, 2020

@sa-bpelakh Platts and gist should be able to have different naming conventions. Is it onto_tool that applies the SHACL rules? If so, the SHACL shapes or files to invoke (or a folder containing them) should be configured in the YAML file, or stored in a particular directory, or something.

@uscholdm
Copy link
Contributor

uscholdm commented Dec 9, 2020

Its true that this issue should not get into what the style conventions are. That can be debated in a separate issue if anyone care enough to raise it.

@sa-bpelakh
Copy link
Collaborator

@rjyounes Yes, the bundle file configures which shapes to apply. So we can configure whatever we consider appropriate for gist, and customers can, um, customize 😄 whichever way they want.

@marksem
Copy link
Collaborator

marksem commented Dec 10, 2020

Team has disagreement on the naming convention as of 12/10/2020 issues meeting. @DanCarey404 will poll SA ontologists. While there IS consensus to follow the standard, there is not consensus ON the standard.

(Detail: Some want Title Case for classes, not sentence case. Some want Title Case for all concepts. Rationale: all are concepts, and labeling for particular use cases (like sentence generation vs. column headings) won't always work. )

PS @sa-bpelakh will modularize the SHACL checking to allow ease of applying different conventions based on where starndard lands.

@sa-bpelakh sa-bpelakh self-assigned this Dec 10, 2020
@rjyounes
Copy link
Collaborator

rjyounes commented Dec 14, 2020

@marksem @DanCarey404 Can we please move discussion of this issue to a gist review meeting and notes here? Our goal is to be transparent, and decisions made by internal polling are not. In addition, there needs to be a rationale for reopening a decision that was made months ago. We cannot rethink every issue for those who did not attend the discussion. If someone who is unable to attend wants to provide input, that can be indicated here and we can accommodate them by scheduling a special meeting if needed.

@rjyounes
Copy link
Collaborator

rjyounes commented Dec 14, 2020

My input is based on earlier decisions now recorded in the gist style guide:

Classes

  • Sentence case
  • Normalized to natural language standards. E.g., hyphens inserted, acronyms in all caps, etc.
    • Examples: AMA guideline, ISBN-10

Properties

  • Same as classes, but initial lowercase
  • Examples: has unit of measure, has SSN.

Rationale

We adopt sentence over title case because the latter, while technically well-defined, has more complex rules and can introduce inconsistencies when implemented by different users.


Additional notes:

  • Sentence case vs title case: I hold by the decision made earlier: We adopt sentence over title case because the latter, while technically well-defined, has more complex rules and can introduce inconsistencies when implemented by different users.
  • Lower case for all properties, object and datatype
  • Acronyms in labels: since I believe that labels (as opposed to local names) should be in natural language form, acronyms should be spelled as they normally are. I will note that UoM is not an actual English-language acronym and therefore is not a good test case. We should also be careful about when an acronym is a prefLabel and when an altLabel: there are cases where the acronym is the most common term (e.g., "CIA", "FBI") and therefore it should be the prefLabel and the fully-spelled out version should be the altLabel, but there are also cases of the reverse (e.g., "Electronic Arts" not "EA").
  • Labels are meant to be in natural language, not camelcase etc. Therefore, hyphens are appropriate where they are used in natural language (in this case English) but not otherwise.

@uscholdm
Copy link
Contributor

I find @rjyounes 's arguments and rationale compelling. If anyone wants to use labels for column headers then they can introduce a subproperty of altLabel called, say titleCaseLabel.

@rjyounes rjyounes added status: under review In triage and removed status: implementation specified Implementation has been specified. A developer should be assigned. labels Dec 14, 2020
@rjyounes
Copy link
Collaborator

I didn't realize that one of the issues at stake in the renewed discussion was the use of labels as column headers. IMO that makes the case even stronger: it's hard to justify considering the preferred label as one designed for column headings or any other implementation-specific use. We have actually had this discussion during review of #20, where we reached the same conclusion as in @uscholdm's suggestion above, to define additional annotations for application-specific needs. In the case of column headers, they are (or could be) the same as the local names, so one could parse the IRIs to derive the local names for use as column headers and not maintain the values in an annotation.

@DanCarey404
Copy link
Contributor Author

I suggest that all words in a label have a leading capital. One reason for this suggestion is that Notepad++ has a convert case option (Proper Case) which does that, as does MS Word (Capitalize Each Word). This removes ambiguity from the rule and ensures the consistency that some are looking for.

@rjyounes
Copy link
Collaborator

rjyounes commented Jan 15, 2021

@DanCarey404 Are you suggesting that even function words (prepositions, articles, etc) would be capitalized? That's not a type of casing I've ever heard of, other than the applications you mention.

@rjyounes
Copy link
Collaborator

One reason for using initial lower for properties: we use labels that are tied to the local names, and should preferably be derivable from them by some simple rules, such as adding whitespace at word boundaries indicated by camel-casing. Since our properties have local names with initial lowercase, this suggests the labels should follow suit.

@rjyounes
Copy link
Collaborator

rjyounes commented Jan 15, 2021

These are the logical options for classes and properties:

  1. Title case for all: Temporal Relation, Has Giver, Identified By (in title case, prepositions at the end of a phrase receive stress and are in upper case)
  2. Title case for classes, lower case for properties: Temporal Relation, has giver
  3. Sentence case for all: Temporal relation, Has giver
  4. Sentence case for classes, lower case for properties: Temporal relation, has giver.
  5. Same as local name: TemporalRelation, hasGiver
  6. Lower case for all: This has not been mentioned and I doubt if anyone wants it; we can probably rule it out.
  7. Every word upper case: Has Unit Of Measure

Note: 2-4 make exceptions for acronyms and terms that are generally capitalized: Social Security Number, has SSN, has Social Security Number.

I would reject 5 because a label is meant for humans and thus should be in natural language.

We haven't mentioned taxonomy terms. Logical options for taxonomy terms:

  1. Title case
  2. Sentence case
  3. Lower case

Review of conventions used by well-known ontologies:

SKOS: Concept Scheme, exact match (2)
PROV: SoftwareAgent, atLocation (5)
FOAF: Online Account, based near (2)
OAI-ORE: Aggregated Resource, Is Aggregated By (1)
OWL Time: Duration description, has beginning (4)
BIBFRAME (Library of Congress): Key title, Has event content (3)
dcterms: Method of Accrual, Date Modified (1)
Schema: Ignore Action, Accepted Offer (1)
Lingvo: Language resource, resource type (4)
Open Annotation: TextPositionSelector, hasBody (5)
Ordered List Ontology: Ordered List, has ordered list (2)

Conclusion: There are no generally accepted conventions; we should choose whichever one we like best.

Note on title case:
There is no one standard for title case: see https://en.wikipedia.org/wiki/Title_case. Chicago Manual of Style, Associated Press, etc. each define their own, though of course the broad convention is common to all. If we adopt title case, I propose that we choose one of these standard variants (or invent our own) and document it in the gist style guide as a reference for ontology developers and reviewers.

I also propose that labels conform to natural language standards by the insertion of, for example, hyphens, even if our standards for local names do not include such characters. E.g., ISBN-10 for class Isbn10.

@rjyounes
Copy link
Collaborator

rjyounes commented Jan 15, 2021

Notes from 2021-01-14 triage meeting:

Dave: When do we see labels?

  • Graphics
  • Forms

Which would you rather see in these contexts?

Rebecca: we also see them in documentation (e.g., Widoco)

Peter: accuracy more important than typographic consistency

Will vote next meeting.

@uscholdm
Copy link
Contributor

Thank you @rjyounes for comprehensive summary.

Conclusion: There are no generally accepted conventions; we should choose whichever one we like best.

Exactly.

We haven't mentioned taxonomy terms.

Most taxonomy terms are instances of gist:Category, which is a lot like a class, semantically. the key technical difference is that we use gist:categorizedBy instead of rdf:type to indicate what kind of thing something is. So we may want to adopt the same convention for taxonomy terms as we do for Classes.

@rjyounes
Copy link
Collaborator

rjyounes commented Jan 28, 2021

These are the logical options for class and property labels:

  1. Title case for all: Temporal Relation, Has Giver, Identified By (in title case, prepositions at the end of a phrase receive stress and are in upper case)
  2. Title case for classes, lower case for properties: Temporal Relation, has giver
  3. Sentence case for all: Temporal relation, Has giver
  4. Sentence case for classes, lower case for properties: Temporal relation, has giver
  5. Same as local name: TemporalRelation, hasGiver
  6. Lower case for all: This has not been mentioned and I doubt if anyone wants it; we can probably rule it out.
  7. Every word upper case: Has Unit Of Measure

Offline voting yields #2 as the winner.

Rebecca will compile a short list of title case conventions for consideration at next meeting. The selected convention will be included in the gist style guide.

@rjyounes rjyounes assigned rjyounes and unassigned sa-bpelakh Jan 28, 2021
@semanticarts semanticarts deleted a comment from uscholdm Jan 29, 2021
@sa-bpelakh sa-bpelakh linked a pull request Feb 1, 2021 that will close this issue
@rjyounes
Copy link
Collaborator

rjyounes commented Feb 2, 2021

I've sorted through a number of style guides from reputable sources (AP, APA, Chicago Manual of Style, MLA, NYT, Wikipedia). The details are included in the attached document as I think they will not be of general interest. I've come up with an amalgam of various conventions that is also computable (e.g., a rule to capitalize nouns, verbs, adjectives, adverbs, and pronouns, or to lowercase prepositions unless stressed, is not computable), as follows:

  1. Capitalize:
    a. First and last words
    b. Words of four or more letters
    c. Second part of hyphenated word (e..g, Data-Centric, not Data-centric)
  2. Lowercase:
    a. Articles: a, an, the
    b. Conjunctions: and, but, if, for, or, nor, so, yet
    c. Prepositions: as, at, by, cum, ere, for, in, of, off, on, out, per, pre, pro, qua, re, sub, to, up, via
  3. Capitalize everything else

Attachment: Title Case Conventions.pdf

@rjyounes
Copy link
Collaborator

rjyounes commented Feb 2, 2021

Regarding automated conversion of local names to labels: there's an issue in the conversion of acronyms and hyphenated words. There are two possible local name conventions:

  1. Represent as in natural language - generally all uppercase - e.g., hasSSN
  2. Represent in camel case - e.g., hasSsn. The argument is that word boundaries can be easily detected. isCiaAgent allows word boundary detection, while isCIAAgent does not. Even for human users, the word boundary is easier to see in the former.

However, labels should include natural language formats: is CIA agent, not is Cia agent. The correct version cannot be algorithmically computed from either local name.

The same may be true of hyphenated words, depending on the local name convention. ISBN-10 can be automatically computed from ISBN-10 but not from Isbn-10, ISBN10, or Isbn10.

In fact, in general it is easier to derive the local name from the label than vice versa.

If we want to stick to our proposed local name conventions, we will use the forms hasSsn, isCiaAgent, and Isbn10. These require human correction once the automated label generator has applied. If the latter runs before every release, we would need human intervention each time. Another option: add a skos:editorialNote indicating to the generator that the label should not be touched.

@uscholdm
Copy link
Contributor

uscholdm commented Feb 2, 2021

In fact, in general it is easier to derive the local name from the label than vice versa.

Interesting observation, it usually goes the other way, but this sounds correct.

The argument is that word boundaries can be easily detected. isCiaAgent allows word boundary detection, while isCIAAgent does not. Even for human users, the word boundary is easier to see in the latter.

I think it is easier to see the boundary in the former: isCiaAgent . Was that a typo?

@rjyounes
Copy link
Collaborator

rjyounes commented Feb 2, 2021

Yes, that's an error. I've fixed it above.

@rjyounes
Copy link
Collaborator

Title case proposal above accepted for implementation.

@rjyounes rjyounes assigned sa-bpelakh and unassigned rjyounes Feb 11, 2021
@rjyounes rjyounes added status: implementation specified Implementation has been specified. A developer should be assigned. and removed status: under review In triage labels Feb 11, 2021
@rjyounes
Copy link
Collaborator

Boris will fix all labels, first by automation and then manual adjustment for exceptions.

@rjyounes
Copy link
Collaborator

rjyounes commented Feb 12, 2021

In writing the label validation script (see PR #428), Boris noted that proper nouns in labels must also retain capitalization. An emended version of the label conventions follows:

Title Case Convention

  1. Capitalize:
    a. First and last words
    b. Words of four or more letters
    c. Second part of hyphenated word (e..g., Data-Centric, not Data-centric)
  2. Lowercase:
    a. Articles: a, an, the
    b. Conjunctions: and, but, if, for, or, nor, so, yet
    c. Prepositions: as, at, by, cum, ere, for, in, of, off, on, out, per, pre, pro, qua, re, sub, to, up, via
  3. Capitalize everything else

Label Conventions
Classes: title case (as above)
Properties: all lowercase

The following exceptions apply to both class and property labels:

  • Acronyms and proper nouns are kept intact (e.g., has SSN, unit symbol Unicode, ISBN-10)
  • Numbers are allowed (e.g., ISBN-10)
  • Hyphens are allowed (e.g., ISBN-10)

The exception for proper nouns makes the convention not fully automatable.

The implementation of these conventions in current labels will be done by Boris using a script with manual corrections (for the non-automatable exceptions). To support label validation as part of bundling the ontology for release, we will add an additional ontology file with an annotation signaling to the validation script that the label is not subject to the validation rules. We propose gist:nonConformingLabel for the annotation. See additional notes in PR #428.

Any objections to the annotation name should be voiced here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact: minor New, backward-compatible functionality (does not change inferences; e.g., adding a term) priority: should have Medium priority feature or bug fix status: implementation specified Implementation has been specified. A developer should be assigned.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants