Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copyediting on mapping-predicates text #204

Merged
merged 1 commit into from
Jun 18, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions src/docs/mapping-predicates.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# How to pick the right mapping predicates

A mapping predicate such as skos:exactMatch species the semantics of the mapping relation - in other words, it defines how a computer (and human!) should interpret the mapping when it is being used. For example, a computer program may be allowed to merge nodes in a knowledge graph _only when they are `skos:exactMatch`_, but not when they are, say, `skos:closeMatch`.
A mapping predicate such as skos:exactMatch specifies the semantics of the mapping relation - in other words, it defines how a computer (and human!) should interpret the mapping when it is being used. For example, a computer program may be allowed to merge nodes in a knowledge graph _only when they are `skos:exactMatch`_, but not when they are, say, `skos:closeMatch`.

Picking the right predicate to specify the meaning of your mapping is often a difficult process. The following guide should help you to understand the most widely used mapping predicates and when they are appropriate.

Expand Down Expand Up @@ -30,14 +30,14 @@ There are at least three things you need to decide before selecting an appropria

### What is the **precision** of the mapping?

As a curator, you should try to investigate the **intended meaning** of both the subject and the object. This task usually involves trying to find out as much as possible about the mapped identifiers: What is their human readable definition? Are their any logical axioms that could help understanding the intended meaning? Sometimes, this even involves asking the respective stewards of the database or ontology for clarification. **Important:** The key here is "intended meaning", i.e. when you see `FOODON:Apple` (FOODON is an ontology), you do not try to figure out _what an apple is_, but what thing in the world (in your conceptual model of the world) the FOODON developers _intended the `FOODON:Apple` identifier to refer to_. This may be an Apple that you can eat. Or a [cultivar](https://en.wikipedia.org/wiki/List_of_apple_cultivars)!
As a curator, you should try to investigate the **intended meaning** of both the subject and the object. This task usually involves trying to find out as much as possible about the mapped identifiers: What is their human readable definition? Are there any logical axioms that could help with understanding the intended meaning? Sometimes, this even involves asking the respective stewards of the database or ontology for clarification. **Important:** The key here is "intended meaning", i.e. when you see `FOODON:Apple` (FOODON is an ontology), you do not try to figure out _what an apple is_, but what thing in the world (in your conceptual model of the world) the FOODON developers _intended the `FOODON:Apple` identifier to refer to_. This may be an Apple that you can eat. Or a [cultivar](https://en.wikipedia.org/wiki/List_of_apple_cultivars)!

The **precision** is simply: is the mapping `exact`, `close`, `broad`, `narrow` or `related`? Here is an basic guide of how to think of each:

- `exact`: The two terms are intended to refer to the same thing. For example, both the subject and the object identifiers refer to the concept of [Gala cultivar](https://en.wikipedia.org/wiki/Gala_(apple)).
- `close`: The two terms are intended to refer to roughly the same thing, but not quite. This is a hazy category and should be avoided in practice, because when taken too literally, most mappings could be interpreted as close mappings. This is not the point of creating mappings, if their intention is to be useful (see "use case" considerations later in this document). An example of a `close` mapping is one between the "heart" concept in a database of anatomical entities for biological research on chimpanzees and the "human heart" in an electronic health record for humans.
- `broad`: The subject refers to a broader term. For example, "human heart" in an electronic health record refers to "heart" in a general anatomy ontology that covers all species, such as Uberon. Another example is "Gala (cultivar)" in one ontology or database to "Apple (cultivar)" in another: the Apple (cultivar) has a broader meaning then "Gala (cultivar)". For a good mapping, it is advisable that "broad" and "narrow" are applied a bit more strictly than is technically permitted by the SKOS specification: both the subject and the object should belong to the same **category**. For example, you should use broad (or narrow) only if both the subject and the object are "cultivars" (in the above example).
- `narrow`: The subject refers to a broader term. For example "Apple (cultivar)" is a narrow match to "Gala (cultivar)". Think of it as the opposite of "broad". `broad` and `narrow` are so called inverse categories: If "Gala (cultivar)" is a `broad` match to "Apple (cultivar)", then "Apple (cultivar)" is a `narrow` match to "Gala (cultivar)"! One **note of caution**: `narrow` matches generally have less useful applications then `broad` ones. For example, if we want to _group_ subject entities in a database under an ontology to make them query-able in a knowledge graph, only `broad` matches to the ontology can be retrieved. For example, if we map "Gala (cultivar)" in a database to "Apple (cultivar)" in an ontology, and we wish to write a semantic query to obtain all records that are about "Apple (cultivar)" according to the ontology, we obtain "Gala (cultivar)". This is not true the other way around: if the ontology term is _more_ specific then the database time, it cant be use the group the database data.
- `narrow`: The subject refers to a broader term. For example "Apple (cultivar)" is a narrow match to "Gala (cultivar)". Think of it as the opposite of "broad". `broad` and `narrow` are so called inverse categories: If "Gala (cultivar)" is a `broad` match to "Apple (cultivar)", then "Apple (cultivar)" is a `narrow` match to "Gala (cultivar)"! One **note of caution**: `narrow` matches generally have less useful applications then `broad` ones. For example, if we want to _group_ subject entities in a database under an ontology to make them query-able in a knowledge graph, only `broad` matches to the ontology can be retrieved. For example, if we map "Gala (cultivar)" in a database to "Apple (cultivar)" in an ontology, and we wish to write a semantic query to obtain all records that are about "Apple (cultivar)" according to the ontology, we obtain "Gala (cultivar)". This is not true the other way around: if the ontology term is _more_ specific then the database term, it can't be used to group the database data.
- `related`: The subject refers to an analogous concept of a different category. For example "Apple" and "Apple tree" are considered `related` matches, but not `exact` matches, as "Apple" is of the "fruit" category, and "Apple tree" of the "tree" category. Other examples include: "disease" and "phenotype", "chemical" and "chemical exposure", "car" and "car manufacturing process". In general, `related` mappings should be reserved for "direct analogues". For example, we should not try to map to `related` and `broad` categories at the same time, like, for example, "Gala (cultivar)" to "Apple tree". This causes a huge amount of proliferation of very "low value" mappings (see use case section later).

<a id="noise"></a>
Expand All @@ -49,7 +49,7 @@ Depending on what you want to do with your mappings, different quality levels ar
While reading through this section, you should keep one things in mind: It is _never_ a good idea to think about mappings as "correct" or "wrong". Even the the exact same identifier (for example in Wikidata, or even the biomedical data domain) can mean something very different depending on which database it is using it or in which part of which datamodel (or value set) they are used. Mapping should therefore be perceived as an inexact art where the goal is not "correctness" but "fitness for purpose": can the mappings deliver the use case I am interested in? In the following, we will take a closer look at the varying levels of noise you may need to weigh against each other.

- "zero-noise". Some mappings directly inform decision processes of downstream consumers, such as clinical decision support or manufacturing. For example, in an electronic health record (EHR) system we may want to know what the latest recommended drugs (or contra-indications) for a conditions are, and the disease-drugs relationships may be curated using one terminology such as [OMOP](https://ohdsi.org/omop), and the EHR may be represented using [ICD10-CM](https://icd.codes/icd10cm) (a clinical terminology used widely by hospitals). In these cases, noise should be zero or close to zero, as patient lives depend on the correctness of these mappings.
- "low-noise". Most mappings are used to augment/inform processes that are a bit upstream of the final consumer. For example, mappings are used to group data for analysis or make it easier to find related data during search (enhancing search indexing semantically). The final consumer does not immediately "see" the mappings, but they just see the consequences of applying the mappings. In these cases, a bit of noise may be acceptable, i.e. some mappings that are "not quite right". Practically, these os very often the case where data sources are aligned automatically to enable searches across, so a few bad mappings are better than having none.
- "low-noise". Most mappings are used to augment/inform processes that are a bit upstream of the final consumer. For example, mappings are used to group data for analysis or make it easier to find related data during search (enhancing search indexing semantically). The final consumer does not immediately "see" the mappings, but they just see the consequences of applying the mappings. In these cases, a bit of noise may be acceptable, i.e. some mappings that are "not quite right". Practically, this is very often the case where data sources are aligned automatically to enable searches across, so a few bad mappings are better than having none.
- "high-noise": Some use cases employ data processing approaches that are themselves highly resilient to noise, like Machine Learning. Here, even a larger number of mappings (in a knowledge graph for example) which are "not quite right", or noisy, may be acceptable (if the signal to noise ratio is still ok, i.e. there are "more good than bad" mappings).

There is no easy recipe by which you can decide what level of noise is acceptable. Your use case will determine this. What you, as the steward of your organisation's mapping data, should consider is that there is (roughly) an order of magnitude in cost involved between the three levels:
Expand Down Expand Up @@ -85,7 +85,7 @@ Other key considerations in the sections are:
There are four semantic frameworks/formalisms that default SSSOM supports: (1) [SPARQL/RDF(S)](https://www.w3.org/TR/rdf-sparql-query/) (querying an integrated knowledge with basic SPARQL); (2) [Simple Knowledge organisation systems (SKOS)](https://www.w3.org/TR/skos-reference/); (3) [Web Ontology Language (OWL)](https://www.w3.org/TR/owl2-syntax/); (4) no formalism (property graphs, non-semantic use cases). We will briefly discuss the implications of each for your use cases.

- SPARQL/RDF(S) is a very general semantic framework that allows query across [property paths](https://www.w3.org/TR/sparql11-property-paths/). Many SPARQL engines provide at least RDFS entailment regime, which allows for some (basic) semantic reasoning (subClassOf, property domains). This is the most likely semantic framework of choice if your use case involves semantic queries such as those involving sub-class groupings.
- SKOS is semantic framework that layers on top of RDF and specifies semantics for a handful of properties that are useful for building taxonomies that do not seek to follow the rigorous semantics of the class-level modelling constructs such as subClassOf. We have no experience with SKOS reasoners, and do not know if there are any out there. This means, in effect, that this "case" (semantic framework) has the same exact considerations as the SPARQL/RDF(S) one above.
- SKOS is a semantic framework that layers on top of RDF and specifies semantics for a handful of properties that are useful for building taxonomies that do not seek to follow the rigorous semantics of the class-level modelling constructs such as subClassOf. We have no experience with SKOS reasoners, and do not know if there are any out there. This means, in effect, that this "case" (semantic framework) has the same exact considerations as the SPARQL/RDF(S) one above.
- OWL is a very powerful semantic framework that is based on formal logic. Ontologies represented in OWL offer support for complex expressions of knowledge, way beyond what RDFS and SKOS can do. OWL is the semantic framework of choice if the goal is to build **and reason** over an integrated (merged) ontology. An example use case where OWL is the appropriate framework is integration of species-specific anatomy ontologies under species-neutral ones, see for example [Uberon](https://github.com/obophenotype/uberon). A basic rule of thumb is: unless you know positively that you have to reason over the _merged_ graph, i.e. set of all ontologies you have mapped across, OWL is probably overkill and should be avoided.
- Using no semantic framework does not mean semantic mappings are useless! Many extremely useful applications exist for mappings which do not involve a semantic framework, such as those related to [Labelled Property Graphs](https://www.oxfordsemantic.tech/fundamentals/what-is-a-labeled-property-graph) (for example [neo4j](https://neo4j.com/)). Even if you just want to translate your data into a graph, it is useful to know the semantics of your mappings as they can inform your graph queries.

Expand Down Expand Up @@ -117,7 +117,7 @@ Note that it does not make sense to try and map instances of concepts, or concep
Typical use cases for mappings include:

1. _Semantic data integration_. This often involves linking data to ontologies or semantic layers in knowledge graphs. Data from one source (such as an EHR) is translated to another (such as OMOP, see above). To analyse the data semantically, the most valuable links are `exact` and `broad` as these allow you to directly query the ontology to retrieve instance data. `close` and `narrow` matches are less useful for such a use case, but maybe be consulted as the "next best thing" to an exact mapping. Often, a low level of noise is acceptable.
2. _Data translation_. Similar to data integration, but we want to map as precisely as possible. Only `exact` matches really matter if we want to make sure that data annotated with one ontology means the exact same thing as data annotated with another. Noise in the mappings is often not acceptable. An example for this is if one source has annotated all its genes using the Huge Gene Nomenclature Committee (HGNC) while another is using NCBI Gene Database identifiers. `broad`, `narrow` and even `close` matches are mostly meaningless - we need a 1:1 translation table with next to zero noise.
2. _Data translation_. Similar to data integration, but we want to map as precisely as possible. Only `exact` matches really matter if we want to make sure that data annotated with one ontology means the exact same thing as data annotated with another. Noise in the mappings is often not acceptable. An example for this is if one source has annotated all its genes using the HUGO Gene Nomenclature Committee (HGNC) while another is using NCBI Gene Database identifiers. `broad`, `narrow` and even `close` matches are mostly meaningless - we need a 1:1 translation table with next to zero noise.
3. _Ontology and knowledge graph merging_. Here, the key issue is that `exact` matches matches have as little noise as possible. Some merging approaches use probabilistic algorithms to weed out out potentially bad mappings (low levels of noise may be acceptable, see for example [boomer](https://github.com/INCATools/boomer)), but any naive merging approach, which is still prevalent in the knowledge graph world, will usually do the following: (1) Merge all `exact` matches into one "node" in the knowledge graph and (2) redirect all data against all these `exact` matches to that newly created node.

<a id="tenstep"></a>
Expand Down