sonormal
is a python library to assist with extraction and processing of schema.org content with emphasis on the Dataset
class.
Included is a command line tool jld
for retrieving and extracting JSON-LD from a web page or other resource and performing various operations on JSON-LD.
This library and tool is focussed on supporting Schema.org harvesting for the DataONE infrastructure.
Usage: jld [OPTIONS] COMMAND [ARGS]...
Retrieve and process JSON-LD.
Options:
-b, --base TEXT Base URI
-p, --profile TEXT JSON-LD Profile
-P, --request-profile TEXT JSON-LD Request Profile
-r, --response Show response information
-R, --relaxed-json Relax strict JSON deserialization
-W, --webpage Render SPA page
--soprod Use schema.org production context instead of v12 https
--help Show this message and exit.
Commands:
cache Cache management, list or purge
canon Normalize and render canonical form
compact Compact the JSON-LD SOURCE
frame Apply frame to source
get Retrieve JSON-LD
identifiers Extract Dataset identifiers
nquads Transform JSON-LD to N-Quads
play Load in JSON-LD Playground
cache
lists entries in the local cache (in folder ~/.local/sonormal/cache
) and optionally purges entries.
canon
canonicalizes the source JSON-LD by expanding and applying the URDNA 2015 algorithm, then serializes with ordered terms, no new lines, and no spaces between delimiters. Checksums computed on the result are consistent between various arrangements of the same input source.
compact
applies the JSON-LD compaction algorithm to the source using the context:
{"@context": [
"https://schema.org/",
{
"id": "id",
"type": "type"
}
]
}
frame
applies the JSON-LD framing algorithm to structure the JSON-LD for ease of identifier extraction from a Dataset
instance using the frame:
{
"@context": {"@vocab":"https://schema.org/"},
"@type": "Dataset",
"identifier": {},
"creator": {}
}
get
retrieves the document from a file or URL, following redirects and Link headers as appropriate. Content is extracted from HTML pages, and optionally (with the -W
flag set) from single page applications where the JSON-LD may be generated on the fly.
identifiers
extracts Dataset
identifier values and computes checksums of the JSON-LD.
nquads
serializes the JSON-LD to N-Quads format.
Download and extract JSON-LD from Hydroshare:
jld get "https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257/"
{
"@context": {
"@vocab": "https://schema.org/",
"datacite": "http://purl.org/spar/datacite/"
},
"@id": "https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257",
"url": "https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257",
"@type": "Dataset",
"additionalType": "http://www.hydroshare.org/terms/CompositeResource",
...
Download and extract JSON-LD from a DataONE single page application (with JSON-LD rendered by the client):
jld -W get "https://search.dataone.org/view/urn%3Auuid%3Add9ad874-ded8-48fe-908a-06732b9a6297"
[
{
"@context": {
"@vocab": "https://schema.org/"
},
"@type": "Dataset",
"@id": "https://dataone.org/datasets/urn%3Auuid%3Add9ad874-ded8-48fe-908a-06732b9a6297",
"datePublished": "2013-10-23T00:00:00Z",
"publisher": {
"@type": "Organization",
"name": "California Ocean Protection Council Data Repository"
},
"identifier": "urn:uuid:dd9ad874-ded8-48fe-908a-06732b9a6297",
...
Processing operations can take stdin as input. For example, normalize JSON-LD using the URDNA 2015 algorithm for assigning ids to blank nodes. Note the source is expanded and canonicalized, output is serialized with no new lines and no spaces between delimiters in preparation for calculating checksums.
jld get "https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257/" | jld canon
[{"@id":"_:c14n0","@type":["http://purl.org/spar/datacite/ResourceIdentifier","https://schema.org/PropertyValue"],
"http://purl.org/spar/datacite/usesIdentifierScheme":[{"@id":"http://purl.org/spar/datacite/
local-resource-identifier-scheme"}],"https://schema.org/propertyId":[{"@value":"UUID"}],"https://schema.org/value":
[{"@value":"uuid:058d173af80a4784b471d29aa9ad7257"}]},{"@id":"_:c14n1","@type":["https://schema.org/Place"],
...
Extract identifiers and compute checksums:
jld get "https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257/" | jld identifiers -c
[
{
"@id": [
"https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257"
],
"url": [
"https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257"
],
"identifier": [
"uuid:058d173af80a4784b471d29aa9ad7257"
],
"hashes": {
"sha256": "a8cb4e5806045032fc2e7ad0b762336ff76f3792271ddc071c0d8c85d6b69ac5",
"sha1": "f6abef03156a5adb6d395f385628a2894e7b920e",
"md5": "03a357ba8043ac734aa3b9e9bb514ff9"
}
}
]
Open the canonical form of the BCO-DMO dataset https://www.bco-dmo.org/dataset/839373
in JSON-LD Playground:
jld get "https://www.bco-dmo.org/dataset/839373" | jld canon | jld play -B
New public gist created at:
https://gist.github.com/datadavev/4f3cad1a104263bcf1c1bb96723911fc
Link to JSON-LD playground:
https://json-ld.org/playground/#startTab=tab-expanded&json-ld=https%3A%2F%2Fgist.githubusercontent.com%2Fdatadavev%2F4f3cad1a104263bcf1c1bb96723911fc%2Fraw
Install using poetry
. For example:
git clone https://github.com/datadavev/sonormal.git
cd sonormal
poetry install
Then run using:
poetry run jld
Alternatively, install into a separately created virtual environment:
poetry install
Then run like:
jld
Note that the play
command for uploading to the JSON-LD Playground requires that the GitHub command line tool gh
is available on the path, and that you have authenticated the tool.