-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unique ID to the notebook metadata #148
Comments
For W3C Web Annotations (JSONLD RDF), is it necessary to associate threaded comments and highlights with a URI subject? A UUID URN could be the canonical identifier for Web Annotations. [1][2] A UUID can be a URI when it has the urn:uuid URN namespace prefix: From https://en.wikipedia.org/wiki/Universally_unique_identifier#Format :
[1] w3c/wpub#56 (comment) links to When would the UUID need to be changed?
What sort of UI does this need?
|
I'm a little late to the thread but I agree having a uuid in the notebook format would be hugely beneficial. It might be required to only change the ID when the location or name of the notebook changes to be compatible with existing assumptions made in applications. This basically follows what is proposed here, by implying that copying a notebook should impose an id update. But edits to an existing notebook would not. The hard part for this is, how does one treat non-application copies? If a user does |
My suggestion of adding a UUID comes from looking at the PDF format (originally motivated by wanting to understand how hypothesis does its magic). https://www.seanh.cc/2017/11/22/pdf-fingerprinting/ is a good&readable post on some of the basics and how to extract fingerprints from PDFs. For PDFs the idea is that copying, renaming and so on does not change the ID. I think that makes sense as the reason to have the ID is to be able to identify that two files are the same independent of the filename or URL. PDFs have a second ID which starts the same as the first but is updated when the content changes. I think I think we could do worse than to copy what PDF does (use two IDs). Their spec is in section 14.4. of https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf |
That's a fairly compelling argument for ID management, with a similar usecase and well established precedent. It doesn't solve all problems, like users copying from a common starter notebook for all their patterns, but it would give a lot more information and insight into notebooks in a way that handles file copy and edits a little better. I'm for this pattern -- do you think we should formulate a JEP for the idea? |
Google Colab includes a 'provenance' section in the colab-specific notebook metadata, this is an array of what we consider IDs indicating where the file came from: "metadata": {
"colab": {
"provenance": [
{
"file_id": "1Rgt3Q7hVgp4Dj8Q7ARp7G8lRC-0k8TgF",
"timestamp": 1560453945720
},
{
"file_id": "https://gist.github.com/blois/057009f08ff1b4d6b7142a511a04dad1#file-post_run_cell-ipynb",
"timestamp": 1560453945720
}
], Every time the file is cloned from within Colab we push a new entry into that list indicating where the file was cloned from. The We don't make heavy use of this data because:
and
If they are copies then it does not seem that they are the same notebook. This seems unexpected for comments on Alice's copy to be shown to Bob. For persistence in browser storage- is there a canonical URL for the notebook that BinderHub could use? |
There's a W3C spec for data like this: (1) where the inputs came from; (2) where the outputs come from; and (3) "who is making said claims with which cryptographic signature" requires additional specs like Linked Data Signatures, W3C Verifiable Claims (and Decentralized Identifiers). https://www.w3.org/TR/prov-overview/ https://www.w3.org/TR/prov-primer/#introduction https://en.wikipedia.org/wiki/PROV_(Provenance)
https://www.w3.org/TR/prov-primer/#derivation-and-revision-1 @prefix exg: <http://example.org/doc1#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
exg:dataset2 a prov:Entity ;
prov:wasRevisionOf exg:dataset1 .
There aren't JSON-LD examples in the prov primer; but you can convert them from turtle (N3) to JSON-LD through an online rdf translator (or {
"@context": {
"exg": "http://example.org/doc1#",
"prov": "http://www.w3.org/ns/prov#",
},
"@id": "exg:dataset2",
"@type": "prov:Entity",
"prov:wasRevisionOf": {
"@id": "exg:dataset1"
}
} When/where the document was copied, revised, and executed would be useful information to share in a tool-independent way. |
I'd be up for this.
What is the difference between a canonical/resolvable URL (at which you can't actually download the notebook) and a unique identifier (that is a a random number)? Is there an advantage to using one or the other? Number 4 or 5 from the list above seem good. I think anything that gives away where a notebook came from (GitHub repo, local file path, etc has the potential for a privacy disaster).
I think a copy of a file made with a tool like I think I'd start a JEP with proposing to include two identifiers in the metadata. Both are chosen "somehow" when the notebook is first created. The first identifier never changes, the second gets changed "somehow" when a user edits or otherwise "meaningfully changes" the notebook. This lets us tell if two notebooks are copies of each other, just located in different parts of the galaxy, if one notebook was somehow derived from another one (shared first identifier, different second one) or if they are completely unrelated. It also means that we don't have to scrub notebooks before people share them (you don't learn very much from getting my notebook and looking at the identifiers). |
Hi Wes, On Twitter you asked about a durable ID to associate annotations to a notebook page. Here's an example of what we recommend:
https://web.hypothes.is/help/how-hypothesis-interacts-with-document-metadata/#dublin-core-metadata |
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/annotating-jupyter-notebooks/2079/6 |
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/annotating-jupyter-notebooks/2079/7 |
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/annotating-jupyter-notebooks/2079/12 |
Since this is about tracking notebooks, I wanted to link to recent work on File ID service for jupyter-server: jupyter-server/jupyter_server#940, https://github.com/jupyter-server/jupyter_server_fileid, jupyterlab/jupyterlab#12614. It seems that motivation for File ID service and this discussion was similar (enabling comment tracking). CC @ellisonbg @dlqqq just to reconcile the discussions in case if you have not seen this one in a while. |
It would be nice to have a "practically unique across the universe identifier" in the notebook metadata. This would allow you to recognise a notebook based on this ID. Right now if Alice and Bob have a copy of the same notebook there is no way to know if they are the same or not. Even for Alice on her laptop and her desktop this is hard. If the notebook contained a unique ID it would be clear that (at some point) these were the same notebook.
I can think of three use cases:
I'd propose that the notebook format starts recommending that tools which create notebooks add a "unique_id" field to the notebook level metadata that contains a value like
uuid.uuid4().hex
. The value of this field should not be changed by reading and writing to the notebook.I am new to this repo so please close this and link to an existing issue/PR if there is one. I searched for "unique" and didn't find anything.
The text was updated successfully, but these errors were encountered: