Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JEP for adding $schema to notebook format #97

Merged
merged 6 commits into from
Jul 11, 2023

Conversation

filipsch
Copy link
Contributor

@filipsch filipsch commented Mar 8, 2023

This JEP proposes to add a new top-level field, $schema to the notebook JSON, as such updating the notebook JSON schema. This new field deprecates nbformat and nbformat_minor.

I skipped the step of creating a GitHub issue and deciding it's a JEP in this repository after discussing with @fcollonval. There was broad consensus about this change and the fact that it's a JEP during the notebook format workshop held in Paris (Feb 28 - Mar 2), and thought it okay to file a PR straight away. I will be the shepherd.

Voting from @jupyter/software-steering-council

@willingc
Copy link
Member

@MSeal @rgbkrk Please review. It looks fine to me if a migration path from old to new formats so notebooks from different nbformat versions can be executed.


```
{
"$schema": "http://json-schema.org/draft-04/schema#",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our notebook format workshop meeting this morning, we'll need to bump to at least JSON Schema draft 2019 to have the deprecated keyword, and maybe we should bump this to the 2020 draft (i.e., the latest draft)?

Copy link
Contributor

@agoose77 agoose77 Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also need to introduce support for this new keyword in the existing schemas, i.e. backport the addition to our existing schemas. This should be acceptable, as these schemas are not declared immutable, and it would be a permissive change.

We should also define how to ensure that the nbformat versions align with the document schema during the deprecation period. One solution is to ensure that they're constants in the schema. Thereafter, we could move to a single-version number (major) for each schema revision, as they compatibility is enforced by $schema itself.

I don't know of a reason not to bump to 2020 draft, besides the risk of existing tooling not supporting newer drafts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say for the newer version it is acceptable to bump the schema to the latest draft after ensuring there is no backward incompatibility between the current version based on draft 04 and the new draft - as far as I know some breaking changes are induced when upgrading from draft 04 to draft 06.

From a quick look it seems ok

Additionally if we are to bump the JSON Schema draft, I would switch all enum with single value to the new const as used for the nbformat version numbers (see mainly cell_type and output_type).

@jasongrout
Copy link
Member

Can we add what the proposed changes are to the schema that removes the deprecated attributes, to have an idea of what the resolution of the deprecated properties looks like?

@westurner
Copy link

Please consider YAML-LD (JSON-LD) in naming the attribute $schema

Given that Linked Data is ideal for science publishing and the internet, as explained and justified by https://5stardata.info/

Eventually, I and I believe also @bollwyvl TODO argue that nbformat should have a JSON-LD Context which would make .ipynb transformable to RDF; in order to botj

  • publish Linked Data from nbformat notebooks
  • publish nbformat notebooks as JSON-LD Linked Data

Eventually,

That's out of scope for this issue, but FEIW the YAML-LD Convenience Context does map a bunch of things that start with $ to their @ equivalents in JSON-LD and $schema may or may not be confusing when working with nbformat as YAML-LD:

{
  "@context": {
    "$base": "@base",
    "$container": "@container",
    "$direction": "@direction",
    "$graph": "@graph",
    "$id": "@id",
    "$import": "@import",
    "$included": "@included",
    "$index": "@index",
    "$json": "@json",
    "$language": "@language",
    "$list": "@list",
    "$nest": "@nest",
    "$none": "@none",
    "$prefix": "@prefix",
    "$propagate": "@propagate",
    "$protected": "@protected",
    "$reverse": "@reverse",
    "$set": "@set",
    "$type": "@type",
    "$value": "@value",
    "$version": "@version",
    "$vocab": "@vocab"
  }
}

$schema is not on the list.

@agoose77
Copy link
Contributor

agoose77 commented Mar 14, 2023

@westurner the $schema top-level property is already prior-art for declaring that a document conforms to a JSON Schema. If we use a different property here, we lose the ability to have a large class of validators understand our document. We've touched on RDF/JSON-LD in our weekly meetings, which you're encouraged to join!

FWIW, as I understand it, JSON-LD and JSON-Schema are orthogonaly concepts. In this JEP, we're concerned about the validation side of things; down the road, the linked-document properties of LD will be useful.

@tonyfast
Copy link
Contributor

currently, the top level notebook schema does not allow for any additionalProperties defined in the container, so we can't have any LD @context. we're hoping to introduce @context as a top level key in future schema. there are likely a few proposals between this JEP and an @context proposal. so this is on folks minds, but we decided to defer linked data proposals until some prior JEPs are accepted. advancing the schema will be mean good things for our ability to write linked data contexts.

change the schema to draft2020-12
@tonyfast
Copy link
Contributor

@jupyter/software-steering-council we are working on a draft to present to y'all for the JEP. yesterday we were wondering what to expect with the process. is there any way someone can outline what the process will look like so we can plan our work accordingly and set some deadlines?

@rgbkrk
Copy link
Member

rgbkrk commented Mar 16, 2023

This is so much more sensible than the incrementing numbers and awkward compatibility between notebook formats. I'm wholly on board. Thank you all so much for pushing forward with this approach.

@fcollonval
Copy link
Contributor

Thanks all for the great discussion.

is there any way someone can outline what the process will look like so we can plan our work accordingly and set some deadlines?

For my reading there are three opened questions:

  • What should be the JSON Schema Draft version?
    • Should we annotate the nbformat_minor and nbformat as deprecated (if we use draft 2019 or later)?
  • Should we switch more enum with single value to const as the newer draft allows that (this will ease the understanding).

And I'm unclear about the following comment of @agoose77 :

We will also need to introduce support for this new keyword in the existing schemas

Which new keyword are we speaking about?


To get validation (from the SSC), the easiest would be to resolve all pending questions and then ping the SSC that this is ready for approval. If some questions are left opened, I would recommend summarizing them in a comment with the possible solutions. Then ping the SSC that will have to figure out how to move forward.

@agoose77
Copy link
Contributor

agoose77 commented Mar 16, 2023

What should be the JSON Schema Draft version?

At least 2019-09. I'd be curious to know whether there are downsides to just jumping straight to 2020-12. See filipsch#2 :)

Should we annotate the nbformat_minor and nbformat as deprecated

Yes, I think so.

Should we switch more enum with single value to const as the newer draft allows that (this will ease the understanding).

Yes, I think so.

We will also need to introduce support for this new keyword in the existing schemas

Which new keyword are we speaking about?

Actually, this is something I wanted to follow up with @jasongrout on. Due to the fact that we have additionalProperties: false, no document with the $schema top-level property will be considered valid for existing schemas. Right now, this doesn't cause a hard-failure with nbformat; the validator complains about a validation error, but ultimately loads notebooks with additional properties.

My understanding of our deprecation process is that we will update nbformat so that it always uses $schema if it finds it. The deprecation period simply means that a v4 notebook might have $schema, or it might not. We should keep the nbformat properties in these transition notebooks so that out-of-date nbformat libraries / other validators have a chance at being able to read the notebook if they're permissive enough. i.e., if notebook consumers are not strictly rejecting the document outright due to the new $schema property, then they will have sufficient information to know that it's nbformat 4.

I was originally thinking that we would need to backport $schema to older (<v4.7) schemas, but actually I don't think that's the case.

Going forward, we will in-principle be moving away from a need for major epochs of a schema; we can version the schema by calver (like JSON Schema drafts) if we want to (and without further context, I'd prefer that). To my mind, if we need to be able to upgrade/downgrade notebooks between schema versions, we can do this on a calver-like ordering, i.e. change the API of nbformat.

```
After the deprecation period expires, a future JEP will remove these `nbformat` and `nbformat_minor` properties from the notebook schema. These properties are retained to permit legacy notebook consumers to read notebooks authored during this deprecation period.

The addition of the `$schema` property removes a level of indirection between the notebook and the schema against which it is invalidated. It also guarantees that the schema against which it is validated is invariant with respect to time; the schema URI should refer to an immutable document.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the schema URI should refer to an immutable document.

It would be good to define where the canonical version of the schema document is stored. This may already be defined somewhere else, in which case that information could be referenced here.

While it's not necessary, the could be hosted at the URL represented by the URI. That is, https://jupyter.org/schema/notebook/notebook-{nbformat}.{nbformat_minor}.schema.json would resolve to the actual schema document. Or, it could be tied a GitHub repo, branch, and tag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is something we've touched upon in the meetings. It's my feeling that we haven't wanted to define that in this JEP (to avoid taking on too many responsibilities). As of right now, the schemas used to validate notebooks are stored in the nbformat repository / wheels. I don't think they're hosted on a standalone URL, but I've not checked!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call this a strong need or requirement. But when I see a schema declaration, my instinct is to ask where I can find the actual document and the statement regarding an immutable document reinforces that. If this isn't the right time to define the process for hosting the schema, then I would suggest just documenting the current location as information for the reader.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be tied a GitHub repo, branch, and tag.

jupyter.org will, in all likelihood, remain under control of Project Jupyter, while GitHub could pull a Docker, Inc., and make things much more challenging.

Further, and not explicitly stated (again to avoid any scope creep): it must not be an expectation that a validating tool will need to (or even be able to) fetch the schema in order to validate it.

While it's my strong feeling the whole family of current and future Jupyter schema should all have an "official", inspectable URL together, with unified tooling for generating human-readable documentation of the schema, generated, lightweight header/typing packages should obviate much of the need for "go grab something off the internet at runtime".

Today, nbformat publishes canonical packages on pypi.org and npmjs.com, which is a great start! But really every language community that wants representation in Jupyter should be able to propose and maintain lightweight packages, and get up-to-date packages.

Here's a quick strawman sketch of what something like that might look like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: #107

@westurner
Copy link

westurner commented Mar 17, 2023

  • Should we annotate the nbformat_minor and nbformat as deprecated (if we use draft 2019 or later)?

When [W3C SHACL] validation for Linked Data notebooks becomes the norm (because Linked Data Notebook outputs are most practically validated as Linked Data with Shapes and Constraints), then the (URI-namespaced) property for the version of the SHACL validation document would need to supersede $schema, again,
So no: $schema URL should not be the nbformat version number because other [SHACL,] schema changes would not result in an implicit change to $schema.

@westurner
Copy link

westurner commented Mar 17, 2023

currently, the top level notebook schema does not allow for any additionalProperties defined in the container,

Is JSONschema with additionalProperties: false fundamentally incompatible with JSON-LD?

so we can't have any LD @context. we're hoping to introduce @context as a top level key in future schema.

When the versioned URI and contents of the @context attribute change, does the nbformat major or minor version need to change?

there are likely a few proposals between this JEP and an @context proposal. so this is on folks minds, but we decided to defer linked data proposals until some prior JEPs are accepted. advancing the schema will be mean good things for our ability to write linked data contexts.

nbformat is older than jsonschema, and may outlast jsonschema draft n, so a separate version string that doesn't change between implementations would be great for backward compatibility

@tonyfast
Copy link
Contributor

Is JSONschema with additionalProperties: false fundamentally incompatible with JSON-LD?

notebook documents, the serialized version of someone's notebook, is fundamentally incompatible with JSON-LD. we can add @context and @graph or any other json-ld key into the metadata properties because they have permissive keys. the top level notebook document is much more strict. in fact, without this JEP, $schema is not something that can exist in a serialized notebook document because additionalProperties is false.

When the versioned URI and contents of the @context attribute change, does the nbformat major or minor version need to change?

this is a good consideration, as we deprecate nbformat and nbformat_minor we'll have to increment with each version until the deprecate. we have ongoing discussions about how to handle mismatched $schema and nbformat keys, likely $schema takes precedence.

nbformat is older than jsonschema, and may outlast jsonschema draft n, so a separate version string that doesn't change between implementations would be great for backward compatibility

wow, you're right! nbformat does predate jsonschema, it seems v3 is the first version to rely on draft04. that was a fun dig into history. anyway, current json schema efforts seem to be a well supported community and they are rigorous in their changing their versions. nbformat will update the json schema draft it is based on less than we will update our own schema versions.

we've been spending a lot of time discussing backwards compatibility, and how to handle that best. on going work...

When [W3C SHACL] validation for Linked Data notebooks becomes the norm (because Linked Data Notebook outputs are most practically validated as Linked Data with Shapes and Constraints), then the (URI-namespaced) property for the version of the SHACL validation document would need to supersede

a shacl context for notebook schema will undoubtedly show up in the future. the nbformat schemas serve as valuable interfaces defining linked data contexts. for example, nbformat could be mapped to shacl using a context like:

 {"@vocab": "https://github.com/jupyter/nbformat/blob/main/nbformat/v4/nbformat.v4.5.schema.json#", "@base": "http://www.w3.org/ns/shacl#](http://www.w3.org/ns/shacl#"}

@agoose77
Copy link
Contributor

we have ongoing discussions about how to handle mismatched $schema and nbformat keys, likely $schema takes precedence.

@tonyfast I was thinking about this after the meeting, and it seems to me that we should literally define these as constants in the schema. My take is that if you author a notebook with $schema, you're literally asking for it to conform to that schema, and that includes nbformat minor and major being valid.

change schema draft to draft2020-12
@tonyfast
Copy link
Contributor

moving agenda minutes over from the team compass.

March 7th, 2023

Name Affiliation GitHub Favorite Schema Key
tonyfast @tonyfast properties
fcollonval QuantStack @fcollonval
Angus Hollands Princeton University @agoose77 😄
Rowan Curvenote / ExecutableBooks @rowanc1
Nick Bollweg Georgia Tech @bollwyvl

Agenda

first meeting of the notebook cells schema group outside of the nbformat workshop.

  • Meeting logistics

    • use hackmd for notes
    • use google meet for video because jovyan is crowded
      • ⚠️ this account is limited to our hour so we have a real hard stop.
    • the textual format team is working in other channels to submit their jeps.
  • Research

    • which schema draft are we using?
    • should only be adding cells and metadata
    • how is this file format going to be reused?
    • introduction of notebook mimetype. how do we carry around the mimebundle across documents and use that information.
    • how do we use attachments better? where do attachments belong?
      • could attachments just be a cell? hold the whole mimebundle
    • Distinguish between saving and reading - always uphold $schema, but not extraSchemas?
    • Should extraSchemas allow embedding schema?
    • Do we include @context?
      • Probably a separate JEP because the value proposition is a different learning curve.
  • Interests

    • Rowan - standardization of notebooks in scientific publishing. dealing with authorship, title, subtitles, scholarship.

to do

  • follow up with JEP shepherd
  • post an issue to the team compass
  • add the event to the community calendar

$vocabulary

  • does this provide the convention (and therefore the tools) we need

https://gregsdennis.github.io/Manatee.Json/usage/schema/vocabs.html

"$vocabulary": {
    "https://json-schema.org/draft/2019-WIP/vocab/core": true,              // 2
    "https://json-schema.org/draft/2019-WIP/vocab/applicator": true,
    "https://json-schema.org/draft/2019-WIP/vocab/validation": true,
    "https://json-schema.org/draft/2019-WIP/vocab/meta-data": true,
    "https://json-schema.org/draft/2019-WIP/vocab/format": true,
    "https://json-schema.org/draft/2019-WIP/vocab/content": true,
    "https://myserver.net/my-vocab": true
  },
  • Angus' understanding of vocabulary1:
    • Vocabularies allow meta schemas to define custom keywords, e.g. a units keyword that adds units to an integer:
       {
           "type": "number",
           "units": "kg/s"
       }
    • One must create a new metaschema that defines these vocabularies, and copies the meta-schema that it "inherits" from (or use allOf?)
    • The $vocabulary section of a metaschema lists the vocabularies, and a boolean flag of whether they constitute a failure if they cannot be located. The units keyword above does not affect validation, so it can safely be ignored if the validator cannot find the URI (it's metadata). Other keyword schemas might not be so permissive:
       {
           "type": "number",
           "isEven": True
       }
      This schema would incorrectly validate documents with odd integers, but the essence is still upheld. A keyword that changed the "type" would not be ignorable if the validator is at-all to be useful.

      Modern JSON Schema introduces vocabularies, which allow you to define a group of keywords and identify them with a URI. Schema authors can then use that URI to tell implementations that the need to support the vocabulary in order to use the schema. If they can't, instead of failing validation, the implementation refuses to run the schema and indicates which vocabularies it doesn't understand.2

    • i.e. $vocabulary solves the problem of "is this failure a 'unrecoverable' error?".
    • We could use this to introduce a top-level extraSchemas field (?)
      • Crucially, it means that validators that don't understand what to do with extraSchemas don't try and validate the document.

Challenges

flowchart
    mimetypes --> IANA
    multiple_schema[multiple schema]
    validation --> validation_report[validation report]
    JEP --> end_meeting[end this meeting]
Loading
  • Extra schemas: Failure modes
    • How can our approaches fail?
      • two conflicting extra schemas
    • How can users save themselves if we break stuff? what happens code/clients break?

Reference

JEP Drafts

References

March 14th, 2023

Name Affiliation GitHub
tonyfast @tonyfast
Steve Purves Curvenote @stevejpurves
Jason Grout Databricks @jasongrout
Angus Hollands Princeton University @agoose77
Nick Bollweg GTech @bollwyvl

Agenda

March 21

no meeting

Footnotes

  1. https://json-schema.org/learn/glossary.html

  2. https://modern-json-schema.com/what-is-modern-json-schema

@tonyfast
Copy link
Contributor

tonyfast commented Apr 3, 2023

here are the notes from last week. see y'all tomorrow. please add anything you might like to talk about to the agenda.

March 28th, 2023

Name Affiliation GitHub
tonyfast @tonyfast
Nick Bollweg GTech @bollwyvl
Steve Purves Curvenote @stevejpurves
Afshin T. Darian QuantStack @afshin

Agenda

  • markdown text format updates

  • problem with jeps: they aren't validated

    • how could we use notebooks to validate jeps? how can we validate schema?
  • [name=Nick] need to be able to reference schema from schema

    • portable cells that can copy and paste across documents
    • treat all cells as the same
    • attachments are broken (not discoverable)m UI is busted
    • slugifying headers isn't consistent across implementations
    • evenutally register cells as a mimetype
  • extra schema uses cases

    • jupyter.org could/should host schema as an official namespace
    • the format should avoid content validation unless explicitly in the purview of the schema
    • vendors could provide content validation
  • how to demo?

    • nbformat, jupyter server
    • traitlets to schema

@tonyfast
Copy link
Contributor

attaching notes from last week's meeting. see folks tomorrow.

April 4rd, 2023

Name Affiliation GitHub
tonyfast @tonyfast
jeremy ravenal naas @jravenel
Angus Hollands Princeton University @agoose77
Afshin T. Darian QuantStack @afshin

Agenda

  • schema provide solutions for validation and ui
  • angus on extra schemas
    • platform to design a different notebooks that will allow different input cells.
    • extension authors can encode some other validation logic
    • mainly useful for the front end.
  • what is the history of this meeting?
    • spun out of a workshop that discussed modifying the notebook format
    • we talked about different ways to create new cells types: is there one code cell or many different cells?
    • there is no nice way for myst to store metadata, there is no way to enshrine metadata in the schema
  • naas - push notebooks to production seamlessly. package software, data, and chats.
  • sell things with demos
    • raw cell bolt-on
    • add a MIME / text entry widget to the cell view
      • specify the mimetype of the contents
    • add a custom renderer based upon mime type
      • HTML
      • SVG
      • Form generation (one way, code generation, hacky!)
    • Execute cell has two steps
      • Two steps, one "executes", another "renders"
    • ability to select output mimetype as well?
  • we talked a lot about cell types
    • use raw cells for another cell type
    • use the kernelspec in cells to identify cell actions
  • steps to getting JEP accepted
    • what implementation will need updating? (open question)
  • Split these conversations:
    • $schema - uncontentious
    • extraSchemas - motivates extension validation, needs discussion
    • @context, annotation, etc. additional discussions!

@tonyfast
Copy link
Contributor

hey folks. i likely will miss the meeting today. hopefully someone else can drive the ship. the hackmd is all set up https://hackmd.io/@tonyfast/H1Xnx1B12

@bollwyvl bollwyvl mentioned this pull request Apr 20, 2023
30 tasks
@tonyfast
Copy link
Contributor

tonyfast commented May 1, 2023

April 25th, 2023

Name Affiliation GitHub
tonyfast @tonyfast
Nick Bollweg GTech @bollwyvl

Agenda

@tonyfast
Copy link
Contributor

tonyfast commented May 2, 2023

May 2nd, 2023

Name Affiliation GitHub
tonyfast @tonyfast
Angus Hollands Princeton University @agoose77

Agenda

is there someone with the proper rights to label these JEPs?

@fcollonval
Copy link
Contributor

@/all (but especially @jupyter/software-steering-council) in 0871ad1 I updated the schema URI to align with JEP #108; i.e. from https://jupyter.org/schema/notebook/notebook-{nbformat}.{nbformat_minor}.schema.json to https://schema.jupyter.org/notebook/v{nbformat}.{nbformat_minor}/notebook.json

For this particular URI I did not use a subproject (as allowed by the JEP). Let me know if it needs further changes.

@fcollonval
Copy link
Contributor

The vote is now closed with the results:

In favor: 8
Against: 0
Abstention: 0
No vote: 3

--> In light of those results, this JEP is accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.