Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jupyter Telemetry Enhancement Proposal #41

Closed
wants to merge 5 commits into from

Conversation

jaipreet-s
Copy link

Contains two accompanying files

  • Press Release
  • Technical proposal

cc @yuvipanda @Zsailer

@westurner
Copy link

A couple thoughts:

  • Pluggable persistence would likely eventually be an objective
  • Should folks use this event bus / messaging system for non-Jupyter application message persistence? Or "this is for logging structured metrics for Jupyter and extensions only"?

@choldgraf (@mybinder) and I were just talking about how to profile BinderHub container launches
https://twitter.com/westurner/status/1142175356880900102 :

https://binderhub.readthedocs.io/en/latest/overview.html#a-diagram-of-the-binderhub-architecture

But there's nothing that can easily profile all of the layers of the distributed stack for a given container launch request (when the image is already cached)? Maybe @sysdig?
https://kubernetes.io/docs/tasks/debug-application-cluster/resource-usage-monitoring/#sysdig

Sysdig pulls together data from system calls, Kubernetes events, Prometheus metrics, statsD, JMX, and more into a single pane that gives you a comprehensive picture of your environment.

JSON with a JSON Schema should be easy enough to integrate with a tool like sysdig, for example.

Presumably there'd be sinks for the supported persistence backends. Would there be a standard interface for reviewing telemetry events and quantitative metrics from within Notebook or JupyterLab; or would users be expected to also configure Grafana / ELK / Loki / Splunk / Sentry?

I'm not at all familiar with with Wikimedia or Mozilla telemetry systems;
so, this is a JSON message store with input validation?

30-telemetry/proposal.md Outdated Show resolved Hide resolved
30-telemetry/proposal.md Outdated Show resolved Hide resolved
30-telemetry/proposal.md Outdated Show resolved Hide resolved
30-telemetry/proposal.md Outdated Show resolved Hide resolved
`app` object. They should use the core eventlogging library directly, and admins
should be able to configure it as they would a standalone application.

#### Authenticated routing service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a "parallel universe" to https://github.com/jupyter/enhancement-proposals/pull/41/files#diff-5c74b6c64dfb44b841261c64623c9c6eR140 right? As in as an extension I could send events to either of these and they'd end up in the same sinks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to defer to @yuvipanda on the JupyterHub functionality 😃

@betatim
Copy link
Member

betatim commented Jul 6, 2019

One thing that wasn't clear to me at the start of reading the JEP and was even less clear at the end: why have a router that is part of Jupyter instead of having the event sources talk directly to the event sinks. From the later parts of the proposal this is proposed for frontend extensions. Server extensions could obviously also send stuff directly to the event sinks.

Another thing I wonder about is if it is a good idea to try and address audit trails and privacy preserving opt-in in one proposal? Audit trail related stuff is by definition about "the user has no choices and can't be trusted" where as user privacy respecting approaches give all the power to the users.

@yuvipanda
Copy link

I wrote up https://github.com/jupyterlab/jupyterlab-telemetry/blob/master/design.md earlier which has informed a lot of choices in this, and has a ton of background material as well. Would recommend reading :)

@betatim
Copy link
Member

betatim commented Jul 6, 2019

I've read it previously and now but I don't think it answers my questions.

@westurner
Copy link

westurner commented Jul 6, 2019 via email

@yuvipanda
Copy link

@betatim:

I've read it previously and now but I don't think it answers my questions.

Apologies, that wasn't directed at you - just a general comment to those who might not have seen it yet.

@betatim
Copy link
Member

betatim commented Jul 7, 2019

One more thing I forgot to write down last time: I think adding a field to the messages that lets someone looking at the logs later tell if this message was sent from a trusted or untrusted component would be super useful. This field would have to be added by a trusted component (the "router" or some other server side component) to avoid clients faking it. The use case would be that only "trusted" messages can be part of any audit trail. Or maybe we can deal with this via having a "source" attribute that is added by a trusted component. I think for audit purposes anything that frontend sends is "useless" because that could have been tampered with by the user (I think).


#### Open Questions

1. Is this work done on the standalone jupyter-server implementation or on the classic jupyter/notebook?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be in the jupyter/notebook package for now, since I think that's going to see active use for anywhere between the next 3-5 years.

Copy link
Author

@jaipreet-s jaipreet-s Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we discussed this but this doc isn't updated. We do need to have some plan for porting these changes into jupyter_server. @Zsailer since you are also close to the jupyter_server WDYT?

Update: Just committed a change to re-word this open question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's some effort to port PRs from notebook to jupyter_server in jupyter-server/jupyter_server#53.

I agree with Yuvi. This should go into notebook and be ported to/mirrored in jupyter_server. I think we're going to be stuck with constantly syncing/porting PRs for awhile.

@westurner
Copy link

I think adding a field to the messages that lets someone looking at the logs later tell if this message was sent from a trusted or untrusted component would be super useful. This field would have to be added by a trusted component (the "router" or some other server side component) to avoid clients faking it. The use case would be that only "trusted" messages can be part of any audit trail. Or maybe we can deal with this via having a "source" attribute that is added by a trusted component. I think for audit purposes anything that frontend sends is "useless" because that could have been tampered with by the user (I think).

Re: components self-identifying as "trusted"

Private key integrity may be the most challenging part of this. A JS app running in a browser (with the obfuscated or unobfuscated source available) does not have a secure enclave within which to store a cryptographic key to be used for signing messages. A JS or Python component would need to generate message signing keys which are then somehow approved as trusted.

CSRF mitigations like per-request token generation may negatively affect performance because there's a shortage of random.
https://github.com/OWASP/CheatSheetSeries/blob/master/cheatsheets/Cross-Site_Request_Forgery_Prevention_Cheat_Sheet.md#csrf-defense-recommendations-summary

There's already the Jupyter auth token; though that's not per-component and AFAIU is not designed to be used as a message signing key.

@westurner
Copy link

HMAC ("hash-based message authentication code") tokens are one way to mitigate the risk of CSRF (a different thing submitting a message as a trusted thing)
https://en.wikipedia.org/wiki/HMAC

Because JSON message key orderings are not necessarily stable (the key order may be different if an attribute is deleted and then inserted again later, for example), the cryptographic hash or signature varies unless the message is canonicalized first. json.dumps(sort_keys=True) is basically a message canonicalization algorithm.

Linked Data Signatures have (URIs for) signature suites, message canonicalization algorithms, and message digest algorithms. This makes things future proof in that instead of saying this is jupyter_telemetry_message_format v2, you specify the proof type (which defines a canonicalizationAlgorithm, digestAlgorithm, and proofAlgorithm)
https://w3c-dvcg.github.io/ld-signatures/#terminology

{
  "@context": "https://w3id.org/identity/v1",
  "title": "Hello World!",
  "proof": {
    "type": "RsaSignature2018",
    "creator": "https://example.com/i/pat/keys/5",
    "created": "2017-09-23T20:21:34Z",
    "domain": "example.org",
    "nonce": "2bbgh3dgjg2302d-d2b3gi423d42",
    "proofValue": "eyJ0eXAiOiJK...gFWFOEjXk"
  }
}

https://w3c-dvcg.github.io/ld-signatures/#signature-suites :

{
  "id": "https://w3id.org/security#RsaSignature2018",
  "type": "SignatureSuite",
  "canonicalizationAlgorithm": "https://w3id.org/security#GCA2015",
  "digestAlgorithm": "https://www.ietf.org/assignments/jwa-parameters#SHA256",
  "proofAlgorithm": "https://www.ietf.org/assignments/jws-parameters#RSASSA-PSS"
}

https://web-payments.org/vocabs/security#LinkedDataSignature2015 :

{
  "@context": ["https://w3id.org/security/v1", "http://json-ld.org/contexts/person.jsonld"],
  "@type": "Person",
  "name": "Manu Sporny",
  "homepage": "http://manu.sporny.org/",
  "signature": {
    "@type": "LinkedDataSignature2015",
    "creator": "http://manu.sporny.org/keys/5",
    "created": "2015-09-23T20:21:34Z",
    "signatureValue": "OGQzNGVkMzVmMmQ3ODIyOWM32MzQzNmExMgoYzI4ZDY3NjI4NTIyZTk="
  }
}

HMACs use symmetric keys (pre-shared key),
cryptographic signatures use asymmetric keys (public and private keys). In either case, if a key is kept in code and/or RAM, it's really not that secret.
https://gist.github.com/westurner/4345987bb29fca700f52163c339a270f#gistcomment-2822602

... What's a good way for a component to indicate that it's trusted?

jaipreet-s and others added 2 commits July 10, 2019 12:04
Co-Authored-By: Tim Head <betatim@gmail.com>
@jaipreet-s
Copy link
Author

jaipreet-s commented Jul 10, 2019

One thing that wasn't clear to me at the start of reading the JEP and was even less clear at the end: why have a router that is part of Jupyter instead of having the event sources talk directly to the event sinks. From the later parts of the proposal this is proposed for frontend extensions. Server extensions could obviously also send stuff directly to the event sinks.

Hi @betatim ,
Thanks for the feedback!

The router fundamentally decouples event publishers from event consumers. For example, without the router, if an event sink interface is updated or a new event sink is replaced, each event publisher will need to be updated to use the new interface. With it, this is not an issue since publishers still talk to the router and new event sinks can be added/dropped via the telemetry_event_sinks configuration.

In addition, the router abstracts common functionality that would otherwise have to be implemented by each event sink, such as those listed in the Core Event Router section

  • Schema validation
  • Adds a mechanism for adding metadata fields
  • Dropping events that are not whitelisted in a given deployment

@jaipreet-s
Copy link
Author

Another thing I wonder about is if it is a good idea to try and address audit trails and privacy preserving opt-in in one proposal? Audit trail related stuff is by definition about "the user has no choices and can't be trusted" where as user privacy respecting approaches give all the power to the users.

In terms of user privacy and transparency, this proposal is limited to making it clear to users what events are being collected, as well as having some kind of Opt-In in the JupyterLab UI. I'd be fine with having more nuanced proposals around audit trails and privacy preserving opt-in as a separate proposal. @Zsailer WDYT?

@jaipreet-s
Copy link
Author

Hi @betatim and @westurner - sorry for being late to get back re: components self-identifying as "trusted"

These are all good points. The current implementation for the event publisher interface makes it possible for publishers to do this themselves m and also for consumers to validate the trust/integrity at that end.

That said, we should consider offering ways to make this easier to do for publishers. jupyter/telemetry#21 has a few ideas on how to provide this functionality

@Zsailer
Copy link
Member

Zsailer commented Aug 7, 2019

Another thing I wonder about is if it is a good idea to try and address audit trails and privacy preserving opt-in in one proposal? Audit trail related stuff is by definition about "the user has no choices and can't be trusted" where as user privacy respecting approaches give all the power to the users.

In terms of user privacy and transparency, this proposal is limited to making it clear to users what events are being collected, as well as having some kind of Opt-In in the JupyterLab UI. I'd be fine with having more nuanced proposals around audit trails and privacy preserving opt-in as a separate proposal. @Zsailer WDYT?

@betatim and @jaipreet-s

Yes, this proposal is trying to communicate that we're injecting telemetry across various "layers" of the Jupyter stack (i.e. Kernel, Server, Lab, Hub, etc.). We want everyone to be aware of these changes without fear that "Jupyter is secretly collecting data about users". We'll provide tools for admins to inform users that data is being collected. And, like @jaipreet-s said, we'll likely provide UI in JupyterLab that allows users to have some control over event collection.

We could remove the technical design plans for "consent" from this proposal and make that a separate discussion if necessary, but I don't think we should remove the language that we care about user privacy and awareness.

@betatim
Copy link
Member

betatim commented Aug 8, 2019

I think my main point was that I'd avoid talking about user choice and audit trails inn the same part of the document because they have such different requirements. They can't be reconciled, but that is fine as they are two very different things :)

@Zsailer
Copy link
Member

Zsailer commented Aug 8, 2019

I'd avoid talking about user choice and audit trails inn the same part of the document

That makes sense—these are really two different experiences/environments. Maybe we should split that bit into two different paragraphs (assuming that you're talking about the press-release document right now).

  • One paragraph about environments where user is offering consent for admin/extension developer to collect data.
  • Another paragraph talking about strictly controlled environments where auditing is required. In this case, Jupyter provides tools that make it easy for environment admin to inform users that auditing is happening.

In both cases, we're communicating that Jupyter's stance is that administrators should be transparent with users.

@adpatter
Copy link

This is an example of a potential use case:

Our telemetry project, ETC JupyterLab Telemetry Extension, captures user interactions and logs these messages to a specified handler. The ETC JupyterLab Telemetry Example repo gives an example of the service provided by the extension being consumed and the events being logged to console.log.

Presently, we are capturing several user interactions with the Notebook:

  • Active Cell Changed
  • Cell Added
  • Cell Executed
  • Cell Removed
  • Notebook Opened
  • Notebook Saved
  • Notebook Scrolled

For each event, a list of cells relevant to the event are captured as well. This is described here. The messages include a list of relevant cells and the present state of the Notebook. Cell contents that have been seen before get replaced with a cell 'ID' in order to save storage space, which allows for the state of the Notebook to be reconstructed at a later time. The reason I point that out is that there might be use cases where multiple schemas could be registered for a single event.

This JSON schema matches the event messages:

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {
    "event_name": {
      "type": "string"
    },
    "cells": {
      "type": "array",
      "items": [
        {
          "type": "object",
          "properties": {
            "id": {
              "type": "string"
            },
            "index": {
              "type": "integer"
            }
          },
          "required": [
            "id",
            "index"
          ]
        }
      ]
    },
    "notebook": {
      "type": "object",
      "properties": {
        "metadata": {
          "type": "object",
          "properties": {
            "kernelspec": {
              "type": "object",
              "properties": {
                "display_name": {
                  "type": "string"
                },
                "language": {
                  "type": "string"
                },
                "name": {
                  "type": "string"
                }
              },
              "required": [
                "display_name",
                "language",
                "name"
              ]
            },
            "language_info": {
              "type": "object",
              "properties": {
                "codemirror_mode": {
                  "type": "object",
                  "properties": {
                    "name": {
                      "type": "string"
                    },
                    "version": {
                      "type": "integer"
                    }
                  },
                  "required": [
                    "name",
                    "version"
                  ]
                },
                "file_extension": {
                  "type": "string"
                },
                "mimetype": {
                  "type": "string"
                },
                "name": {
                  "type": "string"
                },
                "nbconvert_exporter": {
                  "type": "string"
                },
                "pygments_lexer": {
                  "type": "string"
                },
                "version": {
                  "type": "string"
                }
              },
              "required": [
                "codemirror_mode",
                "file_extension",
                "mimetype",
                "name",
                "nbconvert_exporter",
                "pygments_lexer",
                "version"
              ]
            }
          },
          "required": [
            "kernelspec",
            "language_info"
          ]
        },
        "nbformat_minor": {
          "type": "integer"
        },
        "nbformat": {
          "type": "integer"
        },
        "cells": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "cell_type": {
                "type": "string"
              },
              "source": {
                "type": "string"
              },
              "metadata": {
                "type": "object",
                "properties": {
                  "trusted": {
                    "type": "boolean"
                  }
                },
                "required": [
                  "trusted"
                ]
              },
              "execution_count": {
                "type": "null"
              },
              "outputs": {
                "type": "array",
                "items": {}
              },
              "id": {
                "type": "string"
              }
            },
            "required": [

              "id"
            ]
          }
        }
      },
      "required": [
        "metadata",
        "nbformat_minor",
        "nbformat",
        "cells"
      ]
    },
    "seq": {
      "type": "integer"
    },
    "notebook_path": {
      "type": "string"
    },
    "user_id": {
      "type": "string"
    }
  },
  "required": [
    "event_name",
    "cells",
    "notebook",
    "seq",
    "notebook_path",
    "user_id"
  ]
}

Please let me know if anyone has any questions regarding our use case.

@jaipreet-s
Copy link
Author

Hi @Zsailer - Do you think we can close this PR now? It hasn't had active discussion for a while now :) Thanks!

@Zsailer
Copy link
Member

Zsailer commented Apr 25, 2023

I'm going to close this enhancement proposal, as it has been mostly implemented anyways.

For folks reading this in the future, the work evolved and now resides in jupyter_events.

I think it's still worth opening a new JEP that describes the Jupyter Event System and documents how other projects should leverage this work going forward. In many follow-on discussions, we are aiming to put jupyter_events in many layers of the Jupyter stack. A JEP would help define best practices.

Further, we're in the process of creating a schema.jupyter.org subdomain where all Jupyter Event JSON schemas should be published.

@Zsailer Zsailer closed this Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants