This document supplements implementations.md
and has sections detailing the
the eventlogging design that can be common to various parts of the Jupyter
ecosystem. These two documents will co-evolve - as we think more about
implementation, the design will change, and vice versa.
The primary reasons for collecting such data are:
-
Better understanding of usage of their infrastructure. This might be for capacity planning, metrics, billing, etc
-
Auditing requirements - for security or legal reasons. Our Telemetry work in the Jupyter project is necessary but not sufficient for this, since it might have more stringent requirements around secure provenance & anti-tampering.
-
UX / UI Events from end user behavior. This is often targeted measurements to help UX designers / developers determine if particular UX decisions are meeting their goals.
-
Operational metrics. Prometheus metrics should be used for most operational metrics (error rates, percentiles of server or kernel start times, memory usage, etc). However, some operational data is much more useful when lossless than when sampled, such as server start times or contentmanager usage.
Both Metrics and Events are telemetry, but are fundamentally different. Katy Farmer explains it thus:
I want to keep track of my piggy bank closely. Right now, there’s only one metric I care about: total funds. Anyone can put money into my piggy bank, so I want to report the total funds at a one-minute interval. This means that every minute, my database will receive a data point with the timestamp and the amount of total funds in my piggy bank.
Now, I want to track specific events for my piggy bank: deposits and withdrawals. When a deposit occurs, my database will receive a data point with the “deposit” tag, the timestamp and the amount of the deposit. Similarly, when a withdrawal occurs, my database will receive a data point with the “withdrawal” tag, the timestamp and the amount of the withdrawal.
Imagine now that this is the same basic idea behind online banking. We could add more metadata to add detail to the events, like attaching a user ID to a deposit or withdrawal.
Metrics let us answer questions like 'what is the 99th percentile start time for our user servers over the last 24 hours?' or 'what is the current rate of 5xx errors in notebook servers running on machines with GPUs?'. They have limited cardinality, and are usually aggregated at the source. Usually, they are regularly pulled into a central location at regular intervals. They rarely contain any PII, although they might leak some if we are not careful. These are primarily operational. We already support metrics via the [prometheus] (https://prometheus.io/) protocol in JupyterHub, Notebook Server and BinderHub. This is heavily used in a bunch of places - see the public [grafana instance] (https://grafana.mybinder.org/) showing visualizations from metrics data, and documentation about what is collected
Events let us answer questions like 'which users opened a notebook named this in the last 48h?' or 'what JupyterLab commands have been executed most when running with an IPython kernel'. They have much more information in them, and do not happen with any regularity. Usually, they are also 'pushed' to a centralized location, and often contain PII - so need to be treated carefully. BinderHub emits events around repos launched there, the mybinder.org team has a very small pipeline that cleans these events and publishes them at archive.analytics.mybinder.org for the world to see.
This document focuses primarily on Events, and doesn't talk much about metrics.
-
End Users
Primary stakeholder, since it is their data. They have a right to know what information is being collected about them. We should make this the default, and provide automated, easy to understand ways for them to see what is collected about them.
-
Operators
The operators of the infrastructure where various Jupyter components run are the folks interested in collecting various bits of Events. They have to:
a. Explicitly decide what kinds of Events at what level they are going to be collecting and storing
b. Configure where these Events needs to go. It should be very easy for them to integrate this with the rest of their infrastructure.
By default, we should not store any Events unless an operator explicitly opts into it.
-
Developers
Developers will be emitting Events from various parts of the code. They should only be concerned about emitting Events, and not about policy enforcement around what should be kept and where it should be stored. We should also provide easy interfaces for them to emit information in various places (backends, frontends, extensions, kernels, etc)
-
Analysts
These are the folks actually using the event data to make decisions, and hence the ultimate consumers of all this data. They should be able to clearly tell what the various fields in the data represent, and how complete it is. We should also make it easy for the data to be easily consumable by common analyst tools - such as pandas, databases, data lakes, etc
We aren't the first group to try design a unified eventlogging system that is easy to use, transparent and privacy preserving by default. Here are some examples of prior art we can draw inspiration from.
-
Wikimedia's EventLogging
A simple and versatile system that can scale from the needs of a small organization running MediaWiki to the 7th largest Website in the world. The Guide lays out the principles behind how things work and why they do the way they do. The Operational Information Page shows how this is configured in a large scale installation.
Let's take an example case to illustrate this.
Each eventlogging use case must be documented in a public schema. This schema documents events collected about account creation events. This is very useful for a variety of stakeholders.
-
Users can see what information is being collected about them if they wish.
-
Analysts know exactly what each field in their dataset means
-
Operators can use this to perform automatic data purging, anonymizing or other retention policies easily. See how wikimedia does it, to be compliant with GDPR and friends.
-
Developers can easily log events that conform to the schema with standardized libraries that are provided for them, without having to worry about policy around recording and retention. See some sample code to get a feel for how it is.
Thanks to Ori Livneh, one of the designers of this system, for conversations that have influenced how the Jupyter Telemetry system is being designed.
-
-
Mozilla's Telemetry system
Firefox runs on a lot of browsers, and has a lot of very privacy conscious users & developers. Mozilla has a well thought out [data collection policy] (https://wiki.mozilla.org/Firefox/Data_Collection).
There is a technical overview of various capabilities available. Their events system is most similar to what we want here. Similar to the wikimedia example, every event must have a corresponding schema, and you can see all the schemas in their repository. They also provide easy ways for developers to emit events from the frontend JS.
There is a lot more information in their telemetry data portal, particularly around how analysts can work with this data.
-
Debian 'popularity contest'
The debian project has an opt-in way to try map the popularity of various packages used in end user systems with the popularity contest. It is a purely opt-in system, and records packages installed in the system and the frequency of their use. This is sortof anonymously, sortof securely sent to a centralized server, which then produces useful graphs. Ubuntu and NeuroDebian run versions of this as well for their own packages.
This is different from the other systems in being extremely single purpose, and not particularly secure in terms of user privacy. This model might be useful for particular things that need to work across a large swath of the ecosystem - such as package usage metrics - but is of limited use in Jupyter itself.
-
Homebrew's analytics
The popular OS X package manager homebrew collects information about usage with Google Analytics. This is very similar to the Debian Popularity contest system, except it sends events to a third party (Google Analytics) instead. You can opt out of it if you wish.
-
Bloomberg?
Paul Ivanov mentioned that Bloomberg has their own data collection system around JupyterLab. Would be great to hear more details of that here.
-
Other organizations
Everyone operating at scale has some way of doing this kind of analytics pipeline. Would be great to add more info here!
-
Schema
Each event type needs a JSON Schema associated with this. This schema is versioned to allow analysts, operators and users to see when new fields are added / removed. The descriptions should also be clear enough to inform users of what is being collected, and analysts of what they are actually analyzing. We could also use this to mark specific fields as PII, which can then be automatically mangled, anonymized or dropped.
-
EventLogging Python API
A simple python API that lets users in serverside code (JupyterHub, Notebook Server, Kernel, etc) emit events. This will:
- Validate the events to make sure they conform to the schema they claim to represent.
- Look at traitlet configuration to see if the events should be dropped, and immediately drop it if so. So nothing leaves the process unless explicitly configured to do so.
- Filter / obfuscate / drop PII if configured so.
- Wrap the event in an event capsule with common information for all events - timestamp (of sufficient granularity), schema reference, origin, etc.
- Emit the event to a given 'sink'. We should leverage the ecosystem built around Python Loggers for this, so we can send events to a wide variety of sources - files, files with automatic rotation, arbitrary HTTP output, kafka, Google Cloud's Stackdrive, AWS CloudWatch, ElasticSearch and many many more. This should help integrate with whatever systems the organization is already using.
This helps us centralize all the processing around event validity, PII handling and sink configuration. Organizations can then decide what to do with the events afterwards.
-
EventLogging REST API
This is a HTTP Endpoint to the Python API, and is a way for frontend JavaScript and other remote clients to emit events. This is an HTTP interface, and could exist in many places:
- Inside JupyterHub, and all events can be sent via that.
- Inside Jupyter Notebook Server, so it can collect info from the user running it. The Notebook Server can then send it someplace.
- A standalone service, that can be sent events from everywhere.
By separating (2) and (3), we can cater to a variety of scales and use cases.
-
EventLogging JavaScript API
This is the equivalent to (1), but in JavaScript.
It should receive configuration in a similar way as (1) and (2), but be able to send them to various sinks directly instead of being forced to go through (3). This is very useful in cases where events should be sent directly to a pre-existing collection service - such as Google Analytics or mixpanel. Those can be supported as various sinks that plug into this API, so the code that is emitting the events can remain agnostic to where they are being sent.
The default sink can be (3), but we should make sure we implement at least 1 more sink to begin with so we don't overfit our API design.
We should be careful to make sure that these events still conform to schemas, need to be explicitly turned on in configuration, and follow all the other expectations we have around eventlogging data.
-
User consent / information UI
Every application collecting data should have a way to make it clear to the user what is being collected, and possibly ways to turn it off. We could possibly let admins configure opt-in / opt-out options.
Schema naming conventions are very important, and affect multiple stakeholders.
-
Analysts are affected the most. When looking at event data, they should have an easy, reliable way to get the JSON schema referenced there. This schema will have documentation describing the fields, which should be of immense help in understanding the data they are working with.
-
Developers want to avoid cumbersome, hard to remember names when recording events. They might also have private schemas they do not want to publicly publish. There should also be no central naming authority for event schemas, as that will slow down development. They also want their code to be fast, so recording events should never require a network call to fetch schemas.
So the goal should be to provide a set of naming recommendations that can be implemented as a standalone utility for analysts to get the JSON schema from a given schema name. This could even be deployed as a small public tool that can resolve public schemas and display them in a nice readable format, like this.
There's lots of prior art here, but we'll steal most of our recommendations from go's remote package naming conventions.
-
All schema names must be valid URIs, with no protocol part. This is the only requirement - these URIs need not actually resolve to anything. This lets developers get going quickly, and makes private schemas easy to do.
-
jupyter.org
URIs will be special cased.jupyter.org/<project>/<schema>
would resolve to:a. The github repository
jupyter/<project-name>
b. The directoryevent-schemas/<schema>
in the project c. Files inside this directory should be namedv<version>.json
, where<version>
is the integer version of the schema being used. All schema versions must be preserved in the repository. -
lab.jupyter.org
,hub.jupyter.org
andipython.jupyter.org
URIs will also be specialcased, pointing to projects under thejupyterlab
,jupyterhub
andipython
github organizations respectively. -
For arbitrary other public projects, we assume they most likely use a public version control repository. Here we borrow from go's remote syntax for vcs repos - looking for a version control system specific suffix in any part of the path.
For example, if I want to add eventlogging to the project hosted at
https://github.com/yuvipanda/hubploy.git
, the recommendation is that I use URIs of the formgithub.com/yuvipanda/hubploy.git/<schema-name>
. The resolver can then look for the directoryevent-schemas/<schema-name>
after cloning the repository, and find files in there of the formv<version>.json
, same as thejupyter.org
special case.The suggestion is that
jupyter.org
and other special cases are just shortcuts for expanding into the full git repo URL form. -
If a git repository is not used, the URI is treated as a https endpoint, and fetched. Different actions are taken based on the
Content-Type
of the response.a. If
application/json
orapplication/json+schema
, the response is assumed to be the actual schema.b. If
text/html
, we look for a<link>
tag that can point us to a different URI to resolve. We use the standardrel='alternate'
attribute, withtype='application/json+schema'
, and thehref
attribute pointing to another URI. The entire resolution algorithm is then run on this URI, until a schema is produced.This is slightly different than what go does, since they just invented their own
<meta>
tag. We instead use existing standard<link>
tags for the same purpose.This lets URLs provide a human readable version directly with HTML for user consumption, with a link to a machine readable version for computer usage.
-
If none of these work, the URI is assumed to be known to the end user. This might be likely for internal, private schemas that are made available to specific internal users only. Even for private schemas, ideally developers will follow the same naming recommendations as specified here - just for the sake of analysts. However, they might already have other systems of documentation in place, and we do not enforce any of this.
A small reference tool that implements schema resolution using these rules should be produced to see what problems we end up with, and tinker the design accordingly.
Here's a list of open questions.
-
How do we signal strongly that telemetry / events are never sent to the Jupyter project / 3rd party unless you explicitly configure it to do so? This is a common meaning of the word 'telemetry' today, so we need to make sure we communicate clearly what this is, what this isn't, and what it can be used for. Same applies to communicating that nothing is collected or emitted anywhere, despite the possible presence of emission code in the codebase.
-
Add yours here!