-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Straw design document for EventLogging #2
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,315 @@ | ||
# EventLogging design | ||
|
||
This document supplements `implementations.md` and has sections detailing the | ||
the eventlogging design that can be common to various parts of the Jupyter | ||
ecosystem. These two documents will co-evolve - as we think more about | ||
implementation, the design will change, and vice versa. | ||
|
||
## Why collect data? | ||
|
||
The primary reasons for collecting such data are: | ||
|
||
1. Better understanding of *usage* of their infrastructure. This | ||
might be for capacity planning, metrics, billing, etc | ||
|
||
2. *Auditing* requirements - for security or legal reasons. Our | ||
Telemetry work in the Jupyter project is necessary but not | ||
sufficient for this, since it might have more stringent | ||
requirements around secure provenance & anti-tampering. | ||
|
||
4. UX / UI Events from end user behavior. This is often | ||
targeted measurements to help UX designers / developers | ||
determine if particular UX decisions are meeting their | ||
goals. | ||
|
||
3. *Operational* metrics. Prometheus metrics should be used for | ||
most operational metrics (error rates, percentiles of server | ||
or kernel start times, memory usage, etc). However, some | ||
operational data is much more useful when lossless than | ||
when sampled, such as server start times or contentmanager | ||
usage. | ||
|
||
## Metrics vs Events | ||
|
||
Both Metrics and Events are telemetry, but are fundamentally | ||
different. Katy Farmer [explains it](https://thenewstack.io/what-is-the-difference-between-metrics-and-events/) | ||
thus: | ||
|
||
> I want to keep track of my piggy bank closely. Right now, there’s only one | ||
> metric I care about: total funds. Anyone can put money into my piggy bank, so | ||
> I want to report the total funds at a one-minute interval. This means that | ||
> every minute, my database will receive a data point with the timestamp and | ||
> the amount of total funds in my piggy bank. | ||
> | ||
> Now, I want to track specific events for my piggy bank: deposits and | ||
> withdrawals. When a deposit occurs, my database will receive a data point | ||
> with the “deposit” tag, the timestamp and the amount of the deposit. | ||
> Similarly, when a withdrawal occurs, my database will receive a data point | ||
> with the “withdrawal” tag, the timestamp and the amount of the withdrawal. | ||
> | ||
> Imagine now that this is the same basic idea behind online banking. We could | ||
> add more metadata to add detail to the events, like attaching a user ID to a | ||
> deposit or withdrawal. | ||
|
||
Metrics let us answer questions like 'what is the 99th percentile start time | ||
for our user servers over the last 24 hours?' or 'what is the current rate | ||
of 5xx errors in notebook servers running on machines with GPUs?'. They have | ||
limited cardinality, and are usually aggregated at the source. Usually, | ||
they are regularly pulled into a central location at regular intervals. They | ||
rarely contain any PII, although they might leak some if we are not careful. These | ||
are primarily operational. We already support metrics via the [prometheus] | ||
(https://prometheus.io/) protocol in [JupyterHub](https://github.com/jupyterhub/jupyterhub/pull/1581), | ||
[Notebook Server](https://github.com/jupyter/notebook/pull/3490) and | ||
[BinderHub](https://github.com/jupyterhub/binderhub/pull/150). This is | ||
heavily used in a bunch of places - see the public [grafana instance] | ||
(https://grafana.mybinder.org/) showing visualizations from metrics | ||
data, and documentation about [what is collected](https://mybinder-sre.readthedocs.io/en/latest/components/metrics.html) | ||
|
||
Events let us answer questions like 'which users opened a notebook named | ||
this in the last 48h?' or 'what JupyterLab commands have been executed | ||
most when running with an IPython kernel'. They have much more information | ||
in them, and do not happen with any regularity. Usually, they are also | ||
'pushed' to a centralized location, and often contain PII - so need | ||
to be treated carefully. BinderHub [emits events](https://binderhub.readthedocs.io/en/latest/eventlogging.html) | ||
around repos launched there, the mybinder.org team has a [very small | ||
pipeline](https://github.com/jupyterhub/mybinder.org-deploy/tree/master/images/analytics-publisher) | ||
that cleans these events and publishes them at [archive.analytics.mybinder.org](https://github.com/jupyterhub/mybinder.org-deploy/tree/master/images/analytics-publisher) | ||
for the world to see. | ||
|
||
This document focuses primarily on *Events*, and doesn't talk much about metrics. | ||
|
||
## Stakeholders | ||
|
||
1. End Users | ||
|
||
Primary stakeholder, since it is their data. They have a right | ||
to know what information is being collected about them. We should | ||
make this the default, and provide automated, easy to understand | ||
ways for them to see what is collected about them. | ||
|
||
2. Operators | ||
|
||
The operators of the infrastructure where various Jupyter components | ||
run are the folks interested in collecting various bits of Events. | ||
They have to: | ||
|
||
a. Explicitly decide what kinds of Events at what level they are going to | ||
be collecting and storing | ||
|
||
b. Configure where these Events needs to go. It should be very | ||
easy for them to integrate this with the rest of their infrastructure. | ||
|
||
By default, we should not store any Events unless an operator | ||
explicitly opts into it. | ||
|
||
3. Developers | ||
|
||
Developers will be emitting Events from various parts of the code. | ||
They should only be concerned about emitting Events, and | ||
not about policy enforcement around what should be kept and | ||
where it should be stored. We should also provide easy | ||
interfaces for them to emit information in various places | ||
(backends, frontends, extensions, kernels, etc) | ||
|
||
4. Analysts | ||
|
||
These are the folks actually using the event data to make | ||
decisions, and hence the ultimate consumers of all this data. | ||
They should be able to clearly tell what the various fields | ||
in the data represent, and how complete it is. We should also | ||
make it easy for the data to be easily consumable by common | ||
analyst tools - such as pandas, databases, data lakes, etc | ||
|
||
## Other systems to study | ||
|
||
We aren't the first group to try design a unified eventlogging | ||
system that is easy to use, transparent and privacy preserving | ||
by default. Here are some examples of prior art we can draw | ||
inspiration from. | ||
|
||
* Wikimedia's [EventLogging](https://www.mediawiki.org/wiki/Extension:EventLogging/Guide) | ||
|
||
A simple and versatile system that can scale from the needs | ||
of a small organization running MediaWiki to the 7th largest | ||
Website in the world. The [Guide](https://www.mediawiki.org/wiki/Extension:EventLogging/Guide) | ||
lays out the principles behind how things work and why they | ||
do the way they do. The [Operational Information Page](https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging) | ||
shows how this is configured in a large scale installation. | ||
|
||
Let's take an example case to illustrate this. | ||
|
||
Each eventlogging use case must be documented in a public | ||
schema. [This schema](https://meta.wikimedia.org/wiki/Schema:ServerSideAccountCreation) | ||
documents events collected about account creation events. | ||
This is very useful for a variety of stakeholders. | ||
|
||
1. Users can see what information is being collected about | ||
them if they wish. | ||
|
||
2. Analysts know exactly what each field in their dataset means | ||
|
||
3. Operators can use this to perform automatic data purging, | ||
anonymizing or other retention policies easily. See | ||
how [wikimedia does it](https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging), | ||
to be compliant with GDPR and friends. | ||
|
||
4. Developers can easily log events that conform to the schema | ||
with standardized libraries that are provided for them, | ||
without having to worry about policy around recording and | ||
retention. See some [sample code](https://www.mediawiki.org/wiki/Extension:EventLogging/Guide#Underlying_technology) | ||
to get a feel for how it is. | ||
|
||
* Mozilla's Telemetry system | ||
|
||
Firefox runs on a lot of browsers, and has a lot of very privacy conscious | ||
users & developers. Mozilla has a well thought out [data collection policy] | ||
(https://wiki.mozilla.org/Firefox/Data_Collection). | ||
|
||
There is a [technical overview](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/start/adding-a-new-probe.html) | ||
of various capabilities available. Their [events](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/collection/events.html) | ||
system is most similar to what we want here. Similar to the wikimedia example, | ||
every event must have a corresponding schema, and you can see all the | ||
schemas in their [repository](https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Events.yaml). | ||
They also provide easy ways for developers to [emit events](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/collection/events.html#the-api) | ||
from the frontend JS. | ||
|
||
There is a lot more information in their [telemetry data portal](https://docs.telemetry.mozilla.org/), | ||
particularly around how analysts can work with this data. | ||
|
||
* Debian 'popularity contest' | ||
|
||
The debian project has an opt-in way to try map the popularity of | ||
various packages used in end user systems with the [popularity | ||
contest](https://popcon.debian.org/). It is a purely opt-in system, | ||
and records packages installed in the system and the frequency | ||
of their use. This is [sortof anonymously, sortof securely](https://popcon.debian.org/FAQ) | ||
sent to a centralized server, which then produces useful graphs. | ||
[Ubuntu](https://popcon.ubuntu.com/) and [NeuroDebian](http://neuro.debian.net/popcon/) | ||
run versions of this as well for their own packages. | ||
|
||
This is different from the other systems in being extremely | ||
single purpose, and not particularly secure in terms of user | ||
privacy. This model might be useful for particular things that | ||
need to work across a large swath of the ecosystem - such as | ||
package usage metrics - but is of limited use in Jupyter itself. | ||
|
||
* Homebrew's analytics | ||
|
||
The popular OS X package manager [homebrew](https://brew.sh] | ||
[collects information](https://github.com/Homebrew/brew/blob/master/docs/Analytics.md) about | ||
usage with Google Analytics. This is very similar to the Debian Popularity | ||
contest system, except it sends events to a third party (Google Analytics) | ||
instead. You can opt out of it if you wish. | ||
|
||
* Bloomberg? | ||
|
||
Paul Ivanov mentioned that Bloomberg has their own data collection | ||
system around JupyterLab. Would be great to hear more details of that | ||
here. | ||
|
||
* Other organizations | ||
|
||
Everyone operating at scale has some way of doing this kind of analytics | ||
pipeline. Would be great to add more info here! | ||
|
||
## Design proposal | ||
|
||
1. *Schema* | ||
|
||
Each event type needs a [JSON Schema](https://json-schema.org/) associated | ||
with this. This schema is versioned to allow analysts, operators | ||
and users to see when new fields are added / removed. The descriptions | ||
should also be clear enough to inform users of what is being collected, | ||
and analysts of what they are actually analyzing. We could also use this | ||
to mark specific fields as PII, which can then be automatically mangled, | ||
anonymized or dropped. | ||
|
||
2. *EventLogging Python API* | ||
|
||
A simple python API that lets users in serverside code (JupyterHub, | ||
Notebook Server, Kernel, etc) emit events. This will: | ||
|
||
1. Validate the events to make sure they conform to the schema | ||
they claim to represent. | ||
2. Look at traitlet configuration to see if the events should be | ||
dropped, and immediately drop it if so. So nothing leaves the | ||
process unless explicitly configured to do so. | ||
3. Filter / obfuscate / drop PII if configured so. | ||
3. Wrap the event in an *event capsule* with common information | ||
for all events - timestamp (of sufficient granularity), | ||
schema reference, origin, etc. | ||
4. Emit the event to a given 'sink'. We should leverage the | ||
ecosystem built around Python Loggers for this, so we can | ||
send events to a wide variety of sources - [files](https://docs.python.org/3/library/logging.handlers.html#filehandler), | ||
[files with automatic rotation](https://docs.python.org/3/library/logging.handlers.html#rotatingfilehandler), | ||
[arbitrary HTTP output](https://docs.python.org/3/library/logging.handlers.html#httphandler), | ||
[kafka](https://pypi.org/project/python-kafka-logger/), | ||
[Google Cloud's Stackdrive](https://pypi.org/project/google-cloud-logging/), | ||
[AWS CloudWatch](https://github.com/kislyuk/watchtower), | ||
[ElasticSearch](https://github.com/kislyuk/watchtower) | ||
and many many more. This should help integrate with whatever | ||
systems the organization is already using. | ||
|
||
This helps us centralize all the processing around event validity, | ||
PII handling and sink configuration. Organizations can then decide | ||
what to do with the events afterwards. | ||
|
||
3. *EventLogging REST API* | ||
|
||
This is a HTTP Endpoint to the Python API, and is a way for frontend | ||
JavaScript and other remote clients to emit events. This is an HTTP | ||
interface, and could exist in many places: | ||
|
||
1. Inside JupyterHub, and all events can be sent via that. | ||
2. Inside Jupyter Notebook Server, so it can collect info from | ||
the user running it. The Notebook Server can then send it | ||
someplace. | ||
3. A standalone service, that can be sent events from everywhere. | ||
|
||
By separating (2) and (3), we can cater to a variety of scales | ||
and use cases. | ||
|
||
4. *EventLogging JavaScript API* | ||
|
||
This is the equivalent to (1), but in JavaScript. | ||
|
||
It should receive configuration in a similar way as (1) and (2), but be | ||
able to send them to various sinks *directly* instead of being forced to | ||
go through (3). This is very useful in cases where events should be sent | ||
directly to a pre-existing collection service - such as Google Analytics | ||
or mixpanel. Those can be supported as various sinks that plug into this | ||
API, so the code that is emitting the events can remain agnostic to where | ||
they are being sent. | ||
|
||
The default sink can be (3), but we should make sure we implement | ||
at least 1 more sink to begin with so we don't overfit our API design. | ||
|
||
We should be careful to make sure that these events still conform to | ||
schemas, need to be explicitly turned on in configuration, and follow | ||
all the other expectations we have around eventlogging data. | ||
|
||
5. *User consent / information UI* | ||
|
||
Every application collecting data should have a way to make it | ||
clear to the user what is being collected, and possibly ways | ||
to turn it off. We could possibly let admins configure opt-in / | ||
opt-out options. | ||
|
||
## Open questions | ||
|
||
Here's a list of open questions. | ||
|
||
1. How to reference schemas in ways people can find them? Current | ||
example points to URLs that don't exist, purely as a way to | ||
namespace them. Perhaps we should have them point to URLs that | ||
*do* exist? | ||
|
||
2. How do we signal strongly that telemetry / events are never sent | ||
to the Jupyter project / 3rd party unless you explicitly configure | ||
it to do so? This is a common meaning of the word 'telemetry' | ||
today, so we need to make sure we communicate clearly what this | ||
is, what this isn't, and what it can be used for. Same applies | ||
to communicating that nothing is collected or emitted anywhere, | ||
despite the possible presence of emission code in the codebase. | ||
|
||
3. Add yours here! |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Implementation | ||
|
||
This document supplements `design.md` and has sections for implementation of | ||
the eventlogging design in various parts of the Jupyter ecosystem. These two | ||
documents will co-evolve - as we think more about implementation, the design | ||
will change, and vice versa. | ||
|
||
## JupyterLab | ||
|
||
A *lot* of user actions happen here, so a well thought out extensible mechanism | ||
is extremely important. | ||
|
||
Some parts of this might be: | ||
|
||
1. The core event routing component to send events to the configured "sink". | ||
2. An interface to expose to JupyterLab extension developers to be able to | ||
publish custom events | ||
3. An interface to expose to operators/admins writing their own custom sink | ||
(as JupyterLab extensions themselves) and register themselves with the core | ||
routing components | ||
|
||
## JupyterHub | ||
|
||
## Notebook Server | ||
|
||
## Classic Notebook? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the notebook is on its way out, should we worry about adding telemetry to the notebook API and frontend? My vote is that we focus on Jupyter Server for backend and and JupyterLab for client-side telemetry. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1 - This is not saying we should completely drop support for Jupyter Classic, but we should be JupyterLab first There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. However, in the pedagogy world, classic notebook isn't going away for many many years. It makes sense for the Jupyter community / JupyterLab devs to focus on JupyterLab first, but we should make sure we don't make active choices that preclude classic notebook use. |
||
|
||
## Kernels? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 Collecting kernel events is something I've heard many people want. Obviously, the number of events can get very large, so we'll need to design something extensible to handle this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should move forward with Jupyter Server, since that's the future implementation of the notebook server. I think Jupyter Server will be ready before telemetry is ready, so it would be easier to focus our work there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing most of the events that are going to be emitted is going to be from the notebook REST APIs, kernel spawning, etc. Will that be also on Jupyter Server?