Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine grained data collection #49

Merged
merged 29 commits into from
Sep 9, 2020

Conversation

Zsailer
Copy link
Member

@Zsailer Zsailer commented Aug 24, 2020

This replaces #46. The PR was originally directed at handling personal data. Now, it's a bit more general.

Here, I've changed the allowed_schemas trait to explicitly list all event properties that should be recorded based on their name or category. This can be used to filter-out personal data.

Summary of the changes made:

  • allowed_schemas can now be a dictionary describing what data to keep when recording an event (see example below).
  • schemas require a categories field under each property. This field can be used to group properties.
  • unrestricted is a special category that means the named property will always be recorded when that event is recorded.

How it works

allowed_schemas is now a nested dictionary where the keys are the schema IDs, and the values are dictionaries with two optional fields, properties and categories, describing the properties in the event that should be recorded.

As an example,

allowed_schemas = {
    "uri.to.schema": {
        "allowed_properties": ["name", "email"],
        "allowed_categories": ["personal-information"]
    }
}


c.EventLog.allowed_schemas=allowed_schemas

When the uri.to.schema event is recorded, it will include the properties, email and name, and any other properties with the category label personal-information. All other properties will not be recorded.

Personal data

While this PR doesn't specifically address personal data, it provides a mechanism to filter out personal data. For example, properties can be labeled with categories like hippa, gdpr, pii, etc. to make filtering under different regimes possible.

jupyter_telemetry/traits.py Outdated Show resolved Hide resolved
@Zsailer Zsailer force-pushed the fine-grained-data-collection branch from 5418a15 to c7318e3 Compare August 31, 2020 19:29
docs/pages/application.rst Outdated Show resolved Hide resolved
docs/pages/application.rst Outdated Show resolved Hide resolved
docs/pages/schemas.rst Outdated Show resolved Hide resolved
tests/test_register_schema.py Outdated Show resolved Hide resolved
Copy link

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yuvipanda
Copy link
Collaborator

Thanks a lot for this, @Zsailer and @blink1073.

I've one suggestion - when specifying the schema, can we call the properties recorded_properties and recorded_categories? That makes it much clearer what they are than now.

For categories, I'm worried that different schemas will use different words to refer to the same concepts. I don't think we should enforce that with software, but by providing useful guidance & some well known category names. For that, I propose:

  1. Category names are also URIs, similar to schemas. They don't have to be, but by convention they are.

  2. We define (in docs?) some well-known categories that can be used by schemas, and use them in our project jupyter schemas. Examples would be: category.jupyter.org/user-identifier, category.jupyter.org/unrestricted, category.jupyter.org/action-timestamp, etc.

    This would mean we change 'unrestricted' to something like 'category.jupyter.org/unrestricted' (or similar? it's too long...), and document some more. The documentation doesn't have to happen in this PR though.

@yuvipanda
Copy link
Collaborator

From the admin perspective, this now requires the admin to:

  1. Read the schema to understand what fields there are, and what categories
  2. Figure out the fields they actually want in their emitted events, with a combination of categories and field names. This is an 'OR' list - fields will be included based on two of their properties (name, category).

From the analyst perspective, this gets more complicated. Events are no longer self contained - the information about what categories were present is now out-of-band information, that requires co-ordination with the admin to know about. This gets a little messy. For example, if you load the output of events into pandas, what columns you get will now depend on admin parameters. Admins can change this at any time, without a change in schema - causing downstream users to fail. The shape of the data is now different without any such indication to the analyst.

A simplifying suggestion I now have is:

  1. Drop categories. Admins need to read the schema anyway, so require them to just explicitly list fields instead.
  2. When emitting events, include all field names regardless of whether they are allowed or not. But for fields that are not allowed, set them to null.

This gives us two things:

  1. Admins need to explicitly think about each field they are adding and why
  2. All events for a schema will actually conform to that schema, but might have missing data. This preserves the shape of the event for all events conforming to a schema, just setting some data as missing instead.

What do you both think? I apologize profusely for providing this feedback so late.

@yuvipanda
Copy link
Collaborator

I actually take back what I said about categories. For admins, they might not actually know which fields to select! Having a category set by the authors of the schema should definitely help here.

for property_name, data in event.items():
prop_categories = schema["properties"][property_name]["categories"]
# If the property is explicitly listed in
# the allowed_properties, then include it in the capsule
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of excluding it from the capsule, set it to null to indicate missing value?

@yuvipanda
Copy link
Collaborator

yuvipanda commented Sep 8, 2020

So to recap, the things I think we should do is:

  1. Rename fields in admin config to allowed_properties and allowed_categories.
  2. Suggest that category names URIs
  3. Reserve some for well known categories, like user identifier.
  4. When a field shouldn't be recorded, emit it still but set it to null. This preserves shape of data for analysts.

Thank you for coming on this asynchronous journey with me :)

@yuvipanda
Copy link
Collaborator

Thanks a lot, @Zsailer! This LGTM now, once the test failures are fixed.

@Zsailer
Copy link
Member Author

Zsailer commented Sep 9, 2020

Thanks, @yuvipanda and @blink1073! Merging!

@Zsailer Zsailer merged commit d44e217 into jupyter:master Sep 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants