Fine grained data collection #49

Zsailer · 2020-08-24T16:07:01Z

This replaces #46. The PR was originally directed at handling personal data. Now, it's a bit more general.

Here, I've changed the allowed_schemas trait to explicitly list all event properties that should be recorded based on their name or category. This can be used to filter-out personal data.

Summary of the changes made:

allowed_schemas can now be a dictionary describing what data to keep when recording an event (see example below).
schemas require a categories field under each property. This field can be used to group properties.
unrestricted is a special category that means the named property will always be recorded when that event is recorded.

How it works

allowed_schemas is now a nested dictionary where the keys are the schema IDs, and the values are dictionaries with two optional fields, properties and categories, describing the properties in the event that should be recorded.

As an example,

allowed_schemas = {
    "uri.to.schema": {
        "allowed_properties": ["name", "email"],
        "allowed_categories": ["personal-information"]
    }
}


c.EventLog.allowed_schemas=allowed_schemas

When the uri.to.schema event is recorded, it will include the properties, email and name, and any other properties with the category label personal-information. All other properties will not be recorded.

Personal data

While this PR doesn't specifically address personal data, it provides a mechanism to filter out personal data. For example, properties can be labeled with categories like hippa, gdpr, pii, etc. to make filtering under different regimes possible.

jupyter_telemetry/eventlog.py

jupyter_telemetry/traits.py

…gories field in SchemaOption trait

docs/pages/application.rst

docs/pages/schemas.rst

tests/test_register_schema.py

blink1073

LGTM!

yuvipanda · 2020-09-08T06:43:44Z

Thanks a lot for this, @Zsailer and @blink1073.

I've one suggestion - when specifying the schema, can we call the properties recorded_properties and recorded_categories? That makes it much clearer what they are than now.

For categories, I'm worried that different schemas will use different words to refer to the same concepts. I don't think we should enforce that with software, but by providing useful guidance & some well known category names. For that, I propose:

Category names are also URIs, similar to schemas. They don't have to be, but by convention they are.
We define (in docs?) some well-known categories that can be used by schemas, and use them in our project jupyter schemas. Examples would be: category.jupyter.org/user-identifier, category.jupyter.org/unrestricted, category.jupyter.org/action-timestamp, etc.

This would mean we change 'unrestricted' to something like 'category.jupyter.org/unrestricted' (or similar? it's too long...), and document some more. The documentation doesn't have to happen in this PR though.

yuvipanda · 2020-09-08T06:52:39Z

From the admin perspective, this now requires the admin to:

Read the schema to understand what fields there are, and what categories
Figure out the fields they actually want in their emitted events, with a combination of categories and field names. This is an 'OR' list - fields will be included based on two of their properties (name, category).

From the analyst perspective, this gets more complicated. Events are no longer self contained - the information about what categories were present is now out-of-band information, that requires co-ordination with the admin to know about. This gets a little messy. For example, if you load the output of events into pandas, what columns you get will now depend on admin parameters. Admins can change this at any time, without a change in schema - causing downstream users to fail. The shape of the data is now different without any such indication to the analyst.

A simplifying suggestion I now have is:

Drop categories. Admins need to read the schema anyway, so require them to just explicitly list fields instead.
When emitting events, include all field names regardless of whether they are allowed or not. But for fields that are not allowed, set them to null.

This gives us two things:

Admins need to explicitly think about each field they are adding and why
All events for a schema will actually conform to that schema, but might have missing data. This preserves the shape of the event for all events conforming to a schema, just setting some data as missing instead.

What do you both think? I apologize profusely for providing this feedback so late.

yuvipanda · 2020-09-08T06:56:46Z

I actually take back what I said about categories. For admins, they might not actually know which fields to select! Having a category set by the authors of the schema should definitely help here.

yuvipanda · 2020-09-08T06:58:32Z

jupyter_telemetry/eventlog.py

+        for property_name, data in event.items():
+            prop_categories = schema["properties"][property_name]["categories"]
+            # If the property is explicitly listed in
+            # the allowed_properties, then include it in the capsule


Instead of excluding it from the capsule, set it to null to indicate missing value?

yuvipanda · 2020-09-08T07:00:07Z

So to recap, the things I think we should do is:

Rename fields in admin config to allowed_properties and allowed_categories.
Suggest that category names URIs
Reserve some for well known categories, like user identifier.
When a field shouldn't be recorded, emit it still but set it to null. This preserves shape of data for analysts.

Thank you for coming on this asynchronous journey with me :)

…hemas

yuvipanda · 2020-09-09T06:49:22Z

Thanks a lot, @Zsailer! This LGTM now, once the test failures are fixed.

Zsailer · 2020-09-09T17:59:57Z

Thanks, @yuvipanda and @blink1073! Merging!

Zsailer commented Aug 31, 2020

View reviewed changes

jupyter_telemetry/eventlog.py Outdated Show resolved Hide resolved

Zsailer commented Aug 31, 2020

View reviewed changes

jupyter_telemetry/traits.py Outdated Show resolved Hide resolved

Zsailer mentioned this pull request Aug 31, 2020

Allow schemas to be extended #52

Open

Zsailer added 21 commits August 31, 2020 12:26

Add PII awareness to event formatter

881c077

mult-level security

b1c30e8

add default event level to handlers in eventlog initialize

58188c0

fix tests

d28cb14

various linting issues

f2f76a6

add docs about sensitive data

6f1805d

wip

aa676f0

replace multi-level approach to tagging approach

318a076

add explicit argument to collect personal data

a9feb13

more explicit language in docs

6961778

add sentence to schema docs about personal data

9b301e1

typo in formatter

4aa9be5

remove unused traits

6616fbe

linting errors fixed

7490f8f

make categories field a list

02bf53b

linting trouble

d022769

make allowed_schemas trait handle category filtering

a5c2a3d

update docs

d9866ae

fix keys in tests

99d2e5e

linting cleanup

3b0bfce

change category collection from all to any

c7318e3

Zsailer force-pushed the fine-grained-data-collection branch from 5418a15 to c7318e3 Compare August 31, 2020 19:29

relax requirement to list unrestricted alone; remove unnecessary cate…

0a40c57

…gories field in SchemaOption trait

blink1073 reviewed Aug 31, 2020

View reviewed changes

docs/pages/application.rst Outdated Show resolved Hide resolved

docs/pages/application.rst Outdated Show resolved Hide resolved

docs/pages/schemas.rst Outdated Show resolved Hide resolved

tests/test_register_schema.py Outdated Show resolved Hide resolved

remove some old references to PII in documentation and tests

83ffb02

blink1073 approved these changes Aug 31, 2020

View reviewed changes

remove unused traitlets import

2b3c6dc

Zsailer mentioned this pull request Sep 3, 2020

Public Meeting on Telemetry #53

Closed

yuvipanda reviewed Sep 8, 2020

View reviewed changes

Zsailer added 2 commits September 8, 2020 16:29

switch to 'allowed_categories' and 'allowed_properties' in allowed_sc…

37c2da9

…hemas

return None for non recorded properties

0921465

Zsailer added 3 commits September 9, 2020 09:31

update tests to incorporate changes in allowed schemas

57a5cdf

update docs; add docs about URIs for category labels

f2c46cb

minor fix in docs for new category URI

717f8ac

Zsailer merged commit d44e217 into jupyter:master Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine grained data collection #49

Fine grained data collection #49

Zsailer commented Aug 24, 2020 •

edited

Loading

blink1073 left a comment

yuvipanda commented Sep 8, 2020

yuvipanda commented Sep 8, 2020

yuvipanda commented Sep 8, 2020

yuvipanda Sep 8, 2020

yuvipanda commented Sep 8, 2020 •

edited

Loading

yuvipanda commented Sep 9, 2020

Zsailer commented Sep 9, 2020

Fine grained data collection #49

Fine grained data collection #49

Conversation

Zsailer commented Aug 24, 2020 • edited Loading

How it works

Personal data

blink1073 left a comment

Choose a reason for hiding this comment

yuvipanda commented Sep 8, 2020

yuvipanda commented Sep 8, 2020

yuvipanda commented Sep 8, 2020

yuvipanda Sep 8, 2020

Choose a reason for hiding this comment

yuvipanda commented Sep 8, 2020 • edited Loading

yuvipanda commented Sep 9, 2020

Zsailer commented Sep 9, 2020

Zsailer commented Aug 24, 2020 •

edited

Loading

yuvipanda commented Sep 8, 2020 •

edited

Loading