-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine grained data collection #49
Conversation
5418a15
to
c7318e3
Compare
…gories field in SchemaOption trait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks a lot for this, @Zsailer and @blink1073. I've one suggestion - when specifying the schema, can we call the properties For categories, I'm worried that different schemas will use different words to refer to the same concepts. I don't think we should enforce that with software, but by providing useful guidance & some well known category names. For that, I propose:
|
From the admin perspective, this now requires the admin to:
From the analyst perspective, this gets more complicated. Events are no longer self contained - the information about what categories were present is now out-of-band information, that requires co-ordination with the admin to know about. This gets a little messy. For example, if you load the output of events into pandas, what columns you get will now depend on admin parameters. Admins can change this at any time, without a change in schema - causing downstream users to fail. The shape of the data is now different without any such indication to the analyst. A simplifying suggestion I now have is:
This gives us two things:
What do you both think? I apologize profusely for providing this feedback so late. |
I actually take back what I said about categories. For admins, they might not actually know which fields to select! Having a category set by the authors of the schema should definitely help here. |
for property_name, data in event.items(): | ||
prop_categories = schema["properties"][property_name]["categories"] | ||
# If the property is explicitly listed in | ||
# the allowed_properties, then include it in the capsule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of excluding it from the capsule, set it to null to indicate missing value?
So to recap, the things I think we should do is:
Thank you for coming on this asynchronous journey with me :) |
Thanks a lot, @Zsailer! This LGTM now, once the test failures are fixed. |
Thanks, @yuvipanda and @blink1073! Merging! |
This replaces #46. The PR was originally directed at handling personal data. Now, it's a bit more general.
Here, I've changed the
allowed_schemas
trait to explicitly list all event properties that should be recorded based on their name or category. This can be used to filter-out personal data.Summary of the changes made:
allowed_schemas
can now be a dictionary describing what data to keep when recording an event (see example below).categories
field under each property. This field can be used to group properties.unrestricted
is a special category that means the named property will always be recorded when that event is recorded.How it works
allowed_schemas
is now a nested dictionary where the keys are the schema IDs, and the values are dictionaries with two optional fields,properties
andcategories
, describing the properties in the event that should be recorded.As an example,
When the
uri.to.schema
event is recorded, it will include the properties,email
andname
, and any other properties with the category labelpersonal-information
. All other properties will not be recorded.Personal data
While this PR doesn't specifically address personal data, it provides a mechanism to filter out personal data. For example, properties can be labeled with categories like
hippa
,gdpr
,pii
, etc. to make filtering under different regimes possible.