Publish raw data of repositories launched #789

yuvipanda · 2018-10-30T17:49:57Z

With the progress made on #97, we are now close to publishing raw information on repositories launched. This contains the following information:

Timestamp of repository launched (possibly truncated to minute resolution)
Provider of repo launched (GitHub / GitLab / etc)
Repo name
Commit hash / branch run

This lets us (and others!) make more dashboards and run analysis on repository usage on mybinder.org. Something like https://tools.wmflabs.org/pageviews/ sounds awesome :)

This doesn't include any information about our users - only about the repositories being launched. Possible privacy issue here is that we might 'leak' a user's repo that they are just using for themselves. However, we only support public repos already, so IMO this is not a concern. We say this too in our docs already.

This issue should track our work in making this info public. I'd also want to check that this works with what we'd like our privacy policy would be.

yuvipanda · 2018-10-30T17:51:12Z

/cc @minrk @willingc @betatim @choldgraf @jzf2101 what do you think?

betatim · 2018-10-30T17:59:47Z

I can't directly think of what we could leak about individual users which is what I'd worry about. A repo being used on mybinder.org doesn't seem like information that needs protecting, unless it tells you something about an individual human.

yuvipanda · 2018-10-31T19:10:21Z

My plan now is to publish this every day as a JSON (one entry per line) file, with timestamps truncated to per-minute resolution (since that's the only bit of info that's related to a user action in any form).

betatim · 2018-11-01T07:43:22Z

Sounds good to me!

What will the publishing workflow look like and what do you think of (eventually) transitioning this to being a live stream of events (with a limited history?). I was thinking a bit like the twitter firehose. If we could combine daily digests and live stream into one service that would be neat.

yuvipanda · 2018-11-02T00:29:56Z

ok, I've done a bunch of work that lets us build images on demand in this repo, and push them to GCR with chartpress. https://github.com/jupyterhub/mybinder.org-deploy/tree/master/images/events-archiver is the beginning of the script that'll do the archiving.

Next steps:

Write code that reads events and puts them in storage
Run it in a cron

And see how that goes!

This image building infra should also be very useful for other things.

yuvipanda · 2018-11-02T16:27:47Z

I've code that does this, but stackdriver read limits are pretty low (1 per second across the whole project). I've instead set up exports from stackdriver to cloud storage (https://cloud.google.com/logging/docs/export/using_exported_logs#gcs-overview), and the script can read from this, post process and export it as processed public files.

yuvipanda · 2018-11-05T23:53:44Z

With a large number of PRs ending in #817, most of this is done! https://archive.analytics.staging.mybinder.org/ exists for staging, and shortly https://archive.analytics.mybinder.org/ will exist for prod!

Things left to do:

Events from last few hours of a day are now missed, because we run the archiver only every few hours.
Add Piwik tracking code so we have some sense of who is visitin the page.
Write docs on the structure of the files.

We shouldn't publicize this until these things are done, but they should all be done very soon.

betatim · 2018-11-06T07:18:09Z

I fetched a file and tried to open it with: json.load(open("events-2018-11-06.jsonl")) because "what is this jsonl thing? let's try and open it" and it fails :-/

What is the trade-off between using jsonl and plain json with a set of [] around the whole file? Making it easy to open the files is going to be key if we want lots of people to build on them. This makes me think josn.load and the pandas equivalent should "just work". If we stick with jsonl we should supply a snippet for how to read the files. Without guidance/googling I am now thinking I will have to iterate over each line, call json.loads on it and collect things into a list like that.

yuvipanda · 2018-11-06T15:11:11Z

Yep, lotta docs to be written. Pandas.load_json does work with these files I think! If you wanna read it in plain Python you do have to loop over every line.

The big advantage is that you can stream these, and you can not do that with pure json files - you must read the entire thing into memory before you can do anything with them. IMO, this makes enough of an advantage for it to be worth it. This is how json / structured logging works everywhere, for example. Tools like jq work very well with it

See http://jsonlines.org/ for more info. Googling for 'json lines' also produces a lot of info.

I'll be writing a lot of documentation today.

yuvipanda · 2018-11-06T16:43:14Z

pandas.read_json(url, lines=True) does seem to have problems with the nesting, however. I'm gonna de-nest the structure.

choldgraf · 2018-11-06T16:49:36Z

@yuvipanda wanna hack together on a documentation PR sometime this week?

betatim · 2018-11-06T16:50:26Z

The stream seems like a good point and means you could concatenate lots of days into one file easily. Maybe we put pandas.read_json(..., lines=True) on the index page as a pointer? That would have made me find/use it.

yuvipanda · 2018-11-06T16:51:06Z

@betatim yep, we should have code samples in at least Python and JS.

@choldgraf sure! I think I can write one up later today, and we can iterate from there.

yuvipanda · 2018-11-08T23:47:37Z

Documentation now at https://mybinder-sre.readthedocs.io/en/latest/analytics/events-archive.html. This is linked to from https://archive.analytics.mybinder.org/.

Instead of using piwik, I'm going to get stackdriver to send the logs from the nginx proxy serving https://archive.analytics.mybinder.org/ to GCS for storage. This lets us get better metrics on how people are fetching and using this data.

EXCITING! Now that the data engineering is all complete, we 'just' need someone to do something cool with this data.

yuvipanda · 2018-11-09T00:23:44Z

There is now a stackdriver sink in binder-prod called events-archive-access-logs that is archiving nginx logs from archive.analytics.mybinder.org to a GCS bucket named mybinder-events-archive-access-logs. We can use this later to do analytics on our analytics.

This was referenced Nov 1, 2018

Build events-archiver with chartpress #793

Merged

Add missing chartpress.yaml file #794

Merged

Move chartpress step after git-crypt unlock #795

Merged

Push built images to prod GCR #796

Merged

Make chartpress actually push built images #797

Merged

This was referenced Nov 6, 2018

Validate & flatten analytics events jupyterhub/binderhub#710

Merged

Make analytics events flat #826

Merged

betatim mentioned this issue Nov 11, 2018

Launch event archives not accessible from inside mybinder.org #838

Closed

yuvipanda mentioned this issue Mar 24, 2019

Can analytics-publisher be pushed upstream? #911

Open

yuvipanda mentioned this issue May 21, 2019

Straw design document for EventLogging jupyterlab/jupyterlab-telemetry#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish raw data of repositories launched #789

Publish raw data of repositories launched #789

yuvipanda commented Oct 30, 2018 •

edited

Loading

yuvipanda commented Oct 30, 2018

betatim commented Oct 30, 2018

yuvipanda commented Oct 31, 2018

betatim commented Nov 1, 2018

yuvipanda commented Nov 2, 2018 •

edited

Loading

yuvipanda commented Nov 2, 2018

yuvipanda commented Nov 5, 2018 •

edited

Loading

betatim commented Nov 6, 2018

yuvipanda commented Nov 6, 2018 •

edited

Loading

yuvipanda commented Nov 6, 2018

choldgraf commented Nov 6, 2018

betatim commented Nov 6, 2018

yuvipanda commented Nov 6, 2018

yuvipanda commented Nov 8, 2018

yuvipanda commented Nov 9, 2018

Publish raw data of repositories launched #789

Publish raw data of repositories launched #789

Comments

yuvipanda commented Oct 30, 2018 • edited Loading

yuvipanda commented Oct 30, 2018

betatim commented Oct 30, 2018

yuvipanda commented Oct 31, 2018

betatim commented Nov 1, 2018

yuvipanda commented Nov 2, 2018 • edited Loading

yuvipanda commented Nov 2, 2018

yuvipanda commented Nov 5, 2018 • edited Loading

betatim commented Nov 6, 2018

yuvipanda commented Nov 6, 2018 • edited Loading

yuvipanda commented Nov 6, 2018

choldgraf commented Nov 6, 2018

betatim commented Nov 6, 2018

yuvipanda commented Nov 6, 2018

yuvipanda commented Nov 8, 2018

yuvipanda commented Nov 9, 2018

yuvipanda commented Oct 30, 2018 •

edited

Loading

yuvipanda commented Nov 2, 2018 •

edited

Loading

yuvipanda commented Nov 5, 2018 •

edited

Loading

yuvipanda commented Nov 6, 2018 •

edited

Loading