Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish raw data of repositories launched #789

Open
yuvipanda opened this issue Oct 30, 2018 · 15 comments
Open

Publish raw data of repositories launched #789

yuvipanda opened this issue Oct 30, 2018 · 15 comments

Comments

@yuvipanda
Copy link
Contributor

yuvipanda commented Oct 30, 2018

With the progress made on #97, we are now close to publishing raw information on repositories launched. This contains the following information:

  1. Timestamp of repository launched (possibly truncated to minute resolution)
  2. Provider of repo launched (GitHub / GitLab / etc)
  3. Repo name
  4. Commit hash / branch run

This lets us (and others!) make more dashboards and run analysis on repository usage on mybinder.org. Something like https://tools.wmflabs.org/pageviews/ sounds awesome :)

This doesn't include any information about our users - only about the repositories being launched. Possible privacy issue here is that we might 'leak' a user's repo that they are just using for themselves. However, we only support public repos already, so IMO this is not a concern. We say this too in our docs already.

This issue should track our work in making this info public. I'd also want to check that this works with what we'd like our privacy policy would be.

@yuvipanda
Copy link
Contributor Author

/cc @minrk @willingc @betatim @choldgraf @jzf2101 what do you think?

@betatim
Copy link
Member

betatim commented Oct 30, 2018

I can't directly think of what we could leak about individual users which is what I'd worry about. A repo being used on mybinder.org doesn't seem like information that needs protecting, unless it tells you something about an individual human.

@yuvipanda
Copy link
Contributor Author

My plan now is to publish this every day as a JSON (one entry per line) file, with timestamps truncated to per-minute resolution (since that's the only bit of info that's related to a user action in any form).

@betatim
Copy link
Member

betatim commented Nov 1, 2018

Sounds good to me!

What will the publishing workflow look like and what do you think of (eventually) transitioning this to being a live stream of events (with a limited history?). I was thinking a bit like the twitter firehose. If we could combine daily digests and live stream into one service that would be neat.

@yuvipanda
Copy link
Contributor Author

yuvipanda commented Nov 2, 2018

ok, I've done a bunch of work that lets us build images on demand in this repo, and push them to GCR with chartpress. https://github.com/jupyterhub/mybinder.org-deploy/tree/master/images/events-archiver is the beginning of the script that'll do the archiving.

Next steps:

  • Write code that reads events and puts them in storage
  • Run it in a cron

And see how that goes!

This image building infra should also be very useful for other things.

@yuvipanda
Copy link
Contributor Author

I've code that does this, but stackdriver read limits are pretty low (1 per second across the whole project). I've instead set up exports from stackdriver to cloud storage (https://cloud.google.com/logging/docs/export/using_exported_logs#gcs-overview), and the script can read from this, post process and export it as processed public files.

@yuvipanda
Copy link
Contributor Author

yuvipanda commented Nov 5, 2018

With a large number of PRs ending in #817, most of this is done! https://archive.analytics.staging.mybinder.org/ exists for staging, and shortly https://archive.analytics.mybinder.org/ will exist for prod!

Things left to do:

  • Events from last few hours of a day are now missed, because we run the archiver only every few hours.
  • Add Piwik tracking code so we have some sense of who is visitin the page.
  • Write docs on the structure of the files.

We shouldn't publicize this until these things are done, but they should all be done very soon.

@betatim
Copy link
Member

betatim commented Nov 6, 2018

I fetched a file and tried to open it with: json.load(open("events-2018-11-06.jsonl")) because "what is this jsonl thing? let's try and open it" and it fails :-/

What is the trade-off between using jsonl and plain json with a set of [] around the whole file? Making it easy to open the files is going to be key if we want lots of people to build on them. This makes me think josn.load and the pandas equivalent should "just work". If we stick with jsonl we should supply a snippet for how to read the files. Without guidance/googling I am now thinking I will have to iterate over each line, call json.loads on it and collect things into a list like that.

@yuvipanda
Copy link
Contributor Author

yuvipanda commented Nov 6, 2018

Yep, lotta docs to be written. Pandas.load_json does work with these files I think! If you wanna read it in plain Python you do have to loop over every line.

The big advantage is that you can stream these, and you can not do that with pure json files - you must read the entire thing into memory before you can do anything with them. IMO, this makes enough of an advantage for it to be worth it. This is how json / structured logging works everywhere, for example. Tools like jq work very well with it

See http://jsonlines.org/ for more info. Googling for 'json lines' also produces a lot of info.

I'll be writing a lot of documentation today.

@yuvipanda
Copy link
Contributor Author

pandas.read_json(url, lines=True) does seem to have problems with the nesting, however. I'm gonna de-nest the structure.

@choldgraf
Copy link
Member

@yuvipanda wanna hack together on a documentation PR sometime this week?

@betatim
Copy link
Member

betatim commented Nov 6, 2018

The stream seems like a good point and means you could concatenate lots of days into one file easily. Maybe we put pandas.read_json(..., lines=True) on the index page as a pointer? That would have made me find/use it.

@yuvipanda
Copy link
Contributor Author

@betatim yep, we should have code samples in at least Python and JS.

@choldgraf sure! I think I can write one up later today, and we can iterate from there.

@yuvipanda
Copy link
Contributor Author

Documentation now at https://mybinder-sre.readthedocs.io/en/latest/analytics/events-archive.html. This is linked to from https://archive.analytics.mybinder.org/.

Instead of using piwik, I'm going to get stackdriver to send the logs from the nginx proxy serving https://archive.analytics.mybinder.org/ to GCS for storage. This lets us get better metrics on how people are fetching and using this data.

EXCITING! Now that the data engineering is all complete, we 'just' need someone to do something cool with this data.

@yuvipanda
Copy link
Contributor Author

There is now a stackdriver sink in binder-prod called events-archive-access-logs that is archiving nginx logs from archive.analytics.mybinder.org to a GCS bucket named mybinder-events-archive-access-logs. We can use this later to do analytics on our analytics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants