Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public analytics & metrics pipeline #97

Closed
yuvipanda opened this issue Oct 22, 2017 · 9 comments
Closed

Public analytics & metrics pipeline #97

yuvipanda opened this issue Oct 22, 2017 · 9 comments

Comments

@yuvipanda
Copy link
Contributor

yuvipanda commented Oct 22, 2017

One of the coolest things about Wikimedia is the large amount of usage data it makes available publicly: Page Views API & Dumps, content dumps, client usage, live recent changes stream etc. This makes it very useful for a number of purposes - fundraising, quantifying impact, etc. By just making the data available, it enables a wide variety of people to derive whatever meaning they want from the raw data, enabling creativitiy & removing itself as a bottleneck.

We should take a similar approach, both because we strive to be open & we're a small team who can not do all the cool things that would be possible with this approach.

The simple proposal here is:

  1. Instrument binderhub to emit events every time someone builds a repo, with some metadata. We could just do this with regular logging
  2. Collect this info in a structured fashion as an event stream (such as 'all launches')
  3. Provide this as a primary source, in the form of an event stream that the internet can subscribe to.

This can be our primary information source. On top of this, multiple other things can be built:

  1. Someone just listens on this and produces daily, weekly, monthly aggregate statistics
  2. This can also be used to provide nicer badges that have a 'launch count' on them
  3. People can make cool visualizations like https://listen.hatnote.com/
  4. Bots and humans can use this to spot builds that are failing
  5. Can easily be used to justify your own repo's funding / credit, since you know how many times it has been launched.
  6. Can produce leaderboards!

And far more. This also prevents us from being a bottleneck, and provides space for a developer community that uses binder (rather than just one that develops binder) to open up. We also determine what kinda info is emitted, making sure we preserve our users' privacy.

This issue is primarily to talk about this approach, rather than technical details. Thoughts?

@choldgraf
Copy link
Member

Totally agree, this could be a great community feature, and is a key step towards making a "case" for binder tech as being impactful.

One challenge: how would this work once Binder is federated? Would these statistics be kept at the BinderHub level? If there are multiple public streams out there, then it would be straightforward to aggregate them, so maybe not such a big deal so long as the data is there.

@yuvipanda
Copy link
Contributor Author

Indeed, ideally every BinderHub would make its stream public and people can aggregate. See https://wikiapiary.com/wiki/Main_Page for how this sortof aggregation happens for MediaWiki instances (which run Wikimedia but also other websites on the internet unrelated to wikimedia)

@ctb
Copy link

ctb commented Oct 22, 2017

+1 on the federation question :). Does this mean we'll need to have a BinderHubHub?

@choldgraf
Copy link
Member

I prefer "binderbinderhubhub"

@choldgraf
Copy link
Member

then we can make it into a song like

who's the best at building docker? binderbinderhubhub
who supports both R and Rocker? binderbinderhubhub
who connects all the pieces? binderbinderhubhub
who can reproduce your thesis? binderbinderhubhub

ok no more coffee for me this morning

choldgraf added a commit to choldgraf/mybinder.org-deploy that referenced this issue Oct 24, 2017
@betatim
Copy link
Member

betatim commented Oct 27, 2017

What kind of tools/setups would we use to collect the events emitted by binderhub?

As a user of this data, I'd hit "events.mybinder.org" and receive all future events (similar to how you subscribe to the twitter 1% stream?).

@yuvipanda
Copy link
Contributor Author

The way you'd usually do this is:

  1. Move binderhub to emitting structured logs (Switch to structured logging binderhub#219)
  2. Tail these logs from a different service (just do the equivalent of kubectl logs, or pull in from the stackdriver API)
  3. This service is accessible over the web (as events.mybinder.org, sure!), and probably produces an EventStream (so can be easily consumed from front end JS as well as other languages)

This accomplishes a few things:

  1. Makes this completely optional, and doesn't bloat the binderhub code
  2. (2) and (3) are quite generic and unrelated to binderhub itself, so we might actually be able to find some tool that already does it. Even if we don't, this is conceptually quite simple to write, and scales well horizontally

@yuvipanda
Copy link
Contributor Author

We could also explicitly have a 'public' field in the JSON log output, thus whitelisting the things that appear in the public stream. This protects against things like secrets accidentally leaking.

@yuvipanda
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants