Public analytics & metrics pipeline #97

yuvipanda · 2017-10-22T21:35:11Z

One of the coolest things about Wikimedia is the large amount of usage data it makes available publicly: Page Views API & Dumps, content dumps, client usage, live recent changes stream etc. This makes it very useful for a number of purposes - fundraising, quantifying impact, etc. By just making the data available, it enables a wide variety of people to derive whatever meaning they want from the raw data, enabling creativitiy & removing itself as a bottleneck.

We should take a similar approach, both because we strive to be open & we're a small team who can not do all the cool things that would be possible with this approach.

The simple proposal here is:

Instrument binderhub to emit events every time someone builds a repo, with some metadata. We could just do this with regular logging
Collect this info in a structured fashion as an event stream (such as 'all launches')
Provide this as a primary source, in the form of an event stream that the internet can subscribe to.

This can be our primary information source. On top of this, multiple other things can be built:

Someone just listens on this and produces daily, weekly, monthly aggregate statistics
This can also be used to provide nicer badges that have a 'launch count' on them
People can make cool visualizations like https://listen.hatnote.com/
Bots and humans can use this to spot builds that are failing
Can easily be used to justify your own repo's funding / credit, since you know how many times it has been launched.
Can produce leaderboards!

And far more. This also prevents us from being a bottleneck, and provides space for a developer community that uses binder (rather than just one that develops binder) to open up. We also determine what kinda info is emitted, making sure we preserve our users' privacy.

This issue is primarily to talk about this approach, rather than technical details. Thoughts?

choldgraf · 2017-10-22T21:46:20Z

Totally agree, this could be a great community feature, and is a key step towards making a "case" for binder tech as being impactful.

One challenge: how would this work once Binder is federated? Would these statistics be kept at the BinderHub level? If there are multiple public streams out there, then it would be straightforward to aggregate them, so maybe not such a big deal so long as the data is there.

yuvipanda · 2017-10-22T21:54:41Z

Indeed, ideally every BinderHub would make its stream public and people can aggregate. See https://wikiapiary.com/wiki/Main_Page for how this sortof aggregation happens for MediaWiki instances (which run Wikimedia but also other websites on the internet unrelated to wikimedia)

ctb · 2017-10-22T22:20:58Z

+1 on the federation question :). Does this mean we'll need to have a BinderHubHub?

choldgraf · 2017-10-23T16:45:03Z

I prefer "binderbinderhubhub"

choldgraf · 2017-10-23T16:47:36Z

then we can make it into a song like

who's the best at building docker? binderbinderhubhub
who supports both R and Rocker? binderbinderhubhub
who connects all the pieces? binderbinderhubhub
who can reproduce your thesis? binderbinderhubhub

ok no more coffee for me this morning

merge jupyterhub#198 jupyterhub#199 binderhub and jupyterhub#95, jupyterhub#97, jupyterhub#99 of repo2docker

betatim · 2017-10-27T06:04:25Z

What kind of tools/setups would we use to collect the events emitted by binderhub?

As a user of this data, I'd hit "events.mybinder.org" and receive all future events (similar to how you subscribe to the twitter 1% stream?).

yuvipanda · 2017-10-27T06:08:43Z

The way you'd usually do this is:

Move binderhub to emitting structured logs (Switch to structured logging binderhub#219)
Tail these logs from a different service (just do the equivalent of kubectl logs, or pull in from the stackdriver API)
This service is accessible over the web (as events.mybinder.org, sure!), and probably produces an EventStream (so can be easily consumed from front end JS as well as other languages)

This accomplishes a few things:

Makes this completely optional, and doesn't bloat the binderhub code
(2) and (3) are quite generic and unrelated to binderhub itself, so we might actually be able to find some tool that already does it. Even if we don't, this is conceptually quite simple to write, and scales well horizontally

yuvipanda · 2017-12-05T08:50:34Z

We could also explicitly have a 'public' field in the JSON log output, thus whitelisting the things that appear in the public stream. This protects against things like secrets accidentally leaking.

yuvipanda · 2021-07-06T16:13:21Z

boom, https://archive.analytics.mybinder.org/

yuvipanda mentioned this issue Oct 22, 2017

Switch to structured logging jupyterhub/binderhub#219

Open

choldgraf added a commit to choldgraf/mybinder.org-deploy that referenced this issue Oct 24, 2017

Merge pull request jupyterhub#93 from choldgraf/version_bump

979b75e

merge jupyterhub#198 jupyterhub#199 binderhub and jupyterhub#95, jupyterhub#97, jupyterhub#99 of repo2docker

willingc added the enhancement label Nov 2, 2017

yuvipanda mentioned this issue Dec 4, 2017

Prometheus logging of builds #208

Closed

yuvipanda mentioned this issue Feb 16, 2018

Add the ref to "launch" events? jupyterhub/binderhub#452

Closed

betatim mentioned this issue Mar 28, 2018

Track activity of a particular binder repository jupyterhub/binderhub#504

Closed

yuvipanda mentioned this issue Oct 3, 2018

Add event logging support jupyterhub/binderhub#679

Merged

This was referenced Oct 25, 2018

Send eventlogs to StackDriver #787

Merged

Upgrade version of google-cloud-logging jupyterhub/binderhub#706

Merged

Publish raw data of repositories launched #789

Open

yuvipanda closed this as completed Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Public analytics & metrics pipeline #97

Public analytics & metrics pipeline #97

yuvipanda commented Oct 22, 2017 •

edited

Loading

choldgraf commented Oct 22, 2017

yuvipanda commented Oct 22, 2017

ctb commented Oct 22, 2017

choldgraf commented Oct 23, 2017

choldgraf commented Oct 23, 2017

betatim commented Oct 27, 2017

yuvipanda commented Oct 27, 2017

yuvipanda commented Dec 5, 2017

yuvipanda commented Jul 6, 2021

Public analytics & metrics pipeline #97

Public analytics & metrics pipeline #97

Comments

yuvipanda commented Oct 22, 2017 • edited Loading

choldgraf commented Oct 22, 2017

yuvipanda commented Oct 22, 2017

ctb commented Oct 22, 2017

choldgraf commented Oct 23, 2017

choldgraf commented Oct 23, 2017

betatim commented Oct 27, 2017

yuvipanda commented Oct 27, 2017

yuvipanda commented Dec 5, 2017

yuvipanda commented Jul 6, 2021

yuvipanda commented Oct 22, 2017 •

edited

Loading