Incremental loading of Github events and issues. #16

LosD · 2014-11-16T11:53:14Z

This is an attempt to use the ETag header and since parameter to make the polling more effecient.

Sorry about all the stupid commits (updating README, then revert, commenting GPG, then uncommenting again), not entirely used to working in a fork. :)

The changes aren't super pretty, but I wanted to keep as close to the original design as possible, which may have streched it a bit too far. I'm thinking we could introduce some kind of indexer or fetcher class/interface that could be subclassed for special cases like Events to help keeping track of ETags and state, but I'd like your opinion first.

Fixes #15 when merged.

This reverts commit cbb8c0a.

This reverts commit c915697.

GitHub returns events in LIFO order, so that the newest event is returned first. This means that we can stop as soon as we see an existing event.

…re/incremental-update Conflicts: src/main/java/com/ubervu/river/github/GitHubRiver.java

LosD · 2014-11-16T18:28:09Z

Dammit, found a bug. My theory about IDs doesn't seem to hold, at least between types. Will change back to created_at again. I don't believe it will be a performance issue, AFAIK it is treated as an long integer internally, so sorting should be quick, but we might want to verify that.

Previously I measured for myself, and first time around it took around 0.1 second for the first time, and around 5-10 ms after that. That was for around 1300 documents, after filtering to only include documents containing created_at. My CPU is a 2012 2.8 GHz i7, datastore running from a SSD.

…believed. Does not seem to be a performance hit.

LosD · 2014-11-16T18:39:23Z

That should do it :)

mihneadb · 2014-11-17T22:07:04Z

src/main/java/com/ubervu/river/github/GitHubRiver.java

            IndexRequest req = null;
            for (JsonElement e: array) {
                if (type.equals("event")) {
                    req = indexEvent(e);
+                    if (req == null) {
+                        continueIndexing = false;
+                        logger.debug("Found existing event, all remaining events has already been indexed");


Let's be 100% sure about this :D. Wouldn't want to miss any events!

Yeah, that would be bad.

We could run through them all in the beginning, just to be sure (though we lose quite a bit of the speedup). If we still make the check, we could log a warning (or at least an info) if an unindexed event is registered after the continueIndexing flag (which should probably be renamed in that case) has been set. If we see none like this for a few releases, it should be safe to enable the optimization.

A middle ground could also be to run through the rest of the current block, and keep the continueIndexing warning. If there is any bugs or bad assumptions, there would still be a chance of losing events, but at least it would be quite a bit smaller, as it would have to hit in the boundary between blocks at for thing like off-by-one bugs to have an effect. If "last event first" doesn't hold in all cases, then all bets are of course off.

It's actually a liiiitle strange that they didn't implement "since" for events.

Why don't we try the first way you suggested for 1-2 days and watch the logs. If we see no occurrence, we should be fine.

re: since for events - maybe there is a way. I suggest you shoot an email to support@github.com, I've done it a couple times and they were very helpful. Maybe there's something that we are missing?

I sent them a mail, so let's see what they say. I'll make the change tonight. :)

mihneadb · 2014-11-17T22:12:54Z

src/main/java/com/ubervu/river/github/GitHubRiver.java

+                }
+                HttpURLConnection connection = (HttpURLConnection) url.openConnection();
+                addAuthHeader(connection);
+                if (type.equals("event")) {


Indeed, it is a bit unintuitive to have this here. :)

Yeah, that was exactly what I meant with "stretching the design" :D

mihneadb · 2014-11-17T22:14:30Z

First of all, thanks for all your work and sorry for the delay!

The changes make sense, and I agree that a better design is desirable. That being considered, how about we merge these changes as they are and refactor things in a new PR? (Would be great if you feel like doing that as well since you definitely have a good direction already.)

LosD · 2014-11-17T23:06:37Z

You're very welcome, and nevermind (I wasn't exactly speedy myself) :)

I agree. It's probably best to merge changes like this a little fast, as most unrelated changes can easily create conflicts.

I'll see if I can come up with something for a redesign of the indexing proces.

mihneadb · 2014-11-18T08:21:32Z

Sounds great. Let's do as discussed in the "checking for the already seen events" thread and try the code without merging for a couple of days and afterwards let's merge it. Sounds good?

BTW, please bump the version in pom.xml and README.

LosD · 2014-11-18T09:52:07Z

Okay, sounds good! I'll bump the version. I guess I should also explain that the default poll time is reduced, and why.

mihneadb · 2014-11-18T10:01:43Z

ACK!

On Tue, Nov 18, 2014 at 11:52 AM, Dennis Du Krøger <notifications@github.com

wrote:

Okay, sounds good! I'll bump the version. I guess I should also explain
that the default poll time is reduced, and why.

—
Reply to this email directly or view it on GitHub
#16 (comment)
.

http://hootsuite.com
Mihnea Dobrescu-Balaur
Software Engineer | Hootsuite https://www.hootsuite.com
t: +40 720 533 802 | @mihneadb http://twitter.com/mihneadb
Find Hootsuite online:
[image: Hootsuite Blog RSS] http://blog.hootsuite.com [image: Facebook]
https://facebook.com/hootsuite [image: Twitter]
https://twitter.com/hootsuite [image: Youtube]
https://youtube.com/hootsuite [image: Instagram]
http://instagram.com/hootsuite [image: Google+]
https://plus.google.com/+HootSuite/posts
We are hiring in a big way! Apply now http://hootsuite.com/careers

Hootsuite empowers customers to transform messages into meaningful
relationships.

This email is being sent on behalf of Hootsuite Media, Inc
http://hootsuite.com/. If you are no longer interested in receiving
emails from Hootsuite, please click here
https://socialbusiness.hootsuite.com/unsubscribe.html.

Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6.

…vent always first" theory. If/when confirmed, this can be reverted.

LosD · 2014-11-18T20:39:50Z

Done! I also got a reply from Github support. They also believe that newest events always come first, but didn't exactly sound 100% sure, so I think we should stick to testing a bit for now.

First test run seemed to run smoothly: Got 8 new events, and then 292 already existing events (I'm running with debug logging enabled)

mihneadb · 2014-11-18T21:16:05Z

Cool! I don't have my laptop handy now, but I will in the morning. I'll start the indexer on a few repos and check the logs. If I see no warnings by the end of the day I'll revert a1f5c8c and then I'll merge the PR.

LosD · 2014-11-18T21:34:44Z

Sounds good! ☺

mihneadb · 2014-11-19T08:02:43Z

@LosD I'm getting 403 whenever it checks the collaborators:

[2014-11-19 10:01:52,052][INFO ][com.ubervu.river.github.GitHubRiver] [Bucky] [github][gh_river2] Exception encountered (403 usually is rate limit exceeded):
java.io.IOException: Server returned HTTP response code: 403 for URL: https://api.github.com/repos/jquery/jquery/collaborators?per_page=1000
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
    at com.ubervu.river.github.GitHubRiver$DataStream.indexResponse(GitHubRiver.java:159)
    at com.ubervu.river.github.GitHubRiver$DataStream.getData(GitHubRiver.java:334)
    at com.ubervu.river.github.GitHubRiver$DataStream.getData(GitHubRiver.java:312)
    at com.ubervu.river.github.GitHubRiver$DataStream.run(GitHubRiver.java:415)

Are you seeing this as well?

mihneadb · 2014-11-19T08:09:11Z

README.md

@@ -32,7 +32,7 @@ curl -XPUT localhost:9200/_river/my_gh_river/_meta -d '{
    "github": {
        "owner": "gabrielfalcao",
        "repository": "lettuce",
-        "interval": 3600,
+        "interval": 60,


Please add a comment saying that this is now optional (see how it is for auth, below).

Actually, if the user sets a poll interval that is really large (rarely polling), there will be the possibility of losing events, correct?

So maybe it's better to remove that argument altogether? What do you think?

It has always been optional, AFAIK (the default was 3600 before).

I think we should let it stay, especially until we ETag everything. Without it, an unauthenticated user would be 100% sure to hit the request limit, and they might not care about the events at all, i.e. we don't for our project (we use it to do graphs for issue open time). But we should probably warn that it is a very real possibility.

Whoops, my bad, I forgot :). Ok, good point.

I've added a bit about it in the readme.

LosD · 2014-11-19T08:12:58Z

Hmmm... No, at least not last night, when I checked. Only thing I saw was some connection issues around when my computer slept and resumed but that is pretty much expected.

We should probably output the headers when we get an angry response from the server, if possible.

What do you see if you try manually using i.e. curl -i? There should be no difference in how we fetch stuff, so it's pretty strange.

mihneadb · 2014-11-19T08:18:19Z

Yep, "message": "Must have push access to view repository collaborators.". Let's output the message when we get an error from GitHub as well.

I guess we can either remove the collaborators fetching or try doing it and if it raises don't panic - but then we need to document this in the readme (i.e. "we also fetch collaborators if the credentials used have push access to the repo").

What do you think?

LosD · 2014-11-19T08:35:12Z

I think we should do it if possible. But can we check the permissions properly somehow, or would we constantly be hitting the server with unallowed requests?

mihneadb · 2014-11-19T08:46:18Z

Yes, I think we can do a GET[1] and look at the permissions dict (check the example there - not sure which one of source/parent we need to check).

[1] https://developer.github.com/v3/repos/#get

LosD · 2014-11-19T09:52:08Z

I think it should be the one permissions in the root. parent seem to be the organization itself, and I'd guess that source is the one a repo is based on. Like my https://github.com/LosD/elasticsearch-river-github is not really a standalone repo, but sourced by yours.

LosD · 2014-11-19T09:53:11Z

The question is if we shouldn't solve that as a separate issue? It's not really related to incremental loading :)

mihneadb · 2014-11-19T10:04:18Z

Ah I didn't see a permissions field in the root. Perfect if it's there!

On Wed, Nov 19, 2014 at 11:52 AM, Dennis Du Krøger <notifications@github.com

wrote:

I think it should be the one permissions in the root. parent seem to be
the organization itself, and I'd guess that source is the one a repo is
based on. Like my https://github.com/LosD/elasticsearch-river-github is
not really a standalone repo, but sourced by yours.

—
Reply to this email directly or view it on GitHub
#16 (comment)
.

http://hootsuite.com
Mihnea Dobrescu-Balaur
Software Engineer | Hootsuite https://www.hootsuite.com
t: +40 720 533 802 | @mihneadb http://twitter.com/mihneadb
Find Hootsuite online:
[image: Hootsuite Blog RSS] http://blog.hootsuite.com [image: Facebook]
https://facebook.com/hootsuite [image: Twitter]
https://twitter.com/hootsuite [image: Youtube]
https://youtube.com/hootsuite [image: Instagram]
http://instagram.com/hootsuite [image: Google+]
https://plus.google.com/+HootSuite/posts
We are hiring in a big way! Apply now http://hootsuite.com/careers

Hootsuite empowers customers to transform messages into meaningful
relationships.

This email is being sent on behalf of Hootsuite Media, Inc
http://hootsuite.com/. If you are no longer interested in receiving
emails from Hootsuite, please click here
https://socialbusiness.hootsuite.com/unsubscribe.html.

Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6.

mihneadb · 2014-11-19T10:04:38Z

Agreed. Will open a new issue.

On Wed, Nov 19, 2014 at 11:53 AM, Dennis Du Krøger <notifications@github.com

wrote:

The question is if we shouldn't solve that as a separate issue? It's not
really related to incremental loading :)

—
Reply to this email directly or view it on GitHub
#16 (comment)
.

http://hootsuite.com
Mihnea Dobrescu-Balaur
Software Engineer | Hootsuite https://www.hootsuite.com
t: +40 720 533 802 | @mihneadb http://twitter.com/mihneadb
Find Hootsuite online:
[image: Hootsuite Blog RSS] http://blog.hootsuite.com [image: Facebook]
https://facebook.com/hootsuite [image: Twitter]
https://twitter.com/hootsuite [image: Youtube]
https://youtube.com/hootsuite [image: Instagram]
http://instagram.com/hootsuite [image: Google+]
https://plus.google.com/+HootSuite/posts
We are hiring in a big way! Apply now http://hootsuite.com/careers

Hootsuite empowers customers to transform messages into meaningful
relationships.

This email is being sent on behalf of Hootsuite Media, Inc
http://hootsuite.com/. If you are no longer interested in receiving
emails from Hootsuite, please click here
https://socialbusiness.hootsuite.com/unsubscribe.html.

Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6.

…hentication added.

mihneadb · 2014-11-19T18:58:34Z

OK, I didn't see any of those warnings in the logs. @LosD please revert that commit and I'll merge the changes.

Thanks!

mihneadb · 2014-11-19T19:18:34Z

[I'd revert it myself but I don't have access to your branch.]

This reverts commit a1f5c8c.

LosD · 2014-11-19T19:40:51Z

Cool, done! :)

Incremental loading of Github events and issues.

mihneadb · 2014-11-19T19:43:23Z

Awesome, thanks so much! Deploying to the central repo as we speak.

LosD · 2014-11-19T19:48:08Z

Perfect, you are very welcome! :) A little break tonight, then I'll take a look at some of the other stuff we discussed.

mihneadb · 2014-11-19T19:48:28Z

Haha take your time!

LosD and others added 11 commits November 6, 2014 19:02

Update README.md

c915697

Update README.md

cbb8c0a

Issues should now load incrementally properly.

eefab14

Revert "Update README.md"

97739a5

This reverts commit cbb8c0a.

Revert "Update README.md"

6077fcd

This reverts commit c915697.

Whoops, shouldn't have committed POM GPG change.

8c3b08e

Spacing

0f52050

Stop fetching when reaching existing event.

3e590fe

GitHub returns events in LIFO order, so that the newest event is returned first. This means that we can stop as soon as we see an existing event.

Merge https://github.com/uberVU/elasticsearch-river-github into featu…

261184e

…re/incremental-update Conflicts: src/main/java/com/ubervu/river/github/GitHubRiver.java

Avoid error when checking for recent events against empty index.

0c0b569

Whoops, removed GPG again.

7d2a598

LosD changed the title ~~Incremental updates. Fixes #15~~ Incremental loading of Github events and issues. Nov 16, 2014

Use created_at instead of _id, as it does not seem to as serial as I …

22516fe

…believed. Does not seem to be a performance hit.

mihneadb reviewed Nov 17, 2014
View reviewed changes

LosD mentioned this pull request Nov 17, 2014

Save commit data #5

Open

mihneadb reviewed Nov 17, 2014
View reviewed changes

Dennis Du Krøger added 2 commits November 18, 2014 21:10

Remove optimization and replace with extra sanity checking of "last e…

a1f5c8c

…vent always first" theory. If/when confirmed, this can be reverted.

Bump version and explain changes.

67fba6a

mihneadb reviewed Nov 19, 2014
View reviewed changes

Language cleaned up a bit, and information regarding interval and aut…

82c5e69

…hentication added.

Revert sanity check, restoring optimization

15f022f

This reverts commit a1f5c8c.

mihneadb added a commit that referenced this pull request Nov 19, 2014

Merge pull request #16 from LosD/feature/incremental-update

fb5ec3f

Incremental loading of Github events and issues.

mihneadb merged commit fb5ec3f into uberVU:master Nov 19, 2014

mihneadb mentioned this pull request Nov 19, 2014

Refactor code design #18

Open

5 tasks

Incremental loading of Github events and issues. #16

Incremental loading of Github events and issues. #16

Conversation

LosD commented Nov 16, 2014

LosD commented Nov 16, 2014

LosD commented Nov 16, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mihneadb commented Nov 17, 2014

LosD commented Nov 17, 2014

mihneadb commented Nov 18, 2014

LosD commented Nov 18, 2014

mihneadb commented Nov 18, 2014

LosD commented Nov 18, 2014

mihneadb commented Nov 18, 2014

LosD commented Nov 18, 2014

mihneadb commented Nov 19, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LosD commented Nov 19, 2014

mihneadb commented Nov 19, 2014

LosD commented Nov 19, 2014

mihneadb commented Nov 19, 2014

LosD commented Nov 19, 2014

LosD commented Nov 19, 2014

mihneadb commented Nov 19, 2014

mihneadb commented Nov 19, 2014

mihneadb commented Nov 19, 2014

mihneadb commented Nov 19, 2014

LosD commented Nov 19, 2014

mihneadb commented Nov 19, 2014

LosD commented Nov 19, 2014

mihneadb commented Nov 19, 2014