Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental loading of Github events and issues. #16

Merged
merged 16 commits into from
Nov 19, 2014
Merged

Incremental loading of Github events and issues. #16

merged 16 commits into from
Nov 19, 2014

Conversation

LosD
Copy link
Contributor

@LosD LosD commented Nov 16, 2014

This is an attempt to use the ETag header and since parameter to make the polling more effecient.

Sorry about all the stupid commits (updating README, then revert, commenting GPG, then uncommenting again), not entirely used to working in a fork. :)

The changes aren't super pretty, but I wanted to keep as close to the original design as possible, which may have streched it a bit too far. I'm thinking we could introduce some kind of indexer or fetcher class/interface that could be subclassed for special cases like Events to help keeping track of ETags and state, but I'd like your opinion first.

Fixes #15 when merged.

LosD and others added 11 commits November 6, 2014 19:02
This reverts commit cbb8c0a.
This reverts commit c915697.
 GitHub returns events in LIFO order, so that the newest event is returned first. This means that we can stop as soon as we see an existing event.
…re/incremental-update

Conflicts:
	src/main/java/com/ubervu/river/github/GitHubRiver.java
@LosD LosD changed the title Incremental updates. Fixes #15 Incremental loading of Github events and issues. Nov 16, 2014
@LosD
Copy link
Contributor Author

LosD commented Nov 16, 2014

Dammit, found a bug. My theory about IDs doesn't seem to hold, at least between types. Will change back to created_at again. I don't believe it will be a performance issue, AFAIK it is treated as an long integer internally, so sorting should be quick, but we might want to verify that.

Previously I measured for myself, and first time around it took around 0.1 second for the first time, and around 5-10 ms after that. That was for around 1300 documents, after filtering to only include documents containing created_at. My CPU is a 2012 2.8 GHz i7, datastore running from a SSD.

…believed. Does not seem to be a performance hit.
@LosD
Copy link
Contributor Author

LosD commented Nov 16, 2014

That should do it :)

IndexRequest req = null;
for (JsonElement e: array) {
if (type.equals("event")) {
req = indexEvent(e);
if (req == null) {
continueIndexing = false;
logger.debug("Found existing event, all remaining events has already been indexed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be 100% sure about this :D. Wouldn't want to miss any events!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be bad.

We could run through them all in the beginning, just to be sure (though we lose quite a bit of the speedup). If we still make the check, we could log a warning (or at least an info) if an unindexed event is registered after the continueIndexing flag (which should probably be renamed in that case) has been set. If we see none like this for a few releases, it should be safe to enable the optimization.

A middle ground could also be to run through the rest of the current block, and keep the continueIndexing warning. If there is any bugs or bad assumptions, there would still be a chance of losing events, but at least it would be quite a bit smaller, as it would have to hit in the boundary between blocks at for thing like off-by-one bugs to have an effect. If "last event first" doesn't hold in all cases, then all bets are of course off.

It's actually a liiiitle strange that they didn't implement "since" for events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we try the first way you suggested for 1-2 days and watch the logs. If we see no occurrence, we should be fine.

re: since for events - maybe there is a way. I suggest you shoot an email to support@github.com, I've done it a couple times and they were very helpful. Maybe there's something that we are missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent them a mail, so let's see what they say. I'll make the change tonight. :)

@LosD LosD mentioned this pull request Nov 17, 2014
}
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
addAuthHeader(connection);
if (type.equals("event")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it is a bit unintuitive to have this here. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was exactly what I meant with "stretching the design" :D

@mihneadb
Copy link
Contributor

First of all, thanks for all your work and sorry for the delay!

The changes make sense, and I agree that a better design is desirable. That being considered, how about we merge these changes as they are and refactor things in a new PR? (Would be great if you feel like doing that as well since you definitely have a good direction already.)

@LosD
Copy link
Contributor Author

LosD commented Nov 17, 2014

You're very welcome, and nevermind (I wasn't exactly speedy myself) :)

I agree. It's probably best to merge changes like this a little fast, as most unrelated changes can easily create conflicts.

I'll see if I can come up with something for a redesign of the indexing proces.

@mihneadb
Copy link
Contributor

Sounds great. Let's do as discussed in the "checking for the already seen events" thread and try the code without merging for a couple of days and afterwards let's merge it. Sounds good?

BTW, please bump the version in pom.xml and README.

@LosD
Copy link
Contributor Author

LosD commented Nov 18, 2014

Okay, sounds good! I'll bump the version. I guess I should also explain that the default poll time is reduced, and why.

@mihneadb
Copy link
Contributor

ACK!

On Tue, Nov 18, 2014 at 11:52 AM, Dennis Du Krøger <notifications@github.com

wrote:

Okay, sounds good! I'll bump the version. I guess I should also explain
that the default poll time is reduced, and why.


Reply to this email directly or view it on GitHub
#16 (comment)
.

http://hootsuite.com
Mihnea Dobrescu-Balaur
Software Engineer | Hootsuite https://www.hootsuite.com
t: +40 720 533 802 | @mihneadb http://twitter.com/mihneadb
Find Hootsuite online:
[image: Hootsuite Blog RSS] http://blog.hootsuite.com [image: Facebook]
https://facebook.com/hootsuite [image: Twitter]
https://twitter.com/hootsuite [image: Youtube]
https://youtube.com/hootsuite [image: Instagram]
http://instagram.com/hootsuite [image: Google+]
https://plus.google.com/+HootSuite/posts
We are hiring in a big way! Apply now http://hootsuite.com/careers

Hootsuite empowers customers to transform messages into meaningful
relationships.

This email is being sent on behalf of Hootsuite Media, Inc
http://hootsuite.com/. If you are no longer interested in receiving
emails from Hootsuite, please click here
https://socialbusiness.hootsuite.com/unsubscribe.html.

Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6.

Dennis Du Krøger added 2 commits November 18, 2014 21:10
…vent always first" theory.

If/when confirmed, this can be reverted.
@LosD
Copy link
Contributor Author

LosD commented Nov 18, 2014

Done! I also got a reply from Github support. They also believe that newest events always come first, but didn't exactly sound 100% sure, so I think we should stick to testing a bit for now.

First test run seemed to run smoothly: Got 8 new events, and then 292 already existing events (I'm running with debug logging enabled)

@mihneadb
Copy link
Contributor

Cool! I don't have my laptop handy now, but I will in the morning. I'll start the indexer on a few repos and check the logs. If I see no warnings by the end of the day I'll revert a1f5c8c and then I'll merge the PR.

@LosD
Copy link
Contributor Author

LosD commented Nov 18, 2014

Sounds good! ☺

@mihneadb
Copy link
Contributor

@LosD I'm getting 403 whenever it checks the collaborators:

[2014-11-19 10:01:52,052][INFO ][com.ubervu.river.github.GitHubRiver] [Bucky] [github][gh_river2] Exception encountered (403 usually is rate limit exceeded):
java.io.IOException: Server returned HTTP response code: 403 for URL: https://api.github.com/repos/jquery/jquery/collaborators?per_page=1000
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
    at com.ubervu.river.github.GitHubRiver$DataStream.indexResponse(GitHubRiver.java:159)
    at com.ubervu.river.github.GitHubRiver$DataStream.getData(GitHubRiver.java:334)
    at com.ubervu.river.github.GitHubRiver$DataStream.getData(GitHubRiver.java:312)
    at com.ubervu.river.github.GitHubRiver$DataStream.run(GitHubRiver.java:415)

Are you seeing this as well?

@@ -32,7 +32,7 @@ curl -XPUT localhost:9200/_river/my_gh_river/_meta -d '{
"github": {
"owner": "gabrielfalcao",
"repository": "lettuce",
"interval": 3600,
"interval": 60,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment saying that this is now optional (see how it is for auth, below).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if the user sets a poll interval that is really large (rarely polling), there will be the possibility of losing events, correct?

So maybe it's better to remove that argument altogether? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has always been optional, AFAIK (the default was 3600 before).

I think we should let it stay, especially until we ETag everything. Without it, an unauthenticated user would be 100% sure to hit the request limit, and they might not care about the events at all, i.e. we don't for our project (we use it to do graphs for issue open time). But we should probably warn that it is a very real possibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, my bad, I forgot :). Ok, good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a bit about it in the readme.

@LosD
Copy link
Contributor Author

LosD commented Nov 19, 2014

Hmmm... No, at least not last night, when I checked. Only thing I saw was some connection issues around when my computer slept and resumed but that is pretty much expected.

We should probably output the headers when we get an angry response from the server, if possible.

What do you see if you try manually using i.e. curl -i? There should be no difference in how we fetch stuff, so it's pretty strange.

@mihneadb
Copy link
Contributor

Yep, "message": "Must have push access to view repository collaborators.". Let's output the message when we get an error from GitHub as well.

I guess we can either remove the collaborators fetching or try doing it and if it raises don't panic - but then we need to document this in the readme (i.e. "we also fetch collaborators if the credentials used have push access to the repo").

What do you think?

@LosD
Copy link
Contributor Author

LosD commented Nov 19, 2014

I think we should do it if possible. But can we check the permissions properly somehow, or would we constantly be hitting the server with unallowed requests?

@mihneadb
Copy link
Contributor

Yes, I think we can do a GET[1] and look at the permissions dict (check the example there - not sure which one of source/parent we need to check).

[1] https://developer.github.com/v3/repos/#get

@LosD
Copy link
Contributor Author

LosD commented Nov 19, 2014

I think it should be the one permissions in the root. parent seem to be the organization itself, and I'd guess that source is the one a repo is based on. Like my https://github.com/LosD/elasticsearch-river-github is not really a standalone repo, but sourced by yours.

@LosD
Copy link
Contributor Author

LosD commented Nov 19, 2014

The question is if we shouldn't solve that as a separate issue? It's not really related to incremental loading :)

@mihneadb
Copy link
Contributor

Ah I didn't see a permissions field in the root. Perfect if it's there!

On Wed, Nov 19, 2014 at 11:52 AM, Dennis Du Krøger <notifications@github.com

wrote:

I think it should be the one permissions in the root. parent seem to be
the organization itself, and I'd guess that source is the one a repo is
based on. Like my https://github.com/LosD/elasticsearch-river-github is
not really a standalone repo, but sourced by yours.


Reply to this email directly or view it on GitHub
#16 (comment)
.

http://hootsuite.com
Mihnea Dobrescu-Balaur
Software Engineer | Hootsuite https://www.hootsuite.com
t: +40 720 533 802 | @mihneadb http://twitter.com/mihneadb
Find Hootsuite online:
[image: Hootsuite Blog RSS] http://blog.hootsuite.com [image: Facebook]
https://facebook.com/hootsuite [image: Twitter]
https://twitter.com/hootsuite [image: Youtube]
https://youtube.com/hootsuite [image: Instagram]
http://instagram.com/hootsuite [image: Google+]
https://plus.google.com/+HootSuite/posts
We are hiring in a big way! Apply now http://hootsuite.com/careers

Hootsuite empowers customers to transform messages into meaningful
relationships.

This email is being sent on behalf of Hootsuite Media, Inc
http://hootsuite.com/. If you are no longer interested in receiving
emails from Hootsuite, please click here
https://socialbusiness.hootsuite.com/unsubscribe.html.

Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6.

@mihneadb
Copy link
Contributor

Agreed. Will open a new issue.

On Wed, Nov 19, 2014 at 11:53 AM, Dennis Du Krøger <notifications@github.com

wrote:

The question is if we shouldn't solve that as a separate issue? It's not
really related to incremental loading :)


Reply to this email directly or view it on GitHub
#16 (comment)
.

http://hootsuite.com
Mihnea Dobrescu-Balaur
Software Engineer | Hootsuite https://www.hootsuite.com
t: +40 720 533 802 | @mihneadb http://twitter.com/mihneadb
Find Hootsuite online:
[image: Hootsuite Blog RSS] http://blog.hootsuite.com [image: Facebook]
https://facebook.com/hootsuite [image: Twitter]
https://twitter.com/hootsuite [image: Youtube]
https://youtube.com/hootsuite [image: Instagram]
http://instagram.com/hootsuite [image: Google+]
https://plus.google.com/+HootSuite/posts
We are hiring in a big way! Apply now http://hootsuite.com/careers

Hootsuite empowers customers to transform messages into meaningful
relationships.

This email is being sent on behalf of Hootsuite Media, Inc
http://hootsuite.com/. If you are no longer interested in receiving
emails from Hootsuite, please click here
https://socialbusiness.hootsuite.com/unsubscribe.html.

Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6.

@mihneadb
Copy link
Contributor

OK, I didn't see any of those warnings in the logs. @LosD please revert that commit and I'll merge the changes.

Thanks!

@mihneadb
Copy link
Contributor

[I'd revert it myself but I don't have access to your branch.]

@LosD
Copy link
Contributor Author

LosD commented Nov 19, 2014

Cool, done! :)

mihneadb added a commit that referenced this pull request Nov 19, 2014
Incremental loading of Github events and issues.
@mihneadb mihneadb merged commit fb5ec3f into uberVU:master Nov 19, 2014
@mihneadb
Copy link
Contributor

Awesome, thanks so much! Deploying to the central repo as we speak.

@LosD
Copy link
Contributor Author

LosD commented Nov 19, 2014

Perfect, you are very welcome! :) A little break tonight, then I'll take a look at some of the other stuff we discussed.

@mihneadb
Copy link
Contributor

Haha take your time!

@mihneadb mihneadb mentioned this pull request Nov 19, 2014
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fetch updates incrementally, but as often as allowed
2 participants