-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental loading of Github events and issues. #16
Conversation
Dammit, found a bug. My theory about IDs doesn't seem to hold, at least between types. Will change back to created_at again. I don't believe it will be a performance issue, AFAIK it is treated as an long integer internally, so sorting should be quick, but we might want to verify that. Previously I measured for myself, and first time around it took around 0.1 second for the first time, and around 5-10 ms after that. That was for around 1300 documents, after filtering to only include documents containing created_at. My CPU is a 2012 2.8 GHz i7, datastore running from a SSD. |
…believed. Does not seem to be a performance hit.
That should do it :) |
IndexRequest req = null; | ||
for (JsonElement e: array) { | ||
if (type.equals("event")) { | ||
req = indexEvent(e); | ||
if (req == null) { | ||
continueIndexing = false; | ||
logger.debug("Found existing event, all remaining events has already been indexed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be 100% sure about this :D. Wouldn't want to miss any events!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would be bad.
We could run through them all in the beginning, just to be sure (though we lose quite a bit of the speedup). If we still make the check, we could log a warning (or at least an info) if an unindexed event is registered after the continueIndexing flag (which should probably be renamed in that case) has been set. If we see none like this for a few releases, it should be safe to enable the optimization.
A middle ground could also be to run through the rest of the current block, and keep the continueIndexing warning. If there is any bugs or bad assumptions, there would still be a chance of losing events, but at least it would be quite a bit smaller, as it would have to hit in the boundary between blocks at for thing like off-by-one bugs to have an effect. If "last event first" doesn't hold in all cases, then all bets are of course off.
It's actually a liiiitle strange that they didn't implement "since" for events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we try the first way you suggested for 1-2 days and watch the logs. If we see no occurrence, we should be fine.
re: since for events - maybe there is a way. I suggest you shoot an email to support@github.com, I've done it a couple times and they were very helpful. Maybe there's something that we are missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sent them a mail, so let's see what they say. I'll make the change tonight. :)
} | ||
HttpURLConnection connection = (HttpURLConnection) url.openConnection(); | ||
addAuthHeader(connection); | ||
if (type.equals("event")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it is a bit unintuitive to have this here. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was exactly what I meant with "stretching the design" :D
First of all, thanks for all your work and sorry for the delay! The changes make sense, and I agree that a better design is desirable. That being considered, how about we merge these changes as they are and refactor things in a new PR? (Would be great if you feel like doing that as well since you definitely have a good direction already.) |
You're very welcome, and nevermind (I wasn't exactly speedy myself) :) I agree. It's probably best to merge changes like this a little fast, as most unrelated changes can easily create conflicts. I'll see if I can come up with something for a redesign of the indexing proces. |
Sounds great. Let's do as discussed in the "checking for the already seen events" thread and try the code without merging for a couple of days and afterwards let's merge it. Sounds good? BTW, please bump the version in pom.xml and README. |
Okay, sounds good! I'll bump the version. I guess I should also explain that the default poll time is reduced, and why. |
ACK! On Tue, Nov 18, 2014 at 11:52 AM, Dennis Du Krøger <notifications@github.com
http://hootsuite.com Hootsuite empowers customers to transform messages into meaningful This email is being sent on behalf of Hootsuite Media, Inc Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6. |
…vent always first" theory. If/when confirmed, this can be reverted.
Done! I also got a reply from Github support. They also believe that newest events always come first, but didn't exactly sound 100% sure, so I think we should stick to testing a bit for now. First test run seemed to run smoothly: Got 8 new events, and then 292 already existing events (I'm running with debug logging enabled) |
Cool! I don't have my laptop handy now, but I will in the morning. I'll start the indexer on a few repos and check the logs. If I see no warnings by the end of the day I'll revert a1f5c8c and then I'll merge the PR. |
Sounds good! ☺ |
@LosD I'm getting 403 whenever it checks the collaborators:
Are you seeing this as well? |
@@ -32,7 +32,7 @@ curl -XPUT localhost:9200/_river/my_gh_river/_meta -d '{ | |||
"github": { | |||
"owner": "gabrielfalcao", | |||
"repository": "lettuce", | |||
"interval": 3600, | |||
"interval": 60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment saying that this is now optional (see how it is for auth, below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if the user sets a poll interval that is really large (rarely polling), there will be the possibility of losing events, correct?
So maybe it's better to remove that argument altogether? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has always been optional, AFAIK (the default was 3600 before).
I think we should let it stay, especially until we ETag everything. Without it, an unauthenticated user would be 100% sure to hit the request limit, and they might not care about the events at all, i.e. we don't for our project (we use it to do graphs for issue open time). But we should probably warn that it is a very real possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, my bad, I forgot :). Ok, good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a bit about it in the readme.
Hmmm... No, at least not last night, when I checked. Only thing I saw was some connection issues around when my computer slept and resumed but that is pretty much expected. We should probably output the headers when we get an angry response from the server, if possible. What do you see if you try manually using i.e. curl -i? There should be no difference in how we fetch stuff, so it's pretty strange. |
Yep, I guess we can either remove the collaborators fetching or try doing it and if it raises don't panic - but then we need to document this in the readme (i.e. "we also fetch collaborators if the credentials used have push access to the repo"). What do you think? |
I think we should do it if possible. But can we check the permissions properly somehow, or would we constantly be hitting the server with unallowed requests? |
Yes, I think we can do a GET[1] and look at the permissions dict (check the example there - not sure which one of source/parent we need to check). |
I think it should be the one permissions in the root. parent seem to be the organization itself, and I'd guess that source is the one a repo is based on. Like my https://github.com/LosD/elasticsearch-river-github is not really a standalone repo, but sourced by yours. |
The question is if we shouldn't solve that as a separate issue? It's not really related to incremental loading :) |
Ah I didn't see a permissions field in the root. Perfect if it's there! On Wed, Nov 19, 2014 at 11:52 AM, Dennis Du Krøger <notifications@github.com
http://hootsuite.com Hootsuite empowers customers to transform messages into meaningful This email is being sent on behalf of Hootsuite Media, Inc Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6. |
Agreed. Will open a new issue. On Wed, Nov 19, 2014 at 11:53 AM, Dennis Du Krøger <notifications@github.com
http://hootsuite.com Hootsuite empowers customers to transform messages into meaningful This email is being sent on behalf of Hootsuite Media, Inc Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6. |
…hentication added.
OK, I didn't see any of those warnings in the logs. @LosD please revert that commit and I'll merge the changes. Thanks! |
[I'd revert it myself but I don't have access to your branch.] |
This reverts commit a1f5c8c.
Cool, done! :) |
Incremental loading of Github events and issues.
Awesome, thanks so much! Deploying to the central repo as we speak. |
Perfect, you are very welcome! :) A little break tonight, then I'll take a look at some of the other stuff we discussed. |
Haha take your time! |
This is an attempt to use the ETag header and since parameter to make the polling more effecient.
Sorry about all the stupid commits (updating README, then revert, commenting GPG, then uncommenting again), not entirely used to working in a fork. :)
The changes aren't super pretty, but I wanted to keep as close to the original design as possible, which may have streched it a bit too far. I'm thinking we could introduce some kind of indexer or fetcher class/interface that could be subclassed for special cases like Events to help keeping track of ETags and state, but I'd like your opinion first.
Fixes #15 when merged.