-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh token flow breaks up during db server downtime #1928
Comments
Refresh Tokens are TOTPs, meaning that they can be used only once. It's possible that there is a bug in the application which refreshes a token but does not update it properly in your storage. Another possibility is that there is a slow request that times out, causing the token to never get to your application in the first place. Unless you have problems with the database, it would be quite unlikely for the cause to be there. I'm pretty sure it's one of the above issues. Regarding your flow, your application needs to be able to deal with this! A flow like this can happen at any time (e.g. because the user removed consent) and should be accounted for :) I don't dispute that such a bug is annoying though! |
Thank you very much for looking into this case, much appreciated!
I was able to reproduce this issue using the sample app. You mention possible options (bug in application or timed out request) so somehow we need to rule them out.
This is exactly what happens in our production from time to time and what I tried to replicate locally. To make this issue stand out following script was running in background:
Normally hydra survives such conditions quite well, if DB is up token is successfully refreshed and we see errors if DB is down. Once it is up again everything is back to normal:
The problem is that sometimes this flow breaks:
At this point all subsequent requests to refresh token would fail. Please note that app received all responses, no timeouts, nothing of this sort. |
Thank you for providing a detailed response here! It appears to me that there is an additional error happening:
It's hard to say what the error is exactly (maybe you could check the Hydra logs with LOG_LEVEL=trace) but I'm pretty sure it has to do with some transactional error. If you are able to find the error log and post it here that would be of big help. By the way, the reproduction you did - what Hydra version are you using there? |
I am using Here is the Hydra log with info level:
|
Please set the log level to trace! |
I've reproduced this issue again, in this case logs before failure from the app standpoint looked like this:
And Hydra logs with trace log level:
Thank you so much for looking into it! |
Yeah it looks like PostgreSQL is shutting down / being disconnected while we're in the middle of a SQL Transaction which removes the old refresh token and inserts the new refresh token. Apparently this can lead to a condition where the refresh token is only removed, without storing the new refresh token. It's a bit surprising that the transaction is not properly rolled back, but it is possible that this is due to PostgreSQL shutting down - not sure to be honest. The only solution here is probably to change the transaction isolation level - even though I'm not even sure if that would really help because we don't really know if the request made it through or not because the TCP ACK is just missing. Further increasing the isolation level would have (I believe) serious performance implications because we're already at a pretty high isolation level for this particular transaction. I think this is one of the cases where we unfortunately have to tag this with a "wontfix" and where your application should be able to deal with this by re-running the flow - unless someone comes along with a great idea on how to resolve this. I guess this also needs some in-depth knowledge of how PostgreSQL works itself. |
Thank you for getting back to me, Aeneas!
Is this an assumption or do you see it somehow from the logs? Because this sounds quite improbable, there is little to no doubts that Postgres have either completely applied or rolled back transaction. Which leads us to
Please bear with me here, I am not familiar with Go, if we get error during transaction commit do we still generated refresh token and other stuff or is everything replaced with error message itself? I've found the following case in PostgreSQL 9 Administration Cookbook - Second Edition:
If that's the case and my assumption about error message is correct then everything fits together. Transaction was commited but client received not token but error.
While application of course should somehow be able to handle everything thrown at it this part leaves me unsettled. If case above is not correct then Hydra under some conditions experiences data loss and that's a bummer. There could be different priorities but closing an actual issue with "wontfix" seems wrong. Please tell if I could be of any help to narrow down this issue. Thank you for involvement, cheers! |
Experience ;) I am not debating that PG has committed the transaction, the problem is that the TCP connection aborts while the transaction is running. The error returned by the broken transaction will then cause the Go code to try and roll back the transaction, which does not happen because the TCP connection is still offline! In the meanwhile, the actual transaction was successful at PG itself. For the Go code, it is impossible to know if the transaction failed or succeeded, and it thus assumes an error, even if the transaction actually succeeded. This is not solvable from a code perspective in my experience and is just one of the things that can happen in distributed systems. There's actually an issue for that in lib/pq: lib/pq#871 As we're moving to gobuffalo/pop we'll move also to jackc/pgx but that library also is not able to deal with these type of scenarios. |
If you have lots of error messages and broken flows like these, I would recommend trying to reduce the flakyness of the network by either co-locating ORY Hydra with your SQL database or doing some other stuff that helps reduce error rates and latency between those two. |
We already have DB collocated with Hydra. At the moment refresh token flow break appears when there are some problems on DB. We use Stolon in our postgres setup, maybe there are some peculiarities there. While I understand all your concerns that described problem is unlikely, we experience it quite often during little db downtimes. Of course clients do support re-auth, but this breaks flow of an end user. For example for an automation system Zapier (for which we have integration), re-auth means that automation stops working, until user does some manual steps. Zapier does notify end user about broken flow by email, but who reads emails nowadays... Similar situation is for a Slackbots, etc. The main problem is that there is no clear workaround. The only one I see is greatly increasing lifespan of access_token, which is of course insecure. Just to notice, there are already similar propositions in Hydra repo issues, like this one. This will definitely solve our issue. I've looked through other Oauth server, IdentityServer. It also supports similar feature. |
Ok, thank you for the detailed response. Given that there is nothing really we can do here right now given the context to address the exact issue (PG related network problems) and that #1831 will solve this for you, I think it's ok to close this and have it as an additional use case / upvote for #1831 . |
Describe the bug
Sometimes I start receiving "The refresh token has not been found: : not_found" during refresh token flow. After that I no more can refresh the token, end user has to pass oauth flow again.
Reproducing the bug
I didn't spent enough time to reproduce it on local setup. First attempts fails. Even on production setup it's not reproduced every time. So I'll describe steps on production setup instead.
1.) I have client, that infinitely refreshes oauth token. Here is NodeJs code snippet of express app, that handles oauth flow and then starts infinite token refresh:
2.) In my kubernetes cluster I have v1.4.5-alpine hydra version. I use helm chart of this version. But I believe it is not important. I have replicaCount set to 3, and use postgres as database.
3.) At some point of time of the day my postgres db experience downtime, which is related to current cloud provider.
4.) After that sometimes my client cannot refresh token any more. Here are part of hydra logs:
I removed log records with similar content to keep posted logs short.
After 09:08:54 every attempt to refresh token gets same not_found error.
Server configuration
Expected behavior
Refresh flow does not break up
Environment
I believe it is described in helm chart.
Additional context
I was able to reproduce the issue on older version of hydra locally, but didn't find precise steps. What I tried was to turn on\off db, restart hydra container. At the same time I executed infinite token refresh. After upgrading to the latest hydra version I was not able to reproduce the issue, and hoped everything would go well, but it didn't.
I for sure understand that described details are not precise and may be not enough. Probably I would try to find exact steps. My only alternative is to move to some other oauth server which is of course time consuming.
Maybe someone can suggest any workaround of this issue. At the moment I have several client apps, like Zapier and Slack integrations. And clients are not happy with accidentally losing auth info.
The text was updated successfully, but these errors were encountered: