-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[chart/redis-ha][BUG] #52
Comments
@krakazyabra are you connecting to redis via ha-proxy? Or stright to one of the redis pods? What chart version are you using? |
Noticed, that I cannot connect from one redis pod to another:
may be because of
Zero output, but in the same time
|
Hello. I'm connecting via haproxy
Chart version is 4.6.2 |
@krakazyabra that's interesting. To me it seems that you are facing several problems here. One is frequent redis failover, that causes election of a new master. The connectivity issues, gateway timeouts, logs about "redis going awas", and the amount of traffic should all be inter-related. Another is Nextcloud being unable to gracefully deal with redis being temporarily unavailable (during failover for example). This probably causes pages without css and the need to flush cache after the connectivity between Nexcloud and redis restores. I believe good questions to ask are, does redis failover work as expected in staging environment? What happens if you deliberately kill master in staging? Can the new master be successfully elected? What causes connectivity problems or failover in production environment? Do production redis-pods get re-scheduled by kubernetes to other nodes? |
Thanks for your advise. I killed master pod.
In 32 sec redis was not able to choose new master. Is it ok? And redis-0 became master again (but I deleted him)
I don't know. May be node overload, may be OOM, may be k8s reshedulles pods to free nodes. |
@yariksheptykin Failover should be almost immediate assuming your quorum is healthy and the tow remianing slaves are able to determine: a. Master is down (Above shows they are) I should also check - What state is the container itself in during these 32 seconds? Does it no longer exist? As I think the announce service will rely on the pod physically not existing / the process to have completely terminated to be considered in a state of "changeover". For example, if you actually just remove the selector / affinity for the existing master, how quickly does it failover then? One thing I should ask, is hairpin mode enabled on your CNI? I've had a lot of known issues where people don't have hair pinning, so the announce services fail to resolve from their own pod. |
Hi, @DandyDeveloper. Thanks for you message.
Which container? Master is restarting (Terminating, Init, ContainerCreating). Slaves are always Running.
during the day do this
Yes, I'm using hairpin in service. I enabled it, when could not connect to pod through his service.
|
switching was instant. Now it's time to test on prod :) Probably previous time I deleted 2 pods from 3. And there were no qourum, that's why pod couldn't became master. |
So, perform such test on production redis. The result is another. It took more time to elect master (~20sec), but after new master was elected, the application couldn't work with it. In nextcloud's log I saw such message:
If you need, here is full trace Nextcloud log
In nextcloud config I'm using prod-redis-ha-haproxy host
|
Outage happened again. redis-0
redis-1
redis-2
FLUSHALL + FLUSHDB fixed the problem. |
@krakazyabra so it seems that the failover does work fine. According to @DandyDeveloper it takes somewhat longer than usual (20 sec on Production). Nevertheless, the sentinels eventually elect the new master and the redis "cluster" restores its normal operation. To me it all sounds like the redis-ha chart does its job. Still, for some reason Nexcloud cannot handle redis failover. You mention that flushing redis contents restores Nextcloud's operation. Does failover corrupt redis contents? This might occur if two masters exist simulataneusly unaware of each other due to network partition. If Nextloud writes into the stale master during failover and then switches to the newly elected master the data went into the old master gets lost. But that seems to happen also when you manually delete the "master pod". In this case there are no stale master to write to. Still Nextcloud breaks. You mention that your "simple script" can successfully communicate with redis after the breakage whereas Nextcloud cannot. Thus it seems like Nextcloud is reading something from redis that causes it to break. Why do you believe the bug is in the chart, and not in Nextcloud? I don't think that you can solve this issue without being able to reproduce it in staging environemnt where you can study the breakage safely. Meanwhile on production, as a workaround, you could check Nextcloud health periodically and flush redis cache if you detect outage. Or, if this not an option, take the next outage as an opportunitiy to investigate the problem as it occures before you flush the cache manually. |
Hi @yariksheptykin I can exclude network problems, because we use a lot of redis-ha "clusters" from this chart with another applications, and there are no errors. But one thing haunts me: outage can happen, if I use single-replica without haproxy and sentinel. So, master cannot go away in this case. I'll wait for answers from Nextcloud team then. Anyway thank you for your help. |
@yariksheptykin @krakazyabra Worth noting, I'm saying it can take upwards of 20 seconds depending on how your cluster responds to the announce services declaring that the container hosting the master is offline. Typically, in normal scenarios, main "quick" failovers would be things like: corruption in the db, node networking completely dying - Things that result in an immediate notice that the master is offline. Terminating a container is not the same as it's a graceful shutdown, which means the master is still responsive until the Redis process terminates on the pod itself. So, there's a potential in delay for failover until the master reports it's down down. That also being said, completely agree with @yariksheptykin in having some more firm reproducible steps on a minikube environment or something similar. I'm not doubting improvements can be made here, but issues like this are much harder to tie down. One workaround I could advise, is you could add a automated remediation step to ungracefully force kill the pod, which should help the failover happen faster, but could have other negative effects in bringing that original master back online. |
I'm closing this as it's been a while since I've heard from the reporter and NExtCloud appeared to be the RC. Let me know if you did find a fix! |
Describe the bug
Hello. I have very strange bug and cannot locate it. We're using Nextcloud, which is connected to Redis-ha.
Several times a week Nextcloud is going down (504 gateway timeout, or loading the page, but without CSS). After lots of researches. we figured out, that the problem is somewhere in redis's side.
After I saw 504 error or page without CSS, I performed a test: wrote simple script, which is connecting to current Redis, creating, putting, getting the key-value. The script is on the same machine, as NC. Nextcloud cannot connect to Redis, but script can.
In NC log there is only one message - redis server went away. In redis log - totally nothing.
There is no logic in when the problem occurs: it can happen in the night, when there are no online users, or during the day, when there is 40-50 online users.
For now, the only solution I fount - connect to redis and execute flushall command. After this operation Nextcloud starts to work.
In the same time we're using the same configuration in staging environment, but there are no active users. In staging there is no such problem. So it somehow connected with load. I could not reproduce this problem somewhere else. So, unfortunately, I cannot provide steps, how to reproduce it.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Working without outages
The text was updated successfully, but these errors were encountered: