Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chart/redis-ha][BUG] #52

Closed
krakazyabra opened this issue Aug 21, 2020 · 14 comments
Closed

[chart/redis-ha][BUG] #52

krakazyabra opened this issue Aug 21, 2020 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@krakazyabra
Copy link

Describe the bug
Hello. I have very strange bug and cannot locate it. We're using Nextcloud, which is connected to Redis-ha.
Several times a week Nextcloud is going down (504 gateway timeout, or loading the page, but without CSS). After lots of researches. we figured out, that the problem is somewhere in redis's side.

After I saw 504 error or page without CSS, I performed a test: wrote simple script, which is connecting to current Redis, creating, putting, getting the key-value. The script is on the same machine, as NC. Nextcloud cannot connect to Redis, but script can.

In NC log there is only one message - redis server went away. In redis log - totally nothing.

There is no logic in when the problem occurs: it can happen in the night, when there are no online users, or during the day, when there is 40-50 online users.

For now, the only solution I fount - connect to redis and execute flushall command. After this operation Nextcloud starts to work.

In the same time we're using the same configuration in staging environment, but there are no active users. In staging there is no such problem. So it somehow connected with load. I could not reproduce this problem somewhere else. So, unfortunately, I cannot provide steps, how to reproduce it.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy chart
  2. Connect application to it

Expected behavior
Working without outages

@krakazyabra krakazyabra added the bug Something isn't working label Aug 21, 2020
@yariksheptykin
Copy link

@krakazyabra are you connecting to redis via ha-proxy? Or stright to one of the redis pods? What chart version are you using?

@krakazyabra
Copy link
Author

Noticed, that I cannot connect from one redis pod to another:

/data $ redis-cli -p 6379 -h prod-redis-ha-server-1
Could not connect to Redis at prod-redis-ha-server-1:6379: Name does not resolve
not connected> 

may be because of

root@nextcloud-0:/var/www/html# getent ahosts prod-redis-ha-server-0.prod-redis-ha-announce-0

Zero output, but in the same time

root@nextcloud-0:/var/www/html# getent ahosts prod-redis-ha-server-0.prod-redis-ha
10.113.63.110   STREAM prod-redis-ha-server-0.prod-redis-ha.nextcloud.svc.cluster.local
10.113.63.110   DGRAM  
10.113.63.110   RAW    

@krakazyabra
Copy link
Author

krakazyabra commented Aug 21, 2020

@krakazyabra are you connecting to redis via ha-proxy? Or stright to one of the redis pods? What chart version are you using?

Hello. I'm connecting via haproxy

      'redis' =>
      array (
        'host' => 'prod-redis-ha-haproxy',
        'port' => '6379',
        'timeout' => 0.0,
      ),

Chart version is 4.6.2

@yariksheptykin
Copy link

@krakazyabra that's interesting.

To me it seems that you are facing several problems here. One is frequent redis failover, that causes election of a new master. The connectivity issues, gateway timeouts, logs about "redis going awas", and the amount of traffic should all be inter-related.

Another is Nextcloud being unable to gracefully deal with redis being temporarily unavailable (during failover for example). This probably causes pages without css and the need to flush cache after the connectivity between Nexcloud and redis restores.

I believe good questions to ask are, does redis failover work as expected in staging environment? What happens if you deliberately kill master in staging? Can the new master be successfully elected? What causes connectivity problems or failover in production environment? Do production redis-pods get re-scheduled by kubernetes to other nodes?

@krakazyabra
Copy link
Author

krakazyabra commented Aug 21, 2020

What happens if you deliberately kill master in staging?

Thanks for your advise. I killed master pod.
While master pod was down on slaves I saw

stage-redis-ha-announce-2:6379> INFO replication
# Replication
role:slave
master_host:10.105.28.146
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:533984
master_link_down_since_seconds:32
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:11a0f7f63526fdb89f601913be98421a31b3f4df
master_replid2:d309ed9e580077c3df3babbcc0a64178a6dc1097
master_repl_offset:533984
second_repl_offset:496807
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:867
repl_backlog_histlen:533118

Can the new master be successfully elected?

In 32 sec redis was not able to choose new master. Is it ok? And redis-0 became master again (but I deleted him)

What causes connectivity problems or failover in production environment?

I don't know. May be node overload, may be OOM, may be k8s reshedulles pods to free nodes.

@DandyDeveloper
Copy link
Owner

DandyDeveloper commented Aug 23, 2020

@yariksheptykin Failover should be almost immediate assuming your quorum is healthy and the tow remianing slaves are able to determine:

a. Master is down (Above shows they are)
b. Slaves are healthy.

I should also check - What state is the container itself in during these 32 seconds? Does it no longer exist? As I think the announce service will rely on the pod physically not existing / the process to have completely terminated to be considered in a state of "changeover". For example, if you actually just remove the selector / affinity for the existing master, how quickly does it failover then?

One thing I should ask, is hairpin mode enabled on your CNI? I've had a lot of known issues where people don't have hair pinning, so the announce services fail to resolve from their own pod.

@krakazyabra
Copy link
Author

Hi, @DandyDeveloper. Thanks for you message.

What state is the container itself in during these 32 seconds?

Which container? Master is restarting (Terminating, Init, ContainerCreating). Slaves are always Running.

if you actually just remove the selector / affinity for the existing master, how quickly does it failover then?

during the day do this

is hairpin mode enabled on your CNI?

Yes, I'm using hairpin in service. I enabled it, when could not connect to pod through his service.

kind: Service
metadata:
  annotations:
    kube-router.io/service.hairpin: ""

@krakazyabra
Copy link
Author

krakazyabra commented Aug 24, 2020

if you actually just remove the selector / affinity for the existing master, how quickly does it failover then?

switching was instant. Now it's time to test on prod :)

Probably previous time I deleted 2 pods from 3. And there were no qourum, that's why pod couldn't became master.

@krakazyabra
Copy link
Author

krakazyabra commented Aug 24, 2020

So, perform such test on production redis. The result is another. It took more time to elect master (~20sec), but after new master was elected, the application couldn't work with it. In nextcloud's log I saw such message:

"message": {
  "Exception": "RedisException",
  "Message": "NOREPLICAS Not enough good replicas to write.",

If you need, here is full trace

Nextcloud log
{
	"reqId": "45BO04TogwWjXaeqZ5Nm",
	"level": 3,
	"time": "2020-08-24T11:21:48+02:00",
	"remoteAddr": "46.28.104.66",
	"user": "domain-auth-dd2f1a98-72ff-42a8-b69e-d0c997fcf33e",
	"app": "index",
	"method": "GET",
	"url": "/apps/files/?dir=/home-nextcloud/PHOTO/Photos/%D0%91%D0%B5%D0%B7%D0%B5%D0%BD%D0%B3%D0%B8%20%D0%BE%D1%82%20%D0%A1%D0%B0%D1%88%D0%B8%2C%20%D0%B8%D1%8E%D0%BD%D1%8C%202018/100ANV01&fileid=5407188",
	"message": {
		"Exception": "RedisException",
		"Message": "NOREPLICAS Not enough good replicas to write.",
		"Code": 0,
		"Trace": [
			{
				"file": "/var/www/html/lib/private/Memcache/Redis.php",
				"line": 65,
				"function": "setex",
				"class": "Redis",
				"type": "->",
				"args": [
					"5adeb77b6d2a45ad965d6c2b4311e1e7/AutoloaderViewController",
					60,
					"[]"
				]
			},
			{
				"file": "/var/www/html/lib/autoloader.php",
				"line": 171,
				"function": "set",
				"class": "OC\\Memcache\\Redis",
				"type": "->",
				"args": [
					"ViewController",
					[],
					60
				]
			},
			{
				"function": "load",
				"class": "OC\\Autoloader",
				"type": "->",
				"args": [
					"ViewController"
				]
			},
			{
				"function": "spl_autoload_call",
				"args": [
					"ViewController"
				]
			},
			{
				"file": "/var/www/html/lib/private/AppFramework/Utility/SimpleContainer.php",
				"line": 99,
				"function": "__construct",
				"class": "ReflectionClass",
				"type": "->",
				"args": [
					"ViewController"
				]
			},
			{
				"file": "/var/www/html/lib/private/AppFramework/Utility/SimpleContainer.php",
				"line": 116,
				"function": "resolve",
				"class": "OC\\AppFramework\\Utility\\SimpleContainer",
				"type": "->",
				"args": [
					"ViewController"
				]
			},
			{
				"file": "/var/www/html/lib/private/ServerContainer.php",
				"line": 149,
				"function": "query",
				"class": "OC\\AppFramework\\Utility\\SimpleContainer",
				"type": "->",
				"args": [
					"ViewController",
					true
				]
			},
			{
				"file": "/var/www/html/lib/private/AppFramework/DependencyInjection/DIContainer.php",
				"line": 388,
				"function": "query",
				"class": "OC\\ServerContainer",
				"type": "->",
				"args": [
					"ViewController",
					true
				]
			},
			{
				"file": "/var/www/html/lib/private/AppFramework/App.php",
				"line": 110,
				"function": "query",
				"class": "OC\\AppFramework\\DependencyInjection\\DIContainer",
				"type": "->",
				"args": [
					"ViewController"
				]
			},
			{
				"file": "/var/www/html/lib/private/AppFramework/Routing/RouteActionHandler.php",
				"line": 47,
				"function": "main",
				"class": "OC\\AppFramework\\App",
				"type": "::",
				"args": [
					"ViewController",
					"index",
					{
						"__class__": "OC\\AppFramework\\DependencyInjection\\DIContainer"
					},
					{
						"_route": "files.view.index"
					}
				]
			},
			{
				"function": "__invoke",
				"class": "OC\\AppFramework\\Routing\\RouteActionHandler",
				"type": "->",
				"args": [
					{
						"_route": "files.view.index"
					}
				]
			},
			{
				"file": "/var/www/html/lib/private/Route/Router.php",
				"line": 297,
				"function": "call_user_func",
				"args": [
					{
						"__class__": "OC\\AppFramework\\Routing\\RouteActionHandler"
					},
					{
						"_route": "files.view.index"
					}
				]
			},
			{
				"file": "/var/www/html/lib/base.php",
				"line": 1007,
				"function": "match",
				"class": "OC\\Route\\Router",
				"type": "->",
				"args": [
					"/apps/files/"
				]
			},
			{
				"file": "/var/www/html/index.php",
				"line": 37,
				"function": "handleRequest",
				"class": "OC",
				"type": "::",
				"args": []
			}
		],
		"File": "/var/www/html/lib/private/Memcache/Redis.php",
		"Line": 65,
		"CustomMessage": "--"
	},
	"userAgent": "Mozilla/5.0 (X11; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0",
	"version": "19.0.1.1"
}

In nextcloud config I'm using prod-redis-ha-haproxy host

  'redis' => 
  array (
    'host' => 'prod-redis-ha-haproxy',
    'port' => '6379',
    'timeout' => 0.0,
  ),

@krakazyabra
Copy link
Author

krakazyabra commented Aug 24, 2020

Outage happened again.
redis/sentinel/haproxy logs are empty (nothing unusual)
From nextcloud pod I can connect to redis:

redis-0
root@nextcloud-0:/var/www/html# redis-cli -h prod-redis-ha-announce-0
prod-redis-ha-announce-0:6379> INFO replication
# Replication
role:slave
master_host:10.110.222.19
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:1332637369
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1332637369
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331588794
repl_backlog_histlen:1048576


prod-redis-ha-announce-0:6379> INFO
# Server
redis_version:5.0.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:c33a24c35f10df99
redis_mode:standalone
os:Linux 4.15.18-28-pve x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:8.3.0
process_id:1
run_id:7d15135b21e457ef0bbe0077dcdc25238faf1ffc
tcp_port:6379
uptime_in_seconds:324213
uptime_in_days:3
hz:10
configured_hz:10
lru_clock:4450371
executable:/data/redis-server
config_file:/data/conf/redis.conf

# Clients
connected_clients:8
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

# Memory
used_memory:3030880
used_memory_human:2.89M
used_memory_rss:6508544
used_memory_rss_human:6.21M
used_memory_peak:5972704
used_memory_peak_human:5.70M
used_memory_peak_perc:50.75%
used_memory_overhead:2098660
used_memory_startup:791320
used_memory_dataset:932220
used_memory_dataset_perc:41.63%
allocator_allocated:3249752
allocator_active:3780608
allocator_resident:11415552
total_system_memory:67416342528
total_system_memory_human:62.79G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:4294967296
maxmemory_human:4.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.16
allocator_frag_bytes:530856
allocator_rss_ratio:3.02
allocator_rss_bytes:7634944
rss_overhead_ratio:0.57
rss_overhead_bytes:-4907008
mem_fragmentation_ratio:2.18
mem_fragmentation_bytes:3519664
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:0
mem_clients_normal:168148
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:1976
rdb_bgsave_in_progress:0
rdb_last_save_time:1598285145
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:581632
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:1442317
total_commands_processed:73777474
instantaneous_ops_per_sec:7
total_net_input_bytes:8463013966
total_net_output_bytes:111404463614
instantaneous_input_kbps:0.48
instantaneous_output_kbps:1.46
rejected_connections:0
sync_full:2
sync_partial_ok:3
sync_partial_err:2
expired_keys:92715
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:63706642
keyspace_misses:245179
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:201
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:slave
master_host:10.110.222.19
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:1332704850
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1332704850
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331656275
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:4092.593556
used_cpu_user:1406.304003
used_cpu_sys_children:1.139776
used_cpu_user_children:4.543999

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=1360,expires=485,avg_ttl=2400226
redis-1
root@nextcloud-0:/var/www/html# redis-cli -h prod-redis-ha-announce-1
prod-redis-ha-announce-1:6379> INFO replication
# Replication
role:master
connected_slaves:2
min_slaves_good_slaves:2
slave0:ip=10.111.179.178,port=6379,state=online,offset=1332660983,lag=0
slave1:ip=10.98.125.94,port=6379,state=online,offset=1332660983,lag=0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:ee8596993dd6a4ea403df2728166bbd092be62c2
master_repl_offset:1332661125
second_repl_offset:1292024488
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331612550
repl_backlog_histlen:1048576


prod-redis-ha-announce-1:6379> INFO
# Server
redis_version:5.0.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:c33a24c35f10df99
redis_mode:standalone
os:Linux 5.4.0-42-generic x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:8.3.0
process_id:1
run_id:088c4d4f1a2d405c5efbd2bc817cb90fc7ea4beb
tcp_port:6379
uptime_in_seconds:324300
uptime_in_days:3
hz:10
configured_hz:10
lru_clock:4450415
executable:/data/redis-server
config_file:/data/conf/redis.conf

# Clients
connected_clients:13
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

# Memory
used_memory:3330240
used_memory_human:3.18M
used_memory_rss:6664192
used_memory_rss_human:6.36M
used_memory_peak:7263552
used_memory_peak_human:6.93M
used_memory_peak_perc:45.85%
used_memory_overhead:2365354
used_memory_startup:791320
used_memory_dataset:964886
used_memory_dataset_perc:38.00%
allocator_allocated:3520432
allocator_active:4083712
allocator_resident:11649024
total_system_memory:67312480256
total_system_memory_human:62.69G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:4294967296
maxmemory_human:4.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.16
allocator_frag_bytes:563280
allocator_rss_ratio:2.85
allocator_rss_bytes:7565312
rss_overhead_ratio:0.57
rss_overhead_bytes:-4984832
mem_fragmentation_ratio:2.05
mem_fragmentation_bytes:3416888
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:99388
mem_clients_normal:318302
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:1647
rdb_bgsave_in_progress:0
rdb_last_save_time:1598285145
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:577536
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:737061
total_commands_processed:29667369
instantaneous_ops_per_sec:355
total_net_input_bytes:3172946430
total_net_output_bytes:35185604100
instantaneous_input_kbps:41.23
instantaneous_output_kbps:539.16
rejected_connections:0
sync_full:1
sync_partial_ok:2
sync_partial_err:1
expired_keys:38949
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:21096558
keyspace_misses:127908
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:478
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:master
connected_slaves:2
min_slaves_good_slaves:2
slave0:ip=10.111.179.178,port=6379,state=online,offset=1332715787,lag=0
slave1:ip=10.98.125.94,port=6379,state=online,offset=1332715787,lag=0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:ee8596993dd6a4ea403df2728166bbd092be62c2
master_repl_offset:1332715787
second_repl_offset:1292024488
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331667212
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:2146.312761
used_cpu_user:908.084540
used_cpu_sys_children:1.189789
used_cpu_user_children:5.337125

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=1372,expires=497,avg_ttl=1384779

redis-2
root@nextcloud-0:/var/www/html# redis-cli -h prod-redis-ha-announce-2
prod-redis-ha-announce-2:6379> INFO replication
# Replication
role:slave
master_host:10.110.222.19
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:1332670864
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1332670864
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331622289
repl_backlog_histlen:1048576


prod-redis-ha-announce-2:6379> INFO
# Server
redis_version:5.0.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:c33a24c35f10df99
redis_mode:standalone
os:Linux 4.15.18-28-pve x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:8.3.0
process_id:1
run_id:5785cea7517aa3589b4525cc83809bb441f822ef
tcp_port:6379
uptime_in_seconds:324199
uptime_in_days:3
hz:10
configured_hz:10
lru_clock:4450297
executable:/data/redis-server
config_file:/data/conf/redis.conf

# Clients
connected_clients:7
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

# Memory
used_memory:3011544
used_memory_human:2.87M
used_memory_rss:6574080
used_memory_rss_human:6.27M
used_memory_peak:4686520
used_memory_peak_human:4.47M
used_memory_peak_perc:64.26%
used_memory_overhead:2082442
used_memory_startup:791320
used_memory_dataset:929102
used_memory_dataset_perc:41.85%
allocator_allocated:3270800
allocator_active:3854336
allocator_resident:11304960
total_system_memory:67416309760
total_system_memory_human:62.79G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:4294967296
maxmemory_human:4.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.18
allocator_frag_bytes:583536
allocator_rss_ratio:2.93
allocator_rss_bytes:7450624
rss_overhead_ratio:0.58
rss_overhead_bytes:-4730880
mem_fragmentation_ratio:2.18
mem_fragmentation_bytes:3563584
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:0
mem_clients_normal:151226
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:1881
rdb_bgsave_in_progress:0
rdb_last_save_time:1598285145
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:577536
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:380762
total_commands_processed:7715420
instantaneous_ops_per_sec:7
total_net_input_bytes:571975929
total_net_output_bytes:799037066
instantaneous_input_kbps:0.43
instantaneous_output_kbps:1.28
rejected_connections:0
sync_full:3
sync_partial_ok:0
sync_partial_err:3
expired_keys:9
expired_stale_perc:0.29
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:4588
keyspace_misses:14
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:232
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:slave
master_host:10.110.222.19
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:1332685387
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1332685387
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331636812
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:409.080339
used_cpu_user:312.498500
used_cpu_sys_children:1.044103
used_cpu_user_children:4.027608

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=1371,expires=496,avg_ttl=1169059

FLUSHALL + FLUSHDB fixed the problem.

@yariksheptykin
Copy link

@krakazyabra so it seems that the failover does work fine. According to @DandyDeveloper it takes somewhat longer than usual (20 sec on Production). Nevertheless, the sentinels eventually elect the new master and the redis "cluster" restores its normal operation. To me it all sounds like the redis-ha chart does its job.

Still, for some reason Nexcloud cannot handle redis failover. You mention that flushing redis contents restores Nextcloud's operation. Does failover corrupt redis contents? This might occur if two masters exist simulataneusly unaware of each other due to network partition. If Nextloud writes into the stale master during failover and then switches to the newly elected master the data went into the old master gets lost. But that seems to happen also when you manually delete the "master pod". In this case there are no stale master to write to. Still Nextcloud breaks. You mention that your "simple script" can successfully communicate with redis after the breakage whereas Nextcloud cannot. Thus it seems like Nextcloud is reading something from redis that causes it to break.

Why do you believe the bug is in the chart, and not in Nextcloud?

I don't think that you can solve this issue without being able to reproduce it in staging environemnt where you can study the breakage safely. Meanwhile on production, as a workaround, you could check Nextcloud health periodically and flush redis cache if you detect outage. Or, if this not an option, take the next outage as an opportunitiy to investigate the problem as it occures before you flush the cache manually.

@krakazyabra
Copy link
Author

Hi @yariksheptykin
There is logic in your words :) May be the deal is in Nextcloud.
I've already opened same issue in Nexctloud's repo nextcloud/server#22250 but there is no answer or reply. You were faster :)

I can exclude network problems, because we use a lot of redis-ha "clusters" from this chart with another applications, and there are no errors.

But one thing haunts me: outage can happen, if I use single-replica without haproxy and sentinel. So, master cannot go away in this case.

I'll wait for answers from Nextcloud team then. Anyway thank you for your help.

@DandyDeveloper
Copy link
Owner

DandyDeveloper commented Aug 27, 2020

@yariksheptykin @krakazyabra Worth noting, I'm saying it can take upwards of 20 seconds depending on how your cluster responds to the announce services declaring that the container hosting the master is offline.

Typically, in normal scenarios, main "quick" failovers would be things like: corruption in the db, node networking completely dying - Things that result in an immediate notice that the master is offline.

Terminating a container is not the same as it's a graceful shutdown, which means the master is still responsive until the Redis process terminates on the pod itself. So, there's a potential in delay for failover until the master reports it's down down.

That also being said, completely agree with @yariksheptykin in having some more firm reproducible steps on a minikube environment or something similar. I'm not doubting improvements can be made here, but issues like this are much harder to tie down.

One workaround I could advise, is you could add a automated remediation step to ungracefully force kill the pod, which should help the failover happen faster, but could have other negative effects in bringing that original master back online.

@DandyDeveloper
Copy link
Owner

I'm closing this as it's been a while since I've heard from the reporter and NExtCloud appeared to be the RC. Let me know if you did find a fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants