Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commands fail to be pop from stack #1564

Closed
jeisinge opened this issue Dec 21, 2020 · 9 comments
Closed

Commands fail to be pop from stack #1564

jeisinge opened this issue Dec 21, 2020 · 9 comments
Labels
status: waiting-for-feedback We need additional information before we can continue type: bug A general bug

Comments

@jeisinge
Copy link

jeisinge commented Dec 21, 2020

Bug Report

We have started seeing issues with Lettuce hanging and not returning any results. We believe this might be caused by a race condition in canceling a request. We can reproduce manually every time, however, we are having trouble creating a script to reproduce.

The environments that we have tested with are Spring Boot 3.7.6, 4.0.0, and 4.0.1. This app is connected to AWS Elastic Beanstalk Redis cluster with one replica.

The other change on our side was an increase in value size from 0.5 KB to 2 KB and a codec deserialization change.

Current Behavior

Commands not processed or responses mixed up On a web request to Spring that accesses Redis, we issue a simple mget or get. Most of the time this works. However, for some values, some of the time, the request errors out with a time out. After this, the state of the application goes into a mostly failure mode with **redis commands timing out or returning the wrong responses.** We note the following: 1. Before, we were seeing logs of, `Stack contains: 1 commands` . Now, this number becomes `Stack contains: 2 commands` or higher. 2. If we send another request at the same while the first request is "stuck," the first request completes. However, the second request is now "stuck." The first request receives another Redis's command response. For example, when working with two keys, responses for both keys sometimes gets mixed up. It's like the new request pops out an older response.

Our testing procedure is to load up the app and hit it with a web browser. When we reload two tabs at once a couple of times, we enter this error state.

On the first error, we sometimes are able to capture a Mono error of:

java.util.NoSuchElementException: null
    at java.base/java.util.Arrays$ArrayItr.next(Arrays.java:4431)
    at io.lettuce.core.output.KeyValueListOutput.set(KeyValueListOutput.java:80)
    at io.lettuce.core.protocol.RedisStateMachine.safeSet(RedisStateMachine.java:810)
    at io.lettuce.core.protocol.RedisStateMachine.handleBytes(RedisStateMachine.java:572)
    at io.lettuce.core.protocol.RedisStateMachine$State$Type.handle(RedisStateMachine.java:206)
    at io.lettuce.core.protocol.RedisStateMachine.doDecode(RedisStateMachine.java:334)
    at io.lettuce.core.protocol.RedisStateMachine.decode(RedisStateMachine.java:295)
    at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:799)
    at io.lettuce.core.protocol.CommandHandler.decode0(CommandHandler.java:750)
    at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:724)
    at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:618)
    at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:560)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:834)

Input Code

Input Code

API implementation of:

  @GetMapping(path = "/test")
  public Mono<String> test(@RequestParam(name = "key") List<String> keys) {
    return cacheOps.opsForValue().multiGet(keys)
        .doOnEach(s -> logger.info("Encoding signal: {}", s))
        .map(t -> t.toString() })
        .timeout(Duration.ofSeconds(3))
        .onErrorResume(e -> {
          logger.error("JT 100", e);
          return Mono.just("JT 100");
        });
  }

Serialization code of:

  public static TwoValues toTwoValues(ByteBuffer byteBuffer) {
    TwoValues twoValues = null;
    try {
      if (byteBuffer.hasRemaining()) {
        var floatBuffer = byteBuffer.asFloatBuffer();
        var numFloatsInValues = 256;            
        var values1 = new float[numFloatsInValues];
        var values2 = new float[numFloatsInValues];
        floatBuffer.get(values1, 0, numFloatsInValues);
        floatBuffer.get(values2, 0, numFloatsInValues);
        twoValues = new TwoValues(Floats.asList(values1), Floats.asList(values2));
      }
      else {
        log.error("Returning null for the decode because no more remaining in the buffer!");
      }      
    } catch (Throwable t) {
      log.error("Unable to create two values from {}", byteBuffer, t);
    }
    return twoValues;
  }

Expected behavior/code

We expect all requests to act independently of other requests. A request should never be able to see another request's response. Further, the request should not require another request to complete. If our user code invalidates a contract, we should see an error message with details on why that is happening.

Environment

  • Lettuce version(s): 6.0.1
  • Redis version: 6.0.5

Possible Solution

From the debug messages and the stacktrace, it appears the previous command that fail to complete for whatever reason is not removed from the stack. Lettuce needs to clean up on failures better.

Additional context

I have a feeling that this is caused by the increased value size that we have in Redis. My guess is that we are not or Lettuce is not dealing with the ByteBuffers rights. Maybe we need to mark it consumed or copy the data out more explicitly? Or, maybe Lettuce needs to ensure that the ByteBuffer is not reused?

Since others are not seeing this, I reckon it has to do with our code and/or the contract between our code and Lettuce.

@jeisinge jeisinge added the type: bug A general bug label Dec 21, 2020
@mp911de
Copy link
Collaborator

mp911de commented Dec 21, 2020

Are you able to capture a debug log?

@jeisinge
Copy link
Author

jeisinge commented Dec 21, 2020

Thanks for the very fast response! Yes - I meant to post the debug log. Here is a snippet from when the error is happening. If you would like more detail / a different spot, just let me know!

2020-12-21 16:46:28,642 [XNIO-1 task-1] DEBUG io.lettuce.core.RedisChannelHandler - dispatching command SubscriptionCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2020-12-21 16:46:28,642 [XNIO-1 task-1] DEBUG i.l.c.m.UpstreamReplicaConnectionProvider - getConnectionAsync(READ)
2020-12-21 16:46:28,642 [XNIO-1 task-2] DEBUG io.lettuce.core.RedisChannelHandler - dispatching command SubscriptionCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2020-12-21 16:46:28,642 [XNIO-1 task-2] DEBUG i.l.c.m.UpstreamReplicaConnectionProvider - getConnectionAsync(READ)
2020-12-21 16:46:28,643 [XNIO-1 task-1] DEBUG io.lettuce.core.RedisChannelHandler - dispatching command SubscriptionCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2020-12-21 16:46:28,643 [XNIO-1 task-2] DEBUG io.lettuce.core.RedisChannelHandler - dispatching command SubscriptionCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2020-12-21 16:46:28,643 [XNIO-1 task-1] DEBUG i.l.core.protocol.DefaultEndpoint - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, epid=0x7] write() writeAndFlush command SubscriptionCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2020-12-21 16:46:28,643 [XNIO-1 task-2] DEBUG i.l.core.protocol.DefaultEndpoint - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, epid=0x7] write() writeAndFlush command SubscriptionCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2020-12-21 16:46:28,643 [XNIO-1 task-1] DEBUG i.l.core.protocol.DefaultEndpoint - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, epid=0x7] write() done
2020-12-21 16:46:28,643 [XNIO-1 task-2] DEBUG i.l.core.protocol.DefaultEndpoint - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, epid=0x7] write() done
2020-12-21 16:46:28,643 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.CommandHandler - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, chid=0x7] write(ctx, SubscriptionCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], promise)
2020-12-21 16:46:28,643 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.CommandEncoder - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379] writing command SubscriptionCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2020-12-21 16:46:28,646 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.CommandHandler - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, chid=0x7] write(ctx, SubscriptionCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], promise)
2020-12-21 16:46:28,646 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.CommandEncoder - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379] writing command SubscriptionCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2020-12-21 16:46:28,741 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.CommandHandler - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, chid=0x7] Received: 2057 bytes, 4 commands in the stack
2020-12-21 16:46:28,741 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.CommandHandler - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, chid=0x7] Stack contains: 4 commands
2020-12-21 16:46:28,741 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.RedisStateMachine - Decode done, empty stack: true
2020-12-21 16:46:28,741 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.CommandHandler - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, chid=0x7] Completing command LatencyMeteredCommand [type=GET, output=ValueOutput [output=java.nio.HeapByteBuffer[pos=0 lim=2048 cap=2048], error='null'], commandType=io.lettuce.core.RedisPublisher$SubscriptionCommand]
2020-12-21 16:46:28,745 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.CommandHandler - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, chid=0x7] Received: 2057 bytes, 3 commands in the stack
2020-12-21 16:46:28,746 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.CommandHandler - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, chid=0x7] Stack contains: 3 commands
2020-12-21 16:46:28,746 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.RedisStateMachine - Decode done, empty stack: true
2020-12-21 16:46:28,746 [lettuce-epollEventLoop-11-7] DEBUG i.l.core.protocol.CommandHandler - [channel=0xa017a14e, /10.10.10.152:37900 -> host-2.account.0001.use1.cache.amazonaws.com/172.23.24.33:6379, chid=0x7] Completing command LatencyMeteredCommand [type=GET, output=ValueOutput [output=java.nio.HeapByteBuffer[pos=0 lim=2048 cap=2048], error='null'], commandType=io.lettuce.core.RedisPublisher$SubscriptionCommand]
2020-12-21 16:46:31,643 [parallel-8] ERROR c.t.sp.Controller - JT 100
java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 3000ms in 'map' (and no fallback has been configured)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:294)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:279)
	at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:418)
	at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:79)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:119)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-12-21 16:46:31,645 [parallel-8] ERROR c.t.sp.Controller - JT 100
java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 3000ms in 'map' (and no fallback has been configured)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:294)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:279)
	at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:418)
	at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:79)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:119)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

Also, I just attempted to remove the deserializer and instead work off of byte [] arrays. Same issue:

return new ReactiveRedisTemplate<>(factory, RedisSerializationContext.byteArray());

@jeisinge
Copy link
Author

FYI - I just tested removing Spring's facade:

      val client = RedisClient.create();
      val nodes = Arrays.asList(
          RedisURI.create("redis://host.account.ng.0001.use1.cache.amazonaws.com:6379"),
          RedisURI.create("redis://host-1.account.0001.use1.cache.amazonaws.com:6379"),
          RedisURI.create("redis://host-2.account.0001.use1.cache.amazonaws.com:6379"));
      val connection = MasterReplica.connect(client, ByteArrayCodec.INSTANCE, nodes);
      connection.setReadFrom(ReadFrom.UPSTREAM_PREFERRED);
      instance = connection.reactive();

Same issue. It is not Spring.

@jeisinge
Copy link
Author

I believe that the issue might be the value we are storing. In particular, if the first character byte is 62 signed / ASCII >, the issue happens. Specifically, the following value causes the breakage:

set jt:103 ">\x98\xf6u\xbf1uN>\xe6\x14\xcb>Mq\xc9>\xd0bG>\xe6\x8a\xc1=\xfc\xe1}\xbe\x92\xaa\x85\xbe\x86,\a\xbd\xbf\xc9\x9e?\t\xc4H\xbeu\x85\x16>\xb4\xa9\n\xbd\xb6\xd2\xa7\xbe\x00\x1a6\xbe9\xaf\xfd>\x81\xd3\x8a>\xffb\x14\xbe\xe8\xb1\x91\xbe\xc8 8\xbe\xfb\xcbS=\x06\xc8\xa5\xbe\xa62\x92\xbf\n\x15\xf2\xbeW\x1e\xdf\xbf\t\xacL>\xeap\x83\xbf_\xfb\xde\xbf\b\xb3\xc6=\xe5\xab\x9c\xbe\xdc!8=\xe4cG"

I tried with shorter values and I couldn't reproduce. Swapping the > out for the A causes fixes the issue. Tomorrow, I'll attempt to reduce the size of the value further and still reproduce.

Any thoughts on why > might be sometimes-special?

@jeisinge
Copy link
Author

The smallest value that I could get to reproduce is:

set jt:104 ">\x98\xf6u"

or bytes 62, -104, -10, 117

As I got the key smaller and smaller, it felt harder and harder to reproduce. This seems to imply it might be a timing issue that is experated by my internet and VPN connection.

@mp911de
Copy link
Collaborator

mp911de commented Dec 22, 2020

Thanks for digging into the issue, indeed, a lingering > causes that problem. We fixed the issue with #1512 but the version isn't released yet. Can you retest against the latest 6.0.2 snapshots?

@mp911de mp911de added the status: waiting-for-feedback We need additional information before we can continue label Dec 22, 2020
@jeisinge
Copy link
Author

You are the best --- testing now. Thanks!

@mp911de
Copy link
Collaborator

mp911de commented Dec 22, 2020

You can grab the artifact from oss.sonatype.org (see https://github.com/lettuce-io/lettuce-core#binariesdownload)

@jeisinge
Copy link
Author

@mp911de , you are great. Thank you for the quick response and the pointers! I tested locally and I can no longer reproduce.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-feedback We need additional information before we can continue type: bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants