Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lettuce metrics creates lots of long arrays, and gives out of memory error. #1210

Closed
omjego opened this issue Jan 20, 2020 · 3 comments
Closed
Labels
type: bug A general bug
Milestone

Comments

@omjego
Copy link

omjego commented Jan 20, 2020

Current Behavior

We are using Lettuce 5.0.4.RELEASE in our production environment for getting some data from Redis. And while scaling out we faced a very strange issue, that our app went out of memory under moderate traffic. I took the heap dump of the app and did the analysis. I figured out that the lettuce is allocating lots of AtomicLongArrays and those objects were taking almost 80% of the memory. When I tried to run GC forcefully using YourKit profiler, I was able to see GC Pause in the graphs but couldn't see memory getting freed. This happened 3 times and finally, when I forced GC for the 4th time the JVM released most of its memory.
After going through the GC roots of those Long arrays, I figured out that it has something to do with "Command Latency Metrics" in Lettuce. Those arrays belonged to the ConcurrentHashMap that was getting allocated by DefaultCommandLatencyCollector class.
I'm not sure what's happening, is this an expected behavior?

I'm attaching screenshots from YourKit.

Screenshot-2020-01-20-at-2-22-52-PM

Screenshot-2020-01-20-at-2-41-25-PM

Environment

  • Lettuce version(s): 5.0.4.RELEASE
@omjego omjego added the type: bug A general bug label Jan 20, 2020
@mp911de
Copy link
Collaborator

mp911de commented Jan 20, 2020

So basically, there are two issues:

  1. Memory reclaimed only after 4th full GC.
  2. Huge growth in LatencyStats objects along with their child objects.

We're using org.HdrHistogram and org.LatencyUtils to keep track of individual command latencies. Latency stats are keyed by source and target endpoints (IP and port) along with the command type (GET, SET, HMGET, …).

We have 194 commands. Given that there are ~ 72000 CommandLatencyIds, there could be the following causes:

  • Enabled local distinction in CommandLatencyCollectorOptions (disabled by default)
  • A huge number of Redis connections (over 300) that are or were active during the measurement window
  • A custom command type that implements ProtocolKeyword without implementing equals(…) and hashCode(). Looking at the Lettuce code, we have issues here when using dynamic Redis commands ("Redis Command Interfaces") with Redis modules or non-standard Redis commands.

Do you have a heap dump where you could inspect the various CommandLatencyId instances?

Given that the memory could be reclaimed after a couple of GC cycles, I'm not sure how to classify this issue. It could be that a metrics run released the latency objects and the fourth run actually was able to evacuate the objects from memory.

Can you check whether any of the mentioned points above applies to your code?

@omjego
Copy link
Author

omjego commented Jan 21, 2020

Hi @mp911de, thank you for the quick response. Yes, I did inspect the CommandLattencyId objects and most of them are related to the custom Redis module command that we use in our codebase. Although that command implements Commands interface. Please refer to the following code.

public interface CustomRedisCommand extends Commands
{
    @Command("ip2contry.getIp2country")
    String ipToCountry(String ip);
}

I checked the first two points too. We don't provide CommandLatencyCollectorOptions during the creation of ClientResources so it's disabled by default. And we're using connection pool with the max-active value set to 60, so connections won't exceed that number.

@mp911de mp911de added this to the 5.2.2 milestone Jan 21, 2020
mp911de added a commit that referenced this issue Jan 21, 2020
…1210

We now use the String representation of a command for equality and hashCode computation.
mp911de added a commit that referenced this issue Jan 21, 2020
…1210

We now use the String representation of a command for equality and hashCode computation.
mp911de added a commit that referenced this issue Jan 21, 2020
…1210

We now use the String representation of a command for equality and hashCode computation.
@mp911de
Copy link
Collaborator

mp911de commented Jan 21, 2020

We updated CommandLatencyId to consider the String representation of a command name. This change leads to the accumulation of metrics that share the same command name instead of unbounded growth of latency tracking objects.

A snapshot build is available that contains the fix. Care to give lettuce-core-5.2.2.BUILD-SNAPSHOT a spin to see whether it addresses your issue? See the readme for snapshot usage.

@mp911de mp911de closed this as completed Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants