JetStream is discarding messages it likely shouldn't #5638

aksdb · 2024-07-10T11:07:59Z

Observed behavior

We have a work queue stream that is being consumed by about 250 pull consumers and which requires ACKs. Recently (past weeks) we noticed, that a small fraction of messages get lost. Watching the ADVISORY subjects, it seems like JetStream drops the messages itself. Example:

13:20:39 {"type":"io.nats.jetstream.advisory.v1.terminated","id":"8pZH7RdhGdWrEJbkdfdy2D","timestamp":"2024-07-09T13:20:39.504816844Z","stream":"corejobs","consumer":"posts-PostSaved-worker","consumer_seq":311629,"stream_seq":49245214,"deliveries":1,"reason":"Unacknowledged message was deleted"}

The stream doesn't have a MaxDeliver set, and as you can see, it was just delivered once. So there should be no reason to drop it if an ACK was still pending. (We also saw messages being dropped with that reason that had deliveries:0.) From what I can tell, also limits were not exceeded - since the work queue only has very small messages (basically just headers without payload), the stream limit of 500 MB was certainly not reached. The message in question was also far younger than the maximum age (which is 7d; the message was barely a few minutes old).

Even more suspicious: our workers log processing of messages and we were not seeing this message ever hitting our worker. Also we sniffed the $JS-API and the _INBOX, and from what we can tell, the messages in question also never were delivered. We basically only see them being ACK'ed by JetStream as being received and soon after we see them being terminated in the advisory subject.

Here is the stream config of the stream in question:

Information for Stream corejobs created 2022-07-22 08:16:16

              Subjects: core.jobs.>
              Replicas: 3
              Storage: File

Options:

             Retention: WorkQueue
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: unlimited
         Maximum Bytes: 500 MiB
           Maximum Age: 7d0h0m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-cluster-4_v1
               Replica: nats-cluster-0_v1, current, seen 1.85s ago
               Replica: nats-cluster-3_v1, current, seen 1.85s ago

State:

              Messages: 0
                 Bytes: 0 B
        First Sequence: 420,487,665
         Last Sequence: 420,487,664 @ 2024-07-10 10:10:49 UTC
      Active Consumers: 238

Just in case, this is how our pull consumers look like:

Information for Consumer corejobs > notifications-SendMail-worker created 2024-02-06T09:58:48Z

Configuration:

                    Name: notifications-SendMail-worker
               Pull Mode: true
          Filter Subject: core.jobs.notifications.SendMail
          Deliver Policy: All
              Ack Policy: Explicit
                Ack Wait: 1m0s
           Replay Policy: Instant
         Max Ack Pending: 1,000
       Max Waiting Pulls: 512

Cluster Information:

                    Name: nats
                  Leader: nats-cluster-4_v1
                 Replica: nats-cluster-0_v1, current, seen 829ms ago
                 Replica: nats-cluster-3_v1, current, seen 829ms ago

State:

  Last Delivered Message: Consumer sequence: 3,652,619 Stream sequence: 420,486,612 Last delivery: 2m58s ago
    Acknowledgment Floor: Consumer sequence: 3,652,619 Stream sequence: 420,487,750 Last Ack: 2m58s ago
        Outstanding Acks: 0 out of maximum 1,000
    Redelivered Messages: 0

Expected behavior

The messages should not be dropped.

Server and client version

Server: 2.10.16 and 2.10.17
Client: Java SDK 2.19.1

Host environment

Kubernetes on Azure

Steps to reproduce

No response

The text was updated successfully, but these errors were encountered:

kozlovic · 2024-07-11T01:57:44Z

@aksdb I would recommend that you try with a build from main since I believe the fix in #5639 will address your issue. Once you confirm, we can close this issue.

aksdb · 2024-07-11T12:16:52Z

We'll deploy the latest nightly build and see how it behaves. I'll report back ASAP.

chrisdelachal · 2024-07-11T12:46:37Z

Hi all.
We encounter the same issue, still with the Release v2.10.18-RC.2.
in our case, we have an stream S1 on mytopic.>
we have 20 consumers C1 on mytopic.C1, C2 on mytopic.C2 , ..., C20 on mytopic.C20
each consumer from C1 to C19 publish data on mytopic.C20.
When consumer C20, fetch data from mytopic.C20 and ack a message, all other consumer loose their data.
The main stream is that we have message publish on mytopic.C1 to mytopic.C19. An easy way to reproduce is to publish data from mytopic.C1 to mytopic.C19, start consumer C1 to C19, consume a message ack it, and then run C20, read and ack. All message in mytopic.C1 to mytopic.C19 are lost.

wallyqs · 2024-07-11T12:51:08Z

@chrisdelachal can you share the consumer info from a couple of those consumers along with the stream info?

chrisdelachal · 2024-07-11T13:29:03Z

aksdb · 2024-07-11T14:45:57Z

Two awesome colleagues of mine monitored the situation today. So as it seems, the behavior has changed a little bit with synadia/nats-server:nightly-20240711, but the problem persists.

We still saw messages being TERMed with deliveries:0, however we actually saw that those message had been delivered, but were NAKed by our worker (this can happen when the load is too high on a single worker and it wants to "send back" the message to be processed by someone else). As seen in our stream and consumer settings in the initial post, this should of course not lead to the messages being dropped.

It also looks like this happens suspiciously often when pull consumers hit their pull timeout (at least the timestamps seemed to indicate that the TERMs happen around the same second of a minute, which seems to correlate with the second of the start of the service and therefore the initial pull).

chrisdelachal · 2024-07-11T15:01:27Z

FIY, if it can help you, as it was a major issue for us in production, we have downgraded to release 2.10.9 and hopefully the issue is temporary fixed.

aksdb · 2024-07-11T15:16:47Z

FIY, if it can help you, as it was a major issue for us in production, we have downgraded to release 2.10.9 and hopefully the issue is temporary fixed.

Thanks :) Thankfully the only work queue we have (at the moment) is backed by another database (don't ask, please 🙄 ) so we can mitigate by having a cron job re-schedule suspiciously old work items again every few minutes.

kozlovic · 2024-07-16T00:16:54Z

@aksdb I have not been able to reproduce the issue based on what you described. However, I found an opened issue that sounds quite similar (notion of Nak and timeout): #5526.
Out of curiosity, are you using the simplified JetStream API in the client (use of CreateOrUpdateXXX) or the "old" API?

plucked · 2024-07-16T19:52:00Z

I have the same or similar issue. I also tried the latest RC3 candidate.

In essence, 1 stream, many consumers with non wildcard subjects. When I push messages at a rate of about 10k, I see dropped messages. I tried several configurations: file vs memory, single replica vs 3 replicas, but non changed the outcome.

The machine is a google n2 one and I see about 500-700micro cores, so it's busy but not exhausted. I can not repro this myself on my Mac, only within kubernetes.

Is there any information I can further provide that might help to reproduce?

kozlovic · 2024-07-16T23:49:06Z

@plucked But are we sure in your case that there is no limit in max number of redeliveries? Because even if a message is not explicitly ack'ed, it if reaches the max number of redeliveries it will get removed. As for providing information to help reproduce: you mentioned that you are not able to reproduce on your Mac, but do you have a test case (some code) that simulate what you are doing? This could help us figure out the chain of events and either be lucky in reproducing or at least check code paths and see if we find something.
Because I continued trying to reproduce with Nak, purging the stream while some messages were pending, publishing new ones, updating the stream and/or consumer, etc.. and I am still not able to reproduce the issue. That is, so far messages are removed only when ack'ed or the stream is purged.

plucked · 2024-07-17T17:03:50Z

Update from my side, version 2.10.18 fixes it for us. I do not see any messages dropped anymore under load. I something comes up, I will create a ticket or update here if it is not closed.

ThalesValentim · 2024-07-22T07:55:28Z

Hi everyone,

We have been following this issue as the symptom of the issue is very similar compared to the issue we are facing on our JetStream work queue.
Unfortunately, version 2.10.18 didn't fix it for us and it wasn't possible to simulate it on a local environment (MacOS), even with Docker.
Currently, it is only reproducible in a production environment.

We downgraded to version 2.10.14, which the issue doesn't occur. It might give ideas of what could have been changed since then.

Server and client version

Server: 2.10.16, 2.10.17 and 2.10.18
Client: NATS.js 2.28.0

Host environment

Clustered NATS / 3 replicas
Docker
AWS

Steps to reproduce

Our use case is straightforward. The message processing is very light. Each message processing generates a log entry. We expect the same amount of logged items as the amount of published messages as well:

Publish 1000+ messages to a single topic at once
At least 3 consumers subscribe to receive them
Messages are dropped without being acknowledged
Total log entries don't match the published messages

The queue settings are the following:

ACK policy is Explicit
Manual ack is enabled
Max ACK pending is 30
Consumers have the same delivery group
Discard policy is New

We brought graphs that show the behavior of each NATS server version, 2.10.14 vs 2.10.18

[2.10.14] The following picture represents the expected behavior, where 1000 messages were pending in the work queue and consumed completely after about 4 minutes:

[2.10.18] The following shows pending messages being dropped right after publishing them. About 70-80% of the messages were not acknowledged.

Would be there any other information we could provide to help investigate the issue?

aksdb · 2024-07-24T13:54:16Z

For us it's also still happening. We ran 2.10.18 for a few days and still saw terminated messages on our stream:

{"type":"io.nats.jetstream.advisory.v1.terminated","id":"a5wvC7ppU0tFv6oeAttEPh","timestamp":"2024-07-23T07:49:13.535409202Z","stream":"corejobs","consumer":"media-UpdateEntityReferences-worker","consumer_seq":1227973,"stream_seq":49364147,"deliveries":0,"reason":"Unacknowledged message was deleted"}

ripienaar · 2024-07-24T14:02:07Z

This one means the limit (max age, size etc) was hit for a message that was not yet processed by your app so I believe this is different.

aksdb · 2024-07-24T18:36:42Z

@ripienaar as you can see in the stream info posted in the bug report, the stream doesn't have a retry limit. The message is far younger than the 7d limit (only a few minutes) and the size of the stream is not exceeded either. There is no reason for JetStream to discard the message.

kozlovic · 2024-07-24T21:57:46Z

@aksdb @ripienaar @neilalexander I wonder then if this could be related with the opened PR: #5697?

srs-adamr · 2024-07-25T16:49:47Z

This exact issue is happening to us as well! (For about 2 months now)

We thought we were going crazy!

.NET V2 SDK
nats:alpine:latest
3 node cluster
using MEMORY storage

neilalexander · 2024-07-25T16:50:51Z

Yes this should be fixed by #5697 in the next version.

srs-adamr · 2024-07-25T17:21:32Z

Now running version 2.10.18, gave it a test, this is still occuring. Unless that isn't deployed yet you mean.

derekcollison · 2024-07-25T18:21:55Z

You would want to try nightly build from main which has the fix. This will be pulled into a 2.10.19 as well.

srs-adamr · 2024-07-25T19:08:31Z

You would want to try nightly build from main which has the fix. This will be pulled into a 2.10.19 as well.

~~https://hub.docker.com/_/nats/tags?page=&page_size=&ordering=&name=~~
~~I am not seeing the nightly build in the docker images?~~

Nevermind, I found it, it's under Synadias Docker Hub!

https://hub.docker.com/r/synadia/nats-server/tags

Unfortunately looks like there isn't a nightly for alpine.

derekcollison · 2024-07-25T20:45:55Z

https://hub.docker.com/r/synadia/nats-server/tags

aksdb · 2024-07-26T17:16:50Z

The first impression with the nightly is good. At least we didn't see a TERM advisory yet. We will have to monitor it a bit more next week, since we can't directly trigger the behavior, so it may be we were just "lucky" today.

srs-adamr · 2024-07-26T18:04:05Z

We have seen the issue cease on the nightl, the stream messages aren't being cleared anymore!!

However, not sure if it's related, but our Deleted count on our Stream continues to rise, then randomly reset (We aren't deleting any messages manually, and all parameters are set extremely high (Max*))

aksdb · 2024-08-01T07:19:34Z

So after a week of monitoring, it seems the initial problem is gone with the nightly build. We didn't get any unexpected TERMs anymore and apparently we also don't miss any messages.

wallyqs · 2024-08-28T03:36:24Z

Fixed via #5697 which is part v2.10.19

aksdb added the defect Suspected defect such as a bug or regression label Jul 10, 2024

kozlovic mentioned this issue Jul 10, 2024

[FIXED] Don't ack msgs if we are filtered and they are not applicable. #5639

Merged

aksdb mentioned this issue Jul 26, 2024

Consumer dropping messages in 2.10.17 / 2.10.18 #5705

Closed

wallyqs closed this as completed Aug 28, 2024

zakk616 mentioned this issue Sep 4, 2024

JetStream WorkQueue consumer behaving ambiguous (droping, delay picking messages) #5799

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JetStream is discarding messages it likely shouldn't #5638

JetStream is discarding messages it likely shouldn't #5638

aksdb commented Jul 10, 2024 •

edited

Loading

kozlovic commented Jul 11, 2024

aksdb commented Jul 11, 2024

chrisdelachal commented Jul 11, 2024 •

edited

Loading

wallyqs commented Jul 11, 2024

chrisdelachal commented Jul 11, 2024

aksdb commented Jul 11, 2024

chrisdelachal commented Jul 11, 2024

aksdb commented Jul 11, 2024

kozlovic commented Jul 16, 2024

plucked commented Jul 16, 2024 •

edited

Loading

kozlovic commented Jul 16, 2024

plucked commented Jul 17, 2024

ThalesValentim commented Jul 22, 2024

aksdb commented Jul 24, 2024

ripienaar commented Jul 24, 2024

aksdb commented Jul 24, 2024

kozlovic commented Jul 24, 2024

srs-adamr commented Jul 25, 2024 •

edited

Loading

neilalexander commented Jul 25, 2024

srs-adamr commented Jul 25, 2024 •

edited

Loading

derekcollison commented Jul 25, 2024

srs-adamr commented Jul 25, 2024 •

edited

Loading

derekcollison commented Jul 25, 2024

aksdb commented Jul 26, 2024

srs-adamr commented Jul 26, 2024 •

edited

Loading

aksdb commented Aug 1, 2024

wallyqs commented Aug 28, 2024

JetStream is discarding messages it likely shouldn't #5638

JetStream is discarding messages it likely shouldn't #5638

Comments

aksdb commented Jul 10, 2024 • edited Loading

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

kozlovic commented Jul 11, 2024

aksdb commented Jul 11, 2024

chrisdelachal commented Jul 11, 2024 • edited Loading

wallyqs commented Jul 11, 2024

chrisdelachal commented Jul 11, 2024

aksdb commented Jul 11, 2024

chrisdelachal commented Jul 11, 2024

aksdb commented Jul 11, 2024

kozlovic commented Jul 16, 2024

plucked commented Jul 16, 2024 • edited Loading

kozlovic commented Jul 16, 2024

plucked commented Jul 17, 2024

ThalesValentim commented Jul 22, 2024

Server and client version

Host environment

Steps to reproduce

We brought graphs that show the behavior of each NATS server version, 2.10.14 vs 2.10.18

aksdb commented Jul 24, 2024

ripienaar commented Jul 24, 2024

aksdb commented Jul 24, 2024

kozlovic commented Jul 24, 2024

srs-adamr commented Jul 25, 2024 • edited Loading

neilalexander commented Jul 25, 2024

srs-adamr commented Jul 25, 2024 • edited Loading

derekcollison commented Jul 25, 2024

srs-adamr commented Jul 25, 2024 • edited Loading

derekcollison commented Jul 25, 2024

aksdb commented Jul 26, 2024

srs-adamr commented Jul 26, 2024 • edited Loading

aksdb commented Aug 1, 2024

wallyqs commented Aug 28, 2024

aksdb commented Jul 10, 2024 •

edited

Loading

chrisdelachal commented Jul 11, 2024 •

edited

Loading

plucked commented Jul 16, 2024 •

edited

Loading

srs-adamr commented Jul 25, 2024 •

edited

Loading

srs-adamr commented Jul 25, 2024 •

edited

Loading

srs-adamr commented Jul 25, 2024 •

edited

Loading

srs-adamr commented Jul 26, 2024 •

edited

Loading