Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JetStream is discarding messages it likely shouldn't #5638

Closed
aksdb opened this issue Jul 10, 2024 · 27 comments
Closed

JetStream is discarding messages it likely shouldn't #5638

aksdb opened this issue Jul 10, 2024 · 27 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@aksdb
Copy link

aksdb commented Jul 10, 2024

Observed behavior

We have a work queue stream that is being consumed by about 250 pull consumers and which requires ACKs. Recently (past weeks) we noticed, that a small fraction of messages get lost. Watching the ADVISORY subjects, it seems like JetStream drops the messages itself. Example:

13:20:39 {"type":"io.nats.jetstream.advisory.v1.terminated","id":"8pZH7RdhGdWrEJbkdfdy2D","timestamp":"2024-07-09T13:20:39.504816844Z","stream":"corejobs","consumer":"posts-PostSaved-worker","consumer_seq":311629,"stream_seq":49245214,"deliveries":1,"reason":"Unacknowledged message was deleted"}

The stream doesn't have a MaxDeliver set, and as you can see, it was just delivered once. So there should be no reason to drop it if an ACK was still pending. (We also saw messages being dropped with that reason that had deliveries:0.) From what I can tell, also limits were not exceeded - since the work queue only has very small messages (basically just headers without payload), the stream limit of 500 MB was certainly not reached. The message in question was also far younger than the maximum age (which is 7d; the message was barely a few minutes old).

Even more suspicious: our workers log processing of messages and we were not seeing this message ever hitting our worker. Also we sniffed the $JS-API and the _INBOX, and from what we can tell, the messages in question also never were delivered. We basically only see them being ACK'ed by JetStream as being received and soon after we see them being terminated in the advisory subject.

Here is the stream config of the stream in question:

Information for Stream corejobs created 2022-07-22 08:16:16

              Subjects: core.jobs.>
              Replicas: 3
              Storage: File

Options:

             Retention: WorkQueue
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: unlimited
         Maximum Bytes: 500 MiB
           Maximum Age: 7d0h0m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-cluster-4_v1
               Replica: nats-cluster-0_v1, current, seen 1.85s ago
               Replica: nats-cluster-3_v1, current, seen 1.85s ago

State:

              Messages: 0
                 Bytes: 0 B
        First Sequence: 420,487,665
         Last Sequence: 420,487,664 @ 2024-07-10 10:10:49 UTC
      Active Consumers: 238

Just in case, this is how our pull consumers look like:

Information for Consumer corejobs > notifications-SendMail-worker created 2024-02-06T09:58:48Z

Configuration:

                    Name: notifications-SendMail-worker
               Pull Mode: true
          Filter Subject: core.jobs.notifications.SendMail
          Deliver Policy: All
              Ack Policy: Explicit
                Ack Wait: 1m0s
           Replay Policy: Instant
         Max Ack Pending: 1,000
       Max Waiting Pulls: 512

Cluster Information:

                    Name: nats
                  Leader: nats-cluster-4_v1
                 Replica: nats-cluster-0_v1, current, seen 829ms ago
                 Replica: nats-cluster-3_v1, current, seen 829ms ago

State:

  Last Delivered Message: Consumer sequence: 3,652,619 Stream sequence: 420,486,612 Last delivery: 2m58s ago
    Acknowledgment Floor: Consumer sequence: 3,652,619 Stream sequence: 420,487,750 Last Ack: 2m58s ago
        Outstanding Acks: 0 out of maximum 1,000
    Redelivered Messages: 0

Expected behavior

The messages should not be dropped.

Server and client version

Server: 2.10.16 and 2.10.17
Client: Java SDK 2.19.1

Host environment

Kubernetes on Azure

Steps to reproduce

No response

@aksdb aksdb added the defect Suspected defect such as a bug or regression label Jul 10, 2024
@kozlovic
Copy link
Member

@aksdb I would recommend that you try with a build from main since I believe the fix in #5639 will address your issue. Once you confirm, we can close this issue.

@aksdb
Copy link
Author

aksdb commented Jul 11, 2024

We'll deploy the latest nightly build and see how it behaves. I'll report back ASAP.

@chrisdelachal
Copy link

chrisdelachal commented Jul 11, 2024

Hi all.
We encounter the same issue, still with the Release v2.10.18-RC.2.
in our case, we have an stream S1 on mytopic.>
we have 20 consumers C1 on mytopic.C1, C2 on mytopic.C2 , ..., C20 on mytopic.C20
each consumer from C1 to C19 publish data on mytopic.C20.
When consumer C20, fetch data from mytopic.C20 and ack a message, all other consumer loose their data.
The main stream is that we have message publish on mytopic.C1 to mytopic.C19. An easy way to reproduce is to publish data from mytopic.C1 to mytopic.C19, start consumer C1 to C19, consume a message ack it, and then run C20, read and ack. All message in mytopic.C1 to mytopic.C19 are lost.

@wallyqs
Copy link
Member

wallyqs commented Jul 11, 2024

@chrisdelachal can you share the consumer info from a couple of those consumers along with the stream info?

@chrisdelachal
Copy link

Screenshot 2024-07-11 at 15 14 27-1
Screenshot 2024-07-11 at 15 13 54-2
Screenshot 2024-07-11 at 15 13 16-3

@aksdb
Copy link
Author

aksdb commented Jul 11, 2024

Two awesome colleagues of mine monitored the situation today. So as it seems, the behavior has changed a little bit with synadia/nats-server:nightly-20240711, but the problem persists.

We still saw messages being TERMed with deliveries:0, however we actually saw that those message had been delivered, but were NAKed by our worker (this can happen when the load is too high on a single worker and it wants to "send back" the message to be processed by someone else). As seen in our stream and consumer settings in the initial post, this should of course not lead to the messages being dropped.

It also looks like this happens suspiciously often when pull consumers hit their pull timeout (at least the timestamps seemed to indicate that the TERMs happen around the same second of a minute, which seems to correlate with the second of the start of the service and therefore the initial pull).

@chrisdelachal
Copy link

FIY, if it can help you, as it was a major issue for us in production, we have downgraded to release 2.10.9 and hopefully the issue is temporary fixed.

@aksdb
Copy link
Author

aksdb commented Jul 11, 2024

FIY, if it can help you, as it was a major issue for us in production, we have downgraded to release 2.10.9 and hopefully the issue is temporary fixed.

Thanks :) Thankfully the only work queue we have (at the moment) is backed by another database (don't ask, please 🙄 ) so we can mitigate by having a cron job re-schedule suspiciously old work items again every few minutes.

@kozlovic
Copy link
Member

@aksdb I have not been able to reproduce the issue based on what you described. However, I found an opened issue that sounds quite similar (notion of Nak and timeout): #5526.
Out of curiosity, are you using the simplified JetStream API in the client (use of CreateOrUpdateXXX) or the "old" API?

@plucked
Copy link

plucked commented Jul 16, 2024

I have the same or similar issue. I also tried the latest RC3 candidate.

In essence, 1 stream, many consumers with non wildcard subjects. When I push messages at a rate of about 10k, I see dropped messages. I tried several configurations: file vs memory, single replica vs 3 replicas, but non changed the outcome.

The machine is a google n2 one and I see about 500-700micro cores, so it's busy but not exhausted. I can not repro this myself on my Mac, only within kubernetes.

Is there any information I can further provide that might help to reproduce?

@kozlovic
Copy link
Member

@plucked But are we sure in your case that there is no limit in max number of redeliveries? Because even if a message is not explicitly ack'ed, it if reaches the max number of redeliveries it will get removed. As for providing information to help reproduce: you mentioned that you are not able to reproduce on your Mac, but do you have a test case (some code) that simulate what you are doing? This could help us figure out the chain of events and either be lucky in reproducing or at least check code paths and see if we find something.
Because I continued trying to reproduce with Nak, purging the stream while some messages were pending, publishing new ones, updating the stream and/or consumer, etc.. and I am still not able to reproduce the issue. That is, so far messages are removed only when ack'ed or the stream is purged.

@plucked
Copy link

plucked commented Jul 17, 2024

Update from my side, version 2.10.18 fixes it for us. I do not see any messages dropped anymore under load. I something comes up, I will create a ticket or update here if it is not closed.

@ThalesValentim
Copy link

Hi everyone,

We have been following this issue as the symptom of the issue is very similar compared to the issue we are facing on our JetStream work queue.
Unfortunately, version 2.10.18 didn't fix it for us and it wasn't possible to simulate it on a local environment (MacOS), even with Docker.
Currently, it is only reproducible in a production environment.

We downgraded to version 2.10.14, which the issue doesn't occur. It might give ideas of what could have been changed since then.

Server and client version

Server: 2.10.16, 2.10.17 and 2.10.18
Client: NATS.js 2.28.0

Host environment

Clustered NATS / 3 replicas
Docker
AWS

Steps to reproduce

Our use case is straightforward. The message processing is very light. Each message processing generates a log entry. We expect the same amount of logged items as the amount of published messages as well:

  1. Publish 1000+ messages to a single topic at once
  2. At least 3 consumers subscribe to receive them
  3. Messages are dropped without being acknowledged
  4. Total log entries don't match the published messages

The queue settings are the following:

  • ACK policy is Explicit
  • Manual ack is enabled
  • Max ACK pending is 30
  • Consumers have the same delivery group
  • Discard policy is New

We brought graphs that show the behavior of each NATS server version, 2.10.14 vs 2.10.18

[2.10.14] The following picture represents the expected behavior, where 1000 messages were pending in the work queue and consumed completely after about 4 minutes:
Screenshot 2024-07-22 at 10 39 36

[2.10.18] The following shows pending messages being dropped right after publishing them. About 70-80% of the messages were not acknowledged.
Screenshot 2024-07-22 at 10 46 42

Would be there any other information we could provide to help investigate the issue?

@aksdb
Copy link
Author

aksdb commented Jul 24, 2024

For us it's also still happening. We ran 2.10.18 for a few days and still saw terminated messages on our stream:

{"type":"io.nats.jetstream.advisory.v1.terminated","id":"a5wvC7ppU0tFv6oeAttEPh","timestamp":"2024-07-23T07:49:13.535409202Z","stream":"corejobs","consumer":"media-UpdateEntityReferences-worker","consumer_seq":1227973,"stream_seq":49364147,"deliveries":0,"reason":"Unacknowledged message was deleted"}

@ripienaar
Copy link
Contributor

This one means the limit (max age, size etc) was hit for a message that was not yet processed by your app so I believe this is different.

@aksdb
Copy link
Author

aksdb commented Jul 24, 2024

@ripienaar as you can see in the stream info posted in the bug report, the stream doesn't have a retry limit. The message is far younger than the 7d limit (only a few minutes) and the size of the stream is not exceeded either. There is no reason for JetStream to discard the message.

@kozlovic
Copy link
Member

@aksdb @ripienaar @neilalexander I wonder then if this could be related with the opened PR: #5697?

@srs-adamr
Copy link

srs-adamr commented Jul 25, 2024

This exact issue is happening to us as well! (For about 2 months now)

We thought we were going crazy!

.NET V2 SDK
nats:alpine:latest
3 node cluster
using MEMORY storage

image
image
image

@neilalexander
Copy link
Member

Yes this should be fixed by #5697 in the next version.

@srs-adamr
Copy link

srs-adamr commented Jul 25, 2024

Now running version 2.10.18, gave it a test, this is still occuring. Unless that isn't deployed yet you mean.

image

@derekcollison
Copy link
Member

You would want to try nightly build from main which has the fix. This will be pulled into a 2.10.19 as well.

@srs-adamr
Copy link

srs-adamr commented Jul 25, 2024

You would want to try nightly build from main which has the fix. This will be pulled into a 2.10.19 as well.

https://hub.docker.com/_/nats/tags?page=&page_size=&ordering=&name=
I am not seeing the nightly build in the docker images?

Nevermind, I found it, it's under Synadias Docker Hub!

https://hub.docker.com/r/synadia/nats-server/tags

Unfortunately looks like there isn't a nightly for alpine.

@derekcollison
Copy link
Member

@aksdb
Copy link
Author

aksdb commented Jul 26, 2024

The first impression with the nightly is good. At least we didn't see a TERM advisory yet. We will have to monitor it a bit more next week, since we can't directly trigger the behavior, so it may be we were just "lucky" today.

@srs-adamr
Copy link

srs-adamr commented Jul 26, 2024

We have seen the issue cease on the nightl, the stream messages aren't being cleared anymore!!

However, not sure if it's related, but our Deleted count on our Stream continues to rise, then randomly reset (We aren't deleting any messages manually, and all parameters are set extremely high (Max*))

@aksdb
Copy link
Author

aksdb commented Aug 1, 2024

So after a week of monitoring, it seems the initial problem is gone with the nightly build. We didn't get any unexpected TERMs anymore and apparently we also don't miss any messages.

@wallyqs
Copy link
Member

wallyqs commented Aug 28, 2024

Fixed via #5697 which is part v2.10.19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

10 participants