-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PersistentStreamPullingAgent skips over the message under a certain condition #9023
Comments
any updates ? |
@benjaminpetit It seems like our fix for preventing duplicate event delivery (https://github.com/dotnet/orleans/pull/7699/files from #7686) is inadvertently causing messages to be skipped. We recently upgraded to Orleans 8 from 3.x, and are seeing this issue as well, although we are using the Event Hub stream provider, not MemoryStreams. |
I've played around with the tests in Orleans and I've been able to reproduce the issue here: The Removing the |
Is there any update on this? The chance of skipping messages is the reason we're still not on the Streams feature. |
@ReubenBond Can we get an official fix in for this? We are using a forked version of Orleans with this fix in place since we discovered the issue late August. If you would like me to create a PR for this, I am happy to, please confirm that the fix I have in my branch linked above makes sense from your perspective. |
Hi, |
This is not the same issue. A queue cache miss means that your grains are out-living the lifetime of the cache (roughly speaking.) This isn't intuitive, unfortunately. The way Orleans figures out if you may have missed a message is that it tries to keep track of the last message you processed in the cache. It needs to see the last message when it receives a new message to know that there hasn't been any intermediate messages (that you may have missed.) It tracks the last message for as long as the stream is active for a given grain observer. Now that your grains are living for up to two hours, when the grain receives a new message on the hour, the message from an hour ago has long since been purged from the cache. So Orleans will now throw thie queue cache miss exception - which actually means "I have no idea if I missed a message because the last message I processed is no longer visible [in the cache]." It's just a warning - but a very spammy one at that. In short, the timespan in MetadataMinTimeInCache must always be larger than the grain collection age for the type of grain that is consuming the stream. So, if your grains are now living for up to two hours, you need to keep metadata for, let's say, 2h15m to be safe. Metadata is used for tracking n-1 message only and doesn't have a significant effect on cache memory usage. The other fields Before when your grains only lived 15m, the stream cursor would be deallocated and the streaming runtime would "forget" about the last message received before the next one could arrive 45m later. This kind of problem is unique to the "rewindable cache" model for advanced streaming providers like event hubs. Orleans reads all partitions into silo specific memory caches to allow individual grains to rewind the stream without blocking other grains were it to actually rewind into the partition directly. There is one cache per agent, one agent per partition and possibly multiple agents per silo. |
Hi Oisin, I know that QueueCacheMissExceptions used to be warnings and the messages were delivered anyway. |
Oh, I missed that part. I don't recall any new behaviour that should drop messages. That's quite odd and definitely sounds like a bug somewhere. If it is this issue then yeah, you may be able to avoid triggering it |
We continue to experience this problem. Here's a recap of what we did so far: Since our streams receive 1 message per hour, we increased grain collection age from 15 minutes to 2 hours in order to avoid unnecessary traffic on grain storage account every hour. This resulted in the loss of most messages. We deactivated cache metadata purging by setting I then increased stream inactivity period from 30 minutes to 2 hours. Again this resulted in the loss of most messages. So apparently, when you receive messages infrequently, it helps to disable cache metadata purging, but the stream must be deactivated before the next message or it will likely be dropped. Honestly this gives me very poor confidence in Orleans Streams at the moment. For new projects we went with an Event Hubs Processor + Broadcast Channels instead of Orleans Streams. We lost some scalability but gained some peace of mind and reliability. |
Following as well - We're seeing stream events with eventhub occasionally not being delivered to our consumer grains as expected. Seems to have started when we upgraded from v3 to v8, and still happening on latest v9. |
I am seeing the
To help investigate, let me share my setup:
The silo configuration: hostBuilder.AddKeyedAzureTableClient("clustering");
hostBuilder.AddKeyedAzureBlobClient("grain-state");
hostBuilder.UseOrleans(siloBuilder =>
{
siloBuilder.Services.AddSingleton<SessionsEventAdapter>();
siloBuilder.AddEventHubStreams(
name: "sessions",
(ISiloEventHubStreamConfigurator configurator) =>
{
configurator.ConfigureEventHub(builder => builder.Configure(options =>
{
options.ConfigureEventHubConnection(
GetCpmsEventHubFqdn(hostBuilder),
eventHubName: "sessions",
GetPricingEngineConsumerGroup(hostBuilder),
new DefaultAzureCredential(new DefaultAzureCredentialOptions { TenantId = TenantId }));
}));
configurator.ConfigurePartitionReceiver(configure =>
{
configure.Configure(options =>
{
options.PrefetchCount = 100;
});
});
configurator.UseDataAdapter((services, name) => services.GetRequiredService<SessionsEventAdapter>());
configurator.UseAzureTableCheckpointer(
builder => builder.Configure(options =>
{
options.TableServiceClient = new TableServiceClient(
siloBuilder.Configuration.GetConnectionString("EventHubCheckpointStorage"),
options.ClientOptions);
options.PersistInterval = TimeSpan.FromSeconds(10);
}));
configurator.ConfigureStreamPubSub(StreamPubSubType.ImplicitOnly);
});
if (hostBuilder.Environment.IsDevelopment())
{
siloBuilder.UseDashboard(options =>
{
options.HostSelf = false;
});
}
});
// SessionsEventAdapter.cs
internal sealed class SessionsEventAdapter(Serializer serializer) : EventHubDataAdapter(serializer)
{
public override string GetPartitionKey(StreamId streamId)
=> streamId.ToString();
public override StreamId GetStreamIdentity(EventData queueMessage)
{
// When charging sessions are published, the sessions ID is passed as the partition key.
// That means the Grain that should receive the charging session event is the one with the same ID.
var sessionId = queueMessage.PartitionKey;
return StreamId.Create(EventHubConstants.Sessions.EventHubName, sessionId);
}
public override EventData ToQueueMessage<T>(StreamId streamId, IEnumerable<T> events, StreamSequenceToken token, Dictionary<string, object> requestContext)
=> throw new NotSupportedException("This adapter only supports reading CPMS charging sessions.");
protected override IBatchContainer GetBatchContainer(EventHubMessage eventHubMessage)
=> new SessionsBatchContainer(eventHubMessage);
}
// SessionsBatchContainer.cs
[GenerateSerializer, Immutable]
internal sealed class SessionsBatchContainer : IBatchContainer
{
private static readonly JsonSerializerOptions SessionsJsonSerializerOptions = JsonSerializerOptionsFactory.Create();
[Id(0)]
private readonly EventHubMessage eventHubMessage;
[Id(1)]
public StreamSequenceToken SequenceToken { get; }
public StreamId StreamId => eventHubMessage.StreamId;
public SessionsBatchContainer(EventHubMessage eventHubMessage)
{
this.eventHubMessage = eventHubMessage;
SequenceToken = new EventHubSequenceTokenV2(eventHubMessage.Offset, eventHubMessage.SequenceNumber, 0);
}
public IEnumerable<Tuple<T, StreamSequenceToken>> GetEvents<T>()
{
try
{
if (JsonSerializer.Deserialize<T>(eventHubMessage.Payload, SessionsJsonSerializerOptions) is { } message)
{
return [Tuple.Create(message, SequenceToken)];
}
}
catch (Exception)
{
}
return [];
}
public bool ImportRequestContext() => false;
} Here is my grain: [ImplicitStreamSubscription("sessions")]
internal class ChargingSessionGrain(
[PersistentState("ChargingSession")] IPersistentState<ChargingSession> session,
ILogger<ChargingSessionGrain> logger) : Grain, IChargingSessionGrain, IStreamSubscriptionObserver, IAsyncObserver<SessionEvent>
{
public async Task OnSubscribed(IStreamSubscriptionHandleFactory handleFactory)
{
var handler = handleFactory.Create<SessionEvent>();
await handler.ResumeAsync(this, session.State.Token);
}
public async Task OnNextAsync(SessionEvent item, StreamSequenceToken? token = null)
{
if (session.State.SessionId is not null && session.State.SessionId != item.SessionId)
{
logger.LogError($"Receives a message for a different Session ID. Current = {session.State.SessionId}. Received = {item.SessionId}.");
return;
}
session.State = session.State with
{
SessionId = item.SessionId,
EvseId = item.EvseId,
ConsumedKwh = item.ConsumedKwh,
LastEvent = item.Type,
LastEventTimestamp = item.Timestamp,
EventCount = session.State.EventCount + 1,
Token = token
};
await session.WriteStateAsync();
}
public Task OnCompletedAsync()
{
logger.LogInformation("Stream completed for {EvseId}.", session.State.EvseId);
return Task.CompletedTask;
}
public Task OnErrorAsync(Exception ex)
{
logger.LogError(ex, "An error occurred while processing the stream.");
return Task.CompletedTask;
}
} Telemetry collected over 12 hours:
Let me know if other stats are interesting. I let it run locally on my computer and collected stats/logs via Aspire dashboard. |
I think they are different issues here. I think the first analysis from @tchelidze make sense, and we may be skipping events when the metadata cache is used. We have different "persistent" provider that don't behave the same, causing this issue... I also think in this thread there is issues cause by configuration mismatch; we should set some guide and some config check at runtime for that. Metadata cache time > Streaming cache min time > Grain collection time (but if the grain is not called only by streaming, I wonder if it could trigger the issue you are encountering....) |
@benjaminpetit will do. @oising kindly provided input to my issue in Discord, and I definitely need more insights and understanding on how to configure streams for ingestion of events. More guidance/docs and configuration validation would absolutely be great. |
In Orleans 3.x, grain collection time was 2 or 3 hours by default and the cache min time was 30 min. It did not cause messages to be dropped. The grain collection age was changed to 15mn in Orleans 7.0. I'll create a separate issue if it helps. |
💯 |
Is the recommendation to disable metadata cache for now until this potential bug is fixed? |
I created a new issue for a similar bug when |
Hi @benjaminpetit, I opened a separate issue for my Event Hubs streaming problem: #9299 |
This can be closed as resolved I think? (#9336) |
Problem:
When publishing a message under the following conditions
PersistentStreamPullingAgent.pubSubCache
PersistentStreamPullingAgent.queueCache
has no messages in it.Then the published message is lost.
How to reproduce:
here is the link to the GitHub repo demonstrating the issue https://github.com/tchelidze/Orleans_MemoryStream_LostMessage/tree/master/Orleans_MemoryStream_LostMessage
Analysis:
Consider the following scenario : We published message number 1 to the stream. Then we wait and in the meantime stream goes inactive and message cache gets purged. After that we publish message number 2 and 3 to the stream. Then the following happens.
Inside PersistentStreamPullingAgent.DoHandshakeWithConsumer (which gets called from
RegisterAsStreamProducer
, remember, stream is inactive) method we retrieve the last processed message token from the consumer. in our case that would be message number 1. Then we take that token (pointing to message number 1) and callqueueCache.GetCacheCursor
passing that token.What
GetCacheCursor
does is where the problem lies. Specifically PooledQueueCache.SetCursor where it checks if theoldestMessage
is passed the given token.oldestMessage
in our case would be message number 2, while token is message number 1. So thatif
statement is executed on line 201. Then the interesting part comes, we check if thelastPurgedToken
is same or passed the given token.lastPurgedToken
again point to message number 1, because that was the last message that was evicted from the stream. So thatif
statement also executes andPooledQueueCache.SetCursor
sets theSequenceToken
to the oldest message, which is message number 2.Issue number 1:
As i understand,
lastPurgedToken
points to the message which was evicted and no longer in the cache, so checking forsequenceToken.CompareTo(entry.Token) >= 0
does not seem correct here, instead i think it should besequenceToken.CompareTo(entry.Token) > 0
.Story continues.
So, back to PersistentStreamPullingAgent.DoHandshakeWithConsumer line number 315. Here the expectation is that
queueCache.GetCacheCursor
gives us back the cursor that points to a last processed message, but becausequeueCache
no longer has the message (message number 1) it returns cursor pointing to the oldest message, which in our case would be message number 2. On line 315 we move the cursor forward (because remember, expectation was that Cursor was pointing to the last processed message). As a result, now cursor points to message number 3, and that's how message number 2 is lost.Issue number 2:
I think in PersistentStreamPullingAgent.DoHandshakeWithConsumer, instead of blindly moving the cursor to next, we should check if it points to the same position as
requestedHandshakeToken
and if it does not, then we should not move it forward.Workaround
Only workaround i can think of is to set the
StreamPullingAgentOptions.StreamInactivityPeriod
andStreamCacheEvictionOptions.DataMaxAgeInCache
to very high values to avoid the scenario where queueCache is empty and stream is inactive.Thoughts ?
The text was updated successfully, but these errors were encountered: