-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQ] EventHub consumer group / EventProcessorClient metrics #19391
Comments
@conniey could you please take a look? |
Hey, thanks for suggesting this.. at the moment, we have incorporated client-side tracing around processing each message using tracing. https://github.com/Azure/azure-sdk-for-java/wiki/Tracing however, these metrics seem like they would have be surfaced by the service? @JamesBirdsall do you know if this is something that Event Hubs publishes? |
Tracing is great, but I doubt it can address this issue as comparing consumption (ie. state of the consumer group) with production (ie. state of the hub) seems required.
Disclaimer: I'm not an eventHub power-user, I only maintain a couple of consumer groups and might say odd things. I can image four options: Event Hubs service publishes consumer group metrics. Users can view them using an Azure service or to retrieve them programmatically. From my user perspective, that would be ideal the solution. However, I don't see how Event Hubs service could compute such metrics as consumer group offsets are persisted in a blob checkpoint store. I doubt that Event hubs know where consumer group checkpoints are stored nor can read to them. Monitor one or a set of consumer group using a provided tool or library. Combining information from a CheckpointStore and an EventHubConsumerClient should be enough (ie. checkpoint.getSequenceNumber(), partitionProperties.getLastEnqueuedSequenceNumber(), partitionProperties.getLastEnqueuedTime() and maybe client.receiveFromPartition( "", 1, EventPosition.fromSequenceNumber( offset )) to compute latency). This would be similar to what kafka-lag-exporter does. The main challenge is likely how to monitor multiple consumer groups without cumbersome configuration. Because Kafka brokers persist consumer group state and expose an API, one can easily deploy a single kafka-lag-export to monitor all cluster's customer groups. It don't a such pleasant deployment option for Event hubs. Consumer application monitor its consumer groups like in previous option. Application already knows consumer group name, checkpoint location, credentials etc. However, it would mean that each application replica monitors the consumer group or that an election mechanism is required. Consumer application monitors event flowing through an EventProcessorClient rather that a consumer group state. This seems doable by combining DataEvent, InitializationContext and CloseContext. However, metrics semantic & alerting might be harder to get right as application might stop, some partition be unassigned, some partition might receive or consume no event during a window etc. |
Trying to produce good metrics for Event Hubs is a complex issue. Event Hubs itself is a distributed system, and frequently the consumer application is as well, and neither side has a complete picture. Kafka can do a bit better because checkpoints are stored in the service, giving the service side more insight. That may eventually become an option for Event Hubs accessed via AMQP as well, but adding that functionality is not currently on any plan or schedule. One thing we do have internally is a metric for "backlog" on a receiver, which is the difference between the highest sequence number flowed to the receiver and the highest sequence number in the partition. This doesn't map directly to lag at the application level because AMQP is a streaming protocol and messages flow down the wire into a buffer within the client while the application is doing other things, and it common that the messages flowed are well ahead of what the application has actually processed, but it is a useful proxy -- if the backlog is increasing, we can infer that the application is processing slowly (or stopped); if the backlog is decreasing, the application is catching up; and if the backlog remains 0, then the application is keeping up with message flow and presumably healthy. However, it is possible to have more than one receiver on the same partition and consumer group, and that's something which is more likely to happen when there are troubles -- for example, the application is in a bad state and not processing messages for a partition but the receiver is still connected while the application creates another receiver to get message processing started again. What then is the backlog for the partition on that consumer group? A human engineer can look at these metrics and interpret them usefully, but trying to devise rules for an automated monitoring system would be difficult. Our best existing solution to this problem is a feature which adds information about the current state of the partition to each message, which gives the application side the most complete view of the overall situation. This may be what you were talking about with EventContext, although it looks like what I'm thinking of is exposed in the new client as PartitionProperties. It suffers from certain drawbacks, in that the information is already at least a little old when your application sees it, and of course if communications with the service pause or stop then the application can't update the metrics at all. However, as things stand, the consuming application is the only part of the combined system that knows what messages have actually been processed, so it is uniquely positioned to measure the actual lag. |
Have some progress been made in the past months? Event hubs consumers are a major pain point of our information system. Consumers sporadically stop processing some partition until application is restarted; and it cannot be detected automatically because consumer group state cannot be observed. It seems inconceivable that a service such as event hubs does not enable users to easily and robustly monitor whether a consumer group is lagging or not. This is a non existent issue for any other messaging system that I'm aware of. @JamesBirdsall While I understand your explanation (and with no better alternative we developed some monitoring based on message consumption and EventContext), this cannot be an acceptable answer from the Event Hubs team. |
We have several github issues that fill the gaps in our observability experience. Closed in favour of: |
Is your feature request related to a problem? Please describe.
I maintain applications consuming data from Event Hubs managed by external legal entities and had like to monitor my consumer groups health. AFAIK azure-sdk-for-java provides no consumer group or EventProcessorClient related metrics, nor tools which could observe consumer group health (ex. https://github.com/lightbend/kafka-lag-exporter).
I must compute such technical metrics by myself and EventProcessClient API is kind of burdensome for a such task.
Describe the solution you'd like
I had like an EventProcessorClient to produce metrics enabling consumer group supervision. SDK metrics should be easy to merge to other application metrics and not be tied to Azure Monitor. Kafka's consumer metrics and KIP 489: Kafka Consumer Record Latency Metric could serve as a base of inspiration.
Metrics exported by kafka-lag-exporter are also good reference of useful metrics. At the very least I want to be able to monitor consumer group consumption rate, max lag & latency and production rate.
Describe alternatives you've considered
Additional context
We started a prototype based on EventContext and wrapping an event processor but that proved to be more difficult that expected:
The text was updated successfully, but these errors were encountered: