[FEATURE REQ] EventHub consumer group / EventProcessorClient metrics #19391

cykl · 2021-02-23T12:06:57Z

Is your feature request related to a problem? Please describe.

I maintain applications consuming data from Event Hubs managed by external legal entities and had like to monitor my consumer groups health. AFAIK azure-sdk-for-java provides no consumer group or EventProcessorClient related metrics, nor tools which could observe consumer group health (ex. https://github.com/lightbend/kafka-lag-exporter).

I must compute such technical metrics by myself and EventProcessClient API is kind of burdensome for a such task.

Describe the solution you'd like

I had like an EventProcessorClient to produce metrics enabling consumer group supervision. SDK metrics should be easy to merge to other application metrics and not be tied to Azure Monitor. Kafka's consumer metrics and KIP 489: Kafka Consumer Record Latency Metric could serve as a base of inspiration.

Metrics exported by kafka-lag-exporter are also good reference of useful metrics. At the very least I want to be able to monitor consumer group consumption rate, max lag & latency and production rate.

Describe alternatives you've considered

Additional context

We started a prototype based on EventContext and wrapping an event processor but that proved to be more difficult that expected:

it is unclear how to handle metrics for partitions without consumed message during window (ex. lag and latency might have increased if the issue is on the consumer side)
using processPartitionInitialization and processPartionClose is likely required too.

joshfree · 2021-02-23T19:36:53Z

@conniey could you please take a look?

conniey · 2021-02-25T19:45:25Z

Hey, thanks for suggesting this.. at the moment, we have incorporated client-side tracing around processing each message using tracing. https://github.com/Azure/azure-sdk-for-java/wiki/Tracing

however, these metrics seem like they would have be surfaced by the service?

@JamesBirdsall do you know if this is something that Event Hubs publishes?

cykl · 2021-02-26T09:15:48Z

at the moment, we have incorporated client-side tracing around processing each message using tracing

Tracing is great, but I doubt it can address this issue as comparing consumption (ie. state of the consumer group) with production (ie. state of the hub) seems required.

however, these metrics seem like they would have be surfaced by the service?

Disclaimer: I'm not an eventHub power-user, I only maintain a couple of consumer groups and might say odd things.

I can image four options:

Event Hubs service publishes consumer group metrics. Users can view them using an Azure service or to retrieve them programmatically. From my user perspective, that would be ideal the solution. However, I don't see how Event Hubs service could compute such metrics as consumer group offsets are persisted in a blob checkpoint store. I doubt that Event hubs know where consumer group checkpoints are stored nor can read to them.

Monitor one or a set of consumer group using a provided tool or library. Combining information from a CheckpointStore and an EventHubConsumerClient should be enough (ie. checkpoint.getSequenceNumber(), partitionProperties.getLastEnqueuedSequenceNumber(), partitionProperties.getLastEnqueuedTime() and maybe client.receiveFromPartition( "", 1, EventPosition.fromSequenceNumber( offset )) to compute latency). This would be similar to what kafka-lag-exporter does. The main challenge is likely how to monitor multiple consumer groups without cumbersome configuration. Because Kafka brokers persist consumer group state and expose an API, one can easily deploy a single kafka-lag-export to monitor all cluster's customer groups. It don't a such pleasant deployment option for Event hubs.

Consumer application monitor its consumer groups like in previous option. Application already knows consumer group name, checkpoint location, credentials etc. However, it would mean that each application replica monitors the consumer group or that an election mechanism is required.

Consumer application monitors event flowing through an EventProcessorClient rather that a consumer group state. This seems doable by combining DataEvent, InitializationContext and CloseContext. However, metrics semantic & alerting might be harder to get right as application might stop, some partition be unassigned, some partition might receive or consume no event during a window etc.

JamesBirdsall · 2021-02-27T00:18:33Z

Trying to produce good metrics for Event Hubs is a complex issue. Event Hubs itself is a distributed system, and frequently the consumer application is as well, and neither side has a complete picture. Kafka can do a bit better because checkpoints are stored in the service, giving the service side more insight. That may eventually become an option for Event Hubs accessed via AMQP as well, but adding that functionality is not currently on any plan or schedule.

One thing we do have internally is a metric for "backlog" on a receiver, which is the difference between the highest sequence number flowed to the receiver and the highest sequence number in the partition. This doesn't map directly to lag at the application level because AMQP is a streaming protocol and messages flow down the wire into a buffer within the client while the application is doing other things, and it common that the messages flowed are well ahead of what the application has actually processed, but it is a useful proxy -- if the backlog is increasing, we can infer that the application is processing slowly (or stopped); if the backlog is decreasing, the application is catching up; and if the backlog remains 0, then the application is keeping up with message flow and presumably healthy. However, it is possible to have more than one receiver on the same partition and consumer group, and that's something which is more likely to happen when there are troubles -- for example, the application is in a bad state and not processing messages for a partition but the receiver is still connected while the application creates another receiver to get message processing started again. What then is the backlog for the partition on that consumer group? A human engineer can look at these metrics and interpret them usefully, but trying to devise rules for an automated monitoring system would be difficult.

Our best existing solution to this problem is a feature which adds information about the current state of the partition to each message, which gives the application side the most complete view of the overall situation. This may be what you were talking about with EventContext, although it looks like what I'm thinking of is exposed in the new client as PartitionProperties. It suffers from certain drawbacks, in that the information is already at least a little old when your application sees it, and of course if communications with the service pause or stop then the application can't update the metrics at all. However, as things stand, the consuming application is the only part of the combined system that knows what messages have actually been processed, so it is uniquely positioned to measure the actual lag.

cykl · 2021-12-09T08:35:03Z

Have some progress been made in the past months?

Event hubs consumers are a major pain point of our information system. Consumers sporadically stop processing some partition until application is restarted; and it cannot be detected automatically because consumer group state cannot be observed.

It seems inconceivable that a service such as event hubs does not enable users to easily and robustly monitor whether a consumer group is lagging or not. This is a non existent issue for any other messaging system that I'm aware of.

@JamesBirdsall While I understand your explanation (and with no better alternative we developed some monitoring based on message consumption and EventContext), this cannot be an acceptable answer from the Event Hubs team.

conniey · 2024-03-14T23:34:57Z

We have several github issues that fill the gaps in our observability experience. Closed in favour of:

joshfree added Client This issue points to a problem in the data-plane of the library. Event Hubs labels Feb 23, 2021

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 23, 2021

joshfree assigned conniey Feb 23, 2021

joshfree added this to the Backlog milestone Feb 23, 2021

conniey assigned JamesBirdsall Feb 25, 2021

conniey mentioned this issue Mar 4, 2021

[FEATURE REQ] Spring - Add health indicator for Event Hub stream binder #19578

Closed

This was referenced Apr 22, 2021

[QUERY] Event Hubs needs life cycle listener to measure performance #15077

Closed

instrument javaclient with performance metrics Azure/azure-event-hubs-java#307

Closed

saragluna mentioned this issue Nov 15, 2021

[FEATURE REQ] Enhance metrics for Azure Spring Event Hubs #21764

Closed

cykl mentioned this issue Dec 10, 2021

[FEATURE REQ] Expose metrics from AMQP SDKs #25605

Closed

conniey closed this as completed Mar 14, 2024

github-actions bot locked and limited conversation to collaborators Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQ] EventHub consumer group / EventProcessorClient metrics #19391

[FEATURE REQ] EventHub consumer group / EventProcessorClient metrics #19391

cykl commented Feb 23, 2021

joshfree commented Feb 23, 2021

conniey commented Feb 25, 2021

cykl commented Feb 26, 2021 •

edited

Loading

JamesBirdsall commented Feb 27, 2021

cykl commented Dec 9, 2021 •

edited

Loading

conniey commented Mar 14, 2024

[FEATURE REQ] EventHub consumer group / EventProcessorClient metrics #19391

[FEATURE REQ] EventHub consumer group / EventProcessorClient metrics #19391

Comments

cykl commented Feb 23, 2021

joshfree commented Feb 23, 2021

conniey commented Feb 25, 2021

cykl commented Feb 26, 2021 • edited Loading

JamesBirdsall commented Feb 27, 2021

cykl commented Dec 9, 2021 • edited Loading

conniey commented Mar 14, 2024

cykl commented Feb 26, 2021 •

edited

Loading

cykl commented Dec 9, 2021 •

edited

Loading