Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(sentry-metrics): Metrics indexer consumer #28431

Merged
merged 19 commits into from
Sep 27, 2021
Merged

Conversation

MeredithAnya
Copy link
Member

@MeredithAnya MeredithAnya commented Sep 7, 2021

Metrics Indexer Consumer:
Messages produced by Relay into the ingest-metrics topic have a metric name along with any number of tag key, value string pairs associated with the metric. The snuba topic snuba-metrics (which be changed here) expects the integers instead of strings.

The indexer (which will be implemented later) will store the string to int relationship in postgres, but for now this just uses the mock dummy indexer to actually do the conversion.

In this PR the consumer consumes messages from the ingest-metrics topic, translate the payload to have ints instead of strings and then produce to the snuba-metrics topic so that snuba can then store the data.

RedisMockIndexer
The temporary redis indexer can be used by changing the following in the conf/server.py:

SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.redis_mock.RedisMockIndexer"

It also uses a bulk_record method to be able to get and set all the strings (metric name, tag keys and values) for a message at once.

@MeredithAnya MeredithAnya changed the title WIP(metrics): Metrics indexer consumer WIP(sentry-metrics): Metrics indexer consumer Sep 7, 2021
@MeredithAnya
Copy link
Member Author

@fpacifici @jjbayer (cc @jan-auer): some questions/concerns/thoughts I have:

  • I'm not sure if we want to use the BatchKafkaConsumer, or if we should be writing our own Consumer, like what is done for the QuerySubscriptionConsumer.
  • I don't know exactly how best to have the dummy indexer work for what product needs to build on top of this, prior adding in the actual indexer.
  • Is the UseCase actually needed for the indexer? Or is that just information that needs to be passed along later to the metrics product data model?

@jjbayer
Copy link
Member

jjbayer commented Sep 8, 2021

  • I don't know exactly how best to have the dummy indexer work for what product needs to build on top of this, prior adding in the actual indexer.

For release health we cannot really mock tag values, because the release tag may have any value. Maybe we can use a redis key-value lookup as the simplest possible indexer implementation?

  • Is the UseCase actually needed for the indexer? Or is that just information that needs to be passed along later to the metrics product data model?

I think it's not necessary from a functional perspective -- but maybe for partitioning?

@MeredithAnya
Copy link
Member Author

  • I don't know exactly how best to have the dummy indexer work for what product needs to build on top of this, prior adding in the actual indexer.

For release health we cannot really mock tag values, because the release tag may have any value. Maybe we can use a redis key-value lookup as the simplest possible indexer implementation?

  • Is the UseCase actually needed for the indexer? Or is that just information that needs to be passed along later to the metrics product data model?

I think it's not necessary from a functional perspective -- but maybe for partitioning?

Updates:

  • Added redis per @jjbayer's suggestion, assuming that it's an okay implementation of it to get things unblocked
  • Changed the interface to have only record and reverse_resolve where record does both the work of looking up the value and recording it if it's not already there

Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a step in the right direction.
Will review again tomorrow with some more details on how to produce in an efficient manner.
I am not sure about the goal of that pubsub class that does not allow us to set a callback.

@click.option("--topic", default="ingest-metrics", help="Topic to get subscription updates from.")
@batching_kafka_options("metrics-consumer")
@configuration
def metrics_consumer(**options):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to start the consumer by default? I don't think we need it yet by default.

Copy link
Member Author

@MeredithAnya MeredithAnya Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had to manually run sentry run metrics-consumer after running the sentry devserver so I don't think this starts by default

Copy link
Member

@lynnagara lynnagara Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we are adding a separate command rather than using the single ingest-consumer command which runs the rest of the ingest consumers? We could still temporarily omit metrics from --all-consumer-types and only run it if explicitly called with the metrics consumer type if was the concern. Curious if there is another reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lynnagara (cc @fpacifici) the ingest-consumer commands ends up using the IngestConsumerWorker where as we want to use the MetricsIndexerWorker instead. I felt like it was easier to keep these separate for now then to refactor the ingest-consumer command. It seems like this could be easily changed down the line if we wanted, but open to changing it now if people feel strongly

Comment on lines 49 to 53
snuba_metrics_publisher = KafkaPublisher(
kafka_config.get_kafka_producer_cluster_options(cluster_name),
asynchronous=False,
)
snuba_metrics_publisher.publish(snuba_metrics["topic"], json.dumps(message))
Copy link
Contributor

@fpacifici fpacifici Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would flush the messages for every message, which is not ideal considering you have batches of messages and you can flush before committing on the ingest topic only once per batch of messages.
We should also set the callback for each message.

I would have a look at this approach to see how to use the callback
https://github.com/getsentry/cdc/blob/8643ee7a5bf491755c46169c6841131521d34b6c/cdc/producer.py#L41-L161
You probably do not need all that complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some follow up on what a consumer should do:

When writing a consumer like this that needs to achieve a high throughput, there are a few elements to take into account and a few requirements:

  • We cannot auto commit, because we would be committing before processing the message. If there is an error in processing the message is lost. This is taken into account as we do not auto commit
  • We don't need to commit after every offset. Less load on the broker and network
  • We must not loose messages. So if an error happens at any point during processing we cannot commit the entire batch but only the portion that we sent to Kafka
  • producing is asynchronous. So we need to wait for the callback before being sure that the message is persisted in a kafka topic https://docs.confluent.io/clients-confluent-kafka-python/current/overview.html#asynchronous-writes
  • We should really avoid duplicates, as they would not be deduplicated in Clickhouse since metrics are stored in a pre-aggregated way. So if there is an error during processing that causes the consumer to crash we should have committed up to the last acknowledged message (acknowledged meaning that we did receive the callback from the producer).
  • We should keep producing asynchronously and not flush every message individually as that would have a real impact on throughput.
  • Exactly one semantics is technically not achievable as nobody can deduplicate messages. But we can make duplication extremely rare (basically only in case Kafka commit fails multiple times or the consumer crashes for out of memory without being able to commit and after flushing.

So there are a few ways to do that:

  • Simple batching consumer. Do the processing phase, then the batch flush sends all the messages, at the end flushes and waits for callbacks. It only commits on kafka the offset of the last callback received.
  • Something like the cdc link above. Keep producing messages as soon as they are processed and periodically commit the last offset we got the callback for.

Copy link
Member

@jjbayer jjbayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I did a manual end-to-end test locally and everything works as expected.


def resolve(self, organization: Organization, use_case: UseCase, string: str) -> Optional[int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to what I said previously, I think it would make sense to keep the resolve method. It's already in use here, and if the indexer entries ever get a TTL, it would not make sense to prolong the retention every time the indexer is queried from the product side.

@nikhars
Copy link
Member

nikhars commented Sep 24, 2021

LGTM. Nice job.

Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good first step. Thanks

on_delivery=self.callback,
)

messages_left = self.__producer.flush(5.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please as a follow up PR. We will want a metric here to measure how long we wait.

@fpacifici
Copy link
Contributor

Please figure out what is wrong with CI before merging

@MeredithAnya MeredithAnya merged commit d393562 into master Sep 27, 2021
@MeredithAnya MeredithAnya deleted the metrics/SNS-397 branch September 27, 2021 16:33
jjbayer added a commit that referenced this pull request Sep 28, 2021
vuluongj20 pushed a commit that referenced this pull request Sep 30, 2021
* WIP(metrics): Metrics indexer consumer

* use redis for mock indexer

* add bulk_record and async producing

* make redis mock indexer separate file

* fix type errors

* add comment

* remove UseCase and updates tests

* update more tests

* clean up part I

* mini cleanup

* add basic tests

* all org_ids are ints

* missed one

* more clean up

* consumer test

* rename test file

* lil updates

* attempt to fix tests

* try dis tho
vuluongj20 pushed a commit that referenced this pull request Sep 30, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Oct 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants