fix(reprocessing): Batch handle_remaining_events, fix more race conditions [INGEST-452] #29033

untitaker · 2021-10-04T11:49:32Z

We want to reprocess 10 events at a time, but that causes problems in Snuba. ClickHouse is not very good at processing many tiny INSERTs, and unfortunately every eventstream message, in our case replace_groups, maps to exactly one of those queries.

Implement simple Redis-based batching for handle_remaining_events such that we don't take ClickHouse down with our many tiny event ID requests. With this PR we send 500 events at a time to Snuba.

This unearthed a couple other race conditions with regards to how we finish reprocessing. See the removed comment:

    # Need to delay this until we have enqueued all events and stopped
    # iterating over the batch query, if we take care of this in
    # finish_reprocessing it won't work, as for small max_events
    # finish_reprocessing may execute sooner than the last reprocess_group
    # iteration.

"finish_reprocessing may execute sooner than the last reprocess_group", but it can also happen that handle_remaining_events runs after exclude_groups. This is not something I would expect to happen, but because Snuba replacements completely ignore group exclusions it all just happens to work out in the end.

This PR changes our synchronization countdown to contain the full event count, and call mark_event_reprocessed for every single event regardless of max_events.

This makes our entire synchronization model simpler. Basically we now:

set the counter to <number of events in snuba>
spawn either handle_remaining_events or preprocess_event
either task will call mark_event_reprocessed unconditionally
this then triggers finish_reprocessing

Before:

set the counter to min(max_events, <number of events in snuba>)
spawn preprocess_event which will decrement the count, OR spawn handle_remaining_events which won't (but sometimes it does, e.g. when an event is handled as "remaining" because of missing attachments)
finish_reprocessing is scheduled before or after exclude_groups but it doesn't matter because of Snuba internals

Unfortunately that Redis key is used for two things:

Synchronizing Celery tasks (specifically calling finish_reprocessing)
Showing progress in the UI

Meaning that when we change this, it has UI impact. In a job "reprocess m out of n events", the UI should show a progressbar going from 0 to m. With our change it would show a progressbar going from 0 to n.

To counteract that, we still report the bounds of the progressbar to the UI as before, and additionally downscale our sync counter to be within bounds 0..m again whenever the UI needs it.

The only UI-visible impact this has is that the progressbar is advanced more evenly: When reprocessing 3 out of 1000000 events, it used to be that the progressbar immediately goes to 3/3, then just sort of hangs while Sentry is migrating/deleting remaining events.

Follow-up items

The way we batch things is incredibly bespoke. See https://www.notion.so/sentry/Reduced-Snuba-replacements-for-reprocessing-1e6cadf064ef45f38d55ed3e3e014f09 for further ideas to reduce the amount of messages sent, or at least to make Redis-based batching more reusable.

jan-auer · 2021-10-04T14:51:20Z

src/sentry/conf/server.py

@@ -2313,10 +2313,30 @@ def build_cdc_postgres_init_db_volume(settings):

 SENTRY_USE_UWSGI = True

+# When copying attachments for to-be-reprocessed events into processing store,
+# how large is an individual file chunk? Each chunk is stored as Redis key.


Thank you for adding comments on these constants.

jan-auer · 2021-10-04T14:55:13Z

src/sentry/reprocessing2.py

+    event IDs in Redis. We need this because Snuba cannot handle many tiny messages and prefers big ones instead.
+
+    Best performance is achieved when timestamps are close together. Luckily we
+    happen to iterate through events ordered by timestamp.


The way this comment is written suggests that we should enforce this, or write this as a precondition.

jan-auer · 2021-10-04T15:08:09Z

src/sentry/reprocessing2.py

+            project_id=project_id,
+            old_group_id=old_group_id,
+            new_group_id=new_group_id,
+            event_ids=event_ids_batch,


This sends about 20kB into the task queue. Are we comfortable with such large argument lists?

I thought we draw the line at event-payload-sized payloads, but I think it may be easy enough to refactor this code to simply keep things in Redis

but I think it may be easy enough to refactor this code to simply keep things in Redis

It would be a nice separation if one task pushes to redis and the other task reads from it (but that's a nit).

jan-auer · 2021-10-04T15:08:55Z

src/sentry/reprocessing2.py

+
+        # TODO: Redis 6 introduces LPOP with <count> argument, use here?
+        while True:
+            row = client.lpop(key)


Since you intend to empty the key, can you check if you can atomically take the entire key value here instead of popping them individually?

There actually seems no way to do that.. redis doesn't have a "pop key" command. I however moved it into the other Celery task now.

jjbayer · 2021-10-04T16:04:30Z

src/sentry/reprocessing2.py

+    key = f"re2:remaining:{project_id}:{old_group_id}"
+
+    if datetime_to_event:
+        client.lpush(


It seems that LPUSH returns the list size, so we could omit the call to LLEN further down.. but only if LPUSH is actually called. Not sure if worth it.

https://redis.io/commands/lpush

Yup, added.

jjbayer · 2021-10-04T16:06:53Z

src/sentry/reprocessing2.py

+            project_id=project_id,
+            old_group_id=old_group_id,
+            new_group_id=new_group_id,
+            event_ids=event_ids_batch,


but I think it may be easy enough to refactor this code to simply keep things in Redis

It would be a nice separation if one task pushes to redis and the other task reads from it (but that's a nit).

jjbayer · 2021-10-04T16:38:01Z

src/sentry/reprocessing2.py

+    # Our internal sync counters are counting over *all* events, but the
+    # progressbar in the frontend goes until max_events. Advance progressbar
+    # proportionally.
+    pending = int(int(pending) * info["totalEvents"] / float(info.get("syncCount", 1)))


This might be out-of-scope for this PR, but should we maybe just update the UI to reflect the number of events that are actually being touched?

Is there any other API client that relies on this information being a real event count, instead of a scaled down version?

I think there's too many places that we'd have to redesign wording of. Besides the progressbar the same counter is used in the activity feed of an issue.

jan-auer

All concerns addressed, good to go from my end once E2E tests are passing.

Please attend the deploy and watch for new errors.

jan-auer · 2021-10-05T10:09:45Z

src/sentry/reprocessing2.py

+        try:
+            # Rename `key` to a new temp key that is passed to celery task. We
+            # use `renamenx` instead of `rename` only to detect UUID collisions.
+            assert client.renamenx(key, new_key), "UUID collision for new_key?"


Optional: Can we catch this down and report to Sentry?

I think we should halt reprocessing... if we simply continue with the next page, key might become too large.

we could potentially catch down the assertion, delete key from redis and sort-of chug on with reprocessing

but I think the best way to recover from random errors in reprocessing could be: abort reprocessing, merge the old and new issues together via regular issue merge, tell the user "sorry" and hope no data was lost

jjbayer

Only minor comments, otherwise this looks good to me!

jjbayer · 2021-10-05T09:43:14Z

src/sentry/reprocessing2.py


    if datetime_to_event:
-        client.lpush(
+        llen = client.lpush(
            key,
            *(f"{to_timestamp(datetime)};{event_id}" for datetime, event_id in datetime_to_event),
        )
        client.expire(key, settings.SENTRY_REPROCESSING_SYNC_TTL)


nit: This might simplify the condition further down:

else: llen = client.llen(key)

jjbayer · 2021-10-05T10:35:09Z

src/sentry/reprocessing2.py

+            # `key` does not exist in Redis. `ResponseError` is a bit too broad
+            # but it seems we'd have to do string matching on error message
+            # otherwise.
+            return


Should we log anything here?

this condition happens "too often", i.e. when reprocessing is finished and we execute force-flushing

jjbayer · 2021-10-05T10:38:42Z

src/sentry/reprocessing2.py

+
+        event_ids_batch.append(event_id)
+
+    client.delete(key)


client.lpop() was atomic, right? Is there any danger of two task instances running at the same time and working on the same lrange?

Since I now rename the key to something unique I don't have concurrent access on it anymore.

fpacifici · 2021-10-11T16:51:04Z

src/sentry/conf/server.py

+
+# How many event IDs to buffer up in Redis before sending them to Snuba. This
+# is about "remaining events" exclusively.
+SENTRY_REPROCESSING_REMAINING_EVENTS_BUF_SIZE = 500


This will be fine. When we decide to raise this number, please keep Search and Storage involved to ensure we do not go beyond the maximum Clickhouse query size.

fpacifici

Seems good to me

…tched-remaining-events

untitaker added 4 commits October 4, 2021 13:26

wip

35e9643

lets start small

56acf3b

some smaller fixes

d45836b

remove dead comment

5ef9388

untitaker requested review from jjbayer, fpacifici and a team October 4, 2021 11:49

untitaker changed the title ~~fix(reprocessing): Batch handle_remaining_events, fix more race conditions~~ fix(reprocessing): Batch handle_remaining_events, fix more race conditions [INGEST-452] Oct 4, 2021

move more stuff into Django settings

ecc128a

vercel bot deployed to Preview – sentry October 4, 2021 11:59 View deployment

vercel bot deployed to Preview – storybook October 4, 2021 11:59 View deployment

jan-auer reviewed Oct 4, 2021

View reviewed changes

untitaker requested a review from oioki October 4, 2021 15:19

jjbayer reviewed Oct 4, 2021

View reviewed changes

untitaker added 2 commits October 5, 2021 10:43

improve comment

9bb715c

rewrite task to keep arguments in redis

ae8171b

vercel bot deployed to Preview – storybook October 5, 2021 09:04 View deployment

vercel bot deployed to Preview – sentry October 5, 2021 09:04 View deployment

untitaker added 2 commits October 5, 2021 11:49

fix test

67c276e

use lrange instead of many lpop

62cb833

vercel bot deployed to Preview – storybook October 5, 2021 10:02 View deployment

vercel bot deployed to Preview – sentry October 5, 2021 10:02 View deployment

jan-auer approved these changes Oct 5, 2021

View reviewed changes

jjbayer approved these changes Oct 5, 2021

View reviewed changes

simplify

3ca17f3

vercel bot deployed to Preview – sentry October 5, 2021 10:46 View deployment

vercel bot deployed to Preview – storybook October 5, 2021 10:46 View deployment

fpacifici reviewed Oct 11, 2021

View reviewed changes

fpacifici approved these changes Oct 11, 2021

View reviewed changes

Merge remote-tracking branch 'origin/master' into fix/reprocessing-ba…

1b0609d

…tched-remaining-events

vercel bot deployed to Preview – storybook October 12, 2021 10:09 View deployment

vercel bot deployed to Preview – sentry October 12, 2021 10:09 View deployment

untitaker merged commit ff0d828 into master Oct 18, 2021

untitaker deleted the fix/reprocessing-batched-remaining-events branch October 18, 2021 09:26

github-actions bot locked and limited conversation to collaborators Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(reprocessing): Batch handle_remaining_events, fix more race conditions [INGEST-452] #29033

fix(reprocessing): Batch handle_remaining_events, fix more race conditions [INGEST-452] #29033

untitaker commented Oct 4, 2021 •

edited

Loading

jan-auer Oct 4, 2021

jan-auer Oct 4, 2021

jan-auer Oct 4, 2021

untitaker Oct 4, 2021

jjbayer Oct 4, 2021

jan-auer Oct 4, 2021

untitaker Oct 5, 2021

jjbayer Oct 4, 2021

untitaker Oct 5, 2021

jjbayer Oct 4, 2021

jjbayer Oct 4, 2021

untitaker Oct 5, 2021

jan-auer left a comment

jan-auer Oct 5, 2021

untitaker Oct 5, 2021

untitaker Oct 5, 2021

jjbayer left a comment

jjbayer Oct 5, 2021

untitaker Oct 5, 2021

jjbayer Oct 5, 2021

untitaker Oct 5, 2021

jjbayer Oct 5, 2021

untitaker Oct 5, 2021

fpacifici Oct 11, 2021

fpacifici left a comment

fix(reprocessing): Batch handle_remaining_events, fix more race conditions [INGEST-452] #29033

fix(reprocessing): Batch handle_remaining_events, fix more race conditions [INGEST-452] #29033

Conversation

untitaker commented Oct 4, 2021 • edited Loading

Follow-up items

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-auer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjbayer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpacifici left a comment

Choose a reason for hiding this comment

untitaker commented Oct 4, 2021 •

edited

Loading