Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes: slow entity purging - added couple of indices #1417

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sadiqkhoja
Copy link
Contributor

@sadiqkhoja sadiqkhoja commented Feb 22, 2025

Issue: Purging of Entity takes too long when there are many rows in entities table. In our staging server there are 1.6M+ Entities, and purging query at max took 611 minutes to execute.

Solution: Create an index on "deletedAt" column in entities table, and create an index on "sourceId" in "entity_defs" table.

Diagnosis: I tried running parts of CTE query individually but even the simplest query DELETE FROM entities WHERE "deletedAT" IS NOT NULL was taking more than 5 mins (I cancelled the execution).

Since simple query was taking so much time, I hypothesized that creating an index on "deletedAt" column should help, so I tried creating an index but that was taking too much time. So we increased the database instance size from t3.small to t3.xlarge, created the index and down sized the database back to t3.small.

After creation of index, simple query was fast enough. But full CTE query was still taking too much time (cancelled the execution after 5min). I ran the parts of CTE query individually and they were all quick, came to conclusion that we needed to break the CTE into parts. But that conclusion was wrong. During the execution of individual parts I was getting entity_defs_sourceid_foreign key constraint violation when deleting data from entity_def_sources:

DELETE FROM entity_def_sources
      USING entity_defs, entities, datasets
      WHERE entity_def_sources.id = entity_defs."sourceId"
      AND entity_defs."entityId" = entities.id
      AND entities."datasetId" = datasets.id
      AND (entity_def_sources.type = 'submission' OR (entity_def_sources.type = 'api' AND (entity_def_sources.details IS NULL OR entity_def_sources.details = 'null'))) -- don't detail bulk source
      AND "deletedAt" IS NOT NULL

So I purged the deleted rows from entity_defs table, then ran the above query again which was surely quick but deleted nothing; I overlooked that fact and came to wrong conclusion.

Further experimentation showed that running simple query like DELETE FROM entity_def_sources WHERE id = (113) was taking ~10 seconds. Deleting by primary key shouldn't take that much so I thought there must be triggers or foreign validation happening. There are no triggers on the table but there's entity_defs_sourceid_foreign in entity_defs table which had 1.6M+ rows, so I created an index on sourceId column in entity_defs table, executed simple delete from entity_def_sources and it was quick.

After creation of second index, the complete CTE query was quick too.

What has been done to verify that this works as intended?

See above. Additionally CPU consumption in staging environment is not spiking anymore around 4AM UTC.

Why is this the best possible solution? Were any other approaches considered?

Creating two indices solve the problem. Breaking the CTE query might be slower and will have its own complexities like first locking the deleted entities rows so that concurrent undelete doesn't happen. We can certainly came back to this if the problem resurfaces.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Faster purge command and cron job.

Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.

None.

Before submitting this PR, please make sure you have:

  • run make test and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code from external sources are properly credited in comments or that everything is internally sourced

Additional notes:

We should probably create an index on "deletedAt" in Submission table as well, in this PR or separately.

@sadiqkhoja sadiqkhoja marked this pull request as ready for review February 22, 2025 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant