-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: replace windowed_query with yield_per #17361
Merged
miketheman
merged 2 commits into
pypi:main
from
miketheman:miketheman/search-query-refactor
Jan 6, 2025
Merged
refactor: replace windowed_query with yield_per #17361
miketheman
merged 2 commits into
pypi:main
from
miketheman:miketheman/search-query-refactor
Jan 6, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Using a SQLAlchemy built-in for server-side cursor. Instead of performing a single query at the top to create the windowed columns, then issuing subsequent queries for windows of data, use `yield_per()` to stream the results in a generator fashion using a server-side cursor. In dev, shaves ~40% of time off `reindex()`. Refs: https://docs.sqlalchemy.org/en/20/orm/queryguide/api.html#fetching-large-result-sets-with-yield-per Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Signed-off-by: Mike Fiedler <miketheman@gmail.com>
A demonstration of the SQL queries emitted during a full reindex in development: Before-- first query to create the windowed "rownum" column for the 25k splits
SELECT anon_1.name
FROM (SELECT release_descriptions.raw AS description, releases.author AS author, releases.author_email AS author_email, releases.maintainer AS maintainer, releases.maintainer_email AS maintainer_email, releases.home_page AS home_page, releases.summary AS summary, releases.keywords AS keywords, releases.platform AS platform, releases.download_url AS download_url, releases.created AS created, (SELECT array_agg(trove_classifiers.classifier) AS array_agg_1
FROM release_classifiers JOIN trove_classifiers ON trove_classifiers.id = release_classifiers.trove_id
WHERE releases.id = release_classifiers.release_id) AS classifiers, projects.normalized_name AS normalized_name, projects.name AS name, row_number() OVER (ORDER BY projects.name) AS rownum
FROM (SELECT DISTINCT ON (releases.project_id) releases.id AS id
FROM releases
WHERE releases.yanked IS false AND (EXISTS (SELECT 1
FROM release_files
WHERE releases.id = release_files.release_id)) ORDER BY releases.project_id, releases.is_prerelease NULLS LAST, releases._pypi_ordering DESC) AS anon_2 JOIN releases ON releases.id = anon_2.id JOIN release_descriptions ON release_descriptions.id = releases.description_id LEFT OUTER JOIN projects ON projects.id = releases.project_id) AS anon_1
WHERE anon_1.rownum %% %(rownum_1)s::INTEGER = %(param_1)s::INTEGER
-- second query to fetch the actual data, uses the windowed `name` column to find the rows to return
SELECT release_descriptions.raw AS description, releases.author, releases.author_email, releases.maintainer, releases.maintainer_email, releases.home_page, releases.summary, releases.keywords, releases.platform, releases.download_url, releases.created, (SELECT array_agg(trove_classifiers.classifier) AS array_agg_1
FROM release_classifiers JOIN trove_classifiers ON trove_classifiers.id = release_classifiers.trove_id
WHERE releases.id = release_classifiers.release_id) AS classifiers, projects.normalized_name, projects.name
FROM (SELECT DISTINCT ON (releases.project_id) releases.id AS id
FROM releases
WHERE releases.yanked IS false AND (EXISTS (SELECT 1
FROM release_files
WHERE releases.id = release_files.release_id)) ORDER BY releases.project_id, releases.is_prerelease NULLS LAST, releases._pypi_ordering DESC) AS anon_1 JOIN releases ON releases.id = anon_1.id JOIN release_descriptions ON release_descriptions.id = releases.description_id LEFT OUTER JOIN projects ON projects.id = releases.project_id
WHERE projects.name >= %(name_1)s::VARCHAR AND projects.name < %(name_2)s::VARCHAR ORDER BY projects.name
-- third query to get any remaining rows that weren't returned in the first query (dev has 29k rows)
SELECT release_descriptions.raw AS description, releases.author, releases.author_email, releases.maintainer, releases.maintainer_email, releases.home_page, releases.summary, releases.keywords, releases.platform, releases.download_url, releases.created, (SELECT array_agg(trove_classifiers.classifier) AS array_agg_1
FROM release_classifiers JOIN trove_classifiers ON trove_classifiers.id = release_classifiers.trove_id
WHERE releases.id = release_classifiers.release_id) AS classifiers, projects.normalized_name, projects.name
FROM (SELECT DISTINCT ON (releases.project_id) releases.id AS id
FROM releases
WHERE releases.yanked IS false AND (EXISTS (SELECT 1
FROM release_files
WHERE releases.id = release_files.release_id)) ORDER BY releases.project_id, releases.is_prerelease NULLS LAST, releases._pypi_ordering DESC) AS anon_1 JOIN releases ON releases.id = anon_1.id JOIN release_descriptions ON release_descriptions.id = releases.description_id LEFT OUTER JOIN projects ON projects.id = releases.project_id
WHERE projects.name >= %(name_1)s::VARCHAR ORDER BY projects.name AfterOne query emitted, and results streamed SELECT DISTINCT ON (projects.name) release_descriptions.raw AS description, releases.author, releases.author_email, releases.maintainer, releases.maintainer_email, releases.home_page, releases.summary, releases.keywords, releases.platform, releases.download_url, releases.created, (SELECT array_agg(trove_classifiers.classifier) AS array_agg_1
FROM release_classifiers JOIN trove_classifiers ON trove_classifiers.id = release_classifiers.trove_id
WHERE releases.id = release_classifiers.release_id) AS classifiers, projects.normalized_name, projects.name
FROM releases JOIN release_descriptions ON release_descriptions.id = releases.description_id JOIN projects ON projects.id = releases.project_id
WHERE releases.yanked IS false AND (EXISTS (SELECT 1
FROM release_files
WHERE releases.id = release_files.release_id)) AND TRUE ORDER BY projects.name, releases.is_prerelease NULLS LAST, releases._pypi_ordering DESC |
ewdurbin
approved these changes
Jan 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slick!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Using a SQLAlchemy built-in for server-side cursor.
Instead of performing a single query at the top to create the windowed
columns, then issuing subsequent queries for windows of data, use
yield_per()
to stream the results in a generator fashion using aserver-side cursor.
In dev, shaves ~40% of time off
reindex()
andreindex_project()
calls.Refs: https://docs.sqlalchemy.org/en/20/orm/queryguide/api.html#fetching-large-result-sets-with-yield-per
Signed-off-by: Mike Fiedler miketheman@gmail.com