Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: replace windowed_query with yield_per #17361

Merged
merged 2 commits into from
Jan 6, 2025

Conversation

miketheman
Copy link
Member

Using a SQLAlchemy built-in for server-side cursor.

Instead of performing a single query at the top to create the windowed
columns, then issuing subsequent queries for windows of data, use
yield_per() to stream the results in a generator fashion using a
server-side cursor.

In dev, shaves ~40% of time off reindex() and reindex_project() calls.

Refs: https://docs.sqlalchemy.org/en/20/orm/queryguide/api.html#fetching-large-result-sets-with-yield-per

Signed-off-by: Mike Fiedler miketheman@gmail.com

Using a SQLAlchemy built-in for server-side cursor.

Instead of performing a single query at the top to create the windowed
columns, then issuing subsequent queries for windows of data, use
`yield_per()` to stream the results in a generator fashion using a
server-side cursor.

In dev, shaves ~40% of time off `reindex()`.

Refs: https://docs.sqlalchemy.org/en/20/orm/queryguide/api.html#fetching-large-result-sets-with-yield-per

Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Signed-off-by: Mike Fiedler <miketheman@gmail.com>
@miketheman miketheman requested a review from a team as a code owner January 6, 2025 16:18
@miketheman miketheman added the search Elasticsearch, search filters, and so on label Jan 6, 2025
@miketheman
Copy link
Member Author

A demonstration of the SQL queries emitted during a full reindex in development:

Before

-- first query to create the windowed "rownum" column for the 25k splits
SELECT anon_1.name
FROM (SELECT release_descriptions.raw AS description, releases.author AS author, releases.author_email AS author_email, releases.maintainer AS maintainer, releases.maintainer_email AS maintainer_email, releases.home_page AS home_page, releases.summary AS summary, releases.keywords AS keywords, releases.platform AS platform, releases.download_url AS download_url, releases.created AS created, (SELECT array_agg(trove_classifiers.classifier) AS array_agg_1
FROM release_classifiers JOIN trove_classifiers ON trove_classifiers.id = release_classifiers.trove_id
WHERE releases.id = release_classifiers.release_id) AS classifiers, projects.normalized_name AS normalized_name, projects.name AS name, row_number() OVER (ORDER BY projects.name) AS rownum
FROM (SELECT DISTINCT ON (releases.project_id) releases.id AS id
FROM releases
WHERE releases.yanked IS false AND (EXISTS (SELECT 1
FROM release_files
WHERE releases.id = release_files.release_id)) ORDER BY releases.project_id, releases.is_prerelease NULLS LAST, releases._pypi_ordering DESC) AS anon_2 JOIN releases ON releases.id = anon_2.id JOIN release_descriptions ON release_descriptions.id = releases.description_id LEFT OUTER JOIN projects ON projects.id = releases.project_id) AS anon_1
WHERE anon_1.rownum %% %(rownum_1)s::INTEGER = %(param_1)s::INTEGER

-- second query to fetch the actual data, uses the windowed `name` column to find the rows to return 
SELECT release_descriptions.raw AS description, releases.author, releases.author_email, releases.maintainer, releases.maintainer_email, releases.home_page, releases.summary, releases.keywords, releases.platform, releases.download_url, releases.created, (SELECT array_agg(trove_classifiers.classifier) AS array_agg_1
FROM release_classifiers JOIN trove_classifiers ON trove_classifiers.id = release_classifiers.trove_id
WHERE releases.id = release_classifiers.release_id) AS classifiers, projects.normalized_name, projects.name
FROM (SELECT DISTINCT ON (releases.project_id) releases.id AS id
FROM releases
WHERE releases.yanked IS false AND (EXISTS (SELECT 1
FROM release_files
WHERE releases.id = release_files.release_id)) ORDER BY releases.project_id, releases.is_prerelease NULLS LAST, releases._pypi_ordering DESC) AS anon_1 JOIN releases ON releases.id = anon_1.id JOIN release_descriptions ON release_descriptions.id = releases.description_id LEFT OUTER JOIN projects ON projects.id = releases.project_id
WHERE projects.name >= %(name_1)s::VARCHAR AND projects.name < %(name_2)s::VARCHAR ORDER BY projects.name

-- third query to get any remaining rows that weren't returned in the first query (dev has 29k rows)
SELECT release_descriptions.raw AS description, releases.author, releases.author_email, releases.maintainer, releases.maintainer_email, releases.home_page, releases.summary, releases.keywords, releases.platform, releases.download_url, releases.created, (SELECT array_agg(trove_classifiers.classifier) AS array_agg_1
FROM release_classifiers JOIN trove_classifiers ON trove_classifiers.id = release_classifiers.trove_id
WHERE releases.id = release_classifiers.release_id) AS classifiers, projects.normalized_name, projects.name
FROM (SELECT DISTINCT ON (releases.project_id) releases.id AS id
FROM releases
WHERE releases.yanked IS false AND (EXISTS (SELECT 1
FROM release_files
WHERE releases.id = release_files.release_id)) ORDER BY releases.project_id, releases.is_prerelease NULLS LAST, releases._pypi_ordering DESC) AS anon_1 JOIN releases ON releases.id = anon_1.id JOIN release_descriptions ON release_descriptions.id = releases.description_id LEFT OUTER JOIN projects ON projects.id = releases.project_id
WHERE projects.name >= %(name_1)s::VARCHAR ORDER BY projects.name

After

One query emitted, and results streamed

SELECT DISTINCT ON (projects.name) release_descriptions.raw AS description, releases.author, releases.author_email, releases.maintainer, releases.maintainer_email, releases.home_page, releases.summary, releases.keywords, releases.platform, releases.download_url, releases.created, (SELECT array_agg(trove_classifiers.classifier) AS array_agg_1
FROM release_classifiers JOIN trove_classifiers ON trove_classifiers.id = release_classifiers.trove_id
WHERE releases.id = release_classifiers.release_id) AS classifiers, projects.normalized_name, projects.name
FROM releases JOIN release_descriptions ON release_descriptions.id = releases.description_id JOIN projects ON projects.id = releases.project_id
WHERE releases.yanked IS false AND (EXISTS (SELECT 1
FROM release_files
WHERE releases.id = release_files.release_id)) AND TRUE ORDER BY projects.name, releases.is_prerelease NULLS LAST, releases._pypi_ordering DESC

Copy link
Member

@ewdurbin ewdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slick!

@miketheman miketheman merged commit 35f9cac into pypi:main Jan 6, 2025
20 checks passed
@miketheman miketheman deleted the miketheman/search-query-refactor branch January 6, 2025 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
search Elasticsearch, search filters, and so on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants