Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cloud-function): redirect non-canonical URLs #11151

Merged
merged 22 commits into from
Jun 21, 2024

Conversation

caugner
Copy link
Contributor

@caugner caugner commented May 17, 2024

Summary

(MP-1111)

Problem

We currently serve the same content independent of capitalization:

Solution

Fetch the metadata.json and match the URL (excluding search parameters) against the mdn_url in it.

Build a canonical map to determine the canonical URL for a requested URL, and redirect if the requested URL differs.


How did you test this change?

  1. Ran npm ci && npm start in /cloud-function.
  2. Verified that http://localhost:5100/en-US/docs/web?foo=bar redirects to http://localhost:5100/en-US/docs/Web?foo=bar.

@caugner caugner requested a review from a team as a code owner May 17, 2024 23:07
@github-actions github-actions bot added redirects all things related to redirecting cloud-function labels May 17, 2024
Copy link
Contributor

@fiji-flo fiji-flo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a way that doesn't add another fetch please?

@caugner caugner marked this pull request as draft June 7, 2024 21:34
caugner added 3 commits June 8, 2024 00:21
1. Build `sitemap.txt`, with all pages from all sitemaps.
2. Build `sitemap.json` for the Cloud Function, similar to the `redirects.json`.
3. Use middleware to redirect all URLs that match after normalization.
@caugner caugner marked this pull request as ready for review June 7, 2024 22:28
@caugner caugner requested a review from fiji-flo June 7, 2024 22:28
@caugner
Copy link
Contributor Author

caugner commented Jun 7, 2024

Can we use a way that doesn't add another fetch please?

Here's my second approach, utilizing a global sitemap.

@caugner caugner changed the title feat(cloud-function): redirect capitalization feat(cloud-function): redirect non-canonical URLs Jun 7, 2024
@@ -0,0 +1,39 @@
import { readFile, stat, writeFile } from "node:fs/promises";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, why not create the json instead of the .txt? From what I see, this does not need to be in cloud function and it's just on line in the build/sitemaps.ts.

I mainly want be aware of what's the contract between building and the cloud function and I think the json makes a better api.

If there's a strong reason to keep this I'd just have ask to use a Object instead of the Map since we're putting it into and Object any way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, why not create the json instead of the .txt?

I went for text format, because text sitemaps are a thing and it's a simpler format with fewer assumptions than a JSON, although I'm not technically adhering to the sitemap convention of using fully qualified URLs.

I mainly want be aware of what's the contract between building and the cloud function and I think the json makes a better api.

Are you suggesting to publish a sitemap.json with a mapping of normalized path to canonical path, to be used by other consumers potentially? That would actually be my concern that other parties start using this JSON if they come across it. On the other hand, would we change it?

If there's a strong reason to keep this I'd just have ask to use a Object instead of the Map since we're putting it into and Object any way.

Done in d2e58ca.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was suggesting to build the same sitemap.json as in this file. I wouldn't call it sitemap though.

I would not publish that file. Is there a benefit for having the sitemap as text and as XML? If we go for only txt sitemap I'm all in for this. Otherwise this ads complexity and we potentially have to support two sitemaps formats from now on.

Copy link
Contributor Author

@caugner caugner Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind renaming (done in c9a9474), and I could surely build that map as part of the build, but I don't think it reduces complexity:

  1. I need to get all URLs from all sitemaps, so either I gather them as a side-effect of building the sitemaps, or I go over all (gzip-compressed) XML sitemaps later, parsing out the URLs.
  2. We build and deploy the Cloud Function after we sync the build to Cloud Storage, and gsutil doesn't support exclusion very well (cf. the plain.html files that we sync), so we might need to prevent access to that file in the Cloud Function.
  3. We definitely want to keep using the XML sitemap for search engines (as advised in robots.txt), because the text format doesn't support modification dates.
  4. We would need to extract the normalizePath() function into a lib shared by build and cloud function, and I assume this would no longer be possible with rari. We could then duplicate the logic, but it's still risky.

Overall, I would recommend to keep the current approach.

cloud-function/src/utils.ts Outdated Show resolved Hide resolved
@caugner caugner force-pushed the MP-1111-redirect-capitalization branch from 2913e3a to d2e58ca Compare June 11, 2024 09:44
@caugner caugner requested a review from fiji-flo June 11, 2024 09:44
@github-actions github-actions bot added the merge conflicts 🚧 Please rebase onto or merge the latest main. label Jun 19, 2024
Copy link
Contributor

This pull request has merge conflicts that must be resolved before it can be merged.

@github-actions github-actions bot removed the merge conflicts 🚧 Please rebase onto or merge the latest main. label Jun 19, 2024
@@ -39,6 +40,7 @@ router.all(
router.all("/submit/mdn-yari/*", requireOrigin(Origin.main), proxyTelemetry);
router.all("/pong/*", requireOrigin(Origin.main), express.json(), proxyPong);
router.all("/pimg/*", requireOrigin(Origin.main), proxyPong);
router.use(redirectNonCanonicals);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reorder the routes here? Move everything that doesn't need the redirectNonCanonicals up (I guess only get / and the assests).

Otherwise ready to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiji-flo I moved redirectNonCanonicals down instead (and the search-index.json route up next to the assets route), to make the change less invasive: https://github.com/mdn/yari/pull/11151/files/55d08d37374c549d154529d8228dabcca4fc9a2d..45501138daec268e2fdb0dfa7a6ab9229f99f4b3

@caugner caugner force-pushed the MP-1111-redirect-capitalization branch from 0a78822 to 353c553 Compare June 21, 2024 10:29
@caugner caugner requested a review from fiji-flo June 21, 2024 10:35
@caugner
Copy link
Contributor Author

caugner commented Jun 21, 2024

@fiji-flo See #11151 (comment).

Copy link
Contributor

@fiji-flo fiji-flo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@caugner caugner merged commit 8568a5e into main Jun 21, 2024
13 checks passed
@caugner caugner deleted the MP-1111-redirect-capitalization branch June 21, 2024 10:41
ferdnyc pushed a commit to ferdnyc/yari that referenced this pull request Jul 13, 2024
Previously, we served the same content independent of capitalization:
- https://developer.mozilla.org/en-US/docs/web (incorrect capitalization)
- https://developer.mozilla.org/en-US/docs/Web (correct capitalization)

Now, we build a map to match the requested against the canonical URL,
redirecting if necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-function github-actions redirects all things related to redirecting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants