Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made legacy bitstream URLs redirect with 301 status code #3062

Merged

Conversation

alexandrevryghem
Copy link
Member

@alexandrevryghem alexandrevryghem commented May 17, 2024

References

Description

This fixes an issue where legacy bitstream URLs are redirected with status code 302 instead of 301 (permanent URL redirect).

Instructions for Reviewers

List of changes in this PR:

  • Refactored the existing legacy url resolver into a redirect guard and updated the status code of the request before redirecting to the

Guidance for how to test & review this PR:

  • Run your frontend in production mode with sandbox.dspace.org as backend
  • Run curl --head http://localhost:4000/bitstream/handle/123456789/258/Money%20and%20Emerging%20Adults.pdf and verify that the status code is now 301
  • Run curl --head http://localhost:4000/bitstream/handle/123456789/258/Money%20and%20Emerging and verify that the status code for an invalid url is still 404

Checklist

  • My PR is small in size (e.g. less than 1,000 lines of code, not including comments & specs/tests), or I have provided reasons as to why that's not possible.
  • My PR passes ESLint validation using yarn lint
  • My PR doesn't introduce circular dependencies (verified via yarn check-circ-deps)
  • My PR includes TypeDoc comments for all new (or modified) public methods and classes. It also includes TypeDoc for large or complex private methods.
  • My PR passes all specs/tests and includes new/updated specs or tests based on the Code Testing Guide.
  • If my PR includes new libraries/dependencies (in package.json), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.
  • If my PR includes new features or configurations, I've provided basic technical documentation in the PR itself.
  • If my PR fixes an issue ticket, I've linked them together.

Kuno Vercammen and others added 2 commits May 16, 2024 20:41
…e redirect status code to 301 Moved Permanently
…direct_contribute-main

# Conflicts:
#	src/app/bitstream-page/bitstream-page-routing.module.ts
#	src/app/bitstream-page/legacy-bitstream-url.resolver.spec.ts
#	src/app/bitstream-page/legacy-bitstream-url.resolver.ts
@alexandrevryghem alexandrevryghem self-assigned this May 17, 2024
@alexandrevryghem alexandrevryghem added bug component: SEO Search Engine Optimization high priority testathon Reported by a tester during Community Testathon claimed: Atmire Atmire team is working on this issue & will contribute back labels May 17, 2024
@artlowel artlowel added this to the 8.0 milestone May 17, 2024
@tdonohue tdonohue self-requested a review May 17, 2024 13:54
Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexandrevryghem : This isn't working properly for me from either Chrome or from curl (version 7). In both, I see the same behavior.

  1. First, for a bitstream without special characters, I do get a 301 redirect, but it's Location: is wrongly set to the REST API (/api/core/bitstreams/[uuid]/content). So, it looks like this from curl (and I see this same result in Chrome DevTools from the "Network" tab):
    curl --head http://localhost:4000/bitstream/handle/123456789/1361/test_excel.xls
    HTTP/1.1 301 Moved Permanently
    X-Powered-By: Express
    X-RateLimit-Limit: 500
    X-RateLimit-Remaining: 499
    Date: Mon, 20 May 2024 19:06:02 GMT
    X-RateLimit-Reset: 1716232004
    Cache-Control: max-age=604800
    Link: <http://localhost:4000/signposting/linksets/9a171726-3060-4859-9d0d-8c147cb148c5> ; rel="linkset" ; type="application/linkset" , <http://localhost:4000/signposting/linksets/9a171726-3060-4859-9d0d-8c147cb148c5/json> ; rel="linkset" ; type="application/linkset+json" , <http://localhost:4000/items/9a171726-3060-4859-9d0d-8c147cb148c5> ; rel="collection" ; type="text/html"
    Location: http://localhost:8080/server/api/core/bitstreams/8443db73-d5c4-4069-a438-f04fe6d02f57/content
    Vary: Accept
    Content-Type: text/plain; charset=utf-8
    Content-Length: 127
    Connection: keep-alive
    Keep-Alive: timeout=5
    
    • The proper behavior here (for Google Scholar & similar) would be to perform a 301 Redirect to the new User Interface path, e.g. http://localhost:4000/bitstream/handle/123456789/1361/test_excel.xls might 301 redirect to http://localhost:4000/bitstreams/[uuid]/download (which is the URL you see when you hover over the download link in the UI). It's necessary for this to be a User Interface path for SEO purposes, as we want this link associated with indexing the UI.
    • I do realize this would mean we'd have a 301 Redirect (to /bitstreams/[uuid]/download in the UI) followed by a 302 Redirect (to /api/core/bitstreams/[uuid]/content in the REST API). But, the purpose of the 301 is to provide the new UI download URL to Google Scholar (and similar indexers), while the 302 is then to actually send along the content.
  2. Unfortunately, if I attempt to use a bitstream with a special character, I now receive a 404 response both in Chrome and in curl. So, if I paste a URL like this into Chrome: http://localhost:4000/bitstream/handle/123456789/1361/test_pdf_ć.pdf , I see that the browser encodes the special character ć into %C4%87 and the resulting URL returns a 404 response. I can also reproduce this from curl --head when using curl version 7.81.0

My suspicion is that the URL encoding that occurs is tripping things up. This might be related to the issues (at least the 200 OK ones) in #2727 as well, though I'm not sure.

…id bitstream urls in combination with a HardRedirectService#redirect, this will make ensure the redirect is visible for curl instead of being performed by Angular
…itstream-redirect_contribute-main

# Conflicts:
#	src/app/bitstream-page/legacy-bitstream-url-redirect.guard.spec.ts
#	src/app/bitstream-page/legacy-bitstream-url-redirect.guard.ts
@alexandrevryghem
Copy link
Member Author

@tdonohue: Thnx for the feedback! The first issue you mentioned with the legacy bitstream URL directly pointing to the backend has been solved. This was because there was actually no real redirect because the redirect was handled by Angular, so curl wasn't able to see this internal redirect. For the second issue, I was not able to reproduce it with my local setup even without these new changes so could you check if you can still reproduce this?

@tdonohue tdonohue self-requested a review May 24, 2024 13:37
Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexandrevryghem : I can confirm that the 301 redirect is now working properly. So, I'm basically a +1 to this PR.

However, I'm still getting a 404 Error whenever I try to use these legacy bitstream URLs with any web browser. I'm a bit surprised you cannot reproduce this on your end, as I see the same thing in Firefox (v126.0), Chrome (v125.0.6422.77 (Official Build) (64-bit)) and even Microsoft Edge (v125.0.2535.51 (Official build) (64-bit))

In any of those browsers, if I copy a URL like this into my browser window, I always end up at DSpace's 404 page:
http://localhost:4000/bitstream/handle/123456789/1361/test_pdf_ć.pdf

However, if the URL has no special characters, then I see the 301 followed by the 302 in all browsers. For example:
http://localhost:4000/bitstream/handle/123456789/1361/test_pdf.pdf

So, overall, I think this PR fixes the 301 redirect issue. But, I'm still seeing the open bug (#2727) related to bitstream redirects with special characters.

If you could retest these redirects in other browsers, or see if you have colleagues who could retest them, I'd appreciate it. I'm confused as to why I get consistent errors for redirects with special characters (in every browser/tool that I try), but you do not.

@alexandrevryghem
Copy link
Member Author

alexandrevryghem commented May 29, 2024

@tdonohue: I retested it and asked someone else to also retest it and I think that it might just be because the URL you used wasn't valid. Could you maybe try and check if you can reproduce it by running this branch in prod mode with sandbox.dspace.org as backend. I created this item with the two different versions of the accented c character:

  • (which is the combination of 2 characters: the regular c (U+0063), followed by the combining acute accent ◌́ (U+0301)) → c%CC%81
  • ć (this is one character, the letter c with an acute accent (U+0107)) → %C4%87

I added their links in the metadata under dc.identifier.uri. (I used this tool to see the difference in UTF encodings). If you still experience this issue could you maybe create an item on sandbox so that we are sure we use the same DB & same URLs

@tdonohue
Copy link
Member

tdonohue commented May 29, 2024

@alexandrevryghem : I tried your example item on the Sandbox site, and I can verify that Item works for me. I run this PR locally on http://localhost:4000 pointed at the Sandbox backend, and I see the 301 redirect followed by the 302 redirect, even for the files which have special characters. I tried this in several browsers and they all work!

However, strangely, this code doesn't work locally. I tried copying your example item to my own backend running in Docker on http://localhost:8080. I copied it by exporting it using the export script on Sandbox, and then imported using the import script locally. The result is that I have the same Item with the same http://localhost:4000/handle/10673/1122 handle. In this scenario, when I use the same URLs (e.g. http://localhost:4000/bitstream/handle/10673/1122/1test_ć.pdf), I get 404 errors when the URL has a special character.

It's baffling to me that I can run the same UI code via http://localhost:4000/ and it works when pointed at https://sandbox.dspace.org/server/ but doesn't work when pointed at http://localhost:8080/server/. I have no idea why this is happening. But, at this point, I'm assuming it must be something in my local Docker setup, since it seems to only be reproducible on my machine.

In any case, since you've proven this works with the Sandbox as a backend, I'll merge this as-is and mark #2727 as "fixed" by this.

I appreciate you investigating this further and showing that it works on the Sandbox! I have no idea why the same item won't work locally, but it definitely seems like your fixes are working.

@tdonohue tdonohue self-requested a review May 29, 2024 18:36
Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks again @alexandrevryghem ! Merging this as it works properly when I use https://sandbox.dspace.org/server/ as the backend.

@tdonohue tdonohue added the port to dspace-7_x This PR needs to be ported to `dspace-7_x` branch for next bug-fix release label May 29, 2024
@tdonohue tdonohue merged commit 1c325cd into DSpace:main May 29, 2024
13 checks passed
@dspace-bot
Copy link
Contributor

Successfully created backport PR for dspace-7_x:

@tdonohue tdonohue removed the port to dspace-7_x This PR needs to be ported to `dspace-7_x` branch for next bug-fix release label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug claimed: Atmire Atmire team is working on this issue & will contribute back component: SEO Search Engine Optimization high priority testathon Reported by a tester during Community Testathon
Projects
No open projects
Status: ✅ Done
4 participants