-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Format RFC links with anchors nicely #11809
Conversation
Previously if we wrote :rfc:`1234#section-2` links would be formatted like RFC 1234#section-2 which isn't quite nice. After this patch: RFC 1234 Section 2 which looks and reads better. Resolves: sphinx-doc#7027
I'm not entirely sure if the build failures are actually related to the patch |
sphinx/roles.py
Outdated
""" | ||
parts = target.replace('#', ' ').replace('-', ' ').split() | ||
if len(parts) >= 2: | ||
parts[1] = parts[1].title() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC links can have section names, for example
https://datatracker.ietf.org/doc/html/rfc9076.html#name-risks-in-the-dns-data
and title casing arbitrary English text well is complicated, for example the above would be formatted as "RFC 9076 Name Risks in the Dns Data" which is worse than "RFC 9076#name-risks-in-the-dns-data" in my opinion. So you should only do this formatting for exact matches for a few specific cases, I think the ones you put in the test "Section" "Page" "Appendix" are good.
Or if you extract all possible links from all the RFCs and their corresponding natural language text, then you could write and test that this function perfectly extracts the original titles from all possible existing #
links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I didn't know that was possible – thank you.
What do you think we should do in that case? Leaving it be and rendering as RFC 9076#name-risks-in-the-dns-data
is easy but it doesn't look great.
On the other hand I can't think of a good capitalization logic that would convert things like "risks-in-the-dns-data" to "Risks in the DNS data" reliably.
Something like this would be easy (when a name-
prefix is detected in the anchor): RFC 9076 Section risks-in-the-dns-data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS.
the above would be formatted as "Name Risks in the Dns Data"
Note that I'm currenly applying title()
to the first word in that sequence, not the whole thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good capitalization logic
If we're only limiting ourselves to links that appear on datatracker.ietf.org then as I said, we have a specific number of text documents (RFCs) with a specific number of human readable section names that are being lowercased and .replace(' ', '-')
'd and we want to recover that specific set of original strings. The way we can do it is to extract every single possible link on the website and the actual original human readable text, which should be doable locally (i.e. without scraping) (I did something similar once here and here) since it's an open source website. There's like 9000 RFCs, if each has 100 links then that's a list of 900,000 2 item test cases that your function needs to convert correctly. Title casing (without "and" "as" "but" "for" "if" "nor" "or" "so" "yet" "a" "an" "the" "as" "at" "by" "for" "in" "of" "off" "on" "per" "to" "up" "via") + a list of acronyms + 100 hard coded special cases would probably do it. Then you make the extraction easily repeatable so that people can periodically re-run the extraction code as the RFCs are edited. Also, I think for section names it might make sense to also add a colon like "RFC 9076: Name Risks in the DNS Data".
Or you can just do it for sections, to me it's better to leave section names in URL syntax than do a bad job making them human readable because it's a link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to leave anchors with names (and anything else not already determined to be safe) alone for now, probably not worth blocking this PR because of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally,
- RFC 1234, Section 2.
- RFC 1234, Appendix 2, page 23.
- RFC 1234: The Section Title (capitalize words except common conjunctions)
But I'm not sure whether it's worth the case-handling.
# Conflicts: # sphinx/roles.py # tests/test_markup.py
Thank you @jstasiak, this is a nice improvement. A |
Feature or Bugfix
Purpose
Previously if we wrote
links would be formatted like
which isn't quite nice. After this patch:
which looks and reads better.
There are some non-section anchors that I discovered when
looking at a random RFC, I included them all in unit tests.
Detail
Relates
Resolves: #7027