Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support symbol '-' in standard_ref() #344

Closed
tocic opened this issue Jul 28, 2024 · 5 comments · Fixed by #353
Closed

Support symbol '-' in standard_ref() #344

tocic opened this issue Jul 28, 2024 · 5 comments · Fixed by #353

Comments

@tocic
Copy link
Contributor

tocic commented Jul 28, 2024

We should support the symbol - in our references. For example,

§[dcl.spec.auto]¶general-example-4

should not be rendered as

§[dcl.spec.auto]¶general-example-4

@tocic
Copy link
Contributor Author

tocic commented Aug 4, 2024

I tried to make a solution for the gerenal case and came up with the patch below. But it turned out to be more difficult than I expected :).

I downloaded this page and used the following grep to get a list of possible sections:

grep -oP '<a href="https://timsong-cpp.github.io/cppwp/std23/[^#]+#\K[^"]+' 14882_\ Index.html > sections.txt

As you can see, there are many interesting examples that we don't support yet, especially the ones that end with ) and -, because we can't easily disambiguate them from the surrounding punctuation. For example,

... (§[defns.block]¶def:block_(execution)).

should be displayed as

... (§[defns.block]¶def:block_(execution)).

I don't think we can solve that with a single regex (but maybe we can postprocess the match).

But there are only 60 \W$ hits out of 5238 lines in generalindex, 0/3502 in grammarindex, 0/492 in headerindex, 678/9520 in libraryindex, 0/1940 in conceptindex, and 4/337 in impldefindex. So what do you think if we support only those sections that end with a word character (\w)? Maybe allowing the chars +-=*~ is also safe (leaving only 39 examples in generalindex and 90 examples in libraryindex unsupported).

click here to see the patch
diff --git a/quiz/templatetags/quiz_extras.py b/quiz/templatetags/quiz_extras.py
index 4b637ab..0295d9f 100644
--- a/quiz/templatetags/quiz_extras.py
+++ b/quiz/templatetags/quiz_extras.py
@@ -18,21 +18,36 @@ def format_reference(match):
     full_link = "https://timsong-cpp.github.io/cppwp/n4659/" + section_name
     if paragraph_number:
         full_link = full_link + "#" + paragraph_number
-    return "<em><a href=\"" + full_link + "\">" + full_reference + "</a></em>"
+    return f"*[{full_reference}]({full_link})*"
 
 
 def standard_ref(text):
-    section_name = u'(\[(?P<section_name>[\w:]+(\.[\w:]+)*)\])'
-    possible_paragraph = u'(¶(?P<paragraph>\d+(\.\d+)*))*'
-    regex = re.compile('§(' + section_name + possible_paragraph + ')')
-    return re.sub(regex, format_reference, text)
+    regex = re.compile(
+        r'''
+        §\[                    # every ref starts its section part with §[
+            (?P<section_name>  #
+                [^]]+          # at least one character other than ]
+            )                  #
+        \]                     # every ref ends its section part with ]
+        (?:¶                   # ref may start an optional paragraph part with ¶
+            (?P<paragraph>     #
+                \S*            # any number of non-whitespace chars
+                \w             # supported sections must end with a word char
+            )                  #
+        )?                     #
+        ''',
+        re.VERBOSE,
+    )
+    return regex.sub(format_reference, text)
 
 
 @register.filter(needs_autoescape=True)
 def to_html(text, autoescape=None):
     return mark_safe(
-        standard_ref(
-            markdown.markdown(text, extensions=['nl2br', 'pymdownx.superfences'])))
+        markdown.markdown(
+            standard_ref(text), extensions=['nl2br', 'pymdownx.superfences']
+        )
+    )

@knatten
Copy link
Owner

knatten commented Aug 8, 2024

Thanks for investigating all of this! I'm very busy this week so don't have time to dig into it. Something that supports most cases and isn't too complicated sounds good to me. But I'd like to see some examples of the types of links we match, and some things we shouldn't match.

However, instead of regular unit tests, it can often be nice to just have a comment in the code with a link to regex101, like this https://regex101.com/r/Wxld8e/2

@tocic
Copy link
Contributor Author

tocic commented Aug 11, 2024

I believe we should parse the refs first and only then convert markdown to html. In that case, this is the list of possible refs we might encounter.

tmp=''
for index in 'general' 'grammar' 'header' 'library' 'concept' 'impldef'; do
  tmp+=$(grep -oP '<a href="https://timsong-cpp.github.io/cppwp/std23/\K[^"\s]+(?=")' ${index}.html | python3 -c "import sys,html,urllib.parse; print(urllib.parse.unquote(html.unescape(sys.stdin.read())))")
  tmp+=$'\n'
done

echo "${tmp%?}" | sed -e 's/^/§[/' | sed -e 's/#/]¶/' | sort | uniq | head -c -1 > possible_refs.txt

possible_refs.txt

And this is how the suggested regex performs on the dataset (1711/20945 are unsupported):

https://regex101.com/r/hf2P3X/1

@tocic
Copy link
Contributor Author

tocic commented Aug 12, 2024

And here are the characters I think we can safely support:

§\[[^]]+\](?:¶\S*[\w&></^|+%=~*-])?

, leaving 158/20945 unsupported (https://regex101.com/r/hf2P3X/2).

Alternatively, we can try to enumerate the postfixes that the valid reference can't have like this:

§\[[^]]+\](?:¶\S*[^\s:,;.?)](?<!\?!|""|\*\*|!!))?

, leaving 131/20945 unsupported (https://regex101.com/r/hf2P3X/3).

@knatten
Copy link
Owner

knatten commented Aug 17, 2024

I prefer the first one, the second one matches more things but is just so hard to read. Want to make a PR?

tocic added a commit to tocic/cppquiz that referenced this issue Aug 25, 2024
tocic added a commit to tocic/cppquiz that referenced this issue Aug 27, 2024
All currently existing refs in cppquiz are supported, 297 out of 20945
in C++23 index are unsupported (160 containing an asterisk, 131 ending
in a right parenthesis, 4 ending in a quotation mark, 2 ending in
an explamation mark).

Fixes knatten#344.
tocic added a commit to tocic/cppquiz that referenced this issue Aug 27, 2024
All currently existing refs in cppquiz are supported, 297 out of 20945
in C++23 index are unsupported (160 containing an asterisk, 131 ending
in a right parenthesis, 4 ending in a quotation mark, 2 ending in
an explamation mark).

Fixes knatten#344.
tocic added a commit to tocic/cppquiz that referenced this issue Aug 27, 2024
All currently existing refs in cppquiz are supported, 297 out of 20945
in C++23 index are unsupported (160 containing an asterisk, 131 ending
in a right parenthesis, 4 ending in a quotation mark, 2 ending in
an explamation mark).

Fixes knatten#344.
tocic added a commit to tocic/cppquiz that referenced this issue Aug 27, 2024
All currently existing refs in cppquiz are supported, 297 out of 20945
in C++23 index are unsupported (160 containing an asterisk, 131 ending
in a right parenthesis, 4 ending in a quotation mark, 2 ending in
an explamation mark).

Fixes knatten#344.
tocic added a commit to tocic/cppquiz that referenced this issue Aug 28, 2024
All currently existing refs in cppquiz are supported, 297 out of 20945
in C++23 index are unsupported (160 containing an asterisk, 131 ending
in a right parenthesis, 4 ending in a quotation mark, 2 ending in
an exclamation mark).

Fixes knatten#344.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants