Support symbol '-' in standard_ref() #344

tocic · 2024-07-28T07:27:22Z

We should support the symbol - in our references. For example,

§[dcl.spec.auto]¶general-example-4

should not be rendered as

§[dcl.spec.auto]¶general-example-4

The text was updated successfully, but these errors were encountered:

tocic · 2024-08-04T11:39:10Z

I tried to make a solution for the gerenal case and came up with the patch below. But it turned out to be more difficult than I expected :).

I downloaded this page and used the following grep to get a list of possible sections:

grep -oP '<a href="https://timsong-cpp.github.io/cppwp/std23/[^#]+#\K[^"]+' 14882_\ Index.html > sections.txt

As you can see, there are many interesting examples that we don't support yet, especially the ones that end with ) and -, because we can't easily disambiguate them from the surrounding punctuation. For example,

... (§[defns.block]¶def:block_(execution)).

should be displayed as

... (§[defns.block]¶def:block_(execution)).

I don't think we can solve that with a single regex (but maybe we can postprocess the match).

But there are only 60 \W$ hits out of 5238 lines in generalindex, 0/3502 in grammarindex, 0/492 in headerindex, 678/9520 in libraryindex, 0/1940 in conceptindex, and 4/337 in impldefindex. So what do you think if we support only those sections that end with a word character (\w)? Maybe allowing the chars +-=*~ is also safe (leaving only 39 examples in generalindex and 90 examples in libraryindex unsupported).

click here to see the patch

diff --git a/quiz/templatetags/quiz_extras.py b/quiz/templatetags/quiz_extras.py
index 4b637ab..0295d9f 100644
--- a/quiz/templatetags/quiz_extras.py
+++ b/quiz/templatetags/quiz_extras.py
@@ -18,21 +18,36 @@ def format_reference(match):
     full_link = "https://timsong-cpp.github.io/cppwp/n4659/" + section_name
     if paragraph_number:
         full_link = full_link + "#" + paragraph_number
-    return "<em><a href=\"" + full_link + "\">" + full_reference + "</a></em>"
+    return f"*[{full_reference}]({full_link})*"
 
 
 def standard_ref(text):
-    section_name = u'(\[(?P<section_name>[\w:]+(\.[\w:]+)*)\])'
-    possible_paragraph = u'(¶(?P<paragraph>\d+(\.\d+)*))*'
-    regex = re.compile('§(' + section_name + possible_paragraph + ')')
-    return re.sub(regex, format_reference, text)
+    regex = re.compile(
+        r'''
+        §\[                    # every ref starts its section part with §[
+            (?P<section_name>  #
+                [^]]+          # at least one character other than ]
+            )                  #
+        \]                     # every ref ends its section part with ]
+        (?:¶                   # ref may start an optional paragraph part with ¶
+            (?P<paragraph>     #
+                \S*            # any number of non-whitespace chars
+                \w             # supported sections must end with a word char
+            )                  #
+        )?                     #
+        ''',
+        re.VERBOSE,
+    )
+    return regex.sub(format_reference, text)
 
 
 @register.filter(needs_autoescape=True)
 def to_html(text, autoescape=None):
     return mark_safe(
-        standard_ref(
-            markdown.markdown(text, extensions=['nl2br', 'pymdownx.superfences'])))
+        markdown.markdown(
+            standard_ref(text), extensions=['nl2br', 'pymdownx.superfences']
+        )
+    )

knatten · 2024-08-08T20:02:19Z

Thanks for investigating all of this! I'm very busy this week so don't have time to dig into it. Something that supports most cases and isn't too complicated sounds good to me. But I'd like to see some examples of the types of links we match, and some things we shouldn't match.

However, instead of regular unit tests, it can often be nice to just have a comment in the code with a link to regex101, like this https://regex101.com/r/Wxld8e/2

tocic · 2024-08-11T09:20:51Z

I believe we should parse the refs first and only then convert markdown to html. In that case, this is the list of possible refs we might encounter.

tmp=''
for index in 'general' 'grammar' 'header' 'library' 'concept' 'impldef'; do
  tmp+=$(grep -oP '<a href="https://timsong-cpp.github.io/cppwp/std23/\K[^"\s]+(?=")' ${index}.html | python3 -c "import sys,html,urllib.parse; print(urllib.parse.unquote(html.unescape(sys.stdin.read())))")
  tmp+=$'\n'
done

echo "${tmp%?}" | sed -e 's/^/§[/' | sed -e 's/#/]¶/' | sort | uniq | head -c -1 > possible_refs.txt

possible_refs.txt

And this is how the suggested regex performs on the dataset (1711/20945 are unsupported):

https://regex101.com/r/hf2P3X/1

tocic · 2024-08-12T05:06:04Z

And here are the characters I think we can safely support:

§\[[^]]+\](?:¶\S*[\w&></^|+%=~*-])?

, leaving 158/20945 unsupported (https://regex101.com/r/hf2P3X/2).

Alternatively, we can try to enumerate the postfixes that the valid reference can't have like this:

§\[[^]]+\](?:¶\S*[^\s:,;.?)](?<!\?!|""|\*\*|!!))?

, leaving 131/20945 unsupported (https://regex101.com/r/hf2P3X/3).

knatten · 2024-08-17T09:22:38Z

I prefer the first one, the second one matches more things but is just so hard to read. Want to make a PR?

Fixes knatten#344.

All currently existing refs in cppquiz are supported, 297 out of 20945 in C++23 index are unsupported (160 containing an asterisk, 131 ending in a right parenthesis, 4 ending in a quotation mark, 2 ending in an explamation mark). Fixes knatten#344.

All currently existing refs in cppquiz are supported, 297 out of 20945 in C++23 index are unsupported (160 containing an asterisk, 131 ending in a right parenthesis, 4 ending in a quotation mark, 2 ending in an exclamation mark). Fixes knatten#344.

knatten added good first issue help wanted labels Jul 29, 2024

tocic added a commit to tocic/cppquiz that referenced this issue Aug 25, 2024

Support more chars in refs

c2b0b8d

Fixes knatten#344.

tocic mentioned this issue Aug 25, 2024

Support more chars in refs #353

Merged

knatten closed this as completed in #353 Sep 16, 2024

knatten closed this as completed in 0eb7fbc Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support symbol '-' in standard_ref() #344

Support symbol '-' in standard_ref() #344

tocic commented Jul 28, 2024 •

edited

Loading

tocic commented Aug 4, 2024 •

edited

Loading

knatten commented Aug 8, 2024

tocic commented Aug 11, 2024 •

edited

Loading

tocic commented Aug 12, 2024 •

edited

Loading

knatten commented Aug 17, 2024

Support symbol '-' in standard_ref() #344

Support symbol '-' in standard_ref() #344

Comments

tocic commented Jul 28, 2024 • edited Loading

tocic commented Aug 4, 2024 • edited Loading

knatten commented Aug 8, 2024

tocic commented Aug 11, 2024 • edited Loading

tocic commented Aug 12, 2024 • edited Loading

knatten commented Aug 17, 2024

tocic commented Jul 28, 2024 •

edited

Loading

tocic commented Aug 4, 2024 •

edited

Loading

tocic commented Aug 11, 2024 •

edited

Loading

tocic commented Aug 12, 2024 •

edited

Loading