Merge pull request #629 from ossf/python-ruby-permissive

Fix "Correctly Using Regex" table for Python and Ruby
ossf · Sep 24, 2024 · f7ba781 · f7ba781
2 parents 3ba1400 + ff74d1b
commit f7ba781
Show file tree

Hide file tree

Showing 4 changed files with 21 additions and 5 deletions.
diff --git a/docs/Correctly-Using-Regular-Expressions-Rationale.md b/docs/Correctly-Using-Regular-Expressions-Rationale.md
@@ -175,7 +175,7 @@ Setting both PCRE2_ANCHORED and PCRE2_ENDANCHORED forces a full-string match, bu
 The [Python3 language documentation on re](https://docs.python.org/3/library/re.html) notes that its operations are “similar to those found in Perl” - but note that they are _similar_ not _identical_. In this library:
 
 * ^ (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.
-* $ Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.
+* $ Matches the end of the string or just before the newline at the end of the string (it is _permissive_), and in MULTILINE mode it also matches before a newline.
 * \A Matches only at the start of the string.
 * \Z Matches only at the end of the string. Note that this is spelled \Z not \z, and there is no \z.
 

diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md
@@ -102,7 +102,7 @@ Platform
    </td>
    <td>“\Z” (not “$” nor “\z”)
    </td>
-   <td>No
+   <td>Yes
    </td>
   </tr>
   <tr>
@@ -112,18 +112,18 @@ Platform
    </td>
    <td>“\z” (not “$”)
    </td>
-   <td>No
+   <td>Yes
    </td>
   </tr>
 </table>
 
-For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “<tt>^(ab&#x7c;de)$</tt>”. To validate the same thing in Python, use “<tt>^(ab&#x7c;de)\Z</tt>” or “<tt>\A(ab&#x7c;de)\Z</tt>”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “`\n?\z`”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby).
+For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “<tt>^(ab&#x7c;de)$</tt>”. To validate the same thing in Python, use “<tt>^(ab&#x7c;de)\Z</tt>” or “<tt>\A(ab&#x7c;de)\Z</tt>”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive by default and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “`\n?\z`”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby).
 
 In addition, ensure your regex is not vulnerable to a Regular Expression Denial of Service (ReDoS) attack. A ReDoS “[is a Denial of Service attack, that exploits the fact that most Regular Expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size)](https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS)”. Many regex implementations are “backtracking” implementations, that is, they try all possible matches. In these implementations,  a poorly-written regular expression can be exploited by an attacker to take a vast amount of time.
 
 1. One solution is to use a regex implementation that does not have this vulnerability because it never backtracks. E.g., use Go’s default regex system, RE2, or on .NET enable the RegexOptions.NonBacktracking option. Non-backtracking implementations can sometimes be orders of magnitude faster, but they also omit some features (e.g., backreferences).
 2. Alternatively, create regexes that require no or little backtracking. Where a branch (“&#x7c;”) occurs, the next character should select one branch. Where there is optional repetition (e.g., “&#x2a;”), the next character should determine if there is a repetition or not. One common cause of unnecessary backtracking are poorly-written regexes with repetitions in repetitions, e.g., “(a+)&#x2a;”. Some tools can help find these defects.
-3. A partial countermeasure is to greatly limit the length of the untrusted input. This can limit the impact of a vulnerability.
+3. A partial countermeasure is to greatly limit the length of the untrusted input and/or the number of repetitions. This can limit the impact of a vulnerability. For example, in a regex, use “{0,4}” (0 through 4 repetitions inclusive) instead of “*” (0 or more repetitions, with no maximum).
 
 ## Detailed Rationale
 

diff --git a/docs/src/regex.py b/docs/src/regex.py
@@ -0,0 +1,9 @@
+#!/usr/bin/env python3
+
+import re
+
+print('Test Python regex')
+print("Must be false: ", bool(re.search(r'^wrong$', "hello")))
+print("Must be true: ", bool(re.search(r'^hello$', "hello")))
+print("True if permissive: ", bool(re.search(r'^hello$', "hello\n")))
+print("Should be false: ", bool(re.search(r'^hello$', "hello\nthere")))
diff --git a/docs/src/regex.rb b/docs/src/regex.rb
@@ -0,0 +1,7 @@
+#!/usr/bin/env ruby
+
+puts('Test Ruby regex')
+puts("Must be false: ", !! /^wrong$/.match("hello"))
+puts("Must be true: ", !! /^hello$/.match("hello"))
+puts("True if permissive: ", !! /^hello$/.match("hello\n"))
+puts("Should be true ($ always multi): ", !! /^hello$/.match("hello\nthere"))