Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-range-forming "-"s not correctly handled for character groups for JavaScript #5

Closed
danny0838 opened this issue Nov 18, 2022 · 9 comments

Comments

@danny0838
Copy link

danny0838 commented Nov 18, 2022

Example 1 (trailing -)

Code:

console.warn(JSON.stringify(Regex.Analyzer(/[a-]/).tree(), null, 2))

Actual:

{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 512,
          "val": [
            "a",
            "]"
          ],
          "flags": {},
          "typeName": "CharacterRange"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

Expected:

{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 256,
          "val": [
            "a",
            "-"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

Example 2 (prefixing -)

Code:

console.warn(JSON.stringify(Regex.Analyzer(/[-a]/).tree(), null, 2))

Actual:

{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 512,
          "val": [
            null,
            "a"
          ],
          "flags": {},
          "typeName": "CharacterRange"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

Expected:

{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 256,
          "val": [
            "-",
            "a"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

Example 3 (prefixing - with negativity)

Code:

console.warn(JSON.stringify(Regex.Analyzer(/[^-a]/).tree(), null, 2))

Actual:

{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 512,
          "val": [
            null,
            "a"
          ],
          "flags": {},
          "typeName": "CharacterRange"
        }
      ],
      "flags": {
        "NegativeMatch": 1
      },
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

Expected:

{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 256,
          "val": [
            "-",
            "a"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {
        "NegativeMatch": 1
      },
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

Example 4 (- around a character group)

Code:

console.warn(JSON.stringify(Regex.Analyzer(/[\d-x]/).tree(), null, 2))

Actual:

{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 128,
          "val": "d",
          "flags": {
            "MatchDigitChar": 1
          },
          "typeName": "Special"
        },
        {
          "type": 512,
          "val": [
            "d",
            "x"
          ],
          "flags": {},
          "typeName": "CharacterRange"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

Expected:

{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 128,
          "val": "d",
          "flags": {
            "MatchDigitChar": 1
          },
          "typeName": "Special"
        },
        {
          "type": 256,
          "val": [
            "-",
            "x"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

Example 5 (- around a special char)

Code:

console.warn(JSON.stringify(Regex.Analyzer(/[\t-x]/).tree(), null, 2))

Actual:

{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 128,
          "val": "t",
          "flags": {
            "HorizontalTab": 1
          },
          "typeName": "Special"
        },
        {
          "type": 512,
          "val": [
            "t",
            "x"
          ],
          "flags": {},
          "typeName": "CharacterRange"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

Expected:

{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 512,
          "val": [
            {
              "type": 128,
              "val": "t",
              "flags": {
                "HorizontalTab": 1
              },
              "typeName": "Special"
            },
            "x"
          ],
          "flags": {},
          "typeName": "CharacterRange"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}
@foo123
Copy link
Owner

foo123 commented Nov 18, 2022

The correct way to signify that the special char - is not special but a character, is to escape it, like

/[a\-]/

Snap 2022-11-18 at 16 48 23

https://foo123.github.io/examples/regex-analyzer/#action=analyze&regex=%2F%5Ba%5C-%5D%2F

@foo123 foo123 closed this as completed Nov 18, 2022
@danny0838
Copy link
Author

danny0838 commented Nov 18, 2022

@foo123 Most real world JavaScript engines interpret [a-] as [a\-] and [-a] as [\-a]. If this is not fixed, many real world regexes running in a browser will break.

For example, AdguardTeam/AdguardFilters#134630

@foo123
Copy link
Owner

foo123 commented Nov 18, 2022

Maybe be true, but it is still not correct. The - needs to be escaped. It is like browsers accepting fautly html.
I am not sure this needs any fixing.

@foo123
Copy link
Owner

foo123 commented Nov 18, 2022

Anyway in next update I will give it a closer look. Thanks

@foo123 foo123 reopened this Nov 18, 2022
@danny0838
Copy link
Author

According to the spec of ECMAScript, leading and trailing hyphen in a character class should be treated as a literal -.

@danny0838
Copy link
Author

danny0838 commented Nov 18, 2022

Additionally, a hyphen aaround a character group should be treated as literal. For example, [\d-w] should be treated as a character group of the composition of \d, -, and w.

Also a range involving a special char is not handled correctly, e.g. [\t-x].

Added related examples in the OP.

@danny0838 danny0838 changed the title Starting or ending "-" are not correctly handled for character groups Non-range-forming "-"s not correctly handled for character groups for JavaScript Nov 19, 2022
@foo123
Copy link
Owner

foo123 commented Nov 19, 2022

Updated lib to 1.2.0 (js only). This feature has been added.

Snap 2022-11-19 at 15 51 00

https://foo123.github.io/examples/regex-analyzer/#action=analyze&regex=%2F%5B-a%5D%2F

@danny0838
Copy link
Author

@foo123 [x-\d] seems still not correctly handled in 1.2.0.

@foo123
Copy link
Owner

foo123 commented Nov 19, 2022

new upload of 1.2.0

/[a-]/
{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 256,
          "val": [
            "a",
            "-"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}
/[-a]/
{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 256,
          "val": [
            "-",
            "a"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}
/[\d-x]/
{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 128,
          "val": "d",
          "flags": {
            "MatchDigitChar": 1
          },
          "typeName": "Special"
        },
        {
          "type": 256,
          "val": [
            "-",
            "x"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}
/[x-\d]/
{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 128,
          "val": "d",
          "flags": {
            "MatchDigitChar": 1
          },
          "typeName": "Special"
        },
        {
          "type": 256,
          "val": [
            "x",
            "-"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}
/[abx-\dA-Z]/
{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 128,
          "val": "d",
          "flags": {
            "MatchDigitChar": 1
          },
          "typeName": "Special"
        },
        {
          "type": 512,
          "val": [
            "A",
            "Z"
          ],
          "flags": {},
          "typeName": "CharacterRange"
        },
        {
          "type": 256,
          "val": [
            "a",
            "b",
            "x",
            "-"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}
/[abx-\dxyA-Z]/
{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 128,
          "val": "d",
          "flags": {
            "MatchDigitChar": 1
          },
          "typeName": "Special"
        },
        {
          "type": 512,
          "val": [
            "A",
            "Z"
          ],
          "flags": {},
          "typeName": "CharacterRange"
        },
        {
          "type": 256,
          "val": [
            "a",
            "b",
            "x",
            "-",
            "x",
            "y"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}
/[abdA-Zxy0-9]/
{
  "type": 1,
  "val": [
    {
      "type": 8,
      "val": [
        {
          "type": 512,
          "val": [
            "A",
            "Z"
          ],
          "flags": {},
          "typeName": "CharacterRange"
        },
        {
          "type": 512,
          "val": [
            "0",
            "9"
          ],
          "flags": {},
          "typeName": "CharacterRange"
        },
        {
          "type": 256,
          "val": [
            "a",
            "b",
            "d",
            "x",
            "y"
          ],
          "flags": {},
          "typeName": "Characters"
        }
      ],
      "flags": {},
      "typeName": "CharacterGroup"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

[x-\d] seems still not correctly handled in 1.2.0.

Fixed

Additionally, a hyphen aaround a character group should be treated as literal. For example, [\d-w] should be treated as a character group of the composition of \d, -, and w.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants