Skip to content

Commit

Permalink
Add folding and simplication for OP_ECLASS
Browse files Browse the repository at this point in the history
Fixes #537
  • Loading branch information
NWilson committed Dec 1, 2024
1 parent 55fda7f commit b62c07c
Show file tree
Hide file tree
Showing 18 changed files with 1,324 additions and 951 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ jobs:

- name: Test (main test script)
run: |
ulimit -S -s 32768 # Raise stack limit; ASAN with -O0 is very stack-hungry
ulimit -S -s 49152 # Raise stack limit; ASAN with -O0 is very stack-hungry
./RunTest
- name: Test (JIT test program)
Expand Down
35 changes: 35 additions & 0 deletions HACKING
Original file line number Diff line number Diff line change
Expand Up @@ -633,6 +633,41 @@ When XCL_NOT is set, the bit map, if present, contains bits for characters that
are allowed (exactly as for OP_NCLASS), but the list of items that follow it
specifies characters and properties that are not allowed.

The meaning of the bitmap indicated by XCL_MAP is that, if one is present, then
it fully describes which code points < 256 match the class (without needing to
invert the check according to XCL_NOT); the other items in the OP_XCLASS need
not be consulted. However, if a bitmap is not present, then code points < 256
may still match, so the other items in the OP_XCLASS must be consulted.

For classes containing logical expressions, such as "[\p{Greek} && \p{Lu}]" for
"uppercase Greek letters", OP_ECLASS is used. The expression is encoded as a
a stack-based series of operands and operators, in Reverse Polish Notation.
Like an OP_XCLASS, the OP_ECLASS is first followed by a LINK_SIZE value
containing the total length of the opcode and its data. That is followed by a
code containing flags: currently just ECL_MAP indicating that a bit map is
present. There follows the bit map, if ECL_MAP is set. Finally a sequence of
items that are either an operand or operator. Each item starts with a single
code unit containing its type:

ECL_AND AND; no additional data
ECL_OR OR; no additional data
ECL_XOR XOR; no additional data
ECL_NOT NOT; no additional data
ECL_XCLASS The additional data which follows ECL_XCLASS is the same as for
an OP_XCLASS, except that this data is preceded by ECL_XCLASS
rather than OP_XCLASS.

Additionally, there are two intermediate values used during compilation, but
these are folded away during generation of the opcode, and so never appear
inside an OP_ECLASS at match time. They are:

ECL_ANY match all characters; no additional data
ECL_NONE match no characters; no additional data

The meaning of the bitmap indicated by ECL_MAP is different to that of XCL_MAP
for OP_XCLASS, in one way. The ECL_MAP bitmap is present whenever any code
points < 256 match the class.


Back references
---------------
Expand Down
24 changes: 13 additions & 11 deletions src/pcre2_auto_possess.c
Original file line number Diff line number Diff line change
Expand Up @@ -480,13 +480,13 @@ switch(c)

case OP_NCLASS:
case OP_CLASS:
#ifdef SUPPORT_WIDE_CHARS
case OP_XCLASS:
case OP_ECLASS:
/* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/537
Add back the "ifdef SUPPORT_WIDE_CHARS" once we stop emitting ECLASS for this case. */
if (c == OP_XCLASS || c == OP_ECLASS)
end = code + GET(code, 0) - 1;
else
#endif
end = code + 32 / sizeof(PCRE2_UCHAR);
class_end = end;

Expand Down Expand Up @@ -1118,17 +1118,15 @@ for(;;)
list_ptr[2] + LINK_SIZE, (const uint8_t*)cb->start_code, utf))
return FALSE;
break;
#endif

/* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/537
Enclose in "ifdef SUPPORT_WIDE_CHARS" once we stop emitting ECLASS for this case. */
case OP_ECLASS:
if (PRIV(eclass)(chr,
(list_ptr == list ? code : base_end) - list_ptr[2] + LINK_SIZE,
(list_ptr == list ? code : base_end) - list_ptr[3],
(const uint8_t*)cb->start_code, utf))
return FALSE;
break;
#endif /* SUPPORT_WIDE_CHARS */

default:
return FALSE;
Expand Down Expand Up @@ -1236,13 +1234,17 @@ for (;;)
}
c = *code;
}
else if (c == OP_CLASS || c == OP_NCLASS || c == OP_XCLASS || c == OP_ECLASS)
else if (c == OP_CLASS || c == OP_NCLASS
#ifdef SUPPORT_WIDE_CHARS
|| c == OP_XCLASS || c == OP_ECLASS
#endif
)
{
/* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/537
Add back the "ifdef SUPPORT_WIDE_CHARS" once we stop emitting ECLASS for this case. */
#ifdef SUPPORT_WIDE_CHARS
if (c == OP_XCLASS || c == OP_ECLASS)
repeat_opcode = code + GET(code, 1);
else
#endif
repeat_opcode = code + 1 + (32 / sizeof(PCRE2_UCHAR));

c = *repeat_opcode;
Expand Down Expand Up @@ -1315,12 +1317,12 @@ for (;;)
code += GET(code, 1 + 2*LINK_SIZE);
break;

/* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/537
Add back the "ifdef SUPPORT_WIDE_CHARS" once we stop emitting ECLASS for this case. */
case OP_ECLASS:
#ifdef SUPPORT_WIDE_CHARS
case OP_XCLASS:
case OP_ECLASS:
code += GET(code, 1);
break;
#endif

case OP_MARK:
case OP_COMMIT_ARG:
Expand Down
Loading

0 comments on commit b62c07c

Please sign in to comment.