diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html index b7f64b4f2..65a50fba5 100644 --- a/doc/html/pcre2api.html +++ b/doc/html/pcre2api.html @@ -4048,9 +4048,18 @@
+The callout function is not called for simulated substitutions that happen as a +result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. In this mode, when +substitution processing exceeds the buffer space provided by the caller, +processing continues by counting code units. The simulation is unable to +populate the callout block, and so the simulation is pessimistic about the +required buffer size. Whichever is larger of accepted or rejected substitution +is reported as the required size. Therefore, the returned buffer length may be +an overestimate (without a substitution callout, it is normally an exact +measurement).
The first argument of the callout function is a pointer to a substitute callout
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index b10f86028..8cd8eeb4d 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -3893,12 +3893,20 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
The pcre2_set_substitution_callout() function can be used to specify a
callout function for pcre2_substitute(). This information is passed in
a match context. The callout function is called after each substitution
- has been processed, but it can cause the replacement not to happen. The
- callout function is not called for simulated substitutions that happen
- as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option.
+ has been processed, but it can cause the replacement not to happen.
+
+ The callout function is not called for simulated substitutions that
+ happen as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. In
+ this mode, when substitution processing exceeds the buffer space pro-
+ vided by the caller, processing continues by counting code units. The
+ simulation is unable to populate the callout block, and so the simula-
+ tion is pessimistic about the required buffer size. Whichever is larger
+ of accepted or rejected substitution is reported as the required size.
+ Therefore, the returned buffer length may be an overestimate (without a
+ substitution callout, it is normally an exact measurement).
The first argument of the callout function is a pointer to a substitute
- callout block structure, which contains the following fields, not nec-
+ callout block structure, which contains the following fields, not nec-
essarily in this order:
uint32_t version;
@@ -3909,34 +3917,34 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
uint32_t oveccount;
PCRE2_SIZE output_offsets[2];
- The version field contains the version number of the block format. The
- current version is 0. The version number will increase in future if
- more fields are added, but the intention is never to remove any of the
+ The version field contains the version number of the block format. The
+ current version is 0. The version number will increase in future if
+ more fields are added, but the intention is never to remove any of the
existing fields.
The subscount field is the number of the current match. It is 1 for the
first callout, 2 for the second, and so on. The input and output point-
ers are copies of the values passed to pcre2_substitute().
- The ovector field points to the ovector, which contains the result of
+ The ovector field points to the ovector, which contains the result of
the most recent match. The oveccount field contains the number of pairs
that are set in the ovector, and is always greater than zero.
- The output_offsets vector contains the offsets of the replacement in
- the output string. This has already been processed for dollar and (if
+ The output_offsets vector contains the offsets of the replacement in
+ the output string. This has already been processed for dollar and (if
requested) backslash substitutions as described above.
- The second argument of the callout function is the value passed as
- callout_data when the function was registered. The value returned by
+ The second argument of the callout function is the value passed as
+ callout_data when the function was registered. The value returned by
the callout function is interpreted as follows:
- If the value is zero, the replacement is accepted, and, if PCRE2_SUB-
- STITUTE_GLOBAL is set, processing continues with a search for the next
- match. If the value is not zero, the current replacement is not ac-
- cepted. If the value is greater than zero, processing continues when
- PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero
+ If the value is zero, the replacement is accepted, and, if PCRE2_SUB-
+ STITUTE_GLOBAL is set, processing continues with a search for the next
+ match. If the value is not zero, the current replacement is not ac-
+ cepted. If the value is greater than zero, processing continues when
+ PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero
or PCRE2_SUBSTITUTE_GLOBAL is not set), the rest of the input is copied
- to the output and the call to pcre2_substitute() exits, returning the
+ to the output and the call to pcre2_substitute() exits, returning the
number of matches so far.
Substitution case callouts
@@ -3946,21 +3954,21 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
void *callout_data);
The pcre2_set_substitution_case_callout() function can be used to spec-
- ify a callout function for pcre2_substitute() to use when performing
- case transformations. This does not affect any case insensitivity be-
- haviour when performing a match, but only the user-visible transforma-
+ ify a callout function for pcre2_substitute() to use when performing
+ case transformations. This does not affect any case insensitivity be-
+ haviour when performing a match, but only the user-visible transforma-
tions performed when processing a substitution such as:
pcre2_substitute(..., "\\U$1", ...)
- The default case transformations applied by PCRE2 are reasonably com-
+ The default case transformations applied by PCRE2 are reasonably com-
plete, and, in UTF or UCP mode, perform the basic locale-invariant case
- transformations as specified by Unicode. This is suitable for the in-
- ternal (invisible) case-equivalence procedures used during pattern
+ transformations as specified by Unicode. This is suitable for the in-
+ ternal (invisible) case-equivalence procedures used during pattern
matching, but an application may wish to use more sophisticated locale-
aware processing for the user-visible substitution transformations.
- One example implementation of the callout_function using the ICU li-
+ One example implementation of the callout_function using the ICU li-
brary would be:
static uint32_t icu_case_callout(uint32_t ch, int to, void *)
@@ -3971,15 +3979,15 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
: ch;
}
- The first argument of the case callout function is the Unicode charac-
+ The first argument of the case callout function is the Unicode charac-
ter to transform.
- The second argument is one of the constants PCRE2_SUBSTI-
+ The second argument is one of the constants PCRE2_SUBSTI-
TUTE_CASE_LOWER, PCRE2_SUBSTITUTE_CASE_UPPER, or PCRE2_SUBSTI-
TUTE_CASE_TITLE.
- The third argument is the callout_data supplied to pcre2_set_substi-
- tute_case_callout(), and the return value is the transformed Unicode
+ The third argument is the callout_data supplied to pcre2_set_substi-
+ tute_case_callout(), and the return value is the transformed Unicode
character, which may be equal to the input character.
@@ -3988,56 +3996,56 @@ DUPLICATE CAPTURE GROUP NAMES
int pcre2_substring_nametable_scan(const pcre2_code *code,
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
- When a pattern is compiled with the PCRE2_DUPNAMES option, names for
- capture groups are not required to be unique. Duplicate names are al-
- ways allowed for groups with the same number, created by using the (?|
+ When a pattern is compiled with the PCRE2_DUPNAMES option, names for
+ capture groups are not required to be unique. Duplicate names are al-
+ ways allowed for groups with the same number, created by using the (?|
feature. Indeed, if such groups are named, they are required to use the
same names.
- Normally, patterns that use duplicate names are such that in any one
- match, only one of each set of identically-named groups participates.
+ Normally, patterns that use duplicate names are such that in any one
+ match, only one of each set of identically-named groups participates.
An example is shown in the pcre2pattern documentation.
- When duplicates are present, pcre2_substring_copy_byname() and
- pcre2_substring_get_byname() return the first substring corresponding
- to the given name that is set. Only if none are set is PCRE2_ERROR_UN-
- SET is returned. The pcre2_substring_number_from_name() function re-
- turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate
+ When duplicates are present, pcre2_substring_copy_byname() and
+ pcre2_substring_get_byname() return the first substring corresponding
+ to the given name that is set. Only if none are set is PCRE2_ERROR_UN-
+ SET is returned. The pcre2_substring_number_from_name() function re-
+ turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate
names.
- If you want to get full details of all captured substrings for a given
- name, you must use the pcre2_substring_nametable_scan() function. The
- first argument is the compiled pattern, and the second is the name. If
- the third and fourth arguments are NULL, the function returns a group
+ If you want to get full details of all captured substrings for a given
+ name, you must use the pcre2_substring_nametable_scan() function. The
+ first argument is the compiled pattern, and the second is the name. If
+ the third and fourth arguments are NULL, the function returns a group
number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
When the third and fourth arguments are not NULL, they must be pointers
- to variables that are updated by the function. After it has run, they
+ to variables that are updated by the function. After it has run, they
point to the first and last entries in the name-to-number table for the
- given name, and the function returns the length of each entry in code
- units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
+ given name, and the function returns the length of each entry in code
+ units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
no entries for the given name.
The format of the name table is described above in the section entitled
- Information about a pattern. Given all the relevant entries for the
- name, you can extract each of their numbers, and hence the captured
+ Information about a pattern. Given all the relevant entries for the
+ name, you can extract each of their numbers, and hence the captured
data.
FINDING ALL POSSIBLE MATCHES AT ONE POSITION
- The traditional matching function uses a similar algorithm to Perl,
- which stops when it finds the first match at a given point in the sub-
+ The traditional matching function uses a similar algorithm to Perl,
+ which stops when it finds the first match at a given point in the sub-
ject. If you want to find all possible matches, or the longest possible
- match at a given position, consider using the alternative matching
- function (see below) instead. If you cannot use the alternative func-
+ match at a given position, consider using the alternative matching
+ function (see below) instead. If you cannot use the alternative func-
tion, you can kludge it up by making use of the callout facility, which
is described in the pcre2callout documentation.
What you have to do is to insert a callout right at the end of the pat-
- tern. When your callout function is called, extract and save the cur-
- rent matched substring. Then return 1, which forces pcre2_match() to
- backtrack and try other alternatives. Ultimately, when it runs out of
+ tern. When your callout function is called, extract and save the cur-
+ rent matched substring. Then return 1, which forces pcre2_match() to
+ backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
@@ -4049,27 +4057,27 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
pcre2_match_context *mcontext,
int *workspace, PCRE2_SIZE wscount);
- The function pcre2_dfa_match() is called to match a subject string
- against a compiled pattern, using a matching algorithm that scans the
+ The function pcre2_dfa_match() is called to match a subject string
+ against a compiled pattern, using a matching algorithm that scans the
subject string just once (not counting lookaround assertions), and does
- not backtrack (except when processing lookaround assertions). This has
- different characteristics to the normal algorithm, and is not compati-
- ble with Perl. Some of the features of PCRE2 patterns are not sup-
+ not backtrack (except when processing lookaround assertions). This has
+ different characteristics to the normal algorithm, and is not compati-
+ ble with Perl. Some of the features of PCRE2 patterns are not sup-
ported. Nevertheless, there are times when this kind of matching can be
- useful. For a discussion of the two matching algorithms, and a list of
+ useful. For a discussion of the two matching algorithms, and a list of
features that pcre2_dfa_match() does not support, see the pcre2matching
documentation.
- The arguments for the pcre2_dfa_match() function are the same as for
+ The arguments for the pcre2_dfa_match() function are the same as for
pcre2_match(), plus two extras. The ovector within the match data block
is used in a different way, and this is described below. The other com-
- mon arguments are used in the same way as for pcre2_match(), so their
+ mon arguments are used in the same way as for pcre2_match(), so their
description is not repeated here.
- The two additional arguments provide workspace for the function. The
- workspace vector should contain at least 20 elements. It is used for
- keeping track of multiple paths through the pattern tree. More work-
- space is needed for patterns and subjects where there are a lot of po-
+ The two additional arguments provide workspace for the function. The
+ workspace vector should contain at least 20 elements. It is used for
+ keeping track of multiple paths through the pattern tree. More work-
+ space is needed for patterns and subjects where there are a lot of po-
tential matches.
Here is an example of a simple call to pcre2_dfa_match():
@@ -4089,45 +4097,45 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
Option bits for pcre2_dfa_match()
- The unused bits of the options argument for pcre2_dfa_match() must be
- zero. The only bits that may be set are PCRE2_ANCHORED,
- PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
+ The unused bits of the options argument for pcre2_dfa_match() must be
+ zero. The only bits that may be set are PCRE2_ANCHORED,
+ PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
TEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
- PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and
- PCRE2_DFA_RESTART. All but the last four of these are exactly the same
+ PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and
+ PCRE2_DFA_RESTART. All but the last four of these are exactly the same
as for pcre2_match(), so their description is not repeated here.
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
- These have the same general effect as they do for pcre2_match(), but
- the details are slightly different. When PCRE2_PARTIAL_HARD is set for
- pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
+ These have the same general effect as they do for pcre2_match(), but
+ the details are slightly different. When PCRE2_PARTIAL_HARD is set for
+ pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
subject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete
- matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
- return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
- if the end of the subject is reached, there have been no complete
+ matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
+ return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
+ if the end of the subject is reached, there have been no complete
matches, but there is still at least one matching possibility. The por-
- tion of the string that was inspected when the longest partial match
+ tion of the string that was inspected when the longest partial match
was found is set as the first matching string in both cases. There is a
- more detailed discussion of partial and multi-segment matching, with
+ more detailed discussion of partial and multi-segment matching, with
examples, in the pcre2partial documentation.
PCRE2_DFA_SHORTEST
- Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
+ Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna-
- tive algorithm works, this is necessarily the shortest possible match
+ tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string.
PCRE2_DFA_RESTART
- When pcre2_dfa_match() returns a partial match, it is possible to call
+ When pcre2_dfa_match() returns a partial match, it is possible to call
it again, with additional subject characters, and have it continue with
the same match. The PCRE2_DFA_RESTART option requests this action; when
- it is set, the workspace and wscount options must reference the same
- vector as before because data about the match so far is left in them
+ it is set, the workspace and wscount options must reference the same
+ vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the
pcre2partial documentation.
@@ -4135,8 +4143,8 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
When pcre2_dfa_match() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run
- of the function start at the same point in the subject. The shorter
- matches are all initial substrings of the longer matches. For example,
+ of the function start at the same point in the subject. The shorter
+ matches are all initial substrings of the longer matches. For example,
if the pattern
<.*>
@@ -4151,80 +4159,80 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION