YuPcre2 1.9.2 D7-D10.3 Rio

YuPcre2 is a library of Delphi components and procedures that implement regular expression pattern matching using the same syntax and semantics as Perl, with just a few differences. There are two matching algorithms, the standard Perl and alternative DFA algorithm:

The Perl algorithm is what you are used to from Perl and j@vascript. It is fast and supports the complete pattern syntax. You will likely be using it most of the time.
DFA is a special purpose algorithm. If finds all possible matches and, in particular, it finds the longest. It never backtracks and supports partial matching better, in particular multi-segment matching of very long subject strings.

YuPcre2 has native interfaces for 8-bit, 16-bit, and 32-bit strings. Component wrappers are available for UnicodeString / WideString and AnsiString / Utf8String / RawBytestring:

The YuPcre2 RegEx2 classes descend from common ancestors which implement the core functionalities:

Match strings and and extract full or substring matches.
Search for regular expressions within streams and memory buffers. TDIRegExSearchStream descendants employ a buffered search within streams and files (of virtually unlimited size) and use little memory.
Replace full matches or partial substrings.
List full matches or partial substrings.
Format full matches or partial substrings by adding static or dynamic text.

Users familiar with the DIRegEx might be interessted in the differences between YuPcre2 and DIRegEx.

Pattern Syntax

YuPcre2 RegEx2 Workbench Application The YuPcre2 regular expression pattern syntax is mostly compatible with Perl. It includes the following:

Escaped Characters
Character Types
General Category Properties for \p and \P
PCRE2 Special Category Properties for \p and \P
Script Names for \p and \P
Character Classes
Anchors and Simple Assertions
Match Point Reset
Atomic Groups
Option Setting
Newline Convention
What \R Matches
Lookahead and Lookbehind Assertions
Subroutine References (possibly recursive)
Conditional Patterns
Backtracking Control

YuPcre2 RegEx2 String Processing

YuPcre2 can Replace, List, or Format regular expressions matches or any of its substrings, useful for text editors and word processors. Variable portions of the match can be included into the result text. The full match can be referenced by number, substrings also by name. The character to introduce these reference is freely configurable. FormatOptions allow to turn features on or off as required.

Replace returns the original subject string with matches replaced, similar to but more flexible than Delphi's StringReplace() function.
List collects all string matches into a single string. It extracts multiple phone numbers, e-mail addresses, or URLs, with a single call.

YuPcre2 RegEx2 MaskControls

The YuPcre2 RegEx2 MaskControls Demo ApplicationYuPcre2 includes two regular expression mask edits: TDIRegEx2MaskEdit and TDIRegEx2ComboBox. Both controls validate keyboard input against a regular expression. They work similar to Delphi’s TMaskEdit, but more flexible and powerful.

The regular expression mask edits can:

accept / reject specific characters at determined positions;
allow / reject particular characters if they follow defined character(s);
restrict input text to begin / end with exact character(s);
flag incomplete text to show that more input is needed.

Examples: Numbers, number ranges, dates, phone numbers, e-mail addresses, URLs, currency, and more.

Workbench Application

The YuPcre2 RegEx2 Workbench helps to design and test regular expressions. It allows to set options, measure execution times, and to save and load settings for later use.

The YuPcre2 RegEx2 Workbench is available as

Design-Time Component Editor and
Standalone Application.
YuPcre2 1.9.2 – 8 Jan 2019
Matching the pattern (<em>UTF)\C[^\v]+\x80 against an 8-bit string containing multi-code-unit characters caused bad behaviour and possibly a crash.
When returning an error from pcre2_pattern_convert, ensure the error offset is set zero for early errors.
Refactored pcre2_dfa_match so that the internal recursive calls no longer use the stack for local workspace and local ovectors. Instead, an initial block of stack is reserved, but if this is insufficient, heap memory is used. The heap limit parameter now applies to pcre2_dfa_match.
In pcre2_substitute, with global matching, a pattern that matched an empty string, but never at the starting match offset, was not handled in a Perl-compatible way. The pattern (<?=\G.) is an example of such a pattern. Because \G is in a lookbehind assertion, there has to be a “bumpalong” before there can be a match. The automatic “advance by one character after an empty string match” rule is therefore inappropriate. A more complicated algorithm has now been implemented.
When checking to see if a lookbehind is of fixed length, lookaheads were correctly ignored, but qualifiers on lookaheads were not being ignored, leading to an incorrect “lookbehind assertion is not fixed length” error.
Updated to Unicode version 11.0.0. As well as the usual addition of new scripts and characters, this involved re-jigging the grapheme break property algorithm because Unicode has changed the way emojis are handled.
Fixed an obscure bug that struck when there were two atomic groups not separated by something with a backtracking point. There could be an incorrect backtrack into the first of the atomic groups. A complicated example is (?>a(</em>:1))(?&gt;b)(<em>SKIP:1)x|.</em> matched against “abc”, where the <em>SKIP shouldn't find a MARK (because is in an atomic group), but it did.
(</em>ACCEPT:ARG), (<em>FAIL:ARG), and (</em>COMMIT:ARG) are now supported.
A (<em>MARK) name was not being passed back for positive assertions that were terminated by (</em>ACCEPT).
Add support for \N{U+dddd}, but only in Unicode mode.
Add support for (?^) for unsetting all imnsx options.
The PCRE2_EXTENDED (/x) option only ever discarded space characters whose code point was less than 256. Now, when Unicode support is compiled, PCRE2_EXTENDED also discards U+0085, U+200E, U+200F, U+2028, and U+2029, which are additional characters defined by Unicode as “Pattern White Space”. This makes PCRE2 compatible with Perl.
In certain circumstances, option settings within patterns were not being correctly processed. For example, the pattern ((?i)A)(?m)B incorrectly matched “ab”. (The (?m) setting lost the fact that (?i) should be reset at the end of its group during the parse process, but without another setting such as (?m) the compile phase got it right.)
When serializing a pattern, set the memctl, executable_jit, and tables fields (that is, all the fields that contain pointers) to zeros so that the result of serializing is always the same. These fields are re-set when the pattern is deserialized.
In a pattern such as [^\x{100}-\x{ffff}]*[\x80-\xff] which has a repeated negative class with no characters less than 0x100 followed by a positive class with only characters less than 0x100, the first class was incorrectly being auto-possessified, causing incorrect match failures.
If the only branch in a conditional subpattern was anchored, the whole subpattern was treated as anchored, when it should not have been, since the assumed empty second branch cannot be anchored. Demonstrated by test patterns such as (?(1)^())b or (?(?=^))b.
A repeated conditional subpattern that could match an empty string was always assumed to be unanchored. Now it it checked just like any other repeated conditional subpattern, and can be found to be anchored if the minimum quantifier is one or more.