正则表达式贪婪解析方向

2024-03-09 10:09:23

我发现关于贪婪正则表达式的执行方式有两种不同的看法：

>一个是,从背面读取所有输入字符串并匹配模式,首先匹配整个输入,第一次尝试是整个字符串.一些文章支持这种观点是Oracle offical Java tutorial：

Greedy quantifiers are considered “greedy” because they force the
matcher to read in, or eat, the entire input string prior to
attempting the first match. If the first match attempt (the entire
input string) fails, the matcher backs off the input string by one
character and tries again, repeating the process until a match is
found or there are no more characters left to back off from.

另请参阅本文：Performance of Greedy vs. Lazy Regex Quantifiers

>另一个从前面开始匹配,第一个匹配尝试从左边的0索引开始.当找到匹配项时,引擎不会停止,继续匹配其余部分,直到失败为止,然后它将回溯.文章支持这种观点,我发现是：

Repetition with Star and Plus深入了解正则表达式引擎部分谈论<. &gt ;：

The first token in the regex is <. This is a literal. As we already
know, the first place where it will match is the first < in the
string.

我想知道哪个是正确的？这很重要,因为它将影响正则表达式的效率.我添加了各种语言标签,因为我想知道每种语言的实现方式是否有所不同.

解决方法:

“匹配器将输入字符串退回一个字符,然后重试”只是描述了回溯,因此“然后它将回溯”在说同样的话.由于您关于贪婪的两个说法都说相同的话,所以都是正确的. (您的第三引号与贪婪无关.)

让我们提供一个例子.

'xxabbbbbxxabbbbbbbbb' =~ /([ab]*)bb/;

>尝试在位置0.

> [ab] *匹配0个字符“”.

>在位置0,bb无法匹配⇒回溯.

> [ab] *不能再匹配⇒回溯.

>尝试在位置1.

> [ab] *匹配0个字符“”.

>在位置1,bb无法匹配⇒回溯.

> [ab] *不能再匹配⇒回溯.

>尝试在位置2.

> [ab] *匹配6个字符“ abbbbb”.

>在位置8,bb无法匹配⇒回溯.

> [ab] *匹配5个字符“ abbbb”. (退一)

>在位置7,bb无法匹配⇒回溯.

> [ab] *匹配4个字符“ abbb”. (退一)

>在pos 6,bb比赛.

>成功.

因此,$1是“ abbb”. (不是abbbbbbb.“贪婪”并不意味着“可能的最长匹配”.)

现在,让我们看看如果使“ *”为非贪婪会发生什么.

'xxabbbbbxxabbbbbbbbb' =~ /([ab]*?)bb/;

>尝试在位置0.

> [ab] *？匹配0个字符“”.

>在位置0,bb无法匹配⇒回溯.

> [ab] *不再匹配⇒回溯.

>尝试在位置1.

> [ab] *？匹配0个字符“”.

>在位置1,bb无法匹配⇒回溯.

> [ab] *不再匹配⇒回溯.

>尝试在位置2.

> [ab] *？匹配0个字符“”.

>在位置2,bb无法匹配⇒回溯.

> [ab] *？匹配1个字符“ a”. (加一)

>在位置3,bb匹配.

>成功.

因此,$1是“ a”.

只要实现与此处介绍的结果相同,特定的实现就可能将优化做得不同.您可以使用看到Perl在工作中

perl -Mre=debug -E'say "xxabbbbbxxabbbbbbbbb" =~ /([ab]*)bb/;'
perl -Mre=debug -E'say "xxabbbbbxxabbbbbbbbb" =~ /([ab]*?)bb/;'

码农公寓

相关文章