A pedant that hangs out in the dark corner-cases of the web.

Tuesday, May 06, 2008

Visual Studio's NIH RegEx Syntax

Here's a quick phrasebook for Visual Studio's NIH RegEx syntax:

VS Editor RegEx Syntax Real RegEx Syntax* Meaning
{} () tagged / captured submatch
() (?:) non-capturing submatch
(?=) lookahead assertion
~() (?!) negative lookahead assertion / prevent match
(?<=) lookbehind assertion
(?<!) negative lookbehind assertion
(?>) nonbacktracking (greedy) subexpression
< \< start of word
> \> end of word
\< < matches < character
\> > matches > character
(<|>) \b word boundary
~(<|>) \B not a word boundary
? zero-or-one quantifier
?? minimal zero-or-one quantifier
@ *? minimal zero-or-more quantifier
\@ @ matches @ character
# +? minimal one-or-more quantifier
\# # matches # character
^n {n} match n times quantifier
\^ ^ matches ^ character
{n,} match at least n times quantifier
{m,n} match between m and n times (inclusive) quantifier
{n,}? minimally match at least n times quantifier
{m,n}? minimally match between m and n times (inclusive) quantifier
\(w,n) (replacement expression) left-pad captured group n to w characters
\(-w,n) (replacement expression) right-pad captured group n to w characters
\g \a alert / bell
\h [\b] backspace
\: : matches : character
:i ([a-zA-Z_$][a-zA-Z0-9_$]*) identifier
:q (("[^"]*")|('[^']*')) quoted string
:h ([0-9A-Fa-f]+) hexadecimal number (not including any prefix, e.g. 0x or \x or \u)
:n ((\d+.\d*)|(\d*.\d+)|(\d+)) rational number
:w (\p{L}+) letters
:b [ \t] space or tab (like \s without \n or \v)
:z \d+ integer (one or more decimal digits)
:a \w word / alphanumeric character
[^:a] \W non-word / non-alphanumeric character
:c \p{L} letter character (like \w without the _)
:d \d decimal digit
[^:d] \D non-decimal-digit character
:U \p{U} matches Unicode character category U
\p{IsBlock} matches characters in Unicode named block Block
[^:U] \P{U} does not match Unicode character category U
\P{IsBlock} does not match characters in Unicode named block Block
:Al \p{L} letter
:Nu \d decimal digit
:Pu \p{P} punctuation character
:Wh \s whitespace character
[^:Wh] \S non-whitespace character
:Bi ? bidirectional character
:Ha \p{IsHangulJamo} Korean Hangul and combining Jamos
:Hi \p{IsHiragana} hiragana character
:Ka \p{IsKatakana} katakana character
:Id ? ideographic characters, such as Han and kanji

*This is the syntax supported by everything else, including the .NET System.Text.RegularExpressions library. For some reason, Microsoft decided to create a new syntax just for the Visual Studio editor! But they just don't have the resources to implement "your favorite standard".

Anything that's not on the list should be the same for NIH Regex patterns.

No comments: