Regular expressions are a powerful way of specifying complex search and
replace operations. ne
supports the full regular expression
syntax on US-ASCII and 8-bit buffers, but has to impose a restriction on
character sets when searching in UTF-8 text. See UTF-8 Support.
The following section is taken (with minor modifications) from the GNU regular expression library documentation and is Copyright © Free Software Foundation.
A regular expression describes a set of strings. The simplest case is one that describes a particular string; for example, the string ‘foo’ when regarded as a regular expression matches ‘foo’ and nothing else. Nontrivial regular expressions use certain special constructs so that they can match more than one string. For example, the regular expression ‘foo|bar’ matches either the string ‘foo’ or the string ‘bar’; the regular expression ‘c[ad]*r’ matches any of the strings ‘cr’, ‘car’, ‘cdr’, ‘caar’, ‘cadddar’ and all other such strings with any number of ‘a’'s and ‘d’'s.
Regular expressions have a syntax in which a few characters are special constructs and the rest are ordinary. An ordinary character is a simple regular expression which matches that character and nothing else. The special characters are ‘$’, ‘^’, ‘.’, ‘*’, ‘+’, ‘?’, ‘[’, ‘]’ , ‘(’, ‘)’ and ‘\’. Any other character appearing in a regular expression is ordinary, unless a ‘\’ precedes it.
For example, ‘f’ is not a special character, so it is ordinary, and therefore ‘f’ is a regular expression that matches the string ‘f’ and no other string. (It does not match the string ‘ff’.) Likewise, ‘o’ is a regular expression that matches only ‘o’.
Any two regular expressions a and b can be concatenated. The result is a regular expression that matches a string if a matches some amount of the beginning of that string and b matches the rest of the string.
As a simple example, we can concatenate the regular expressions ‘f’ and ‘o’ to get the regular expression ‘fo’, which matches only the string ‘fo’. Still trivial.
Note: special characters are treated as ordinary ones if they are in contexts where their special meanings make no sense. For example, ‘*foo’ treats ‘*’ as ordinary since there is no preceding expression on which the ‘*’ can act. It is poor practice to depend on this behaviour; better to quote the special character anyway, regardless of where is appears.
The following are the characters and character sequences that have special meaning within regular expressions. Any character not mentioned here is not special; it stands for exactly itself for the purposes of searching and matching.
The case of zero ‘o’'s is allowed: ‘fo*’ does match ‘f’.
‘*’ always applies to the smallest possible preceding
expression. Thus, ‘fo*’ has a repeating ‘o’, not a repeating
‘fo’.
Character ranges can also be included in a character set, by writing two characters with a ‘-’ between them. Thus, ‘[a-z]’ matches any lower-case letter. Ranges may be intermixed freely with individual characters, as in ‘[a-z$%.]’, which matches any lower case letter or ‘$’, ‘%’ or period.
Note that the usual special characters are not special any more inside a character set. A completely different set of special characters exists inside character sets: ‘]’, ‘-’ and ‘^’.
To include a ‘]’ in a character set, you must make it the first character. For example, ‘[]a]’ matches ‘]’ or ‘a’. To include a ‘-’, you must use it in a context where it cannot possibly indicate a range: that is, as the first character, or immediately after a range.
Note that when searching in UTF-8 text, a character set may contain
US-ASCII characters only.
‘^’ is not special in a character set unless it is the first character.
The character following the ‘^’ is treated as if it were first (it may
be a ‘-’ or a ‘]’).
Because ‘\’ quotes special characters, ‘\$’ is a regular expression that matches only ‘$’, and ‘\[’ is a regular expression that matches only ‘[’, and so on.
For the most part, ‘\’ followed by any character matches only that
character. However, there are several exceptions: characters which, when
preceded by ‘\’, are special constructs. Such characters are always
ordinary when encountered on their own.
Thus, ‘foo|bar’ matches either ‘foo’ or ‘bar’ but no other string.
‘|’ applies to the largest possible surrounding expressions. Only a
surrounding ‘( ... )’ grouping can limit the grouping power of
‘|’.
This last application is not a consequence of the idea of a parenthetical
grouping; it is a separate feature that happens to be assigned as a second
meaning to the same ‘( ... )’ construct because there is no
conflict in practice between the two meanings. Here is an explanation of
this feature:
The strings matching the first nine ‘( ... )’ constructs appearing in a regular expression are assigned numbers 1 through 9 in order of their beginnings. ‘\1’ through ‘\9’ may be used to refer to the text matched by the corresponding ‘( ... )’ construct.
For example, ‘(.+)\1’ matches any non empty string that is composed of
two identical halves. The ‘(.+)’ matches the first half, which may be
anything non empty, but the ‘\1’ that follows must match the same exact
text.
Also the replacement string has some special feature when doing a regular expression search and replace. Exactly as during the search, ‘\’ followed by digit stands for “the text matched the digit'th time by the ‘( ... )’ construct in the search expression”. Moreover, ‘\0’ represent the whole string matched by the regular expression. Thus, for instance, the replace string ‘\0\0’ has the effect of doubling any string matched.
Another example: if you search for ‘(a+)(b+)’, replacing with ‘\2x\1’, you will match any string composed by a series of ‘a’'s followed by a series of ‘b’'s, and you will replace it with the string obtained by moving the ‘a’ in front of the ‘b’'s, adding moreover ‘x’ inbetween. For instance, ‘aaaab’ will be matched and replaced by ‘bxaaaa’.
Note that the backslash character can escape itself. Thus, to put a backslash in the replacement string, you have to use ‘\\’.