$Revision: 5.0.2.3 $
A regular expression is a pattern that matches one or more strings. Technically, the patterns we support go beyond regular expressions, however we'll continue to call them regular expressions. Allegro CL provides regular expression (regexp) functionality as described in this document.
In a regular expression string, characters are either normal or special. A normal character stands for itself, a special character has a meaning described below.
There are two contexts in a regular expression: characters within a [...] expression and those elsewhere.
Special characters outside a [...] are: * [ ] \ + ^ $
These characters can be made normal characters by preceding them with a backslash. Some of these characters are only special in certain places in the string (see below for the details).
Special characters within a [...] are: ^ ]
These characters can only be made non-special by placing them in a certain position in the [...] expression (more details below). Note that backslash is not a special character in this context.
There are certain characters that when preceded by a backslash outside of a [...] expression turn into special characters. Those characters are: ( ) | w W b B 0 1 2 3 4 5 6 7 8 9.
A regular expression is defined as:
a | Where a is any non-special character, matches itself. |
xy | Where x and y are regular expressions, matches the concatenation of the regular expressions. |
m* | A single character regular expression followed by * matches zero or more occurrences of m. If there is a choice, it always matches the longest sequence of m's. |
m+ | A single character expression followed by + matches one or more occurrences of m. If there is a choice, it always matches the longest sequence of m's. |
. | A period matches any character (except newline--see notes on the match-regexp function). |
^ | If this is the first character of the regular expression string it forces the match to start at the beginning of the to-be-matched string. If this character appears after the beginning of the string it stands for itself. |
$ | If this is the last character of the regular expression string it forces the match to end at the end of the to-be-matched string. If this character appears before the end of the regular expression string, it stands for itself. |
[..] | This matches exactly one character from the set of characters denoted by the pattern
inside the brackets. This pattern has a different form than elsewhere in the regular
expression: [abcs-y3-8] matches either a , b or c ,
or s though y , or 3 through 8 . You can
invert the set using the caret as the first character. [^a-z] matches any
character not in the range a through z . In order to include the
right bracket in the set it has to be listed first (or after the caret): []ab]
matches a , b or the right bracket. [^]ab] matches
any character except a , b or the right bracket. In order to
match a hyphen it has to be first or last: [b-] matches b or a
hyphen. In order to match a caret it has to be somewhere other than the first character. [b^]
matches b or a caret. |
\(x\) | This grouping syntax matches whatever x matches, and at the same time
remembers what x matches. There can be up to 9 groups defined in a regular
expression string. Each group is given a number from 1 to 9 based on the order in which
they appear in the pattern string. When the match is made, the value of each group can be
returned by the regexp-match function. |
x\|y | This tries to match x, and if that fails it tries to match y. To
control what constitutes x and y you can use the \( \)
grouping. For example, abc\|def means abc or def
whereas a\(bc\|de\)f means a followed by bc or de
followed by f . |
\n | If n is 1 through 9 then this stands for the string matched by group 1 through 9. If there is no string assigned to group n then this is match failure. There is no group 0 so the form \0 is illegal and an error is signaled when the regular expression is parsed. |
\w | Matches a word character. It is equivalent to [a-zA-Z0-9] . |
\W | Matches anything but a word character. It is equivalent to [^a-zA-Z0-9] . |
\b | Matches a blank character (one of space, form feed, tab and newline). |
\B | matches anything but a blank character (one of space, form feed, tab and newline). |
\a | For any character a not mentioned above stands for a itself. It is unwise to put in extra backslashes since while \x may stand for just x today, in the future it may have a different meaning. |
When typing a regular expression in Lisp source code keep in mind that in order to represent a backslash in a string constant you need two backslashes. The Lisp reader reads "foo\+" as "foo+", when what you probably wanted was "foo\\+" (where you are putting a backslash in front of the + to remove its special meaning so you could match the string foo+.)
The + and * characters must follow a single character regular expression. They cannot
follow a group expression, even if that group matches just one character. In other words \(a\)*
is not legal. [a]*
is legal since the [..]
expression always
matches one character.
Compatibility with other regular expression parsers: the UNIX version (on Solaris)
supports x\{m,n\}
meaning between m
and n
occurrences of x
. It also supports \<
and \>
as matching word beginnings and word endings, where a word is a C identifier. The UNIX
version does not support the +
special character (possibly since you can get
it with \{1,\}
).
The Linux version supports x?
meaning 0 or 1 occurrence of x. We can get
that with \(x\|\)
.
The GNU Emacs regular expression parser supports a lot of additional features.
Two functions are supported in this facility. Each is documented on its own page. We repeat most of the information here for reading convenience.
excl:compile-regexp
[function]
Arguments: regexp
Compiles the string regexp
into a regular expression object and
returns that object. If there are syntax errors in the string, an error will be
signaled.
(match-regexp regexp match-string
&key (newlines-special t) case-fold shortest
(return :string) (start 0)
(end (length match-string)))
excl:match-regexp
[function]
Arguments: regexp match-string &key (newlines-special
t) case-fold shortest
(return :string) (start 0)
(end
(length match-string))
The regexp
argument is a regular expression object (the result of
regexp-compile
) or it is a string (in which case it will be compiled into a
regular expression object). The match-string
is a string to match against the
regular expression. The function will attempt to match the regular expression against the match-string
starting at the first character of the match-string
, and if that fails it
will look inside the match-string
for a match (unless the regular expression
begins with a caret).
The keyword arguments are:
:newlines-special |
If true then a newline will not match the . regular expression. This is useful to prevent multiline matches. |
:case-fold | If true then the match-string is effectively mapped to lower case before doing the match. Thus lower case characters in the regular expression match either case and upper case characters match only upper case characters. |
:return | The return value from a failed match is nil. If the
value of return is :string then the return value from a
successful match are multiple values. The first value is t.
The second value is the substring of the match-string that
matched the regular expression. The third value (if any) is the substring that matched
group 1. The fourth value is the substring that matched group 2. An so on. If you use the
\| form, then some groups may have no associated match in which case nil will be returned as that value. In highly nested \| forms, a
group may return a match string when in the final match that group had no match. If the value of return is :index then it is just like :string except that instead of the strings being returned, a cons is returned giving the start and end indices in the original match-string of the match. The end index is one greater than the last character in the substring. If the value of return is nil then the one value t is returned when the match succeeds. |
:start | The first character in the match-string to match against. |
:end | One past the last character in the match-string to match against. |
:shortest | This makes match-regexp return the shortest rather than the longest match. One motivation for this is parsing html. Suppose you want to search for the next item in italics, which in html looks like <i>foo</i>. If you do (match-regexp "<i>.*</i>" string) then if the string is <i>foo</i> and <i>bar</i> then you'll match the whole string, including the non-italic part. However if you use the shortest keyword then you'll only match the <i>foo</i> part. |
Compilation note: there is a compiler macro defined for match-regexp that will handle in a special way match-regexp calls where the first argument is a constant string. That is, this form (match-regexp "foo" x) will compile to code that will arrange to call compile-regexp on the string when the code is fasled in. Since the cost of compile-regexp is high, this saves a lot of time.
Copyright (C) 1998-1999, Franz Inc., Berkeley, CA. All Rights Reserved.