CFX_PCRegEx

Abstract

CFX_PCRegEx is a Perl-compatible regular expression (regex) parsing extension tag for the Allaire ColdFusion server software, originally written by Rick Osborne in September of 2000. It is designed to be a replacement to (or supplement of) the existing ColdFusion regex capabilities. Both Find() and Replace() capabilities are available, including backreferences, POSIX expressions, and just about anything else you can do with Perl regexes.

The tag uses the PCRE (Perl Compatible Regular Expression) engine, which was written by Philip Hazel and is copyright by the University of Cambridge, England. For more information on the PCRE engine, see <ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/>.

Tag Attributes

<CFX_PCREGEX SUBJECT="#Subject#" PATTERN="#Pattern#" RESULTS="#ResultVar#" OFFSET="1" COUNT="1" MAXSUBS="ALL" DEBUG="True"> <CFX_PCREGEX SUBJECT="#Subject#" PATTERN="#Pattern#" RESULTS="#ResultVar#" REPLACE="Replacement" COUNT="ALL">

Attribute	Required	Default	Effect/Purpose
Subject	Yes		The string to be matched against.
Pattern	Yes		The regular expression pattern to use on Subject
Results	Yes		The name of the variable to store the results in.
Offset	No	1	The offset inside the Subject string at which to start the regex match.
Replace	No		The replacement string that will be substituted into the Subject string at the matched locations.
Count	No	1	The maximum number of match attempts made for the Pattern.
MaxSubs	No	ALL	The maximum number of subexpression matches to return.
Debug	No		Display informative debugging information. (Only valid with the Debug version of the tag. The Release version wil not output any debugging information.)

Returned Variables

Variable	Value
PCRegEx.Time	The amount of time taken to process the regex. (This does not include tag load/unlad time.)
PCRegEx.Message	Error message (if any) for the regex.
PCRegEx.Offset	The character offset in the Subject where the error (if any) occurred. This is 0 for no error, and -1 for No Match.
PCRegEx.Version	Version of the tag.
PCRegEx.PCREVersion	Version of the PCRE engine used by the tag.
PCRegEx.PCRELicense	License information for the PCRE engine.
PCRegEx.PCREURL	URL for more information about the PCRE engine.

The variable specified in the Results attribute will be set differently depending on the mode of the tag. If a Replace attribute is found, the tag will go into Replace mode, otherwise it is in Find mode by default.

Find Mode

In Find mode, the result variable will be set to a query with the following columns: Match, Sub, Pos, Len. The Match and Sub columns are only useful when the Count attribute is set to something greater than 1. The Match column contains the number of the matched expression (starting at 1). The Sub column contians the number of the matched subexpression for the current Match. The Pos column is the position in the Subject string where the matching subexpression starts, and the Len column holds the length. When Pos equals 0, the subexpression was not matched. (This is different than the way CF handles subexpressions, as it simply collapses the result set to eliminate unmatched subexpressions.) When Sub equals 0, the Pos and Len represent the entire matched expression. The RecordCount for a given result set should equal the number of matches multiplied by one more than the number of subexpressions per match. (RecordCount = Matches * (Subexpressions + 1))

If you are stuck using a CF-style regex that captures subexpressions that you aren't going to use, you will speed up the execution considerably by setting the MaxSubs attribute to 0.

For any positive result, a few shortcuts can be used. Result.Pos[0] and Result.Len[0] signify the first matched expression. Result.Match[RecordCount] is the total number of matches. Result.Sub[RecordCount] is the number of subexpressions returned for each match. (Remember that this does not include the 0th match.

Replace Mode

In Replace mode the result variable will always be set to the resultant string. No information about the number of matches or anything like that is set. All you get is the resultant string.

Backreferences to subexpressions can be used in the Replace string, just as with the REReplace() and REReplaceNoCase() standard CFML functions, with one addition: the backreference \0 will return the entire matching expression. (Like $& in Perl.) The backref parser tries to be smart and outguess clumsy coders (and be efficient). A pass is made over the Replace string to see if there are any actual backrefs being made. It looks for something akin to "\\[[:digit:]][[:digit:]]?"; that is a backslash followed by one or two digits. If such a backref is found, then the engine will try to interpolate the Replace string for each matching expression. In such a case, you must escape any backslashes that you want to use as actual backslashes. If no valid backrefs are found, then you do not need to escape your backslashes. For example, a Replace string of "\1\\\2" would be interpolated, while "\a\\b" would not, and in the first case the backslash is escaped and in the second case it is not.

Examples

<!--- From the Allaire book --->
<CFSET data="Some BIG string">
<CFX_PCREGEX SUBJECT="#data#" PATTERN=" [A-Z]+ " RESULTS="bigstring">
<CFIF PCREGEX.MESSAGE IS "">
  <CFOUTPUT>Match found at #bigstring.pos# : #Mid(data,bigstring.pos,bigstring.len)#</CFOUTPUT>
<CFELSE>
  <CFOUTPUT>There was an error: #PCREGEX.MESSAGE#</CFOUTPUT>
</CFIF>
<!--- Should see: Match fount at 5 :  BIG  --->

<!--- Find all of the words --- like split() in Perl --->
<CFX_PCREGEX SUBJECT="#data#" PATTERN="\w+" RESULTS="words" COUNT="ALL">
<CFOUTPUT QUERY="words">#Mid(data,Pos,Len)#<BR></CFQUERY>
<!--- Should see: Some<BR>BIG<BR>string --->

<!--- From the Allaire book --->
<CFX_PCREGEX SUBJECT="Allaire's Web Site" PATTERN="[[:space:]]" REPLACE="*" RESULTS="starred" COUNT="ALL">
<CFOUTPUT>#starred#</CFOUTPUT>
<!--- Should see: Allaire's*Web*Site --->

<!--- From the Allaire book --->
<CFX_PCREGEX SUBJECT="There is is coffee in the the kitchen" PATTERN="([A-Za-z]+)[ ]+\1" REPLACE="*" RESULTS="starred" COUNT="ALL">
<CFOUTPUT>#starred#</CFOUTPUT>
<!--- Should see: There * coffee in * kitchen --->

<!--- From the Allaire book --->
<CFX_PCREGEX SUBJECT="There is is a cat in in the kitchen" PATTERN="([A-Za-z]+)[ ]+\1" REPLACE="\1" RESULTS="onedupe">
<CFOUTPUT>#onedup#</CFOUTPUT>
<!--- Should see: There is a cat in in the kitchen --->

<!--- From the Allaire book --->
<CFX_PCREGEX SUBJECT="There is is a cat in in the kitchen" PATTERN="([A-Za-z]+)[ ]+\1" REPLACE="\1" RESULTS="nodupes" COUNT="ALL">
<CFOUTPUT>#nodupes#</CFOUTPUT>
<!--- Should see: There is a cat in the kitchen --->

Common Questions

How can I emulate REReplaceNoCase() and REFindCase() (case-insensitive matching)?: Use the "(?i)" directive at the beginning of your pattern.
Where can I find more information on PCRE's eccentricities?: ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
http://www.cise.ufl.edu/depot/www/pcre/pcre.html#SEC13
Where can I find regexes for URLs of different protocols?: http://www.foad.org/~abigail/Perl/url2.html

Tag Installation

Tag installation is just like any other tag installation. See the Allaire reference material for details. The distribution for this program should have come with two DLLs: a Debug and a Release version. Both DLLs have the same functionality, with the exception that the Debug version is compiled with debugging information, while the Release version has none of this and is optimized for speed.

Additional Information

Author

This program was originally written by Rick Osborne. All questions or comments should be directed to him at <pcregex@rixsoft.com>.

Availability

The primary distribution URL for this program is <http://www.rixsoft.com/ColdFusion/CFX/PCRegEx/>. Latest versions will be kept at that URL, so if you did not obtain this program from that URL, please check for a newer version. This help file should be included with every distribution, along with the executables (DLL), source (C++), and test file (CFM). If any parts of this distribution are missing, please visit the preceding URL for a full distribution.

Note: The PCRE source code is not distributed with the source code for this DLL. You must obtain the PCRE source code seperately if you want to manually compile the code for this DLL.

License

This program is being release under the same license as the PCRE engine. Please see the next section for details.

PCRE License

PCRE is a library of functions to support regular expressions whose syntax and semantics are as close as possible to those of the Perl 5 language.

Written by: Philip Hazel <ph10@cam.ac.uk>
University of Cambridge Computing Service,
Cambridge, England. Phone: +44 1223 334714.

Permission is granted to anyone to use this software for any purpose on any computer system, and to redistribute it freely, subject to the following restrictions:

This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The origin of this software must not be misrepresented, either by explicit claim or by omission. In practice, this means that if you use PCRE in software which you distribute to others, commercially or otherwise, you must put a sentence like this
Regular expression support is provided by the PCRE library package, which is open source software, written by Philip Hazel, and copyright by the University of Cambridge, England.
somewhere reasonably visible in your documentation and in any relevant files or online help data or similar. A reference to the ftp site for the source, that is, to
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
should also be given in the documentation.
Altered versions must be plainly marked as such, and must not be misrepresented as being the original software.
If PCRE is embedded in any software that is released under the GNU General Purpose Licence (GPL), then the terms of that licence shall supersede any condition above with which it is incompatible.

Last Updated 2000-10-11 by Rick Osborne