* This area of the Website is being presented as one long page for easy printing. You can also browse this area of the Website as shorter webpages for easy reading. Visit: Regular Expression Matching.
Regular Expression Matching is the process of looking for text inside of other text, and optionally replacing it.Regular Expression Matching (RegEX) is one of the most frequently required techniques in scripting and programming. Regular Expression Matching is the process of looking for text inside of other text, and optionally replacing it. It is a more powerful alternative to using functions like Replace(), because it lets you describe a pattern that you are looking for, instead of doing a precise match. This makes RegEX extremely useful in Web development for tasks like automatically hyper-linking URLs, or pulling email addresses off of Webpages. But RegEX will also be of unlimited use to the InfoChannel WinScript developer.
Much of this baby-step tutorial will be dedicated to the variance between the WSH/ASP implementation of RegEX, and the way the rest of the world uses it.Using RegEX to do a search and replace across an entire document is sometimes called Global Replace (or GREP). There are special programs dedicated to doing regular expression matching, but RegEX is also commonly included natively in programming languages. It is a powerful technique with confusing syntax and implementations that vary from language to language. The particular RegEX implementation is sometimes called the RegEX engine. RegEX in the Windows Scripting Host (WSH) uses a Regular Expression "Object". While in some ways this is powerful, it mostly just adds a layer of obfuscation in using an otherwise straight forward technique. Much of this baby-step tutorial will be dedicated to the variance between the WSH/ASP implementation of RegEX, and the way the rest of the world uses it. |
I am going to make a function that returns a value equal to the first match.The RegEX implementation in WSH/ASP is capable of creating an object containing a set of matches that can be iterated through. While this is powerful, it is not what I want to do for the first demonstration. Instead, I am going to make a function that returns a value equal to the first match. I have a real-world example in mind. The email notifications that go out on Scala discussions contain a discussion link that leads sales prospects back to their personal discussion with Scala. These discussions take place in a web-based Message Board system. We encourage people to follow these links back, but on occasion, people reply directly to the notification. I will be building the system that takes these replies, and automatically posts it back into their discussion, saving lots of manual time of visiting the discussion ourselves and copying & pasting the message back into place.
What I need to do is extract the message ID from the body of the Email message.Most of this project, you will not be seeing. I will be focusing on only the function containing the RegEX. The function will take the body of the email message as input and return the message ID. So, what I need to do is extract the message ID from the body of the Email message. The message ID appears in a URL in the body of the message. There are other approaches to doing this involving using a combination of VBScript's string processing functions, like InStr(), Replace() or Mid() or Left(). But I can do it in one shot with RegEX. This is a perfect example. |
Since I'm talking about a function that goes into a larger program, I will present examples that work as stand-alone programs. But I will have to fake the message. So, let's say an email might look like this...
Now, to include this fake email in the test program, we will need to populate a variable with this email as its contents. Make a new text file. Name it ExtractID.vbs and put the following code into it...
As usual, there are nuances to point out. First, is the use of the underscore character. Underscores are a way of appending together multiple lines of a program into one. This keeps you from having to write consecutive lines that look like... myFakeEmail = myFakeEmail & "line 1" & vbNewLine myFakeEmail = myFakeEmail & "line 2" & vbNewLine myFakeEmail = myFakeEmail & "line 3" & vbNewLine Instead, you write.... myFakeEmail = myFakeEmail & "line 1" & vbNewLine & _ "line 2" & vbNewLine & _ "line 3" & vbNewLine So, why do I have the extra empty lines at the top and bottom? It's just a technique I use, so the first and last lines of the real body of the message are not special exceptions. I can copy and paste between the empty lines, and run the same macro on every line to fill in the quotes and underscores. The next thing to point out is the infamous VB double-double quotes. This is the way to escape a quote, since a quote in VBScript (and many other languages) is the way of containing a string literal. I could have also appended a Chr(34) with the same results. I do it this way out of habit. The next thing to point out is the vbNewLine's. I use them to reproduce the line breaks that would exist in the actual email. As we will learn later, the actual placement of the line break is important, as is whether there is a trailing space after the text for which we will be looking. This WSH script will actually run fine, even though it doesn't display any output. |
Now, to keep in the spirit of the baby-step tutorials, I'm going to add just a few lines. This will reinforce the message about Option Explicit that I gave in my first tutorial. Us it, or risk introducing bugs. Now that Option Explicit is added, we need to Dim our variables. I have also introduce the first function that has appeared so far in my Scala WSH tutorials. It is just the container for the function. I'm choosing a function instead of a Sub because this is intended to actually return a value.
This will run, but still produces no output. |
Here, I demonstrate how a function takes a parameter and returns a value. The last step a function also does is to sets the value of the function. Some people think this limits functions to returning just a single value. But that's not true, because it can actually return any object including multidimensional arrays and database recordsets. I COULD use it to return an object containing every match found. But I will be using it just to return a single value (the first match found).
Now, we do have output, albeit not very useful...
|
Here, I symmetrically create and destroy an incidence of the RegExp object. It's built into WSH, and the CreateObject function is not even required.
The output is unchanged. |
Now, we set up the RegExp environment. We want our matches to be case-insensitive so, a match will occur even if the upper and lower case usage differs. It's not really necessary here, but is good to deliberately think about it every time. Next, we want our matches to be Global. RegEx engines can essentially consider the search document line-by-line or the entire document at once. If you were matching on something that spanned several lines, you would have to set this value to True. In this situation, it is not really necessary, but it won't do any harm.
|
This is perhaps the most important line. It is where we set the pattern. Currently, I am setting it to a match that I know will work. Eventually, we will be altering this pattern so that it matches the subsequent number as well, no matter what it is.
|
Now, I add the creation of the matches collection and a single match object. Normally, when you conduct a match, you iterate through the matches collection. But examples of doing that abound on the Web, and un this case, we are only interested in the first match. Just about the only other place I found that documents this is the MSDN page on submatches: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/script56/html/v...
As usual, any object we create, we symmetrically destroy with the Set Object = Nothing phrase. We use our upper and lower case naming convention to remind ourselves that these are all objects that need destroying.
And now we have some new output to look at... ![]() |
In another act of defensive programming, we are going to want to make sure that a match truly exists before we return a value. I also did a test to ensure that no match will be found (notice the x in the pattern). So, there is once again no output.
|
This step is teensy tiny, but is so important that it must be pointed out. I added a backslash question mark (\?) to the pattern. This is important to note because a normal question mark has a special meaning to RegEX. So, when you actually are looking for a question mark in your text, you must escape it. The backslash is the RegEX character for escaping the character that immediately follows. So, when we match on "\?", we are really just matching on "?". Get it? This is VERY important.
And our output proves that the question mark is now included in the matched pattern...
|
In this step, we start to show a lot of the features of Regular Expression Matching. Any set of characters enclosed by the square brackets ([]) define a character class. If I wanted to match only on a, b, c and d, I would use [abcd]. Similarly, if I wanted to match on zero through nine, I would use [0123456789]. Happily, these character classes support the hyphen to define a range. So, to match on any number 0 through 9, you can simply use [0-9], which I did. Now, the asterisk (*) is a very special character in that it says something about the character that came immediately before it. The period (.) and the question mark (?) work in the same way. The asterisk tells us to match on the previous character any number of times. So, we are essentially saying match on the URL, plus any number of any length.
In order to test to make sure this is true, I also added a link to an on-page bookmark in a URL. This is done with the pound (#) symbol. If something on the page had a link to it like this, <a name="12345">on-page link, and that bookmark was appended to the URL, our pattern match would stop short of it, because the pound sign is not part of the pattern.
And the output confirms this is working...
|
We're almost done. I've only just scratched the surface of Regular Expression Matching, hardly even introducing you to the complex rules. But it's time to cheat a little bit. As you may have surmised, regular expression matching is about consuming characters in your match. You write a pattern that grabs and includes characters. Consequently, even though we are just interested in the number from the URL, we had to grab the entire URL, because it is the URL that tells us where the number is found in the document. Do you detect a little something of a catch-22 in regular expression matching? It is true!
But any faculty you could imagine asking for is usually available in RegEx, although support varies between implementations. For example, to do the entire task with just one match, I would use the "look behind" feature. In other words, the part of my pattern that included the URL would not consume the text. I would be looking for a number that matches the pattern, which has the URL before it. This is different from the URL followed by the number. As you will learn with RegEX, it's all in how you state the question. There is much order sensitivity. So, for the cheating I talked about. Since the URL is the same every time, and it is of a fixed length, it is easy to chop it off of the beginning of the string before we return the value. We will use VBScript's built-in string functions to do this. It could be done with a second RegEX match. But I think that's overkill at this point. The Right() and Len() functions fit the need very nicely...
And the output of our program demonstrates that we are done...
|