Tokyo PC Users Group
	  Home Page
Home
Members Only
Newsletter
Newsgroups
Become a Member
Meeting Info & Map
Officers
Members
Corporate Members
Photos
Workshops & Training
Other Clubs
Job Hunting?
Constitution

Searching, matching & replacing successfully - Part I

by  Kurt Keller

Could be useful for other stuff, couldn't it?

Did you ever use the command dir *.txt on a DOS or Windows machine? Have you ever typed ls *.tar on Unix? What is the asterisk standing for? Exactly. It means any number (0 or more) of any sign, or shorter: anything. So *.txt actually means anything followed by . txt. This can be as little as .txt itself or it can be music.txt or this.txtand- that.txt or even something more complicated.

Pieces of a Puzzle Wouldn't it be wonderful if this could be applied to more than only file listings? And wouldn't it be terrific if it was a bit more flexible? Well, the same concept can be applied to more situations and it is much more flexible. What you're looking for is called Regular Expressions, or Regex for short, and is available in many utilities, many programming languages and built into many applications. You just need to know it is there and how to use it. If it is not built into the application where you need this, then there is probably not much I can do for you. Your best bet would be to export the data to some file and use some utilities to massage the file. If the problem is rather that you don't yet know what all these hieroglyphs mean, then I might be able to shed some light onto it.

When talking about Unix, then I feel comfortable and can join the discussion. When the topic is Windows, I don't know much more than the average Joe (most secretaries know Word, Excel and all that stuff way better than I do) and when talking about Mac, OS/390, Novell and others, I have no idea at all. So in my examples I'm using mostly Unix tools and commands. If you want to go along with the examples, you don't need to have a Unix machine at home. If you're running Windows, then you can install Cygwin (http://www.cygwin.com/) which is a free Unix environment running in a window under Windows. It is simple to install and you don't need any Unix knowledge to install and run it. If you prefer to set up a Unix machine, you might want to try FreeBSD (http://www.freebsd.org/).

The very basics

With regular expressions, there are some signs which have a special meaning, just like the asterisk mentioned in the directory listing example. There are more than one regular expression engines and not all the tools support all of the special signs, but the basics usually are quite universal. Let's start right with an example.

We do have an email address book. This is a simple text file named aliases.txt, plain ASCII with one email address per line:

  AJ editor <editors@TPC.org>
  John Doe <jdoe@hotmail.com>
  Alice Springs <alice@yahoo.co.au>
  TPC President <president@tpc.org>
  TPC program director <programs@tpc.org>
  Bob Hacker <badboy@ftpcsorg.net>
  Dave Cybercop <hunter@angels.net>

Now we're informed that the Tokyo PC Club changes its domain name from tpc. org to tokyopc.org, because the Transaction Processing Performance Council so badly wants to get the domain tpc.org and has offered the Club a lot of money for it. We now could simply open the aliases.txt file with a text editor and change every occurrence of tpc.org into tokyopc.org. If our aliases.txt file only consists of the seven entries above, this is no problem. But the whole file does have about 700 addresses and unfortunately I don't have a secretary to whom I could assign the task to adapt my aliases file. Here a few tools which understand regular expressions can help us.

Regular expressions are always about matching some text. The better we can describe what we want to match, the more successful we're going to be. Regex is always line oriented, that is the whole match must occur on a single line to be recognized (exceptions are possible with certain tools but very rare). As there is exactly one email address per line in our aliases.txt file, we use the command grep (Global Regular Expression Print) to show us all the tpc. org addresses in the file. So first we must know what text we want to match. Should be tpc.org, right? So let's try:

  %> grep tpc.org aliases.txt
  TPC President <president@tpc.org>
  TPC program director <programs@tpc.org>
  Bob Hacker <badboy@ftpcsorg.net>

Well, not quite what we expected. First of all the AJ editor's address is missing. This is because regular expressions usually are case sensitive. The command line switch -i for grep makes regex support case insensitive and will solve that problem. But why do we have Bob Hacker in this list? After all tpcsorg is not the same as tpc.org, right? Well, the dot is one of the special characters and means any character at all. So a single dot matches anything, but not nothing. If we want to match a real dot, we need to take the special meaning away from the dot, we need to escape it. This is being done by preceding it with a backslash. To prevent interpretation of the backslash by the Unix shell, we'd better put the whole regular expression between single quotes (I spare you the detailed explanation of this at the moment, just believe me for now). So with this added knowledge lets try again:

  %> grep -i 'tpc\.org' aliases.txt
  AJ editor <editors@TPC.org>
  TPC President <president@tpc.org>
  TPC program director <programs@tpc.org>

Cool, that's the list of all the tpc.org addresses we need to adapt. But I'm not yet satisfied with the regular expression we used. What if I had the email address <mike.oldfield@another-tpc.org> in my aliases.txt file? It would be matched as well because it contains the string tpc.org. We do know much more about the string we actually want to match than what we put into our regex. It is an email address and the domain is tpc.org. There will always be an at mark right in front of it. By including the at mark in our regex, we can further safeguard what we'll be matching. Add the following line to your aliases.txt file to see whether I'm having you on or telling you the truth:

  Mike Oldfield <mike.oldfield@another-tpc.org>
  %> grep -i 'tpc\.org' aliases.txt
  AJ editor <editors@TPC.org>
  TPC President <president@tpc.org>
  TPC program director <programs@tpc.org>
  Mike Oldfield <mike.oldfield@another-tpc.org>

  %> grep -i '@tpc\.org' aliases.txt
  AJ editor <editors@TPC.org>
  TPC President <president@tpc.org>
  TPC program director <programs@tpc.org>

Now we have safeguarded one side of the string we want to match. And now you expect me to also take precautions at the end of the string, right? Hey, you're getting the hang of it, good. Add another line to your aliases.txt file:

  Donna Summer <donna.summer@tpc.org.tw>
When checking the syntax on our aliases. txt file, we see that all the email addresses are enclosed in angle brackets. If this is really so, it will help us to form a more reliably matching regular expression. I don't want to strain my eyes visually checking my own 700 lines aliases.txt file to see whether really ALL the email addresses end with an angle bracket. Instead I use another regular expression to confirm it. First we need to know how many lines there are in our aliases.txt. We count them by piping the whole file into the wc (Word Count) utility. The -l command line switch is used to only count lines:

  %> cat aliases.txt | wc -l
        9

And now we count how many lines end with closing angle brackets:

  %> grep '>$' aliases.txt | wc -l
        9

With the help of grep we filter all the lines which have a closing angle bracket (>) just before the end of the line ($) and pipe this output into wc which then counts the lines for us. And good luck, the number is the same as with the last command, which means that all the lines end with closing angle brackets. So our final grep command looks like this:

  %> grep -i '@tpc\.org>' aliases.txt
  AJ editor <editors@TPC.org>
  TPC President <president@tpc.org>
  TPC program director <programs@tpc.org>

As you can see, the output is correct even though we now have the additional two 'troublemaker addresses' in our file, so our matching expression should be fine and reliable. With this output you can open your aliases.txt file and look for these entries to adapt them. No need to check each line in the file separately and possibly miss one.

Lazy people need to know a little bit more

Antenna Signal I'm going to do something else, though. I'm using regular expressions to actually do the whole work for me. On one hand I'm lazy and on the other hand, even though I'm a fast typist, my typing is not so reliable, I make too many mistakes. So I'm going to use another of those handy little Unix utilities: sed, the Stream EDitor. As most other tools and editors that support regular expressions, it does have a substitute function.

What you match by a regular expression (or parts of it, if you want to) can be substituted with something else. sed does, however, not have an option for case insensitivity. So in order to match both upper and lower case I need to use character classes. When you want to match any one of a bunch of possible characters or signs, you can put the list of possibilities within square brackets. For example [abc] will match any one of the letters a, b or c. So if I want to match either an upper or lower case t I can use [Tt], for either case of the letter p it would be [Pp] and so on. The same can be used with grep:

  %> grep '@[Tt][Pp][Cc]\.[Oo][Rr][Gg]>' aliases.txt
  AJ editor <editors@TPC.org>
  TPC President <president@tpc.org>
  TPC program director <programs@tpc.org>

I'm not going to explain the whole sed command, but if you have been following along, it should not be too difficult to at least guess what is going on:

  %> sed 's/@[Tt][Pp][Cc]\.[Oo][Rr][Gg]>/@tokyopc.org>/' aliases.txt >aliases.new
  %> cat aliases.new >aliases.txt
  %> rm aliases.new

Even though I'm using regular expressions often, I wouldn't dare doing this without using grep this way first to control what is being matched.

The whole thing could also be done in a couple of seconds reliably with regular expressions in my favourite text editor, vim. For those of you knowing vi or vim:

  :% s/@tpc\.org>/@tokyopc.org>/igc
More to follow

As the sed and vim examples show, knowing how to use regular expressions can simplify your life and job a lot in many situations. Even adapting my 700 line aliases. txt file wouldn't take more than a couple of seconds with regular expressions and vim. Without regular expressions it would be a tedious, error prone and time consuming task. We have only slightly scratched the surface of what regular expressions can do.

Some very important concepts, such as quantifiers, or backreferences in substitution have not even been mentioned yet. Watch out for part 2 of this introduction to regular expressions.

References

Cygwin - free Unix under Windows http://www.cygwin.com/
F r e e B S D - f r e e U n i x http://www.freebsd.org/
ViImproved - vi clone with many enhancements http://www.vim.org/
P I N B O A R D http://www.pinboard.com/
H i g h T e c h S a m u r a i http://kurt.www.pinboard.com/

© Algorithmica Japonica Copyright Notice: Copyright of material rests with the individual author. Articles may be reprinted by other user groups if the author and original publication are credited. Any other reproduction or use of material herein is prohibited without prior written permission from TPC. The mention of names of products without indication of Trademark or Registered Trademark status in no way implies that these products are not so protected by law.

Algorithmica Japonica

January , 2003

The Newsletter of the Tokyo PC Users Group

Submissions : Editor


Tokyo PC Users Group, Post Office Box 103, Shibuya-Ku, Tokyo 150-8691, JAPAN