Linux

Filtering and searching text with regular expressions is an important skill for every developer. Regular expressions can be tricky to master. To work with them effectively, you need a detailed understanding of their symbols and syntax.

Fortunately, learning to work with regular expressions can be incremental. You don't need to learn everything all at once to do useful work. Rather, you can start with the basics and then move into more complex topics while developing your understanding and using what you know as you go along.

This article is the second in a series. The first article introduced some basic elements of regular expressions: The basic metacharacters (.*^$\s\d) as well as the escape metacharacter \.

This article introduces some more advanced syntax: quantifiers, pattern collections, groups, and word boundaries. If you haven't read the first article, you might want to review it now before continuing with this content.

These articles demonstrate regular expressions by piping string output from an echo command to the grep utility. The grep utility uses a regular expression to filter content. The benefit of demonstrating regular expressions using grep is that you don't need to set up any special programming environment. You can execute an example of a regular expression immediately by copying and pasting the code directly into your terminal window running under Linux.

What's the difference between a regular character and a metacharacter

A regular character is a letter, digit, or punctuation used in everyday text. When you declare a regular character in a regular expression, the regular expression engine searches content for that declared character. For example, were you to declare the regular character h in a regular expression, the engine would look for occurrences of the character h.

A metacharacter is a placeholder symbol. For example, the metacharacter . (dot) represents "any character," and means any character matches here. The metacharacter \d represents a numerical digit, and means any digit matches here. Thus, when you use a metacharacter, the regex engine searches for characters that comply with the particular metacharacter or set of metacharacters.

What are quantifiers?

A quantifier is a syntactic structure in regular expressions that indicates the number of times a character occurs in sequence in the input text. There are two ways to declare a quantifier. One way is:

x{n}

In this syntax:

  • x is the character to match.
  • n indicates the number of times the character needs to occur.

A related syntax declares a quantifier with a minimum and maximum range:

x{n,m}

In this syntax:

  • x is the character to match.
  • n indicates the minimum number of occurrences and m indicates the maximum number of occurrences.

The following example uses a quantifier to create a matching pattern that identifies two occurrences of the regular character g in sequence:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po 'g{2}'

The regular expression matches the characters highlighted in bold in the following text:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

Thus, the regular expression returns the following result:

gg

The following example uses a quantifier to create a matching pattern that identifies a minimum and a maximum for occurrences of the character g in a sequence. The minimum length is 1 and the maximum is 2. The regular expression is processed in a case-insensitive manner, as indicated by the -i option to grep:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Poi 'g{1,2}'

The regular expression matches the characters highlighted in bold in the following text:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

Because each sequence is identified and returned on a one-by-one basis, the output is:

G
gg
g

What are pattern collections?

A pattern collection is a syntactic structure that describes a character class. A character class is a set of metacharacters and regular characters that combine to create a matching pattern that, like a metacharacter, can match many different characters in text. A pattern collection is defined between square brackets ([ ]).

The following example uses the [A-Z] character class, which denotes any uppercase character from A to Z inclusive, to create a pattern collection that matches only uppercase characters in the given text:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety." $ echo $teststr | grep -Po '[A-Z]'

The regular expression matches the characters highlighted in bold in the following text:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

J
L
G
F
C
T

The following example uses the [0-9] character class, which denotes any digit between 0 and 9, to create a pattern collection that matches only numeric characters in the given text:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '[0-9]'

The regular expression matches the characters highlighted in bold in the following text:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

1

The following example uses a pattern collection that matches certain exact regular characters within a set of regular characters. The regular expression says: Match any f, G, or F:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '[fGF]'

The regular expression matches the characters highlighted in bold in the following text:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

f
f
G
F

The following example uses a pattern collection with both metacharacters and regular characters. The logic behind the regular expression says: Match any g, r, or e followed by a space character and then the string Fido:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '[gre]\sFido'

The regular expression matches the characters highlighted in bold in the following text:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

g Fido

The following example uses two pattern collections along with metacharacters that are outside them. The regular expression says: Match a numeric character, then continue matching any character zero or many times that is followed by an uppercase character. The pattern collection [0-9] indicates any numeral from 0 to 9. The metacharacters .* indicate zero or more instances of any character, and the pattern collection [A-Z] indicates any uppercase character from A to Z:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '[0-9].*[A-Z]'

The regular expression matches the characters highlighted in bold in the following text:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

1 bird named T

The following example uses the negation metacharacter ^ within a pattern collection. The negation metacharacter indicates that the succeeding characters are not to be matched when the regular expression is being executed.

Note: As you might remember from the first article in this series, ^ is the same metacharacter that indicates a line start—but only when used outside square brackets. The ^ metacharacter indicates negation only when it appears within the square brackets ([ ]) that declare a pattern collection.

The following collection pattern says: Match any character that is not a, e, i, o, or u:

$ teststr="Jeff and the pet Lucky."
$ echo $teststr | grep -Po '[^aeiou]'

The regular expression matches the characters highlighted in bold in the following text. The text is underlined to make the space characters apparent:

Jeff and the pet Lucky.

Space characters in the following output are also underlined to make them apparent. Space characters are matched by this regular expression:

J
f
f
_
n
d
_
t
h
_
p
t
_
L
c
k
y
. 

Groups

A group in a regular expression is, as the name implies, a group of characters declared according to a specific definition. A group declaration can include metacharacters and regular characters. A group is declared between open and closed parentheses like this: ( ).

The following example uses a . (dot) metacharacter, which indicates "any character." The declared group says: Match any three characters as a group and return each group:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '(...)'

The regular expression matches the characters highlighted in alternating bold and non-bold text as shown in the following text. Again, the text is underlined to make the space characters apparent:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

Because the group is identified and returned on a one-by-one basis, the output is:

Jef
f_a
nd_
the
_pe
t_L
uck
y._
Gre
gg_
and
_th
e_d
og_
Fid
o._
Chr
is_
has
_1_
bir
d_n
ame
d_T
wee
ty.

The following example uses the . (dot) metacharacter along with the regular character y to define a group of three characters, of which the first two characters can be anything and the third character must be y.

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '(..y)'

The regular expression matches the characters highlighted in bold in the following text:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

cky
ety

The following example demonstrates a regular expression group that uses the . (dot) metacharacter along with the \d metacharacter to define a group of five characters, of which the first two characters are any regular character, the third character is a digit, and the last two characters are any regular characters:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '(..\d..)'

The regular expression matches the characters highlighted in bold in the following text. The text is underlined to make the space characters apparent.

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

s 1 b

Word boundaries

A word character is declared using the metacharacters \w. A word character indicates any uppercase character, lowercase character, numeric character, or connector character such as a hyphen.

A word boundary is defined as a transition between a word character and a beginning space, an ending space, or a punctuation mark ( .!? ). A word boundary is declared using the metacharacters \b.

The following example demonstrates a regular expression that uses the metacharacters \w+ to find occurrences of words within text. The metacharacter + indicates one or more occurrences of a character. The logic in play is: Match one or more word characters:

$ teststr="Jeff and the pet Lucky.
$ echo $teststr | grep -Po '\w+'

The regular expression matches the characters highlighted in bold in the following text:

Jeff and the pet Lucky

Because each word is identified and returned on a one-by-one basis, the output is:

Jeff
and
the
pet
Lucky

The following example uses a word boundary to find occurrences of the regular character a that appears at the beginning of a word:

"Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '\ba'

The regular expression matches the characters highlighted in bold in the following text:

and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

a
a

The following example uses a word boundary to find occurrences of the regular character y that appear at the end of a word:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po 'y\b'

The regular expression matches the characters highlighted in bold in the following text. Note that punctuation marks at the end of a word are not considered word characters and are excluded from the match:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

y
y

The following example uses a word boundary to find occurrences of the regular characters Tweety that appear at the end of a word:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po 'Tweety\b'

The regular expression matches the characters highlighted in bold in the following text. Again, notice that punctuation marks at the end of a word are excluded:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

Tweety

The following example contains a regular expression group that uses word boundaries to find occurrences of words that start with the regular character a and end with the regular character d. The regular expression uses the metacharacters \w* to declare all occurrences of word characters:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '\ba\w*d\b'

The regular expression matches the characters highlighted in bold in the following text.

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The output is:

and
and

Grouping and specifying multiple characters simultaneously extend regular expressions

This article gave you an introduction to working with quantifiers, pattern collections, groups, and word boundaries. You learned to use quantifiers to declare a range of character occurrences to match. Also, you learned that pattern collections enable you to declare character classes that match characters in a generic manner. Groups execute matches that declare a particular set of characters. Word boundaries allow you to make matches by working within the boundaries of space characters and punctuation marks.

These intermediate concepts covered in this article will bring additional power and versatility to working regular expressions. But there's a lot more to learn. Fortunately, as mentioned at the beginning of this article, you can use the concepts and techniques discussed in this article immediately.

The key is to start practicing what you've learned now. Mastery is the result of small, incremental accomplishments. As with any skill, the more you practice, the better you'll get.

Comments