Linux

A regular expression (also called a regex or regexp) is a rule that a computer can use to match characters or groups of characters within a larger body of text. For instance, using regular expressions, you could find all the instances of the word cat in a document, or all instances of a word that begins with c and ends with t.

Use of regular expressions in the real world can get much more complex—and powerful—than that. For example, imagine you need to write code verifying that all content in the body of an HTTP POST request is free of script injection attacks. Malicious code can appear in any number of ways, but you know that injected script code will always appear between <script></script> HTML tags. You can apply the regular expression <script>.*<\/script>, which matches any block of code text bracketed by <script> tags, to the HTTP request body as part of your search for script injection code.

This example is but one of many uses for regular expressions. In this series, you'll learn more about how the syntax for this and other regular expressions work.

As just demonstrated, a regex can be a powerful tool for finding text according to a particular pattern in a variety of situations. Once mastered, regular expressions provide developers with the ability to locate patterns of text in source code and documentation at design time. You can also apply regular expressions to text that is subject to algorithmic processing at runtime such as content in HTTP requests or event messages.

Regular expressions are supported by many programming languages, as well as classic command-line applications such as awk, sed, and grep, which were developed for Unix many decades ago and are now offered on GNU/Linux.

This article examines the basics of using regular expressions under grep. The article shows how you can use a regular expression to declare a pattern that you want to match, and outlines the essential building blocks of regular expressions, with many examples. This article assumes no prior knowledge of regular expressions, but you should understand how to with the Linux operating system at the command line.

What are regular expressions, and what is grep?

As we've noted, a regular expression is a rule used for matching characters in text. These rules are declarative, which means they are immutable: once declared, they do not change. But a single rule can be applied to any variety of situations.

Regular expressions are written in a special language. Although this language has been standardized, dialects vary from one regular expression engine to another. For example, JavaScript has a regex dialect, as do C++, Java, and Python.

This article uses the regular expression dialect that goes with the Linux grep command, with an extension to support more powerful features. grep is a binary executable that filters content in a file or output from other commands (stdout). Regular expressions are central to grep: The re in the middle of the name stands for "regular expression."

This article uses grep because it doesn't require that you set up a particular coding environment or write any code to work with the examples of regular expressions demonstrated in this article. All you need to do is copy and paste an example onto the command line of a Linux terminal and you'll see results immediately. The grep command can be used in any shell.

Because this article focuses on regular expressions as a language, and not on manipulating files, the examples use samples of text piped to grep instead of input files.

How to use grep against content in a file

To print lines in a file that match a regular expression, use the following syntax:

$ grep -options <regular_expression> /paths/to/files

In this command syntax:

  • -options, if specified, control the behavior of the command.
  • <regular_expression> indicates the regular expression to execute against the files.
  • /paths/to/files indicate one or more files against which the regular will be executed.

The options used in this article are:

  • -P: Apply regular expressions in the style of the Perl programming language. This option, which is specific to GNU/Linux, is used in the article to unlock powerful features that aren't recognized by grep by default. There is nothing specific to Perl in the regular expressions used in this article; the same features can be found in many programming languages.
  • -i: Match in a case-insensitive manner.
  • -o: Print only the characters matching the regular expression. By default, the whole line containing the matching string is printed.

How to pipe content to a regular expression

As mentioned earlier, you can also use a regular expression to filter output from stdout. The following example uses the pipe symbol (|) to feed the result of an echo command to grep.

$ echo "I like using regular expressions." | grep -Po 'r.*ar'

The command produces the following output:

regular

Why does grep return the characters regular to match the regular expression specified here? We'll explore the reasons in subsequent sections of this article.

Regular characters, metacharacters, and patterns: The building blocks of regular expressions

You'll use three basic building blocks when working with regular expressions: regular characters, metacharacters, and patterns. Regular characters and metacharacters are used to create a regular expression, and that regular expression represents a matching pattern that the regex engine applies to some content.

You can think of a metacharacter as a placeholder symbol. For example, the . metacharacter (a dot or period) represents "any character." The \d metacharacter represents any single numeral, 0 through 9.

The * metacharacter is a shorthand that represents the instruction "search for a character that occurs zero or more times as defined by the preceding character." (You'll see how to work with the * metacharacter in sections to come.)

Regular expressions support many metacharacters, each worthy of a page or two of description. For now, the important thing to understand is that a metacharacter is a reserved symbol used by the regex engine to describe a character in a generic manner. Also, certain metacharacters are a shorthand for a search instruction.

You can combine regular characters with metacharacters to declare rules that define search patterns. For example, consider the following short regular expression:

.t

This matches a pattern consisting of two characters. The first character can be any character, as declared by the . (dot) metacharacter, but the second character must be t. Thus, applying the regular expression .t to the string I like cats but not rats matches the strings highlighted in bold font here:

I like cats but not rats

You can do a lot using just the basic metacharacters to create regular expressions with grep. The following sections provide a number of useful examples.

Running basic regular expressions

The following subsections demonstrate various examples of regular expressions. The examples are presented as two commands to enter in a Linux terminal. The first command creates a variable named teststr that contains a sample string. The second executes the echo command against teststr and pipes the result of the echo command to grep. The grep command then filters the input according to the associated regular expression.

How to declare an exact pattern match using regular characters

The following example demonstrates how to search a string according to the pattern of regular characters, Fido. The search declaration is case-sensitive:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po 'Fido'

The result is:

Fido

How to declare a case-insensitive exact pattern match

The following example demonstrates how to search a string according to a pattern of regular characters, fido. The search declaration is case-insensitive, as indicated by the -i option in the grep command. Thus, the regex engine will find occurrences such as FIDO as well as fido or fiDo.

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Poi 'fido'

The result is:

Fido

How to declare a logical pattern match

The following example uses the | metacharacter symbol to search according to a this or that condition—that is, a condition that can be satisfied by either of the regular expressions on either side of |. In this case, the regular expression matches occurrences of the regular character f or g:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po 'f|g'

The grep command identifies each occurrence that satisfies the rule declared in the regular expression. Conceptually, the regular expression is saying, Return any character that is either an f or a g. We are leaving the search case-sensitive, as is the default. Thus, the identified characters are highlighted in bold text here:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

Because each character is identified and returned on a one-by-one basis, the output sent to the terminal window is:

f
f
g
g
g

How to find a character at the beginning of a line

The following example uses the ^ metacharacter to search for the beginning of a line of text. Conceptually, the ^ metacharacter matches the beginning of a line.

The example executes the regular expression ^J. This regular expression searches for a match that satisfies two conditions. The first condition is to find the beginning of the line; the next is to find the regular character J at that position.

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '^J'

The regular expression matches the character highlighted in bold text as shown here:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The result returned to the terminal is:

J

How to find a character at the end of a line

The following example uses the $ metacharacter to search for the end of a line to text.

The example executes the regular expression \.$. The regular expression declares a matching rule that has two conditions. First, the regular expression searches for an occurrence of the regular character . (dot). Then the regular expression looks to see whether the end of the line is next. Thus, if the . character comes at the end of the line, it's deemed a match.

The regular expression includes a backslash (\) as an "escape" metacharacter before the dot. The escape metacharacter is needed to override the normal meaning of the dot as a metacharacter. Remember that the . (dot) metacharacter means any character. With the escape character, the dot is treated as a regular character, and so matches just itself:

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po '\.$'

The regular expression matches the final dot in the text, highlighted in bold as shown here:

Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The result is just the final dot:

.

Suppose you were to use an unescaped dot in the regular expression:

$ echo $teststr | grep -Po '.$'

You would get the same result as using the escaped dot, but a different logic is being executed. That logic is: Match any character that is the last character before the end of the string. Thus, the regular expression would always match any line. Using the escape character to identify a character as a regular character is a subtle distinction in this case, but an important one nonetheless.

How to find multiple characters at the end of a line

The following example searches the string assigned to the variable teststr to match the characters ty. when they appear at the end of a line.

$ teststr="Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety."
$ echo $teststr | grep -Po 'ty\.$'

The result is:

ty.

Again, note the user of the escape metacharacter (\) to declare the . (dot) character as a regular character.

How to find occurrences of a character using the metacharacters for matching numerals

The following example uses the \d metacharacter to create a regular expression that looks for matches of any numeral in a given piece of text.

$ teststr="There are 9 cats and 2 dogs in a box."
$ echo $teststr | grep -Po '\d'

Because each numeral is matched and returned on a one-by-one basis, the output sent to the terminal is:

9
2

How to find a string using metacharacters for a numeral and a space

The following example uses the \d and \s metacharacters along with regular characters to create a regular expression that matches text according to the following logic: Match any numeral that is followed by a space and then the regular characters cats.

The \d metacharacter matches a numeral and the \s metacharacter matches a whitespace character (a space, a tab, or a few other rare characters):

$ teststr="There are 9 cats and 2 dogs in a box."
$ echo $teststr | grep -Po '\d\scats'

The result is:

9 cats

How to combine metacharacters to create a complex regular expression

The following example uses the \d metacharacter to match a numeral, \s to match a space, and . (dot) to match any character. The regular expressions uses the * metacharacter to say, Match zero or more successive occurrences of the preceding character.

The logic expressed in the regular expression is this: Find a string of text that starts with a numeral followed by a space character and the regular characters cats. Then keep going, matching any characters until you come to another numeral followed by a space character and the regular characters dogs:

$ teststr="There are 9 cats and 2 dogs in a box."
$ echo $teststr | grep -Po '\d\scats.*\d\sdogs'

The result is:

9 cats and 2 dogs

How to traverse a line of text to a stop point

The following example uses the . (dot) metacharacter and * along with the regular characters cats to create a regular expression with the following logic: Match any character zero or more times until you come to the characters cats:

$ teststr="There are 9 cats and 2 dogs in a box."
$ echo $teststr | grep -Po '.*cats'

The result is:

There are 9 cats

The interesting thing about this regular expression is that starting from the beginning of the line is implicit. The ^ metacharacter could be used to indicate the start of a line, but because the regular expression matches any characters until you come to cats, it isn't necessary to explicitly declare the start of the line using ^. The regular expression starts processing from the beginning of the line by default.

Regular expressions uncover patterns in text

Regular expressions offer a powerful yet concise way to do complex text filtering. You can use them in programming languages such as JavaScript, Python, Perl, and C++, and directly in a Linux terminal to process files and text using the grep command, as demonstrated in this article.

Getting the hang of regular expressions takes time. Mastering the intricacies of working with the metacharacters alone can be daunting. Fortunately, the learning curve is developmental. You don't have to master the entirety of regular expressions to work with them usefully as a beginner. You can start with the basics, and as you learn more you can do more. Just being able to do pattern matching using the basic examples shown in this article can provide immediate benefit.

An upcoming article in this series will explain regular expression features that are even more powerful.

Last updated: September 15, 2022

Comments