Linux

This article is a third in a series about executing regular expressions using the grep executable that ships with Linux operating systems. The grep command filters content in a file or as output from stdout.

The first article in this series described the basics of using metacharacters and regular characters to create regular expressions. The second discussed working with quantifiers, pattern collections, groups, and word boundaries in regular expressions. This article uses the features described in the previous articles, along with new ones, to match and filter content in HTML files.

Matching and retrieving text from HTML is a common task for a broad variety of IT professionals, particularly when troubleshooting issues in web pages. Thus, being able to apply regular expressions to HTML files is a useful skill.

The article uses grep because that won't require you to set up a particular coding environment or write any complex programming code to work with the examples of regular expressions demonstrated in this article. All you need to do is copy and paste an example onto the command line of a Linux terminal and you'll see results immediately.

This article is divided into three sections. The first shows you how to create regular expressions that execute against a single HTML file. The second shows you how to work with multiple HTML files. The last shows you how to use a special command-line utility named pcre2grep to execute regular expressions against text split over multiple lines in one or many HTML files.

Regular characters versus metacharacters

A regular character represents itself in the text you're searching. Examples include the letters a, g, or t, or the numerical digits 3 or 8. When you declare a regular character in a regular expression, the regular expression engine searches content for the declared character.

A metacharacter represents a group of characters or other aspects of searching. You can think of a metacharacter as a placeholder symbol. For example, the metacharacter . (dot) represents "any character" and the metacharacters \d represent any digit.

Running regular expressions using grep against a single HTML file

In this section, you'll see a variety of regular expressions executed against a single file of HTML. The HTML content used for the demonstration follows. Store the content in a file named regex-content-01.html:

<html>
<head>
<title>A list of interesting and uninteresting people
</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body bgcolor="#ffffff" text="#000000">
      <h1>Interesting People</h1>
            <ul>
                  <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
                  <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
                  <li><div id="3">John Lennon<br>john@beatles.io</div></li>
                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
            </ul>
      <h1>Uninteresting People</h1>
            <ul>
                  <li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
                  <li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
                  <li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
            </ul>
</body>
</html>

Format for using grep against a single file

The format for using grep against an HTML file at the command line is as follows:

$ grep -Po <regular_expression> <path/with/filename>

The elements of the syntax are as follows:

  • grep is the binary executable.
  • -Po contains options passed to grep. The P option interprets the regular expression as a Perl regular expression. The o option makes grep output only the text that matches, not the full lines containing it.
  • <regular_expression> is the regular expression to execute.
  • <path/with/filename> is the target file's location within the computer's file system, such as ~/Documents/somefile.html. If you provide a plain filename without a path, it refers to a file in the current working directory.

The following subsections match individual lines within a single file.

Matching occurrences of a string using regular characters

The following example matches occurrences of a set of regular characters in an HTML file named regex-content-01.html. In this case, the regular characters form the string people:

$ grep -Po 'people' regex-content-01.html

The logic that the regular expression executes is as follows: Match any occurrence of the regular characters people in the file regex-content-01.html.

The output is:

people
people
people

Matching occurrences of a string using metacharacters and regular characters

The following example matches occurrences of a set of metacharacters and regular characters in regex-content-01.html:

$ grep -Po '@.*people' regex-content-01.html

The logic that the regular expression executes is as follows: In the file regex-content-01.html, match any occurrence of the regular character @ followed by occurrences of any characters zero or more times (.*) until the regular characters people occur.

The output is:

@uninterestingpeople
@uninterestingpeople
@uninterestingpeople

An extended version of the previous example is:

$ grep -Po '@.*people.*\.com' regex-content-01.html

The logic that the regular expression executes is as follows: In the file regex-content-01.html, match any occurrence of the regular character @ followed by occurrences of any characters zero or more times (.*) until the regular characters people occur. Then, match any characters zero or more times (.*) until the regular characters .com occur. Note that the escape metacharacter (\) is used before the dot regular character (.) like so: \.. Using the escape metacharacter indicates that the regular expression has to process the dot as a regular character (.) and not as the metacharacter that means "any character."

The output is:

@uninterestingpeople.com
@uninterestingpeople.com
@uninterestingpeople.com

Case-insensitive match using metacharacters and regular characters

The following example demonstrates how to run grep using a case-insensitive regular expression, matching either uppercase or lowercase instances of characters. The key to creating a case-insensitive regular expression is to use the -i option when running grep. The -i option indicates case-insensitive processing.

In this case, the regular expression returns any line that matches the regular characters mick jagger in a case-insensitive manner.

$ grep -Poi '.*mick jagger.*' regex-content-01.html

The logic that the regular expression executes is as follows: In the file regex-content-01.html, match any line that has zero or more occurrences of any character (.*) until the regular characters mick jagger occur in either lowercase or uppercase. Then match zero or more occurrences of any character (.*).

The result is the following. 

<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>

Matching HTML list entities

The following example looks for lines of text that match a string that starts with the tag <li> and ends with the tag </li>:

$ grep -Po '<li>.*</li>' regex-content-01.html

The logic that the regular expression executes is as follows: In the file regex-content-01.html, match lines of text that have an occurrence of the regular characters <li> followed by zero or more occurrences of any character (.*), followed by the regular characters </li>.

The output is:

<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
<li><div id="3">John Lennon<br>john@beatles.io</div></li>
<li><div id="4">John Doe<br>jd@uninterestingpeople.com</div></li>
<li><div id="5">Jane Doe<br>jane@uninterestingpeople.com</div></li>
<li><div id="6">Uninteresting Person<br>up@uninterestingpeople.com</div></li>

Matching occurrences of a string within an HTML tag according to a range of ID values

The following example defines a character class that declares a range of numerals to match within the id attribute of a <div> tag:

$ grep -Po '<li><div id="[2-4]">.*</li>' regex-content-01.html

The logic that the regular expression executes is as follows: In the file regex-content-01.html, match lines of text that have an occurrence of the regular characters <li><div id=" followed by any regular character that is a numeral in the range 2 to 4 ([2-4]). Then match the regular characters "> followed by zero or more occurrences of any character (.*) that are then followed by the regular characters </li>.

The output is:

<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
<li><div id="3">John Lennon<br>john@beatles.io</div></li>
<li><div id="4">John Doe<br>jd@uninterestingpeople.com</div></li>

Working with multiple HTML files

The following examples demonstrate how to execute regular expressions against multiple HTML files. If you want to get hands-on experience working with the examples in this section, copy and paste the following HTML into a file named regex-content-02.html and save it in the same directory where you previously created regex-content-01.html:

<html>
 <head>
 <title>A list of cool animals
 </title>
 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
 </head>
 <body bgcolor="#000000" text="#ffffff">
      <h1>Interesting Pets</h1>
            <ul>
                  <li><div id="1">Daffy Duck</div></li>
                  <li><div id="2">Porky Pig</div></li>
                  <li><div id="3">Bugs Bunny</div></li>
                  <li><div id="4">Huckleberry Hound</div></li>
                  <li><div id="5">Crusader Rabbit</div></li>
                  <li><div id="6">Top Cat</div></li>
                  <li><div id="7">Rags T. Tiger</div></li>
            </ul>
</body>
</html>

Formats for using grep against multiple files

The format for using grep against multiple HTML files at the command line is as follows:

$ grep -Po <regular_expression> <path/with/filename-01.html> <path/with/filename-02.html> ... <path/with/filename-n.html>

The elements of the syntax are as follows:

  • grep is the binary executable.
  • -Po contains options passed to grep. The P option interprets the regular expression as a Perl regular expression. The o option makes grep output only the text that matches, not the full lines containing it.
  • <regular_expression> is the regular expression to execute.
  • <path/with/filename-01.html>, <path/with/filename-02.html>, and <path/with/filename-n.html> are the various target files within the computer's file system.

The format for using grep with a file specification against multiple HTML files is as follows.

$ grep -Po <regular_expression> <wild_card>.html

<wild_card>.html picks out filenames using wildcard characters. For example, the following declaration finds all files in the current working directory that have any filename ending with the .html filename extension:

*.html

The following declaration finds all files that start with the characters regex-content-0, followed by any character (?) and ending with the extension .html:

regex-content-0?.html

The following subsections match individual lines within multiple files.

Matching a string of regular characters across multiple HTML files

The following example matches all strings of regular characters Duck that occur in all files in the current directory that have file names that end with the extension .html:

$ grep -Po 'Duck' *.html

The logic that the regular expression executes is as follows: Match any occurrence of the regular characters Duck in all files in the current directory that have filenames that end with the extension .html.

The output is:

regex-content-01.html:Duck
regex-content-02.html:Duck

Matching lines of text across multiple HTML files according to metacharacters and regular characters

The following command matches a string plus all surrounding text on the same line:

$ grep -Po '.*Duck.*' *.html

The logic that the regular expression executes is as follows: In all files in the current directory that have file names that end with the extension .html, match lines of text that have zero or more occurrences of any character (.*), then the regular characters Duck, followed by zero or more occurrences of any character (.*).

The result is the following.

regex-content-01.html:                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-02.html:                  <li><div id="1">Daffy Duck</div></li>

Matching occurrences of characters within an HTML tag across multiple HTML files

The following example finds all characters between the <li> and </li> HTML tags in all files in the current directory that have file names that end with the extension .html:

$ grep -Po '<li>.*</li>' *.html

The logic that the regular expression executes is as follows: In all files in the current directory that have file names that end with the extension .html, match lines of text that have an occurrence of the regular characters <li> followed by zero or more occurrences of any character (.*), followed by the regular characters </li>.

The output is:

regex-content-01.html:<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
regex-content-01.html:<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
regex-content-01.html:<li><div id="3">John Lennon<br>john@beatles.io</div></li>
regex-content-01.html:<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-01.html:<li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
regex-content-01.html:<li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
regex-content-01.html:<li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
regex-content-02.html:<li><div id="1">Daffy Duck</div></li>
regex-content-02.html:<li><div id="2">Porky Pig</div></li>
regex-content-02.html:<li><div id="3">Bugs Bunny</div></li>
regex-content-02.html:<li><div id="4">Huckleberry Hound</div></li>
regex-content-02.html:<li><div id="5">Crusader Rabbit</div></li>
regex-content-02.html:<li><div id="6">Top Cat</div></li>
regex-content-02.html:<li><div id="7">Rags T. Tiger</div></li>

Matching occurrences of specific characters within an HTML tag across multiple HTML files

The following command finds <li> entities containing a particular string, Duck:

$ grep -Po '<li>.*Duck.*</li>' *.html

The logic that the regular expression executes is as follows: In all files in the current directory that have file names that end with the extension .html, match lines of text that have an occurrence of the regular characters <li> followed by zero or more occurrences of any character (.*), then the regular characters Duck, followed by zero or more occurrences of any character (.*), which are then followed by the regular characters </li>.

The output is:

regex-content-01.html:<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-02.html:<li><div id="1">Daffy Duck</div></li>

Working across multiple lines of HTML

One of the shortcomings of the grep command is that it does not allow you to execute a regular expression across multiple line breaks in text. For example, consider the file regex-content-02.html, which has the following snippet of HTML:

<title>A list of cool animals
</title>

The following regular expression in a grep command will not match the <title>...</title> content in the previous snippet, because grep does not match line break metacharacters such as \n:

$ grep -Po '<title.*\n.*</title>' regex-content-02.html

Other dialects of regular expressions, such as those found in JavaScript, Java, PHP, and C#, can work with line breaks, but grep cannot. In order to do matching across multiple lines of text at the command line, you need to use pcre2grep.

Installing pcre2grep

The pcre2grep executable typically needs to be installed by an administrator on a Linux computer. The command does not ship by default.

Run the following commands to install pcre2grep on a computer running Red Hat Enterprise Linux, Fedora, or CentOS Stream:

$ sudo dnf update
$ sudo dnf install pcre2-tools -y

Run the following commands to install pcre2grep on a computer running Ubuntu or another system based on Debian:

$ sudo apt update
$ sudo apt-get install pcre2-utils -y

Once you have pcre2grep installed, you can run the following examples against the HTML files you installed previously.

Finding the content between HTML tags that are defined over two lines

The following example uses pcre2grep to match all content that covers two lines in a case-insensitive manner between <title> and </title> tags, including the tags themselves, within all HTML files in the current directory.

The example uses the -Mi options with pcre2grep. The M option allows matching over multiple lines. The i option conducts matches in a case-insensitive manner. The example uses the metacharacters \n to indicate a line break:

$ pcre2grep -Mi '<title.*\n.*<\title>' *.html

The logic that the regular expression executes is as follows: Match the contents in any file in the current directory that has the extension .html. Search the file contents to match the characters <title followed by zero or more occurrences of any character (.*) until the line break metacharacters (\n) occur. Then continue matching occurrences of any character (.*) until the regular characters </title> occur.

The output is:

regex-content-01.html: <title>A list of interesting and uninteresting people
</title>
regex-content-02.html: <title>A list of cool animals
</title>

The example shown above matches only a single line break. The regular expression will not find a match if content including the <title> and </title> tags covers more than two lines, like so:

<title>
A list of
cool animals
</title>

The way to address the problem is to use the metacharacters (?s) so the regular expression interprets the dot metacharacter (.) to include line breaks, as shown in the next example.

Finding the content between unordered list tags in HTML files using (?s)

The following example prepends the metacharacters (?s) to the regular expression to make it process the dot metacharacter (.) to include line breaks as "any character":

$ pcre2grep -Mo '(?s)<ul.+ul>' *.html

The logic that the regular expression executes is as follows: Match the contents in any file in the current directory that has the extension .html. The "any character" wildcard metacharacter (.) can include line breaks as indicated by the metacharacters (?s) at the start of the regular expression. Start by matching an occurrence of the regular characters <ul , followed by one or many occurrences of any character including line breaks (.+), followed by an occurrence of the regular characters ul>.

The output is:

regex-content-01.html:<ul>
                  <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
                  <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
                  <li><div id="3">John Lennon<br>john@beatles.io</div></li>
                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
            </ul>
regex-content-01.html:<ul>
                  <li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
                  <li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
                  <li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
            </ul>
regex-content-02.html:<ul>
                  <li><div id="1">Daffy Duck</div></li>
                  <li><div id="2">Porky Pig</div></li>
                  <li><div id="3">Bugs Bunny</div></li>
                  <li><div id="4">Huckleberry Hound</div></li>
                  <li><div id="5">Crusader Rabbit</div></li>
                  <li><div id="6">Top Cat</div></li>
                  <li><div id="7">Rags T. Tiger</div></li>
            </ul>

Putting it all together

Filtering content in HTML files is a useful skill for any IT professional working on the web. Whether you're a front-end developer trying to debug a misbehaving web page or a system administrator looking for particular words or phrases in directories full of HTML files, being able to execute regular expressions using grep or across multiple lines of text using prce2grep is valuable.

The techniques covered in this article are but an introduction. There's a lot more to learn. Still, the basics presented here will provide a solid foundation upon which to move forward in your journey toward mastery of regular expressions.

Last updated: August 14, 2023