Filter content in HTML using regular expressions in grep

This article is a third in a series about executing regular expressions using the grep executable that ships with Linux operating systems. The grep command filters content in a file or as output from stdout.

The first article in this series described the basics of using metacharacters and regular characters to create regular expressions. The second discussed working with quantifiers, pattern collections, groups, and word boundaries in regular expressions. This article uses the features described in the previous articles, along with new ones, to match and filter content in HTML files.

Matching and retrieving text from HTML is a common task for a broad variety of IT professionals, particularly when troubleshooting issues in web pages. Thus, being able to apply regular expressions to HTML files is a useful skill.

The article uses grep because that won't require you to set up a particular coding environment or write any complex programming code to work with the examples of regular expressions demonstrated in this article. All you need to do is copy and paste an example onto the command line of a Linux terminal and you'll see results immediately.

This article is divided into three sections. The first shows you how to create regular expressions that execute against a single HTML file. The second shows you how to work with multiple HTML files. The last shows you how to use a special command-line utility named pcre2grep to execute regular expressions against text split over multiple lines in one or many HTML files.

Regular characters versus metacharacters

A regular character represents itself in the text you're searching. Examples include the letters a, g, or t, or the numerical digits 3 or 8. When you declare a regular character in a regular expression, the regular expression engine searches content for the declared character.

A metacharacter represents a group of characters or other aspects of searching. You can think of a metacharacter as a placeholder symbol. For example, the metacharacter . (dot) represents "any character" and the metacharacters \d represent any digit.

Running regular expressions using grep against a single HTML file

In this section, you'll see a variety of regular expressions executed against a single file of HTML. The HTML content used for the demonstration follows. Store the content in a file named regex-content-01.html:

<html>
<head>
<title>A list of interesting and uninteresting people
</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body bgcolor="#ffffff" text="#000000">
      <h1>Interesting People</h1>
            <ul>
                  <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
                  <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
                  <li><div id="3">John Lennon<br>john@beatles.io</div></li>
                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
            </ul>
      <h1>Uninteresting People</h1>
            <ul>
                  <li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
                  <li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
                  <li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
            </ul>
</body>
</html>

Filter content in HTML using regular expressions in grep

Share:

Regular characters versus metacharacters

Running regular expressions using grep against a single HTML file

Format for using grep against a single file

Matching occurrences of a string using regular characters

Matching occurrences of a string using metacharacters and regular characters

Case-insensitive match using metacharacters and regular characters

Matching HTML list entities

Matching occurrences of a string within an HTML tag according to a range of ID values

Working with multiple HTML files

Formats for using grep against multiple files

Matching a string of regular characters across multiple HTML files

Matching lines of text across multiple HTML files according to metacharacters and regular characters

Matching occurrences of characters within an HTML tag across multiple HTML files

Matching occurrences of specific characters within an HTML tag across multiple HTML files

Working across multiple lines of HTML

Installing pcre2grep

Finding the content between HTML tags that are defined over two lines

Finding the content between unordered list tags in HTML files using (?s)

Putting it all together

Kafka Monthly Digest: March 2025

Enable 3.5 times faster vision language models with quantization

How to set up OpenShift confidential clusters on Azure

What’s new in Red Hat OpenShift GitOps 1.16

Run Red Hat Developer Hub Locally with Ease

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue