Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Filter content in HTML using regular expressions in grep

October 5, 2022
Bob Reselman
Related topics:
Linux
Related products:
Red Hat Enterprise Linux

Share:

    This article is a third in a series about executing regular expressions using the grep executable that ships with Linux operating systems. The grep command filters content in a file or as output from stdout.

    The first article in this series described the basics of using metacharacters and regular characters to create regular expressions. The second discussed working with quantifiers, pattern collections, groups, and word boundaries in regular expressions. This article uses the features described in the previous articles, along with new ones, to match and filter content in HTML files.

    Matching and retrieving text from HTML is a common task for a broad variety of IT professionals, particularly when troubleshooting issues in web pages. Thus, being able to apply regular expressions to HTML files is a useful skill.

    The article uses grep because that won't require you to set up a particular coding environment or write any complex programming code to work with the examples of regular expressions demonstrated in this article. All you need to do is copy and paste an example onto the command line of a Linux terminal and you'll see results immediately.

    This article is divided into three sections. The first shows you how to create regular expressions that execute against a single HTML file. The second shows you how to work with multiple HTML files. The last shows you how to use a special command-line utility named pcre2grep to execute regular expressions against text split over multiple lines in one or many HTML files.

    Regular characters versus metacharacters

    A regular character represents itself in the text you're searching. Examples include the letters a, g, or t, or the numerical digits 3 or 8. When you declare a regular character in a regular expression, the regular expression engine searches content for the declared character.

    A metacharacter represents a group of characters or other aspects of searching. You can think of a metacharacter as a placeholder symbol. For example, the metacharacter . (dot) represents "any character" and the metacharacters \d represent any digit.

    Running regular expressions using grep against a single HTML file

    In this section, you'll see a variety of regular expressions executed against a single file of HTML. The HTML content used for the demonstration follows. Store the content in a file named regex-content-01.html:

    <html>
    <head>
    <title>A list of interesting and uninteresting people
    </title>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    </head>
    <body bgcolor="#ffffff" text="#000000">
          <h1>Interesting People</h1>
                <ul>
                      <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
                      <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
                      <li><div id="3">John Lennon<br>john@beatles.io</div></li>
                      <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
                </ul>
          <h1>Uninteresting People</h1>
                <ul>
                      <li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
                      <li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
                      <li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
                </ul>
    </body>
    </html>

    Format for using grep against a single file

    The format for using grep against an HTML file at the command line is as follows:

    $ grep -Po <regular_expression> <path/with/filename>

    The elements of the syntax are as follows:

    • grep is the binary executable.
    • -Po contains options passed to grep. The P option interprets the regular expression as a Perl regular expression. The o option makes grep output only the text that matches, not the full lines containing it.
    • <regular_expression> is the regular expression to execute.
    • <path/with/filename> is the target file's location within the computer's file system, such as ~/Documents/somefile.html. If you provide a plain filename without a path, it refers to a file in the current working directory.

    The following subsections match individual lines within a single file.

    Matching occurrences of a string using regular characters

    The following example matches occurrences of a set of regular characters in an HTML file named regex-content-01.html. In this case, the regular characters form the string people:

    $ grep -Po 'people' regex-content-01.html

    The logic that the regular expression executes is as follows: Match any occurrence of the regular characters people in the file regex-content-01.html.

    The output is:

    people
    people
    people

    Matching occurrences of a string using metacharacters and regular characters

    The following example matches occurrences of a set of metacharacters and regular characters in regex-content-01.html:

    $ grep -Po '@.*people' regex-content-01.html

    The logic that the regular expression executes is as follows: In the file regex-content-01.html, match any occurrence of the regular character @ followed by occurrences of any characters zero or more times (.*) until the regular characters people occur.

    The output is:

    @uninterestingpeople
    @uninterestingpeople
    @uninterestingpeople

    An extended version of the previous example is:

    $ grep -Po '@.*people.*\.com' regex-content-01.html

    The logic that the regular expression executes is as follows: In the file regex-content-01.html, match any occurrence of the regular character @ followed by occurrences of any characters zero or more times (.*) until the regular characters people occur. Then, match any characters zero or more times (.*) until the regular characters .com occur. Note that the escape metacharacter (\) is used before the dot regular character (.) like so: \.. Using the escape metacharacter indicates that the regular expression has to process the dot as a regular character (.) and not as the metacharacter that means "any character."

    The output is:

    @uninterestingpeople.com
    @uninterestingpeople.com
    @uninterestingpeople.com

    Case-insensitive match using metacharacters and regular characters

    The following example demonstrates how to run grep using a case-insensitive regular expression, matching either uppercase or lowercase instances of characters. The key to creating a case-insensitive regular expression is to use the -i option when running grep. The -i option indicates case-insensitive processing.

    In this case, the regular expression returns any line that matches the regular characters mick jagger in a case-insensitive manner.

    $ grep -Poi '.*mick jagger.*' regex-content-01.html

    The logic that the regular expression executes is as follows: In the file regex-content-01.html, match any line that has zero or more occurrences of any character (.*) until the regular characters mick jagger occur in either lowercase or uppercase. Then match zero or more occurrences of any character (.*).

    The result is the following. 

    <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>

    Matching HTML list entities

    The following example looks for lines of text that match a string that starts with the tag <li> and ends with the tag </li>:

    $ grep -Po '<li>.*</li>' regex-content-01.html

    The logic that the regular expression executes is as follows: In the file regex-content-01.html, match lines of text that have an occurrence of the regular characters <li> followed by zero or more occurrences of any character (.*), followed by the regular characters </li>.

    The output is:

    <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
    <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
    <li><div id="3">John Lennon<br>john@beatles.io</div></li>
    <li><div id="4">John Doe<br>jd@uninterestingpeople.com</div></li>
    <li><div id="5">Jane Doe<br>jane@uninterestingpeople.com</div></li>
    <li><div id="6">Uninteresting Person<br>up@uninterestingpeople.com</div></li>

    Matching occurrences of a string within an HTML tag according to a range of ID values

    The following example defines a character class that declares a range of numerals to match within the id attribute of a <div> tag:

    $ grep -Po '<li><div id="[2-4]">.*</li>' regex-content-01.html

    The logic that the regular expression executes is as follows: In the file regex-content-01.html, match lines of text that have an occurrence of the regular characters <li><div id=" followed by any regular character that is a numeral in the range 2 to 4 ([2-4]). Then match the regular characters "> followed by zero or more occurrences of any character (.*) that are then followed by the regular characters </li>.

    The output is:

    <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
    <li><div id="3">John Lennon<br>john@beatles.io</div></li>
    <li><div id="4">John Doe<br>jd@uninterestingpeople.com</div></li>

    Working with multiple HTML files

    The following examples demonstrate how to execute regular expressions against multiple HTML files. If you want to get hands-on experience working with the examples in this section, copy and paste the following HTML into a file named regex-content-02.html and save it in the same directory where you previously created regex-content-01.html:

    <html>
     <head>
     <title>A list of cool animals
     </title>
     <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
     </head>
     <body bgcolor="#000000" text="#ffffff">
          <h1>Interesting Pets</h1>
                <ul>
                      <li><div id="1">Daffy Duck</div></li>
                      <li><div id="2">Porky Pig</div></li>
                      <li><div id="3">Bugs Bunny</div></li>
                      <li><div id="4">Huckleberry Hound</div></li>
                      <li><div id="5">Crusader Rabbit</div></li>
                      <li><div id="6">Top Cat</div></li>
                      <li><div id="7">Rags T. Tiger</div></li>
                </ul>
    </body>
    </html>

    Formats for using grep against multiple files

    The format for using grep against multiple HTML files at the command line is as follows:

    $ grep -Po <regular_expression> <path/with/filename-01.html> <path/with/filename-02.html> ... <path/with/filename-n.html>

    The elements of the syntax are as follows:

    • grep is the binary executable.
    • -Po contains options passed to grep. The P option interprets the regular expression as a Perl regular expression. The o option makes grep output only the text that matches, not the full lines containing it.
    • <regular_expression> is the regular expression to execute.
    • <path/with/filename-01.html>, <path/with/filename-02.html>, and <path/with/filename-n.html> are the various target files within the computer's file system.

    The format for using grep with a file specification against multiple HTML files is as follows.

    $ grep -Po <regular_expression> <wild_card>.html

    <wild_card>.html picks out filenames using wildcard characters. For example, the following declaration finds all files in the current working directory that have any filename ending with the .html filename extension:

    *.html

    The following declaration finds all files that start with the characters regex-content-0, followed by any character (?) and ending with the extension .html:

    regex-content-0?.html

    The following subsections match individual lines within multiple files.

    Matching a string of regular characters across multiple HTML files

    The following example matches all strings of regular characters Duck that occur in all files in the current directory that have file names that end with the extension .html:

    $ grep -Po 'Duck' *.html

    The logic that the regular expression executes is as follows: Match any occurrence of the regular characters Duck in all files in the current directory that have filenames that end with the extension .html.

    The output is:

    regex-content-01.html:Duck
    regex-content-02.html:Duck

    Matching lines of text across multiple HTML files according to metacharacters and regular characters

    The following command matches a string plus all surrounding text on the same line:

    $ grep -Po '.*Duck.*' *.html

    The logic that the regular expression executes is as follows: In all files in the current directory that have file names that end with the extension .html, match lines of text that have zero or more occurrences of any character (.*), then the regular characters Duck, followed by zero or more occurrences of any character (.*).

    The result is the following.

    regex-content-01.html:                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
    regex-content-02.html:                  <li><div id="1">Daffy Duck</div></li>

    Matching occurrences of characters within an HTML tag across multiple HTML files

    The following example finds all characters between the <li> and </li> HTML tags in all files in the current directory that have file names that end with the extension .html:

    $ grep -Po '<li>.*</li>' *.html

    The logic that the regular expression executes is as follows: In all files in the current directory that have file names that end with the extension .html, match lines of text that have an occurrence of the regular characters <li> followed by zero or more occurrences of any character (.*), followed by the regular characters </li>.

    The output is:

    regex-content-01.html:<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
    regex-content-01.html:<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
    regex-content-01.html:<li><div id="3">John Lennon<br>john@beatles.io</div></li>
    regex-content-01.html:<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
    regex-content-01.html:<li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
    regex-content-01.html:<li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
    regex-content-01.html:<li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
    regex-content-02.html:<li><div id="1">Daffy Duck</div></li>
    regex-content-02.html:<li><div id="2">Porky Pig</div></li>
    regex-content-02.html:<li><div id="3">Bugs Bunny</div></li>
    regex-content-02.html:<li><div id="4">Huckleberry Hound</div></li>
    regex-content-02.html:<li><div id="5">Crusader Rabbit</div></li>
    regex-content-02.html:<li><div id="6">Top Cat</div></li>
    regex-content-02.html:<li><div id="7">Rags T. Tiger</div></li>

    Matching occurrences of specific characters within an HTML tag across multiple HTML files

    The following command finds <li> entities containing a particular string, Duck:

    $ grep -Po '<li>.*Duck.*</li>' *.html

    The logic that the regular expression executes is as follows: In all files in the current directory that have file names that end with the extension .html, match lines of text that have an occurrence of the regular characters <li> followed by zero or more occurrences of any character (.*), then the regular characters Duck, followed by zero or more occurrences of any character (.*), which are then followed by the regular characters </li>.

    The output is:

    regex-content-01.html:<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
    regex-content-02.html:<li><div id="1">Daffy Duck</div></li>

    Working across multiple lines of HTML

    One of the shortcomings of the grep command is that it does not allow you to execute a regular expression across multiple line breaks in text. For example, consider the file regex-content-02.html, which has the following snippet of HTML:

    <title>A list of cool animals
    </title>

    The following regular expression in a grep command will not match the <title>...</title> content in the previous snippet, because grep does not match line break metacharacters such as \n:

    $ grep -Po '<title.*\n.*</title>' regex-content-02.html

    Other dialects of regular expressions, such as those found in JavaScript, Java, PHP, and C#, can work with line breaks, but grep cannot. In order to do matching across multiple lines of text at the command line, you need to use pcre2grep.

    Installing pcre2grep

    The pcre2grep executable typically needs to be installed by an administrator on a Linux computer. The command does not ship by default.

    Run the following commands to install pcre2grep on a computer running Red Hat Enterprise Linux, Fedora, or CentOS Stream:

    $ sudo dnf update
    $ sudo dnf install pcre2-tools -y

    Run the following commands to install pcre2grep on a computer running Ubuntu or another system based on Debian:

    $ sudo apt update
    $ sudo apt-get install pcre2-utils -y

    Once you have pcre2grep installed, you can run the following examples against the HTML files you installed previously.

    Finding the content between HTML tags that are defined over two lines

    The following example uses pcre2grep to match all content that covers two lines in a case-insensitive manner between <title> and </title> tags, including the tags themselves, within all HTML files in the current directory.

    The example uses the -Mi options with pcre2grep. The M option allows matching over multiple lines. The i option conducts matches in a case-insensitive manner. The example uses the metacharacters \n to indicate a line break:

    $ pcre2grep -Mi '<title.*\n.*<\title>' *.html

    The logic that the regular expression executes is as follows: Match the contents in any file in the current directory that has the extension .html. Search the file contents to match the characters <title followed by zero or more occurrences of any character (.*) until the line break metacharacters (\n) occur. Then continue matching occurrences of any character (.*) until the regular characters </title> occur.

    The output is:

    regex-content-01.html: <title>A list of interesting and uninteresting people
    </title>
    regex-content-02.html: <title>A list of cool animals
    </title>

    The example shown above matches only a single line break. The regular expression will not find a match if content including the <title> and </title> tags covers more than two lines, like so:

    <title>
    A list of
    cool animals
    </title>

    The way to address the problem is to use the metacharacters (?s) so the regular expression interprets the dot metacharacter (.) to include line breaks, as shown in the next example.

    Finding the content between unordered list tags in HTML files using (?s)

    The following example prepends the metacharacters (?s) to the regular expression to make it process the dot metacharacter (.) to include line breaks as "any character":

    $ pcre2grep -Mo '(?s)<ul.+ul>' *.html

    The logic that the regular expression executes is as follows: Match the contents in any file in the current directory that has the extension .html. The "any character" wildcard metacharacter (.) can include line breaks as indicated by the metacharacters (?s) at the start of the regular expression. Start by matching an occurrence of the regular characters <ul , followed by one or many occurrences of any character including line breaks (.+), followed by an occurrence of the regular characters ul>.

    The output is:

    regex-content-01.html:<ul>
                      <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
                      <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
                      <li><div id="3">John Lennon<br>john@beatles.io</div></li>
                      <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
                </ul>
    regex-content-01.html:<ul>
                      <li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
                      <li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
                      <li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
                </ul>
    regex-content-02.html:<ul>
                      <li><div id="1">Daffy Duck</div></li>
                      <li><div id="2">Porky Pig</div></li>
                      <li><div id="3">Bugs Bunny</div></li>
                      <li><div id="4">Huckleberry Hound</div></li>
                      <li><div id="5">Crusader Rabbit</div></li>
                      <li><div id="6">Top Cat</div></li>
                      <li><div id="7">Rags T. Tiger</div></li>
                </ul>

    Putting it all together

    Filtering content in HTML files is a useful skill for any IT professional working on the web. Whether you're a front-end developer trying to debug a misbehaving web page or a system administrator looking for particular words or phrases in directories full of HTML files, being able to execute regular expressions using grep or across multiple lines of text using prce2grep is valuable.

    The techniques covered in this article are but an introduction. There's a lot more to learn. Still, the basics presented here will provide a solid foundation upon which to move forward in your journey toward mastery of regular expressions.

    Last updated: August 14, 2023

    Recent Posts

    • How to enable Ansible Lightspeed intelligent assistant

    • Why some agentic AI developers are moving code from Python to Rust

    • Confidential VMs: The core of confidential containers

    • Benchmarking with GuideLLM in air-gapped OpenShift clusters

    • Run Qwen3-Next on vLLM with Red Hat AI: A step-by-step guide

    What’s up next?

    Intermediate Linux Cheat Sheet card image

    This Linux cheat sheet introduces developers and system administrators to the Linux commands they should know.  You'll learn about text utilities, disk tools, network connectivity tools, user and user group management, and more.

    Download the free cheat sheet
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue