Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Prevent Trojan Source attacks with GCC 12

January 12, 2022
David Malcolm
Related topics:
CompilersC, C#, C++Developer Tools
Related products:
Red Hat Enterprise Linux

Share:

At the start of November of 2021, a new kind of software vulnerability was made public: "Trojan Source," in which certain Unicode bidirectional control characters are used to write obfuscated code. These control characters can be used to create text in which the logical order seen by a programming language implementation (such as a compiler or interpreter) differs from the visual order seen by a human reading the code.

Detecting Trojan Source attacks

Red Hat has taken various steps to ensure that our code isn't affected by this kind of problem, and to protect our customers against it. We have added detection for the issue in various places in our workflow and scanned our source code repositories.  We have implemented patches to help upstream projects detect code obfuscated in this way and have provided our customers with updates to our tools to detect such issues—see CVE-2021-42574 and CVE-2021-42694. We have also published a script for scanning source repositories for the issue.

I'm part of a team at Red Hat working on GCC, the GNU Compiler Collection. In the spirit of defense in depth, we spent a fair amount of time before the vulnerability went public experimenting with ways GCC could detect such code and warn the user if it reaches the compiler.

Here's one of the example attacks from the Trojan Source researchers, written in C.

#include <stdio.h>
#include <stdbool.h>

int main() {
    bool isAdmin = false;
    /*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
        printf("You are an admin.\n");
    /* end admins only ‮ { ⁦*/
    return 0;
}

Exactly what the preceding code looks like will vary depending on the tool you use to view it. To a human reader using Firefox, line 6 of the code appears to begin with this comment:

/* begin admins only */

This comment appears to be immediately followed on the same line by a conditional guarding the printf statement:

if (isAdmin) {

Unicode's rules for bidirectional text are decidedly non-trivial, so I wrote a Python 3 script for debugging UTF-8 encoded files. This script mimics how GCC outputs source lines; but rather than just outputting the source lines themselves, it interleaves them with per-character lines showing the Unicode codepoints, the UTF-8 encoding bytes, the name of each character, and, where printable, the characters themselves.

Running it on the preceding example gives the following output for line 6:

   6 |     /*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+002F            0x2f                                  SOLIDUS /
     |   U+002A            0x2a                                 ASTERISK *
     |   U+202E  0xe2 0x80 0xae                   RIGHT-TO-LEFT OVERRIDE (format control)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+007D            0x7d                      RIGHT CURLY BRACKET }
     |   U+0020            0x20                                    SPACE (separator)
     |   U+2066  0xe2 0x81 0xa6                    LEFT-TO-RIGHT ISOLATE (format control)
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+0066            0x66                     LATIN SMALL LETTER F f
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0028            0x28                         LEFT PARENTHESIS (
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+0073            0x73                     LATIN SMALL LETTER S s
     |   U+0041            0x41                   LATIN CAPITAL LETTER A A
     |   U+0064            0x64                     LATIN SMALL LETTER D d
     |   U+006D            0x6d                     LATIN SMALL LETTER M m
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0029            0x29                        RIGHT PARENTHESIS )
     |   U+2069  0xe2 0x81 0xa9                  POP DIRECTIONAL ISOLATE (format control)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+2066  0xe2 0x81 0xa6                    LEFT-TO-RIGHT ISOLATE (format control)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0062            0x62                     LATIN SMALL LETTER B b
     |   U+0065            0x65                     LATIN SMALL LETTER E e
     |   U+0067            0x67                     LATIN SMALL LETTER G g
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0061            0x61                     LATIN SMALL LETTER A a
     |   U+0064            0x64                     LATIN SMALL LETTER D d
     |   U+006D            0x6d                     LATIN SMALL LETTER M m
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0073            0x73                     LATIN SMALL LETTER S s
     |   U+0020            0x20                                    SPACE (separator)
     |   U+006F            0x6f                     LATIN SMALL LETTER O o
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+006C            0x6c                     LATIN SMALL LETTER L l
     |   U+0079            0x79                     LATIN SMALL LETTER Y y
     |   U+0020            0x20                                    SPACE (separator)
     |   U+002A            0x2a                                 ASTERISK *
     |   U+002F            0x2f                                  SOLIDUS /
     |   U+000A            0x0a                           LINE FEED (LF) (control character)

A careful reading of the above will show that what appeared to be if (isAdmin) { after the comment is actually } LRI if (isAdmin)PDI LRI within the comment, where LRI and PDI are Unicode control characters. In other words, the conditional has been surreptitiously commented out.

The issue here is that Unicode's rules for bidirectional text work at the level of paragraphs and lines, whereas C's tokenization rules are affected by boundaries such as those for comments and string literals.

My colleague Marek Polacek and I implemented a new warning for GCC 12, -Wbidi-chars, for detecting Trojan Source attacks involving Unicode control characters. Marek implemented the guts of the warning, but when I tried it out on the examples provided by the Trojan Source researchers, I found I had trouble understanding the initial results—precisely because of the obfuscation itself.

So for GCC 12, I've added a new flag to GCC diagnostics, indicating that the diagnostic itself relates to source code encoding. When any such diagnostic is printed, GCC will now escape non-ASCII characters in the source code.

Here's what the preceding example looks like when compiled with GCC 12 (the warning is enabled by default):

$ gcc -c trojan-source/C/commenting-out.c
trojan-source/C/commenting-out.c: In function ‘main’:
trojan-source/C/commenting-out.c:6:43: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
    6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
      |       ~~~~~~~~                                ~~~~~~~~                    ^
      |       |                                       |                           |
      |       |                                       |                           end of bidirectional context
      |       U+202E (RIGHT-TO-LEFT OVERRIDE)         U+2066 (LEFT-TO-RIGHT ISOLATE)
trojan-source/C/commenting-out.c:8:28: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
    8 |     /* end admins only <U+202E> { <U+2066>*/
      |                        ~~~~~~~~   ~~~~~~~~ ^
      |                        |          |        |
      |                        |          |        end of bidirectional context
      |                        |          U+2066 (LEFT-TO-RIGHT ISOLATE)
      |                        U+202E (RIGHT-TO-LEFT OVERRIDE)

Escaping the non-ASCII characters clarifies exactly which control characters are present in the source file. It also effectively defangs the obfuscation: the visual ordering of the characters will always be the same as the logical ordering in this output.

We call a tokenization boundary such as a comment or string literal a bidirectional context in the warning because the obfuscation happens when there are differences between the structure as seen by the C tokenizer of the logical ordering of the characters on the one hand and the structure perceived by a human reader of the visual ordering of the code as implemented by the Unicode bidirectional algorithm on the other.

The default is -Wbidi-chars=unpaired, in which the warning complains about unpaired characters within such a bidirectional context. A stronger form of the warning is -Wbidi-chars=any, in which the warning complains about any bidirectional control characters in the source code:

$ gcc -c trojan-source/C/commenting-out.c -Wbidi-chars=any
trojan-source/C/commenting-out.c: In function ‘main’:
trojan-source/C/commenting-out.c:6:7: warning: found problematic Unicode character "U+202E (RIGHT-TO-LEFT OVERRIDE)" [-Wbidi-chars=]
    6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
      |       ^~~~~~~~
trojan-source/C/commenting-out.c:6:10: warning: found problematic Unicode character "U+2066 (LEFT-TO-RIGHT ISOLATE)" [-Wbidi-chars=]
    6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
      |                  ^~~~~~~~
trojan-source/C/commenting-out.c:6:23: warning: found problematic Unicode character "U+2066 (LEFT-TO-RIGHT ISOLATE)" [-Wbidi-chars=]
    6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
      |                                               ^~~~~~~~
trojan-source/C/commenting-out.c:8:24: warning: found problematic Unicode character "U+202E (RIGHT-TO-LEFT OVERRIDE)" [-Wbidi-chars=]
    8 |     /* end admins only <U+202E> { <U+2066>*/
      |                        ^~~~~~~~
trojan-source/C/commenting-out.c:8:27: warning: found problematic Unicode character "U+2066 (LEFT-TO-RIGHT ISOLATE)" [-Wbidi-chars=]
    8 |     /* end admins only <U+202E> { <U+2066>*/
      |                                   ^~~~~~~~

By default, the warning shows the non-ASCII characters in the form <U+xxxx>, but I've also added a new option, -fdiagnostics-escape-format=bytes, which will show the bytes that encoded the characters in question in the form <xx>. Here's what the warning looks like with -fdiagnostics-escape-format=bytes:

$ gcc -c trojan-source/C/commenting-out.c -fdiagnostics-escape-format=bytes
trojan-source/C/commenting-out.c: In function ‘main’:
trojan-source/C/commenting-out.c:6:43: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
    6 |     /*<e2><80><ae> } <e2><81><a6>if (isAdmin)<e2><81><a9> <e2><81><a6> begin admins only */
      |       ~~~~~~~~~~~~                                        ~~~~~~~~~~~~                    ^
      |       |                                                   |                               |
      |       U+202E (RIGHT-TO-LEFT OVERRIDE)                     U+2066 (LEFT-TO-RIGHT ISOLATE)  end of bidirectional context
trojan-source/C/commenting-out.c:8:28: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
    8 |     /* end admins only <e2><80><ae> { <e2><81><a6>*/
      |                        ~~~~~~~~~~~~   ~~~~~~~~~~~~ ^
      |                        |              |            |
      |                        |              |            end of bidirectional context
      |                        |              U+2066 (LEFT-TO-RIGHT ISOLATE)
      |                        U+202E (RIGHT-TO-LEFT OVERRIDE)

Let's take a look at how some other examples of this exploit are handled by GCC 12.

early-return.c: Code hidden in a comment

The code in this listing is obfuscated so that return 0; appears to be part of a comment:

#include <stdio.h>

int main() {
    /* Say hello; newline⁧ /*/ return 0 ;
    printf("Hello world.\n");
    return 0;
}

GCC 12 successfully complains about this and shows the logical ordering of the source:

$ gcc -c trojan-source/C/early-return.c
trojan-source/C/early-return.c: In function ‘main’:
trojan-source/C/early-return.c:4:29: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=]
    4 |     /* Say hello; newline<U+2067> /*/ return 0 ;
      |                          ~~~~~~~~   ^
      |                          |          |
      |                          |          end of bidirectional context
      |                          U+2067 (RIGHT-TO-LEFT ISOLATE)

stretched-string.c: Code hidden in a string literal

This listing includes obfuscated code in which the string literal passed as the second argument of strcmp() appears to be "user" but is actually "userRLO LRI// Check if adminPDI LRI":

#include <stdio.h>
#include <string.h>

int main() {
    char* access_level = "user";
    if (strcmp(access_level, "user‮ ⁦// Check if admin⁩ ⁦")) {
        printf("You are an admin.\n");
    }
    return 0;
}

Again, GCC 12 successfully issues a warning:

$ gcc -c trojan-source/C/stretched-string.c
trojan-source/C/stretched-string.c: In function ‘main’:
trojan-source/C/stretched-string.c:6:53: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
    6 |     if (strcmp(access_level, "user<U+202E> <U+2066>// Check if admin<U+2069> <U+2066>")) {
      |                                   ~~~~~~~~                                   ~~~~~~~~
      |                                   |                                          |       |
      |                                   |                                          |       end of bidirectional context
      |                                   U+202E (RIGHT-TO-LEFT OVERRIDE)            U+2066 (LEFT-TO-RIGHT ISOLATE)

What about homoglyphs?

The Trojan Source paper also noted the existence of homoglyphs: identifiers with characters that look like other characters. Indeed, in some cases a computer's font machinery might choose the same "glyph" for two different Unicode characters, making them pixel-for-pixel identical. You could substitute characters with carefully chosen homoglyphs to obfuscate code. Here's an example from the author of the paper:

#include <stdio.h>

void sayHello() {
    printf("Hello, World!\n");
}

void sayНello() {
    printf("Goodbye, World!\n");
}

int main() {
    sayНello();
    return 0;
}

This code pairs a capital H in sayHello() with a capital Cyrillic letter en (Н) in sayНello(). In many cases, the two characters will look similar or identical.

This isn't a new vulnerability, but I had a go at detecting it for GCC 12. My proof-of-concept patch complains about the preceding code as follows:

$ gcc trojan-source/C/homoglyph-function.c
trojan-source/C/homoglyph-function.c:7:1: warning: identifier ‘sayНello’ (‘say\u041dello’)... [CWE-1007] [-Whomoglyph]
    7 | void say<U+041D>ello() {
      | ^~~~
trojan-source/C/homoglyph-function.c:3:1: note: ...confusable with non-equal identifier ‘sayHello’ here
    3 | void sayHello() {
      | ^~~~

There are improvements to be made here; among other issues, the warning doesn't quite underline the correct token—it underlines the void when it should underline the identifier itself.

Unfortunately, it's not clear to me when the compiler should warn for this. Should it complain about any homoglyph identifier pairs seen in source code, or merely those that occur in the same scope? If the former, then this would probably rule out a lot of single-character identifiers because many have homoglyphs that might be in use in a header file. If the latter, does this catch every possible misuse of homoglyphs? I don't yet have a good answer to these questions, so this warning didn't make feature freeze for GCC 12.

Hunting for Trojan Source

The -Wbidi-chars warning is in trunk for GCC 12, which will likely be released in April of 2022. You can try it out now using the excellent Compiler Explorer website.

As earlier noted, Red Hat has published a script for scanning source repositories for the bidirectional Trojan Source issue, as well as the other preventative measures mentioned. See RHSB-2021-007 for more details and for links to package updates.

I tried scanning GCC's own source tree for non-ASCII characters; we only use them in comments, generally when spelling the names of contributors. As a British developer now living in the United States, I only use ASCII in my code. Still, I appreciate that other people will want to use non-ASCII characters and that I likely have an Anglocentric bias.

To what extent do you use non-ASCII characters in your source code? Do you use them for identifiers, for string literals, or just for comments? Let us know in the discussion below.

Last updated: September 7, 2022

Related Posts

  • Detecting memory management bugs with GCC 11, Part 1: Understanding dynamic allocation

  • Static analysis updates in GCC 11

  • New C++ features in GCC 10

Recent Posts

  • Container starting and termination order in a pod

  • More Essential AI tutorials for Node.js Developers

  • How to run a fraud detection AI model on RHEL CVMs

  • How we use software provenance at Red Hat

  • Alternatives to creating bootc images from scratch

What’s up next?

Red Hat Insights API

Find out how to get actionable intelligence using Red Hat Insights APIs so you can identify and address operational and vulnerability risks in your Red Hat Enterprise Linux environments before an issue results in downtime.

Get the cheat sheet
Red Hat Developers logo LinkedIn YouTube Twitter Facebook

Products

  • Red Hat Enterprise Linux
  • Red Hat OpenShift
  • Red Hat Ansible Automation Platform

Build

  • Developer Sandbox
  • Developer Tools
  • Interactive Tutorials
  • API Catalog

Quicklinks

  • Learning Resources
  • E-books
  • Cheat Sheets
  • Blog
  • Events
  • Newsletter

Communicate

  • About us
  • Contact sales
  • Find a partner
  • Report a website issue
  • Site Status Dashboard
  • Report a security problem

RED HAT DEVELOPER

Build here. Go anywhere.

We serve the builders. The problem solvers who create careers with code.

Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

Sign me up

Red Hat legal and privacy links

  • About Red Hat
  • Jobs
  • Events
  • Locations
  • Contact Red Hat
  • Red Hat Blog
  • Inclusion at Red Hat
  • Cool Stuff Store
  • Red Hat Summit

Red Hat legal and privacy links

  • Privacy statement
  • Terms of use
  • All policies and guidelines
  • Digital accessibility

Report a website issue