Featured image for: Value range propagation in GCC with Project Ranger.

At the start of November of 2021, a new kind of software vulnerability was made public: "Trojan Source," in which certain Unicode bidirectional control characters are used to write obfuscated code. These control characters can be used to create text in which the logical order seen by a programming language implementation (such as a compiler or interpreter) differs from the visual order seen by a human reading the code.

Detecting Trojan Source attacks

Red Hat has taken various steps to ensure that our code isn't affected by this kind of problem, and to protect our customers against it. We have added detection for the issue in various places in our workflow and scanned our source code repositories.  We have implemented patches to help upstream projects detect code obfuscated in this way and have provided our customers with updates to our tools to detect such issues—see CVE-2021-42574 and CVE-2021-42694. We have also published a script for scanning source repositories for the issue.

I'm part of a team at Red Hat working on GCC, the GNU Compiler Collection. In the spirit of defense in depth, we spent a fair amount of time before the vulnerability went public experimenting with ways GCC could detect such code and warn the user if it reaches the compiler.

Here's one of the example attacks from the Trojan Source researchers, written in C.

#include <stdio.h>
#include <stdbool.h>

int main() {
    bool isAdmin = false;
    /*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
        printf("You are an admin.\n");
    /* end admins only ‮ { ⁦*/
    return 0;
}

Exactly what the preceding code looks like will vary depending on the tool you use to view it. To a human reader using Firefox, line 6 of the code appears to begin with this comment:

/* begin admins only */

This comment appears to be immediately followed on the same line by a conditional guarding the printf statement:

if (isAdmin) {

Unicode's rules for bidirectional text are decidedly non-trivial, so I wrote a Python 3 script for debugging UTF-8 encoded files. This script mimics how GCC outputs source lines; but rather than just outputting the source lines themselves, it interleaves them with per-character lines showing the Unicode codepoints, the UTF-8 encoding bytes, the name of each character, and, where printable, the characters themselves.

Running it on the preceding example gives the following output for line 6:

   6 |     /*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+002F            0x2f                                  SOLIDUS /
     |   U+002A            0x2a                                 ASTERISK *
     |   U+202E  0xe2 0x80 0xae                   RIGHT-TO-LEFT OVERRIDE (format control)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+007D            0x7d                      RIGHT CURLY BRACKET }
     |   U+0020            0x20                                    SPACE (separator)
     |   U+2066  0xe2 0x81 0xa6                    LEFT-TO-RIGHT ISOLATE (format control)
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+0066            0x66                     LATIN SMALL LETTER F f
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0028            0x28                         LEFT PARENTHESIS (
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+0073            0x73                     LATIN SMALL LETTER S s
     |   U+0041            0x41                   LATIN CAPITAL LETTER A A
     |   U+0064            0x64                     LATIN SMALL LETTER D d
     |   U+006D            0x6d                     LATIN SMALL LETTER M m
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0029            0x29                        RIGHT PARENTHESIS )
     |   U+2069  0xe2 0x81 0xa9                  POP DIRECTIONAL ISOLATE (format control)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+2066  0xe2 0x81 0xa6                    LEFT-TO-RIGHT ISOLATE (format control)
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0062            0x62                     LATIN SMALL LETTER B b
     |   U+0065            0x65                     LATIN SMALL LETTER E e
     |   U+0067            0x67                     LATIN SMALL LETTER G g
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0061            0x61                     LATIN SMALL LETTER A a
     |   U+0064            0x64                     LATIN SMALL LETTER D d
     |   U+006D            0x6d                     LATIN SMALL LETTER M m
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0073            0x73                     LATIN SMALL LETTER S s
     |   U+0020            0x20                                    SPACE (separator)
     |   U+006F            0x6f                     LATIN SMALL LETTER O o
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+006C            0x6c                     LATIN SMALL LETTER L l
     |   U+0079            0x79                     LATIN SMALL LETTER Y y
     |   U+0020            0x20                                    SPACE (separator)
     |   U+002A            0x2a                                 ASTERISK *
     |   U+002F            0x2f                                  SOLIDUS /
     |   U+000A            0x0a                           LINE FEED (LF) (control character)

A careful reading of the above will show that what appeared to be if (isAdmin) { after the comment is actually } LRI if (isAdmin)PDI LRI within the comment, where LRI and PDI are Unicode control characters. In other words, the conditional has been surreptitiously commented out.

The issue here is that Unicode's rules for bidirectional text work at the level of paragraphs and lines, whereas C's tokenization rules are affected by boundaries such as those for comments and string literals.

My colleague Marek Polacek and I implemented a new warning for GCC 12, -Wbidi-chars, for detecting Trojan Source attacks involving Unicode control characters. Marek implemented the guts of the warning, but when I tried it out on the examples provided by the Trojan Source researchers, I found I had trouble understanding the initial results—precisely because of the obfuscation itself.

So for GCC 12, I've added a new flag to GCC diagnostics, indicating that the diagnostic itself relates to source code encoding. When any such diagnostic is printed, GCC will now escape non-ASCII characters in the source code.

Here's what the preceding example looks like when compiled with GCC 12 (the warning is enabled by default):

$ gcc -c trojan-source/C/commenting-out.c
trojan-source/C/commenting-out.c: In function ‘main’:
trojan-source/C/commenting-out.c:6:43: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
    6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
      |       ~~~~~~~~                                ~~~~~~~~                    ^
      |       |                                       |                           |
      |       |                                       |                           end of bidirectional context
      |       U+202E (RIGHT-TO-LEFT OVERRIDE)         U+2066 (LEFT-TO-RIGHT ISOLATE)
trojan-source/C/commenting-out.c:8:28: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
    8 |     /* end admins only <U+202E> { <U+2066>*/
      |                        ~~~~~~~~   ~~~~~~~~ ^
      |                        |          |        |
      |                        |          |        end of bidirectional context
      |                        |          U+2066 (LEFT-TO-RIGHT ISOLATE)
      |                        U+202E (RIGHT-TO-LEFT OVERRIDE)

Escaping the non-ASCII characters clarifies exactly which control characters are present in the source file. It also effectively defangs the obfuscation: the visual ordering of the characters will always be the same as the logical ordering in this output.

We call a tokenization boundary such as a comment or string literal a bidirectional context in the warning because the obfuscation happens when there are differences between the structure as seen by the C tokenizer of the logical ordering of the characters on the one hand and the structure perceived by a human reader of the visual ordering of the code as implemented by the Unicode bidirectional algorithm on the other.

The default is -Wbidi-chars=unpaired, in which the warning complains about unpaired characters within such a bidirectional context. A stronger form of the warning is -Wbidi-chars=any, in which the warning complains about any bidirectional control characters in the source code:

$ gcc -c trojan-source/C/commenting-out.c -Wbidi-chars=any
trojan-source/C/commenting-out.c: In function ‘main’:
trojan-source/C/commenting-out.c:6:7: warning: found problematic Unicode character "U+202E (RIGHT-TO-LEFT OVERRIDE)" [-Wbidi-chars=]
    6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
      |       ^~~~~~~~
trojan-source/C/commenting-out.c:6:10: warning: found problematic Unicode character "U+2066 (LEFT-TO-RIGHT ISOLATE)" [-Wbidi-chars=]
    6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
      |                  ^~~~~~~~
trojan-source/C/commenting-out.c:6:23: warning: found problematic Unicode character "U+2066 (LEFT-TO-RIGHT ISOLATE)" [-Wbidi-chars=]
    6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
      |                                               ^~~~~~~~
trojan-source/C/commenting-out.c:8:24: warning: found problematic Unicode character "U+202E (RIGHT-TO-LEFT OVERRIDE)" [-Wbidi-chars=]
    8 |     /* end admins only <U+202E> { <U+2066>*/
      |                        ^~~~~~~~
trojan-source/C/commenting-out.c:8:27: warning: found problematic Unicode character "U+2066 (LEFT-TO-RIGHT ISOLATE)" [-Wbidi-chars=]
    8 |     /* end admins only <U+202E> { <U+2066>*/
      |                                   ^~~~~~~~

By default, the warning shows the non-ASCII characters in the form <U+xxxx>, but I've also added a new option, -fdiagnostics-escape-format=bytes, which will show the bytes that encoded the characters in question in the form <xx>. Here's what the warning looks like with -fdiagnostics-escape-format=bytes:

$ gcc -c trojan-source/C/commenting-out.c -fdiagnostics-escape-format=bytes
trojan-source/C/commenting-out.c: In function ‘main’:
trojan-source/C/commenting-out.c:6:43: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
    6 |     /*<e2><80><ae> } <e2><81><a6>if (isAdmin)<e2><81><a9> <e2><81><a6> begin admins only */
      |       ~~~~~~~~~~~~                                        ~~~~~~~~~~~~                    ^
      |       |                                                   |                               |
      |       U+202E (RIGHT-TO-LEFT OVERRIDE)                     U+2066 (LEFT-TO-RIGHT ISOLATE)  end of bidirectional context
trojan-source/C/commenting-out.c:8:28: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
    8 |     /* end admins only <e2><80><ae> { <e2><81><a6>*/
      |                        ~~~~~~~~~~~~   ~~~~~~~~~~~~ ^
      |                        |              |            |
      |                        |              |            end of bidirectional context
      |                        |              U+2066 (LEFT-TO-RIGHT ISOLATE)
      |                        U+202E (RIGHT-TO-LEFT OVERRIDE)

Let's take a look at how some other examples of this exploit are handled by GCC 12.

early-return.c: Code hidden in a comment

The code in this listing is obfuscated so that return 0; appears to be part of a comment:

#include <stdio.h>

int main() {
    /* Say hello; newline⁧ /*/ return 0 ;
    printf("Hello world.\n");
    return 0;
}

GCC 12 successfully complains about this and shows the logical ordering of the source:

$ gcc -c trojan-source/C/early-return.c
trojan-source/C/early-return.c: In function ‘main’:
trojan-source/C/early-return.c:4:29: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=]
    4 |     /* Say hello; newline<U+2067> /*/ return 0 ;
      |                          ~~~~~~~~   ^
      |                          |          |
      |                          |          end of bidirectional context
      |                          U+2067 (RIGHT-TO-LEFT ISOLATE)

stretched-string.c: Code hidden in a string literal

This listing includes obfuscated code in which the string literal passed as the second argument of strcmp() appears to be "user" but is actually "userRLO LRI// Check if adminPDI LRI":

#include <stdio.h>
#include <string.h>

int main() {
    char* access_level = "user";
    if (strcmp(access_level, "user‮ ⁦// Check if admin⁩ ⁦")) {
        printf("You are an admin.\n");
    }
    return 0;
}

Again, GCC 12 successfully issues a warning:

$ gcc -c trojan-source/C/stretched-string.c
trojan-source/C/stretched-string.c: In function ‘main’:
trojan-source/C/stretched-string.c:6:53: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
    6 |     if (strcmp(access_level, "user<U+202E> <U+2066>// Check if admin<U+2069> <U+2066>")) {
      |                                   ~~~~~~~~                                   ~~~~~~~~
      |                                   |                                          |       |
      |                                   |                                          |       end of bidirectional context
      |                                   U+202E (RIGHT-TO-LEFT OVERRIDE)            U+2066 (LEFT-TO-RIGHT ISOLATE)

What about homoglyphs?

The Trojan Source paper also noted the existence of homoglyphs: identifiers with characters that look like other characters. Indeed, in some cases a computer's font machinery might choose the same "glyph" for two different Unicode characters, making them pixel-for-pixel identical. You could substitute characters with carefully chosen homoglyphs to obfuscate code. Here's an example from the author of the paper:

#include <stdio.h>

void sayHello() {
    printf("Hello, World!\n");
}

void sayНello() {
    printf("Goodbye, World!\n");
}

int main() {
    sayНello();
    return 0;
}

This code pairs a capital H in sayHello() with a capital Cyrillic letter en (Н) in sayНello(). In many cases, the two characters will look similar or identical.

This isn't a new vulnerability, but I had a go at detecting it for GCC 12. My proof-of-concept patch complains about the preceding code as follows:

$ gcc trojan-source/C/homoglyph-function.c
trojan-source/C/homoglyph-function.c:7:1: warning: identifier ‘sayНello’ (‘say\u041dello’)... [CWE-1007] [-Whomoglyph]
    7 | void say<U+041D>ello() {
      | ^~~~
trojan-source/C/homoglyph-function.c:3:1: note: ...confusable with non-equal identifier ‘sayHello’ here
    3 | void sayHello() {
      | ^~~~

There are improvements to be made here; among other issues, the warning doesn't quite underline the correct token—it underlines the void when it should underline the identifier itself.

Unfortunately, it's not clear to me when the compiler should warn for this. Should it complain about any homoglyph identifier pairs seen in source code, or merely those that occur in the same scope? If the former, then this would probably rule out a lot of single-character identifiers because many have homoglyphs that might be in use in a header file. If the latter, does this catch every possible misuse of homoglyphs? I don't yet have a good answer to these questions, so this warning didn't make feature freeze for GCC 12.

Hunting for Trojan Source

The -Wbidi-chars warning is in trunk for GCC 12, which will likely be released in April of 2022. You can try it out now using the excellent Compiler Explorer website.

As earlier noted, Red Hat has published a script for scanning source repositories for the bidirectional Trojan Source issue, as well as the other preventative measures mentioned. See RHSB-2021-007 for more details and for links to package updates.

I tried scanning GCC's own source tree for non-ASCII characters; we only use them in comments, generally when spelling the names of contributors. As a British developer now living in the United States, I only use ASCII in my code. Still, I appreciate that other people will want to use non-ASCII characters and that I likely have an Anglocentric bias.

To what extent do you use non-ASCII characters in your source code? Do you use them for identifiers, for string literals, or just for comments? Let us know in the discussion below.

Last updated: September 7, 2022