At the start of November of 2021, a new kind of software vulnerability was made public: "Trojan Source," in which certain Unicode bidirectional control characters are used to write obfuscated code. These control characters can be used to create text in which the logical order seen by a programming language implementation (such as a compiler or interpreter) differs from the visual order seen by a human reading the code.
Detecting Trojan Source attacks
Red Hat has taken various steps to ensure that our code isn't affected by this kind of problem, and to protect our customers against it. We have added detection for the issue in various places in our workflow and scanned our source code repositories. We have implemented patches to help upstream projects detect code obfuscated in this way and have provided our customers with updates to our tools to detect such issues—see CVE-2021-42574 and CVE-2021-42694. We have also published a script for scanning source repositories for the issue.
I'm part of a team at Red Hat working on GCC, the GNU Compiler Collection. In the spirit of defense in depth, we spent a fair amount of time before the vulnerability went public experimenting with ways GCC could detect such code and warn the user if it reaches the compiler.
Here's one of the example attacks from the Trojan Source researchers, written in C.
#include <stdio.h> #include <stdbool.h> int main() { bool isAdmin = false; /* } if (isAdmin) begin admins only */ printf("You are an admin.\n"); /* end admins only { */ return 0; }
Exactly what the preceding code looks like will vary depending on the tool you use to view it. To a human reader using Firefox, line 6 of the code appears to begin with this comment:
/* begin admins only */
This comment appears to be immediately followed on the same line by a conditional guarding the printf
statement:
if (isAdmin) {
Unicode's rules for bidirectional text are decidedly non-trivial, so I wrote a Python 3 script for debugging UTF-8 encoded files. This script mimics how GCC outputs source lines; but rather than just outputting the source lines themselves, it interleaves them with per-character lines showing the Unicode codepoints, the UTF-8 encoding bytes, the name of each character, and, where printable, the characters themselves.
Running it on the preceding example gives the following output for line 6:
6 | /* } if (isAdmin) begin admins only */ | U+0020 0x20 SPACE (separator) | U+0020 0x20 SPACE (separator) | U+0020 0x20 SPACE (separator) | U+0020 0x20 SPACE (separator) | U+002F 0x2f SOLIDUS / | U+002A 0x2a ASTERISK * | U+202E 0xe2 0x80 0xae RIGHT-TO-LEFT OVERRIDE (format control) | U+0020 0x20 SPACE (separator) | U+007D 0x7d RIGHT CURLY BRACKET } | U+0020 0x20 SPACE (separator) | U+2066 0xe2 0x81 0xa6 LEFT-TO-RIGHT ISOLATE (format control) | U+0069 0x69 LATIN SMALL LETTER I i | U+0066 0x66 LATIN SMALL LETTER F f | U+0020 0x20 SPACE (separator) | U+0028 0x28 LEFT PARENTHESIS ( | U+0069 0x69 LATIN SMALL LETTER I i | U+0073 0x73 LATIN SMALL LETTER S s | U+0041 0x41 LATIN CAPITAL LETTER A A | U+0064 0x64 LATIN SMALL LETTER D d | U+006D 0x6d LATIN SMALL LETTER M m | U+0069 0x69 LATIN SMALL LETTER I i | U+006E 0x6e LATIN SMALL LETTER N n | U+0029 0x29 RIGHT PARENTHESIS ) | U+2069 0xe2 0x81 0xa9 POP DIRECTIONAL ISOLATE (format control) | U+0020 0x20 SPACE (separator) | U+2066 0xe2 0x81 0xa6 LEFT-TO-RIGHT ISOLATE (format control) | U+0020 0x20 SPACE (separator) | U+0062 0x62 LATIN SMALL LETTER B b | U+0065 0x65 LATIN SMALL LETTER E e | U+0067 0x67 LATIN SMALL LETTER G g | U+0069 0x69 LATIN SMALL LETTER I i | U+006E 0x6e LATIN SMALL LETTER N n | U+0020 0x20 SPACE (separator) | U+0061 0x61 LATIN SMALL LETTER A a | U+0064 0x64 LATIN SMALL LETTER D d | U+006D 0x6d LATIN SMALL LETTER M m | U+0069 0x69 LATIN SMALL LETTER I i | U+006E 0x6e LATIN SMALL LETTER N n | U+0073 0x73 LATIN SMALL LETTER S s | U+0020 0x20 SPACE (separator) | U+006F 0x6f LATIN SMALL LETTER O o | U+006E 0x6e LATIN SMALL LETTER N n | U+006C 0x6c LATIN SMALL LETTER L l | U+0079 0x79 LATIN SMALL LETTER Y y | U+0020 0x20 SPACE (separator) | U+002A 0x2a ASTERISK * | U+002F 0x2f SOLIDUS / | U+000A 0x0a LINE FEED (LF) (control character)
A careful reading of the above will show that what appeared to be if (isAdmin) {
after the comment is actually } LRI if (isAdmin)PDI LRI
within the comment, where LRI
and PDI
are Unicode control characters. In other words, the conditional has been surreptitiously commented out.
The issue here is that Unicode's rules for bidirectional text work at the level of paragraphs and lines, whereas C's tokenization rules are affected by boundaries such as those for comments and string literals.
My colleague Marek Polacek and I implemented a new warning for GCC 12, -Wbidi-chars
, for detecting Trojan Source attacks involving Unicode control characters. Marek implemented the guts of the warning, but when I tried it out on the examples provided by the Trojan Source researchers, I found I had trouble understanding the initial results—precisely because of the obfuscation itself.
So for GCC 12, I've added a new flag to GCC diagnostics, indicating that the diagnostic itself relates to source code encoding. When any such diagnostic is printed, GCC will now escape non-ASCII characters in the source code.
Here's what the preceding example looks like when compiled with GCC 12 (the warning is enabled by default):
$ gcc -c trojan-source/C/commenting-out.c trojan-source/C/commenting-out.c: In function ‘main’: trojan-source/C/commenting-out.c:6:43: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=] 6 | /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */ | ~~~~~~~~ ~~~~~~~~ ^ | | | | | | | end of bidirectional context | U+202E (RIGHT-TO-LEFT OVERRIDE) U+2066 (LEFT-TO-RIGHT ISOLATE) trojan-source/C/commenting-out.c:8:28: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=] 8 | /* end admins only <U+202E> { <U+2066>*/ | ~~~~~~~~ ~~~~~~~~ ^ | | | | | | | end of bidirectional context | | U+2066 (LEFT-TO-RIGHT ISOLATE) | U+202E (RIGHT-TO-LEFT OVERRIDE)
Escaping the non-ASCII characters clarifies exactly which control characters are present in the source file. It also effectively defangs the obfuscation: the visual ordering of the characters will always be the same as the logical ordering in this output.
We call a tokenization boundary such as a comment or string literal a bidirectional context in the warning because the obfuscation happens when there are differences between the structure as seen by the C tokenizer of the logical ordering of the characters on the one hand and the structure perceived by a human reader of the visual ordering of the code as implemented by the Unicode bidirectional algorithm on the other.
The default is -Wbidi-chars=unpaired
, in which the warning complains about unpaired characters within such a bidirectional context. A stronger form of the warning is -Wbidi-chars=any
, in which the warning complains about any bidirectional control characters in the source code:
$ gcc -c trojan-source/C/commenting-out.c -Wbidi-chars=any trojan-source/C/commenting-out.c: In function ‘main’: trojan-source/C/commenting-out.c:6:7: warning: found problematic Unicode character "U+202E (RIGHT-TO-LEFT OVERRIDE)" [-Wbidi-chars=] 6 | /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */ | ^~~~~~~~ trojan-source/C/commenting-out.c:6:10: warning: found problematic Unicode character "U+2066 (LEFT-TO-RIGHT ISOLATE)" [-Wbidi-chars=] 6 | /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */ | ^~~~~~~~ trojan-source/C/commenting-out.c:6:23: warning: found problematic Unicode character "U+2066 (LEFT-TO-RIGHT ISOLATE)" [-Wbidi-chars=] 6 | /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */ | ^~~~~~~~ trojan-source/C/commenting-out.c:8:24: warning: found problematic Unicode character "U+202E (RIGHT-TO-LEFT OVERRIDE)" [-Wbidi-chars=] 8 | /* end admins only <U+202E> { <U+2066>*/ | ^~~~~~~~ trojan-source/C/commenting-out.c:8:27: warning: found problematic Unicode character "U+2066 (LEFT-TO-RIGHT ISOLATE)" [-Wbidi-chars=] 8 | /* end admins only <U+202E> { <U+2066>*/ | ^~~~~~~~
By default, the warning shows the non-ASCII characters in the form <U+xxxx>
, but I've also added a new option, -fdiagnostics-escape-format=bytes
, which will show the bytes that encoded the characters in question in the form <xx>
. Here's what the warning looks like with -fdiagnostics-escape-format=bytes
:
$ gcc -c trojan-source/C/commenting-out.c -fdiagnostics-escape-format=bytes trojan-source/C/commenting-out.c: In function ‘main’: trojan-source/C/commenting-out.c:6:43: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=] 6 | /*<e2><80><ae> } <e2><81><a6>if (isAdmin)<e2><81><a9> <e2><81><a6> begin admins only */ | ~~~~~~~~~~~~ ~~~~~~~~~~~~ ^ | | | | | U+202E (RIGHT-TO-LEFT OVERRIDE) U+2066 (LEFT-TO-RIGHT ISOLATE) end of bidirectional context trojan-source/C/commenting-out.c:8:28: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=] 8 | /* end admins only <e2><80><ae> { <e2><81><a6>*/ | ~~~~~~~~~~~~ ~~~~~~~~~~~~ ^ | | | | | | | end of bidirectional context | | U+2066 (LEFT-TO-RIGHT ISOLATE) | U+202E (RIGHT-TO-LEFT OVERRIDE)
Let's take a look at how some other examples of this exploit are handled by GCC 12.
early-return.c: Code hidden in a comment
The code in this listing is obfuscated so that return 0;
appears to be part of a comment:
#include <stdio.h> int main() { /* Say hello; newline /*/ return 0 ; printf("Hello world.\n"); return 0; }
GCC 12 successfully complains about this and shows the logical ordering of the source:
$ gcc -c trojan-source/C/early-return.c trojan-source/C/early-return.c: In function ‘main’: trojan-source/C/early-return.c:4:29: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=] 4 | /* Say hello; newline<U+2067> /*/ return 0 ; | ~~~~~~~~ ^ | | | | | end of bidirectional context | U+2067 (RIGHT-TO-LEFT ISOLATE)
stretched-string.c: Code hidden in a string literal
This listing includes obfuscated code in which the string literal passed as the second argument of strcmp()
appears to be "user"
but is actually "userRLO LRI// Check if adminPDI LRI"
:
#include <stdio.h> #include <string.h> int main() { char* access_level = "user"; if (strcmp(access_level, "user // Check if admin ")) { printf("You are an admin.\n"); } return 0; }
Again, GCC 12 successfully issues a warning:
$ gcc -c trojan-source/C/stretched-string.c trojan-source/C/stretched-string.c: In function ‘main’: trojan-source/C/stretched-string.c:6:53: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=] 6 | if (strcmp(access_level, "user<U+202E> <U+2066>// Check if admin<U+2069> <U+2066>")) { | ~~~~~~~~ ~~~~~~~~ | | | | | | | end of bidirectional context | U+202E (RIGHT-TO-LEFT OVERRIDE) U+2066 (LEFT-TO-RIGHT ISOLATE)
What about homoglyphs?
The Trojan Source paper also noted the existence of homoglyphs: identifiers with characters that look like other characters. Indeed, in some cases a computer's font machinery might choose the same "glyph" for two different Unicode characters, making them pixel-for-pixel identical. You could substitute characters with carefully chosen homoglyphs to obfuscate code. Here's an example from the author of the paper:
#include <stdio.h> void sayHello() { printf("Hello, World!\n"); } void sayНello() { printf("Goodbye, World!\n"); } int main() { sayНello(); return 0; }
This code pairs a capital H in sayHello()
with a capital Cyrillic letter en (Н) in sayНello()
. In many cases, the two characters will look similar or identical.
This isn't a new vulnerability, but I had a go at detecting it for GCC 12. My proof-of-concept patch complains about the preceding code as follows:
$ gcc trojan-source/C/homoglyph-function.c trojan-source/C/homoglyph-function.c:7:1: warning: identifier ‘sayНello’ (‘say\u041dello’)... [CWE-1007] [-Whomoglyph] 7 | void say<U+041D>ello() { | ^~~~ trojan-source/C/homoglyph-function.c:3:1: note: ...confusable with non-equal identifier ‘sayHello’ here 3 | void sayHello() { | ^~~~
There are improvements to be made here; among other issues, the warning doesn't quite underline the correct token—it underlines the void
when it should underline the identifier itself.
Unfortunately, it's not clear to me when the compiler should warn for this. Should it complain about any homoglyph identifier pairs seen in source code, or merely those that occur in the same scope? If the former, then this would probably rule out a lot of single-character identifiers because many have homoglyphs that might be in use in a header file. If the latter, does this catch every possible misuse of homoglyphs? I don't yet have a good answer to these questions, so this warning didn't make feature freeze for GCC 12.
Hunting for Trojan Source
The -Wbidi-chars
warning is in trunk for GCC 12, which will likely be released in April of 2022. You can try it out now using the excellent Compiler Explorer website.
As earlier noted, Red Hat has published a script for scanning source repositories for the bidirectional Trojan Source issue, as well as the other preventative measures mentioned. See RHSB-2021-007 for more details and for links to package updates.
I tried scanning GCC's own source tree for non-ASCII characters; we only use them in comments, generally when spelling the names of contributors. As a British developer now living in the United States, I only use ASCII in my code. Still, I appreciate that other people will want to use non-ASCII characters and that I likely have an Anglocentric bias.
To what extent do you use non-ASCII characters in your source code? Do you use them for identifiers, for string literals, or just for comments? Let us know in the discussion below.
Last updated: September 7, 2022