GNU C library

gnu logoWhite box testing?

Traditional white box testing verifies the internal implementation details of the software under test. As of today the GNU C Library (glibc) has very little if any white box testing. The general policy has been that we implement standards conforming interfaces and that as such we need to test those interfaces. This is a good start, but we need to test more if we are going to cover all cases and configurations, and this includes more detailed failure path testing.

The following covers the use of System Tap (systemtap) to test a particular failure path in glibc. The user in this blog is not completely fictitious since the work was prompted by real users. The discussed scenario is real and happened  on developer boxes during validation and coverage testing.

The Problem

A user has reported that they are seeing intermittent crashes in their applications under high memory load. The crashes all appear to be in glibc. After some back and forth with the user we are able to get a core file of the crash. With the application, core file, and debugging symbols we have an excellent start to solving the problem:

[carlos@koi bug]$ gdb -c core.16292 ./user-application
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/user-application...done.
[New LWP 16292]
Core was generated by `/foo/user-application'.
Program terminated with signal 11, Segmentation fault.
#0  _nl_find_msg (domain_file=domain_file@entry=0x1c292d0, domainbinding=domainbinding@entry=0x1c28010,
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fffea28aef8)
    at dcigettext.c:1175

1175                  newmem->next = transmem_list;
(gdb) bt
#0  _nl_find_msg (domain_file=domain_file@entry=0x1c292d0, domainbinding=domainbinding@entry=0x1c28010,
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fffea28aef8)
    at dcigettext.c:1175
#1  0x00007f380e6391ab in __dcigettext (domainname=0x1c29250 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#2  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#3  0x0000000000400aae in main (argc=2, argv=0x7fffea28b0b8) at nls-test.c:115
(gdb) p newmem
$1 = (transmem_block_t *) 0x0
(gdb)

Looking up the source you see:

1167                   else
1168                     {
1169                       malloc_count = 1;
1170                       freemem_size = INITIAL_BLOCK_SIZE;
1171                       newmem = (transmem_block_t *) malloc (freemem_size);
1172 # ifdef _LIBC
1173                       /* Add the block to the list of blocks we have to free
1174                          at some point.  */
1175                       newmem->next = transmem_list;
1176                       transmem_list = newmem;
1177 # endif
1178                     }
1179                   if (__builtin_expect (newmem == NULL, 0))
1180                     {
1181                       freemem = NULL;
1182                       freemem_size = 0;
1183                       __libc_lock_unlock (lock);
1184                       return (char *) -1;
1185                     }

In a high memory load situation malloc might fail and newmem->next (line 1175) will segfault if newmem is NULL.

The fix is to add a check for NULL and allow the following check to return -1.

It looks as if the the code had expected malloc to fail, but code had been introduced between the malloc and the use that rendered the check moot.

We apply the following fix:

diff --git a/intl/dcigettext.c b/intl/dcigettext.c
index 110307b..18b49b3 100644
--- a/intl/dcigettext.c
+++ b/intl/dcigettext.c
@@ -1170,10 +1170,13 @@ _nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp)
                      freemem_size = INITIAL_BLOCK_SIZE;
                      newmem = (transmem_block_t *) malloc (freemem_size);
 # ifdef _LIBC
-                     /* Add the block to the list of blocks we have to free
-                        at some point.  */
-                     newmem->next = transmem_list;
-                     transmem_list = newmem;
+                     if (newmem != NULL)
+                       {
+                         /* Add the block to the list of blocks we have to free
+                            at some point.  */
+                         newmem->next = transmem_list;
+                         transmem_list = newmem;
+                       }
 # endif
                    }
                  if (__builtin_expect (newmem == NULL, 0))

We build a new glibc and consider the bug fixed, but there is a nagging worry that this isn't the end of the failures. Are there other instances where similar code might fail? The interface we fixed is now going to return -1 during high load, do the callers work correctly also?

We place the new glibc under high memory load, and follow a series of rather rigorous and lengthy tests, including testing with the customer on their site, and the system fails again.

The core file reveals the following:

[carlos@koi bug]$ gdb -c core.24446 ./user-application
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/foo/user-application...done.
[New LWP 24446]
Core was generated by `/foo/user-application'.
Program terminated with signal 11, Segmentation fault.
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    needle_start=needle_start@entry=0x7fab8721a907 "charset=") at ../string/strstr.c:63
63      while (*haystack && *needle)
(gdb) bt
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    needle_start=needle_start@entry=0x7fab8721a907 "charset=") at ../string/strstr.c:63
#1  0x00007fab870e2a9c in _nl_find_msg (domain_file=domain_file@entry=0x22732e0, domainbinding=domainbinding@entry=0x2272010,
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff780e8c78)
    at dcigettext.c:948
#2  0x00007fab870e31ab in __dcigettext (domainname=0x2273260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#3  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#4  0x0000000000400aae in main (argc=2, argv=0x7fff780e8e38) at nls-test.c:115
(gdb)

Looking at the code in question you see the following:

            /* Get the header entry.  This is a recursion, but it doesn't
               reallocate domain->conversions because we pass convert = 0.  */
            nullentry =
              _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);

            if (nullentry != NULL)
              {
                const char *charsetstr;

                charsetstr = strstr (nullentry, "charset=");

When _nl_find_msg returns an error of -1, that error is incorrectly interpreted as a valid string and passed to strstr which crashes. The code in question can't handle _nl_find_msg returning a value of -1, and this is disconcerting since one expects callers, particularly recursive ones, to handle error conditions correctly.

We fix this problem with the following patch:

diff --git a/intl/dcigettext.c b/intl/dcigettext.c
index 110307b..0daa508 100644
--- a/intl/dcigettext.c
+++ b/intl/dcigettext.c
@@ -941,6 +941,11 @@ _nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp)
            nullentry =
              _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
 
+           /* Resource problems are fatal.  If we continue onwards we will
+              only attempt to calloc a new conv_tab and fail later.  */
+           if (__builtin_expect (nullentry == (char *) -1, 0))
+             return (char *) -1;
+
            if (nullentry != NULL)
              {
                const char *charsetstr;

However, now we must audit all callers to determine if they handle the returned error correctly. This is harder than expected as the conditionals and code paths are quite complex. Is there is any way to simulate a failure at this particular point?

Can we find all the callers of _nl_find_msg and force the result of the call to be -1 and observe how the system fails, fixing any further failures?

Yes, there is is a way to inject failures quickly, and we can do so with systemtap.

Validation through failure injection

You can inject failures in several ways, including manually under a debugger like gdb. The way which we are going to show here is using systemtap. We demonstrate systemtap because it's easy and because eventually with the dyninst backend you should be able to do this as a fully unprivileged user (non-dyninst systemtap has some limited-privilege capabilities too, but require some sysadmin setup.). We also chose systemtap because the performance impact on the running application is minimized, while a debugger could cause timeouts and not allow the application to run as it was intended.

We start with the original unpatched build to get comfortable with the process.

To simulate the original failure we need to have malloc return NULL at line 1171.

We use systemtap to see if line 1171 has debug information and which variables are accessible at that point:

stap -L 'process("/home/carlos/build/glibc/libc.so").
    statement("_nl_find_msg@*dcigettext.c:*")' | grep 117
process("/home/carlos/build/glibc/libc.so").
    statement("_nl_find_msg@/home/carlos/src/glibc/intl/dcigettext.c:1170")
     $non_reversible:size_t $freemem:unsigned char* $freemem_size:size_t
     $inbuf:unsigned char const* $outbuf:unsigned char*
     $malloc_count:int $lock:__libc_lock_t $convd:struct converted_domain*
     $domain_file:struct loaded_l10nfile* $domainbinding:struct binding*
     $msgid:char const* $convert:int $lengthp:size_t* $nstrings:nls_uint32
     $resultlen:size_t
process("/home/carlos/build/glibc/libc.so").
    statement("_nl_find_msg@/home/carlos/src/glibc/intl/dcigettext.c:1176")
     $newmem:transmem_block_t* $non_reversible:size_t $freemem:unsigned char*
     $freemem_size:size_t $inbuf:unsigned char const* $outbuf:unsigned char*
     $malloc_count:int $lock:__libc_lock_t $convd:struct converted_domain*
     $domain_file:struct loaded_l10nfile* $domainbinding:struct binding*
     $msgid:char const* $convert:int $lengthp:size_t* $nstrings:nls_uint32
     $resultlen:size_t

We have two lines we could put probe statements on and we decide to use line 1176.

We create the following systemtap script:

probe process("/home/carlos/build/glibc/libc.so").statement("_nl_find_msg@*dcigettext.c:1176")
{
        printf("_nl_find_msg: newmem is %d.n", $newmem);
        printf("_nl_find_msg: Forcing newmem to zero to simulate malloc always failing at this point.n");
        $newmem = 0;
}

Next we need an application to test it with, and we use the following somewhat non-trivial program and script to set everything up:

#include <libintl.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <error.h>
#include <errno.h>

const struct
{
  const char *msgid;
  const char *msgstr;
} msgs[] =
{
#define INPUT(Str) { Str,
#define OUTPUT(Str) Str },
#include TESTSTRS_H
};

const char *catname[] =
{
  [LC_MESSAGES] = "LC_MESSAGES",
  [LC_TIME] = "LC_TIME",
  [LC_NUMERIC] = "LC_NUMERIC"
};

#define check_setlocale(cat, name) do {                                       
    if (setlocale (cat, name) == NULL)                                        
      {                                                                       
        printf ("%s:%u: setlocale (%s, "%s"): %mn",                        
                __FILE__, __LINE__, #cat, name);                              
        result = 1;                                                           
      }                                                                       
  } while (0)

static int
positive_gettext_test (void)
{
  size_t cnt;
  int result = 0;

  for (cnt = 0; cnt < sizeof (msgs) / sizeof (msgs[0]); ++cnt)
    {
      const char *found = gettext (msgs[cnt].msgid);
      printf ("found "%s"n", found);

      if (found == NULL
          || (msgs[cnt].msgstr[0] != ''
              && strcmp (found, msgs[cnt].msgstr) != 0))
        {
          /* Oops, shouldn't happen.  */
          printf ("
  gettext ("%s") failed, returned "%s", expected "%s"n",
                  msgs[cnt].msgid, found, msgs[cnt].msgstr);
          result = 1;
        }
    }

  return result;
}

int
main (int argc, char *argv[])
{
  int result = 0;
  char * dom;

  /* This is the place where the .mo files are placed.  */
  if (argc > 1)
    {
      bindtextdomain ("existing-domain", argv[1]);
      bindtextdomain ("existing-time-domain", argv[1]);
      bindtextdomain ("non-existing-domain", argv[1]);
    }
  else
    {
      printf ("No arg!n");
      exit (1);
    }
  /* The locale the catalog is created for is "existing-category".  Now
     set the various variables in question to this value and run the
     test.  */
  setenv ("LANGUAGE", "existing-locale", 1);
  setenv ("LC_ALL", "non-existing-locale", 1);
  setenv ("LC_MESSAGES", "non-existing-locale", 1);
  setenv ("LC_CTYPE", "non-existing-locale", 1);
  setenv ("LANG", "non-existing-locale", 1);
  check_setlocale (LC_CTYPE, "en_GB.UTF-8");
  check_setlocale (LC_MESSAGES, "en_GB.UTF-8");
  unsetenv ("OUTPUT_CHARSET");

  /* This is the name of the existing domain with a catalog for the
     LC_MESSAGES category.  */
  textdomain ("existing-domain");
  puts ("test `gettext' with LANGUAGE set");
  if (positive_gettext_test () != 0)
    {
      puts ("FAILED");
      result = 1;
    }

  return result;
}

The program loads the glibc messages, and translates from the default to en_GB.UTF-8, checking to make sure the translation matches what's expected.

The script expects a glibc build and source directory. The setup script looks like this:

#!/bin/bash
set -e

# common_objpfx is a glibc build directory.
common_objpfx=/home/carlos/build/glibc/
test_program_prefix=/home/carlos/support/bug/
# objpfx is the path to the current directory you're building in
objpfx=/home/carlos/support/bug/
# srcpfx is a glibc source directory.
srcpfx=/home/carlos/src/glibc/

LC_ALL=C gawk -f ${srcpfx}intl/po2test.awk ${srcpfx}po/en_GB.po > ${objpfx}msgs.h

gcc -Wl,-rpath=${common_objpfx} -Wl,--dynamic-linker=${common_objpfx}elf/ld.so
-g3 -O0 -DTESTSTRS_H="msgs.h" -o nls-test nls-test.c

# Create the locale directories.
rm -rf ${objpfx}localedir
rm -rf ${objpfx}domaindir

mkdir -p ${objpfx}localedir/existing-locale/LC_MESSAGES
for f in ADDRESS COLLATE CTYPE IDENTIFICATION MEASUREMENT MONETARY NAME NUMERIC PAPER TELEPHONE TIME; do
  cp -f ${common_objpfx}localedata/en_GB.UTF-8/LC_$f 
        ${objpfx}localedir/existing-locale
done
cp -f ${common_objpfx}localedata/en_GB.UTF-8/LC_MESSAGES/SYS_LC_MESSAGES 
      ${objpfx}localedir/existing-locale/LC_MESSAGES

# Create the domain directories.
mkdir -p ${objpfx}domaindir/existing-locale/LC_MESSAGES
mkdir -p ${objpfx}domaindir/existing-locale/LC_TIME
# Populate them.
msgfmt -o ${objpfx}domaindir/existing-locale/LC_MESSAGES/existing-domain.mo 
       -f ${srcpfx}po/en_GB.po
msgfmt -o ${objpfx}domaindir/existing-locale/LC_TIME/existing-time-domain.mo 
       -f ${srcpfx}po/en_GB.po

echo "GCONV_PATH=${common_objpfx}iconvdata 
LOCPATH=${objpfx}localedir:${common_objpfx}localedata 
${test_program_prefix}nls-test ${objpfx}domaindir > nls-test.out"

echo sudo stap -g nls-orig.stp --sysroot=/ 
--sysenv=GCONV_PATH=${common_objpfx}iconvdata 
--sysenv=LOCPATH=${objpfx}localedir:${common_objpfx}localedata 
-c "${test_program_prefix}nls-test ${objpfx}domaindir"

And we run it like this:

sudo stap -g nls-step0.stp --sysroot=/
 --sysenv=GCONV_PATH=/home/carlos/build/glibc/iconvdata
 --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc/localedata
 -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir"
 > out.txt
[sudo] password for carlos:
Warning: child process exited with signal 11 (Segmentation fault)
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

To our excitement it fails. This is good since it means we might have reproduced the original problem the customer was seeing under heavy memory load.

The output text files shows we failed almost immediately with no additional output:

cat out.txt
_nl_find_msg: newmem is 8312192.
_nl_find_msg: Forcing newmem to zero to simulate malloc always failing at this point.

We look at the generated core file and we see:

[carlos@koi bug]$ gdb -c core.12100 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bugs/nls-test...done.
[New LWP 12100]
Core was generated by `/home/carlos/support/bugs/nls-test /home/carlos/support/bugs/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  _nl_find_msg (domain_file=domain_file@entry=0x7ea2e0, domainbinding=domainbinding@entry=0x7e9010,
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff4dceb098)
    at dcigettext.c:1175
1175                  newmem->next = transmem_list;
(gdb) bt
#0  _nl_find_msg (domain_file=domain_file@entry=0x7ea2e0, domainbinding=domainbinding@entry=0x7e9010,
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff4dceb098)
    at dcigettext.c:1175
#1  0x00007f633a1901ab in __dcigettext (domainname=0x7ea260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#2  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#3  0x0000000000400aae in main (argc=2, argv=0x7fff4dceb258) at nls-test.c:115
(gdb)

Success! This is exactly the same failure as the original failure provided by the user.

What next? We know that fixing this bug causes _nl_find_msg to return -1 which is the cause of the next bug. We need to examine all the callers to see if they take into account a return of -1.

Injecting failures at all call sites

The next step is to find all the call sites of _nl_find_msg, trigger them, and inject failures at the call site, simulating _nl_find_msg returning -1.

grep -rn '_nl_find_msg' *
...
intl/dcigettext.c:630:      retval = _nl_find_msg (domain, binding, msgid1, 1, &retlen);
intl/dcigettext.c:638:          retval = _nl_find_msg (domain->successor[cnt], binding,
intl/dcigettext.c:740:_nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp)
intl/dcigettext.c:942:          _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
intl/loadmsgcat.c:784:       initialization via _nl_find_msg.  We have initialized
intl/loadmsgcat.c:1259:  nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
intl/gettextP.h:184:char *_nl_find_msg PARAMS ((struct loaded_l10nfile *domain_file,

There are only four locations of interest (the rest are headers, comments or the function istself):

intl/dcigettext.c:630:      retval = _nl_find_msg (domain, binding, msgid1, 1, &retlen);
intl/dcigettext.c:638:          retval = _nl_find_msg (domain->successor[cnt], binding,
intl/dcigettext.c:942:          _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
intl/loadmsgcat.c:1259:  nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);

We use the following systemtap command to list a wider set of line numbers we can use for statement events:

stap -L 'process("/home/carlos/build/glibc/libc.so").statement("*@*dcigettext.c:*")'

We can trigger on line 630, and 632:

process("/home/carlos/build/glibc/libc.so").
    statement("__dcigettext@/home/carlos/src/glibc/intl/dcigettext.c:630")
     $domainname:char const* $msgid1:char const* $msgid2:char const* $plural:int
     $n:long unsigned int $category:int $domain:struct loaded_l10nfile*
     $categoryvalue:char const* $dirname:char* $xdomainname:char* $saved_errno:int
     $search:struct known_translation_t $foundp:struct known_translation_t**
     $localename:char const* $domainname_len:size_t $tree_lock:__libc_rwlock_t
process("/home/carlos/build/glibc/libc.so").
    statement("__dcigettext@/home/carlos/src/glibc/intl/dcigettext.c:632")
     $domainname:char const* $msgid1:char const* $msgid2:char const* $plural:int
     $n:long unsigned int $category:int $domain:struct loaded_l10nfile*
     $categoryvalue:char const* $dirname:char* $xdomainname:char* $retval:char*
     $saved_errno:int $search:struct known_translation_t $foundp:struct known_translation_t**
     $localename:char const* $domainname_len:size_t
     $tree_lock:__libc_rwlock_t

We choose line 632 and set retval to -1 to simulate _nl_find_msg failing in __dcigettext.

We use the following script:

probe process("/home/carlos/build/glibc/libc.so").statement("__dcigettext@*dcigettext.c:632")
{
    printf("_nl_find_msg: retval is %d.n", $retval);
    printf("_nl_find_msg: Forcing retval to -1 to simulate _nl_find_msg failing.n");
    $retval = -1;
}

Run it like this:

sudo stap -g nls-step1a.stp --sysroot=/
 --sysenv=GCONV_PATH=/home/carlos/build/glibc/iconvdata
 --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc/localedata
 -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir"
 > out.txt
Warning: child process exited with status 1
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

Looking at out.txt we see that there have been no crashes or segfaults but the translations are not happening and the default locale messages are being returned.

A quick look at the source shows that this is expected:

 630           retval = _nl_find_msg (domain, binding, msgid1, 1, &retlen);
 631
 632           if (retval == NULL)
 633             {
 634               int cnt;
 635
 636               for (cnt = 0; domain->successor[cnt] != NULL; ++cnt)
 637                 {
 638                   retval = _nl_find_msg (domain->successor[cnt], binding,
 639                                          msgid1, 1, &retlen);
 640
 641                   if (retval != NULL)
 642                     {
 643                       domain = domain->successor[cnt];
 644                       break;
 645                     }
 646                 }
 647             }
 648
 649           /* Returning -1 means that some resource problem exists
 650              (likely memory) and that the strings could not be
 651              converted.  Return the original strings.  */
 652           if (__builtin_expect (retval == (char *) -1, 0))
 653             goto no_translation;

Injecting a failure for the first _nl_find_msg call here results in a goto no_translation which is a non-fatal problem. The translation is not performed if there is an error, and the error itself is propagated to the caller.

In fact we can see our next _nl_find_msg call site at line 638.

Unfortunately triggering this next call to _nl_find_msg requires a considerably more complex set of locales with succesors. We recognize that triggering all cases might be infeasible and this call to _nl_find_msg was sufficiently difficult that we gave up. This process is a heuristic approach at covering all of the failure cases and sometimes it costs too much to get 100% coverage. For the sake of long-term maintenance we fix the code by having it eagerly skip translating if _nl_find_msg returns -1 e.g.:

@@ -638,6 +638,11 @@ DCIGETTEXT (domainname, msgid1, msgid2, plural, n, category)
           retval = _nl_find_msg (domain->successor[cnt], binding,
                      msgid1, 1, &retlen);
 
+          /* Resource problems are not fatal, instead we return no
+             translation.  */
+          if (__builtin_expect (retval == (char *) -1, 0))
+            goto no_translation;
+
           if (retval != NULL)
             {
               domain = domain->successor[cnt];

This is the same as the code on line 652 but this is better contained and less likely to be moved around leaving a call site of _nl_find_msg without a check for -1 errors.

Moving on to the third call site at line 942 we check for a line to use as a statement:

stap -L 'process("/home/carlos/build/glibc/libc.so").
    statement("*@*dcigettext.c:*")' | grep 944
process("/home/carlos/build/glibc/libc.so").
    statement("_nl_find_msg@/home/carlos/src/glibc/intl/dcigettext.c:944")
     $nullentry:char* $nullentrylen:size_t $encoding:char const* $convd:struct converted_domain*
     $domain_file:struct loaded_l10nfile* $domainbinding:struct binding* $msgid:char const*
     $convert:int $lengthp:size_t* $nstrings:nls_uint32 $resultlen:size_t

We create another failure injection:

probe process("/home/carlos/build/glibc/libc.so").statement("_nl_find_msg@*dcigettext.c:944")
{
        printf("_nl_find_msg: nullentry is %d.n", $nullentry);
        printf("_nl_find_msg: Forcing nullentry to -1 to simulate _nl_find_msg failing.n");
        $nullentry = -1;
}

This failure represents the second failure seen by the user.

Run it like this:

sudo stap -g nls-step2a.stp --sysroot=/
 --sysenv=GCONV_PATH=/home/carlos/build/glibc/iconvdata
 --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc/localedata
 -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir"
 > out.txt
Warning: child process exited with signal 11 (Segmentation fault)
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

Load the core file:

[carlos@koi bug]$ gdb -c core.18236 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/nls-test...done.
[New LWP 18236]
Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    needle_start=needle_start@entry=0x7f42a2ce5907 "charset=") at ../string/strstr.c:63
63      while (*haystack && *needle)
(gdb) bt
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    needle_start=needle_start@entry=0x7f42a2ce5907 "charset=") at ../string/strstr.c:63
#1  0x00007f42a2bada93 in _nl_find_msg (domain_file=domain_file@entry=0x65f2e0, domainbinding=domainbinding@entry=0x65e010,
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff8b9c8388)
    at dcigettext.c:948
#2  0x00007f42a2bae1ab in __dcigettext (domainname=0x65f260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#3  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#4  0x0000000000400aae in main (argc=2, argv=0x7fff8b9c8548) at nls-test.c:115
(gdb)

Success! This looks exactly like the second failure reported by the user.

We move on to the last call site line 1259 in intl/loadmsgcat.c.

We choose to set a statement on line 1260, the only available line with debug information nearest to 1259.

We use the following script:

probe process("/home/carlos/build/glibc/libc.so").statement("_nl_load_domain@*loadmsgcat.c:1260")
{
    printf ("_nl_load_domain: nullentry is '%x'n", $nullentry);
    printf ("_nl_load_domain: Forcing nullentry to -1 to simulate _nl_find_msg failure.n");
    $nullentry = -1;
}

Run it like this:

sudo stap -g nls-step3a.stp --sysroot=/
 --sysenv=GCONV_PATH=/home/carlos/build/glibc/iconvdata
 --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc/localedata
 -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir"
 > out.txt
[sudo] password for carlos:
Warning: child process exited with signal 11 (Segmentation fault)
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

This is an unexpected and new failure. Something the user hasn't reported yet!

Loading the core file in gdb:

[carlos@koi bug]$ gdb -c core.19031 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/nls-test...done.
[New LWP 19031]
Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  __strstr_sse2 (haystack_start=haystack_start@entry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    needle_start=needle_start@entry=0x7fae5c5e194e "plural=") at ../string/strstr.c:63
63      while (*haystack && *needle)
(gdb) bt
#0  __strstr_sse2 (haystack_start=haystack_start@entry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    needle_start=needle_start@entry=0x7fae5c5e194e "plural=") at ../string/strstr.c:63
#1  0x00007fae5c4ada96 in __gettext_extract_plural (nullentry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    pluralp=0x79f4b8, npluralsp=0x79f4c0) at plural-exp.c:109
#2  0x00007fae5c4ab009 in _nl_load_domain (domain_file=domain_file@entry=0x79f2e0, domainbinding=domainbinding@entry=0x79e010)
    at loadmsgcat.c:1260
#3  0x00007fae5c4aaafd in _nl_find_domain (dirname=dirname@entry=0x79e040 "/home/carlos/support/bug/domaindir", locale=<optimized out>,
    locale@entry=0x7fffc8e666f0 "existing-locale", domainname=domainname@entry=0x7fffc8e66710 "LC_MESSAGES/existing-domain.mo",
    domainbinding=domainbinding@entry=0x79e010) at finddomain.c:147
#4  0x00007fae5c4aa18c in __dcigettext (domainname=0x79f260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:626
#5  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#6  0x0000000000400aae in main (argc=2, argv=0x7fffc8e66988) at nls-test.c:115
(gdb)

The backtrace shows that if _nl_find_msg fails during _nl_load_domain that the -1 error will be treated as a pluralization string and get passed to strstr which will crash.

The fix for this is to abort the loading of the domain if you can't read the pluralization information:

diff --git a/intl/loadmsgcat.c b/intl/loadmsgcat.c
index e4b7b38..ac90ed1 100644
--- a/intl/loadmsgcat.c
+++ b/intl/loadmsgcat.c
@@ -1237,7 +1237,7 @@ _nl_load_domain (domain_file, domainbinding)
     default:
       /* This is an invalid revision.  */
     invalid:
-      /* This is an invalid .mo file.  */
+      /* This is an invalid .mo file or we ran out of resources.  */
       free (domain->malloced);
 #ifdef HAVE_MMAP
       if (use_mmap)
@@ -1257,6 +1257,11 @@ _nl_load_domain (domain_file, domainbinding)
 
   /* Get the header entry and look for a plural specification.  */
   nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
+  if (__builtin_expect (nullentry == (char *) -1, 0))
+    {
+      __libc_rwlock_fini (domain->conversions_lock);
+      goto invalid;
+    }
   EXTRACT_PLURAL_EXPRESSION (nullentry, &domain->plural, &domain->nplurals);
 
  out:

Further testing shows no more crashes, and we have covered all of the call sites, therefore we are done.

Conclusion

Failure injection testing with systemtap has proven to be very useful and key to easily and quickly locating additional failures without the need to completely trigger the original failure conditions. This process is thus particularly well suited to out-of-memory conditions deep in glibc.

The final step in our process is to patch the original malloc problem and at a minimum the two call sites (intl/dcigettext.c:942 and intl/loadmsgcat.c:1259) with missing checks for -1 returns from _nl_find_msg. Patching the call site at intl/dcigettext.c:638 is a maintenance issue and should be done to prevent future problems.

At this point we have a patch and 4 regression tests that cover all 4 callers of _nl_find_msg.

The regression tests can and should be put back into the glibc testsuite to prevent future changes from regressing the build.

We hope that this blog gives you a good overview of how to use systemtap on more complex projects with glibc as an example.

Notes:

At the time of testing running systemtap required root privileges but on Red Hat Enterprise Linux with Developer Toolset Version 2 and above the dyninst backend will allow all of the aboves tests as a normal user with minimal impact on the running code.

Why didn't we use the systemtap .return probe to capture function returns and adjust the return value of _nl_find_msg? The .return probe isn't supported in the setup I had, otherwise, yes, you can, in a setup that supports it, use the .return probe. I found instrumenting the call sites to be revealing of how the function was used, but in a mass test it would be easier to require .return probe support.

Last updated: February 23, 2024