How to store large amounts of data in a program

Most programs need data in order to work. Sometimes this data is provided to the program when it runs, and sometimes the data is built into the program. In this article, I'll explain how to store large amounts of data inside a program so that it is there when the program runs.

The most obvious method of storing data is to include it in the program's source code. For example, in C:

int a = 1;

This approach works for small amounts of data, but it quickly becomes cumbersome as the amount of data to be stored increases. Additionally, if the data is going to be stored in this way, it's often necessary to create a tool that will convert the data into a form that is acceptable to the programming language used.

The next choice would be to load the data at run-time. This works, but it has problems, too. For example, it presumes the existence of a filesystem that can be used to store the data file(s). It also means that the program is no longer a single entity but now has to be shipped with these data files. And, extra code needs to be written to handle situations where the files are missing or corrupt.

So, this article presents a method for including large data files into the body of an executable program. The article is written with ELF-based GNU/Linux systems in mind. Other operating systems may have other methods for solving this problem. In particular, it is worth noting that Windows supports the concept of "resources" [1] for programs, which provide read-only access to various types of embedded data.

The INCBIN directive

The method is to make use of an assembler source file, or even inline assembler, and the special assembler pseudo-op called .incbin [2]. This directive allows an arbitrary file to be included in the program at the specified location. For example:

.incbin "foo.jpg"

In practice, it is best to make sure that the data is located in the correct section and that it is aligned correctly. Additionally, symbols will probably be needed to provide access to the data from the high-level source code:

.data
.align 4
.global start_of_foo
start_of_foo:
.incbin "foo.jpg"
.global end_of_foo
end_of_foo:

This could then be accessed in a C source file like this:

extern char start_of_foo;
extern char end_of_foo;
char * p;

for (p = & start_of_foo; p < & end_of_foo; p++)
  ...

Note that the use of the address operators (&) and the absence of pointer types (char *) in the above code fragment is correct. This is because of the difference between assembler created symbols and compiler-generated symbols [3]. When an assembler creates a symbol, all it really does is to provide a label that corresponds to a given address. Whereas when a compiler creates a symbol, it creates a space in the program's data, installs a value into that space, and then uses the symbol as an indirect reference to that value.

The C language does allow symbols to be treated as labels; however, they must be declared as unsized arrays instead:

extern char start_of_foo[];
extern char end_of_foo[];
char * p;

for (p = start_of_foo; p < end_of_foo; p++)
...

The assembler code puts the contents of foo.jpg into the program's data section, which means that it can be written to as well as read. If the data needs to be read-only, then it should be placed into the .rodata section instead, like this:

.section .rodata
[...]
.incbin "foo.jpg"
[...]

In fact, it may be desirable to place the data into a section all of its own so that it can be easily located in the resulting executable. The .section directive allows new sections to be created so the following could be used:

.section foo-image, "a" @progbits

The "a" indicates that space should be allocated for the section in the run-time memory image of the program. By default this data is read-only, so if it needs to be writeable, you would add the w flag (i.e., "aw"). The @progbits indicates that the section only contains data, nothing else.

Another thing to consider with this method is that it changes the current section, which could cause problems if the assembler is inlined into a higher level source code. In this case the .pushsection and .popsection pseudo-ops can be used to safely change the section, like this:

__asm__("\n\
    .pushsection .foo-image, \"a\", @progbits\n\
    .align 4\n\
    .global start_of_foo\n\
start_of_foo:\n\
    .incbin \"foo.jpg\"\n\
    .global end_of_foo\n\
end_of_foo:\n\
    .popsection\n");

Putting the data into a section of its own also has an additional benefit. As long as the section name is a valid C identifier (meaning foo_image is OK, but foo-image is not), then the linker will automatically create beginning and end symbols for it. So, it's not necessary to declare them in the assembler code. Hence the following program will print out the size and contents of a file called foo.jpg, with foo.jpg being embedded into the executable:

int
main (void)
{
  extern const char __start_foo_image[];
  extern const char __stop_foo_image[];
  const char * p;

  __asm__("\n\
.pushsection foo_image, \"a\", @progbits\n\
.incbin \"foo.jpg\"\n\
.popsection\n");

  printf ("image size: %#lx\n", __stop_foo_image - __start_foo_image);

  for (p = __start_foo_image; p < __stop_foo_image; p++)
    printf ("%d ", *p);

  printf ("\n");
  return 0;
}

Modifying the in-program data

One problem with storing data inside an executable is that it is then difficult to modify the data. Recompilation is always an option, but there is another option. The objcopy program allows the contents of sections in a program to be changed. Note, however, that it does not allow editing of individual bytes within a section, only the wholesale replacement of the contents of a section. Thus, this method only works if the data has been placed into a section of its own.

The command [4] looks like this:

objcopy --update-section sectionname=filename <file>

So, given the examples above this command:

objcopy --update-section foo_image="bar.jpg" a.out

will replace the foo.jpg image inside a.out with the bar.jpg image.

This method does have a major flaw, however; the replacement does not change the symbols generated by the assembler or the linker, and the compiled code will still use the old values. So, if the new file is of a different size to the old file then the stop/end symbol will be incorrect. The start symbol will still be OK, because its value is relative to the start of the foo_image section, which is always zero. Thus, the moral to this story is that, unless the data is self-describing, do not replace it with anything other than an equal-sized block.

Conclusion

It is possible to store large data sets inside a program, using a little bit of assembler hackery. Putting the data into its own section makes it easier to examine and, if necessary, alter. This approach does make the program bigger, of course, but depending upon the circumstances it may still be better than storing the data outside of the program.

References

[1] https://en.wikipedia.org/wiki/Resource_(Windows)
[2] https://sourceware.org/binutils/docs-2.32/as/Incbin.html#Incbin
[3] https://sourceware.org/binutils/docs-2.32/ld/Source-Code-Reference.html#Source-Code-Reference
[4] https://sourceware.org/binutils/docs-2.32/binutils/objcopy.html#objcopy

Last updated: July 3, 2019

How to store large amounts of data in a program

The INCBIN directive

Modifying the in-program data

Conclusion

References

Deploy computer vision applications at the edge with MicroShift

Simplify Gatekeeper installation and constraint management

Modernizing Pedal: Breaking down a Javå monolith into Quarkus microservices

Modernizing Pedal: API management for modern system design and administration

Prepare and label custom datasets with Label Studio

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue