Featured image for "Tips for writing portable assembler with GNU Assembler."

Writing assembly code is straightforward when you are familiar with the targeted architecture's instruction set, but what if you need to write the code for more than one architecture? For example, you might want to test whether a particular assembler feature is available, or generate an object file for use with another tool. Writing assembly source code that can work on multiple architectures is not so simple.

This article describes common types of problems encountered when working with assembly code, and the techniques to overcome them. You will learn how to address problems with comments, data, symbols, instructions, and sections in assembly code. To get you started, the Portable assembler demo source file provides many examples of GNU Assembler (GAS) assembly code. I'll use a few of the examples in this article.

Problems with comments

There is no architecture-neutral way of creating a prefixed line comment. As a result,

  # This is a comment

might or might not work, depending on the target. (On some architectures the hash character is actually part of the instruction set, similarly for the semicolon and colon characters.)

Instead, the safe approach is to use C-like comments:

  /* This is a comment. */

But keep in mind that these comments cannot be nested:

  /* This is /* not a */ valid comment. */

Problems with data

The size of individual data items, such as integers, pointers, floats, and so on, varies from one architecture to another. Take the following example:

  .data
  .word 0x12345678

This code would fail to assemble on machines where a word was less than 4 bytes long. (Fortunately, the .data directive is universal.)

A more reliable way to insert specific integer values is to use the .dc.<letter> directives, where <letter> is b for bytes, w for 16-bit values, and l for 32-bit values. Here's an example:

  .data
  .dc.b 0x78
  .dc.w 0x5678
  .dc.l 0x12345678

This assembly code works on all targets, regardless of their word size.

Inserting 64-bit integer values

Oddly, the directive for 64-bit values does not follow the same naming scheme. Instead the directive to use is .quad:

  .quad 0x1234567890abcdef

Endian-ness

All values are stored in the target's endian format, which is usually the right approach. However, when fixed ordering is required, specifying multiple single-byte values is the way to go:

  .data
  .dc.b 0x78, 0x56, 0x34, 0x12

This code produces a little-endian ordering of bytes, even on a big-endian architecture. You cannot however create multi-byte bit patterns on targets where the byte size is larger than 8 bits (for example, the Texas Instrument's TIC54x.) Outside assistance is the only way to handle this particular situation:

  .data
  .ifdef big_bytes
  .dc.b 0x5678, 0x1234
  .else
  .dc.b 0x78, 0x56, 0x34, 0x12
  .endif

This solution works provided the symbol, big_bytes, is defined for architectures with 16-bit bytes and not otherwise. (Symbols can be defined on the GAS command line with --defsym <name>=<value>.)

Alignment requirements

Another problem with directives that store data values is that they can have alignment requirements. For example:

  .data
  .dc.b 0xff
  .dc.l 0x12345678

This example fails to assemble for the SH target because the 4 bytes in 0x12345678 are not being stored on a 4-byte aligned boundary. You can solve this issue with an alignment directive, but be cautious of using .align, which has target-specific semantics. Instead, use either the .balign or .p2align directives:

.data
  .dc.b 0xff
  .balign 4
  .dc.l 0x12345678

Note that this code introduces a gap between the 0xff byte and the 0x12345678 word.

Fixed values

GAS supports simple arithmetic and logical operations on symbols and constants. For most directives, the result must be a fixed value. Here's an example:

  .dc.b (val & 0xff), (val >> 8) & 0xff

This code works provided that the symbol val has a defined value when the directive is evaluated.

Storing strings

Strings can be stored easily, but beware that the .ascii directive does not store a terminating NUL byte. For C like strings use the .asciz directive instead:

  .ascii "this string has no NUL byte at the end"
  .asciz "this string does"

Problems with symbols

Labels and symbols are defined in various ways, all of which work across most targets:

  val = 0x1234
  .equiv here, .
  .equiv there, here + 4
  this_is_a_label:

For comparability with the HPPA assembler however, it is necessary to start a label's name in the first column of a line. Plus, by extension, the first column on any line needs to contain a whitespace character if no label is being defined.

If a symbol or label holds an address, then it is safest to insert it into the code using the .dc.a directive, like so:

  .dc.a this_is_a_label

You can perform simple addition or subtraction operations on an address, but more complicated operations are often not supported. Calculating the difference between two labels usually works only when they are defined in the same section, and sometimes not even then:

  .dc.a label1 - 2      /* This will work. */
  .dc.a label1 - label2 /* This might not work. */

Problems with instructions

Typically, instructions are specific to individual architectures. As a result, you cannot write a generic assembler source file that involves code. Starting with GAS 2.35 however, there is a new pseudo-op instruction (.nop), which generates a no-op instruction on any target:

  .text
  .nop /* This is a real instruction. */

Problems with sections

All architectures accept the section names .text, .data, and .bss. The old AOUT file format only supports these names. More modern formats such as Portable Executable (PE) and Executable and Linkable Format (ELF) support arbitrary section names. When defining new sections, be aware that the .section directive for ELF targets accepts more arguments than does the PE version:

  .section name                  /* See note 1. */
  .section name, "flags"         /* See note 2. */
  .section name, "flags", %type  /* See note 3. */

Notes:

  1. This form fails on targets where the section flags are compulsory.
  2. This form works for both PE-based and ELF-based targets, although the flags are different.
  3. This form only works on ELF-based targets. Note the use of the % character instead of the @ character.

Conclusion

This article addressed common problems writing portable assembly code and provided solutions and examples. In summary, writing portable assembler is hard to do and best kept simple, and persistence is the key.

Last updated: February 5, 2024