Writing assembly code is straightforward when you are familiar with the targeted architecture's instruction set, but what if you need to write the code for more than one architecture? For example, you might want to test whether a particular assembler feature is available, or generate an object file for use with another tool. Writing assembly source code that can work on multiple architectures is not so simple.
This article describes common types of problems encountered when working with assembly code, and the techniques to overcome them. You will learn how to address problems with comments, data, symbols, instructions, and sections in assembly code. To get you started, the Portable assembler demo source file provides many examples of GNU Assembler (GAS) assembly code. I'll use a few of the examples in this article.
Problems with comments
There is no architecture-neutral way of creating a prefixed line comment. As a result,
# This is a comment
might or might not work, depending on the target. (On some architectures the hash character is actually part of the instruction set, similarly for the semicolon and colon characters.)
Instead, the safe approach is to use C-like comments:
/* This is a comment. */
But keep in mind that these comments cannot be nested:
/* This is /* not a */ valid comment. */
Problems with data
The size of individual data items, such as integers, pointers, floats, and so on, varies from one architecture to another. Take the following example:
.data .word 0x12345678
This code would fail to assemble on machines where a word was less than 4 bytes long. (Fortunately, the .data
directive is universal.)
A more reliable way to insert specific integer values is to use the .dc.<letter>
directives, where <letter>
is b
for bytes, w
for 16-bit values, and l
for 32-bit values. Here's an example:
.data .dc.b 0x78 .dc.w 0x5678 .dc.l 0x12345678
This assembly code works on all targets, regardless of their word size.
Inserting 64-bit integer values
Oddly, the directive for 64-bit values does not follow the same naming scheme. Instead the directive to use is .quad
:
.quad 0x1234567890abcdef
Endian-ness
All values are stored in the target's endian format, which is usually the right approach. However, when fixed ordering is required, specifying multiple single-byte values is the way to go:
.data .dc.b 0x78, 0x56, 0x34, 0x12
This code produces a little-endian ordering of bytes, even on a big-endian architecture. You cannot however create multi-byte bit patterns on targets where the byte size is larger than 8 bits (for example, the Texas Instrument's TIC54x.) Outside assistance is the only way to handle this particular situation:
.data .ifdef big_bytes .dc.b 0x5678, 0x1234 .else .dc.b 0x78, 0x56, 0x34, 0x12 .endif
This solution works provided the symbol, big_bytes
, is defined for architectures with 16-bit bytes and not otherwise. (Symbols can be defined on the GAS command line with --defsym <name>=<value>
.)
Alignment requirements
Another problem with directives that store data values is that they can have alignment requirements. For example:
.data .dc.b 0xff .dc.l 0x12345678
This example fails to assemble for the SH target because the 4 bytes in 0x12345678
are not being stored on a 4-byte aligned boundary. You can solve this issue with an alignment directive, but be cautious of using .align
, which has target-specific semantics. Instead, use either the .balign
or .p2align
directives:
.data .dc.b 0xff .balign 4 .dc.l 0x12345678
Note that this code introduces a gap between the 0xff
byte and the 0x12345678
word.
Fixed values
GAS supports simple arithmetic and logical operations on symbols and constants. For most directives, the result must be a fixed value. Here's an example:
.dc.b (val & 0xff), (val >> 8) & 0xff
This code works provided that the symbol val
has a defined value when the directive is evaluated.
Storing strings
Strings can be stored easily, but beware that the .ascii
directive does not store a terminating NUL byte. For C like strings use the .asciz
directive instead:
.ascii "this string has no NUL byte at the end" .asciz "this string does"
Problems with symbols
Labels and symbols are defined in various ways, all of which work across most targets:
val = 0x1234 .equiv here, . .equiv there, here + 4 this_is_a_label:
For comparability with the HPPA assembler however, it is necessary to start a label's name in the first column of a line. Plus, by extension, the first column on any line needs to contain a whitespace character if no label is being defined.
If a symbol or label holds an address, then it is safest to insert it into the code using the .dc.a
directive, like so:
.dc.a this_is_a_label
You can perform simple addition or subtraction operations on an address, but more complicated operations are often not supported. Calculating the difference between two labels usually works only when they are defined in the same section, and sometimes not even then:
.dc.a label1 - 2 /* This will work. */ .dc.a label1 - label2 /* This might not work. */
Problems with instructions
Typically, instructions are specific to individual architectures. As a result, you cannot write a generic assembler source file that involves code. Starting with GAS 2.35 however, there is a new pseudo-op instruction (.nop
), which generates a no-op instruction on any target:
.text .nop /* This is a real instruction. */
Problems with sections
All architectures accept the section names .text
, .data
, and .bss
. The old AOUT file format only supports these names. More modern formats such as Portable Executable (PE) and Executable and Linkable Format (ELF) support arbitrary section names. When defining new sections, be aware that the .section
directive for ELF targets accepts more arguments than does the PE version:
.section name /* See note 1. */ .section name, "flags" /* See note 2. */ .section name, "flags", %type /* See note 3. */
Notes:
- This form fails on targets where the section flags are compulsory.
- This form works for both PE-based and ELF-based targets, although the flags are different.
- This form only works on ELF-based targets. Note the use of the % character instead of the @ character.
Conclusion
This article addressed common problems writing portable assembly code and provided solutions and examples. In summary, writing portable assembler is hard to do and best kept simple, and persistence is the key.
Last updated: February 5, 2024