Abhilash Meesala

Endianness

If we each take a piece of paper and write today’s date, we may end up with two different representations of the same concept. I may write it as 22/10/23 (DD/MM/YY). You may write it as 23/10/22 (YY/MM/DD). Even though both these formats represent the same concept - 22nd October 2023, how you and I chose to represent and decode it could be drastically different.

This “encoding and decoding a concept” issue applied to computer architecture is the Endianness problem.


Computer memory is byte-addressable - that is, every byte of memory has a unique address, and it can be set, accessed or cleared using that address. The following shows a typical representation of memory.

Address
0x0
0x1
0x2
0x3
Value

Storing a single-byte value is straightforward. Say we need to store 83; we take the bit representation of 83 (0x53) and store it at the memory location. When we’d like to read back the value, we read the bit sequence at the memory location and interpret it as 83.

Address
0x0
0x1
0x2
0x3
Value
0x53
83 stored at location 0

Storing an ASCII string is also easy. We take each character in the string and write them to consecutive locations in the memory. We also add a null terminator (0x00) at the end to mark the end of the string.

Address
0x0
0x1
0x2
0x3
Value
0x79
0x75
0x00
String 'OK' stored at location 0

To read the string back, we read the bit sequence from our start address until we encounter a null terminator. Then, we chunk the bit sequence into byte-sized slices and interpret each byte as a character in the string.

Storing multi-byte values requires some thinking. Take 30,54,19,896 (0x12345678) for example. How do we store this value?

One option is to store it just like how it reads in English or in binary - left to right, starting with 0x12.

Address
0x0
0x1
0x2
0x3
Value
0x12
0x34
0x56
0x78
Storing 0x12345678 left to right

When we store 0x12 at address 0, we are storing the Most Significant Byte (MSB) at a lower address. This is analogous to representing the date as YY/MM/DD. The Most significant part that can swing the value the most is at the leftmost end.

Alternatively, we can represent the same number from right to left.

Address
0x0
0x1
0x2
0x3
Value
0x78
0x56
0x34
0x12
Storing 0x12345678 right to left

Storing the Least Significant Byte (LSB) at a lower address is analogous to representing the date as DD/MM/YY. The MSB, in this case, is stored far to the right. Note that bits within a byte are still stored left to right. Only the byte order is reversed.

So, which of these options do we choose?

There’s no right answer. Some systems store the MSB at a lower address, while others store the LSB at a lower address. Endianness refers to the byte order in which we store the sequence of bytes in memory. We refer to a system as Big-Endian if it stores the MSB at a lower address and Little-Endian if it stores the LSB at a lower address.


Each system, whether big-endian or little-endian, is internally consistent. If 0x1234 is stored at location 0, a read from location 0 by the same system will yield the same value.

Endianness becomes an issue when two systems start communicating with each other. Suppose you store the integer 0x12345678 to a file and send it to a machine that uses the opposite Endianness; the value it reads will drastically differ from what you intended to send.

In this example, if 30,54,19,896 (0x12345678) is saved using big-endian order, reading it on a little-endian system would yield 2,01,89,15,346 (0x78563412). Not good.

So, how do we deal with endianness issues?

Broadly speaking, two categories of issues arise due to Endianness - code portability and data sharing.


We usually take one or more of the following approaches to solve data-sharing issues.

  1. Define the Endianness beforehand as part of the specification or
  2. Detect the Endianness of the data and architecture at runtime and do translations accordingly, or
  3. Convert all scalar data into ASCII character strings, which are endian-independent.

By far, approach 1 is popular. Most network stacks, communication protocols and file formats define their Endianness as part of their specification. For example, All protocol layers in TCP/IP are big-endian; JPEG is big-endian, while GIF is little-endian.

File formats such as TIFF and XML take the second approach. A Byte Order Mark (BOM) is used to identify the Endianness when reading the data. This BOM could be just a single value, such as 0xFEFF. If you read the first two bytes and find 0xFEFF, it is encoded as Big-Endian; otherwise, it is Little-Endian.

Because Endianness is only an issue for multi-byte values, a few systems convert all scalar data into ASCII character strings. The receiving system will parse the string and convert it back. While this approach is simple and works well in most cases, it is not efficient.

If you are designing a system that needs to share data with other systems, defining the Endianness as part of the specification is best.


Code portability is a trickier issue because it is easy to write code that’s not portable, and there’s no proven way to detect these issues at compile time. In a language like C, where the programmer directly controls what is stored and read from memory locations, shooting yourself in the foot is easy.

/* A contrived example showing how easy it is to write non-portable AND wrong code */

#include<stdio.h>

int main() {
  unsigned int x = 0x1284;
  char *c = (char*) &x; // c points to the first byte of x which is dependent on the Endianness of the system
  printf("%x\n", *c);
  if(*c < 0) { // This is wrong on so many levels.
    printf("Number is negative");
  } else {
    printf("Number is positive");
  }
  return 0;
}

If you need to access individual bytes of a multi-byte value, detect the Endianness at run time and appropriately handle the data.

When dealing with data external to your system, stick to standard functions that abstract away the endianness details. For example, if you need to send a bit sequence on a network, use functions such as htonl and ntohl to help you do the translations.

#include <stdio.h>
int main() {
    long x = 0x12345678;

    printf("x: 0x%lx\n", x);

    long network_order = htonl(x);
    printf("htonl(x): 0x%lx\n", network_order);

    long host_order = ntohl(network_order);
    printf("ntohl(htonl(x)): 0x%lx\n", host_order);
}

// Output:
// x: 0x12345678
// htonl(x): 0x78563412
// ntohl(htonl(x)): 0x12345678

Spending a little effort upfront in making your code endian independent goes a long way in avoiding the grief of debugging byte-order issues.

If you’d like to read more about Endianness, I recommend you go through Danny Cohen’s Holy Wars and a Plea for Peace. It is a classic paper that first introduced Little Endian and Big Endian. This StackExchange question discusses the advantages and disadvantages of using one Endianness over the other.