Escher Technologies Articles on Formal Verification>

Using and Abusing Unions

The C union type is one of those features that is generally frowned on by those who set programming standards for critical systems, yet is quite often used. MISRA C 2004 rule 18.4 bans them (“unions shall not be used”) on the grounds that there is a risk that the data may be misinterpreted. However, it goes on to say that deviations are acceptable for packing and unpacking of data, and for implementing variant records provided that the variants are differentiated by a common field.

According to K&R’s The C Programming Language, “A union is a variable that may hold (at different times) objects of different types and sizes, with the compiler keeping track of size and alignment requirements.” Note the words at different times. So it appears that they didn’t expect programmers to use them to pack and unpack data, using code such as the following:

uint16_t unpack(uint8_t lobyte, uint8_t hibyte) {
  union {
    uint16_t wordData;
    uint8_t byteData[2];
  } temp;
  temp.byteData[0] = lobyte;
  temp.byteData[1] = hibyte;
  return temp.wordData;
}

I regard this as an abuse of unions. This code is not portable, because its behaviour is dependent on how the compiler lays out the union, and on whether the processor is big-endian or little-endian. Is there any need to use it? Let’s look at the alternative:

uint16_t unpack(uint8_t lobyte, uint8_t hibyte) {
  return ((uint16_t)hibyte) << 8 ) | (uint16_t)lobyte;
}

This version also makes it clear that lobyte contains exactly the lower 8 bits of the data (which was probably read from an I/O port), and does not assume that uint8_t has exactly 8 bits (as opposed to at least 8 bits).

Is there any reason not to use this code and avoid the union? There might be a performance penalty, but only if you are using a poor compiler or a very low optimization level so that the compiler does not implement the shift using a move or byte swap instruction, and the processor does not have barrel-shift hardware. In other cases, this version might be faster than using a union, because optimizing it does not require a variable to be eliminated.

eCv requires programs to be type-safe, and doesn’t make assumptions about endianness or struct and union layout and alignment. So it doesn’t support use of unions in this way. In the event that you really do need to use a union for packing or unpacking data, you can fool eCv like this:

#ifdef __ECV__
// define the shift version of unpack
...
#else
// define the union version of unpack
...
#endif

but you are then assuming responsibility for ensuring that the union version behaves correctly.

What about using unions for their intended purpose, i.e. holding different types of data at different times? The usual criticism here is that C unions don’t have automatic discriminants, so the compiler cannot insert run-time checks. Why not verify formally that the data is never misinterpreted instead? What we need to ensure is that a union is only ever read through the same member as was last used to assign it. We express the concept that “member M was last used to assign the value of E” in eCv using the syntax E holds M. We can use a holds expression anywhere in a specification or any other ghost context, but not of course in real code. Here’s an example:

struct Status { ... };
struct Error { ...};

union StatusOrError {
  struct Status st;
  struct Error err;
};

static union StatusOrError lastResult;

Whenever eCv sees lastResult.err or lastResult.st being read, it will attempt to prove lastResult holds err or lastresult holds st respectively. If we want to write a function that assumes that lastResult holds a particular member, eCv will fail to verify the function unless we declare that assumption as a precondition. For example:

void displayError()
pre(lastResult holds err)
{ ... lastResult.err ... }

void displayStatus()
pre(lastResult holds st)
{ ... lastResult.st ... }

Now eCv will need to verify that the precondition holds at each call to displayError or displayStatus:

lastResult.err = ... ;
displayError();     // OK
displayStatus();    // verification failure here

So we have made unions type safe, effectively by adding a ghost discriminant that can be interrogated by a holds expression. If you want to store a real discriminant, you can tie the two together using an invariant:

struct WrappedStatusOrError {
  union StatusOrError stOrErr;
  enum { disc_st, disc_err } disc;
  invariant((disc == disc_st) == (stOrErr holds st))
  invariant((disc == disc_err) == (stOrErr holds err))
}

Unions are rarely used in regular C++ programming, because variant data is almost always better represented using a class inheritance hierarchy. However, that approach normally requires dynamic memory allocation. Therefore, C-style unions still have a place in embedded C++ programming.

List of Articles

TOP