David

Verifying a binary search - Part 2

In my last entry I showed how to use a correct-by-construction approach to develop a binary search function. We got as far as specifying the function and the loop, but we left the loop body undefined. The function declaration looked like this:

size_t bSearch(const LinEntry* array table, size_t nElems, uint16_t key)
pre(table.lwb == 0; table.lim == nElems)
pre(forall a in table.indices; b in table.indices
      :- b > a => table[b].raw > table[a].raw)
post(result <= nElems)
post(forall i in 0..(result - 1) :- key >= table[i])
post(forall i in result..(nElems - 1) :- key <= table[i]);

and the function definition looked like this:

size_t bSearch(const LinEntry* array table, size_t nElems, uint16_t key) {
  size_t low = 0, high = nElems;
  while (high != low)
  keep(high >= low)
  keep(high <= nElems)
  keep(forall i in 0..(low - 1) :- table[i].rawValue <= key)
  keep(forall i in high..(nElems - 1) :- table[i].rawValue >= key)
  decrease(high - low) {
    low = ?; high = ?;
  }
  return low;
}

eCv verified the program, apart from the incomplete loop body. So we just need to write loop body code that preserves the four loop invariants (the expressions in the keep clauses), decreases the variant (the expression in the decrease clause), and leaves the variant non-negative unless the loop is about to terminate.

The classic way to do a binary search is to pick an index midway between low and high, compare the table item at that point with the key, and adjust either low or high to that index depending on the result. So here’s a first attempt at the loop body code:

const size_t mid = (low + high)/2;
if (key >= table[mid].raw) {
low = mid;
} else {
high = mid;
}

If we try to verify this using eCv, we get a single failed verification condition:

c:\escher\ecvtest\ex5.c (23,9): Warning! Exceeded time limit proving:
   Loop body establishes end condition or decreases variant
       (defined at c:\escher\ecvtest\ex5.c (21,5))
      (see C:\escher\ecvtest\ex5_unproven.htm#6), did not prove:
      (low$loopstart_7603,5$ + high$loopend$) < (low$loopend$ + high$loopstart_7603,5$).

The prover has timed out trying to prove that the body either decreases the variant or leads to termination of the loop. There are four possibilities:

  1. There is a genuine problem – the code may loop indefinitely.
  2. The code won’t loop indefinitely, but we need a different loop variant in order to make this provable.
  3. The verification condition was simply too hard for the prover (this is always a possibility when the proof timed out). We could try increasing the prover timeout.
  4. There is a bug in eCv.

In this case, if we follow the link to the HTML proof report file, we see this:

Could not prove any of:
!((1 + lowloopstart_7603,5) == highloopstart_7603,5)
…

So the prover thinks that the loop may not make any progress if high == 1 + low at the start of the loop. Sure enough, there is a problem in this case; because in this situation, the integer division rounds down, giving mid == low. So we may end up doing the assignment low = low, leading to an indefinite loop.

One way to fix this is to avoid the situation high = 1 + low at the start of the loop. For example, we could change the while-condition to high > low + 1. Returning low immediately after the end of the loop will no longer be correct, so we’ll need to do something different.

However, the loop invariant we have to maintain is that all elements up to and including low – 1 hold raw values less than or equal to key. Therefore, having established that the element at mid is less than or equal to key, we can set low to mid + 1 instead of mid, avoiding the problem. So the code now looks like this:

size_t bSearch(const LinEntry* array table, size_t nElems, uint16_t key) {
  size_t low = 0, high = nElems;
  while (high != low)
  keep(high >= low)
  keep(forall i in 0..(low - 1) :- table[i].rawValue <= key)
  keep(forall i in high..(nElems - 1) :- table[i].rawValue >= key)
  decrease(high - low) {
    const size_t mid = (low + high)/2;
    if (key >= table[mid].raw) {
      low = mid + 1;
    } else {
      high = mid;
    }
  }
  return low;
}

This is sufficient to make the function fully verifiable.

I hope you will agree that this has demonstrated a correct-by-construction approach to writing loops, avoiding the dangers of non-termination and off-by-one errors. It’s all too easy to introduce off-by-one errors in C, because – unlike Ada – the C language doesn’t provide a form of loop that iterates over exactly the index range of an array, so you need to design the termination condition yourself.

Those of you who are familiar with SPARK Ada may be wondering why I used multiple preconditions, postcondition and loop invariants, rather than “and”-ing them all together like this:

keep(high >= low
     && high <= nElems
     && (forall i in 0..(low - 1) :- table[i].rawValue <= key)
     && (forall i in high..(nElems - 1) :- table[i].rawValue >= key))

You can do that in eCv, but there are two reasons why it is better not to do so, as follows:

Firstly, when eCv is generating verification conditions – in this case, that the initialization establishes the loop invariant that the loop body preserves it – eCv will generate a separate verification condition for each invariant expression. If you “and” the invariants together, you have a single expression, and therefore a single verification condition. If eCv fails to prove it, you can’t easily tell which part of the invariant failed without looking at the prover output. Writing the invariants separately (or, alternatively, putting them all in one keep-clause as separate expressions separated by semicolons) causes eCv to generate a separate verification condition for each one, so that you can see immediately which one(s) failed.

The second reason for using multiple specification clauses is also to do with error reporting. Remember that eCv keywords such as pre, post and keep are implemented as macros, so that the specifications can be made invisible to standard C compilers. When a C preprocessor expands the macros, it generally ignores the layout of the actual macro arguments, and generates the expanded version using the layout of the macro definition. So a macro argument that you wrote across several lines in your source code typically gets expanded to a single line. This means that if an error is reported, the line number will be reported as the line on which the eCv keyword occurred, even if the error is in a macro argument a few lines later. So it’s best to keep the arguments to eCv specification macros on the same line as the keyword. Therefore, when using long expressions, you’ll want to use a separate instance of the eCv keyword with each expression.

ARTICLES   List of Articles     TOPTOP