I’ve received lots of positive feedback following the release of bfc v1.0.0, my optimising BF compiler.
However, the competition has been heating up! It’s been interesting to compare with performance other optimising BF implementations. In a few cases, bfc’s performance was matched by simpler implementations.
Challenge accepted. How can we leverage bfc’s optimisations to produce even faster BF programs?
Reorder With Offset
One popular optimisation bfc lacked was instruction offsets.
If we have the code:
Our IR previously looked like this:
PointerIncrement 1 Increment 1 PointerIncrement 2 Increment 2 PointerIncrement 1
However, accessing cells in our array at a specific offset is
cheap. If we’re coding in C, accessing
some_array[index + 1] is
often as fast as accessing
We add an offset attribute to Increment and Set instructions. This has two big advantages:
- Array access with a constant offset is cheap in hardware.
- We can combine many more pointer increment instructions.
After introducing offsets, our IR looks like this:
Increment 1 (offset 1) Increment 2 (offset 3) PointerIncrement 4
There’s another advantage of adding offsets to Increment and Set instructions. Given a sequence of Increments and Sets with different offsets, we can freely reorder them. Given the IR:
Increment 1 (offset 1) Increment 2 (offset -1) Set 3 (offset 2) Set 4 (offset 0)
We sort by offset, producing the IR:
Increment 2 (offset -1) Set 4 (offset 0) Increment 1 (offset 1) Set 3 (offset 2)
By accessing the cells array in sequential order, we get better cache locality during program execution.
Next Mutation Analysis
In bfc v1.0.0, we would remove redundant adjacent instructions. For example, incrementing a cell is pointless just before writing to it:
and consecutive loops loops are dead:
[some loop here][this loop is dead]
However, now that we’re reordering instructions, our adjacent instruction techniques may miss out.
bfc solves this by with ‘next mutation analysis’. Given an instruction, bfc is now able to find the next mutation instruction for the current cell.
Of course, it’s not always possible to find one. It may not be possible to determine the next mutation at compile time (e.g. complex loops), or there may not be another mutation for this cell.
Here’s an example. Given the IR:
Loop (some loop body) Set 0 (offset 0) <- this is redundant! Set 0 (offset -1)
If a loop terminates, the current cell is 0, so we know this set is redundant. However, after reordering, that set does not immediately follow the loop:
Loop (some loop body) Set 0 (offset -1) Set 0 (offset 0) <- this is still redundant!
bfc uses next mutation analysis to find this redundant set and remove it, giving us:
Loop (some loop body) Set 0 (offset -1)
This is a generalisation of the techniques used in v1.0.0, and it applies in other situations too. For example, consider the program:
[loop X]>.<[loop Y]
Loop Y is dead here, and bfc is able to remove it.
Partial Loop Execution
bfc was already able to execute BF code at compile time, producing big speedups in many popular BF programs.
However, v1.0.0 required loops to be completely executed at compile time. This simplified the implementation, but many larger BF programs have a big outer loop. These programs did not benefit from bfc’s compile time execution.
For example, given the IR:
A B Loop: C D Loop: E F
If we couldn’t execute the whole outer loop at compile time, we would only be able to remove A and B from the final executable, giving us:
Loop: C D Loop: E F
Now in v1.2.0, we can execute up to an arbitrary position in the code. As
in v1.0.0, bfc executes until it reaches a
, instruction, it reaches
a time limit, or the program terminates.
If we only manage to partially execute a loop, we split the basic block, and runtime execution begins where compile time execution finished.
In our example above, suppose we reach the time limit just after executing E. Our final executable looks like this:
GOTO start Loop: C D Loop: E start: F
For v1.2.0 I’ve measured bfc using Mats Linander’s benchmark suite.
v1.2.0 has improved performance in every case. If we look at normalised performance, we can see which benchmarks have improved the most relative to v1.0.0:
There’s a number of lessons I’ve learnt from these additional optimisation techniques.
Firstly, optimisation is hard. Static analysis of programs is easy to screw up even with a good test suite. It took me several hours to get the last (known) bug out of the next mutation analysis.
Secondly, optimisation is never done. There are always further improvements possible. The challenge is finding which optimisations are sufficiently general that they apply to ‘real world’ code. It’s not even clear what the upper limit on performance is. Mats Linander’s project still beats bfc in many cases!
Thirdly, it’s not enough to use an optimising compiler backend. LLVM has many powerful optimisations, and tips for using them effectively. This is no substitute for exploiting the semantics of the target language in the frontend.
Finally, we have completed our challenge! bfc v1.2.0 produces executables that are, on average, 28% faster than v1.0.0. This totally ridiculous project is available in all its glory on GitHub.