When Intel Corp announced the Pentium, one question that taxed the company was the extent to which applications needed to be recompiled to get the best out of the chip. It wasn’t that recompilation was needed for the application to run, it was simply that to get the last ounce of performance out of the […]
When Intel Corp announced the Pentium, one question that taxed the company was the extent to which applications needed to be recompiled to get the best out of the chip. It wasn’t that recompilation was needed for the application to run, it was simply that to get the last ounce of performance out of the processor, the compiler had to be aware of some of the new features that the enhanced chip offered. Intel acknowledged that some applications could benefit substantially by being optimised for Pentium – on average they could run 30% faster. But surely optimising for Pentium meant de-optimising for the 80386 and 80486 processor? At this point, the Intel public relations line tends to be Oh no – optimise for Pentium and you will also optimise for the 80486. At which point the sceptic gets a little sceptical. To be fair, Intel is probably right – it is possible to include optimisations for a chip with inbuilt pipelining and parallel execution without damaging its speed of execution on a simpler processor. Today the same questions can be asked of the PowerPC family of chips. If I optimise my code specfically for the PowerPC 604, how much extra speed will I be able to squeeze out of the processor? If I optimise for the 601, will I get substandard performance if the code runs on a 603 or a 604, or vice versa? A number of compiler writers will point out that at this stage the question is a little premature.
Here we are with a family of chips that goes blindingly fast and you want to worry about squeezing out the last drop of performance, they cry, quite rightly. However, we’ll ask it anyway, since it is a question that will become important to the software community in the next year or so. As with so many questions, the answer is it depends, according to Mike Phillip, who manages Motorola Inc’s RISC compiler tools group. It depends on the kind of application, on the design of the target computer system and on the compiler. Given all those imponderables, however, he says, full optimisation for the 604 might result in a 10% increase in performance. The question then becomes, how much will optimising for one processor slug performance on the others. Again, says Phillip, there is no simple answer, but generally compiling for a chip with a large degree of internal parallelism (such as the 604) will not impact a more lowly processor too badly. One way to get a good approximation of the truth would be to run optimised SPECmark code on one processor and then on the other. Unfortunately, no-one has done this yet, partly because all those SPECint and SPECfp figures that you have seen for the 604 seem to have been generated by simulation rather than by compiling actual code. It is generally accepted that there will be a performance hit of some kind, but nobody has figures. PowerPC News is keen to hear from anyone who has run comparative tests. The degree of parallelism is just one way in which the different members of the PowerPC family differ. Apart from the obvious multiple integer units, the 604 also has a re-engineered floating-point unit. The PowerPC 604 is a single pass double-precision unit, meaning that both single- and double-precision operations zip through the chip in one pass with a latency of three cycles. Essentially, it has two multiply units for double precision operands. The 601 and the 603, by contrast, require double precision to travel the pipeline twice, giving a four-cycle latency and two-cycle throughput. Coupled with the fact that the 604 now has two wait station queueing places in front of the floating point unit then a compiler-writer has one or two things to continue when getting the best out of the processor.
By Chris Rose
Nearly as important for scientific applications is the organisation of the cache on the target machine. Different processors have their cache aranged in different ways. The 604, for example, has separate data and instruction caches on-board, while the 601 unifies them, but that will not make too much difference to the compiler, Phillip says. More important is the size a
nd organisation of the of the Level 2 off-chip cache. Since commercial application developers will have no idea what kind of cache their customers’ machines will have, cacheing will most probably be ignored. In any case, says Phillip, it will be the scientific, memory-bound applications that will notice the difference. Last but not least, there are the differences in instruction set between the various processors. The most trivial example of this are the extra instructions that were retained from the old Power architecture in the PowerPC 601. However, the fact is that the 601 includes many instructions that are not strictly part of the PowerPC set, but are there to provide a bridge between the two architectures. The advantage is that software from an old RS/6000 will usually run fine, without a recompilation on the the 601-based RS/6000s. Theoretically, a piece of code aggressively optimised for the 601 might take some of these bridging instructions, causing all kinds of nasties when it runs on the 604 or 603, as the processor could trap the illegal instructions. This admittedly, is a daft example, although there is probably someone out there who is trying to make a 601 application go faster using these methods. It is worth noting, however, that both the 603 and 604 processors also have additional instructions in them, which are not included in the 601. These are the so-called graphics instructions: stfiwx – Store floating-point as Integer Word fres – Floating Reciprocal Estimate Single frsqrte – Floating Reciprocal Square Root Estimate fsel – Floating point Select Examples of how these instructions can be used in graphical applications are given in Appendix E of the PowerPC documentation and they are basically quick ‘n’ dirty ways of carrying out floating point operations where speed is more important than accuracy. Useful as they are, they are not supported by the 601 processor and will generate an error. So can they be safely used at all? Currently, the wisdom from the IBM Corp and Motorola compiler divisions is that these kind of instructions will not be generated by everyday compilers, instead they will appear in specialised graphics libraries.
Phillip suggests that people may adopt a method of using dynamically linkable libraries, which can be swapped in at run-time depending on the processor in use. If they are restricted to specialist graphical library use, there should not be too many problems. However, other informed IBMers suggest that these instructions will become more pervasive over time as the focus switches to floating point usage in general applications. So what is the answer? Motorola’s Phillip suggests one solution may be smart installers. In Motorola’s compilers, like many others, switches abound. Developers can compile for particular targets, and for particular chips. They also allow for the production of multiple sets of object code. So, an installation CD-ROM could hold multiple copies of (parts of?) the application, together with a smart installer that could read the system configuration, processor-type and so forth, and copy the most suitable code across. It is an evolution of the fat binary idea. Will it catch on? We don’t know because until definitive SPECmarks or other benchmarks are run and published, it is not certain whether there is even an issue. Watch out for the launch of the first 604-based Macintoshes, however, as that is when we will see the subject apear again.
Chris Rose edits the new PowerPC News publication, from which this item is taken. PowerPC News is published fortnightly and is presently available only on the Internet.