AltiVec Articles & Papers
AltiVec for Real-Time Multiprocessor Implementations

Steve Paavola
Marketing Manager,SKY Computers, Inc.
www.sky.com

AltiVec Overview

AltiVec Technology is an exciting new capability that Motorola has made available with the MPC7400 PowerPC microprocessor. The MPC7400 out performs DSP chips doing DSP algorithms, yet it is much easier to program than DSP chips, especially for large, complex algorithms. The MPC7400 adds SIMD (Single Instruction Multiple Data) capability to enhance its performance in signal and image processing and graphics capabilities.

PowerPC with AltiVec

AltiVec Technology adds 32 128 bit "vector" registers, 162 new instructions and 4 data stream pre-fetch engines to the PowerPC architecture. It performs 128 bit arithmetic operations every processor clock. Each 128 bit register can contain 4 32-bit floating point numbers, 4 32-bit integers, 8 16-bit integers, or 16 8-bit integers. Since the full 128 bit computation is performed in a single clock, as many as 4 floating-point adds, or 16 8-bit adds can be performed every clock. AltiVec provides multiply-accumulate functions, so 2 computations can be performed per arithmetic item every clock, eg. 8 FLOPS (4 adds and 4 multiplies), or as many as 32 OPS per clock.

The MPC7400 has 32 KB of instruction cache and 32 KB of data cache on-chip. It includes a backside cache interface similar to the Power PC 750 product family, but with a capacity of up to 2 MB - twice what is supported by the 750. It supports the current 60x bus for memory and I/O, but also support a new native bus interface.

The MPC7400 can be a simple drop-in replacement to existing PowerPC 750 designs. However, to take full advantage of the new capabilities in a multiprocessor board or system requires a design which takes the maximum advantage of the microprocessor's capabilities. The easiest new feature for multicomputing vendors to implement is the expanded backside cache. By simply doubling the number of SRAM chips and connecting the additional address line, vendors can provide twice as much fast memory to hold frequently used data and code. This is especially important to users in signal processing where the real time core, coefficients, and filters of the application are stored in the backside cache.

Implementing the native bus requires more work. In order to understand the value of the native bus, we must first understand the 60x bus. The earlier 60x bus interface in previous PowerPC products has a MESI cache coherency protocol, independent address and data interfaces, and dead cycles sprinkled in. The PowerPC 604e microprocessor provided the best implementation with 2 pipeline stages for cached data accesses, and it is able to stream memory reads - no dead cycles. Taking advantage of this capability for more than 2 sequential cache line loads requires designing a clever memory controller because of the delays involved in the memory controller. All other PowerPC processors require some number of dead cycles between cache lines which considerably slows down or blocks processing.

The native bus on the MPC7400 is similar to the 60x bus - it has a MERSI cache coherency protocol, independent address and data interfaces, and the ability to handle wait states if needed. For data intensive applications, AltiVec's bus interface has additional pipeline stages, allowing multicomputing vendors to greatly simplifying the design of the high-performance memory controller, ultimately giving the application developer dramatically higher sustained memory bandwidth. Data can stream for both reads and writes, and cache lines can be processed out of order. Each address request comes with a tag, and the memory controller can specify which tag it is completing. The memory controller can optimize the sequence of completions rather than having to process the requests in FIFO order.

MPC7400 as DSP Processor

The PowerPC is already being used successfully in a wide range of DSP applications. Often it performs better, and is cost competitive with DSP implementations. With the addition of AltiVec, the MPC7400 is very attractive for DSP. At 333 MHz, the MPC7400 is rated at 2.6 GFLOPS For 16-bit arithmetic, the MPC7400 is rated at 4.8 GOPS. These are impressive numbers - better than the announced performance of any DSP chip.

However at the application level the questions are how close can you get to this peak performance level, and how much software effort is required to get that performance? The answer depends on the application, the specific processor implementation, and the software tools available to support it.

There are several features that affect the performance of a processor including operation speed, memory speed and characteristics, and I/O speeds. In general, RISC processors take a somewhat different architectural approach toward optimizing these features than the DSP processors take.

The first feature to maximize is the number of operations executed per second. There are two ways to increase this - increase the clock speed, or increase the number of operations executed every clock. Increasing the clock speed is an obvious approach, and the general purpose, RISC processors lead here.

In addition to faster clock speed, the MPC7400 implements both a super-scalar architecture and SIMD to increase the number of operations executed per clock.

The super-scalar architecture can start more than one instruction every clock. This provides the processor with the ability to look ahead some number of instructions to find work to do based on the number of functional units in the processor and the instruction mix of the application. As a result, the processor is able to perform some instruction optimizations at run-time. The benefit of using a super-scalar architecture is that the RISC architecture is maintained, allowing compilers to generate efficient code. It is still possible to hand optimize some classes of functions to enhance their performance, but the compilers do a good job on general code.

SIMD is another single instruction optimization. As implemented in the AltiVec section of the MPC7400, a single instruction is applied to multiple data items simultaneously. This capability is starting to appear in some DSP implementations as well. DSPs usually increase the number of operations per clock by making the instruction word more complex, putting multiple functions in the instruction word. Some DSP chips are implementing VLIW (Very Long Instruction Word) which emphasizes how many operations they are putting in each word. This approach reduces the complexity of the processor design, but increases the software challenge. It is very difficult for a compiler to generate efficient code for these architectures. As a result users end up writing more code in assembly language.

Memory Optimization

System performance is more than just instruction execution speed. Data bandwidth is often more important - if you can't get the data to the computational units, it doesn't matter how fast they are. One thing that microprocessor vendors do to optimize performance is to put fast SRAM close to the processor to speed up access to data.

The RISC processors implement cache memory, which is fast SRAM in the chip. The MPC7400 has 32KB of instruction cache, and 32KB of data cache, for a total of 512 Mbits. The MPC7400 also implements a backside L2 cache, which is additional SRAM chips interfaced to the MPC7400 through a dedicated bus. The MPC7400 will support up to 2MB or 16 Mbits of backside L2 cache. These caches have cache tags in the processor. The nice thing about this architecture is that the processor will automatically load the cache with the appropriate data as the application needs it, making for simple software algorithms.

DSPs put a large directly addressable SRAM inside the chip. The DSP processor vendors tend to measure this memory in Mbits, with a large SRAM coming in at 1 or 2 Mbits. Being directly addressable, there isn't the potential latency of dealing with cache tags. But, the user must manage this address space carefully because the processor won't automatically fill the SRAM with the data when the application needs it, as is done with a cache. As a result, the programmer must explicitly move data and potentially code in and out of the on-chip SRAM, increasing the complexity of larger applications.

For I/O and DRAM interfaces general purpose processors have a single bus interface for DRAM and I/O. In the case of the MPC7400, another chip must be employed to bridge between this bus and any DRAM, FLASH and I/O busses. This bus is fast at up to 100 MHz, and wide at 64 bits. DSP chips tend to implement slower and narrower memory busses. Commonly they provide only a 32 bit interface, and a fast bus has been 40 MHz. Recent processors now have speeds up to 75 MHz or so, and 64 bit busses are starting to appear. However, they are still not as fast as the MPC7400. A feature the DSPs usually provide that isn't available on the MPC7400 is serial ports and link ports. These interfaces provide fast serial I/O and alternate interconnects, whereas the only MPC7400 interconnect is through its processor bus.

Typical Implementation

The new Merlin board from SKY Computers is a typical implementation of the MPC7400. It includes 4 MPC7400 processors, and some SRAM chips for the backside cache, a bridge chip, and some DRAM memory. Bridge chips compatible with the MPC7400 are available from several vendors, with the I/O bus being PCI. Other configurations are possible, including fast interconnects such as the 320 MB/sec ANSI standard SKYchannel used on Merlin.

The software is as important as the hardware in a successful AltiVec system. C compilers are available with enhanced syntax to support AltiVec. Although the compilers don't generate AltiVec instructions from scalar code, AltiVec "vector" functions can be written into the C application. As a result, the programmer doesn't have to write in assembly language, and some of the optimizations provided by the compiler, like register allocation and optimization, can be applied to the AltiVec code.

In order to write efficient AltiVec code, the user program must be carefully written. Data vectors must be aligned on 128 bit boundaries, or significant additional code must be written to determine the actual alignment and arrange the data within the AltiVec registers so that the computations can be performed. AltiVec provides some nice features to align the data with minimal overhead, but it is additional code that must be written.

Another issue that must be addressed is the vector length. As long as the data is a multiple of 128 bits or 16 bytes long, there isn't any problem. However, if this isn't the case, some additional code must be written. Again, the AltiVec designers made this simple.

Taking advantage of AltiVec at this level is still easier than writing in assembly language for DSP chips. The compiler manages some of the optimization and bookkeeping for the programmer. However, the down side is that the resulting code is not portable. Moving the application to the NGP (Next Great Processor) will require re-coding if NGP doesn't have AltiVec functions.

SKY Computers has addressed this problem by porting its Standard Math Library (SML) to AltiVec. SKY has supplied SML on several microprocessor platforms to solve software portability problems. The architecture independent SML uses SKY's advanced compiler technology to maximize memory and cache bandwidth utilization. As a result, the programmer doesn't have to write AltiVec functions into the application, alignment can be ignored, and the compiler and SML functions will execute correctly with any vector length. Applications written 10 years ago for SKY's processors will compile and execute on the MPC7400 and will execute faster because they use AltiVec.

MPC7400 The Supercomputer in a chip

The new MPC7400 with AltiVec Technology will improve raw compute performance by a factor of 4x above the fastest PowerPC microprocessors available today. The keys to harnessing the power of MPC7400 for embedded real-time applications is in the efficient hardware design of the multicomputing board or system. Issues such as full utilization of the back side cache will make for designs that can take advantage of the higher processing performance.

Software portability will be a key issue for legacy code and code now in development. SKY provides tools for customers to easily move their applications up to the new performance levels. Without such tools, the expense of re-writing code will become a stumbling block to the achievement of the impressive new levels of performance we can now provide.

The MPC7400 is faster than yesterday's supercomputers . But, for real world, real time applications we must make that performance easily accessible to programmers and application developers.