Introduction to AltiVec
AltiVec in the News
E-mail Group
Training Material
Articles and Papers
Introductory Articles
Technical Articles and Papers
Technical Specifications
Tools
About This Site
Coming Soon
Home
The AltiVec Information Source

AltiVec Articles & Papers

Motorola�s AltiVec� Technology Simplifies the Design of
High Performance Embedded Applications

Sam Fuller
System Architecture and Product Planning Manager
Networking and Computing Systems Group
Motorola Semiconductor Products Sector

 

Introduction:

Since the birth of the first general purpose microprocessors in the late 1970s, the demand for increased processing performance has continued to fuel the semiconductor industry to develop better, faster, less expensive devices. To meet these demands, semiconductor suppliers rely on advances in manufacturing process technology as well as improvements to the microprocessor architecture / design.

By providing increasingly more performance at a competitive price, the microprocessor has proliferated into numerous applications beyond the obvious personal computer that initially drove technology in the 1980�s. Today, new applications such as Voice over IP, multi-channel modems, speech processing, and image and video processing have formed whole new markets that have assumed the role of the technology driver. For example, with the growth of the Internet the telecommunications infrastructure is moving from its voice-oriented circuit-switched roots to a data-oriented packet-switched network based on Internet Protocol (IP). This conversion has created a tremendous opportunity for microprocessors and digital signal processors to be used as controllers, switch managers, database managers, and protocol converters in this new digital communications infrastructure. Applications such as these may require several orders of magnitude increases in performance to handle the enormous computational and bandwidth demands placed on the system.

Two standard approaches are used to boost microprocessor performance: leading-edge manufacturing processes and overall improvements to the microprocessor architecture and/or design. As the capabilities of the manufacturing process technology continue to improve, the number of transistors that can economically be placed on a silicon die increases. The availability of these additional transistors allows microprocessor architects to introduce new more sophisticated functionality dedicated to solving old problems better and, perhaps more importantly, to solving whole new classes of problems.

Motorola�s AltiVec Technology was developed to satisfy many application demands for increased performance. AltiVec Technology expands Motorola�s PowerPC� Architecture through the addition of a 128-bit vector execution unit, which operates concurrently with existing integer and floating point units. This new engine provides for highly parallel operations, allowing for simultaneous execution of up to 16 operations in a single clock cycle.

This article will address how Motorola�s new AltiVec Technology takes best advantage of both of the standard approaches to increasing performance, leading-edge manufacturing process and architecture/design, while addressing the inherent limitations of each. Examples in this article will speak specifically to the first device to implement AltiVec Technology, Motorola�s PowerPC G4 microprocessor.

A Short History

The first mainstream general purpose 16-bit microprocessors introduced were the Motorola 68000 and the Intel 8086. These processors were introduced in the late 1970�s and did not include support for floating point or virtual memory. Over the next 10 years subsequent incarnations of these processors grew to include full 32-bit architectures, with virtual memory support, integrated caches and floating point co-processors.

While all of these functions were available in the mainframe and minicomputers of the day, the economics of completely integrated microprocessors prohibited their integration on to a microprocessor. The advent of sub-micron technology in the late 1980�s and early 1990�s was required before floating point units, caches and MMUs could be routinely included on high-volume, low cost microprocessors. This represented the first phase of microprocessor integration beyond the creation of the microprocessor itself.

During the second half of the 1990�s, processing power has continued to advance by Moore's law -- doubling every 18 months, clock frequencies have reached the 300 to 400 MHz range and a new class of applications for microprocessors has appeared. This new class of application is characterized as dealing with rich natural data types such as speech, video, high resolution still images, 3D graphics and virtual reality. The computational requirements of dealing with these new data types are several orders of magnitude beyond those required to process the text and numerical data types common in the most popular applications of the 1980�s and early 1900�s, namely word processing and spreadsheets. Desktop computers and other embedded devices are now much more likely to be dealing with motion video editing and presentation, video conferencing, and 3D gaming in additional to IP-based voice telephony and graphics rich world wide web browsing.

The infrastructure of the network is also required to participate and process these new data types. The arrival of digital cellular telephony has created new performance requirements for advanced DSP devices and to deal with the coding and decoding of voice traffic both within the digital cellular network and between the digital cellular network and the existing analog telephone infrastructure. The same phenomenon is observed within the emerging IP-based telephony infrastructure. Additionally, network and telecommunications service providers are creating new classes of telephone based automated speech attendants that not only understand spoken requests but also respond in more natural synthesised speech.

All of these applications require much higher performance, sometimes two to three orders of magnitude higher. At the same time, these new applications also require much lower cost and power disappation to meet the price and industrial design targets for consumer and embedded devices.

The advancement in process technology from a 1 m minimum feature size to a .1m mimimum feature size between 1990 and the expected 2005 availability of .1m technology will increase the average number of transistors available on a 100 mm^2 die to more than 100 million transistors, roughly a 100-fold increase from 1990. At the same time the speed of the transistors will have improved from approximately 20 MHz to in excess of 2 GHz (if current trends continue). This is another 100-fold increase for a total improvement in computational capability of 10,000X in 15 years time!

Unfortunately, this 10,000X improvement in device capability does not necessarily mean a 10,000X improvement in performance and end-use capability. There are some significant bottlenecks that stand in the way of a microprocessor�s ability to actually translate the capabilities of the technology into actual deliverable performance to the application. Two of the main bottlenecks that must be addressed are the latency to memory and the lack of available instruction level parallelism.

AltiVec Technology Handles Memory Latency

The challenge facing processor architects is to productively make use of the continually increasing computational capabilities that manufacturing process technology is providing. Two significant barriers stand in the way of making full use of the technology�s capabilities. The first is memory latency. Memory latency has been essentially flat for the last 15 years and does not look to improve much in the future. New DRAM technologies like DDR SDRAM and Rambus� DRDRAM improve the bandwidth of the memory but do little to improve the latency of the memory.

On chip SRAM arrays in the form of caches and even local memories can help to reduce the impact of long latency memory operations to off chip SRAM and DRAM devices. The Motorola G4 processor contains 64 K bytes of on chip cache memory divided evenly between separate instruction and data caches. The caches are 8-way set associative, which increases their effective capacity significantly beyond that of a direct-mapped cache of similar size. The Motorola G4 processor also contains tag ram and support logic for an external L2 cache of up to 2 Mbytes in size organized as 2-way set associative.

In addition to the cache support, the G4 processor also offers a very sophisticated data prefetch mechanism to reduce overall memory latency as viewed by the processor. Because the AltiVec technology is designed to work with streaming data types, the software can be much more intelligent about the future data requirements of a task. AltiVec technology provides a powerful mechanism to bring data into the processor in advance of its actual usage. This mechanism is referred to as data stream touching. It is accessed through a Data Stream Touch (DST) instruction. DST differs significantly from cache pre-load operations employed by other microprocessor architectures (including PowerPC�s block touch operations) which traditionally will speculatively bring one cache line of data into the processor for each pre-load instruction. In contrast, the DST instruction refers to a whole block of data described with a starting address, a block size (1 to 32 16-byte vectors), a number of blocks to prefetch (1 to 256 blocks) and a signed stride in bytes (-32,768 to +32,768). DST effectively kicks off a DMA operation that will asynchronously, and independently of the processor�s instruction execution, bring data into the on-chip cache heirarchy. The G4 processor will support up to 4 independent streams of DST operations. To take advantage of this capability a programmer or intelligent compiler would identify the data to be operated on (for example an MPEG 8x8 pixel block residing in memory) and issue a DST operation ahead of the consumption code that processes the pixels. As these operations are highly repetitive, a form of software pipelining can be employed where by the next block is prefetched into the cache while the previous block is being processed. The overhead of this approach is significantly less than what would be equivalently required using only the cache line prefetch facilities provided by the PowerPC architecture without AltiVec technology.

The Motorola G4 primary and secondary caches along with the data stream prefetch mechanism supplied by the AltiVec technology provide the tools developers need to ensure memory latency does not become a bottleneck to the G4�s processing capabilities.

AltiVec Technology Manages the Instruction Set Dependency Bottleneck

The other barrier to effective use of expanding transistor budgets is the very limited parallelism inherent in current CISC and RISC instruction set architectures. A common approach for many processors including PowerPC processors from Motorola is to provide for super-scalar dispatch and parallel execution and completion of instructions. This is done to exploit instruction level parallelism and in some cases data level parallelism. While this approach has benefits for two to three instruction issue machines,design complexity coupled with a lack of significant available instruction parallelism causes the benefits of this approach to break down quickly beyond three-issue super-scalar machines. While there are real problems with the parallel execution of traditional scalar integer code, the emergence of a whole new class of multimedia and DSP algorithms has created significant opportunities for the microprocessor architect to exploit data parallelism thus increasing the overall micoprocessor performance. Examples of these algorithms include discrete cosine transforms, convolutional encoders, and Viterbi decoders. These algorithms are widely used in modem, voice, speech and video processing. In the most interesting application areas these algorithms deal with 8 and 16 bit data types instead of the more traditional 32-bit data types. When implemented on a traditional RISC processor much of the processor�s 32-bit or 64-bit register files, data paths and ALUs are under utilized. Single instruction multiple data (SIMD) technology has emerged as a common technique to close this semantic gap between the demands of the algorithms and the capabilities of the hardware.

The Motorola G4 processor contains a significant SIMD expansion to the PowerPC architecture in the form of AltiVec technology. Motorola�s AltiVec technology is a vector SIMD architecture based on 128-bit wide vectors providing full support for 8, 16 and 32-bit data types. AltiVec technology is very general-purpose in nature and applicable across a wide range of applications. AltiVec technology is applicable wherever there is data parallelism. AltiVec technology is similar to the Intel-MMX, Sun-VIS, HP-Max and Alpha-MVI extensions. However, AltiVec technology is significantly more powerful than these other architectural extensions, providing support for 3D geometry, high-end audio, speech recognition, DSP, text character processing and data mining tasks. Also, the permutation capability of AltiVec technology provides data reorganization capabilities far beyond that of any other existing general-purpose processor. This is accomplished through the AltiVec permute instruction which provides for the arbitrary selection of up to 16 byte elements from a set of 32 bytes. This operation provides two very important functions, namely data reorganization and parallel table lookup. Permute provides the functionality of a full 32x16 byte-wise crossbar attached to the AltiVec register files.

AltiVec technology provides 8 parallel full function ALUs for 16-bit data types that are common in multimedia and DSP applications.

AltiVec instructions are executed out of the same instruction stream as PowerPC scalar integer, floating point and branch instructions. All AltiVec instructions operate on fixed length vectors, each instruction performing the same operation on corresponding elements in the source vector operands.

{ !! Separate Table

AltiVec Technology has the following architectural characteristics:

  • Fixed vector-length of 128-bits comprised of 16 8-bit elements, 8 16-bit elements, or 4 32-bit elements.
  • Signed and unsigned, 8-, 16-, and 32-bit integers, and IEEE single-precision floating point numbers.
  • Saturation and Modulo arithmetic
  • 32-register namespace
  • Vector register file architecturally separate from the floating-point and integer registers
  • No mode switching that would increase overhead of using the instructions
  • 4-operand, non-destructive instructions (3 source 1 result)

}

 

 

This diagram demonstrates how a pair of AltiVec instructions can produce a 8x8 dot product result with two instructions. The vector multiply sum (VMSUM) performs eight parallel 16-bit multiplications with the results summed into a target vector register as 32-bit integers. The subsequent vector sum across (VSUM) combines the four separate 32-bit results into a single 32-bit result. Because the G4 processor is pipelined and can deliver a result per clock cycle the actual dot product throughput can be quite high. Detailed work at Motorola shows that for 16-bit integer data types the G4 processor with AltiVec technology is capable of sustaining more than five multiply-accumulate (MAC) operations per clock cycle in the form of a traditional FIR filter.

For networking infrastructure applications, this level of parallel execution performance coupled with the high clock frequency of PowerPC processors, means that a Motorola G4 processor has the computational bandwidth to implement a 30 channel E1 line using the G.729a codec, including echo cancellation, in a Voice over IP telephony application.

To facilitate programming the AltiVec hardware contained in Motorola processors, Motorola has defined a set of C language extensions to represent the intrinsic parallelism of the AltiVec technology explicitly. These extensions allow a programmer to work in a high-level language such as C and thus avoid the complexities of working in assembly language while having direct access to the data parallel facilities of the AltiVec execution units.

The Motorola G4 processor, planned for introduction in 1999, is designed to take full advantage of the capabilities offered by the .25 m and .18 m process technology generations. This microprocessor, which is designed to support both desktop computing and high-performance embedded applications contains 10.5 million transistors in a 83 mm^2 die using a .22m CMOS technology with six layers of copper metal for interconnect. Initial product offerings in 1999 at 400 MHz with offer typical power dissipation still well below 10 Watts to meet the more stringent power and thermal requirements of notebook and embedded applications. For targeted applications such as Voice over IP processing the G4 typical power dissipation is estimated to be under 3 Watts. The sub- 100 mm^2 die size of the G4 will allow the device to meet the stringent cost requirements of many embedded applications as well as future consumer-oriented computer and entertainment applications.

Summary

The first round of microprocessor integration and development in the late 1980s and early 1990s saw mainstream microprocessors evolve into 32-bit processors with integrated caches, memory management units, and floating point co-processors. The second round of integration began in the late 1990s and includes integrated support for multimedia data types and DSP-type operations. Integrated second level caches have also made their appearance and will, most likely become very common in the early years of the next century. Motorola is leading the way for the industry

The networking infrastructure and the consumer entertainment markets are supplanting the desktop computing market as the main drivers for process technology and architectural innovation. These embedded market customers demand very low cost and very low power designs that offer the highest possible performance. To meet these customer demands requires leading edge manufacturing processes and innovative architecture. Motorola�s PowerPC G4 with AltiVec Technology takes advantage of both these standard approaches to deliver maximum performance.

PowerPC and PowerPC 750 are trademarks of International Business Machines

AltiVec is a trademark of Motorola, Incorporated