 | Level: Introductory Cathleen Shamieh (cathleen.shamieh@verizon.net), Consultant
19 Oct 2004 Each of the leading microprocessor manufacturers has announced the availability of one or more 64-bit desktop processors, but differences exist in architectural design, fabrication, support, and intended use of each processor. This article looks at the critical issues in a few of IBM's 64-bit POWER designs, covering 32-bit compatibility, power management, processor bus design, and the manufacturing process.
When people talk about 64-bit computing, it's not always clear what they mean.
Most often, they mean some combination of register width, bus width, or
address space. For the purposes of this article, it means a processor with
64-bit registers and 64-bit addressing.
In the late 1980's, as the desktop processing industry was still
struggling to transition to 32-bits, several companies (among them DEC,
IBM, Motorola, and Apple) were already in the throes of 64-bit
development. When the first 64-bit processor was introduced in 1992 for
high-end UNIX servers and workstations, who would have thought of using
that kind of processing power just a dozen years later in cell phones and
laptops? Yet by the end of the 20th century, desktop applications had
tested the computational and memory limits of 32-bit processors, and
growth in high-performance consumer electronics called for increasingly
more powerful embedded processors with ever-constrained power
requirements. Early in the 21st century, the need for impenetrable
security has grown tremendously, sparking renewed interest in encryption
and decryption algorithms, as well as desktop security. The desktop and
embedded industries needed another bit. To preserve the "power of two"
relationships that people have come to expect, with 8, 16, and 32-bit
words in existing architectures, developers are getting 32 more bits,
resulting in an address space that is 4 billion times larger than the
maximum address space of a 32-bit system.
Capable of addressing an astronomical 18 billion GB, or 18 exabytes, of
memory, 64-bit integers also accelerate complex mathematical
calculations through their ability to perform calculations directly on
64-bit numbers, as well as performing multiple operations on smaller
numbers within a single CPU cycle (see Resources for the definition of exabyte). The impact of 64-bit processing is
substantial: the time it takes to render a 3D model can be reduced
dramatically, freeing up computing resources, compressing diagnostic
timeframes, and enabling you to work more efficiently.
 |
Billions and billions
When a kilobyte was a big chunk of space, no one cared that it was 1024
instead of 1000 bytes. When the megabyte came along, the usage was
entrenched. But, as the orders of magnitude add up, so does the error.
One recommended solution is to distinguish between binary and decimal,
using "kilo" for 1000, and "kibi" for 1024. So, 2^64 is
18446744073709551616, which is just over eighteen quintillion, or eighteen
billion billion, bytes -- but it's really only seventeen billion gigabytes.
Got that down? While we're at it, we should point out that our British
readers probably think we're sloppy at math, because, to them, "billion"
often means "million million", and "trillion" means "million million
million", so they would call 2^64 "eighteen trillion" instead of "eighteen
quintillion".
Want a simpler unit for the amount of memory in a 64-bit address space?
Refer to it as "lots," or fall back on Carl Sagan's old line.
|
|
This processing power, which used to be available only on high-end servers
for complex enterprise applications like real-time business intelligence,
is now available on the desktop. Small businesses and home PC users can
perform video editing and rendering tasks that were the stuff of dreams a
decade ago. Just as 32-bit processing became commonplace in desktops and
entry-level servers, so 64-bit processing is poised to become more and
more ubiquitous over the next few years. From a theoretical feature
bragged up in trade magazines, to a reasonably cost-effective choice for
high-end embedded systems, 64-bit processing has come a long way.
This kind of address space is especially useful for simulations and large
databases. While home users rarely have a working set of more than 4GB of
data, scientists and database technicians are quite happy to have a little
more room to work on large datasets, or build larger, more complete,
simulations. With modern databases frequently holding terabytes of data,
the ability to have more than 4GB of what is, effectively, a working
cache, can improve performance dramatically.
More memory than you can possibly count
For applications that don't need to address memory beyond the 32-bit
processor limit of 4GB, 64-bit systems still provide substantial benefits
in terms of processing speed. In 32-bit computing, integer math uses
32-bit wide general-purpose registers. With 64-bit computing, each
general-purpose register is 64-bits wide and can represent a much larger
integer. High-level languages, such as C and C++, support 64-bit
mathematical operations on 32-bit processors by splitting a 64-bit number
across two 32-bit registers. The 64-bit integer types (such as int64_t,
sometimes called "long long" on 32-bit systems) can be contained within a
single register on a 64-bit machine. This register-width difference
produces a substantial difference in resource requirements when performing
64-bit math, as Table 1 illustrates.
Table 1. Resources required to load, add, and store two 64-bit integers
|
Operation
|
Resources on 32-bit processor
|
Resources on 64-bit processor
|
Effective improvement with 64-bit
| |
Load two 64-bit integers
|
- Requires four (4) 32-bit registers to hold data
- Requires 4 load instructions
|
- Requires two (2) 64-bit registers to hold data
- Requires 2 load instructions
|
Reduced number of instructions to load data by one half and fewer
registers consumed by one half
| |
Add two 64-bit integers
|
- Requires 2 addition instructions; an add with carry and an extended to
include the carry
|
- Requires one addition instruction
|
Reduced number of instructions by one half and reduced interlocking among
instructions and carry status
| |
Store two 64-bit integers
|
- Requires four (4) 32-bit registers to hold data
- Requires 4 store instructions to save data
|
- Requires two (2) 64-bit registers to hold data
- Requires 2 store instructions to save data
|
Reduced number of instructions to store data by one half and registers
consumed by one half
| |
Total resources
|
10 instructions issued and 4 registers plus carry field
|
5 instructions issued and 2 registers used
|
One half the instructions, less than one half the resources consumed
|
Logical operations (AND, OR, XOR) also benefit from wider registers, since
they can operate on a much larger data size. As a result, applications
that involve the manipulation of huge data sets, such as document
management and decision support, run much faster on a 64-bit system.
Finally, 64-bit processors can drive 32-bit applications even faster, by
handling more data per clock cycle than a 32-bit processor. Therefore,
even apps that don't need to address memory beyond 4GB can benefit from
64-bit processing.
The impact of design differences
Frank Lloyd Wright once said, "Architecture is the triumph of human
imagination over materials, methods, and men." Like building design,
microprocessor design involves imagination and creativity, makes use of
different materials and processes, and should bear in mind the intended
use of the design. Decisions made during the design process have a great
impact on the ultimate "form and function" of the resultant composition.
So what are the critical considerations in the design of a 64-bit
processor? It is important to note here that this discussion focuses on
64-bit computing in the desktop, entry-level server, and embedded markets.
Sixty-four-bit computing in the high-end server environment has been well
established for several years, and is outside the vantage point of this
article. The IBM® POWER4™ and POWER5™ processors fall into this category, and are outside the scope of this discussion. Success in the
desktop, embedded, and small-scale server environment depends on a
combination of performance, power, compatibility with existing 32-bit
code, and middleware support. Some of the design factors affecting these
elements are:
- Architecture design (for example, pipeline, register sets)
- Performance of 32-bit software
- Silicon manufacturing
- Power management
- System interface speed (bus architecture)
Architected for what?
The days of "my processor's clock is faster than your processor's clock"
are over. Sure, clock speed is important, but if a processor gets bogged
down making calls to memory, I/O devices, and other processors in the
system, what difference does it make? Remember, a chain is only as strong
as its weakest link. In the microprocessor world, performance is defined
by throughput and capacity -- not just clock speed. Processor frequency,
cache size, memory bandwidth, and processor architecture all contribute to
overall performance. At the 2004 International Solid-State Circuits
Conference, a panel of processor architects from IBM, AMD, Intel, Fujitsu,
Sun, and Stanford University generally agreed that chips will increasingly
rely on parallelism, rather than clock rates, for achieving faster speeds.
 |
Performance benchmarks
One recognized source of benchmark performance data is the non-profit
Standard Performance Evaluation Corporation (SPEC). SPEC's philosophy is
that relevant benchmarks should be based on real-world applications, such
as weather prediction and image processing. The SPEC CPU benchmarks
compare systems on a known compute-intensive workload, and the results
reflect the combined performance of key system components, such as the
system's processor, memory hierarchy, and compiler. For more information
and the latest set of benchmark data, visit www.spec.org.
|
|
The architecture and its instruction set form the core of a microprocessor
system, and architectures are typically designed with one or more goals in
mind. The specific goals of a particular design are important
considerations that have an impact on the intended use of the processor.
For instance, some processors are designed with an emphasis on clock speed
or data crunching, while others seek to optimize throughput. Some are
focused on general purpose computing, while others are designed to meet
the unique needs of embedded systems. Other design goals may include
native 32-bit compatibility (see next section), support for symmetric
multiprocessing (SMP), and optimization for certain types of applications.
Processors with specific design goals will perform better on some
applications than others. Dynamic workloads will perform better on
processors optimized for throughput, whereas workloads that involve
predictable algorithms operating on static data call for processors
architected for number crunching.
The IBM® PowerPC® 970 was designed for high performance general purpose processing. Multiple pipelined execution units, branch prediction, and a
SIMD, or vector processing (Altivec) unit, combine to allow up to 215
in-flight instructions. With each clock cycle, up to eight instructions
can be fetched from the direct-mapped 64K L1 instruction cache, broken
down, and dispatched into the execution units, while 32K of write-through,
two-way associative L1 data cache can fetch up to eight active data
streams, which are loaded into data registers behind the execution units.
Different types of instructions are processed concurrently by the execution
units, which include two floating-point units, two integer units,
two load/store units, a condition register unit, a branch prediction unit,
and a vector processing unit. This dual-pipeline 128-bit vector engine
performs SIMD processing, applying a single instruction to multiple data
simultaneously, and uses a set of 162 specialized SIMD instructions for
optimal performance.
Figure 1. PowerPC 970 architecture (Adapted from Apple's "PowerPC G5" white paper, June 2004. See Resources.)
Evolution and compatibility
Customers with substantial technology investments in 32-bit systems will
move towards 64-bit computing at different rates and for different
reasons, such as the need for large file support. Some applications are
best left as 32-bit programs, but should be able to coexist with
applications that are ported to 64-bit. To provide customers with
investment protection while offering the flexibility to deploy 64-bit
technology according to their specific business needs, 64-bit systems
should support 32-bit compatibility, and 32-bit and 64-bit computing
environments should be able to coexist and share resources on the same
system, just as 32-bit programs have in the past.
There are two different ways of providing 32-bit compatibility in 64-bit
processor design: native 32-bit support or 32-bit emulation. Native 32-bit
support provides full binary compatibility with existing 32-bit
applications, enabling them to run at full processor speed. Compatibility
through emulation requires the translation of the 32-bit application
instructions on the fly, incurring substantial processing overhead and
resulting in sub-optimal 32-bit performance.
The IBM PowerPC 970 family of 64-bit processors provides native support
for 32-bit processing, enabling user mode 32-bit PowerPC applications to
run on the PowerPC 970 processors without any modifications. Because the
64-bit PowerPC architecture is a superset of a 32-bit processor, the
PowerPC 970 processor can run 32-bit programs the same way the programs
run on a 32-bit processor. The PowerPC 970 has two execution modes:
32-bit, which enables instructions and addressing to behave the same as on
a 32-bit processor, and 64-bit, which produces 64-bit addressing and
instruction behavior for a true 64-bit environment. Additional supervisor
instructions are provided to set up and control the execution mode on a
per-process level, which enables the creation of a mixed environment of
concurrent 32-bit and 64-bit processes at the system level.
Some operating systems (for example, Linux) support a mix of 64-bit applications
and 32-bit applications running at the same time. This allows for
customer-controlled migration to a 64-bit environment, and enables
customers to port only those applications that truly benefit from 64-bit
computing. For maximum flexibility, the IBM PowerPC 970 processor family
can execute code in 32-bit environments, mixed 32-bit and 64-bit
environments, or in a pure 64-bit environment.
Waiting for something to work on: front-side bus architecture
One reason that cache size is so important in modern processors is that
even the fastest processors can get bogged down communicating with the
memory controller. Conventional bidirectional buses carry data to and
from the processor over the same link, incurring delays when the bus
switches direction and while the processor and the memory controller
negotiate use of the bus. Dual-channel unidirectional buses enable data to
flow to and from the processor simultaneously, eliminating negotiation
overhead and more than doubling the effective data rate. The trade-off
involved in bus architecture design is cost versus performance. While a
dual-channel design revs up memory performance, system costs tend to be
higher due to the need for memory module pairs as well as more
sophisticated chipset technology to handle the higher complexity of the
memory bus.
The IBM PowerPC 970 family of processors features two unidirectional
32-bit point-to-point channels designed to operate at an integer fraction
of the CPU core frequency. With a clock speed of 2.5GHz, the front-side
bus of the 90nm IBM PowerPC 970FX is theoretically capable of operating at
up to 1.25GHz, for an aggregate bandwidth of up to 10GBps. This type of
bus architecture achieves its highest throughput only when the number of
reads and writes are fairly well balanced. A bidirectional bus
architecture, as seen on Intel IA-64 and AMD Athlon processors, achieves a
lower peak throughput, but it can deliver its peak throughput in either
direction, making it better suited for applications that perform mostly
reads or writes.
Manufacturing process matters too
The manufacturing process used in creating processor technology has a
tremendous impact on both the performance and power metrics.
Traditionally, new processor technology that introduces an increase in
processing speed has been accompanied by an inevitable increase in power
consumption. The processor industry has come to expect this. However,
recent breakthroughs in chip fabrication have enabled manufacturers to
produce faster processors -- with decreased power consumption. And 64-bit
processors are among the first to benefit from these breakthroughs.
By integrating strained silicon and silicon-on-insulator (SOI) technology
into the same manufacturing process, electrons flow faster through
transistors, and neighboring transistors are isolated through an
insulating layer in the silicon. The result is higher performance with
reduced power consumption. Copper wiring used in place of the 30-year-old
practice of connecting transistors through aluminum conduits further boosts
performance, through improved conductivity and reliability.
 |
A powerful lineage
One of the original design goals of the Apple-IBM-Motorola partnership
that developed the PowerPC architecture back in 1991 was to define a
64-bit architecture that was a superset of the 32-bit architecture, in
order to provide application binary compatibility for 32-bit applications.
The PowerPC architecture that was born of this partnership is -- and
always was -- a 64-bit architecture derived from the IBM POWER
architecture. From the very beginning, PowerPC was designed to support
switching between the 64-bit mode and the 32-bit mode. As a relative of
the IBM POWER4 and POWER5 processors, the PowerPC 970 family may be a new
generation of PowerPC processors, but it inherits a history of over ten
years of 64-bit computing at IBM.
|
|
The 90nm IBM PowerPC 970FX is the first chip fabricated using a
combination of SOI, strained silicon, and copper wiring technologies,
placing over 58 million transistors on a 65mm2 die -- a 50% die shrink
over its predecessor, the 130nm IBM PowerPC 970. The PowerPC 970FX runs at
speeds up to 2.5GHz, making it smaller, faster, and more power-efficient
than the PowerPC 970. These new fabrication technologies are now being
deployed by other chip manufacturers anxious to gain the same
power-performance advantage over their own predecessors.
Note that the performance of the PowerPC 970 family actually exceeds that
of its award-winning parent, the high-end IBM POWER4 processor, in many
areas. This is due to the fact that the circuit and process technology
used for the POWER4 processor was designed to achieve levels of
reliability necessary for the continuous availability server market --
levels that can be relaxed for the desktop and small-scale server market
-- at the expense of transistor switching speed. Thus, the fabrication
technology used for the PowerPC 970 was designed to eke out higher
performance by trading away reliability; for these markets, the trade-off
between reliability and performance is different.
Overcoming a power struggle
With more transistors being crammed into smaller chips in order to enhance
microprocessor performance, power management has become increasingly
challenging. Clock gating and other simple techniques have reached their
limit, leading chip designers to implement somewhat precarious techniques,
such as tweaking individual devices in critical sections of their chips to
match a specific need and designing chips to operate close to their
thermal limits. Ongoing research seeks to manage power dissipation while
maintaining high levels of processor performance.
Historically, IBM Power Architecture microprocessors have incorporated
features to help users effectively manage power dissipation. The PowerPC
750 microprocessor, produced in 0.25-µm technology, first gave users the
options of dynamic power management, with three software-selectable
power-saving modes, and where execution units were not clocked when idle.
The power-saving modes reduced functionality of other areas, with nap and
doze modes limiting cache and bus snooping operations, and sleep mode
turning off all functional units except for interrupts. These techniques
were an effective way to reduce power, as they reduced switching on the
chip.
As the process geometries have been reduced to below 130 nm, power
dissipation due to leakage currents has greatly increased. IBM addressed
this challenge in the 90nm PowerPC 970FX microprocessor by integrating
strained silicon and SOI into the same manufacturing
process, as previously discussed. This technique speeds the flow of
electrons through transistors to increase performance and provides an
insulating layer in the silicon that isolates transistors and decreases
power consumption.
A new approach to power management, patented by IBM, involves adding some
power-control features within the processor chip. This power tuning
technique, enabled through advanced system-wide tuning and controlling of
processor frequency and voltage, allows designers to quickly and
seamlessly change the frequency from full frequency to f/2 and f/4. The
frequency switch is applied at a system level -- affecting the processor
bus and the bridge and memory controller support chip as well as the
processor core. The PowerPC 970FX microprocessor takes advantage of this
IBM-refined power-saving technique, enabling a seamless, fine-grained,
system-wide frequency and voltage change without stopping core execution
units, disrupting interrupts, or disabling bus snooping.
If all of that isn't enough, of course, there's always the option of using
liquid cooling, as Apple did in the 2.5Ghz G5 machines (see Resources).
Apple's pick
The 90nm IBM PowerPC 970FX leverages patented fabrication processes and
power management techniques, along with ten years of 64-bit computing
experience, to achieve high performance on compute- and
bandwidth-intensive applications while maintaining compatibility with
32-bit code. Apparently, Apple appreciated the choices IBM processor
architects made when designing the 970 family; dual 130nm PowerPC 970s
form the foundation of the Power Mac G5, and the PowerPC 970FX is at the
core of the Apple Xserve G5, a rack-mount server. Reliability,
performance, backward compatibility, and years of IBM research and
development have come together to produce 64-bit computing for the masses.
Resources
- Find the definition of exabyte on wikipedia.org.
- "PowerPC Architecture: A High-Performance Architecture with a
History" covers the development of the PowerPC architecture (IBM).
- "The R/S 6000
64-bit Solution" discusses the benefits of 64-bit technology (IBM).
-
The Wikipedia article on
64-bit computing provides a good, general foundation and helpful links.
-
The "Developing Embedded Software for
the IBM PowerPC 970FX Processor" IBM Application Note discusses issues
associated with developing new software and porting existing software to
the PowerPC 970FX processor (IBM, July 2004).
-
The IBM white paper, "An
Introduction to 64-bit Computing and the IBM PowerPC 970FX", provides
an overview of 64-bit computing and discusses the advantages of a 64-bit
operating system environment (IBM, April 2004).
-
"PowerPC Microprocessor Family: Programming Environments Manual for 64-bit Microprocessors" (in PDF) Software Reference Manual can help
you develop software that is compatible across the entire
family of 64-bit PowerPC processors (IBM, July 2005).
-
Learn more about power tuning in the PowerPC 970FX processor with
"Frequency switching improves power management in Power Architecture
chips" by Helena Purgatorio (developerWorks, September 2004).
-
The article
"The
IBM PowerPC 970FX power envelope and power management"
provides an understanding of the PowerPC 970FX processor's advanced power
management techniques (developerWorks, September 2004).
- Find out more about the International Solid-State Circuits
Conference, where a panel agreed that parallelism, not clock speed, would
be the biggest component of upcoming performance gains.
-
Check out the radiator-like
liquid
cooling system used in the 2.5Ghz G5 (The Detroit News, September 2004).
- Peter Perlso has posted a PowerPC overview, which
covers many things most other overviews do not -- including die and L1 and L2 cache sizes for the
various iterations of PowerPC chips through the years. Compare with a similar page he has
for chips of the Intel variety -- he's even got a
page on the kilo- or kibi-/which definition of exabyte?
problem.
- Have experience you'd be willing to share with Power Architecture zone
readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside
IBM are welcomed. Check out the Power Architecture author
FAQ to learn more.
- Have a question or comment on this story, or
on Power Architecture technology in general?
Post it in the Power Architecture technical forum
or send in a letter to the editors.
-
Get a subscription to the Power Architecture Community Newsletter when
you Join the Power Architecture community.
- All things Power are chronicled in the developerWorks Power
Architecture editors' blog, which is just one of many developerWorks
blogs.
- Find more articles and resources on Power Architecture
technology and all things
related in the developerWorks Power
Architecture technology content area.
- Download a IBM PowerPC 405 Evaluation Kit to demo a SoC in a simulated
environment, or just to explore the fully licensed version of
Power Architecture technology.
About the author  | |  | Cathleen Shamieh has engineering and consulting experience in the
telecommunications, speech processing, computer telephony, and medical
electronics industries. She can be reached at
cathleen.shamieh@verizon.net
|
Rate this page
|  |