[ Print Article! ]

Dissecting The ATI RADEON 9700 PRO
December 12, 2002 Dave Barron

Summary: By now we all know that ATI's RADEON 9700 PRO VPU is fast, but have you ever wondered how ATI makes the frames fly so fast? In today's article, Dave explores the RADEON 9700 PRO architecture in intricate detail, in the first part of multiple articles. See what makes the R300 core tick in our latest article!


IntroductionPage:: ( 1 / 10 )

ATI has taken an indisputable technology lead with the release of the R300-based RADEON 9700 PRO. Not only has this product outperformed the best offerings of NVIDIA, Matrox and others, but it has provided a feature set that is considerably richer. For those who might think so, I want to reassure you that we have not turned into “fanboys”. A close examination of the performance and feature set of R300 establishes this technology lead we speak of, and it is a lead that simply cannot be argued with.



With that said, NVIDIA should be expected to answer back, and with the announcement of GeForce FX (NV30), they have responded. However, GeForce FX is still some time away from shipping and so for now ATI has the extreme high-end market as their own. We felt that it would be interesting to take a close look at R300, looking not only at the surface, but digging deep into the technology. In this two-part article we will first consider each of the 3D related technologies within the chip, from there examining the performance of not simply the whole chip, but each individual unit.

SmartShader 2.0 - Pixel Shader

The introduction of pixel shader standards brought about a change that went from wanting more pixels to wanting smarter, more advanced pixels. A variety of new operations were brought to the forefront with the use of pixel shader hardware, including division, subtraction, square roots and many others. These allowed game developers to produce dramatically more complex lighting, shadowing, and bump-mapping effects.

DirectX 9 has taken pixel shader functionality to a new level of complexity. There are several new operations available, in addition to greater overall flexibility. R300 follows the pixel shader 2.0 (PS2.0) spec almost perfectly, with the addition of greater instruction support. The following chart compares R300’s pixel shader to that of the pixel shader 2.0 specification. In certain cases R300 exceeds the PS2.0 specification, where in others it meets it exactly. In cases where R300 has greater support, it is likely the PS2.0 limitation exists due to another hardware vendor’s upcoming design finding such limitations.

PS 2.0 vs. R300

Pixel Shader 2.0 R300
Max Textures 16 16
Instruction Slots 96 160
Color Ops 64 128
Address Ops 32 32









SIDEBAR: Our RADEON 9700 PRO Review



Core ArchitecturePage:: ( 2 / 10 )

Each of R300’s eight pixel pipelines is capable of addressing 16 textures through a single pass, multi-clock loopback for a total fill-rate of 2.6 GP/sec. By supporting a single texturing operation per-pipeline, the application of 16 textures requires the use of 16 clock cycles. For complex pixel shading operations, a greater number of clock cycles are required as well.

[image]

<% print_image("01"); %>

Each of ATI’s pixel pipelines can execute three instructions per-clock. The first of these is a texture lookup, followed by an address operation, and then a color operation. Simultaneous execution of these three operations allows for minimal cycle consumption in simple pixel shader ops, as much of the work can be completed in a single cycle. With that said, a complex pixel shader nearing the hardware limit of 160 instructions will require more than a few cycles to complete.

[image]
<% print_image("02"); %>



As per the DirectX 9 specification, R300 supports data precision that is substantially greater than previous generations of hardware. Most graphics hardware has supported 32-bit integer formats, where internal rendering accuracy was an integer falling between 32 and 48-bpp. Now supported is 128-bit float point precision, which obviously is dramatically more accurate.

ATI’s design is slightly confusing to most, for saying that it provides 128-bit floating point precision is accurate, though not entirely accurate. R300’s texture address logic supports 32-bits per-channel (128-bit floats), where the shader logic uses a 24-bit per-channel float point accuracy, totaling out to a precision of 96-bits per-pixel. The output format can either be reduced to a format of lower accuracy, or expanded all of the way to a 128-bit float.

Why the move to 128-bit floating point?

In supporting higher precision formats, two primary tasks are accomplished. The first is derived from rendering artifacts associated with multi-pass rendering. Each rendering pass in multi-pass rendering results in a precision loss of roughly 50%. With multiple passes, precision levels can quickly drop to a point where artifacts (color banding) become apparent. Increased precision works to alleviate this by starting at a higher base precision, allowing more room for quality loss without a noticeable effect. The second reason is a bit more interesting.

Lighting is a primary aspect of 3D rendering. Proper lighting effects make the difference between an immersive, realistic environment and one that is dull and clearly fake. With the use of floating point data formats, the dynamic range output can be increased considerably to allow for great realism and realistic over bright lighting. ATI has demonstrated this with their Natural Light demo, as seen in the image below. The right side uses floating point accuracy and a high dynamic range, where the left does not.

[image]

<% print_image("03"); %>




SIDEBAR: If you don’t mind paying MSRP, ATI.com is offering free shipping on all its products.



TRUFORM 2.0Page:: ( 3 / 10 )

R300 includes support for higher order surfaces through the use of a fixed function tessellation engine. Higher order surfaces allow for increased geometric detail without the original source object being generated at that level. Doing so allows for higher geometric complexity without consuming the massive amount of AGP and local memory bandwidth that might be associated with such.



R300 supports a feature known as adaptive tessellation. While this is not a higher order surface in and of itself, it is an important feature worth consideration. When an object is tessellated, the geometric detail increases as the object comes nearer, while decreasing the distance from the object increases. Processing power is thus saved from rendering inessential detail.

[image]

<% print_image("04"); %>

Continuous tessellation is supported as well, providing smooth tessellation between detail levels. Often times software applications will provide multiple models of the same object with different levels of tessellation. In rendering these, the objects distance determines which model is rendered. The problem in doing such is that the object can “pop” into greater detail as it becomes near. One frame will feature a low detail model where the next frame suddenly has much greater detail.

Moving tessellation to the hardware level, continuous tessellation provides a constant, smooth transition between model details. Rather than jumping between detail levels, the level of detail is gradually increased as the object nears the viewer. There are no sudden changes, but a simple gradual progression. Doing so allows for the performance and quality benefits of tessellation, without the downfall of detail popping.

Since the release of R200, ATI has supported N-Patches, and this remains true with R300. N-Patches are a higher order surface that, when enabled, will tessellate rounded surfaces to greater detail. These can provide substantial increases in quality where used appropriately, as character detail can become significantly improved. Yet at the same time, if an application developer allows free use of N-patches on every game model, the scene can also become seriously distorted, as has been seen with certain games such as Serious Sam.

New to DirectX 9 is the support of displacement mapping. Displacement mapping is similar to bump-mapping in its effect, yet where bump-mapping simulates surface detail, displacement mapping increases geometry detail so that the enhancements are actually there. Within DirectX 9 there are two types of displacement mapping implementations used, only one is supported by ATI.



SIDEBAR: ATI recently cut the price on RADEON 9000 PRO from $149 to $129



Displamcent mappingPage:: ( 4 / 10 )

Pre-computed displacement mapping is done by an artist’s creation of a displacement map. When the displacement map is fed into the tessellator, pre-specified tessellation points are used and the high detail surface is generated.



This type of implementation is best suited for object models in that there is no concern about object distortion. Though the type chosen for implementation by ATI (probably due to it being a simpler design), its inability to support dynamic tessellation makes it far from ideal.

[image]

<% print_image("05"); %>

Though not implemented on R300, sampled displacement mapping is certainly worth consideration. Using an evaluator for displacement map sampling, the displacement map is sampled through either bilinear or trilinear filtering. Ideally this is done with trilinear filtering across terrain and other larger objects, as much like trilinear texture filtering corrects ugly texture borders, doing so with a displacement map can alleviate issues with displacement map borders where cracking and other distortions might occur. With the values sampled, scalar results are stored in a register for use by the tesselator. With these values, the base mesh is tessellated into the expected higher detail surface. The following image provides some detail into what is involved in this implementation, though the concept can be applied to both types.

[image]
<% print_image("06"); %>

One issue to know with displacement mapping comes from collision detection. When a surface is dramatically displaced, the original collision boundaries no longer fall along the newly generated surface area. With that, a person could find themselves moving into a seemingly solid surface because the collision detection was not properly implemented.

Vertex Shader

R300’s vertex shader implementation has in many ways gone beyond the 2.0 shader requirement. Within each of the four available vertex shader units are a vector and scalar processor in simultaneous operation. Most operations can be handled by the vector processor, which can handle four operations simultaneously. Operations with fewer than four element vectors are handled by the scalar processor, as the vector processor requires the same amount of time to handle 3 or fewer elements as it does four. By operating both processors in parallel, efficiency is increased as is overall computation time.

[image]

<% print_image("07"); %>





SIDEBAR: TRUFORM wasn’t enabled with early RADEON 9700 PRO drivers, but its definitely up and running now.



2.0 Vertex ShadersPage:: ( 5 / 10 )

As per the vertex shader 2.0 specification, R300 supports a register space for 256 instructions. By means of flow control the number of instruction slots in a register no longer defines the number of instructions that can be used. Flow control allows for loops to be executed in the shader code increasing the total number of available instructions.

The maximum instructions that can be run via a loop are 1024. Beyond this point a hardware abort is in place to avoid the possibility of infinite loops. Of interest is that R300 can actually support a total 65,280 instructions when a CAPS bit is used to disable the abort function. With a greater number of clock cycles (slower performance) required for such a program, it is unlikely that this would be used for anything other than creating pre-rendered images.



The DX9 vertex shader on R300 also supports the use of code jumps. A hardware jump is done with the following code:

Jump condition, label

In this situation, condition defines under what situation the jump should take place, while label tells of the instruction down the code path to go to. Of note is that backwards jumps are not allowed, nor are jumps allowed in and out of loops.

The following chart demonstrates how R300’s vertex shader compares to the vertex shader 2.0 specification. As can be seen, R300 matches the specification fairly closely. As with the pixel shader, it is likely that the vertex shader 2.0 specification’s temporary registers are fewer in number due to another hardware manufacturer not meeting ATI’s level of support.

VS 2.0 vs. R300

Vertex Shader 2.0 R300
Instruction Slots 256 256
Max Instructions 1024 65280
Temp Registers 12 32
Constant Registers 32 32




SmoothVision 2.0 - Anti-Aliasing

ATI’s approach to anti-aliasing has been through their “SmoothVision“ technique. With R200 this was a simple jittered super-sampling approach similar to the method found on 3dfx’s Voodoo5. The difference between the two laid in ATI’s use of multiple jitter patterns, where 3dfx used a fixed rotated grid. Selecting from one of several pattern options, ATI produced a more random anti-aliasing, capable of better covering aliasing artifacts.

R300’s approach to anti-aliasing has taken the anticipated leap forward with a move to multi-sampling. Multi-sampling is similar to super-sampling in that multiple sub-samples are used to determine anti-aliasing for the final image. Where both super-sampling and multi-sampling will sample depth values for every pixel, multi-sampling will only sample sub-pixel colors for those falling along a triangle’s edge. All internal sub-pixels share the same color as the original pixel.

[image]

<% print_image("08"); %>




SIDEBAR: Look for the RADEON 8500 to return in the form of the RADEON 9100 soon.



SMOOTVISION 2.0 (cont’d)Page:: ( 6 / 10 )

The result of multi-sampling is an edge-only anti-aliasing. Without sampling the color of non-edge sub-pixels textures are unable to receive the anti-aliasing associated with super-sampling. However, avoiding this function removes the fill-rate hit associated with super-sampling, freeing those clock cycles for more important tasks, such as pixel shader calculations.

A problematic area of multi-sampling involves alpha-blended surfaces. Edge detection fails along these surfaces, leaving objects such as grass and leaves without anti-aliasing. Surfaces that use alpha testing, such as fences and grated walkways, remain aliased as well.



Of interest is ATI’s implementing an anti-aliasing gamma correction to smooth the color gradient between pixels. On the top image of our examples we see that gamma correction is not applied, with the resulting color gradient being very sharp. By applying gamma correction to the second image, the center pixels are brightened so as to provide a much smooth gradient. Doing so increases the smoothness of a surface’s edge by providing greater area for a slower color transition.

[image]

<% print_image("09"); %>

Texture Filtering

At the texture level, R300 supports bilinear, trilinear, and anisotropic filtering. While bilinear and trilinear filtering are nothing new, ATI’s anisotropic implementation is worth consideration. Where previous incarnations of SmoothVision’s anisotropic filtering where limited to bilinear sampling (sampling across a single mip-map level), version 2.0 finally allows for use with trilinear filtering. This implementation, while slower, provides greater quality by sampling across two mip-map levels, while also blending between the two levels to prevent that nasty line artifacting that would often appear.

A primary aspect to ATI’s historically superior anisotropic filtering performance has been the use of adaptive anisotropic filtering. Not every surface requires the same number of samples to be taken. Some surfaces offer no visual difference whether 4 or 64 samples are taken per-pixel. With that in mind, ATI dynamically calculates the level of anisotropic filtering to use on a surface, depending on its distance, size and rotation.

With R200, adaptive anisotropic filtering presented a serious issue with surfaces uses certain Z axis rotations. The incorrect filtering level would be set because of this, and the result would be nothing more than a standard bilinear filtering. ATI has since improved their algorithm on R300, increasing detection accuracy. The following image shows the quality difference between a standard bilinear filter, up through a 16x (64-tap) anisotropic filter.

[image]

<% print_image("10"); %>





SIDEBAR: The RADEON 9100 cards won’t be produced by ATI, rather their third-party board partners.



256-bit Memory InterfacePage:: ( 7 / 10 )

As the second company to introduce the use of a 256-bit memory bus, ATI has placed more than the usual attention on memory bandwidth. Bandwidth efficiency becomes a factor with a memory bus as wide as 128-bits, much less 256. To alleviate this problem ATI has taken on the use of multi-channel memory access.



ATI’s memory controller is very similar to that of NVIDIA’s crossbar controller. Providing four independent channels, ATI transfers 64-bit (128-bit effective with DDR memory) data blocks across each channel. This design provides independent channels that allows for R300 to send and receive data simultaneously, while increasing overall efficiency.

[image]

<% print_image("11"); %>

Yet where does the inefficiency come from in a single channel bus? While the bus itself is not inefficient, the potential lack of data makes it such. When using a single channel 256-bit DDR memory bus, 512-bits of data are theoretically transferred per-clock. In a single clock cycle, the chip might provide 64-bits of data while requiring 128-bits of data from local memory. With the first clock cycle 64-bits of data are sent to memory out of the 512 possible bits, creating an efficiency of 12.5%. In this clock cycle the chip stalls as it is waiting for the requested 128-bits of data. On the second clock cycle, the requested data is being retrieved, while still only providing an efficiency of 25% and the chip remains stalled until the following cycle when the data becomes available for use.

Use of a multi-channel memory controller alleviates this issue by simultaneously sending and receiving the 64 and 128 respective bits. Efficiency is increased to 37.5%, while the graphics processor’s stall time is reduced by one clock cycle.





SIDEBAR: ATI is rumored to be close to finishing RV350. This chip will bring much of the functionality present in RADEON 9700 to even lower price points.



HyperZ IIIPage:: ( 8 / 10 )
Being aware of the costs of depth complexity, ATI has put great effort into using available resources as efficiently as possible. Though not as efficient in occlusion culling as deferred rendering, ATI has thus far developed the next best thing.



The improvements found in HyperZ III have surprisingly found very little discussion, but they are certainly worth noting. The first of these is in their occlusion culling. R200 based solutions used a hierarchical Z-buffer to determine occluded pixels. In doing this, they would test the visibility of an 8x8 pixel block based on what was already rendered. To determine this, farthest visible pixel was stored as a reference value on chip for the target location of the new set of pixels. The nearest vertex of the 8x8 pixel block was compared to the reference value to determine visibility. If the nearest vertex fell behind the reference value, it was determined that the block was occluded. However, if it did not the entire block was rendered.

With such an implementation the ability to determine occlusion is severely limited. Let us assume that a single pixel out of the 8x8 pixel block is nearer to the viewer than the reference value. The entire 8x8 block must then be rendered. This results in a lot of wasted rendering, so ATI has taken steps to avoid this.

With R300 the first thing to note is that ATI no longer uses the nearest vertex to test for visibility. Rather, the minimum and maximum Z values of the polygon (or portion of such) within the block are used. This allows for a more accurate testing as the nearest vertex might not accurately reflect the actual block being tested, resulting in pixels being rendered wastefully. At this point, the 8x8 block is tested. If it is determined to be visible, the block is culled and the next block is considered. However, unlike with R200, rendered does not take place at this point if the block is determined visible.

When blocks remain visible after a visibility check, the 8x8 block is sub-divided into two 4x4 blocks. These blocks are tested for visibility just as the 8x8 block was. If either one of these blocks is determined visible, occlusion tests go to the pixel level with early Z compares.

Early Z compares simply add an additional Z check just prior to the pixel pipeline. As with a final Z compare, the depth value of existing pixels are read and compared to this new pixel. If the pixel is determined to be visible, it is then rendered and displayed. If, however, it is occluded, it is culled.

One might question why ATI goes through the work of using a hierarchical Z-buffer when the last stage of HyperZ III is testing at the pixel level. Efficiency is the answer to this. Early Z checks must read each individual Z value from the Z buffer, then do a compare to determine visibility. This Z read requires bandwidth and bandwidth is what ATI is trying to conserve. A hierarchical Z can store the reference values on-chip, requiring no access to local memory (or, if stored in local memory, only a single read is required for the entire block). The result is little or no bandwidth being used while potentially culling an 8x8 or at least a 4x4 block.

[image]

<% print_image("12"); %>




SIDEBAR: HYPERZ was first introduced in the original RADEON family two years ago.



HYPERZ III (Cont’d)Page:: ( 9 / 10 )

Compression

Z-buffer compression is another important aspect of HyperZ III, in that it can substantially reduce bandwidth requirements, especially in cases of anti-aliasing. The compression unit takes into consideration a pixel block, generating a plane equation by its slope. Instead of storing a Z value for each available pixel, the plane equation is stored and used for calculating the depth value of any pixel necessary within the block. Peaking at a compression ratio of 4:1, this is a best case that can only be met with one or two triangles within the block. This will often be achieved in compressing environments where larger triangles are used, but characters will not achieve this level of compression.

[image]

<% print_image("13"); %>

Enabling anti-aliasing increases the theoretical compression ratio as more samples are used within each triangle. Theoretically, the compression ratio is increased to a peak of 24:1 in cases where 6x AA is being used. Being a peak, a 24:1 compression ratio should not be consistently expected. With that said, enabling anti-aliasing allows for the use of a new feature, namely color compression.



HyperZ III’s color compression is really nothing of a surprised, as it is somewhat obvious in design. With multi-sampling anti-aliasing, each sub-pixel that does not fall along a triangle edge has a color identical to the original pixel. This same color value is then stored 2, 4 or 6 times, depending on the level of anti-aliasing. With the values being identical to the original value, storing more than one value is redundant. Examining a pixel block, the compression algorithm first determines if any of the pixels within the block fall along a triangle edge. If none of them do, a single value is stored for each sub-pixel, resulting in a compression ratio of 2x, 4x or 6x, depending on the respective anti-aliasing level. If a triangle edge is located along any of the pixels within the block, compression is either reduced or eliminated, depending on the specifics of the situation.

[image]
<% print_image("14"); %>






SIDEBAR: HYPERZ III offers lossless z-buffer compression up to 24:1.



ConclusionPage:: ( 10 / 10 )

Having taken an overview of R300’s primary 3D functionality blocks it becomes apparent that ATI put considerable time into R300’s design. With such complex designs, many months of planning are required before code work can even begin. The architecture must be laid out and planned for an optimal design. Specifications of each unit must be set in place and new technologies must be developed.



With GeForce FX right around the corner, NVIDIA has certainly answered back to R300. Early benchmark results released by NVIDIA have shown it to have a considerable performance advantage, while its specification makes it clearly superior in functionality (though, it must be admitted that the additional functionality exposed within NV30 will have no real bearing on gamers). Yet ATI is expected to answer back with R350, which will likely be a simple evolution of R300. Of course, it is only with time and a little patience that we will discover where 3D hardware will takes us and who will lead the industry.




SIDEBAR: What would you like us to dissect next, GeForce FX? Offer your feedback and chat with others in the news comments!


© Copyright 2003 FS Media, Inc.
[ Print Article! | Close Window ]