Summary: In the first part of our dissection article, Dave examined the theory behind the RADEON 9700 PRO's architecture. In part two, Dave takes a closer look at the performance of each subunit, including the pixel and vertex shaders, as well as ATI's HYPERZ III. See how they perform in the conclusion to this series!
In our original article on the dissection of R300 we considered the key elements of the 3D pipeline, including the pixel and vertex shading. We also looked at some of the more advanced features, including HyperZ III and SmoothVision 2.0. In doing that, we came to a better understanding as to how R300 operates.
When we originally planned this article, it was our intent to show how each of the different aspects of HyperZ brought about performance benefits. There was no anticipated problem with this, as a variety of registry strings were made available to allow for enabling and disabling these features.
Unfortunately, sometimes things simply do not work as planned, and this was one such case. While the necessary strings exist in the registry, we have found that ATI has somehow disabled the functionality of those strings in that they seemingly accomplish nothing. The changing of certain values yielded inconsistent results, with generally no performance change at all. Because of this, we decided to take a slightly different approach to this article.
While we were originally going to focus on the performance benefits of the new techniques used by R300, we will instead consider what really matters: real world performance. We’re not talking about regular benchmarks, testing how fast game X is or how fast game Y is, though there is a little of that in our 3DMark tests. Rather, we will break down and closely examine R300’s capabilities in the pixel shader, vertex shader, and other areas. Doing this will allow us to find R300’s strong and weak areas.
R300 has, without a doubt, the most powerful pixel shader out there. Though only allowing for a single pixel in each rendering pass, it can write up to 8 pixels per-clock, with 3 operations in each cycle. First, let us consider R300’s peak fill-rate in an optimal environment.
To delve further into R300’s pixel shading capability, we ran 3DMark 2001’s shader tests. For this, we are looking at performance in the normal and advanced pixel shader test, as well as EMBM and DOT3 bump-mapping. With bump-mapping operations occurring within the pixel shader, examining EMBM and DOT3 is very revealing.
It is interesting to note that the DOT3 performance is actually lower in overall performance when compared to our other tests. This is interesting to note and is likely due to the more complex lighting in the scene. R300’s high bandwidth levels are noted particularly in the advanced pixel shader test, as this makes heavy use of alpha blending, which is notoriously bandwidth intensive.
By using NVIDIA’s Chameleon Mark test, we can really see how pixel shader complexity impacts overall performance. The Real test has a noted drop in performance over both the Shiny and Glass test. While the difference is not critical, upper resolutions do show this to be at roughly 20 fps. With even greater complexity, we can be assured that the performance gap would be even wider.
R300’s vertex shader is important in that it not only provides vertex shading capability, but also static T&L operations. The performance of every vertex shader varies in between applications. With a variety of ways to handle geometry data, each vertex shader is optimized for certain methods. The method used by the application and the quality of the code used can vary the overall performance. With this in mind we have run several different geometry tests to examine overall performance.
Vertex Shader (cont’d)
With static T&L having been considered, what of vertex shader calculations? Well just as with a pixel shader, vertex shader code complexity has a major impact on performance. While R300’s vertex shader is capable of running approximately 65,000 instructions; vertex shader performance can actually be reduced to a crawl when compared to even the most complex pixel shader programs. The following performance numbers are from 3DMark 2001 SE’s vertex shader performance test:
The primary gain found with HyperZ III is from R300’s hierarchical Z-buffer. In the first article we discussed exactly how this operates, and in doing that we noted how scene rendering order was critical for the use of it. The reason behind this is quite simple, in that the detection algorithm relies on what has already been rendered to detect if the next pixel will be visible. If the scene renders back-to-front we find that every new pixel will come closer to the viewer, so no pixel can ever be culled. On the other hand, if the scene is rendered front-to-back, the nearest layer is rendered first, thus allowing for the detection of all non-visible pixels behind this front layer.
With R300’s use of multi-sampling and both color and Z-buffer compression, anti-aliasing no longer has the performance impact that it once did. In the days of super-sampling, there was little getting around the 75% performance loss, as it was all eaten up in fill-rate. With multi-sampling requiring only minimal additional fill-rate, the performance loss associated with such can be dramatically reduced. The following charts show the performance of anti-aliasing with 3DMark’s Complex Race Scene:
Anisotropic filtering causes a similar loss in performance, with greater texture sampling numbers and more complex filtering operations. While the quality benefit is certainly dramatic, it does not come for free. Here are 16x (64-tap) anisotropic filtering results from the same scene.
Combining both anisotropic filtering and anti-aliasing delivers ideal image quality results, though with the greatest performance loss.
It is clearly apparent how multi-sampling aides in maintaining performance. In our first article we noted R300’s use of an adaptive anisotropic filtering technique, and the benefits of this are very apparent. Without this technique we can rest assured that performance numbers with 16x anisotropic and 6x multi-sampling would likely end up in the single digits.
As we stated in the beginning of this article, we had hoped to more closely examine certain aspects of HyperZ in this article. It simply was not possible to do so, even after requests to ATI for assistance. However, we were still able to take a closer look at the deeper aspects of R300’s performance. We certainly found that HyperZ provides dramatic benefits with front-to-back and even random rendering orders.
|© Copyright 2003 FS Media, Inc.|