||FS Guides: Occlusion Culling
January 14, 2002 Tony Smith
Summary: Everyone knows about Transform and Lighting, fill rate, and shaders, but one of the oldest principles of 3D rendering is getting a lot more attention today - the Z, or depth buffer. Creative engineering has yielded a number of ways to speed up the display of 3D graphics through manipulation or depth information, and here's how modern 3D accelerators do it.
| Introduction||Page:: ( 1 / 5 )|
What you can’t see can hurt you<% print_interactive_ad(); %>
Occlusion Culling is becoming a more and more important part of complex 3D graphics rendering. Not only because we often find applications that yield higher depth-complexity, but because the scenes rendered are also more complex at the pixel level. For example, with a highly depth-complex scene and advanced pixel shaders, multiple clock cycles with extremely complex calculations and multiple texture lookups might be required for objects that are never even seen. No longer is it just a factor of rendering one or two textures on an opaque surface that doesn’t exist, but it can be a matter of spending a considerable number of clock cycles on what are effectively non-existent objects.
With that in mind, this article will examine a few different implementations of hardware occlusion culling. Certainly there are a lot of software algorithms for this as well, and these can help dramatically, depending on the efficiency of the algorithm. However, at that same time there are no software algorithms presently in-use that are effective at removing all depth-complexity. Because of such, hardware occlusion culling can offer consider gains on top of software algorithms. In this article, we will consider early Z checking, hierarchical Z-buffering and deferred rendering.
SIDEBAR: All current-generation 3D-accelerators make some use of occlusion-culling technology. As scenes become more complex, the important of rendering optimization increases correspondingly, and even obscure Z-buffer operations are starting to get the ™ treatment.
| Early Z Checks||Page:: ( 2 / 5 )|
An ounce of prevention…
Early Z Checking (also know as Early Z Out) is a fairly basic idea that helps to work around one of the inefficiencies of a traditional rendering pipeline. In a traditional pipeline, texturing operations occur with the final result, typically having a pixel written to the color buffer. To determine if this write should occur, a depth (Z) test is performed to determine whether or not the pixel is visible. If the Z value of the new pixel is less than that of the buffered value, it is determined that the pixel is visible and therefore should be written to the buffer. If the Z value is greater than the buffered value, the pixel is occluded and it is thus ignored. This shows a lack of efficiency because the pixel’s visibility is determined after the pixel’s color value is finalized, thus requiring the use of texture bandwidth and fill-rate.
Early Z checks add an additional Z compare early in the pixel pipeline -- before texturing operations are performed. This initial Z compare tries to determine early on as to whether or not this pixel is visible just as a traditional Z compare does. If it is able to determine that the pixel is not visible (by reading from the Z-buffer that an existing pixel exists with a nearer depth), the pixel is culled and texturing operations are not performed.
When a graphics scene is rendered, often times it is rendered in a specific order. Some applications will render the scene in a back-to-front order, where objects that are furthest away being rendered initially, and progressively rendering objects nearer and nearer to the viewer. This is the worst case for a 3D accelerator typically because a color write must occur for every pixel at every depth. Other applications will use a front-to-back render order, where objects that are nearest the viewer being rendered first. This situation is optimal for hardware, as it only requires a single color write for each pixel location. The final situation presents a random order rendering, where scene objects come down the pipeline in a random order. This is a sub-optimal, though better than back-to-front as only some pixel color writes must occur more than once.
For early Z checking, front-to-back ordering is optimal, just as on a traditional pipeline. By using a front-to-back order, all visible portions of the scene are rendered immediately, allowing the early Z compare to read the needed depth values and determine that any pixels that are not rendered at the nearest level are not visible and can thus be culled. Random order rendering allows for some values to be occluded, but in such a case it is impossible to determine the exact level of gain, as it can actually vary from scene-to-scene. Any application that uses back-to-front as a rendering order will not see any gain from early Z checking. In such a case if hardware does not disable early Z checking (assuming that it was a bandwidth limited product, as all current hardware is today) there would actually be a loss in performance. Why is this?
Just as on a traditional rendering pipeline, hardware that supports early Z checks must perform a Z compare late in the pipeline as well. This is true for a variety of reasons, such as the possibility of a pixel shader modifying the Z value of a pixel. Another possibility is that an application might modify the depth value of a texture using the TexDepth function in DX8.1 Pixel Shaders 1.3 and 1.4. In such a case, hardware would not only need a late Z compare, but it would also want to disable the early Z compare of the primitive associated with the TexDepth modification.
SIDEBAR: Some light (or otherwise) reading available on the subject at graphbib.org
| Hierarchical Z-buffer||Page:: ( 3 / 5 )|
HyperZ – Theory and application
Graphics guru Ned Greene originally developed hierarchical Z-buffering for quickly rendering scenes with very high depth complexities. The general concept of hierarchical Z-buffering is to use multiple Z-buffer resolutions to progressively determine object visibility. Originally, Greene used octrees to efficiently implement this.
Of today’s 3D graphics accelerators, only ATI uses hierarchical Z-buffering. While not the complete implementation that Greene originally proposed, it can offer considerable gains in texturing. ATI’s hierarchical Z-buffering is part of what they call HyperZ -- a few different hardware functionalities designed to reduce bandwidth requirements and increase rasterization performance.
The hierarchical Z-buffering implementation works by first keeping a reference Z value for every 8x8 pixel block of the Z-buffer. This reference value must be the deepest (furthest) value of all pixels in the block, with each block being determined by tiling the buffer. Keeping each value creates a low-resolution Z-buffer, which is then used for determining a rough visibility estimate.
Initially, the Z buffer is cleared, or filled entirely with values of zero. To see a benefit from the hierarchical Z, at least one object must be rendered. We’ll assume that a single triangle is rendered that covers the entire screen. For each 8x8 block of the triangle, the deepest Z value is kept as a reference value. As we go to render the next object(s), it is broken down into 8x8 pixel blocks. From each 8x8 pixel block, the furthest pixel Z value (ATI actually does this per-vertex with the new block) is compared to the reference value of the 8x8 block that exists in the Z-buffer, which is the render location for the new block. In this comparison, if the reference value on the existing block is found to be closer to the viewer than the new one, the new one is culled and the next block is compared. However, if the new block is determined to be nearer than the existing one, the new one must be rendered.
The primary issue, as some may have noted, is that if only a single pixel in the 8x8 block is nearer to the viewer than the deepest-existing one, the entire block must be rendered. This provides a loss in the total gain achieved by a hierarchical Z in comparison to early Z checking. On the other hand, with each reference value being stored on-chip (or a single value lookup per-block), the memory bandwidth requirements are dramatically lower.
SIDEBAR: www.ngreene.com - no need for Z-ordering here. Not even a hidden message in the source.
| Deferred Rendering||Page:: ( 4 / 5 )|
The black sheep of the Z family
Deferred rendering is truly different than any traditional architecture. Of the technologies currently out there, this is likely the most radical of them all. This, of course, is what is found in PowerVR, as well as others such as Gigapixel. ATI is believed to be presently working on a deferred rendering architecture as well, and they’ve recently received several patents on it. Deferred rendering’s basic concept is to delay rendering each scene by one frame, making it possible to do additional work on the scene, namely removing hidden surfaces.
The first stage of deferred rendering is known as “sorting and binning.” This stage is carried out by determining where geometry is located within a scene (think of it as a rough rendering of the scene) and writing that geometry to a scene buffer. The actual location of the geometry within the scene is stored in bins as pointers. The pointer lists store the location of the geometry for each tile, with the scene being broken it tiles of 32x16 or 32x32 pixels.
After the binning stage rendering takes place, it will render the scene one tile at a time. In doing this, it pulls up our first pointer list and see what geometry is needed for use, and then place the geometry into the tile buffer. Now pixel visibility is determined. There are different ways to do this, depending on the architecture. With this method, it is possible to determine the pixels that are occluded and then cull those out of the scene. The visible pixels are then textured.
Of course, everyone’s got their own ideas
It is worth noting that PowerVR handles scene data somewhat differently. With PowerVR, a scene is converted into “Infinite Planes.” Infinite Planes is a different mathematical representation of a scene. To determine pixel visibility, they use a process known as ray casting. This works by shooting a ray through each pixel. In doing this it is determined which is the first visible opaque pixel and culls out any other pixels beyond that point. In order to keep performance up, the ray casting is done by a massively parallel system, shooting a separate ray through many pixels at once. The end result is the same as doing it any other way. PowerVR just uses this slightly different approach.
We see that the advantage is that only the visible pixels are rendered. All hidden pixels are removed before texturing. The next advantage is that there is no longer a need to have multiple frame-buffer accesses. Instead, there is only a single write per-tile. Of course it is possible to do this on a traditional architecture as well in most cases. The result of this can be a considerable reduction in color buffer read/writes. Since the Z-buffer is on-chip, using a 32-bit Z-buffer does not use additional bandwidth either. Another advantage is that multi-sampling anti-aliasing can be done nearly for free. The only difference with that being that there is a need for more, smaller tiles and thus more bins. There are other advantages as well, but these are some of the key ones.
Deferred rendering is not perfect though. PowerVR, for example, has proven to have a lot of hardware-related issues with certain applications. Their drive team has been pretty successful at addressing many of these issues, but they still do crop up. Many of the issues are addressed by simply disabling certain functionalities in the deferred architecture; bring it closer to the functionality of an immediate mode renderer. This, however, reduces the benefits of the architecture too. There is discussion that in the future deferred rendering may have issues with highly complex scenes, where large amounts of geometry must be binned. There are ways to address issues such as this, but what the eventual outcome is remains to be seen.
SIDEBAR: Polar bears are left handed.
| Conclusion||Page:: ( 5 / 5 )|
Certainly there are advantages and disadvantages to all of these implementations. At the same time though, with careful implementation it can be made certain that the disadvantages never outweigh the advantages. At the same time though, it is possible for a sloppy implementation to present more bad than good. They all have the ability to remove hidden surfaces, and this is exactly the point.
At the same time, there is nothing to say that multiple implementations cannot be used simultaneously. For example, one might use early Z checking to remove surfaces prior to binning on a deferred rendering architecture. This is just one possible way. There are a variety of things that hardware developers will, and certainly do investigate.
The end result of course, is removing as much redundant and unused information as possible in order to arrive at the fastest, most-efficient rendering scheme possible. In today’s continuingly fillrate and bandwidth-limited world, the one thing we can all be sure of is more research and heavier reliance on occlusion culling in future-generation hardware.
SIDEBAR: Occlusion Culling – interesting technology or dead-end theory? Let us know in the FS News Comments!