Dual Graphics Engines, Tessellation, ROPs and More
AMD is introducing asynchronous dispatch for GPU compute for the 6900 series. It can execute multiple compute kernels simultaneously where each kernel has its own protected virtual address and its own command queue. While Nvidia’s Fermi architecture can handle parallel kernels, it must switch between them in order to process them. For AMD and asynchronous dispatch, this means not only can it have multiple kernels spawn from a single thread, but that it can actually manage multiple different applications completely and independently at the same time. This is something of interest as the GPU could start running various applications at the same time. While this is NOT something covered under Direct3D 11 (DX11), Baumann stated that “AMD will look to add extensions to expose this through OpenCL.”
AMD is utilizing two bidirectional DMA engines on the PCI express interface. This means that the 6900 series cards can do multiple concurrent data transfers across the PCI Express bus. This shows up as data rates of 5.5Gbps on HD 6970 and 5.0Gbps for HD 6950. Additionally, AMD has improved the way each SIMD can deal with storing data locally. In Radeon HD 48xx (RV770), AMD introduced a Local Data Store (LDS), or Local Data Share in AMD’s own terms, for each SIMD array. This allowed the array to store information for other threads to access while in the array. It also had Global Data Stores for other arrays to share between. In Cayman, AMD has enlarged the LDS to 32KB but it also gave the arrays the ability to fetch directly from the LDS.
Looking at the chart above, you can see that the 6970 has the same number of ROPs as the HD 5970 and HD 5970, however, but AMD has modified them to process INT8 and FP-16 operations faster. The ROPs handling color can now process INT8 16-bit (unorm and snorm) operations up to two times faster while FP16 32-bit single and double component operations up to four times faster. AMD also added new efficiencies how ROPs coalesce and then write data. What this means is that ROPs take fragments and blocks of data and put them together in one write operation instead of across multiple writes. AMD uses coalescence enhancements to ALUs read operations as well.
Previously AMD added a second rasterizer to improve performance. This time around AMD duplicated two entire geometry blocks. This means that the 6900 series cards can process two primitives per clock. Theoretically it would equate to twice the performance for transform and backface culling (eliminating the work load for geometry that is facing away from the camera). By having two rasterizers, AMD again can process up to 32 pixels per clock.
AMD improved the tessellation unit in each of the geometry blocks. AMD states that the improvements to the tessellation unit and the fact that AMD duplicated the units should provide three times the overall performance. When we get to the Unigine Heaven 2.1 benchmark results you will see a dramatic improvement to tessellation.