Broken Toys
Below you can see the original block diagram that Nvidia supplied to show some of the functional units inside of GF100. However, you will see something missing. What is missing in that hole is exactly what didn’t show up inside GeForce GTX 480. We received 480 (93%) of the possible 512 arithmetic logic units (ALU) or “CUDA cores.” This equals one full streaming multiprocessor group comprised of one SIMD, 32 ALUs and a tessellator.
Each SM can generate about 0.25 triangles per Polymorph Engine or 1 per Graphics Processing Cluster (GPC). This in theory appears to be a balanced approach as a single Raster Unit in each GPC can render 1 triangle clock. Not having the 16th unit meant that the GPU could create 3.75 triangles per clock cycle versus 4 rasterized per clock. This imbalance creates a slight bottleneck between creating and rasterizing triangles. Both are important for triangle subdivision and using textures for displacement mapping. While 0.25 triangles per clock may not seem like a lot, but theoretically it equates to 193,000 triangles per second of diminished geometry throughput and less performance from tessellation and vertex texture fetching. I say theoretically because in the real world, not all triangles are equal. Under certain usage patterns it is closer to 2 billion triangles per second versus the 3 billion that a full 16 SM graphics processor could supposedly output.
GF100 debuted with slower than expected core clock and memory frequencies, increased power consumption, and additional heat which required a more powerful and loud cooling solution. Despite its limitations, Nvidia’s GTX 480 is certainly a monster of a chip and can handle almost anything currently available on the market to render. That being said, who cares about GF100, we now have GF110. Nvidia took what it learned from launching GTX 480 and designed a piece of silicon that took the best of GF100 and some of the improvements from GF104 to deliver what we now know as GeForce GTX 580.