The present I wanted for so long!
This is the Fermi I was told about only better. FULL EVERYTHING! All things being equal, the architectural changes and fixes could show performance as much as 5% in Unigine Heaven 2.1 and Metro 2033. Nvidia is even claiming as much as 15% in Dirt2.

As you can see from the tables, there are very nice improvements all over the place from GeForce GTX 480 to GTX 580. We are clearly expecting GeForce GTX 580 to crush geometry and be a bit soft on shading. A big surprise… right? When has that really been different when looking at AMD and Nvidia?

Additionally, Fermi is scalar heavy compared to the AMD GPUs. Beyond3D did some compute tests and showed that GF100 can issue twice as many scalar instructions than AMD and AMD can issue twice as many Vect4 instructions. [Alex Voixu, et al. Beyond3D.com] Again, not too surprising as old habits die hard and we like to do what we have always been good at. We expect GF110 to be an even bigger brute to geometry. (and it is… we peeked at the test scores)
Nvidia gave GF100 support for more tile formats to enhance depth buffering to improve z-culling. The basic premise here is that the z-buffer is a table. If you look at this table like a texture, it can be stored, compressed, uncompressed, given different levels of detail and so on. You therefore can access larger data sets for depths and use what best speeds up your application. This is something that we could probably write an entire article on, but the key here is that streamlining the process of getting pixels to your screen is what is most important (and that they look correct). Therefore, removing pixels that cannot be seen because they are on geometry that is obscured by other geometry should be removed from the work schedule as soon as possible. Just like people, the less time you have to think about or do fruitless work, the better and more efficient you are.
GF100 incorporated fully compliant IEEE 754-2008 single and double precision. Each of these ALUs use fused multiply-add instructions. Looking at the diagram below, you can see that using a FMA over a MAD (Multiply-Add) is better at retaining higher precision. It also makes it possible to do two floating point operations per clock cycle. This is huge when AMD and Nvidia are making talking points about GLOP throughput calculations.
GF110 got something juicy from its little brother. GeForce GTX 460 (GF104) introduced “full speed” 64-bit floating-point (FP16) texture filtering. DX9.0c introduced a minimum 32-bit floating-point lighting precision. When hardware started supporting FP16 blending, high dynamic range rendering (HDRR) really came alive. That being said, GF100 was designed to handle one texture address and four samples per texture unit. With 64 texture units it can still only deliver one location but thanks to GF104 it can now return and filter four INT8 (32-bit), four FP16 (64-bit) or one FP32 (128-bit) texture samples. Not only should this help GF110 to process HDR but also when using displacement mapping and texture heavy applications.