[ Print Article! ]

AMD Radeon 6950/6970 Performance Preview
December 17, 2010 Darren Polkowski

Summary: AMD's high-end DX11 offerings for gamers are here just in time for Christmas. With a new VLIW4 architecture, 2GB of memory, and PowerTune, AMD's latest GPUs bring many new features to the table. See how they stack up against the competition in today's article!


IntroductionPage:: ( 1 / 12 )

Here we go again! The past two months has been one of the craziest on record for graphics product launches. While it makes a lot of work for the reviewer, it is a very good day for the consumer. The Nvidia launch of GeForce GTX 570 last Tuesday was a good shot in the arm for those of you looking for great performance under $400. At $350 the GTX 570 is an attractive buy but that is still a bit out of reach for the majority of gamers. While this is the FiringSquad, “Home of the Hardcore Gamer,” most of us don’t have hardcore discretionary budgets to acquire toys with.

[image]

<% print_image("01"); %>

Today AMD is launching its Cayman architecture under the heading of Radeon HD 6970 and Radeon HD 6950. As you will see, AMD changed some of the architecture to make it faster in several areas but also is introducing some new features. The best part of this launch is the performance per dollar. Radeon HD 6970 is being introduced at $369 to compete directly with GeForce GTX 570. Radeon HD 6950 is launching at $299. This is the price range for GeForce GTX 470 and Radeon HD 5870. This should ultimately lower the price on all of the cards in that $250-300 band as HD 6950 pushes it way in.

On the pages that follow, we will explore the new power and overclocking features inside of Cayman. We will also take a closer look at the changes in the architecture and point out why AMD made these changes and how it relates to performance. AMD has a real winner in both cards it is launching today and as we take a walk through all of the changes you should have a clear understanding of this as well.






Architecture & TheoryPage:: ( 2 / 12 )

The first major change to the architecture in Cayman is the use of VLIW4 (Very Large Instruction Words). Since HD 2900 (R6xx), AMD/ATI has been utilizing blocks of 5 ALUs strung together performing VLIW5 super scalar instruction level parallelism. Basically, the SIMD tells the array to execute a task in parallel and bundles those instructions in blocks of five to be run on the four normal ALUs and the “Fat” unit that handles transcendental processing such as sine, cosine, e, square root, and log.

Take what you have known about AMD’s architecture under VLIW5 and wrap your mind around VLIW4. AMD stated that in most games the average slot utilization is 3.4 of the 5 streaming processors. Most of the time the last ALU is not doing any work and under normal operation the processor is only at 68% utilization. We came to understand as VLIW5 in current architectures because we were originally introduced to R300. The ATI Radeon 9700 utilized a Vec4+1 or a vector plus a scalar processing architecture. The Vec4 processor would take four element vector operations (Vec3/Vec4) and could handle a scalar operation in parallel. Rocket eight years forward to the present and we have a Vec4 now comprised of a bundle of four scalars plus a scalar.

[image]

<% print_image("02"); %>




In all retrospection, since almost everything in the world of computers is based on a binary foundation, I am not sure why AMD had not moved to an architecture with base of two scalars. Someone could just shout out a base of one but I guess this kind of talk drives us back to a discussion of scalar vs. super-scalar. That debate has changed over time and back again. It is more than can be addressed in this article, but please feel free discuss that in the forums.

In the previous VLIW5 architectures, the transcendental unit could do the same calculations as the other four but could also do certain specific calculations in one pass. As you can see from the block diagram below, AMD has removed the “Fat” or “T” unit. This means that the remaining four ALUs have to process transcendental calculations together without the help of special fixed functions.

Dave Baumann, HD 6900 Product Manager, stated that “this allows better utilization than the previous VLIW5 design. Effectively the shader processors are being utilized more frequently and more efficiently. This basically brings about 10% more performance per millimeter.” What Dave is saying is that using VLIW4 vs. VLIW5 allows AMD to put more SIMD units in the same space on the die and therefore should increase performance as you have more units executing more threads and calculations.

[image]
<% print_image("03"); %>

In this configuration, each block of four ALUs in AMD’s Cayman architecture can calculate four 32-bit floating point (FP) fixed multiply-add (FMA), multiply-add (MAD), multiply (MUL) or addition (ADD) in one clock cycle. It can also handle two 64-bit FP ADDs, 1 64-bit FMA or MUL and one special function (transcendental). On the integer side of things it can handle four 24-bit MAD, MUL or ADDs, four 32-bit ADDs or bitwise operations (shifts, XOR, etc). It can also handle one 32-bit MAD or MUL or one 64-bit ADD per clock.

This rearranging of the ALUs allows AMD to use 1,536 Stream Processors (ALUs) across 24 SIMDs (Single Instruction Multiple Data) arrays. Looking at Radeon HD 5870, it has an array of 20 SIMDs of 1,600 ALUs. All things being equal and under certain circumstances, the new configuration would process less calculations per clock than the VLIW5 but may be closer to full utilization. Four more SIMD arrays in the VLIW4 configuration means that four more independently different instructions can be computed in the same clock than the VLIW5. The trade-off is more efficiency and diversity of workload vs. more brute strength and special transcendental calculations per clock. Obviously AMD feels the new configuration is better.




Dual Graphics Engines, Tessellation, ROPs and MorePage:: ( 3 / 12 )


AMD is introducing asynchronous dispatch for GPU compute for the 6900 series. It can execute multiple compute kernels simultaneously where each kernel has its own protected virtual address and its own command queue. While Nvidia’s Fermi architecture can handle parallel kernels, it must switch between them in order to process them. For AMD and asynchronous dispatch, this means not only can it have multiple kernels spawn from a single thread, but that it can actually manage multiple different applications completely and independently at the same time. This is something of interest as the GPU could start running various applications at the same time. While this is NOT something covered under Direct3D 11 (DX11), Baumann stated that “AMD will look to add extensions to expose this through OpenCL.”

[image]

<% print_image("04"); %>

AMD is utilizing two bidirectional DMA engines on the PCI express interface. This means that the 6900 series cards can do multiple concurrent data transfers across the PCI Express bus. This shows up as data rates of 5.5Gbps on HD 6970 and 5.0Gbps for HD 6950. Additionally, AMD has improved the way each SIMD can deal with storing data locally. In Radeon HD 48xx (RV770), AMD introduced a Local Data Store (LDS), or Local Data Share in AMD’s own terms, for each SIMD array. This allowed the array to store information for other threads to access while in the array. It also had Global Data Stores for other arrays to share between. In Cayman, AMD has enlarged the LDS to 32KB but it also gave the arrays the ability to fetch directly from the LDS.

[image]
<% print_image("05"); %>

Looking at the chart above, you can see that the 6970 has the same number of ROPs as the HD 5970 and HD 5970, however, but AMD has modified them to process INT8 and FP-16 operations faster. The ROPs handling color can now process INT8 16-bit (unorm and snorm) operations up to two times faster while FP16 32-bit single and double component operations up to four times faster. AMD also added new efficiencies how ROPs coalesce and then write data. What this means is that ROPs take fragments and blocks of data and put them together in one write operation instead of across multiple writes. AMD uses coalescence enhancements to ALUs read operations as well.

Previously AMD added a second rasterizer to improve performance. This time around AMD duplicated two entire geometry blocks. This means that the 6900 series cards can process two primitives per clock. Theoretically it would equate to twice the performance for transform and backface culling (eliminating the work load for geometry that is facing away from the camera). By having two rasterizers, AMD again can process up to 32 pixels per clock.

AMD improved the tessellation unit in each of the geometry blocks. AMD states that the improvements to the tessellation unit and the fact that AMD duplicated the units should provide three times the overall performance. When we get to the Unigine Heaven 2.1 benchmark results you will see a dramatic improvement to tessellation.





PowerTune TechnologyPage:: ( 4 / 12 )

AMD is introducing what it calls PowerTune Technology. All semiconductor products operate under a specific design specifications. One of these is referred to as TDP or Thermal Design Power. As its name implies, the TDP for a product is the maximum amount of heat in Watts that a chip can dissipate. So TDP is the figure used to describe how much power a chip will pull under “Normal” circumstances. There are several applications for the chip to draw more power than designed for.

Nvidia introduced monitoring on the input rails for GTX 580 and GTX 570. While this approach is beneficial for blocking overdraw on those cards while running FurMark, Game Test 4 in 3DMark 03, or OCCT, it may not give the best performance under other conditions. AMD’s innovation can monitor and adjust power across the entire chip versus the input rails. This means that AMD can “clamp” to the TDP instead of a power draw. Not all processors are manufactured equally and some can overclock better, have less leakage, or have lower temperatures. Since there are these disparities between processors, the better the die, the more you can push it without passing the original TDP threshold. Poorer quality dies will hit the maximum TDP while pushing out less performance. The diagram below shows an example that AMD shared which demonstrates the frequency of the processor fluctuating based on a clamped TDP target.

[image]

<% print_image("06"); %>

Because of these variances, AMD utilizes an algorithm in conjunction with the power draw counters in each part of the chip to tailor a specific power design. The chip can therefore scale to meet the demand or slow the clock speeds of the chip to meet the TDP. While most game applications fall far below the TDP for a specific chip, there are those outlying applications which go over the TDP. So instead of AMD building the TDP with those outlying apps like OCCT or upcoming game titles as the maximum, they can use a TDP which accounts for the majority of applications. AMD can utilize the algorithm and counters to constrain the application to the TDP instead of the application constraining the TDP.

[image]
<% print_image("07"); %>

AMD makes control over PowerTune available to the end user. This means that you the consumer can adjust -20% to +20% of the specified TDP. In some instances you may want to get that last ounce of FPS when running OCCT. By boosting the threshold up by 20% you may be able to deliver the power the application wants to draw. Be warned, this is considered over-volting and can void your warranty.



In the image below you can see how altering the power control allows the chip to hit the maximum FPS that it is capable of with a relaxed TDP. The bottom line shows that the clock frequencies have been reduced because Perlin Noise test in 3DMark Vantage is drawing too much power. As you change the maximum TDP, the higher the FPS output becomes as the chip can draw more power and gets the chip closer to its maximum performance. Remember, the application will ask for power but it does not mean all applications will demand all of the power. Adjusting the TDP constraint for an application that does not ask for extra power will not gain performance. You would have to overclock the chip first and then the application may request the extra power.



Dual BIOS and OutputsPage:: ( 5 / 12 )

You tweakers out there will love what AMD is shipping on all reference based graphics boards, a dual BIOS. The one setting is the factory default and cannot be over-written. The second is an unprotected BIOS which the user can change. So if you like to tweak your cards or look for a modified BIOS from the web, you now have the ability to change yours without the fear of “bricking” your card. If you have an error while flashing the unprotected BIOS, all you need to do is change to the protected factory BIOS (Setting 2) and reboot. While in operation, you can flip the switch to the unprotected BIOS and re-flash it.

[image]

<% print_image("08"); %>

One more thing to point out about the images above is the fact that there are two CrossFire connectors on HD 6970 and 6950. This means that these cards should be capable of 3-way and 4-way CrossFire configurations.

Cayman comes equipped with the same Universal Video Decoding architecture (UVD 3) launched with 6xxx series for fixed-function video playback accelerator. UVD3 added acceleration for MPEG-2 bitstream, DivX/xVid (MPEG-4 part 2 ASP) and Multi-View Codec (MVC) file formats. MVC decode is the codec used in Blu-ray 3D movies specifications where it combines two stereo views encoded in H.264.

[image]
<% print_image("09"); %>




AMD continues to provide outputs for up to six displays. There is a single link DVI connection on the bottom and a dual link DVI connection on top. Additionally it provides a HDMI 1.4a connection. Ideally you can run two 30” monitors off the dual link DVI plus the HDMI outputs. AMD also provides two DisplayPort 1.2 (miniDP) outputs which each support up to two 30” monitors each or can run EIGHT monitors at 1920x1080 (4 on each)! Of course this could require additional outside hardware but being able to run four or six monitors from a single graphics card is awesome for those of you (me included) who run multi-monitor setups.

[image]
<% print_image("10"); %><% print_image("11"); %>

I listed a chart above from a presentation which shows the limitations of the number of monitors per cable based on the bandwidth required. To read more about DisplayPort, please visit their website. They have a lot of good information listed there. (DisplayPort.org)



Testing SetupPage:: ( 6 / 12 )

The test system has been overclocked to 4.33GHz and SHOULD eliminate any potential for the benchmarks to be CPU bound. You can find more specifics on the system via the CPU-Z. Additionally, we have supplied the GPU-Z screenshots so you can see the card specific details.

[image]

<% print_image("12"); %><% print_image("13"); %>

We ran each test at least three runs. The “average” which we report in each resolution and configuration is the geometric mean. We use it where possible for of all of the runs as to get a true center of the data. The minimum is the minimum of all of the runs. We would like to demonstrate user experiences as much as possible. While it does not play well for PR and marketing types, it is what we experienced and what a gamer would experience under the same conditions. Additionally, the three resolutions we selected are the two most popular and the maximum. We used a resolution of 1920x1080 for Unigine Heaven 2.1. If you have comments about the test setup or what you would like to see run through the paces, please contact us.

AMD Reference Radeon HD 6970


[image]

<% print_image("14"); %><% print_image("15"); %>

AMD Reference Radeon HD 6950


[image]

<% print_image("16"); %><% print_image("17"); %>

Nvidia Reference GeForce GTX 570


[image]

<% print_image("18"); %><% print_image("19"); %>

Nvidia Reference GeForce GTX 580


[image]

<% print_image("20"); %><% print_image("21"); %>

EVGA GeForce GTX 480


[image]

<% print_image("22"); %><% print_image("23"); %>

Asus EAH 6780


[image]

<% print_image("24"); %><% print_image("25"); %>

Asus EAH6850 DirectCU


[image]

<% print_image("26"); %><% print_image("27"); %>

AMD Reference Radeon HD 5870




[image]

<% print_image("28"); %><% print_image("29"); %>

AMD Reference Radeon HD 5850


[image]

<% print_image("30"); %><% print_image("31"); %>



DX11 and Tessellation - Unigine Heaven 2.1Page:: ( 7 / 12 )

DX11 and Tessellation - Unigine Heaven 2.1


As mentioned earlier, AMD has doubled the geometry by duplicating its “Graphics Engine.” This doubling means there are two rasterizers, tessellation units, vertex and geometry assemblers. Now, what I would like to point out is that not all tessellation is identical. AMD calls this its 8th generation tessellator. The basic premise of the tessellator stage is subdivide lines, triangles and quads into smaller objects. This means lines can spawn more points, triangles can spawn more triangles or quads, etc. The tessellator uses a determined tessellation factor (how much to subdivide) and then generates the topology once per patch.

[image]

<% print_image("32"); %>

Many games do not take advantage of tessellation in terms of what it can do for a scene or game. Civilization V for example uses tessellation for surface and water displacement. While this makes the ground and water look more realistic, most of the time I play the game at such a distance I cannot really enjoy the added geometry. I am not saying it is not important, but developers still have a way to go to really take advantage of the hardware that exists, especially now that Cayman adds even more punch in terms of geometry.

Below you can immediately see a massive improvement to geometry and tessellation that Cayman brings to the table. While this usage scenario is not the norm compared to games, it shows how far we can expect games to go as developers add more dynamic elements to games which can utilize tessellation on a larger scale. While at resolution and tessellation settings we chose, the new Cayman based cards scamper up the performance ladder to be in line with GTX 570 and GTX 480.






DX11 FPS: Battlefield 2: Bad Company & Metro 2033Page:: ( 8 / 12 )

Battlefield 2: Bad Company


Battlefield 2 uses of DX11 but as stated in an interview at PC Games Hardwarecode, Anders Gyllenburg says “The main benefits for us are efficient soft shadowmap filtering, and some smaller performance optimizations.” (www.pcgameshardware.com)

[image]

<% print_image("33"); %><% print_image("34"); %>



Looking at the graphs below you will see the Nvidia cards holding the lead at lower resolutions but as the number of pixels per frame increases, the AMD 6900 series cards start kicking butt and taking names. The HD 6970 for instance is basically doing the same frame rates but at a $150 discount. The HD 6950 is doing the same to the GTX 570 but at a $50 discount. So while the frame rates don’t jump out and slap me across the face, the frame rate in comparison to the cost per card does.








DX11 FPS: Metro 2033Page:: ( 9 / 12 )

Metro 2033


Like Battlefield 2: Bad Company, we crank up the settings because current cards should be able to chew through most of what the games can throw at them. We therefore enable tessellation, multi-sampling antialiasing to 4x and anisotropic filtering to 16x.

[image]

<% print_image("35"); %><% print_image("36"); %>

Looking at the results you should note that the new HD 6970 can HD 6950 hold their own against their respective card in its price point. But at higher resolutions we see AMD take the lead. The HD 6950 beats the more expensive GTX 570 and GTX 480. At 1920x1080 all of the Nvidia candidates and now the new Cayman equipped cards are playable with minimum frame rates above 30fps.







DX11 RTS: Civilization V & DX11 Driving Sim: Dirt2Page:: ( 10 / 12 )

Civilization V


There are several ways to use Civilization to test rendering horsepower. For the most part, we value actual game play versus a built in benchmark. Both types of testing scenarios make sense as they can give us a valid look at what hardware and drivers can do. However, playing the game is what you the end user will most likely be using a game for… as an entertainment application. So for Civilization we have a preset pattern of testing it by moving around an endgame game I actually played and saved so we can repeat the test pattern.

[image]

<% print_image("37"); %><% print_image("38"); %>

We get the same pattern as before in Civilization. The higher the resolution (with a lot of features turned on) the better Cayman performs compared to the competition. At 1680x1050 the GTX 570 and HD 6970 are neck and neck. As we move to 1080 we see the HD 6970 pulls ahead. At 2560x1600, it spanks it. At that resolution even the HD 6950 pulls ahead to be neck and neck with a card that is $50 more expensive.








Colin McRae: Dirt2


Dirt 2 is a good racing game. Codemasters uses DX11 and tessellation on the water and on the people in the crowd (I still think they could have done a better job on the people – especially for PC). We only test this game at the highest resolution and at 4xAA and 16xAF because, this game is a console port with PC tweaks. Modern graphics cards should have no issue with it.

[image]

<% print_image("39"); %><% print_image("40"); %>

As you can see from the results, there was some sort of anomaly on the minimum frame rates for the HD 6970 and I plan on rerunning the results when I do a follow up article to test SLI and CrossFire scaling. I want to put an end to the discussion on what is the best bang for the buck. Moving on, the new Cayman cards outperform the previous generation as well as the 6800 series cards but Nvidia has them on the ropes in Dirt 2.






Other Tests: Sound, Heat and PowerPage:: ( 11 / 12 )

Noise


The GTX 570 is still the king in terms of reference design cards. What we would like to point out that the chart does NOT show is that when the HD 6970 starts up we got the classic spin-up and slow down we used to know in older cards and in the HD 5870 (aka “The Batmobile”). It is yet to be seen if this was specific to the cards we received and the press BIOS or if this is common to all of the retail cards as well. The reason I point this out is that the HD 6950 did not spin up when the system was started. We will keep you posted as I get an answer to that question.





Power Utilization


We utilize a Diamond BizView BV200 card to help with the power test to allow us to get into the operating system and back out. We leave the card inserted for all of the power, temperature and acoustic tests for consistency. Looking at the cards below, the HD 6970 requires an external 8-pin and 6-pin power connection while the HD 6950 only requires two 6-pin connections.

[image]

<% print_image("44"); %>

AMD engineering comes through again with low power consumption. Their techno-wizardry was able to keep the power draw below their Nvidia counterparts in all classes and price points. In fact, the 6900 cards are a little more power hungry than the 6800 based parts. What we need to point is that while consumption is in-line with the other parts and far below Nvidia’s, there was a clear improvement to performance.




Temperatures


Okay, there is no winner in terms of power, while there is a loser (GTX 480). Overall the temperatures are where we would expect them to be. With more happening inside both flavors of Cayman we should expect it to get warmer. Moving to VLIW4 from VLIW5 means that we should have greater utilization across the entire chip. Just looking at the shader ALUs alone, we should expect them to be better utilized. If the previous configuration was utilizing up to 68%, Cayman can utilize up to 85% with everything else being equal.






ConclusionsPage:: ( 12 / 12 )

Cayman is a superb piece of architecture. It is fast, efficient but not quite exactly what the press slide decks claims. I know where the PR and marketing folks are coming from so I will cut them some slack. However, I care what you and I the consumers benefit. I like the steps taken to look forward such as enhanced DMA performance, 5.5Gbps is insane. I like PowerTune. It allows future designs to utilize power more effectively, if not more efficiently, and gives the user some control over what they want the card to do. I also like the dual BIOS, and yes I have bricked cards before.

This race for the fastest and best in each price category has left a wake of confusion as to what is the best configuration for the cost. As for single card configurations, there isn’t a new champion but there are heroes. GTX 570 is still an amazing card for what it can do. But now that we have price points below where I thought they would appear, I would have to lean on the HD 6970.

[image]

<% print_image("45"); %><% print_image("46"); %>

At $299 the HD 6950 is extremely attractive. For $75 more than a GTX 580 you could put two cards in for better performance. What makes things hard is that you can pick up a pair of HD 6850 cards for about the same amount you would spend on a single HD 6970. This is why I need to do some specific configurations to show the scaling and performance as well as the price per performance.

If you are like me, I love multi-monitor setups. I am not sure how I functioned before I had more than one monitor running. The HD 6900 and 6800 series give you that and more. If you want to run three or more monitors the cards from AMD can supply it while GTX 570 needs a second card to accomplish the same task. That brings up Eyefinity and multi-monitor gaming. While GTX 570 has a over 1GB of memory, that is a limited amount to span three monitors, even if it could output to three monitors.




[image]
<% print_image("47"); %>

We are back to my opening remarks; it is a good day to be a consumer. There are a lot of choices at many price points. The only cards that we know should be coming are a replacement to the GTX 460 (probably GTX 560) and the dual chip Cayman card. This should settle your minds if you are sitting on the fence about what to buy. If you are like me and can’t stand playing at anything under 2560x1600, you should seriously look at the Cayman based HD 6970 and HD 6950. They have a lot of horsepower and plenty of memory which is why they pull ahead of Nvidia at the higher resolutions. If, on the other hand you play at 2MP 1920x1080, both are attractive. UVD 3, dual BIOS, multi-monitor, and PowerTune then become the value added proposition to look at. If you play at something less than that, Nvidia has a clear advantage.


© Copyright 2003 FS Media, Inc.
[ Print Article! | Close Window ]