[ Print Article! ]

Jakub's Rant: 'Real-world' Benchmarking
July 20, 2006

Summary: [H]ardOCP has claimed our Core 2 benchmarks lie to you, that only their real-world GPU bottleneck tests can show real-world CPU performance. We address the issue of "real-world" vs "canned", and go over some of the myths and errors propagated by our friendly rivals, while also going over the pros and cons of both methods.


IntroductionPage:: ( 1 / 6 )


We test Intel's Core 2 Duo and Extreme using real-world gaming. Don't let a bunch of canned benchmarks lie to you about gaming performance, real gameplay experience tells a different story. Unless of course you game at 800x600.

In his article he then goes on to state:

Let's just cut to the chase. You will see a lot of gaming benchmarks today that just simply lie to you. That is right, you will see frames per second numbers that are at best total BS, and at their worst a terrible representation of what difference a new Intel Core 2 processor will make in your gaming experience.


One of their assertions for why they feel their testing methods are better than the methods used by other sites is that timedemo benchmarks are “canned” runs, and do not include physics or AI calculations. However, this assumption is simply not true. The truthful answer is that this will vary from game-to-game. Timedemos can and do involve physics and AI calculations.

Quake 4 has two options for benchmarking, the “timedemo” console command and "playnettimedemo" command. “playnettimedemo” involves demos recorded over the ‘net (or network) and benchmarking with this command will include everything, including physics. This isn't a new development either, as Unreal Tournament 2004 has done this in botmatches for years.

And for the record, we did use the "playnetttimedemo" for our CPU benches in our Core 2 article.

What’s wrong with removing physics and AI anyway?

When testing a specific hardware component, whether it’s a graphics card, CPU, or any other component, as a reviewer you want to try to isolate the performance of that component as much as possible. That’s why reviewers will often test without sound, or run testing without other apps running in the background like email or ICQ, etc even though that’s not necessarily the way you normally do things in everyday use. After all, these variables can affect the true performance of the component you’re testing.

Timedemos that don’t include physics or AI is another tool that the reviewer can use to isolate the performance of a specific component, that’s why they’re often used to test video cards. When you're testing for video performance, removing CPU bottlenecks is the best way to see the true speeds of the graphics adapters.

Testing with this method isn’t “lying” to readers – you’re showing the true performance of the hardware component you’re reviewing. Even if you do want to make the argument that since some timedemos remove AI and physics you’re not simulating real-world use, that still doesn’t change the fact that as long as you test hardware the same way it won’t matter.

In other words, ATI’s card isn’t going to suddenly come out ahead of NVIDIA because the timedemo didn’t include physics or AI. As long as you’re testing the hardware properly, the outcome shouldn’t be affected, just the frame rate will be a little different. And besides, by taking out aspects like AI and physics, it removes one potential bottleneck that prevents you from seeing the true performance potential of the hardware. This is why timedemos that don’t use AI/physics can be good for testing graphics cards or testing certain aspects of graphics cards.

When testing with timedemos, what we do is we attempt to minimize and isolate the effect other system components can have on the item being reviewed and show how that component performs in comparison to other potential upgrades the reader may be faced with, as well as how the component performs in comparison to older products. The graphs are presented in a concise, easy-to-read format at multiple resolutions as we realize that the results can vary at different resolutions, not to mention that not everyone is stuck at one resolution.



Real-world?Page:: ( 2 / 6 )

Who defines what “real-world” is?


In a flight simulator for instance, is “real-world” testing at 35,000 feet with the graphics card merely rendering the sky, clouds, and any other objects that may be up at that altitude? Or is real-world defined as flying nap-of-the-earth at 200 feet with your fighter jet flying at 600 miles per hour and the graphics card rendering not only the sky and clouds, but individual trees and other terrain objects? (Not to mention the difference in the flight model and how the CPU handles physics of that flight model between these two scenarios.)

Or we can switch genres and go to Quake. Is real-world testing outdoors or indoors? How many enemies on-screen? 5, 10, 15? Is it more real-world to test single-player where you’ve got the CPU handling both AI and physics of perhaps 3-10 enemies shooting at you simultaneously, or hop on a multiplayer server where the CPU no longer has to handle AI but you may have dozens of guys on a map at once?

These are the types of decisions a reviewer must face when testing hardware. As you can see, there are literally countless different combinations out there that can affect performance. When testing hardware it would be very difficult for anyone to test all these combinations, sacrifices must be made in order to get the review online in time for your deadline. Does that mean your test is any more or less of a “real-world” test?

Instead of worrying whether something is real-world or not, instead what we attempt to do with our methodology is create a scenario that’s as difficult as possible, or as close to difficult as possible for the component we happen to be testing at the moment. By running a worse-case scenario we can see how the hardware performs when it’s stressed the most; under less stressful scenarios it should be capable of delivering better performance that hopefully produces an adequate frame rate. (This actually brings us up to another point, what exactly is adequate anyways? For some people maybe 30 FPS is fine, others may want no less than 60. The truthful answer is that it’s completely subjective, what’s considered “adequate” is going to vary from person-to-person and even game-to-game, say for instance a twitch shooter like Quake which demands a higher frame rate than a more strategic shooter like Splinter Cell or Rainbow Six.)

According to [H]ardOCP, "real-world" CPU performance testing means using a variety of AA, AF, resolution and in-game settings to try to average 40fps... or 35fps, or 73fps. Whatever the video card happens to max out at. "Real-world" benchmarks take no notice of your preference to run at a minimum of 60fps, if that's your flavor, or if you prefer lower resolutions and higher refresh rates to make things easy on the eyes, rather than extra eye candy.

You see, the assertion is that the hardware reviewer in question knows what resolution and settings you prefer to play at, and these apparently will always top out the video card in question. With the graphics adapter thus acting as a bottleneck, the latest CPU from Intel, which in other tests is about 10-25% faster than AMD's offering, turns out to be no better in these benchmarks.

We as a community are no doubt truly grateful to our friends and rivals in Texas for pointing out that if your video card is maxed out, differences in CPU speeds can be difficult to tell. We said the same thing in our GeForce 7800 GTX CPU scaling article last year, where an Athlon 64 3500+ delivered performance similar to an Athlon 64 FX-57 in our benchmarks. Of course, in that article we happened to include results at not just 1600x1200, but also 1280x1024, a common resolution for many LCDs. At that resolution the FX-57 did show a slight performance advantage over the 3500+. In the past we’ve run scaling results for other GPUs as well, including Radeon X800, GeForce4, and other GPUs.

[H]ardOCP does raise a good point that most people don't run their games at the lowest possible video settings to get the highest framerate. These truly are not the days of Quake any more, we do like our eye candy...



The futurePage:: ( 3 / 6 )

Here’s a question for you, after seeing CPU test benchmarks that show identical performance between three processors because the video card is the bottleneck, what have we learned about the CPU’s performance? How do we know how it compares to other processors in its class? Moreover, those tests indicate nothing about what we can expect from it if it is paired with a better video card.

Remember, CPUs tend to be longer-term purchases. The CPU industry typically doesn’t move as fast as the graphics industry, up to this point AMD and Intel only introduce radically new CPU microarchitecture’s roughly every 4-5 years. In comparison ATI and NVIDIA introduce their next-gen GPUs roughly every 12-24 months. In the graphics world going from one generation to the next often equates to a doubling in performance. We saw this when going from the GeForce 4 to Radeon 9700, and Radeon 9700/9800 to the X800, from Radeon X800 to GeForce 7800. The list of examples goes on and on.


Keep in mind that with the debut of previous next-gen GPUs, it wasn’t unheard of for the new graphics card to be CPU-bound at resolutions as high as 1600x1200.

Now consider that just a few months from now, we are expecting ATI and NVIDIA's latest creations - the R600 and G80 GPUs. Unlike the last two revisions, this is truly next-generation hardware, the kind of leap we saw last when the GeForce 6800 and Radeon X800 came out to replace the 5900 and 9800 lines.

When these next-gen video cards come out, how will you be able to choose the appropriate processor to match them, if all your CPU benchmarks were run with older video cards that are maxed out? Even if you cannot afford a Core 2 Duo E6700 with G80 now, next summer after price cuts you may be able to.

With traditional benchmarks, you'll still be able to see its relative performance to other CPUs with no bottlenecks, while "real-world" benchmarks will uselessly indicate that all CPUs are created equal and there is no point trying to discern between them.

Meanwhile, the benchmark standard, refined over the years also shows that once you bump up resolution and settings, the differences between the CPUs are more or less eliminated. It becomes obvious to anyone with even nominal tweaking experience that there are two forces at play, that one is more important at lower resolutions, and the other at higher settings. Of course, if upcoming next-gen video cards make another leap forward as they did with the last generational leap, even high resolutions may become CPU-bound.



Real-world testing (cont’d)Page:: ( 4 / 6 )

Why “real-world” now?

This is the question I’ve been wondering, particularly in light of the launch of upcoming next-gen GPUs as mentioned on the previous page. In all honesty this probably goes beyond the premise of this article, which was to explain our methodology and why we test hardware the way we do, then I saw this post by Kyle in his forums where he suggests that this is somehow a new phenomenon that he just now realized:


Using "gaming benchmarks" for CPU testing is broken in its current state. What worked a decade ago, does not work now to show anything applicable to gaming performance as it once did. It does not work any more to reflect any sort of real world benefits.


The fact of the matter is this is simply not true. Gaming performance when testing at high-res has been bound by the GPU rather than the CPU dating all the way back to when the first GPUs began to offload more work from the CPU. I can say this with complete 100% certainty because we’ve been conducting both high resolution and low resolution performance results in our CPU reviews for some time.


Are "real-world" CPU benchmarks useless? No, they're a perfectly good illustration for new readers that video cards can be a bottleneck for CPUs at high resolutions. But the moment a better video card comes out, or someone is interested in possibly CrossFire/SLI results, these "real-world" benchmarks become irrelevent. Let's be honest, when we want to know how many pixels a video card can push, we read video card reviews.

[H]ardOCP’s numbers are very good at presenting Core 2’s performance for a very specific audience: those users with a Core 2 Extreme X6800, Core 2 Duo E6700, or AMD Athlon 64 FX-62 who happen to have a GeForce 7900 GTX and also game at high screen resolutions and high image quality settings, other than that, they don't tell you anything else. What if you run lower resolutions? What if you're going to buy a G80 or R600? What if you have SLI or CrossFire?

Instead of trying to replicate “real-world” usage we tried to provide a mixture of both tests that aren’t GPU-bound (the 800x600 results in our Core 2 review) to better illustrate the performance between the different CPUs, as well as high-resolution 1600x1200 results with 4xAA/8xAF in order to demonstrate usage more typical of a high-end gamer. We also included CrossFire results at 1600x1200 and 2048x1536 to give dual-GPU users an idea of how CPU performance scales from single-GPU config to a dual-GPU config.

Calling this method of testing “virtually worthless” or suggesting that we’re somehow “lying” to our readers is not only misleading to the public, it’s downright unfair to us and other websites.



Our real-world testingPage:: ( 5 / 6 )

Now that I’ve discussed the differences between benchmarking with demos and “real-world” usage with manual walkthroughs, I’ve got something to admit, we combined both techniques in our Core 2 article, and have been using both techniques for quite some time now when testing hardware. Any website which tests with Bethesda’s Oblivion for example is doing the same too, as the game doesn’t have a built-in method for recording demos. Guess a lot of sites are more real-world than [H]ardOCP thinks huh?

So how do we conduct our “real-world” testing with manual walkthroughs?


If you don’t try and minimize this variability basically you’d end up doing one thing with one piece of hardware and a totally different thing with the 2nd hardware component. That's not a very scientific way of testing. And by the way, with timedemos, you're doing the load evenly for both.

That's why when we do manual runs, we have to minimize our interaction with the environment. In other words, we don't shoot at things and we don't allow our character to be shot at. We've also got to walk a tightrope basically, running down the same set path as close as we can every time. Even then though there's still going to be variability, in the Oblivion City area for our Core 2 review for instance I've observed that the NPCs may take slightly different routes – one or two may not even show up at all going from one run to the next.

The long and short of it is, there's no way any one method can 100% replicate what the end user is going to experience, because everyone plays games differently. So what you've got to do is make things as even as possible so that the hardware is tested properly. You don't want to give one piece of hardware more of a load than the 2nd piece of hardware. The load needs to be the same.

As a result, testing with manual walkthroughs can’t be quite as stressful as testing with timedemos because the walkthroughs don’t involve intense combat, so instead we try our best to offset this by finding areas that will put as much stress on the component tested as possible. An example of this would be our foliage tests in GPU reviews, and the city testing in CPU reviews.



SummaryPage:: ( 6 / 6 )


What we do with our testing is try and gather as much information about the performance potential of the hardware as possible. This not only includes testing at multiple resolutions, but also including as wide a variety of hardware as possible so that you can not only see how the component being reviewed compares to the latest and greatest hardware, but also the previous generation of hardware as well. Essentially all we’re doing is saying “this is how fast x piece of hardware performed in our testing with this application, and this is how well it performed with this app”.

During the testing process, it’s important that we try and tax the hardware as much as possible, so you can see how well it truly performs in comparison to other potential upgrade choices. After all, if we didn’t do this, how else would you be able to determine which component to get?

From there it is then up to you, the reader, to make the decision that’s right for your needs. Then, once you’re purchased your upgrade, it’s also up to you to determine which screen resolution and graphics settings work best for you based on the rest of your system’s components and just as importantly, based on your tastes and preferences when it comes to frame rate.

This silliness though that [H]ardOCP is somehow doing something dramatically different than other sites simply isn’t true and it needs to stop. Clearly they haven’t reinvented the wheel when it comes to benchmarking.

To claim that what they’re doing is “real-world” testing and that what other sites is doing is “total BS”, “misleading”, and “worthless” is unfair to all other websites with Core 2 reviews. Especially in light of how narrow the scope of their “real-world” testing was, only 3 CPUs were tested, and multi-GPU performance was never touched.

If this is what hardware journalism has resorted to, we want no part of it.








© Copyright 2003 FS Media, Inc.
[ Print Article! | Close Window ]