Summary: [H]ardOCP has claimed our Core 2 benchmarks lie to you, that only their real-world GPU bottleneck tests can show real-world CPU performance. We address the issue of "real-world" vs "canned", and go over some of the myths and errors propagated by our friendly rivals, while also going over the pros and cons of both methods.
We test Intel's Core 2 Duo and Extreme using real-world gaming. Don't let a bunch of canned benchmarks lie to you about gaming performance, real gameplay experience tells a different story. Unless of course you game at 800x600. In his article he then goes on to state: Let's just cut to the chase. You will see a lot of gaming benchmarks today that just simply lie to you. That is right, you will see frames per second numbers that are at best total BS, and at their worst a terrible representation of what difference a new Intel Core 2 processor will make in your gaming experience. One of their assertions for why they feel their testing methods are better than the methods used by other sites is that timedemo benchmarks are “canned” runs, and do not include physics or AI calculations. However, this assumption is simply not true. The truthful answer is that this will vary from game-to-game. Timedemos can and do involve physics and AI calculations. Quake 4 has two options for benchmarking, the “timedemo” console command and "playnettimedemo" command. “playnettimedemo” involves demos recorded over the ‘net (or network) and benchmarking with this command will include everything, including physics. This isn't a new development either, as Unreal Tournament 2004 has done this in botmatches for years. And for the record, we did use the "playnetttimedemo" for our CPU benches in our Core 2 article. What’s wrong with removing physics and AI anyway?
When testing a specific hardware component, whether it’s a graphics card, CPU, or any other component, as a reviewer you want to try to isolate the performance of that component as much as possible. That’s why reviewers will often test without sound, or run testing without other apps running in the background like email or ICQ, etc even though that’s not necessarily the way you normally do things in everyday use. After all, these variables can affect the true performance of the component you’re testing.
Who defines what “real-world” is?
Here’s a question for you, after seeing CPU test benchmarks that show identical performance between three processors because the video card is the bottleneck, what have we learned about the CPU’s performance? How do we know how it compares to other processors in its class? Moreover, those tests indicate nothing about what we can expect from it if it is paired with a better video card. Remember, CPUs tend to be longer-term purchases. The CPU industry typically doesn’t move as fast as the graphics industry, up to this point AMD and Intel only introduce radically new CPU microarchitecture’s roughly every 4-5 years. In comparison ATI and NVIDIA introduce their next-gen GPUs roughly every 12-24 months. In the graphics world going from one generation to the next often equates to a doubling in performance. We saw this when going from the GeForce 4 to Radeon 9700, and Radeon 9700/9800 to the X800, from Radeon X800 to GeForce 7800. The list of examples goes on and on. Keep in mind that with the debut of previous next-gen GPUs, it wasn’t unheard of for the new graphics card to be CPU-bound at resolutions as high as 1600x1200. Now consider that just a few months from now, we are expecting ATI and NVIDIA's latest creations - the R600 and G80 GPUs. Unlike the last two revisions, this is truly next-generation hardware, the kind of leap we saw last when the GeForce 6800 and Radeon X800 came out to replace the 5900 and 9800 lines. When these next-gen video cards come out, how will you be able to choose the appropriate processor to match them, if all your CPU benchmarks were run with older video cards that are maxed out? Even if you cannot afford a Core 2 Duo E6700 with G80 now, next summer after price cuts you may be able to. With traditional benchmarks, you'll still be able to see its relative performance to other CPUs with no bottlenecks, while "real-world" benchmarks will uselessly indicate that all CPUs are created equal and there is no point trying to discern between them. Meanwhile, the benchmark standard, refined over the years also shows that once you bump up resolution and settings, the differences between the CPUs are more or less eliminated. It becomes obvious to anyone with even nominal tweaking experience that there are two forces at play, that one is more important at lower resolutions, and the other at higher settings. Of course, if upcoming next-gen video cards make another leap forward as they did with the last generational leap, even high resolutions may become CPU-bound.
Why “real-world” now?
This is the question I’ve been wondering, particularly in light of the launch of upcoming next-gen GPUs as mentioned on the previous page. In all honesty this probably goes beyond the premise of this article, which was to explain our methodology and why we test hardware the way we do, then I saw this post by Kyle in his forums where he suggests that this is somehow a new phenomenon that he just now realized:
Now that I’ve discussed the differences between benchmarking with demos and “real-world” usage with manual walkthroughs, I’ve got something to admit, we combined both techniques in our Core 2 article, and have been using both techniques for quite some time now when testing hardware. Any website which tests with Bethesda’s Oblivion for example is doing the same too, as the game doesn’t have a built-in method for recording demos. Guess a lot of sites are more real-world than [H]ardOCP thinks huh? So how do we conduct our “real-world” testing with manual walkthroughs? If you don’t try and minimize this variability basically you’d end up doing one thing with one piece of hardware and a totally different thing with the 2nd hardware component. That's not a very scientific way of testing. And by the way, with timedemos, you're doing the load evenly for both. That's why when we do manual runs, we have to minimize our interaction with the environment. In other words, we don't shoot at things and we don't allow our character to be shot at. We've also got to walk a tightrope basically, running down the same set path as close as we can every time. Even then though there's still going to be variability, in the Oblivion City area for our Core 2 review for instance I've observed that the NPCs may take slightly different routes – one or two may not even show up at all going from one run to the next. The long and short of it is, there's no way any one method can 100% replicate what the end user is going to experience, because everyone plays games differently. So what you've got to do is make things as even as possible so that the hardware is tested properly. You don't want to give one piece of hardware more of a load than the 2nd piece of hardware. The load needs to be the same. As a result, testing with manual walkthroughs can’t be quite as stressful as testing with timedemos because the walkthroughs don’t involve intense combat, so instead we try our best to offset this by finding areas that will put as much stress on the component tested as possible. An example of this would be our foliage tests in GPU reviews, and the city testing in CPU reviews.
What we do with our testing is try and gather as much information about the performance potential of the hardware as possible. This not only includes testing at multiple resolutions, but also including as wide a variety of hardware as possible so that you can not only see how the component being reviewed compares to the latest and greatest hardware, but also the previous generation of hardware as well. Essentially all we’re doing is saying “this is how fast x piece of hardware performed in our testing with this application, and this is how well it performed with this app”. During the testing process, it’s important that we try and tax the hardware as much as possible, so you can see how well it truly performs in comparison to other potential upgrade choices. After all, if we didn’t do this, how else would you be able to determine which component to get? From there it is then up to you, the reader, to make the decision that’s right for your needs. Then, once you’re purchased your upgrade, it’s also up to you to determine which screen resolution and graphics settings work best for you based on the rest of your system’s components and just as importantly, based on your tastes and preferences when it comes to frame rate. This silliness though that [H]ardOCP is somehow doing something dramatically different than other sites simply isn’t true and it needs to stop. Clearly they haven’t reinvented the wheel when it comes to benchmarking. To claim that what they’re doing is “real-world” testing and that what other sites is doing is “total BS”, “misleading”, and “worthless” is unfair to all other websites with Core 2 reviews. Especially in light of how narrow the scope of their “real-world” testing was, only 3 CPUs were tested, and multi-GPU performance was never touched. If this is what hardware journalism has resorted to, we want no part of it. | ||||||||||||||||||||||||||||
| © Copyright 2003 FS Media, Inc. |