View Full Version : The Problem with Benchmarks
Mikki
08-13-2003, 08:07 AM
We all know the problem...there isn't a single benchmark out there which displays accurate results consistantly as well as scales properly. We've been complaining about this for a long time and unfortunately there doesn't seem to be a light at the end of the tunnel. :(
This problem is amplified for the BE Crew because we strive to provide everyone with accurate reviews, and those reviews are largely based on test results.
One possible solution is running a multitude of benchmarks, each for a specific purpose. Even this has it's faults due to the amount of time it takes to run all those tests as well as the fact that even then we are seeing varied results.
Another solution is to design a real-world test based on such things as video encoding. This has real promise because we'd be running a test with a real-world app, not a simulation. The problem lies in that something like this needs to be freely (and legally) distributable.
So, with this in mind, I go where we always go when we have a problem...the forum!! :) We have an excellent group of smart enthusiasts here and I appeal to you to help BE and the rest of the overclocking community to solve this problem.
Please post your ideas and suggestions here, and more importantly - your results. If you have any test results which demonstrate a problem with a particular benchmark, please post them. If you have results which demonstrate the accurate and consistant results of a particular benchmark or real-world solution, please post them. Opinions are welcome, but please base them on facts and/or test results.
What am I hoping to accomplish with this? Three things, actually. First I want to identify as many problems that we can. Secondly, I hope to at least come up with a solution to this problem, whatever it may be. Thirdly, once we have collected a decent amount of good solid data, I am going to contact as many benchmark companies as I can, show them the data and see if they will work with the overclocking community to make something better.
TIA!! :wave: :-)
pointreyes
08-13-2003, 08:39 AM
For disk benchmarks-we need to use the industry standard-iometer. Just wish someone knew how to use it and teach us. :rolleyes: iometer was originally developed by Intel and it's actually an old benchmark tool but it test the drives in a manner that is closer to emulating a drive dealing with loads that normal disk benchmarks don't do as well. Note the you will see iometer results with the more expensive disk controllers for advertising. It's about the only test that 3Ware will use.
I'll start it with Sandra. . . Specifically the memory tests. You do a simple refresh and the scores change. I'll post real examples later on, but we need a memory tool to properly gague the memory throughput and performance.
:D
eva2000
08-13-2003, 09:02 AM
anyone good at software programming ? wanna whip up a batch file to run some benchmarks or a new app hehe
we can call the app
B.E.A.S.T
Bleedin Edge Application System Tester :D
then you can say to users, can you run the B.E.A.S.T. on your pc and post the results hehe :-)
pointreyes
08-13-2003, 09:24 AM
KT would have my vote for developing BEAST.
Mikki
08-13-2003, 11:29 AM
B.E.A.S.T....lol! I like it!!:-) ;)
We've been looking into the feasability of creating a BE Benchmark for some time now, and decided against it for right now. Something like that would take a large amount of time that is unavailable, not to mention the hours and hours it would take to test it on a multitude of different platforms to ensure it did what it was supposed to do. After all, there are companies out there who do nothing but, and their programs don't do what we need them to do. Should someone like KT volunteer to take on a project like that, we'd be grateful but for now we need to work with what we have (or don't have)...;):)
eva2000
08-13-2003, 11:40 AM
KT?
pointreyes
08-13-2003, 11:46 AM
Originally posted by eva2000
KT?
Heinz 57, er user id number of 57. :p
http://www.bleedinedge.com/forum/member.php3?s=&action=getinfo&userid=57
ThugsRook
08-13-2003, 02:52 PM
i have found 2 issue with sandra membench over the last few months while doing reviews for BE.com. 1 of these issue also spills over into ANY unbuffered membench. (like m86)
right now nothing is perfect. formerly good benchmarks have become useless (on springdale/canterwood) and suprisingly enough, formerly lame benchmarks are now showing hope of being very good (on springdale/canterwood)
but then again i keep having to say (springdale/canterwood) dont i? :rolleyes:
heres my "personal" take on things as of this moment ~ i am not preaching this, this just whats working best for me.....
if i had to be on a deserted island with only 1 benchmark what would it be?
3DMark2001
if i was allowed 2 benchmarks?
3DMark2001
PCMark2002
3?
3DMark2001
PCMark2002
UT2003 demo
4?
3DMark2001
PCMark2002
UT2003 demo
Q3 Extreme benchmark <--- this is new
you get my point right?
ill post my results of those "issues" i mentioned earlier.....
ThugsRook
08-13-2003, 03:24 PM
my 1st issue was found while comparing granite bay to a springdale board. this was when springdale was 1st released and the whole entire world was (mistakingly) basing its performance (vs granite bay) on sandra and UB sandra results.....
ok ppl ~ i have something serious to discuss here.
something is lying in a big way.
ill post the comparison and you decide for youself wth is going on here...
1) P4G8X 2.53b @ 3401mhz 179fsb cas 2-2-2-6-1
2) P4P800 2.53b @ 3401mhz 179fsb cas 2.5-4-4-8
> 3DMark2001<
P4G8X = 15782
P4P800 = 15031
> Sandra Unbuffered <
P4G8X = 2418-2491
P4P800 = 2929-2536 :scratch: :!:
there is definately a huge flaw here somewhere.
we definately have a conflict when transforming Sandra memory bandwidth into actual performance.
3dmark says p4g8x mops the floor with a p4p800~
sandra says the exact opposite :smash:
discuss...
ThugsRook
08-13-2003, 03:33 PM
the 2nd issue i found was during recent testing for a review....
ok here's wth is going on.....
im tweaking out the 4:5 to get my benchmarks for the review.
since GAT doesnt work async, im currently testing a GAT suboption called read delay adjust.
it has 3 modes ~ auto enabled disabled
(in this particular case, auto runs the same as disabled)
here's the results ~ 2.4b @ 2820mhz 160fsb 4:5 400ddr c2622....
http://www.bleedinedge.com/images/thugs/ico-m86.gif Memtest86 v3.0
RDA enabled = 2625
RDA disabled = 2625
m86 is basically an unbuffered benchmark, it doesnt see the difference
http://www.bleedinedge.com/images/thugs/ico-a32.gif Aida32 v3.70 read/write
RDA enabled = 4562 / 1905
RDA disabled = 4394 / 1898
a32 sees a nice difference, on the read side of things
http://www.bleedinedge.com/images/thugs/ico-pcm2k2.gif PCMark2002 memory
RDA enabled = 8946
RDA disabled = 8627
pcm sees a nice big difference
http://www.bleedinedge.com/images/thugs/ico-sss.gif SciSoft Sandra v7973 unbuffered
RDA enabled = 2894 / 2823
RDA disabled = 2870 / 2805
very very small difference. again sandra and the unbuffered benchmarks have failed me. if i only ran sandra i woulda missed this tweak!
(sandra is kinda wild, that small difference could easily be random flux for sandra)
what is RDA actually worth in 3dmark terms?
121 3dmarks :eek2:
http://www.bleedinedge.com/images/thugs/ico-3dm2k1.gif 3DMark2001
RDA Enabled = 15982
RDA Disabled = 15861
121 3dmarks? from a simple tweak that sandra and m86 basically shrugged off as nothing??? :smash:
discuss...
Mikki
08-13-2003, 04:00 PM
Thanks Thugs, just the kind of stuff I was looking for...:D
So what do you may of those issues, any theories? :)
ThugsRook
08-13-2003, 04:09 PM
theories?
theory or not ~ sandra has failed me twice.
its results need to be verified by other benchmarks ~ which makes it a waste of my time.
the reason why i run so many benchmarks per tweak/setting is to verify that they are all telling the truth time after time again.
do you guys have any idea how many reviews are based and sandra results? how do you feel about those reviews now that you have this info??
the last thing i need is a benchmark that misleads me, or worse, misleads others.
> sandra has been removed from my system (again) <
m86 membench will be used sparingly~
aida32 and pcmark2002 are now in my front seats for memory bandwidth testing :wave:
pointreyes
08-13-2003, 05:20 PM
My theory is that the Intel dual DDR chipset boards have changed the way benchmark software can function. I find it interesting that the Granite Bay I had was only slightly faster than my AMD dual DDR nForce2 board. The Granite Bay had a 2.4b in it oc'ed to 3.0 Ghz and the nForce2 board was running at 200fsb with the XP2100 working at 2.1 Ghz. A 900Mhz difference and yet a very small difference in Sandra's memory throughput.
On a sidebar, I love the fact that I can have dual DDR with three sticks on a nForce2 but you must do pairs on an Intel system. :rolleyes:
eva2000
08-13-2003, 07:49 PM
Originally posted by ThugsRook
my 1st issue was found while comparing granite bay to a springdale board. this was when springdale was 1st released and the whole entire world was (mistakingly) basing its performance (vs granite bay) on sandra and UB sandra results..... maybe to do with the cas latencies and how they'd work on 865/875 http://www.mushkin.com/mushkin/pop-up/latencies.htm
What is tRAS and why is it backwards and important at the same time?
The word latencies is generally used to describe a delay. However, Merriam-Webster defines the word’s origin as period of dormancy and in technical parlance, latency is often used to describe simply the duration of any event. One example is the PCI latency which describes the time any device has access to the PCI bus before it will be automatically disconnected to allow other devices access to the same resources.
Why are we talking about this? Very simple, the access latencies of any device to the PCI bus are usually eight cycles, but the total latency can be set from 16-256 cycles. This shows that the same word is used to describe two entirely different parameters, the first being the time until any transactions can start, the second referring to the time that is available for transactions (minus the access latencies). As an example, a PCI latency of 32 will carry a penalty (access latency) of 8 cycles which leaves 24 cycles for actual data transfers. Therefore, decreasing this latency will not increase performance, on the contrary.
The exact same is true for tRAS short for the RAS Pulse width. Historically, tRAS was defined as the time needed to establish the necessary potential between a bitline pair within the memory array until it was safe to write back the data to the memory cells of origin after a (destructive) read. Pay attention to the word read here.
Memory, in many ways is like a book, you can only read after opening a book to a certain page and paragraph within that particular page. The RAS Pulse Width is the time until a page can be closed again. Therefore, just by definition, the minimum tRAS must be the RAS-to-CAS delay plus the read latency (CAS delay). That is fine for FPM and EDO memory with their single word data transfers. With SDRAM, memory controllers started to output a chain of four consecutive quadwords on every access. With DDR, that number has increased to eight quadwords that effectively are two consecutive bursts of four.
Now imagine someone closes the book you are reading from in the middle of a sentence. Right in your face! And does it over and again. This is what happens if tRAS is set too short. So here is the really simple calculation: The second burst of four has at least to be initiated and prefetched into the output buffers (like you get a glimpse at the headline in a book) before you can close the page without losing all information. That means that the minimum tRAS would be tRCD+CAS latency + 2 cycles (to output the first burst of four and make way for the second burst in the output buffers).
Any tRAS setting lower tRCD + CAS + 2 cycles will allow the memory controller to close the page “in your face!” over and again and that will cause a performance hit because of a truncated transfer that needs to be repeated. Along with those hassles comes the self-explanatory risk for data corruption. That one is not a real problem as long as the system is kept running but in case it is shut down and the memory content is written back to the hard disk drive, the consequences can be catastrophic. For the drive, that is.
Mikki
08-14-2003, 08:08 AM
Lot's of really good info coming so far, you guys rock! :rock:
Thugs, nuff said...;):)
pointreyes, I agree, the architecture of a chipset may affect how certain things run, but my opinion with that is 1) it should only show and performance increase/decrease and 2) that should parallel the real-time operation of any software. A program is a program is a program, and if well written should always run the same. It may be faster or slower in it's operation but it should operate the same no matter what it's running on.
A good solid benchmark should measure the performance of a given system period, no matter what the system. For instance, a benchmark should measure how fast it takes for a particular system (hardware and OS included) to take a certain amount of data and write it into memory. That's it. It shouldn't be affected by anything. If the chipset does something to speed that particular process up, then it should reflect in the benchmark, simple as that.
In a way, when you think about this, it's so simple. You're just trying to measure how fast your system does something, this shouldn't be that hard and we certainly shouldn't be having all these problems, y'know? It's irritating...:rolleyes:
eva2000, nice post....:D
Originally posted by pointreyes
My theory is that the Intel dual DDR chipset boards have changed the way benchmark software can function. I find it interesting that the Granite Bay I had was only slightly faster than my AMD dual DDR nForce2 board. The Granite Bay had a 2.4b in it oc'ed to 3.0 Ghz and the nForce2 board was running at 200fsb with the XP2100 working at 2.1 Ghz. A 900Mhz difference and yet a very small difference in Sandra's memory throughput.
Of course part of the problem we have is the vast myriad of chipsets and hardware platforms.
Has anyone thought that Chipset drivers might be "tuned" for benchmarks like some have said Nvidia drivers are?
Mikki
08-15-2003, 07:36 AM
PF, see my post above yours...;)
eva2000
08-15-2003, 08:07 AM
Originally posted by Mikki
eva2000, nice post....:D always :D
sodface
08-15-2003, 08:14 AM
I'm a bit late as usual on this topic, but mikki and I have talked about it at length, and I've been following the thread. I haven't done a lot of research about how benchmarks work but here are some observations, right or wrong. I think a benchmark should be run in the target OS using the OS API's, assuming that that is the way the majority of application developers write their software. What does a benchmark that runs from a boot floppy tell us about how a particular piece of hardware will perform once the OS is active? On the benchmarks that are run from within the OS environment, how do we know the method used to access the hardware? Do some use the windows API as the middleman or are they somehow bypassing windows and accessing hardware directly, and if so, how does that translate into performance in a particular application that is programmed to use the API? Visual basic is about the only programming language I know half way decently and in that I do know that the method you use to do something WILL affect the speed of execution. If you use an intrinsic VB control to perform a task such as display a messagebox, that method may be slower than displaying a messagebox by making a direct call to the api. (this is an example, I don't know this to be true.) I just think a benchmark should be written to perform tasks the same way an application developer would write it. Of course, there is more than one way to skin a cat, and one man's idea of efficient code could be rewritten by someone else to perform much more quickly. If you had 2 different programs to calculate PI to a certain number of places, both programs would not necessarily tackle this task the same way and as a result would return times significantly different from each other.
I've lost my train of thought. Sorry if this post is illustrating my keen grasp of the obvious.:(
Mikki
08-15-2003, 08:27 AM
Thanks sodface :) I agree with you, I see no sense in running a bench that doesn't use API's. What's the sense in benchmarking a system that isn't running at real-world specs (such as a DOS bench) or not using the OS calls?
The OS is part of the system, is part of what is being benched. Folks with intentical rigs running different OS's are going to get different results, are we going to say they perform the same? Of course not...the OS is part of the system.
Something I don't understand is if you decide to write a benchmark in VB or I decide to write one in C++, if we invoke the calls to time a certain operation, shouldn't that be all there is to it? If we are timing an operation that, say...writes a chunk of data to memory, shouldn't that be a clear-cut operation? Pass it over to the OS and display the results, how can it be more simple? If so then why are all the benchmarks screwed up unless they've tweaked them somehow for certain chipsets or drivers? :rolleyes:
Mikki
08-15-2003, 08:50 AM
Something else that I was just thinking (and due in part to a conversation I had with Thugs).......I think most of us would agree that the benchmarks which we consider most worthy right now such as 3DMark2001SE and X-Isle Demo are good multi-platform benchmarks due to the fact that they were written a long time ago. And so the theory goes that they aren't optimized for any particular piece of hardware or any driver. So.....wouldn't it be that a simple, general program could actually be more accurate than all these fancy one's? :)
pointreyes
08-15-2003, 09:21 AM
Originally posted by Mikki
The OS is part of the system, is part of what is being benched. Folks with intentical rigs running different OS's are going to get different results, are we going to say they perform the same? Of course not...the OS is part of the system.
Hehe, one of the reasons that I prefer using the Windows Server OSes that support more than 4 procs is due to the noticably speed difference. However, on benches there is no increase but when it comes to doing some of the stuff I do-there is a performance difference normally in favor of the Server OSes.
Note that I did say 'support more than procs.' 2000 Standard Server and 2003 Web Server are not as nearly as good as 2000 Advanced Server and 2003 Enterprise Server.
An then there is (yes, I just had to mention it) Linux. My RAID5 is noticably faster running in Linux than it was when I tried it in w2k AS. And here's the other catch to the Linux twist-I'm using the reiserfs, not the ext3fs. So now we have the other problem with benching disks-how much does the file system affect the results? The reiserfs is superior to ntfs and for large files XFS is even better than reiserfs. XFS was developed by SGI, the same company that used to supply the majority of hardware for Hollywood machines. I will on use NTFS whereas others will only use FAT32.
Maybe we need to have a deviation range? e.g. You are using NTFS and with this test NTFS can have a 2% performance loss. If you were using FAT32 you might get this much more speed on this OS.
Like sodface, I'm losing my train of thought. I'm busy on many software configuration issues so my mind is getting muddled-sorry about that.
Mikki
08-15-2003, 09:30 AM
And see, that's just what I'm talking about. If one OS is noticably faster than another, like you mentioned pointreyes, then a benchmark should reflect that. It's part of the system.
Same with file systems. The bench should tell you how fast it is, period. How fast does this system write a 10g file to the hard drive? Simple question, simple answer...so why are we having problems? Such things like fragmentation will affect the results, but they affect real-world performance also, so they should be included. Same with the amount of data on the drive...that will affect where the data is written to on the physical drive and therefore affects performance.
I'm a firm believer in testing a system as is, just like I believe a system should be stress-tested as is. What's the point in altering your system and testing when you aren't going to operate the system under those parameters?
Thanks for the post pointreyes...:D
Mikki
08-19-2003, 10:04 AM
Here's (http://www.pinoypc.net/articles/viewarticle.php?article=276) a very interesting editorial about the benchmarking subject from Planet Savage...:)
oldfart
08-19-2003, 10:22 AM
How did i miss this one? This is a subject that interests me.
A have a couple of things that I consider to be mandatory for a benchmark to be considered a good one.
1) Must represent real world performance. Too many benches, notably synthetic mem benches such a SiSoft, Aida, Memtest, etc, do not at all represent real world performance. I've seen tons of these types of benches show a 20% - 30% gain that is <1% or even a loss in performance when compared to a real world bench on the same system. This type of bench is used FAR too often. The results are very misleading. I refuse to use any of them
2) The results must be consistent. I can run 3Dmark01 without making a single change to my system and get a different result each time. When you are looking for small performance changes while tweaking your system, the normal variation of a 17K score is too much to draw solid conclusions from it. Same goes with SiSoft. Too much variation from run to run. Some programs that are real world and do have consistent results: TMPGEnc, Lame, Winzip, Winrar, Q3A, UT2K3, UT. 3Dmark03 is much better than 01, but it doesn't work on older video cards.
Mikki
08-19-2003, 10:36 AM
Thanks for your input oldfart...:) You have two valid points and I think everyone agrees with you...;)
Mikki
08-19-2003, 10:51 AM
And here (http://www.aceshardware.com/read.jsp?id=60000242) is an article which shows 3DMark2003 results from Ace's Hardware. There's no surprise at the results they came up with...;)
pointreyes
08-19-2003, 11:06 AM
Originally posted by Mikki
Thanks for your input oldfart...:) You have two valid points and I think everyone agrees with you...;)
I know I do. I hated the P4 with a passion. I could care less about benchmarks-Oracle was an OldFart *pun intended :p * with the P4. The AMD XP/MP and PIII procs (hence the reason why I went from a dual PIII system to a dual MP system when my PIII system died) were running circles around the P4 for my line of work. My XP2100 was running circles around my 2.4b oc'ed to 3.0. However, the P4 HT line has finally made Oracle work the way it should on a system.
Powered by vBulletin® Version 4.1.8 Copyright © 2012 vBulletin Solutions, Inc. All rights reserved.