I'm not entirely happy with the results, but I would also like to know the -march that was used to compile the 64bit Linux kernel and MySQL, and any other Linux benchmarks, because optimization for specific -march is crucial to compare apples with apples. Ok yes AMD wins by assuming standard/generic compile flags was used. But what if both were to use optimized architecture compile flags? Then if AMD lose they could blame gcc. But it still should be interesting to see the outcome of open source support. And how it wraps around Vendor support to open source and also the distribution that decided to go so far to support architecture specific (Gentoo Linux Distribution comes to mind).
Intel knew that they were lagging behind because of their arch was too specific and wasn't running as efficiently on generic compiled code. Core i7 was suppose to fix this, But what if this is still not the case and optimization is required for architecture is still necessary to see any improvements on Intel's side?
Just wanted to mention that I love that case. The Supermicro 846 series is what I'm looking towards to do some serious disk density for a new storage server. I'm just wondering how much the SAS backplane affects access latencies, etc. (If you are using the one with the LSI expander chip.)
Just wanted to add a few bits of info. Back to an earlier comment, it is definitely incorrect to call SAS and infiniband the same, the cables are infact slightly different composition (Differences in shieliding) although they are terminated the same. Lets not forget that 10G ethernet uses the same termination in some implementations too.
Also at least under Xen AMD platforms still do not offer PCI pass through, this is a fairly big inconvenience and should probably be mentioned as support is not expected until the HT3 Shanghais release later this year. Paricularly interesting are result here that show despite NUMA greatly reducing load on the HTLinks, it makes very little difference to the VMMark result:
http://blogs.vmware.com/performance/2007/02/studyi...">http://blogs.vmware.com/performance/2007/02/studyi... I would imagine HT3 Opteron will only really benefit 8way opterons in typical figure of 8 config as the bus is that much heavier loaded.
Its odd your Supermicro quad had problems with the Shanghais and certain apps, no problems from testing with a Tyan TN68 here. Was the bios beta?
Are any benches with Xen likely to happen in future or is Anandtech solidly VMWare?
Answering another users questions about mailserver and spam assasin, I do not know of any decent bechmarks for this but have seen the load that different mailfilter servers can handle under the MailWatch/Mailgate system. Fast SAS raid 1 and dual quad cores seems to give best value and ram requirement is about 1Gb per core. Would be interesting to see some sort of linux mail filter benchmarks if you can construct something to do this.
I'd imagine alot of software development servers are used for testing of applications which fall into one of the other categories. There'd also be bug tracking and version control servers, but most interesting from a performance perspective might be build servers (i.e. servers used for compiling software) - the best benchmark for that would probably be compile times for various compilers (i.e. Gnu, Intel and MS C/C++ compilers; Sun, Oracle and IBM java compilers; etc.)
Very nice article, just the Fluent benchmarks are far too simple to give relevant information. Standard Fluent job these days has 25+ million elements, so sedan_4m is more of a synthetic test than a real world one. It would also be very interesting to see Nastran, Abaqus and PamCrash numbers.
Just to reinforce Alpha's comments... 25 million being standard isn't even close to being true!!!
Sure, if you want to perform a full global simulation, you'll need that number (and more) for something like a car (you can add another 0 to the 25 million for F1 teams!), or an aircraft.
But, mostly, the problems are broken down into much smaller numbers, under 10 million cells. A rule of thumb with fluent is 700K cells for every GB of RAM... so working on that principal, you'd need a 16GB workstation for 10 million cells...
Anything more, and you'll need a full cluster to make the turnaround times practical. :-)
"Standard Fluent job these days has 25+ million elements"
That's NOT entirely true. The size of the simulation is dependent on what you are simulating. (And also hardware availability/limitations).
Besides, I don't think that Johan actually ran the benchmarks himself. He just dug up the results database and mentioned it here (wherever the processor specs were relevant).
He also makes a note that the benchmark itself can be quite costly (e.g. a license of Fluent can easily be $20k+), and there needs to be a degree of understanding to be able to run the benchmark itself (which he also stated that neither he, nor the lab has.)
And on that note - Johan, if you need help in interpreting the HPC results, feel free to get in touch with me. I sent you an email on this topic (I'm actually running the LS-DYNA 3-car collision benchmark right now as we speak on my systems). I couldn't find the case data to be able to run the Fluent benchmarks as well, but IF I do; I'll run it and I'll let you know.
Otherwise, EXCELLENT. Looking forward to lots more coming from you once the NDA is lifted on the Nehalem Xeons.
The article makes a point of explaining how fair the Opteron Killer? section is, by assuming that unbuffered DDR3-1066 will provide results close enough to registered DDR3-1333 for Nehalem. But what is nowhere mentioned is that all of the benchmarks unfairly penalize the 45nm Opteron because registered DDR2-800 was used whereas faster DDR2-1067 is supported by Shanghai. If you go into great length justifying memory specs for Intel, IMHO you should mention that point for AMD as well.
The Oracle Charbench graph shows "Xeon 5430 3.33GHz". This is wrong, it's the X5470 that runs at 3.33GHz, the E5430 runs at 2.66GHz.
The 3DSMax 2008 32 bit graph should show the Quad Opteron 8356 bar in green color, not blue.
In the 3DSMax 2008 32 bit benchmark, some results are clearly abnormal. For example a Quad Xeon X7460 2.66GHz is beaten by an older microarchitecture running at a slower speed (Quad Xeon 7330 2.4GHz). Why is that ?
The article mentions in 2 places the Opteron "8484", this should be "8384".
The Opteron Killer? section says "the boost from Hyper-Threading ranges from nothing to about 12%". It should rather say "ranges from -5% to 12%" (ie. HT degrades performance in some cases).
There is a typo in the same section: "...a small advantage at* it can use..." s/at/as/.
Also, I think CPU benchmarking articles should draw graphs to represent performance/dollar or performance/watt (instead of absolute performance), since that's what matters in the end.
"But what is nowhere mentioned is that all of the benchmarks unfairly penalize the 45nm Opteron because registered DDR2-800 was used whereas faster DDR2-1067 is supported by Shanghai. "
Considering that Shanghai has just made DDR-2 800 (buffered) possible, I think it is highly unlikely that we'll see buffered DDR-2 1066 very soon. Is it possible that you are thinking of Deneb which can use DDR-2 1066 unbuffered?
"In the 3DSMax 2008 32 bit benchmark, some results are clearly abnormal. For example a Quad Xeon X7460 2.66GHz is beaten by an older microarchitecture running at a slower speed (Quad Xeon 7330 2.4GHz). Why is that ? "
"Also, I think CPU benchmarking articles should draw graphs to represent performance/dollar or performance/watt (instead of absolute performance), since that's what matters in the end. "
In most cases performance/dollar is a confusing metric for server CPUs, as it greatly depends on what application you will be running. For example, if you are spending 2/3 of your money on a storage system for your OLTP app, the server CPU price is less important. It is better to compare to similar servers.
Performance/Watt was impossible as our Quad Socket board had a beta BIOS which disabled powernow! That would not have been fair.
I'll check out the typos you have discovered and fix them. Thx.
DDR2-1067: oh, you are right. I was thinking of Deneb.
Yes performance/dollar depends on the application you are running, so what I am suggesting more precisely is that you compute some perf/$ metric for every benchmark you run. And even if the CPU price is less negligible compared to the rest of the server components, it is always interesting to look both at absolute perf and perf/$ rather than just absolute perf.
I forgot to mention that the database created is slightly larger than 1 GB. And we wouldn't be able to get >80% CPU load if we were bottlenecked by I/O
You are right, this is a smallish database. By the way, when you report CPU utilization, would you take IOWait separate from CPU used? If taken together (which was not clear) it is possible to get 100% CPU utilization out of which 90% will be IOWait :)
I guess the key battleground will be Shanghai versus Nehalem in the virtualised server space...
AMD need their optimisations to shine through.
Its entirely understandable that you could not conduct virtualisation tests on the Nehalem platform, but unfortunate from the point of view that it may decide whether Shanghai is a success or failure over its life as a whole. As always, time is the great enemy! :-)
"you could not conduct virtualisation tests on the Nehalem platform"
Yes. At the moment we have only 3 GB of DDR-3 1066. So that would make pretty poor Virtualization benches indeed.
"unfortunate from the point of view that it may decide whether Shanghai is a success or failure"
Personally, I think this might still be one of Shanghai strong points. Virtualization is about memory bandwidth, cache size and TLBs. Shanghai can't beat Nehalem's BW, but when it comes to TLB size it can make up a bit.
With the VMWare benchmark, it is really just a measure of the CPU / Memory. Unless you are running applications with very small datasets where everything fits into RAM, the primary bottlenck I've run into is the storage system. I find it much better to focus your hardware funds on the storage system and use the company standard hardware for server platform.
This isn't to say the bench isn't useful. Just wanted to let people know not to base your VMWare buildout soley on those numbers.
I'm surprised by your comments. You claim that VMmark is a CPU/memory-centric benchmark. If I look at the raw data in the VMmark disclosure for Dell's R905 score of 20.35 @ 14 tiles, I see that the benchmark is driving 250-300 MB/s of disk IO across several HBAs and storage LUNs. This characteristic scales with the various systems mentioned in the article.
As a designer of VMmark, I happen to know that both storage bandwidth (for the fileserver) and latency (for mail and database)are critical to acheiving good VMmark scores. Furthermore, the webserver drives substantial network IO. The only purely CPU-centric component to VMmark is the javaserver. Overall, the benchmark does exercise the entire virtualization solution - hypervisor, CPU, memory, disk, and network.
While SAS and Infiniband share some connectors and obtain similar data rates, they are incompatible technologies with two different purposes. Infiniband can be used for disk shelf connections, but it is less common and definitely not the case here. You should not call the connection between the Adaptec 5805 controller and the disk shelf an "Infiniband connection", even if it is using Infiniband connectors and cables, it is simply an SAS connection.
Well, the physical layer is Infiniband, the used protocol is SCSI. I can understand calling it an "infiniband connection" maybe confusing, but the cable is an infiniband cable.
Anand, I think the above poster is right. The Adaptec RAID 5805 uses SFF-8087 connectors but the protocol is SSP (Serial SCSI Protocol). Infiniband is a physical layer protocol that shares the same connector as SAS but they are not the same. Nothing in the Adaptec RAID 5805 spec mentions Infiniband as a supported protocol.
I'm not sure you can run your same ol benchmark for rendering, and I'd really like more insight into what you guys are rendering and if it's indeed using all 16/24(six core 4 point system)/32(hyperthreading) cores on the system.
What renderer, what scene, details details...
These chips get gobbled up by render farms and this is indeed where they can really flex their muscles to the fullest.
I read DailyTech and anandtech.com to keep up with the latest in IT. I appreciate the thought that has gone into putting together this article. I would like to see more articles like this one.
With the rapid increase of virtualization, AMD is looking really strong. We have begun using 3.5 Vmware and are expanding the use of it. Virtualization is truly becoming a big thing in server choice.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
29 Comments
Back to Article
Theunis - Wednesday, December 31, 2008 - link
I'm not entirely happy with the results, but I would also like to know the -march that was used to compile the 64bit Linux kernel and MySQL, and any other Linux benchmarks, because optimization for specific -march is crucial to compare apples with apples. Ok yes AMD wins by assuming standard/generic compile flags was used. But what if both were to use optimized architecture compile flags? Then if AMD lose they could blame gcc. But it still should be interesting to see the outcome of open source support. And how it wraps around Vendor support to open source and also the distribution that decided to go so far to support architecture specific (Gentoo Linux Distribution comes to mind).Intel knew that they were lagging behind because of their arch was too specific and wasn't running as efficiently on generic compiled code. Core i7 was suppose to fix this, But what if this is still not the case and optimization is required for architecture is still necessary to see any improvements on Intel's side?
James5mith - Tuesday, December 30, 2008 - link
Just wanted to mention that I love that case. The Supermicro 846 series is what I'm looking towards to do some serious disk density for a new storage server. I'm just wondering how much the SAS backplane affects access latencies, etc. (If you are using the one with the LSI expander chip.)Krobaruk - Saturday, December 27, 2008 - link
Just wanted to add a few bits of info. Back to an earlier comment, it is definitely incorrect to call SAS and infiniband the same, the cables are infact slightly different composition (Differences in shieliding) although they are terminated the same. Lets not forget that 10G ethernet uses the same termination in some implementations too.Also at least under Xen AMD platforms still do not offer PCI pass through, this is a fairly big inconvenience and should probably be mentioned as support is not expected until the HT3 Shanghais release later this year. Paricularly interesting are result here that show despite NUMA greatly reducing load on the HTLinks, it makes very little difference to the VMMark result:
http://blogs.vmware.com/performance/2007/02/studyi...">http://blogs.vmware.com/performance/2007/02/studyi...
I would imagine HT3 Opteron will only really benefit 8way opterons in typical figure of 8 config as the bus is that much heavier loaded.
Its odd your Supermicro quad had problems with the Shanghais and certain apps, no problems from testing with a Tyan TN68 here. Was the bios beta?
Are any benches with Xen likely to happen in future or is Anandtech solidly VMWare?
Answering another users questions about mailserver and spam assasin, I do not know of any decent bechmarks for this but have seen the load that different mailfilter servers can handle under the MailWatch/Mailgate system. Fast SAS raid 1 and dual quad cores seems to give best value and ram requirement is about 1Gb per core. Would be interesting to see some sort of linux mail filter benchmarks if you can construct something to do this.
RagingDragon - Saturday, December 27, 2008 - link
I'd imagine alot of software development servers are used for testing of applications which fall into one of the other categories. There'd also be bug tracking and version control servers, but most interesting from a performance perspective might be build servers (i.e. servers used for compiling software) - the best benchmark for that would probably be compile times for various compilers (i.e. Gnu, Intel and MS C/C++ compilers; Sun, Oracle and IBM java compilers; etc.)bobbozzo - Friday, December 26, 2008 - link
IMO, many (most?) mailservers need more than fast I/O; what with so much anti-spam and anti-virus filtering going on nowadays.SpamAssassin, although wonderful, can be slow under heavy loads and with many features enabled, and the same goes for A/Vs such as clamscan.
That said, I don't know of any good benchmarks.
vtechk - Thursday, December 25, 2008 - link
Very nice article, just the Fluent benchmarks are far too simple to give relevant information. Standard Fluent job these days has 25+ million elements, so sedan_4m is more of a synthetic test than a real world one. It would also be very interesting to see Nastran, Abaqus and PamCrash numbers.Amiga500 - Sunday, December 28, 2008 - link
Just to reinforce Alpha's comments... 25 million being standard isn't even close to being true!!!Sure, if you want to perform a full global simulation, you'll need that number (and more) for something like a car (you can add another 0 to the 25 million for F1 teams!), or an aircraft.
But, mostly, the problems are broken down into much smaller numbers, under 10 million cells. A rule of thumb with fluent is 700K cells for every GB of RAM... so working on that principal, you'd need a 16GB workstation for 10 million cells...
Anything more, and you'll need a full cluster to make the turnaround times practical. :-)
alpha754293 - Saturday, December 27, 2008 - link
"Standard Fluent job these days has 25+ million elements"That's NOT entirely true. The size of the simulation is dependent on what you are simulating. (And also hardware availability/limitations).
Besides, I don't think that Johan actually ran the benchmarks himself. He just dug up the results database and mentioned it here (wherever the processor specs were relevant).
He also makes a note that the benchmark itself can be quite costly (e.g. a license of Fluent can easily be $20k+), and there needs to be a degree of understanding to be able to run the benchmark itself (which he also stated that neither he, nor the lab has.)
And on that note - Johan, if you need help in interpreting the HPC results, feel free to get in touch with me. I sent you an email on this topic (I'm actually running the LS-DYNA 3-car collision benchmark right now as we speak on my systems). I couldn't find the case data to be able to run the Fluent benchmarks as well, but IF I do; I'll run it and I'll let you know.
Otherwise, EXCELLENT. Looking forward to lots more coming from you once the NDA is lifted on the Nehalem Xeons.
zpdixon42 - Wednesday, December 24, 2008 - link
Johan,The article makes a point of explaining how fair the Opteron Killer? section is, by assuming that unbuffered DDR3-1066 will provide results close enough to registered DDR3-1333 for Nehalem. But what is nowhere mentioned is that all of the benchmarks unfairly penalize the 45nm Opteron because registered DDR2-800 was used whereas faster DDR2-1067 is supported by Shanghai. If you go into great length justifying memory specs for Intel, IMHO you should mention that point for AMD as well.
The Oracle Charbench graph shows "Xeon 5430 3.33GHz". This is wrong, it's the X5470 that runs at 3.33GHz, the E5430 runs at 2.66GHz.
The 3DSMax 2008 32 bit graph should show the Quad Opteron 8356 bar in green color, not blue.
In the 3DSMax 2008 32 bit benchmark, some results are clearly abnormal. For example a Quad Xeon X7460 2.66GHz is beaten by an older microarchitecture running at a slower speed (Quad Xeon 7330 2.4GHz). Why is that ?
The article mentions in 2 places the Opteron "8484", this should be "8384".
The Opteron Killer? section says "the boost from Hyper-Threading ranges from nothing to about 12%". It should rather say "ranges from -5% to 12%" (ie. HT degrades performance in some cases).
There is a typo in the same section: "...a small advantage at* it can use..." s/at/as/.
Also, I think CPU benchmarking articles should draw graphs to represent performance/dollar or performance/watt (instead of absolute performance), since that's what matters in the end.
JohanAnandtech - Wednesday, December 24, 2008 - link
"But what is nowhere mentioned is that all of the benchmarks unfairly penalize the 45nm Opteron because registered DDR2-800 was used whereas faster DDR2-1067 is supported by Shanghai. "Considering that Shanghai has just made DDR-2 800 (buffered) possible, I think it is highly unlikely that we'll see buffered DDR-2 1066 very soon. Is it possible that you are thinking of Deneb which can use DDR-2 1066 unbuffered?
"In the 3DSMax 2008 32 bit benchmark, some results are clearly abnormal. For example a Quad Xeon X7460 2.66GHz is beaten by an older microarchitecture running at a slower speed (Quad Xeon 7330 2.4GHz). Why is that ? "
Because 3DS Max does not like 24 cores. See here:
http://it.anandtech.com/cpuchipsets/intel/showdoc....">http://it.anandtech.com/cpuchipsets/intel/showdoc....
"Also, I think CPU benchmarking articles should draw graphs to represent performance/dollar or performance/watt (instead of absolute performance), since that's what matters in the end. "
In most cases performance/dollar is a confusing metric for server CPUs, as it greatly depends on what application you will be running. For example, if you are spending 2/3 of your money on a storage system for your OLTP app, the server CPU price is less important. It is better to compare to similar servers.
Performance/Watt was impossible as our Quad Socket board had a beta BIOS which disabled powernow! That would not have been fair.
I'll check out the typos you have discovered and fix them. Thx.
zpdixon42 - Wednesday, December 24, 2008 - link
DDR2-1067: oh, you are right. I was thinking of Deneb.Yes performance/dollar depends on the application you are running, so what I am suggesting more precisely is that you compute some perf/$ metric for every benchmark you run. And even if the CPU price is less negligible compared to the rest of the server components, it is always interesting to look both at absolute perf and perf/$ rather than just absolute perf.
denka - Wednesday, December 24, 2008 - link
32-bit? 1.5Gb SGA? This is really ridiculous. Your tests should be bottlenecked by IOJohanAnandtech - Wednesday, December 24, 2008 - link
I forgot to mention that the database created is slightly larger than 1 GB. And we wouldn't be able to get >80% CPU load if we were bottlenecked by I/Odenka - Wednesday, December 24, 2008 - link
You are right, this is a smallish database. By the way, when you report CPU utilization, would you take IOWait separate from CPU used? If taken together (which was not clear) it is possible to get 100% CPU utilization out of which 90% will be IOWait :)denka - Wednesday, December 24, 2008 - link
Not to be negative: excellent article, by the waymkruer - Tuesday, December 23, 2008 - link
If/When AMD does release the Istanbul (k10.5 6-core), The Nehalem will again be relegated to second place for most HPC.Exar3342 - Wednesday, December 24, 2008 - link
Yeah, by that time we will have 8-core Sandy Bridge 32nm chips from Intel...Amiga500 - Tuesday, December 23, 2008 - link
I guess the key battleground will be Shanghai versus Nehalem in the virtualised server space...AMD need their optimisations to shine through.
Its entirely understandable that you could not conduct virtualisation tests on the Nehalem platform, but unfortunate from the point of view that it may decide whether Shanghai is a success or failure over its life as a whole. As always, time is the great enemy! :-)
JohanAnandtech - Tuesday, December 23, 2008 - link
"you could not conduct virtualisation tests on the Nehalem platform"Yes. At the moment we have only 3 GB of DDR-3 1066. So that would make pretty poor Virtualization benches indeed.
"unfortunate from the point of view that it may decide whether Shanghai is a success or failure"
Personally, I think this might still be one of Shanghai strong points. Virtualization is about memory bandwidth, cache size and TLBs. Shanghai can't beat Nehalem's BW, but when it comes to TLB size it can make up a bit.
VooDooAddict - Tuesday, December 23, 2008 - link
With the VMWare benchmark, it is really just a measure of the CPU / Memory. Unless you are running applications with very small datasets where everything fits into RAM, the primary bottlenck I've run into is the storage system. I find it much better to focus your hardware funds on the storage system and use the company standard hardware for server platform.This isn't to say the bench isn't useful. Just wanted to let people know not to base your VMWare buildout soley on those numbers.
Bruce Herndon - Tuesday, December 23, 2008 - link
I'm surprised by your comments. You claim that VMmark is a CPU/memory-centric benchmark. If I look at the raw data in the VMmark disclosure for Dell's R905 score of 20.35 @ 14 tiles, I see that the benchmark is driving 250-300 MB/s of disk IO across several HBAs and storage LUNs. This characteristic scales with the various systems mentioned in the article.As a designer of VMmark, I happen to know that both storage bandwidth (for the fileserver) and latency (for mail and database)are critical to acheiving good VMmark scores. Furthermore, the webserver drives substantial network IO. The only purely CPU-centric component to VMmark is the javaserver. Overall, the benchmark does exercise the entire virtualization solution - hypervisor, CPU, memory, disk, and network.
cdillon - Tuesday, December 23, 2008 - link
While SAS and Infiniband share some connectors and obtain similar data rates, they are incompatible technologies with two different purposes. Infiniband can be used for disk shelf connections, but it is less common and definitely not the case here. You should not call the connection between the Adaptec 5805 controller and the disk shelf an "Infiniband connection", even if it is using Infiniband connectors and cables, it is simply an SAS connection.JohanAnandtech - Tuesday, December 23, 2008 - link
Well, the physical layer is Infiniband, the used protocol is SCSI. I can understand calling it an "infiniband connection" maybe confusing, but the cable is an infiniband cable.shank15217 - Friday, December 26, 2008 - link
Anand, I think the above poster is right. The Adaptec RAID 5805 uses SFF-8087 connectors but the protocol is SSP (Serial SCSI Protocol). Infiniband is a physical layer protocol that shares the same connector as SAS but they are not the same. Nothing in the Adaptec RAID 5805 spec mentions Infiniband as a supported protocol.http://www.adaptec.com/en-US/products/Controllers/...">http://www.adaptec.com/en-US/products/C...ers/Hard...
niva - Tuesday, December 23, 2008 - link
I'm not sure you can run your same ol benchmark for rendering, and I'd really like more insight into what you guys are rendering and if it's indeed using all 16/24(six core 4 point system)/32(hyperthreading) cores on the system.What renderer, what scene, details details...
These chips get gobbled up by render farms and this is indeed where they can really flex their muscles to the fullest.
JohanAnandtech - Tuesday, December 23, 2008 - link
Just click on the link under "we have performed so many times before" :-)akinneyww - Tuesday, December 23, 2008 - link
I read DailyTech and anandtech.com to keep up with the latest in IT. I appreciate the thought that has gone into putting together this article. I would like to see more articles like this one.Jammrock - Tuesday, December 23, 2008 - link
The VMware results shocked me the most. I know AMD has been working hard on the virtualization sector and it looks like their work has paid off.classy - Tuesday, December 23, 2008 - link
With the rapid increase of virtualization, AMD is looking really strong. We have begun using 3.5 Vmware and are expanding the use of it. Virtualization is truly becoming a big thing in server choice.