We always tell people; heat and electronics (especially computers), don't mix well together. We will give you one such example today why when excessive heat and computers collide, bad things easily happen.
This week has been a horrid week for us. We've had so much hardware die or fail - first of all, our RAID 0 array failed, then our water cooling pump decided to fail, then one of our Vista installs corrupted itself and finally managed to avoid a fire hazard but killed an nVidia GeForce 7900GS graphics card in the process. They say bad things happen in groups of three - we're at number four and hope for no more problems.
Today we'll tell you about the death of an nVidia GeForce 7900GS graphics card and how we think it could have been avoided by better and more thoroughly-tested Windows Vista ForceWare drivers. It seems to be a major blunder by nVidia and we feel as if it is important enough to alert our readers about the issue.
Let's get started!
GeForce 7900GS meets early death
Death of a GeForce 7900GS graphics card
It would seem, most of our horrid problems this week which occurred, are because of cooling related issues (heat) besides maybe the RAID 0 array, which goes unexplained at the moment. It was Wednesday night, I went out for dinner for several hours (6 hours, to be exact), arrived home and noticed my computer had switched itself off... which seemed quite intriguing. Searching for an answer to the power down, I tried turning the system back on several times - it would power up but then shut down again almost instantly. After opening the case, I noticed coolant from the water cooling setup laying in abundance at the bottom of the case (and on the floor) and that naturally raised some amount of fear. This is after the other system failures during the week - I was beginning to think my computer was cursed, and the gods had something against me.
After further and closer inspection, I looked at the graphics card and noticed the water cooling VGA block literally just hanging from the graphics card - that along with the heat and strange smells. The water cooling pump had failed which is responsible for cooling GPU and CPU. After examining the rest of the system, everything else seemed to be okay - obviously the motherboard was fine because it was still powering up which suggested that the CPU, memory and so on was also saved from death and the coolant I was using was high quality. One of the most important (and expensive) components, the CPU, survived since Intel has advanced thermal management (monitoring and throttling) built into their latest Core 2 processors and actually have had thermal management in their older processors for some time now. That is hard controlled through the system BIOS and if excessive heat occurs no matter what you are doing or using on the PC, the system will take appropriate measure to ensure the system powers down or at least slows down clock speeds and voltage levels.
There wasn't much heat radiating off the CPU, in fact it was very cool (almost cold) - thermal throttling had kicked in once the water stopped flowing through the pump and as a result would have radically reduced clock speeds and voltages to compensate for the rise in temperatures. On the other hand, when it comes to the graphics card, I could barely touch the PCB let alone the other components such as capacitors, RAM or core - it was clear thermal management didn't kick in. The result was amazing yet quite scary to know that if the graphics card kept operating and hadn't died (although it seems like it took a full 6 or 7 hours for it to finally fizzle out and die), the effect of the cooling failure may have been much worse under different yet similar circumstances.
The graphics card managed to generate so much heat that it completely melted the plastic clips which attach the VGA water block to the card - a quick test with a cigarette lighter shows that the water block area which screws into the card is flammable, we must have got lucky there and avoided that catching on fire. It also got so hot that it melted the upper copper surface of the VGA block and imprinted the nVidia logo and parts of the GPU core information writing onto the water block base itself, as you can see above. On the reserve side of the water block (the side that doesn't make contact with the GPU core), we saw discoloration of the copper surface - the surface even started to crack in areas suggesting that bizarre temperatures must have been reached at its peak. Basically, it was a scary sight - to think the graphics card allowed itself to reach such extreme temperatures, it was a blessing that PCB didn't catch on fire and I hate to say, the entire and computer and possibly subsequently the desk and then house. There's obviously a low chance of that happening but when you're working with such high temperatures, anything is possible and cannot be left to chance.
"Why did this happen?" you may be asking. That's what got us thinking too and we try and provide some reasoning and cautioning advice. We are certainly not a mission to make companies look bad - we feel it would be a disservice to our readers if we did not write an article, hence here we are.
No Thermal Management in Vista?
Where is my GPU thermal management in Windows Vista?
So, our topic today is how on earth did our system allow the graphics card to reach such extreme temperatures? Where is the kill switch? More precisely, why did the graphics card driver NOT tell the system to shut down? Is thermal management control (monitoring and throttling) not working as it should under Microsoft's latest operating system, Windows Vista? We think some companies have some serious answering to do - those companies are nVidia and Microsoft, two big fish of the industry.
In nVidia's defense, they were most concerned about the issue once we reported it to them but they are on Thanks Giving holidays right now in the USA. That meant we couldn't talk directly to the nVidia driver team but other nVidia offices around the world are still active. After exchanging several phone calls and emails from nVidia on the issue, the response which shocked us most was this one from one of our contacts at the nVidia Taiwan office who works in a technical marketing capacity:
In regards to the issue you've found, I think it may be an individual case. I tried testing a 7900GS board in my lab this afternoon with Vista running and our drivers posted on the net, but did not see a similar occurrence. It worked fine.
And my response:
Just another question, when you said "It worked fine" - please clarify, did you remove ALL cooling and let the graphics card run for an extended amount of time (and how long...)?
And the last response from nVidia:
I did not remove all cooling to let it run because this is not the normal sku that graphics cards using our GPUs are shipped. I ran the system with benchmarks as well as let the system sit for a while for a total of around 10 hours. But of course, this is with the normal cooling on the card.
Truthfully speaking, I find it somewhat amazing that a GPU can get to temperatures so hot to melt copper. Considering that the melting point of copper is close to 1100C, I would think the whole board would've caught on fire before copper would melt.
As for the temperature monitoring, since it notes that the feature will be available in a separate downloadable utility, I'm assuming that it would be best if you download that utility. But as for how our drivers are programmed and how our hardware is designed, I am not at liberty to discuss this information. It would really be best if you can wait until we get a more concrete response from our counterparts in the US.
Let's start from the beginning - I was using a water cooling setup and the pump doesn't have water flow protection. So, the water stopped pumping and the pump was unable to shut down the system because it could not detect that fundamental flaw - Yes, we are also in talks with that company as well about why their pump doesn't contain this feature. The water stopped flowing, temperatures started to rapidly increase (even with the GPU sitting idle in Vista, although the Aero interface probably results in higher operating idle temperatures). In Windows XP, nVidia's ForceWare graphics card drivers have GPU monitoring (full thermal management) enabled but in ForceWare 96.85 for Vista (the latest publically available on the nVidia website) lack many features. nVidia in their own release notes claim these to be "Not NVIDIA Issues":
This section lists issues that are not due to the NVIDIA driver as well as features that are not meant to be supported by the NVIDIA driver for Windows Vista.
That's interesting to say the least! It would seem like nVidia have this BETA driver available on their website but are effectively trying to isolate themselves from taking any responsibility from any unforeseeable issues which may or may not occur. First of all, that type of action is shocking and even if the driver is in BETA and not "officially supported", that means nothing. The driver is available and listed on nVidia's public website as WHQL (Microsoft test it and certify it for public consumption - that is a $10,000 USD exercise with each new driver) and hence driver related and occurred issues ARE their responsibility because they are making it ability to users. nVidia have a duty of care to the computer users who use their graphics cards implementing their GPU - any judge or lawyer will tell you this and it is the first thing you learn in school in legal studies.
GPU Temperature Monitoring
Temperature monitoring is no longer supported in the default GPU driver control panel. This feature will be available in a separate downloadable utility.
This is definitely one for the books! From our research so far, because the current public Vista drivers for GeForce graphics cards is unable to monitor GPU temperatures in their control panel, that fact suggests that it is also unable to operate its usual internal thermal management systems like it does in Windows XP. If our mentioned cooling failure occurred in Windows XP, the driver would have told the graphics card to crash - if in a game running the higher 3D clock speeds, the game will crash to desktop. If sitting idle in Windows, the driver would have told Windows to crash and usually the result would be a system power down. This gives you a chance to fix the cooling issue and avoid death of your graphics cards and possibly other components if things became bad.
We also used the exact same 7900GS graphics card under Windows XP before installing Vista, so we can confirm it does have a thermal diode and thermal management does work under the older OS.
What was Microsoft thinking?
Yes, these drivers are WHQL Certified!
Now we are passing the ball (responsibility) over to Microsoft, in this part of the article. nVidia's ForceWare 96.85 drivers were released on October 17, 2006 and are WHQL certified! That means, before nVidia releases the drivers, they are checked by Microsoft and officially approved and certified for use in their operating system. For a fact, we know this process for ATI Radeon Catalyst drivers takes around 24 hours (give or take) - is that enough time to properly test a driver from all aspects of possible error? Obviously not, as they didn't pick up on the error we found. Microsoft dropped the ball here, too.
So, how on earth could Microsoft approve ForceWare 96.85 drivers knowing that GPU temperature monitoring was no longer supported under their latest OS? Did they not think what might happen if the graphics card overheats in Vista? The usual mechanism in Windows XP is not present which tells the system to power down or crash. The answer is simple - the drivers should have never passed WHQL approval in Redmond by Microsoft and never even made it to the nVidia website for public consumption. It's a major Vista driver blunder by two big companies.
We know Vista RTM has only just been released recently but many people have been using forms of Vista (RC1 and RC2 and even previous releases) for quite some time now. We also appreciate Vista (final RTM) is a new operating system and nVidia are still fine tuning their driver performance and features to make everything work just like XP. However, all of this is just not an excuse for the lacking thermal management systems - that should have been the first thing nVidia made sure worked 100%.
Personally, I would have much rather saw nVidia focus their time on making their internal thermal management systems work properly before they concentrated on improving frame rates or making SLI work better, for instance. They may well be an underlying reason why thermal management isn't working under Vista but nVidia should have worked harder to make it work and should have (at very least) made a better effort to make it clear that this feature isn't working properly - you shouldn't need to dig through pages of release notes to find out about this as it just might be too late by the time you do that. It should be visibly clear.
Further Testing and Industry Concerns
Further Testing with GeForce 7600GS
In an attempt to replicate the error, I went to the nearest computer shop and bought another GeForce graphics card knowing full well that it might and probably will die - being a mainstream shop, the highest-end card they stocked was a GeForce 7600GS, so we picked up one straight away. This was a Leadtek card that used no fan, just a heat sink with heat pipe design.
We installed the card without any cooling but the GPU got so hot before Vista even had a chance to load, that the system shut down. This suggests that some of the recent GeForce cards do come with thermal diodes and do have thermal management effectively working. Keep in mind though, that was out of Windows Vista - we're talking around the BIOS POST screen area!
We could have tried removing the heat sink while in Windows Vista but we weren't on a mission to get electric zaps, so we didn't attempt it. The chances are high that the same thing would have occurred - the graphics card would have kept running until it got so hot that the GPU core fizzled out and died, hopefully before we saw fire. Keep in mind that electrical fires are the hardest to control, especially without a fire extinguisher - I will be buying one soon, for safety sakes.
Industry Concerns on the subject
We were very interested to hear ATI's thoughts on the subject and asked them over email what would happen if cooling failed on a Radeon based graphics card under Windows Vista - we had a Power Color Radeon X800GTO spare and ready for death in our Taiwan lab.
ATI suggested that emulating a failed cooling system was not advisable but as long as the particular graphics card partner installed thermal control, their driver should be able to handle the rest without any problems under Vista. Since we could not confirm if Power Color used thermal control, we didn't attempt the test on ATI's advice. From the ATI driver team in Canada:
If there is no thermal controller, then the chip can be damaged (as there is no way to monitor the board - it is up to each partner whether they include a thermal chip - some leave it off to save money on the cost of the board).
But in the case of the X800 GTO (it may be that a thermal controller was included on the board, but OD is still not supported) - I don't know, but I would not recommend taking the heatsink off of your board - there's a good reason why they're there
So this isn't a driver issue - the board needs the thermal Asic to begin with, if one is there it is very easy to monitor the temperature and ensure nothing happens to it.
Our concern with this response is that ATI / AMD doesn't force their graphics board partners to install a thermal monitoring device on Radeon graphics cards. Both nVidia and ATI should make the thermal monitoring a must have device on all graphics cards from low-end to high-end, as all graphics cards get hot.
If you are the unlucky one who buys a graphics card with no thermal monitoring (since the card partner wanted to save a few bucks!), you want to pray you don't have any cooling failure. It's high time ATI and nVidia and other graphics cards companies force their partners to implement thermal control - don't make it an option!
The saving of just a few dollars does add up when you start considering how many graphics cards are sold but this should in no way come at cost of lessened safety - we are working with electronics and moving parts and they do fail from time to time without reason.
We will no doubt quiz ATI and nVidia more about this and also talk to board partners about the issue.
The normal warranty response from all companies is simple and acceptable; if you change the stock cooling, your warranty will be void. That in itself is fair enough - if you change the cooling, you changed the intended way the graphics card was designed to be used.
However, in an enthusiast computing world where water cooling (and other cooling methods) is becoming more prevalent (nVidia's very own nForce 680i motherboard is designed with water cooling in mind), nVidia should have made 100% sure thermal management was working correctly or at least made the missing feature from Vista much more obvious. As mentioned, both ATI and nVidia should make sure that all graphics cards using their silicon come with thermal control. nVidia are a company who know quite well that their products sell to people who overclock (hence rise temperatures) and we can't help but feel nVidia made a major blunder in their current Vista driver. Personally, I don't care that I ended up with a dead GeForce 7900GS, it was a review sample and we didn't pay for it. It's more important you learn something from our experience and just how and why it happened.
Even if we didn't change the cooler and used the default shipping cooler, there is always a chance that the cooling fan will fail. If the fan/s fails, you will end up seeing a similar aftermath as we had with our failed water cooling pump, just the whole death will take a little longer.
It goes without saying, we hope nVidia bring back their thermal management systems into their upcoming Vista drivers ASAP - it is a must. Both nVidia and ATI drivers do have problems right now and not everything is supported and working as it does in Windows XP but both companies are due to release new drivers to the world next week and are said to be working hard on this.
It would be a fundamental mistake if nVidia release their drivers without thermal management working 100% effectively but we will follow-up on this and will conduct further testing. For the stacks of users out there who change their cooling on their graphics card and overclock (that's reaching 20% of the market, these days), we urge you to take some caution away from this article. If you run into the same problems with us, you will be lucky if you only lose your $500 graphics card and then hope the problems stop there and don't cause any further disaster.
We are waiting for further response from nVidia and will update this article once we hear more from them and other companies on the subject. Please feel free to leave your comments below - have you experienced this problem or anything similar? Do let us know!
PRICING: You can find products similar to this one for sale below.
United States: Find other tech and computer products like this over at Amazon.com
United Kingdom: Find other tech and computer products like this over at Amazon.co.uk
Australia: Find other tech and computer products like this over at Amazon.com.au
Canada: Find other tech and computer products like this over at Amazon.ca
Deutschland: Finde andere Technik- und Computerprodukte wie dieses auf Amazon.de