| « Failing hardware part 4 | Failing hardware part 2 » |
Failing hardware part 3
More updates on the constant meltdown of my workstation. Thanks to the folks who commented on this blog and on IRC -- I had some helpful suggestions and thoughts.
I managed to get rid of the graphical corruption by switching to a different graphics card. Yes, I know the same kinda corruption can occur with a bad or missing grub splashimage, but I'd been running with one for two years and no issues. Since my grub.conf doesn't really change, that wasn't the cause. So I kinda fixed one issue -- new graphics card, no framebuffer/grub corruption.
Unfortunately, the lockups continue. So the biggest issue the new purchase was supposed to fix . . . didn't. I got shafted on tax and shipping, so what should have been an $86 card turned into $101. Ouch. That's a fair bit above the $80 neighborhood that the RadeonHD 4670 is supposed to be. Still, I got a nice HIS IceQ card. I truly can't hear its fan; it's got a very nice Arctic model. It's a nice enough upgrade, but since my system still has issues it's fairly pointless.
I switched power outlets, thinking maybe it was bad wiring in the walls. No change. Tried switching NICs again. No change. Played with some BIOS settings; I noticed a couple of IRQs being shared that really didn't need to be. No change. Tried removing my sound card, since it's the only PCI device in the system. No change.
Then I tried booting a LiveCD. SliTaz, actually. It runs entirely in RAM. It worked okay for about an hour and a half. I let my system mostly idle for a few hours; when I came back, it was hardlocked and showed no sign of recovering.
Fortunately, I'd backed up all my data to a separate disk, in preparation for reinstalling. My system stayed stable just long enough to copy the data and pull the drive; it actually hardlocks on shutdown and reboot now, too.
So, since a completely different kernel and operating system didn't show any improvements, and neither did the graphics card or running without CD drives or hard disks, I figure the problem must lie with the motherboard.
Right?
To that end, I've been searching for cheap ($80) motherboards that support socket AM2/AM2+, since I plan to save some money and reuse my Athlon X2 4600+, rather than switch to Intel. Based on all the failure reports out there, I'm avoiding Gigabyte, EVGA, and ECS motherboards. ASUS has a very attractive line of M3A78 boards, usually AMD780G chipset. I'd like one with a 790GX chipset, but that's in the $150 range. The ASUS boards I've looked at are all in my price range; it's just a matter of finding one that I like. About the only requirement I have is that it uses only solid-state capacitors and power circuitry. I'm not going to touch anything that has the possibility of leaking/exploding caps.
The other major requirement is that it needs to have working AHCI mode for SATA disks. I ran into a recent Phoronix forums post that said ATI chipsets have "buggy as hell" AHCI support, which doesn't bode well. But I don't really want to go down the nVidia route either, not since I filed that bug awhile ago for the failure of sata_nv to work with optical drives on the MCP55 chipset. Any suggestions, chipset-wise?
I suppose it's not really a good time to be buying stuff from either AMD or Intel right now, given the nifty stuff coming out next month and after, but I've no time to sit around waiting. I've got stuff to do. Ebuilds to be hacked. Newletters to be sent. Docs to be written. I mean, what with the coming stabilization of baselayout-2, OpenRC, Portage-2.2 . . . you name it.
Trackback address for this post
Trackback URL (right click and copy shortcut/link location)
5 comments
I just want to refer you to my previous comment http://planet.gentoo.org/developers/nightmorph/2008/10/02/failing_hardware#c19978
because the way you describe your problem, the more it looks like you have some power problem (peak voltages). If you don't want to buy an UPS - which I would recommend - you could buy a quality, stabilized (!) power supply which should smoothen the power output.
Cheers
Jochen
They have several very similar submodels, differing AFAIK just by amount of the sideport RAM and one of them being small ITX.
They have had quite favourable reviews and are to my knowledge cheapest in that range.
WHile at it, check your PSU. It might be that bad caps in PSU started it all and that PSU gradually started killing everything else...
I had a 'crashy' system here for awhile due to memory. It was NOT bad memory cells -- memtest checked out fine. Rather, it was that the (generic) memory wasn't quite stable at the rated speed, and would occasionally corrupt a transfer.
As with you, it happened pretty much any time, but more often when there was heavy memory bandwidth usage. My system is ECC enabled and I toggled that on and off, but if anything it was worse with it ON than off, apparently due to the transfer of the additional ECC bits (?). I don't have strong enough 3D to do much there, but I did notice more frequent lockups while doing emerges -- but it wasn't regular enough to really pin it down.
Eventually, I found something on MCEs (machine check exceptions), and enabled the option for that in the kernel. I'd get them, but then had to trace down some way of figuring out what the number meant. I found an app called parsemce (IIRC by Dave Miller?, google...) that told me it was the memory.
Eventually a BIOS upgrade gave me memory de-clocking ability and I found de-clocking the memory a single notch was all it needed. I could even decrease the various individual wait-states/cycle-counts and did so; I just couldn't return to the rated memory clock.
Eventually I upgraded memory and could set the system back to normal memory clock after that. Both the de-clocked old memory and the full-clocked new memory were rock stable, so it was just that, the memory.
If you're lucky, you can de-clock the memory in your BIOS, or have either spare sticks or other machines that you can swap memory with. That'd give you a zero-cost test method. But unless you require registered memory as I did, a single 512 meg or whatever stick should be cheap and enough to test with.
Duncan
If you are byuing new board, ignore boards without SB750 southbridge.
Old ones with SB700/SB600 have various problems and lack crucial function ACC-that might give you better OC or lower power consumption at rated speed and is obligatory for new upcoming K-10 Denebs...
Price delta for such board is too low to not go for it...