In hindsight, I made some pretty foolish assumptions that led to my failure to interpret some early warning signs…
Last night I finally pinpointed the exact cause of the problems that have been keeping my dual-Athlon workstation PC down for the past 4 months. I ran a copy of Alexander Grigoriev’s MemTest utility (which was recommended to me by some of the chipset engineers at work, and has proven to be a really excellent piece of software for test memory) on “Aki” (have I mentioned my computer’s name is Aki?), and sure enough it found a huge chunk of completely trashed memory. Further testing allowed me to isolate the problem to one DIMM module, and there you have it: one of the memory modules had flaked out and was the reason behind all (?) the instability I’d experienced.
I’d foolishly assumed that because I’d set up an ECC (Error Checking and Correction) memory subsystem that the memory itself was unlikely to be at fault. But ECC can only fix single-bit memory “glitches” (which can occur occasionally due to “cosmic radiation” and other causes)—I’d overlooked the fact that it was entirely possible for a whole memory module to go bad. Which turned out to be the case.
Secondly I’d assumed that because of the ECC system, the system would inform me in some obvious way if something was wrong with the memory. And in a way it had tried—I’ve long heard the occasional POST (Power-On Self-Test) beep code emitted by the system, but had figured it to be due to some poorly seated memory modules (because I re-seated the memory and the problem had seemed to go away). And I suspected the “STOP 0×8E” failure that I kept seeing in Windows to be somehow related to the memory subsystem. But my false sense of confidence in the memory had prevented me from leaping to the obvious conclusion. Robust, the memory could be (if it was good to begin with—which it wasn’t); “unsinkable” it could not.
Anyway I removed the bad memory module (lowering the total memory to 768 MB) and the system has been running perfectly since. The situation had become increasingly critical as for the past months I’ve been unable to access some important E-mail that I have archived on that system, in addition to various programming projects and other stuff.
What’s ironic about this whole affair is that this is the first time memory problems have been a serious issue in any computer system that I’ve built, yet the ECC and buffering I built into the system should have made it the most reliable memory in any system I’d built.
What’s even more ironic is that shortly before discovering the problem last night, I’d started ordering parts for a system to supersede Aki. My reasoning was that it being likely that the motherboard, CPU(s) or memory had gone bad I might as well replace the whole lot, rather than wasting any more time trying to diagnose the exact problem.
Anyway I’m still going to build a new replacement system, with a new philosophy that I’ve mentioned: smaller and simpler. That way, next time something goes wrong it won’t take me months to diagnose the trouble.