Here’s one fascinating article on DIMM memory errors (thanks to @codinghorror for the link!)
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
Basically, the authors partnered with Google to capture and analyze logs from thousands of Google servers over the period of two years.
Their conclusions are basically (interpretation is mine, check actual article for the original):
- learn how to recognize when the server failure happened due to a memory error and replace any DIMMs that had an uncorrectable error immediately.
- don’t hesitate to spend extra money on premium memory that features chipkill error correction capability (e.g. HP Chipspare, Intel SDDC or Sun Extended ECC)
- don’t get religious about temperature in your datacenter, it has much less impact on memory errors than you’d think (unless of course your temp goes through the roof)
- manufacturer / brand is pretty much irrelvant to memory errors as is DDR1/DDR2/FBDIMM nor the chip size etc.
- do burn-in your memory before putting it in PROD and replace any bad DIMMs. (You can use memtest if you want to test for memory errors specifically.)
- do expect more frequent glitches after 10-18 months of DIMM life
