Vadné ECC ramky na serveru. Jak to detekovat na běžící mašině (Centos 7)

Problém vypadá takto:

Message from syslogd@server at Aug 15 14:50:19 ...
 kernel:[Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB.
Message from syslogd@server at Aug 15 14:50:19 ...
 kernel:[Hardware Error]: Error Status: Corrected error, no action required.
Message from syslogd@server at Aug 15 14:50:19 ...
 kernel:[Hardware Error]: CPU:0 (f:41:3) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc7a4000e6080a13
Message from syslogd@server at Aug 15 14:50:19 ...
 kernel:[Hardware Error]: MC4_ADDR: 0x000000008f1f3500
Message from syslogd@server at Aug 15 14:50:19 ...
 kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

řešení:

yum install edac-utils -y
root@server# edac-util -v 
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow4: 0 Uncorrected Errors
mc0: csrow4: mc#0csrow#4channel#0: 0 Corrected Errors
mc0: csrow5: 0 Uncorrected Errors
mc0: csrow5: mc#0csrow#5channel#0: 0 Corrected Errors
mc0: csrow6: 0 Uncorrected Errors
mc0: csrow6: mc#0csrow#6channel#0: 0 Corrected Errors
mc0: csrow7: 0 Uncorrected Errors
mc0: csrow7: mc#0csrow#7channel#0: 42 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow2: 0 Uncorrected Errors
mc1: csrow2: mc#1csrow#2channel#0: 0 Corrected Errors
mc1: csrow3: 0 Uncorrected Errors
mc1: csrow3: mc#1csrow#3channel#0: 0 Corrected Errors
mc1: csrow4: 0 Uncorrected Errors
mc1: csrow4: mc#1csrow#4channel#0: 0 Corrected Errors
mc1: csrow5: 0 Uncorrected Errors
mc1: csrow5: mc#1csrow#5channel#0: 0 Corrected Errors
mc1: csrow6: 0 Uncorrected Errors
mc1: csrow6: mc#1csrow#6channel#0: 0 Corrected Errors
mc1: csrow7: 0 Uncorrected Errors
mc1: csrow7: mc#1csrow#7channel#0: 0 Corrected Errors
Protože server dokáže nastartovat pouze se sudým počtem RAM, je nutné odebrat, či nahradit poslední PÁR modulů RAM a server bude zase šlapat v pořádku.

zdroj

Napsat komentář

Vaše emailová adresa nebude zveřejněna. Vyžadované informace jsou označeny *

ten + six =