DIMM Replacement
From UF HPC Wiki
Here at the UF HPC Center, our machines sometimes have MCE Errors occur on them. We have come up with a way in which to narrow down to an individual DIMM where the MCE error occurred, thus allowing us to replace that individual DIMM.
Contents |
Initial Detection
Initial detection is performed by a script called mcelog which we pass through a filter called <mcelog-filter>. This is done through a cron script run every five minutes on each node, and the output is placed in /var/log/mcelog.
Every night these logs are scanned to see how many MCE errors have occurred, the types of MCE errors, and on which processors these MCE's happened. This report is sent out via email to the HPC-Logs mailing list. It is also injected into a MySQL database which is then accessed by a webpage on the HPC website, MCE Logs.
Phase III
MCE Detection
When a DIMM faults on one of the Phase III nodes, there are two different indications of the problem:
- Yellow LED on the front of the chassis for the node lights up, indicating that there is a problem with the machine. This does not necessarily mean the problem is with memory, as it could be with a faulted CPU, drive, or fan.
- Yellow LED on the inside of the machine on the motherboard, next to the associated DIMM that has faulted. This is visible to some extent through the back of the case, though it is hard to read sometimes, particularly if the machine is fully loaded with DIMMs.
MCE detection in phase III nodes is a lot easier to detect than in phase II. MCE's can be detected through the use of the IPMI tools that are installed on the nodes. For instance, a node that is having issues would look something like this:
[root@r10a-s6 ~]# ipmitool sel elist 4 | 01/14/2009 | 07:45:27 | Event Logging Disabled System Event Log | Log area reset/cleared | Asserted 1bc | 02/02/2009 | 08:07:33 | System ACPI Power State ACPI State | S5/G2: soft-off | Asserted 1d0 | 02/02/2009 | 08:07:34 | System Event #0x83 | Timestamp Clock Sync | Asserted 1e4 | 02/02/2009 | 08:08:24 | System Event #0x83 | Timestamp Clock Sync | Asserted 1f8 | 02/02/2009 | 08:08:25 | Power Unit Power Unit Stat | Power off/down | Asserted 20c | 02/02/2009 | 08:45:33 | Processor Proc 1 Status | Presence detected | Asserted 220 | 02/02/2009 | 08:45:34 | Processor Proc 2 Status | Presence detected | Asserted 234 | 02/02/2009 | 08:24:09 | Power Unit Power Unit Stat | AC lost | Asserted 248 | 02/02/2009 | 08:45:35 | Power Unit Power Unit Stat | AC lost | Deasserted 25c | 02/02/2009 | 08:46:27 | Button Button | Power Button pressed | Asserted 270 | 02/02/2009 | 08:46:44 | Drive Slot Drv 1 Pres | Device Present 284 | 02/02/2009 | 08:46:59 | System Event #0x83 | Timestamp Clock Sync | Asserted 298 | 02/02/2009 | 08:46:59 | System Event #0x83 | Timestamp Clock Sync | Asserted 2ac | 02/02/2009 | 08:47:03 | Slot/Connector DIMM A1 | Device Installed | Asserted 2c0 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM A2 | Device Installed | Asserted 2d4 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM A3 | Device Installed | Asserted 2e8 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM A4 | Device Installed | Asserted 2fc | 02/02/2009 | 08:47:03 | Slot/Connector DIMM B1 | Device Installed | Asserted 310 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM B2 | Device Installed | Asserted 324 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM B3 | Device Installed | Asserted 338 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM B4 | Device Installed | Asserted 34c | 02/02/2009 | 08:47:03 | Slot/Connector DIMM C1 | Device Installed | Asserted 360 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM C2 | Device Installed | Asserted 374 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM C3 | Device Installed | Asserted 388 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM C4 | Device Installed | Asserted 39c | 02/02/2009 | 08:47:03 | Slot/Connector DIMM D1 | Device Installed | Asserted 3b0 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM D2 | Device Installed | Asserted 3c4 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM D3 | Device Installed | Asserted 3d8 | 02/02/2009 | 08:47:03 | Slot/Connector DIMM D4 | Device Installed | Asserted ...System booted here... 720 | 02/03/2009 | 08:58:11 | System Event #0x01 | OEM System boot event | Asserted 734 | 02/03/2009 | 08:58:23 | System ACPI Power State ACPI State | S0/G0: working | Asserted 748 | 02/03/2009 | 20:40:17 | Memory #0x08 | Correctable ECC | Asserted 75c | 02/04/2009 | 03:17:56 | Memory #0x08 | Correctable ECC | Asserted 770 | 02/05/2009 | 04:18:36 | Memory #0x08 | Correctable ECC | Asserted 784 | 02/05/2009 | 14:30:27 | Memory #0x08 | Correctable ECC | Asserted 798 | 02/05/2009 | 18:08:12 | Memory #0x08 | Correctable ECC | Asserted 7ac | 02/07/2009 | 23:18:37 | Memory #0x08 | Correctable ECC | Asserted
The above shows that the system was booted, and 16 dimms were detected in various slots. Then, we see that there are some correctable ECC's being asserted over a range of time. Unfortunately, this does not show WHICH dimm the ECC's were being propagated on. The only time we can find from this particular command what DIMM is in trouble is if the IPMI system decided that enough is enough and it is time to disable that particular DIMM.
However, it is possible to find out which DIMM is being complained about through the use of the save command in IPMI:
[root@r10a-s6 ~]# ipmitool sel save output [root@r10a-s6 ~]# more output ... 0x04 0x0c 0x08 0x6f 0x20 0xff 0x07 # Memory #0x08 Correctable ECC 0x04 0x0c 0x08 0x6f 0x20 0xff 0x09 # Memory #0x08 Correctable ECC 0x04 0x0c 0x08 0x6f 0x20 0xff 0x07 # Memory #0x08 Correctable ECC 0x04 0x0c 0x08 0x6f 0x20 0xff 0x09 # Memory #0x08 Correctable ECC 0x04 0x0c 0x08 0x6f 0x20 0xff 0x09 # Memory #0x08 Correctable ECC 0x04 0x0c 0x08 0x6f 0x20 0xff 0x09 # Memory #0x08 Correctable ECC
Here we see the same Correctable ECC messages, but with different values prior to the comment. The first six byte codes don't really have any meaning for us (I am sure they have meaning, but for what we are trying to find out, they don't matter). The last column of bytes is what really matters, as this identifies exactly which DIMM is having the issue. In this case, we see both DIMM's 0x07 and 0x09 are having issues. From the following table we can look up exactly which DIMMs these are:
| Byte Code | DIMM |
|---|---|
| 0x00 | A1 |
| 0x01 | A2 |
| 0x02 | A3 |
| 0x03 | A4 |
| 0x04 | B1 |
| 0x05 | B2 |
| 0x06 | B3 |
| 0x07 | B4 |
| 0x08 | C1 |
| 0x09 | C2 |
| 0x0a | C3 |
| 0x0b | C4 |
| 0x0c | D1 |
| 0x0d | D2 |
| 0x0e | D3 |
| 0x0f | D4 |
Phase II
MCE Error
L2 Cache Errors
MCE 0
CPU 0 2 bus unit TSC 326a37baf152
L2 cache ECC error
Bus or cache array error
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
prefetch mem transaction
memory access, level generic'
STATUS d000400000000863 MCGSTATUS 0
Northbridge Errors
MCE 1
CPU 0 4 northbridge TSC 326a37baf65e
ADDR 1717cdf0
Northbridge ECC error
ECC syndrome = 20
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d410400100000813 MCGSTATUS 0
Tracing the error
Looking at the actual MCE error listings above, you can see that some of the errors have a physical memory address associated with it. This is a good thing, because we can then go and figure out from that address which DIMM is associated with the address!
To do this, we first have to get a listing of the DIMMS and their associated memory address ranges. We can do this with the program dmidecode:
Memory Address Range
Handle 0x0038
DMI type 20, 19 bytes.
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x0003FFFFFFF
Range Size: 1 GB
Physical Device Handle: 0x0037
Memory Array Mapped Address Handle: 0x0034
Partition Row Position: 1
Corresponding DIMM Mapping
Handle 0x0037
DMI type 17, 27 bytes.
Memory Device
Array Handle: 0x0033
Error Information Handle: Not Provided
Total Width: 128 bits
Data Width: 128 bits
Size: 1024 MB
Form Factor: DIMM
Set: None
Locator: DIMM_A2
Bank Locator: BANK1
Type: DDR
Type Detail: Synchronous
Speed: 400 MHz (2.5 ns)
Manufacturer: Manufacturer1
Serial Number: SerNum1
Asset Tag: AssetTagNum1
Part Number: PartNum1
Analysis
So, from the Memory Address Range above, we can see that any memory address between 0x00000000000 and 0x0003FFFFFFF corresponds to this DIMM. This DIMM's actual location is referenced by the Memory Array Mapped Address Handle: 0x0034. So all we had to do was look at that address handle, which is listed there in Corresponding DIMM Mapping. There is becomes very obvious that the DIMM is located in location DIMM_A2, which on our motherboards is slot A2.
So, for the MCE error above, we would look and see that the error occurred in memory address 0x1717cdf0, which falls in the range for this particular DIMM. To confirm even more, we note that according to the MCE error message, it was on CPU 0, for which this DIMM corresponds as well.
Memory maps on our machines also appear to be relatively consistent, and we appear to have the following ranges:
0x00000000000 - 0x0003FFFFFFF: DIMM_A1 or DIMM_A2 0x00040000000 - 0x0007FFFFFFF: DIMM_B1 or DIMM_B2 0x00080000000 - 0x000BFFFFFFF: DIMM_C1 or DIMM_C2 0x000C0000000 - 0x000FFFFFFFF: DIMM_D1 or DIMM_D2
Problems
One of the problems with this method is that it does not take into account memory hole remapping for memory ranges under the four gigabyte range. In order to get the full use of the four gigabytes of memory we have installed on our machines, some of the memory has to be remapped outside of that four gigabyte range in order to take into account the x86 architecture. As such, there is a range of memory that these errors can fall into that does not properly map to a specific DIMM:
MCE 0
CPU 2 0 data cache TSC 43d1ead1671c
ADDR 118e3a0c0
Data cache ECC error (syndrome b)
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
data read mem transaction
memory access, level generic'
STATUS d405c00000000833 MCGSTATUS 0
For four gigabytes of memory, the memory address range for the physical DIMM's is from 0x00000000000 to 0x000FFFFFFFF. In this case, the error occurs at memory location 0x00118e3a0c0, which is outside of that range, but not outside of the range of the machine's total address range of 0x00000000000 to 0x00133FFFFFF. The range of 0x00100000000 to 0x00133FFFFFF is the remapped area. Unfortunately, we have not yet figured out how to tell where this remapped range actually comes from on the physical DIMMs. We know it is from either dimm C1 and D2 (the two DIMMs that are associated with CPU 2/3), but we don't know where on those two DIMMs it actually falls.
As such, any errors that occur outside of the physical address range of the four DIMMs cannot be pinpointed to a single DIMM, though it can be pinpointed to one of the two DIMMs associated with whatever CPU had the error.
