December 2, 2010

My physical server does not respond to remote crash dump!

We use mainly HP servers as our Wintel and ESX hardware. We sometimes have to do crash dump if a server stops responding so that Microsoft can do an analysis of the memory dump. Normally if the server is located on the same site as where I am situated, I will do the usual CTRL + SCROLL LOCK + SCROLL LOCK combination on the keyboard.

But what happens if the server is on another remote site? Being the lazy system admin, I looked for a way to do remote crash dump without having to travel down to the other site. Fortunately for me, when there is a will, there is a way. This method uses the HP server's own iLO GUI for remote generating the crash dump, using the Non-Maskable Interrupt (NMI) switch.

But first, the complete memory dump option must be enabled on the server, and the paging file must be at least the size of the physical memory ram + 1mb. MS also recommends to be at least 1.5 times the physical memory.

The steps listed will require reboot. Hence they should be done before the system has any issue, or to be set later and wait for the issue to occur again.

Symptoms
  • Unable to issue memory crash dump on remote server using keyboard commands (CTRL + SCROLL LOCK + SCROLL LOCK)

To resolve
  1. Right click "My Computer" -> Properties -> "Advanced" tab -> "Performance" -> Settings -> "Advanced" tab -> "Virtual memory" -> Change -> "Custom size"

    Set the amount of page size to either same as the physical memory available + 1mb, or the MS recommended 1.5 times physical memory

  2. Right click "My Computer" -> Properties -> "Advanced" tab -> "Startup and Recovery" -> Settings -> "Write debugging information" -> Complete memory dump
  3. Regedit -> HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl
  4. Add new REG_DWORD key
    • Name : NMICrashDump
    • Value : 1
  5. Restart the server
  6. Wait for the problem to surface again

The crash dump method can be tested after the reboot by using the NMI feature in the HP iLO GUI. 

System Status -> Diagnostics -> Generate NMI to System






There will be a STOP 0x00000080 hardware malfunction error to indicate the crash dump is working.




No comments:

Post a Comment