Crashes you can’t handle easily #1: SEH failure on x64 Windows

TL;DR: Don’t expect structured exception handling mechanisms to always work correctly on x64 Windows.

If you ship software, you probably care about crashes. Your product fails and gets terminated, your users get frustrated, their workflow is disrupted, and – worst of all – they might even lose some data. When a crash happens, you want to make sure relevant information is collected and sent back to you, the developer, so the problem can be investigated and fixed.

However, if you don’t rely on your platform’s built-in crash handling facilities, even detecting some crashes is far from trivial. I started this series of blog posts to write about such cases.


But why would you not rely on your system’s crash infrastructure? For instance, if you ship to multiple platforms, it’s best not to. Having to look at crash reports at different places for different platforms, feature disparity, the inability to correlate multiplatform crashes (easily) – these problems are all solved if you have your own, unified, multiplatform crash infrastructure. Even if you ship to one platform only, it might still worth the effort to use your own infrastructure for total control.

Windows Error Reporting

In case of Windows, the built-in infrastructure is called Windows Error Reporting (formerly known as Dr. Watson). If an application crashes, WER takes over and displays a well-known dialog about the crash, and collects some data. If the user consents, the report is sent to Microsoft. If you code sign your application and register at the Windows Dev Center, you can access reports for your application.

As a maintainer/user of a custom crash infrastructure, you don’t want to see this dialog

To some degree you can customize this report by marking memory blocks and files for inclusion, however, you still cannot control the exact actions to be taken, nor the location to send the report to.

There is a way to take over WER and thus achieve greater control, more on that later.

The basis of exception-based crash detection

The easiest way to detect crashes that manifest in unhandled exceptions (such as access violations, stack overflows, dividing by zero, etc.) is installing a callback for your process called the top level UnhandledExceptionFilter. It’s practically a process-wide global variable (a callback function) that gets called when the system thinks an exception is unhandled (I didn’t say “when there is an unhandled exception” on purpose).

Aside from 3rd party DLLs, shell extensions, and other external components potentially overriding your carefully set Unhandled Exception Filter, life should be easy, right? It’s just a callback after all that should cover all exception-based crashes. No. Life is never that easy.

I’ll demonstrate a case (a category of cases, to be precise) which escapes UEFs, but to understand what happens, we need to have at least a very basic idea of how exception handling works on x64 Windows.

The basics of x64 exception handling presented in 1 minute (or less)

Unlike x86 which has frame pointer chains and a linked list of exception handlers that are currently in scope, x64 has metadata-based exception handling. (Almost) every function has an associated block of metadata, which describes its prolog (its stack layout and usage, essentially) and exception handlers contained in the function (if any). To traverse stack frames and find exception handlers, the system relies on the presence of this metadata.

When an exception is thrown, the system looks for potential handlers. As you might expect, it starts with the top of the stack, and “goes down” iteratively until a handler handles the exception, or the bottom of the call stack is reached (which means it is unhandled). So first, it looks up the metadata block for the address the instruction pointer was pointing to when the exception occurred. If there is a handler, it gets called. If it refuses to handle the exception, or there was no handler, the system gets to the next stack frame using the metadata for the current frame, and so on. If there is no metadata (or the operating system can’t find it), this process gets stuck in the middle of exception handling. There is a special category of functions (called leaf functions) that do not require unwind data, but for simplicity, I ignore them throughout this post.

windbg’s .fnent command is one way to dump unwind metadata. As you can see, the main function of the code sample below does not have any handlers, nor does it do any non-volatile register save operations in its prolog. It allocates stack space less than 128 bytes in size, hence the UWOP_ALLOC_SMALL code.

So, what happens when unwinding gets stuck? Well, since the system is stuck in the middle of a call stack, it can’t really know whether the exception would be handled, or not. And guess what: your carefully set Unhandled Exception Filter is not called either. Instead, the process crashes with Windows Error Reporting. Ouch.

A concrete example

“Alright, but that sounds like an edge case” – you might say. Even if it were, you still wanted to handle it. But it is not. Consider a case where you have code that accidentally corrupts the stack. If it also corrupts the return address, then when the unwinding process gets to that frame, the system won’t find unwind tables for that garbage return address. In result, your UEF won’t get called. Here’s some code that demonstrates this:

// Compile with MSVC. Target x86, then target x64.
// See what happens (run without a debugger attached).
#include <cstdint>
#include <iostream>
#include <windows.h>
LONG WINAPI MyUEF (PEXCEPTION_POINTERS pExp)
{
std::cout << "Unhandled exception with code " << std::hex <<
pExp->ExceptionRecord->ExceptionCode << std::endl;
Sleep (5000);
TerminateProcess (GetCurrentProcess (), pExp->ExceptionRecord->ExceptionCode);
// Never reached, but make the compiler happy
return EXCEPTION_CONTINUE_SEARCH;
}
__declspec(noinline) void CorruptStack ()
{
// Simulate corruption of the return address
*(uintptr_t*) _AddressOfReturnAddress () = 0x1234;
}
int main ()
{
SetUnhandledExceptionFilter (MyUEF);
std::cout << "Corrupting stack..." << std::endl;
CorruptStack ();
}
view raw NoUEFCalledBack.cpp hosted with ❤ by GitHub

It’s easy to see how and what goes wrong:

  1. When main calls CorruptStack, the return address (pointing inside main) is placed on the stack.
  2. CorruptStack overwrites this with a bogus address.
  3. Upon returning from CorruptStack, the CPU will try to execute code at address 0x1234. This will raise an access violation.
  4. The system will start looking for exception handlers. It will start with address 0x1234, since that’s the value of the instruction pointer.
  5. There won’t be metadata present for that address, so we are in trouble (actually, it will be assumed to be a leaf function, so the handler search phase will go astray and get stuck sooner or later).

It’s worth noting that not being able to handle a subset of stack corruptions is just a specific case of the more general problem: unavailability/lack of unwind data. There are two more cases that come to my mind that I’ve actually seen in production.

The first one is handwritten assembly code. For x64 assembly functions, the assembler won’t generate unwind data for your functions automagically, you have to annotate your code manually (see MASM’s documentation for an example). Many multiplatform assemblers have no support for this at all. The second one is where code is copied or generated without generating/copying corresponding unwind information as well. Anti-debugging, DRM, and anti-tamper solutions might do such cheeky things.

A solution

Regarding possible solutions, I don’t have good news. I don’t know of any direct callbacks that you can register for cases like this (like SetUnhandledExceptionFilter for unhandled exceptions). The only solution I know is based on WER. I mentioned previously that you can somewhat customize WER reports, and you can also “hook” WER. I put the word “hook” in quotation marks, because I don’t mean classical function hooking. You will see what I mean by that in a minute.

Windows Error Reporting is an out-of-process mechanism. If your application crashes, your crashed process does really minimal things. It signals the WER service that it crashed, which will spawn a new process for this fault to collect some data, maybe write a minidump, display a dialog, etc. It’s no accident that it does the heavy lifting in a separate process, because doing even trivial things in a crashed process (like allocating memory) is dangerous. The good news is, you can act as a middleman in this process. It has its quirks and twists, but it’s definitely doable.

What we need to do here is create a so-called runtime exception helper DLL with some predefined exported callbacks, register that DLL, and in our application (at runtime) tell WER that we want this DLL to potentially handle our crashes. The said DLL must:

The mentioned functions you need to implement and export must be regular C functions (I’m honestly grateful to Microsoft for not making this a COM API), for their signature see the documentation (click the links in the list above). Even though the documentation states that this DLL is

” […] a custom runtime exception handler that is used to provide custom error reporting for crashes”,

the API, its parameters, and the way these functions are called back all suggest that this mechanism is designed to complement WER’s workflow, and not to take over it. However, with some trickery, we can actually take over. Here’s the full code of a DLL that just does that:

#include <windows.h>
#include <werapi.h>
extern "C" {
__declspec(dllexport) HRESULT WINAPI OutOfProcessExceptionEventCallback (
PVOID /*pContext*/,
const PWER_RUNTIME_EXCEPTION_INFORMATION /*pExceptionInformation*/,
BOOL* pbOwnershipClaimed,
PWSTR pwszEventName,
PDWORD pchSize,
PDWORD pdwSignatureCount)
{
*pbOwnershipClaimed = TRUE; // Claim this crash
const WCHAR* eventDescription = L"Your event description";
wcscpy (pwszEventName, eventDescription);
// According to the documentation, we must include the terminating L'&#092;&#048;'
*pchSize = wcslen (eventDescription) + 1;
// Call back OutOfProcessExceptionEventSignatureCallback exactly one time
*pdwSignatureCount = 1;
return S_OK;
}
__declspec(dllexport) HRESULT WINAPI OutOfProcessExceptionEventSignatureCallback (
PVOID /*pContext*/,
const PWER_RUNTIME_EXCEPTION_INFORMATION pExceptionInformation,
DWORD /*dwIndex*/,
PWSTR /*pwszName*/,
PDWORD /*pchName*/,
PWSTR /*pwszValue*/,
PDWORD /*pchValue*/)
{
// pExceptionInformation contains an exception record, HANDLEs to the
// crashing thread and process, and the crashing thread's context.
// Using these you can write a minidump, walk the stack, etc.
// [ Handle the crash here using pExceptionInformation ]
// Terminate WER's workflow by terminating the crashing process
TerminateProcess (pExceptionInformation->hProcess,
pExceptionInformation->exceptionRecord.ExceptionCode);
// From the viewpoint of WER this function did not succeed, because
// we did not provide the parameter name and value (pchName, pchValue)
return E_FAIL;
}
__declspec(dllexport) HRESULT WINAPI OutOfProcessExceptionEventDebuggerLaunchCallback(
PVOID /*pContext*/,
const PWER_RUNTIME_EXCEPTION_INFORMATION /*pExceptionInformation*/,
PBOOL /*pbIsCustomDebugger*/,
PWSTR /*pwszDebuggerLaunch*/,
PDWORD /*pchDebuggerLaunch*/,
PBOOL /*pbIsDebuggerAutolaunch*/)
{
// Since we returned E_FAIL in OutOfProcessExceptionEventSignatureCallback,
// this should not be called by WER
return E_FAIL;
}
} // extern "C"
view raw ExampleWERPlugin.cpp hosted with ❤ by GitHub

When a crash happens, WER will call OutOfProcessExceptionEventCallback. Here we have to decide whether we want to claim the crash, and provide a description for the event. We also have to specify how many parameters (key-value pairs) we want to add to the WER report (this is the number of times OutOfProcessExceptionEventSignatureCallback will be called back). Since we just want to use OutOfProcessExceptionEventSignatureCallback as a “hooking point”, we pass 1, and return success. Next, WER calls OutOfProcessExceptionEventSignatureCallback. Here we use the pExceptionInformation parameter to do the actual crash reporting. This structure is quite handy, as it contains the exception record, HANDLEs to the crashed thread and process, and the CONTEXT of the crashed thread. After we are done with our custom crash reporting actions, we terminate the crashed process, and return failure. This way WER backs off, and we are happy. Note, that we could do the same dance earlier in OutOfProcessExceptionEventCallback, because pExceptionInformation gets passed to that function as well. But first actually claiming the crash, and then doing the dirty things just feels a bit nicer.

If you just cringed, I feel you, because this is relying on undocumented behavior (terminating the crashed application in the process of creating a WER report). However, there is not much WER could do if the process it wants to report a fault for disappears. I tried this on Windows 7, 8.1, and 10, and I saw the same behavior. An alternative solution would be disabling WER’s dialog with SetErrorMode and SEM_NOGPFAULTERRORBOX. Edit: that doesn’t work. Setting SEM_NOGPFAULTERRORBOX basically turns WER off for your process, including this plugin mechanism.

Now that we’ve got our WER helper DLL up and running, there are two more things left to get this working. First, we have to add our DLL’s path to the registry at HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\Windows Error Reporting\RuntimeExceptionHelperModules, as documented here. werfault.exe will refuse to load our runtime exception helper DLL if it’s not on this list. Second, in our application (preferably at an early point of execution) we have to call WerRegisterRuntimeExceptionModule with the helper DLL path. Be aware that this function does not do any validation (just because it returns S_OK doesn’t mean that everything will work fine). It doesn’t check whether the required functions are exported, nor if the DLL exists at all. It just writes some registration data to your process’ PEB. Be nice and unregister the DLL at a late point with WerUnregisterRuntimeExceptionModule.

Please take this solution with a pinch of salt, as there is not much experience with these APIs on the internet, and I myself haven’t used this feature in production (yet). Let’s take a moment to review the pros and cons.

Pros:

  • It works.
  • Aside from solving the specific case presented in this post, it actually provides full coverage. It can handle all crashes WER could, which is really nice. Edit: that’s not enirely true. See the follow up post mentioned in the edit history at the bottom. If you can live with its disadvantages, it’s a good idea to make use of this mechanism regardless of this specific problem.
  • It’s an out-of-process mechanism out of the box (you don’t have to write a single line of code to make it out-of-process).

Cons:

  • There are several different components working together, which makes maintenance and testing hard(er).
  • It requires administrative privileges to register your WER DLL, since you have to write to HKLM in the registry. This doesn’t work well with portable programs nor with development builds, as they don’t have installers.
  • The APIs used here are available only on Windows 7 and later. If you have to target an earlier version, you are pretty much out of luck.
  • The runtime exception handler module is hard to debug since it gets loaded by a process that is spawned on-demand.

Closing thoughts

When Structured Exception Handling fails, I would expect my UnhandledExceptionFilter to be called. If the system can’t determine whether there is a handler or not, then at the end of the day no one handles the exception, thus it is unhandled. Maybe the function is named incorrectly, and there is a technical reason for this behavior, but I couldn’t find any information on that.

As for the solution I presented, I’m not a fan of that either. It’s quirky, complex, and inelegant. I wish there were alternatives, but I just couldn’t find any. If we look at how other operating systems provide similar functionality, mach‘s exception ports seem superior: they are relatively simple, hierarchical, and you get to choose whether you want them out-of-process or in-process.

Update history:

  • 2017-06-04: Fixed grammatical errors and typos. Referred follow-up post.

Author: Donpedro

C++ programmer with an interest in operating systems and everything low level.

4 thoughts on “Crashes you can’t handle easily #1: SEH failure on x64 Windows”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s