TL;DR: Utilizing WER plugins is beneficial, but it turns out it’s not the silver bullet of crash detection.
Almost exactly a year ago I blogged about the so-called runtime exception helper modules of Windows Error Reporting (that’s a mouthful, so from now on I’ll refer to them as their internal name, WER plugins), and how you can enhance your crash detection capabilities with them on x64 Windows. I apologize for the lack of posts lately, I have been working on something in my free time that I’m not ready to share with the public yet. Since my post I referred above, I’ve gained some production experience with WER plugins, and I thought they are worthy of sharing.
Mistakes were made
If you’ve read my previous post, you know that the most important type of crash that only a WER plugin could detect is the corruption of the return address. Here’s the relevant code snippet as a reminder:
|// Compile with MSVC. Target x86, then target x64.|
|// See what happens (run without a debugger attached).|
|LONG WINAPI MyUEF (PEXCEPTION_POINTERS pExp)|
|std::cout << "Unhandled exception with code " << std::hex <<|
|pExp->ExceptionRecord->ExceptionCode << std::endl;|
|TerminateProcess (GetCurrentProcess (), pExp->ExceptionRecord->ExceptionCode);|
|// Never reached, but make the compiler happy|
|__declspec(noinline) void CorruptStack ()|
|// Simulate corruption of the return address|
|*(uintptr_t*) _AddressOfReturnAddress () = 0x1234;|
|int main ()|
|std::cout << "Corrupting stack..." << std::endl;|
Long story short: unwind failure happens because the return address is corrupted, so the UEF does not get called back. Here comes the plot twist: even WER plugins can’t handle this (which is a damn shame, if you ask me). If you run this code with a WER plugin active, it won’t get loaded and called back.
“Wait a second, one of the major points of your previous post about this topic was the ability to handle that case. How come you didn’t notice back then?” You are correct, I was careless and wrong. I only have a vague recollection of how I tried out things back then, but my theory is that I tested out my plugin with RaiseFailFastException, a function which ~throws an SEH exception without calling back UEFs and VEHs (it’s a trivial and easy way of trying out a WER plugin).
I assumed that WER plugins worked for all kinds of crashes, so I was happy when I got it working for one type of crash, and rushed to write a blog post about it. It was an irresponsible thing to do, and now I look stupid. My bad.
Anyways, this raises the obvious question: Why does it not work in the case of a corrupted return address? I could trace back this behavior to a function in one of WER’s modules, Faultrep.dll. The function is called CRuntimePlugin::Initialize. It has a precondition check that looks like this in assembly:
Those constants are pretty familiar and suspicious at the same time, right? Here’s the equivalent C code:
|/* ... */|
|PEXCEPTION_RECORDS pExc = __rax;|
|if (pExc->ExceptionCode == EXCEPTION_STACK_OVERFLOW |||
|(pExc->ExceptionCode == EXCEPTION_ACCESS_VIOLATION &&|
|pExc.ExceptionInformation == EXCEPTION_EXECUTE_FAULT))|
|/* ... */|
So if the exception in question is a stack overflow or an access violation of the execute kind, the plugin mechanism will not function. Since an overwritten return address will manifest in the latter, we are out of luck.
I know this is not an insightful answer (or an answer at all, as some might argue), but to be fair, no one can really explain this without access to source (other than Microsoft). I contacted the WER team at Microsoft (some mailing list of them at least), but received no response. I can’t imagine why this limitation exists. This code runs in a separate process (WerFault.exe, to be specific), so these two types of crashes shouldn’t need a complete backoff for safety or other reasons.
The other thing I wanted to mention briefly is client-side throttling of WER reports. I haven’t found any official documentation on this, so I’m speaking out of sheer experience. It seems like that if in a given timeframe some number of crashes happen with the same signature, further reports will be suppressed for some time.
This popped up a few times in production where users complained about the same crash happening throughout the day. When I checked their machine, I noticed that after a number of crashes, no events were emitted in their event log, neither was our WER plugin called back, despite the crash definitely happening.
I’m really disappointed by the fact that not all crashes can be detected using WER plugins. The official documentation doesn’t mention this limitation, and I failed to notice back then, so I got burned by this peculiar behavior. Stack corruptions are dangerous errors, not being able to detect even a subset of them is a pity. I wonder if there’s an official stance by Microsoft regarding custom crash detection and handling. Either such thing does not exist, or it is unsupported officially.
Either way, this does not mean that WER plugins are useless for custom crash detection. Since this technique went into production at the company I work at, I’ve already seen some crashes caught with them which otherwise would have been undetected. For example, regular crashes in non-unwindable code, 3rd party components calling RaiseFailFastException, and ugly stack corruptions where an access violation happened before the attempt to return.