A follow-up on WER plugins

TL;DR: Utilizing WER plugins is beneficial, but it turns out it’s not the silver bullet of crash detection.

Almost exactly a year ago I blogged about the so-called runtime exception helper modules of Windows Error Reporting (that’s a mouthful, so from now on I’ll refer to them as their internal name, WER plugins), and how you can enhance your crash detection capabilities with them on x64 Windows. I apologize for the lack of posts lately, I have been working on something in my free time that I’m not ready to share with the public yet. Since my post I referred above, I’ve gained some production experience with WER plugins, and I thought they are worthy of sharing.

Mistakes were made

If you’ve read my previous post, you know that the most important type of crash that only a WER plugin could detect is the corruption of the return address. Here’s the relevant code snippet as a reminder:

// Compile with MSVC. Target x86, then target x64.
// See what happens (run without a debugger attached).
#include <cstdint>
#include <iostream>
#include <windows.h>

LONG WINAPI MyUEF (PEXCEPTION_POINTERS pExp)
{
  std::cout << "Unhandled exception with code " << std::hex <<
    pExp->ExceptionRecord->ExceptionCode << std::endl;

  Sleep (5000);
  TerminateProcess (GetCurrentProcess (), pExp->ExceptionRecord->ExceptionCode);

  // Never reached, but make the compiler happy
  return EXCEPTION_CONTINUE_SEARCH;
}

__declspec(noinline) void CorruptStack ()
{
  // Simulate corruption of the return address
  *(uintptr_t*) _AddressOfReturnAddress () = 0x1234;
}

int main ()
{
  SetUnhandledExceptionFilter (MyUEF);
  std::cout << "Corrupting stack..." << std::endl;
  CorruptStack ();
}

Long story short: unwind failure happens because the return address is corrupted, so the UEF does not get called back. Here comes the plot twist: even WER plugins can’t handle this (which is a damn shame, if you ask me). If you run this code with a WER plugin active, it won’t get loaded and called back.

“Wait a second, one of the major points of your previous post about this topic was the ability to handle that case. How come you didn’t notice back then?” You are correct, I was careless and wrong. I only have a vague recollection of how I tried out things back then, but my theory is that I tested out my plugin with RaiseFailFastException, a function which ~throws an SEH exception without calling back UEFs and VEHs (it’s a trivial and easy way of trying out a WER plugin).

I assumed that WER plugins worked for all kinds of crashes, so I was happy when I got it working for one type of crash, and rushed to write a blog post about it. It was an irresponsible thing to do, and now I look stupid. My bad.

Anyways, this raises the obvious question: Why does it not work in the case of a corrupted return address? I could trace back this behavior to a function in one of WER’s modules, Faultrep.dll. The function is called CRuntimePlugin::Initialize. It has a precondition check that looks like this in assembly:

Those constants are pretty familiar and suspicious at the same time, right? Here’s the equivalent C code:


/* ... */
PEXCEPTION_RECORDS pExc = __rax;

if (pExc->ExceptionCode == EXCEPTION_STACK_OVERFLOW ||
    (pExc->ExceptionCode == EXCEPTION_ACCESS_VIOLATION &&
    pExc.ExceptionInformation[0] == EXCEPTION_EXECUTE_FAULT))
{
	return E_NOTIMPL;
}
/* ... */

So if the exception in question is a stack overflow or an access violation of the execute kind, the plugin mechanism will not function. Since an overwritten return address will manifest in the latter, we are out of luck.

I know this is not an insightful answer (or an answer at all, as some might argue), but to be fair, no one can really explain this without access to source (other than Microsoft). I contacted the WER team at Microsoft (some mailing list of them at least), but received no response. I can’t imagine why this limitation exists. This code runs in a separate process (WerFault.exe, to be specific), so these two types of crashes shouldn’t need a complete backoff for safety or other reasons.

Throttling

The other thing I wanted to mention briefly is client-side throttling of WER reports. I haven’t found any official documentation on this, so I’m speaking out of sheer experience. It seems like that if in a given timeframe some number of crashes happen with the same signature, further reports will be suppressed for some time.

This popped up a few times in production where users complained about the same crash happening throughout the day. When I checked their machine, I noticed that after a number of crashes, no events were emitted in their event log, neither was our WER plugin called back, despite the crash definitely happening.

Closing thoughts

I’m really disappointed by the fact that not all crashes can be detected using WER plugins. The official documentation doesn’t mention this limitation, and I failed to notice back then, so I got burned by this peculiar behavior. Stack corruptions are dangerous errors, not being able to detect even a subset of them is a pity. I wonder if there’s an official stance by Microsoft regarding custom crash detection and handling. Either such thing does not exist, or it is unsupported officially.

Either way, this does not mean that WER plugins are useless for custom crash detection. Since this technique went into production at the company I work at, I’ve already seen some crashes caught with them which otherwise would have been undetected. For example, regular crashes in non-unwindable code, 3rd party components calling RaiseFailFastException, and ugly stack corruptions where an access violation happened before the attempt to return.

Advertisements

Author: Donpedro

C++ programmer with an interest in operating systems and everything low level.

4 thoughts on “A follow-up on WER plugins”

  1. >>I contacted the WER team at Microsoft (some mailing list of them at least), but received no response.

    Sorry about that! I never received your mail. I’m the dev lead for WER, and we’re doing some work in this area, which is why I came across this blog.

    Your analysis is correct, of course — we don’t pass control to the plugin if the exception information suggests that it’s a GS (buffer overrun) or NX (execution of non-executable memory). Why? Because one of the main purposes of a plugin is to alter the crash signature, and we don’t want these particular exceptions (which indicate potential security issues) to have their crash signature altered.

    Having said that, we’re considering a change to WER that would invoke the OutOfProcessExceptionEventCallback anyway and ignore the attempt to claim the exception. That would allow a plugin to act on the exception without changing the signature.

    Like

    1. Thanks for reaching out Alan, I really appreciate it. I’d like to forward my previous e-mail to you. I’ve tried sending it to “your first name followed by the first letter of your last name at microsoft dot com”, but that address doesn’t seem to exist. What would be the best way to contact you?

      > Because one of the main purposes of a plugin is to alter the crash signature, and we don’t want these particular exceptions (which indicate potential security issues) to have their crash signature altered.

      There are two problems here. First, we are practically abusing WER plugins. We (by “we” I mean my employer and myself) are not using plugins to provide a signature, but to “steal” crashes from WER. We always try to be good citizens, but in this case, there is no possibility for real cooperation (that is, reporting a crash for ourselves while also letting WER handle it).

      The second problem is, the root cause in these cases are not always security related. Even though I erroneously state in this post that 0xC0000409 is stack overflow, it’s actually __fastfail (I guess you refer to this as GS because it used to mean that in the past, but was later “overloaded” for the more generic __fastfail). Some prominent/”famous” developers are already recommending using __fastfail over abort, throwing an SEH exception, etc. I know of one major graphics driver vendor who uses __fastfail in their drivers to signal unrecoverable errors in client processes. Heck, even the CRT uses it for std::abort (recent versions, at least). One could argue that abort can be caught with setting up an abort handler, but that doesn’t work if your process has multiple CRTs loaded e.g. from 3rd party libraries, shell extensions, etc. (fun times…).

      > we’re considering a change to WER that would invoke the OutOfProcessExceptionEventCallback anyway and ignore the attempt to claim the exception

      That’d be a very welcome change. How can we make our voice heard? UserVoice or the Feedback Hub don’t seem to be the appropriate channels for this kind of feedback 🙂

      My employer has vast experience with crash reporting (collection, analysis, etc.), I’m pretty sure it’s not as sophisticated/smart/scalable as WER, but either way, I think we could both benefit from a two-way conversation.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s