TL;DR: There is a bug in Microsoft OneDrive’s shell extension (which gets installed automatically with Office, by the way) that makes programs skip global variable destructors in DLLs if they ever open a file dialog.
The company I work for has a really decent custom crash detection system. It’s multiplatform, it has features like out-of-process crash reporting, automated categorization, etc. If you’ve read some of my previous posts on this blog, then you are aware that implementing such a system is no easy feat. No matter how thorough and careful you are, there’ll always be crashes that escape your ability of detection (given your program is complex enough and has plenty of users…).
This is one of the main reasons why it has a feature called abnormal exit detection. Upon exiting, the program using said crash detection system must signal that it’s about to exit. If it fails to do that, the out-of-process crash reporting server (this is just a “watchdog process” that runs on the same machine) knows that something went wrong, and takes a note of the abnormal exit and the exit code. So, let’s say we detected that our program vanished, and its exit code was 0xC0000005 (the exception code of an access violation). That’s a telltale sign of an uncaught crash, because when an unhandled SEH exception occurs in a program, it gets terminated with the exception’s code. At this point, you can’t create a minidump or walk the stack of course, because the process is already gone, but having the exit code at least is still better than nothing.
Other than pinning down potentially unhandled crashes, sometimes this mechanism catches miscellaneous errors, and bugs in 3rd party libraries as well. This time it caught a bug in the OneDrive shell extension, GROOVEEX.dll. But let’s not skip ahead, it took quite a lot of time and effort to come to that conclusion. Let me tell you the whole story…
Error: The operation completed succesfully
Pretty much since the abnormal exit detection feature went live for our users, we have been getting a lot of these reports with various exit codes. In fact, we get more of these reports than actual crashes, so believe me when I say we get a lot of them. Surprisingly, the vast majority of these reports contained the exit code of 0. That’s right, it seemed that many times our program failed to exit normally, but the exit code was ERROR_SUCCESS anyway.
Sometimes this phenomenon happened to in-house users as well (testers, developers, etc.). Whenever we saw this kind of issue popping up in our crash reporting database with an in-house machine, we asked the affected users if there was anything strange or circumstance worthy of mention. They always told us that nothing strange was happening, the program seemed to exit normally. And also, it always happened sporadically, so no chance of a repro, either.
Without a repro in our hands, we still had to do something to find an explanation. Of course when you encounter something like this, it’s usually best that you don’t assume it’s an error in the operating system or a 3rd party component, so we started looking for errors in our own code (remember the First Rule of Programming).
One potential cause could be that we (or some other process) calls TerminateProcess on our program with the exit code parameter set to zero. Another one could be an error in our out-of-process crash reporting infrastructure: maybe these reports are just false positives, or maybe we misdetect the exit code.
Several code reviews later we actually found some minor mistakes that could be the cause of these reports given the right circumstances (we didn’t get our hopes up, though…). Since we couldn’t reproduce the issue, we couldn’t try if our fixes worked. We had to wait for the next patch to be deployed, and see whether these reports stop flowing in. They didn’t.
The NtGlobalFlag value that was almost useful
The product that was affected by this problem has a beta testing period that takes place every year, a few months before the final release of the newest major version. It has a reward system (with actual rewards, and not just wallpapers, for example), so participating users are usually motivated and responsive.
During this beta period at the beginning of this year, we spotted two beta users that were affected by this issue more frequently than usual. It was still sort of sporadic, but it seemed it could be worth reaching out to them. Recall that our two main suspects were:
- Bug in our out-of-process crash reporting system
- Someone terminating our process early with exit code zero
Even though we’ve already reviewed relevant parts of our code, we still couldn’t rule out suspect number one. We asked for logs from the users hoping to spot anomalies, but there wasn’t any in them.
As for suspect number two: after some googling, I found a global flag value (if you don’t know what that is, it’s basically just a per-process registry value that lets you tell Windows to handle your process differently, or trigger special behavior) called FLG_MONITOR_SILENT_PROCESS_EXIT. This seemed like just the thing we needed: if this flag is active, a minidump will be written when the process is terminated, and an event will be emitted in the Windows event log containing the terminating process and the exit code. It has some additional features as well, so if you are curious, check out the documentation.
After checking whether this feature works on a clean install (you need to watch out for this; some features driven by global flags require the Windows SDK to be installed, like the Page Heap, for instance), I threw together a bunch of batch files that turn on and off this feature for our program. Then I handed these files over to the users with instructions and asked them to work for a few days like this and send back the resulting minidumps and event log.
Only one of them repro’d the issue during this phase, but that was enough, because we could correlate the entry in the user’s Windows event log with a record in our crash reporting database. The exit in question was what the documentation refers to as a “self exit”, that is, the process either exited normally, or called TerminateProcess on itself. That certainly rules out another process terminating our program, but that was never our main suspicion anyway.
And what about the minidump? It contained a call stack like this:
This seems like a normal exit: main returned, and the runtime is calling the Win32 function ExitProcess. Even though the documentation is not exactly clear about this, it turns out this silent process exit monitoring feature writes minidumps upon calling ExitProcess and TerminateProcess1.
So basically, we were asking the system the following (wrong) question: “Who and when calls ExitProcess or TerminateProcess on our program?”. And the system righteously answered: “The C runtime”. Well, duh! That’s always the case given a graceful exit, so that’s not really useful (in this particular case, at least). If you know how a regular exit goes down in case of an MSVC-compiled application, then you are aware that the point of the ExitProcess call is far from the end of the party. If you don’t, here’s how things happen (super roughly):
- Your main function returns, or exit gets called
- The runtime calls global variables’ destructors contained in your executable
- The runtime calls ExitProcess
- The call to ExitProcess will terminate all threads, except the calling one
- ExitProcess then will start unloading DLLs
- As DLLs are unloaded, global destructors contained in them are also called
- The process is dead
Well, what do you know! The logic responsible for signaling succesful exits is implemented in a global C++ object’s destructor that happens to be in a DLL (it’s worth noting that this is on purpose, we need this code to run as late as possible to cover as much lifetime as possible). The minidump was taken at point 3, while our exit signaling mechanism runs at point 6. The nastiness must happen between these two points, but unfortunately, that’s a blind spot for silent process exit monitoring.
Sir, you need a debugger for that
If we could have a look at the process’ very last state just before it ceased to exist (anyone remembers the scene from Wild Wild West where they made a projector out of a man’s head to find out what he saw just before his death?), that would hopefully reveal the culprit. It’s certainly possible, windbg, for example, can show you the final call stack of a debuggee (or even write a minidump).
So we could either:
- Teach the poor user the basics of windbg
- Write a dbgeng script and ask the user to execute it (preferably with cdb, the GUI-less “version” of windbg)
- Or organize a remote session and live debug the issue ourselves
The first option was out of the question for obvious reasons (if you wonder why, you’ve probably never used windbg; the TL;DR is it’s a power tool that’s cumbersome to use even for developers). The second option might have been viable, but I was afraid that even if I could hack together a script that does the job, an unexpected event or circumstance would make it fail or go astray (and also, the dbgeng script syntax is god awful…). The third option would have worked if the issue was reproducible reliably, but it wasn’t.
There is nothing special about windbg that makes this particular feature possible. On Windows, the ability to intercept the exit point of a debuggee comes out-of-the-box. Given this, I eventually ended up writing a really simple debugger that writes a minidump of the process upon receiving the EXIT_PROCESS_DEBUG_EVENT event. Even though I’ve never used the debugging APIs before, it only took a couple of hours to throw this mini-debugger together. You create a process with the DEBUG_ONLY_THIS_PROCESS flag, enter the so-called debug loop, and respond to events, easy as that (the documentation was quite helpful).
I sent this small debugger to the user with instructions, asked him to work like this for a few days, and send back the resulting minidumps. A couple of days later he provided us with the dumps.
The gates are now electrified, indeed
Most of the minidumps contained a call stack for regular exits (similar to the one shown earlier). Some of them, however, looked like this:
ucrtbase!__crt_seh_guarded_call::operator()<, & __ptr64, >
When I looked at this call stack2, I immediately knew what the problem was. Not because I’m that experienced, but because I’m a regular reader of Raymond Chen’s blog. I remembered a particular post that talked about what happens when you try to enter an abandoned critical section during process shutdown: your process gets terminated, right on the spot. And as it turns out, with exit code zero.
The whole phenomenon started to make perfect sense! When a user opens a file dialog, the OneDrive shell extension, GROOVEEX.dll, gets loaded. It must have been creating threads, one (or more) of which uses a critical section. But a global destructor in this shell extension also wants to lock this critical section (to perform some kind of cleanup, supposedly). However, if the thread that was holding it earlier was terminated because of a process shutdown (point 4 of the shutdown sequence explained above), we end up with the situation exactly like described in this post. This also explains why we were never able to reproduce this, the process needed to be shut down while the critical section in question was held by a thread other than the one initiating the exit (a race condition).
Reporting the issue
We finally had the culprit, it was time to report the bug to Microsoft. If you report problems to Microsoft from time to time (like myself), then you know it’s way harder than hunting down the actual bugs3. “I Can Win” case: you encounter a crash or problem in Visual Studio or one of its components (like MSVC). Report it through Connect or Developer Community, and call it a day. “Hurt Me Plenty” case: you find a bug in a component that has an active MSDN blog, or you know its developers are lurking around on forums or Reddit. You can report it to them directly. “Nightmare!” case: you want to report a problem in a component that’s sort of exotic. General googling and looking up conference speeches about the topic (hoping the presenter actually works on that component) is the best you can do. If you have a name, you might have success snatching that person’s e-mail address using services like Clearbit Connect and RocketReach.
This particular problem turned out to be a “Nightmare!” case, but with the method described above, I could find and contact a Program Manager on the OneDrive team. I sent him a document describing the issue, which he forwarded to a developer. We exchanged some e-mails with the said developer, but the whole thing stalled after that, unfortunately (no fix, no response).
A couple of months later something changed: the number of incoming reports (abnormal exit with exit code zero) largely increased. For some reason (maybe an implementation detail changed in GROOVEEX.dll) the issue suddenly became a 100% reproducible. As they say, when life gives you lemons, make lemonade. We did, in form of a minimal repro program, which we sent out to Microsoft immediately.
The repro program itself is a super simple console application:
- It has a static DLL dependency
- There is a global C++ object in this DLL which prints a message in its constructor and destructor
- The program opens a file dialog, and when it’s closed, discards the result
You can try it out yourself, given you have Office (and therefore the OneDrive shell extension) installed, and at the time you try it the bug is still not fixed. You can download the application from here (you carefully inspect the code and confirm that it’s harmless before compiling it, right?).
Unfortunately, even though we’ve provided a minimal repro program, the bug is still not fixed at the time of writing this post. Here’s a timeline:
- Summer of 2016: Abnormal exit detection feature of our crash reporting system went live.
- Fall of 2016: We noticed these peculiar reports and started looking for possible explanations, we reviewed our code.
- Winter of 2016: After several code reviews, we landed some patches hoping to fix the problem.
- February of 2017: Beta testing of the next major version of our product began.
- March of 2017: We identified a handful of beta test participants who seemed to be affected by this issue more frequently than others.
- April of 2017: After reaching out to some beta testers, one user was able to provide a minidump that pinpointed GROOVEEX.dll as the culprit.
- May of 2017: I contacted a Program Manager on the OneDrive team about the issue. He forwarded my e-mail to a developer.
- July of 2017: Bug suddenly became always reproducible. Minimal repro program created, and sent to the Program Manager (developer CC’d), asking for escalation.
- September of 2017: Since the bug still wasn’t fixed, I asked for a status report from the Program Manager (no reply as of yet).
I know some people will react to this whole story with something like “But Peter, you shouldn’t rely on global destructors anyway!” or “This is just another reason why you should avoid globals, el-oh-el!”, and in a sense, they’d be right. Unsurprisingly, we are aware of best practices like this, so the product where we’ve hit this does not suffer from any feature or data loss because of this issue. But a bug is a bug, and bugs ought to be fixed. The abnormal exit detection feature of our crash reporting system is quite useful, it does catch errors in our own code from time to time, but for a year or so we’ve been spammed with reports of this GROOVEEX.dll issue, which is kind of inconvenient.
As for Windows’s behavior, according to Raymond’s blog post that I also linked above, in Windows XP and earlier the system was quite forgiving. If someone tried to enter an orphaned critical section during process shutdown, they were simply let through. It’s easy to see why Microsoft changed this in Vista: if a thread was terminated while owning a critical section, the actual resource that was guarded by that critical section could very well be corrupt, so letting someone touch it could potentially cause more harm than good.
Since Vista, Windows gives you a slap on the face in this scenario, which is fine. What’s not fine is pretending that nothing wrong happened. The process’ exit code will be zero (with some cheating4), and Windows Error Reporting will neither write a report nor an event log entry.
This decision was most likely made to retain compatibility (“Hey Microsoft, program X and Y worked fine on XP, but on Vista, they seem to crash when I close them!”). However, in my opinion, it would have been much better to default to crashing the offending program in this scenario with WER, and resort to swallowing the error only with a compatibility shim, or when the minimum subsystem version of the executable image is less than or equal to that of Windows XP. I bet that if reports of GROOVEEX.dll trying ot enter an orphaned critical section flooded
winqual the Windows Dev Center (this is where WER reports end up), the issue would’ve already been fixed. In other words, random Central-Eastern European guys writing blog posts don’t scale, automated problem reports do.
1 Fun fact: if both gets called (for example, you have a global variable in a DLL that calls TerminateProcess in its destructor), two minidumps will be written for the same exit (at two different points, of course).
2 If you are wondering where those GROOVEEX!DllUnregisterServer calls come from, the answer is they are not actually GROOVEEX!DllUnregisterServer calls. This DLL is not on the MS Symbol Server, so the debugger can’t get a PDB for it. However, that doesn’t stop windbg from trying to resolve addresses, so it found the nearest exported function, and assumed the calls belong to it. In short, interpret those lines as “private functions that we don’t have any symbolic information for”.
3 I’m aware that the official way to report problems regarding any Windows component is through the Feedback Hub app. However, I haven’t had any success with it so far, so I recommend using the alternative methods described here.
4 Inspecting the code that runs in this case revealed that the actual NtTerminateProcess syscall is invoked with the exit code parameter of 0xC000004B, which is NTSTATUS STATUS_THREAD_IS_TERMINATING. I didn’t bother to check the kernel mode code as well, but apparently there is a branch in there somewhere that checks for this exit code and “corrects” it to zero (with a cheeky comment in the source code, I imagine).