Be careful what you ask for (especially in WMI queries)

TL;DR: In exotic environments, WMI queries might take longer than you think.

Recently, we’ve received a report from a user claiming that our product was taking ages to launch. According to the report the program started eventually, but a few minutes of delay was inconvenient enough for the client to file a support case (unsurprisingly).

The initial ETW trace provided showed that the product in question was hung for more than 4 minutes (with no CPU usage), then startup proceeded normally.

The “Delay Type” column showing “COM Call” suggests that this issue is COM-related. The main thread was waiting for a background thread belonging to a 3rd party library, and that thread was blocked on a call stack like this:

ntdll.dll!NtAlpcSendWaitReceivePort
rpcrt4.dll!LRPC_BASE_CCALL::DoSendReceive
rpcrt4.dll!I_RpcSendReceive
combase.dll!ThreadSendReceive
combase.dll!CSyncClientCall::SendReceive2
combase.dll!CSyncClientCall::SendReceive
combase.dll!NdrExtpProxySendReceive
rpcrt4.dll!NdrpClientCall2
combase.dll!ObjectStublessClient
combase.dll!ObjectStubless
fastprox.dll!CEnumProxyBuffer::XEnumFacelet::Next
App.exe!FunctionOf3rdParyLibrary
[…]

Uh-oh. This confirms that this is indeed an issue where COM is involved. I hate diagnosing COM/RPC related issues, because:

  • I’m not that familiar with COM internals, and in my experience that’s needed for diagnosing issues like this
  • The causality chain usually spans multiple threads and processes, making it very cumbersome to follow
  • Most of the time the causality chain leads to nowhere relevant (sooner or later being readied by the Idle process, for example)

I attempted to follow the causality chain anyway. Just as expected there was no direct culprit, but WMIPrvSE.exe showed up, which was something to start with, at least.

Did someone say WMI?

WMIPrvSE.exe is a host process for WMI, Windows Management Instrumentation. In short, WMI is a management technology that represents system and hardware components, certain states, processes etc. as objects. You can query all of this (even remotely) with a SQL-like language called WQL.

Does this sound like a feature an IT department would find most useful? Yes. WMI is seldom used by regular applications. However, since you can query pretty much everything with WMI, a too ambitious IT department can incidentally DoS your machine if they are being reckless with their queries (that would explain why this particular issue never popped up before).

When a system component turns out to be highly relevant in an investigation, I always run xperf -providers and search for potentially useful ETW providers. Then taking a trace on my own machine shows which ones could be handy for further investigation. In this case, Microsoft-Windows-WMI-Activity seemed useful, so I asked the user to capture another trace with this provider enabled.

For not well-known providers, the “Task Name” column is very useful to get an idea of which events mean what, but in this case, they were empty (the provider’s manifest must lack this information for some reason). I could see that processes of interest emitted events, but I couldn’t make sense of the data.

It seemed like events with Id 11 and 13 were emitted in pairs, with an integer field called OperationId binding them together. All I needed to do was somehow group these events and calculate their duration, maybe a long-running WMI operation would show up. If only there was a WPA feature for this…

Regions of interest

Well, there is, and it’s called Regions of Interest. Very simply put, it’s a tool that allows you to make rectangles out of pairs of events in WPA. Simply put, it’s an XML-based format that contains a description for defining regions made up of events. You can pair them based on their properties, name the regions, etc.

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<InstrumentationManifest>
<Instrumentation>
<Regions>
<RegionRoot Guid="{EFA7A927-BAE3-48F6-92E1-000000000000}" Name="WMI-Regions">
<Region Guid="{45EEAD74-5B56-4B38-8CEC-CF66312D6F9B}" Name="WMI-Operations">
<Start>
<Event Provider="{1418ef04-b0b4-4623-bf7e-d74ab47bbdaa}" Id="11" Version="0" />
</Start>
<Stop>
<Event Provider="{1418ef04-b0b4-4623-bf7e-d74ab47bbdaa}" Id="13" Version="0" />
</Stop>
<Match>
<Event>
<Payload FieldName="OperationId" TargetFieldName="OperationId" />
</Event>
</Match>
<Naming>
<PayloadBased NameField="Operation" />
</Naming>
</Region>
</RegionRoot>
</Regions>
</Instrumentation>
</InstrumentationManifest>
view raw WMI_durations.xml hosted with ❤ by GitHub

This barely needs explanation, but here it goes:

  • Events of Microsoft-Windows-WMI-Activity with Id 11 and 13 should be paired into regions…
  • …but only ones with matching OperationId fields
  • Created regions should be named based on the Operation field of events

After applying this file, I could see all WMI queries ordered by their duration:

Looks like the winner is SELECT * FROM Win32_NTDomain, which took almost the same time as the application was hung. But why does it take so long on the user’s machine? This should just return the domain that the computer is part of, right? That shouldn’t take long, right? You can run WMI queries by:

  1. Running wbemtest.exe elevated
  2. Pressing “Connect…”, then connecting to root\wmi
  3. Pressing “Query…”, entering your query, and hitting “Apply”

I ran this query on my own developer machine, and got this result in an instant:

I asked the user to do the same, and note how much time it takes to execute. They sent me back this, and told me that the query took a few minutes to complete:

What’s up with all these entries? Shouldn’t a computer be part of one domain exactly? The documentation of Win32_NTDomain explains this “phenomenon”:

You can use this class to list all of the domains that can be discovered from the local computer.

Aha! So it’s not the single domain that your machine is part of, it’s all discoverable domains (I had no idea that such a thing existed, but hey, you learn new things every day). Capturing a quick network trace showed that running this query involves communication with the domain controller, so in case of this user if the connection quality is poor with one or more domain controllers, no wonder the query is taking ages to finish. By intentionally degrading my network connection with clumsy, I could get the query to take as long as a minute on my machine. Even though the documentation states that “the number of domains that the provider can discover is limited by […] network latency”, it doesn’t really seem to be the case.

Mystery solved. As a workaround, we’ve instructed the user to contact their IT department to limit the number of discoverable domains, if possible. In the meantime we’ve also contacted the vendor of the 3rd party library initiating this WMI query, and brought this case to their attention.

Closing thoughts

Maybe it’s just a coincidence, but to me it feels like the problems caused by WMI (either directly or indirectly) shows an increasing trend. Take this, this, or this case, for example.

At first glance WMI might seem like a straightforward technology, however, even the simplest of queries can hold systemwide locks, perform network communication, and initiate non-trivial amounts of IO. After reading about numerous cases like this one, it’s clear that the actual cost of WMI queries (especially in corner cases) is very hard to estimate, so in general, it’s best used sparingly.

Author: Donpedro

C++ programmer with an interest in operating systems and everything low level.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s