Thursday, December 30, 2010

About IRP_MJ_CREATE and minifilter design considerations - Part III

An interesting topic when discussing creates is the context (thread and process context) in which the create happens. This isn't really interesting from the OS perspective (since the OS always receives the request in the context of the requestor) but from a filtering perspective. In the previous post we discussed how the OS takes the request and eventually sends an IRP to the file system. There are some things to note:

  1. CREATE operations must be synchronized by the OS. I think this is true for any stateful protocol (and stateless protocols don't really have a CREATE operation anyway). The CREATE operation simply means "hey everyone, there will be some requests in this context for this object so you'd better set up your contexts so you know what we're talking about when you get the next request". So the requestor can't really do anything until the request is complete since they don't even have a handle. This means that the IO manager will pretty much execute in a single thread and when it needs to wait for some other service (like the FS) it will send a request (the IRP_MJ_CREATE IRP) and wait for it to come back.
  2. The FS stack however is layered. The implication of this is that while the user can treat the CREATE operation as synchronous, the layers involved in processing that create can't. For file system filters (legacy and minifilters), there are 3 distinct steps:
    1. Before the request makes to the minifilter (before the preCreate callback is called)
    2. After the request is seen by the minifilter, but before the minifilter knows the request has been completed by the lower layers (after the preCreate callback but before the postCreate callback)
    3. After the minifilter knows the request has completed, but before the IO manager knows about it (after the postCreate callback)
    This is important to understand because there are certain limitations, depending on what each layer of the OS knows about the request. For example, during a preCreate callback, the IO manager knows someone wants to open a file but the FS doesn't yet know about that file. So even though the minifilter has a FILE_OBJECT structure (which comes from the IO manager), trying to use it to request something from the FS (like reading or writing or even queries) cannot work since the FS has not yet seen the request and has no idea what the FILE_OBJECT is supposed to represent (the information about which stream on disk the FILE_OBJECT will represent is stored in the create IRP and not in the FILE_OBJECT). In a similar fashion, during the postCreate callback the filter knows how the FS handled the request (whether it was a successful request or not) but the IO manager doesn't, so trying to call a function that involves the IO manager for that FILE_OBJECT (for example ObOpenObjectByPointer, which will create a HANDLE given an OBJECT) will fail.
  3. FltMgr will also synchronize IRP_MJ_CREATE requests for a couple of reasons. From a minifilter perspective, this is beneficial because it simplifies the model quite a bit. In general synchronized operations are somewhat simpler to handle in the postOp callback but synchronizing every operation will have a negative impact on the system. So FltMgr won't synchronize by default any operation except CREATE, where there is no negative impact because the IO manager synchronizes it already. While this is guaranteed by documentation, minifilters should still always return FLT_PREOP_SYNCHRONIZE instead of FLT_PREOP_SUCCESS_WITH_CALLBACK for IRP_MJ_CREATE just so this behavior is made obvious.
  4. This brings us to the most important point. FltMgr documentation mentions in a bunch of different places that the postCreate callback will be called in the same context as the preCreate callback. In some cases I've this statement being interpreted as "FltMgr guarantees that the postCreate will be called in the same thread where the user request was issued". However, this is not the case. FltMgr makes no guarantees about what thread the preCreate callback will be called on, just that it will call postCreate on the same thread. What can happen is that a filter (legacy or minifilter) can return STATUS_PENDING for an IRP_MJ_CREATE and the continue the request on a different thread, in a different process altogether. This is a legal option and what happens is that the filter below the filter that returned pending will have its preCreate callback called on the new thread, in the new process context. This is a brief example of what happens in this case (let's say the FS will return STATUS_REPARSE):
    1. The IO manager receives the CREATE request on Thread1 and issues an IRP_MJ_CREATE on the same thread.
    2. FilterA (let's say it's a legacy filter) sees IRP_MJ_CREATE request on Thread1 and pends it and then sends it down on a different thread, Thread2 .
    3. MinifilterB (below FilterA) sees the IRP_MJ_CREATE request (i.e. minifilter B's preCreate callback is called) on Thread2, where it queues the request and returns FLT_PREOP_PENDING.
    4. MinifilterB then dequeues the request on a different thread (Thread3) and it sends it down (calls FltCompletePendedPreOperation with FLT_PREOP_SYNCHRONIZE for example)
    5. The FS receives the IRP_MJ_CREATE on Thread3, processes and discovers it is a reparse point and so it returns STATUS_REPARSE.
    6. FltMgr's completion routine gets called on Thread3 and since FltMgr knows the operation is synchronized, it simply signals Event2.
    7. FltMgr resumes the operation on Thread2 where it was waiting for the event and calls the postCreate callback for minifilterB.
    8. Minifilter B does whatever processing it does for STATUS_REPARSE and returns FLT_POSTOP_FINISHED_PROCESSING.
    9. FltMgr completes the request (we're still on Thread2).
    10. FilterA's IoCompletion routine gets called on Thread2 and FilterA performs whatever processing it needs before completing the IRP.
    11. the IO manager's IoCompletion routine gets called (still on Thread2), but the IO manager is synchronizing the operation so it signals Event1.
    12. IO manager's wait on Thread1 returns so the IO manager can inspect the result of the call. Since the FS returned STATUS_PENDING, it might return back to OB and restart parsing from there… This in turn might come down the same path and issue a new IRP_MJ_CREATE on Thread1 and so on...
    Here is a picture of what this would look like.
As you can see, it is impossible for a filter to guarantee that its preCreate callback will be called on the thread of the original request. So what can a file system filter (or a file system) do ? Well, there are largely three reasons why a file system (or filter) might care about the context of a certain operation:
  • The operation refers to some buffer and the VA is only valid in the process context of the originator.
  • The operation refers to some other variable that is process specific (for example , a handle), like IRP_MJ_SET_INFORMATION with FileRenameInformation or FileLinkInformation, where the parameters contain a handle.
  • The operation needs to evaluate security so it needs to know who is the requestor for the operation.
IRP_MJ_CREATE doesn't care about user buffers or other process dependent variables (they are all captured before getting to the IO manager) so file systems and filters don't need to worry about that. However, security is a really big part of IRP_MJ_CREATE processing so filters often need to know who is requesting the operation. However, as I mentioned in the previous post in this series, the security context is captured in nt!ObOpenObjectByName and sent in the IRP parameters (Parameters.Create.SecurityContext) and so the file system and the filters can simply use the context there to decide who is requesting the operation.
In conclusion, the fact that a filter can't guarantee that it will be called in the context of the thread where the original request was issued doesn't matter much.

Thursday, December 23, 2010

About IRP_MJ_CREATE and minifilter design considerations - Part II

Since we've discussed the concepts last time we can finally start looking at the debugger. Because we're mostly interested in the create operation from a filter perspective, I put a breakpoint on fltmgr!FltpCreate so that we can see exactly what the stack looks like when the request reaches a filter. Let's say we're trying open the file "C:\Foo\Bar.txt". Here is what the stack looks like.

00 9b5c5a70 828484bc fltmgr!FltpCreate
01 9b5c5a88 82a4c6ad nt!IofCallDriver+0x63
02 9b5c5b60 82a2d26b nt!IopParseDevice+0xed7
03 9b5c5bdc 82a532d9 nt!ObpLookupObjectName+0x4fa
04 9b5c5c38 82a4b62b nt!ObOpenObjectByName+0x165
05 9b5c5cb4 82a56f42 nt!IopCreateFile+0x673
06 9b5c5d00 8284f44a nt!NtCreateFile+0x34

In order to discuss the flow of the IO through the OS we're going to look at what each of these functions does.
nt!NtCreateFile
This is how the OS receives a request to open a file or a device (at this level there is no distinction between the two yet). NtCreateFile doesn't really do much, it's just a wrapper over an internal OS function (IopCreateFile). The file name here is something like "\??\C:\Foo\Bar.txt".
nt!IopCreateFile
This is the function to open a device (or a file) at the IO manager level. This is an internal function where most requests to open a file or a device end up (NtOpenFile, IoCreateFile and friends and so on). This is what happens here:
  1. The parameters for the operation are validated and checked to see if they make sense. Here is where STATUS_INVALID_PARAMETER is returned if you do something like ask for DELETE_ON_CLOSE but not ask for DELETE access… There are a lot of checks to validate the parameters, but no actual security or sharing checks.
  2. A very important structure is allocated, the OPEN_PACKET. This is an internal structure to the IO manager and it is the context that the IO manager has for this create. The create parameters are copied in initially. This is a structure that's available in the debugger:
    1: kd> dt nt!_OPEN_PACKET
        +0x000 Type             : Int2B
        +0x002 Size             : Int2B
        +0x004 FileObject       : Ptr32 _FILE_OBJECT
        +0x008 FinalStatus      : Int4B
        +0x00c Information      : Uint4B
        +0x010 ParseCheck       : Uint4B
        +0x014 RelatedFileObject : Ptr32 _FILE_OBJECT
        +0x018 OriginalAttributes : Ptr32 _OBJECT_ATTRIBUTES
        +0x020 AllocationSize   : _LARGE_INTEGER
        +0x028 CreateOptions    : Uint4B
        +0x02c FileAttributes   : Uint2B
        +0x02e ShareAccess      : Uint2B
        +0x030 EaBuffer         : Ptr32 Void
        +0x034 EaLength         : Uint4B
        +0x038 Options          : Uint4B
        +0x03c Disposition      : Uint4B
        +0x040 BasicInformation : Ptr32 _FILE_BASIC_INFORMATION
        +0x044 NetworkInformation : Ptr32 _FILE_NETWORK_OPEN_INFORMATION
        +0x048 CreateFileType   : _CREATE_FILE_TYPE
        +0x04c MailslotOrPipeParameters : Ptr32 Void
        +0x050 Override         : UChar
        +0x051 QueryOnly        : UChar
        +0x052 DeleteOnly       : UChar
        +0x053 FullAttributes   : UChar
        +0x054 LocalFileObject  : Ptr32 _DUMMY_FILE_OBJECT
        +0x058 InternalFlags    : Uint4B
        +0x05c DriverCreateContext : _IO_DRIVER_CREATE_CONTEXT
     
    This structure is pretty important to the flow of the IO operation but there is no way to access it as a developer so it's going to be just an important concept to remember later on.
  3. Finally, since we've copied all internal parameters and all the IO manager has at this point is an OB manager path (in the ObjectAttributes paramater to the call), it must call the OB manager to open the device (ObOpenObjectByName, see below).
  4. After ObOpenObjectByName returns this function cleans up and returns.
nt!ObOpenObjectByName
This the call to have the OB manager create a handle for object when we know the name. This isn't a public interface since 3rd party drivers only need to open objects that have their own create or open APIs (for example ZwCreateFile, ZwOpenKey, ZwOpenSection, ZwCreateSection, ZwOpenProcess and so on). Another thing to note about the OB APIs is that they fall largely into two classes:
  1. Functions that reference objects (that just operate on the reference count of objects), like ObReferenceObject, ObReferenceObjectByName and ObReferenceObjectByPointer.
  2. Function that create handles to object in addition to referencing them (which is called an "open"), like ObOpenObjectByName and ObOpenObjectByPointer.
Anyway, this is roughly what goes on in here:
  1. Capture the security context for this open, so that whoever needs to open the actual object can perform access checks. This also means that the file system itself doesn't rely on the thread context being the same and instead uses the context captured here. So minifilters should to the same when they care about the security context of a create.
  2. Call the actual function that looks up the path in the namespace (ObpLookupObjectName, see below)
  3. If ObpLookupObjectName was able to find an object then a handle is created for that object (since this was an open type function).
nt!ObpLookupObjectName
This is the function where the OB manager actually looks in the namespace for the path it needs to open (which at this point is still "\??\C:\Foo\Bar.txt"). One thing to note is that the OB namespace has a hierarchical structure, with DIRECTORY_OBJECT types of objects that hold other objects. The root of the namespace ("\") is such a DIRECTORY_OBJECT.
Anyway this is what happens in this function. The parsing starts at the root at the namespace, "\". This is a loop until we find the final object to return to the user or find that there is no object by that name (and therefore fail the request):
  1. If the current directory is the root directory then check if the name starts with "\??\" and make it point to the \GLOBAL?? directory. This is a hardcoded hack in IO manager (which is why calling "!object \" in WinDbg doesn't show a "??" folder). (so our name becomes "\GLOBAL??\C:\Foo\Bar.txt")
  2. Find the first component in the path (which is GLOBAL??) in the current directory.
  3. If the component found is a DIRECTORY_OBJECT, open it and continue parsing from that point using the rest of the name (in our case, "C:\Foo\Bar.txt" is the remaining name). Continue the loop with remaining path.
  4. if the object has a parse procedure, call that parse procedure and give it the rest of the path. if the parse procedure returns STATUS_REPARSE (and it hasn't reparsed too many times already), start again at the root of the namespace with the new name returned by the parse procedude. Otherwise the parse procedure should either return STATUS_SUCCESS and return an object or a failure status.
Some notable things are:
  • OB will do a case sensitive or a case insensitive search of the OB namespace, depending on the OBJ_CASE_INSENSITIVE flag that is passed into the OBJECT_ATTRIBUTES, which is why it's important to set this correctly when calling FltCreateFile in a filter (specifically from a NormalizeNameComponent callback) since if it's not correctly set the request might not make it down the IO stack at all
  • the OB namespace uses symlinks quite a lot. OB symlinks are a special type of object that has a string member that points to a different point in the namespace, and a parse procedure:
    0: kd> dt _OBJECT_SYMBOLIC_LINK
     nt!_OBJECT_SYMBOLIC_LINK
        +0x000 CreationTime     : _LARGE_INTEGER
        +0x008 LinkTarget       : _UNICODE_STRING
        +0x010 DosDeviceDriveIndex : Uint4B
     
    So in our example, when OB gets to "\GLOBAL??\C:" it discovers it is a symlink and it calls the parse procedure with the rest of the remaining name ("\Foo\Bar.txt"). In The symlink for "\GLOBAL??\C:" points to "\Device\HarddiskVolume2" and the symlink's parse procedure concatenates that name with the remaining path that it got and so the new name after the symlink is "\Device\HarddiskVolume2\Foo\Bar". See this:
    0: kd> !object \GLOBAL??\C:
     Object: 96f7f188  Type: (922b7f78) SymbolicLink
         ObjectHeader: 96f7f170 (new version)
         HandleCount: 0  PointerCount: 1
         Directory Object: 96e08f38  Name: C:
         Target String is '\Device\HarddiskVolume2'
         Drive Letter Index is 3 (C:)
     
    The parse procedure of a symlink always returns STATUS_REPARSE.
  • Once we get to the "\Device\HarddiskVolume2\Foo\Bar.txt" path, while parsing OB will find that "\Device\HarddiskVolume2" is a DEVICE_OBJECT type of object and that it has a parse procedure. The parse procedure for a DEVICE_OBJECT is IopParseDevice, so that function gets called.
  • Another thing to note that there is a limit to the number of times OB will reparse and each time it sees a STATUS_REPARSE counts against that limit (so it doesn't matter whether it was a reparse from a symlink or a DEVICE_OBJECT, everything counts). So it is possible to reparse to the point where OB won't reparse anymore.
nt!IopParseDevice
The name here is just "\Foo\Bar.txt" and the parse procedure gets a reference to the device where the path should be searched. This is where the difference between a file and a device becomes relevant. If there is no remaining path, this is treated as an open to the device. If there is a path, then this is assumed to be a file (or directory) open. This is a pretty involved function with many special cases. However, there are only a couple of steps that we're going to talk about:
  1. Get the context for this create, which is the OPEN_PACKET structure from before. This works because the OPEN_PACKET is IO manager's structure passed from IopCreateFile to IopParseDevice. This is important because this is a nice way to have context across calls through other subsystems (OB manager) and still keep context that is opaque to those subsystems. This isn't always the case unfortunately and whenever two subsystems share the same structure the architecture gets complicated.
  2. Check to see if a file system is mounted on this device and if not then mount it.
  3. Process the device hint if there was any.
  4. Allocate the IRP_MJ_CREATE irp
  5. Allocate the FILE_OBJECT that will represent the open file.
  6. Call the FastIoQueryOpen function (which minifilters see as the IRP_MJ_NETWORK_QUERY_OPEN). The IRP parameter to this call is the IRP that was just allocated.
  7. If the FastIoQueryOpen didn't work, send the full Irp to the file system stack by calling IoCallDriver.
  8. Wait for IRP to complete (i.e. the IRP is synchronized by the IO manager).
  9. If the request was a STATUS_REPARSE, then first check if it is a directory junction or a symlink and do some additional processing for those. Anyway, copy the new name to open from the FILE_OBJECT (the actual name to open is passed in and out this function through a parameter).
  10. If the status from the Irp was not a success status or it was a STATUS_REPARSE, cleanup the FILE_OBJECT and release the references associated with it. The irp is always released anyway.
  11. Return the status. If this was successful, the FILE_OBJECT will be the one used to represent the file.

This is a pretty high level view of the process but it should explain why some of the things we're going to talk in future posts work the way they do.

Thursday, December 16, 2010

About IRP_MJ_CREATE and minifilter design considerations - Part I

This is the first in a series of posts where I'll try to address various common questions about IRP_MJ_CREATE. My plan is to address the following topics:

  • What exactly is it that IRP_MJ_CREATE creates ? (a bit of rambling on one of my favorite topics, operating systems design)
  • Why is there no IRP_MJ_OPEN ? Surely MS could afford one more IRP :)...
  • Flow of a file open request through the OS.
  • What is the difference between a stream and a file from an FS perspective
  • What does STATUS_REPARSE do ?
  • What is name tunneling ? How does it affect creates ?
  • How to open the same stream as an existing FILE_OBJECT in a name-safe way.
  • What are stream file objects and why are they necessary ?
  • Various strategies to redirect a file open to a different file.
  • How to track a create when reparsing ?

In order to address this properly, I'd like to explain some things about operating systems. This is a rather dry topic but in my opinion the things I'm going to talk about are fundamental for understanding not only how IRP_MJ_CREATE works, but also why it works the way it does.

There are many ways to define an operating system but for this topic I think that a very useful way to describe it is as a hardware abstraction layer. It is a library of functions combined with a machine abstraction. As such, OS code is pretty much dedicated to either "abstract stuff that people use a lot" (allocate memory, create a window, draw strings and so on) or "hardware interaction code" (talk to the disk, talk to the memory controller hardware, talk to the graphics hardware). As such it should come as no surprise that the kernel part of OS is designed around interaction with hardware (as opposed to the user mode part which in general implements more abstract services).

File systems (and the whole file system stack including legacy filters and minifilters) are "higher level drivers" (since they don't usually talk to hardware directly). However, they must fit into the OS model which is built around hardware. This is why file system still create device objects and when calling FltGetFileNameInformation the name it returns starts with "\Device\....".

One other very important concept that plays into why IRP_MJ_CREATE functions the way it does is that the OS itself is implemented as a set of "services". Each service has its own protocol, usually described by an API set (the memory manager has it's own command set, the object manager has its own set and so does the IO manager). Most (if not all) of these protocols are stateful. The caller issues an "initialize" command (ExAllocatePool, ZwCreateFile, FltRegisterFilter) and they get back a more or less opaque handle (for ExAllocatePool, the pointer serves as the handle; ZwCreateFile -> an actual handle; FltRegisterFilters -> a PFLT_FILTER pointer and so on) and they can then issue additional commands that require that handle to be passed in (ExFreePool, ZwReadFile, FltStartFiltering). For stateful protocols the service (or server) has a blob of data that describes the internal state of each object and based on that data it knows how to satisfy each request. The opaque handle is a key that helps the service find that data. For example, for ExAllocatePool the internal data blob is the nt!_POOL_HEADER, for ZwCreateFile the context is pretty much a set of granted access rights for that handle and a reference to the FILE_OBJECT and for FltStartFiltering the FLT_FILTER structure. From this point on I'll call that blob of data a context (as in MM's context, IO manager's context, FltMgr's filter context). For services that already provide support for caller defined contexts (like FltMgr) I'll use the terms "internal context" and "user's context" to differentiate the two. The conclusion here is that any stateful protocol must have some context in the service (or server) side that the service can use to keep track of the state of communication with the client.

The important thing I wanted to get to is that sometimes some operations require multiple OS components to work together to satisfy a user request and as such multiple contexts might need to be created by each component. For example, for a ZwCreateFile call there might need to be created some of the following contexts: a handle, a FILE_OBJECT, a FltMgr internal context, some minifilter contexts, one or more file system contexts and a couple of MM contexts (where all the other contexts will be stored).

So with all these things in place, we can start talking about IRP_MJ_CREATE. As I said above, the OS has an abstract interface which consists mainly of OBJECTs for various things. When someone needs to talk to a device (physical or a virtual device, like a file system; anything that can be represented internally by a DEVICE_OBJECT), the OS context is a FILE_OBJECT. So in other terms, the FILE_OBJECT simply represents the state associated with the OS communicating to a DEVICE_OBJECT. The "create" word in ZwCreateFile and IRP_MJ_CREATE simply refers to FILE_OBJECT itself. There is no IRP_MJ_OPEN because there is no way to open an existing FILE_OBJECT. In order to get a FILE_OBJECT one must either create it or already have a reference to it (pointer or handle) and must call either ObReferenceObject or ObReferenceObjectByHandle to get another reference to that FILE_OBJECT.

The next topic, which is the flow of a create operation through the OS is pretty long so I'll save for next week. In the mean time please fell free to let me know what other topics related to the IRP_MJ_CREATE path you have that you'd like to address.

Thursday, December 9, 2010

More on IRPs and IRP_CTRLs

Sometimes I see posts on discussion lists about how a callback is not being called for some operation that a minifilter registered for. In most (possibly all) cases it turns out that that's not what the problem is and that the callback is in fact called, it's just that the poster can't tell it happened. It's happened to me a couple of times, but since I have a lot of confidence in FltMgr (having worked on it and all) I start of with the assumption that it must be something I'm doing wrong.


However, I've been wondering why people seem so keen on assuming that they don't get to see the callback for minifilters. And then I've realized that it might have something to do with the fact that minifilters use a callback model whereas the NT IO model is call-through. I'll talk a bit the call-through model and the limitations it has. I'll start with a brief refresh of the NT IO model and then explain the limitations and how the minifilter model tries to address them. Then I'll explain some of the downsides and how to work around them.


When an IO request (open a file, read or write and so on) reaches the IO manager, the information about the request is put in an IO request packet (IRP). Then the IO manager calls the driver that should process that IRP by calling IoCallDriver. There may be multiple drivers needed in order to complete a single operation, for example when the user opens a remote file so the IO request goes to a file system which then needs to send something to the network, so now there are at least two drivers involved in this. One could design the OS so that the drivers could go back to the IO manager and let it dispatch the request to the appropriate driver again or let the two drivers communicate directly. NT was designed to let the drivers communicate directly. Moreover, in many cases it one request may pass through many drivers that make up an IO stack (like the file system stack or the storage stack or the network stack), where each driver performs a specific role. So the IRP is potentially modified by each driver and sent to the next driver by calling IoCallDriver.


This is a call-through model. In the debugger it can sometimes look like this (please note that the IRP model allows the request to be completely decoupled from the thread but in practice you still see a lot of cases where a lot of drivers simply call the next driver in the same thread):


1: kd> kn
 # ChildEBP RetAddr  
00 a204bb10 828734bc volmgr!VmReadWrite
01 a204bb28 963bc475 nt!IofCallDriver+0x63
02 a204bb34 963bc548 fvevol!FveRequestPassThrough+0x31
03 a204bb50 963bc759 fvevol!FveReadWrite+0x4e
04 a204bb80 963bc7a9 fvevol!FveFilterRundownReadWrite+0x197
05 a204bb90 828734bc fvevol!FveFilterRundownWrite+0x33
06 a204bba8 9639a76e nt!IofCallDriver+0x63
07 a204bc88 9639a8a5 rdyboost!SmdProcessReadWrite+0xa14
08 a204bca8 828734bc rdyboost!SmdDispatchReadWrite+0xcb
09 a204bcc0 965a0fd9 nt!IofCallDriver+0x63
0a a204bce8 965a12fd volsnap!VolsnapWriteFilter+0x265
0b a204bcf8 828734bc volsnap!VolSnapWrite+0x21
0c a204bd10 960b091c nt!IofCallDriver+0x63
0d a204bd1c 828a711e Ntfs!NtfsStorageDriverCallout+0x14
0e a204bd1c 828a7215 nt!KiSwapKernelStackAndExit+0x15a
0f 981c964c 828c711d nt!KiSwitchKernelStackAndCallout+0x31
10 981c96c0 960af939 nt!KeExpandKernelStackAndCalloutEx+0x29d
11 981c96ec 960b05a6 Ntfs!NtfsCallStorageDriver+0x2d
12 981c9730 960af0a0 Ntfs!NtfsMultipleAsync+0x4d
13 981c9860 960ae0a6 Ntfs!NtfsNonCachedIo+0x413
14 981c9978 960af85f Ntfs!NtfsCommonWrite+0x1ebd
15 981c99f0 828734bc Ntfs!NtfsFsdWrite+0x2e1
16 981c9a08 9605f20c nt!IofCallDriver+0x63
17 981c9a2c 9605f3cb fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x2aa
18 981c9a64 828734bc fltmgr!FltpDispatch+0xc5
19 981c9a7c 82a74f6e nt!IofCallDriver+0x63
1a 981c9a9c 82a75822 nt!IopSynchronousServiceTail+0x1f8
1b 981c9b38 8287a44a nt!NtWriteFile+0x6e8
1c 981c9b38 828798b5 nt!KiFastCallEntry+0x12a
1d 981c9bd4 82a266a8 nt!ZwWriteFile+0x11

So here we can see how a write (ZwWriteFile) goes through FltMgr, NTFS, volsnap, rdyboost, fvevol and volmgr (where I set my breakpoint for this blog post).


One big problem with this approach is that the size of the kernel stack in NT is pretty small (depends on the architecture and so on but it's something like 12K or 20K..) and so if there are enough drivers, each of them using some stack space then it is possible to run out of stack. This in fact happens in some cases (AV filters were notorious for using a lot of stack) and the outcome is a bugcheck. Please note that in the example above, most filters were just letting the request pass through them, without necessarily doing anything to it. So they still use stack space even if they don't care about the operation at all…


Another problem with this approach is that it is almost impossible to unload a driver because very often each driver remembers which driver they need to send the IRP to next, so they are either referencing it (so it will never go away) or just using it without referencing it and so immediately after it goes away there is a bugcheck.


FltMgr's main goal was designed to increase system reliability (yeah, making file system filters development easier was just a secondary objective) and it tried to address this issue by making the minifilter model a callback model. This addresses both problems. Unloading a minifilter works because now each filter doesn't need to know which is the next filter to call and so the only component that must reference a minifilter is FltMgr, which then allows a minifilter to go away by informing only FltMgr about it.


The way this takes care of stack usage is a bit more interesting. When the minifilter callback is done it returns to FltMgr a status that instructs FltMgr whether they want to be notified when the request completes or not (or a couple of other statuses) but that's it. The stack space associated with the call to the minifilter's callback (the stack frame) is released and can be reused. This is why in the stack above, the IRP simply goes from IO manager to FltMgr and then to the filesystem. It doesn't matter how many minifilters were attached to the volume, they all use no stack space at all at this time.


Now, let's look in more detail at filter manager's stack frame. There are no minifilters functions on the frame because they all returned nicely to FltMgr and no longer use any stack space. This is the most confusing thing about this, that the minifilters cannot be seen on the stack so it looks like they have never been called at all… However, now that we know that FltMgr must have called some minifilters, is there a way to see which minifilters were called and so on ? In a previous post I explained that FltMgr has an internal structure that wraps the IRP called the IRP_CTRL (also known as a CALLBACK_DATA), and all the information about the request is stored in there. FltMgr clearly must remember the IRP_CTRL associated with this IRP someplace, but where ?


1: kd> kbn
 # ChildEBP RetAddr  Args to Child              
...
16 981c9a08 9605f20c 93460958 94301bf8 00000000 nt!IofCallDriver+0x63
17 981c9a2c 9605f3cb 981c9a4c 93460958 00000000 fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x2aa
18 981c9a64 828734bc 93460958 94301bf8 94301bf8 fltmgr!FltpDispatch+0xc5
19 981c9a7c 82a74f6e 93715f80 94301bf8 94301dac nt!IofCallDriver+0x63
...


Well, it turns out that there is another very useful structure called the IRP_CALL_CTRL, which is a structure that associates an IRP and an IRP_CTRL and other context that FltMgr keeps for the operation:


1: kd> dt 981c9a4c fltmgr!_IRP_CALL_CTRL
   +0x000 Volume           : 0x932f1008 _FLT_VOLUME
   +0x004 Irp              : 0x94301bf8 _IRP
   +0x008 IrpCtrl          : 0x93591de0 _IRP_CTRL
   +0x00c StartingCallbackNode : 0xffffffff _CALLBACK_NODE
   +0x010 OperationStatusCallbackListHead : _SINGLE_LIST_ENTRY
   +0x014 Flags            : 0x204 (No matching name)

From here we can see the IRP_CTRL pointer and call my favorite extension, !fltkd (I get a complaint on my current symbols about how the PVOID type is not defined, which I've edited out):



1: kd> !fltkd.irpctrl 0x93591de0

IRP_CTRL: 93591de0  WRITE (4) [00000001] Irp
Flags                    : [10000004] DontCopyParms FixedAlloc
Irp                      : 94301bf8 
DeviceObject             : 93460958 "\Device\HarddiskVolume2"
FileObject               : 93715f80 
CompletionNodeStack      : 93591e98   Size=5  Next=1
SyncEvent                : (93591df0)
InitiatingInstance       : 00000000 
Icc                      : 981c9a4c 
PendingCallbackNode      : ffffffff 
PendingCallbackContext   : 00000000 
PendingStatus            : 0x00000000 
CallbackData             : (93591e40)
 Flags                    : [00000001] Irp
 Thread                   : 93006020 
 Iopb                     : 93591e6c 
 RequestorMode            : [00] KernelMode
 IoStatus.Status          : 0x00000000 
 IoStatus.Information     : 00000000 
 TagData                  : 00000000 
 FilterContext[0]         : 00000000 
 FilterContext[1]         : 00000000 
 FilterContext[2]         : 00000000 
 FilterContext[3]         : 00000000 

   Cmd     IrpFl   OpFl  CmpFl  Instance FileObjt Completion-Context  Node Adr
--------- -------- ----- -----  -------- -------- ------------------  --------
 [0,0]    00000000  00   0000   00000000 00000000 00000000-00000000   93591fb8
     Args: 00000000 00000000 00000000 00000000 00000000 0000000000000000
 [0,0]    00000000  00   0000   00000000 00000000 00000000-00000000   93591f70
     Args: 00000000 00000000 00000000 00000000 00000000 0000000000000000
 [0,0]    00000000  00   0000   00000000 00000000 00000000-00000000   93591f28
     Args: 00000000 00000000 00000000 00000000 00000000 0000000000000000
 [0,0]    00000000  00   0000   00000000 00000000 00000000-00000000   93591ee0
     Args: 00000000 00000000 00000000 00000000 00000000 0000000000000000
 [4,0]    00060a01  00   0002   9341d918 93715f80 9608e55e-2662d614   93591e98
            ("FileInfo","FileInfo")  fileinfo!FIPostReadWriteCallback 
     Args: 00020000 00000000 003a0000 00000000 92fc6000 0000000000000000
Working IOPB:
>[4,0]    00060a01  00          9341d918 93715f80                     93591e6c
            ("FileInfo","FileInfo")  
     Args: 00020000 00000000 003a0000 00000000 92fc6000 0000000000000000

Here we can see what the minifilter stack looks like and that the FileInfo minifilter wanted a postOp callback for this operation. Another thing we can do is this (using the FLT_VOLUME pointer from the IRP_CALL_CTRL):



1: kd>  !fltkd.volume 0x932f1008

FLT_VOLUME: 932f1008 "\Device\HarddiskVolume2"
   FLT_OBJECT: 932f1008  [04000000] Volume
      RundownRef               : 0x00000074 (58)
      PointerCount             : 0x00000001 
      PrimaryLink              : [9334f404-932ad9b4] 
   Frame                    : 930adcc0 "Frame 0" 
   Flags                    : [00000064] SetupNotifyCalled EnableNameCaching FilterAttached
   FileSystemType           : [00000002] FLT_FSTYPE_NTFS
   VolumeLink               : [9334f404-932ad9b4] 
   DeviceObject             : 93460958 
   DiskDeviceObject         : 932b2320 
   FrameZeroVolume          : 932f1008 
   VolumeInNextFrame        : 00000000 
   Guid                     : "" 
   CDODeviceName            : "\Ntfs" 
   CDODriverName            : "\FileSystem\Ntfs" 
   TargetedOpenCount        : 55 
   Callbacks                : (932f109c)
   ContextLock              : (932f12f4)
   VolumeContexts           : (932f12f8)  Count=0
   StreamListCtrls          : (932f12fc)  rCount=2630 
   FileListCtrls            : (932f1340)  rCount=0 
   NameCacheCtrl            : (932f1388)
   InstanceList             : (932f1058)
      FLT_INSTANCE: 94114498 "luafv" "135000"
      FLT_INSTANCE: 9341d918 "FileInfo" "45000"


From here we can tell that there are in fact two minifilters attached to this frame , luafv and fileinfo. We knew about fileinfo from the IRP_CTRL, but what about luafv ? Did it even get called ? Well, unfortunately the only thing we can know for sure is that luafv was registered with fltmgr and attached to this volume. They might not have a callback registered for WRITEs or that callback was called but they returned FLT_PREOP_SUCCESS_NO_CALLBACK, so fltmgr didn't use a completion node for it so there is no record of it… We can look at the filter and see the registered callbacks, but we might not be able to find a record of whether the callback was actually called..

Thursday, December 2, 2010

More thoughts on FltDoCompletionProcessingWhenSafe and minifilter completion in general

I promised in the last post that I'd talk about how FltDoCompletionProcessingWhenSafe can deadlock. I've never actually seen such a deadlock so I've spent some time thinking about it and I went over various scenarios but in the end I couldn't find anything specific to FltDoCompletionProcessingWhenSafe.

However, thinking about deadlocks in the completion path there is a way a deadlock can happen anyway, so I'll write about that instead and explain how I think this works with FltDoCompletionProcessingWhenSafe :).

There are some drivers that take the approach of queuing up requests and then using one or more threads to dequeue the requests and process them. In theory this can happen anywhere, in a minifilter, in the file system and in the storage stack. In fact the ramdisk sample in the WDK is implemented using such a queue (at least, as far as I can tell, WDF is not my forte). Anyway, the point to remember is that this is a fairly common design strategy, possibly even more so with storage drivers.

This will be easier to explain with an example, so I'll describe a possible architecture for a storage driver. This driver mark all requests as pending, it queues them to an internal queue, it then releases a semaphore (or some similar mechanism) after which it returns pending to the caller. The driver also has one thread that waits on the semaphore and then when it is signaled it dequeues one request and processes it synchronously (it waits for it to complete), after which it calls IoCompleteRequest and goes back to waiting. Pretty simple, right ? For this discussion I'll simplify things by making the storage driver never actually call IoCompleteRequest at DPC so that is not an issue..

Now, here is where a minifilter enters the picture. Let's say I need a minifilter that performs some sort of logging and after each successful operation (or unsuccessful, it doesn't matter I'm just trying to find something plausible a minifilter would do) it writes a record to a log file. So their postOp routine does something like this:

If (NT_SUCCESS(status)) FltWriteFile(..., logEntry, ..);

Now, let's say that because the minifilter write expects that there are multiple threads writing at the same time, it is easier to open the log file for synchronous IO and not worry about maintaing the current byte offset and so on. Which means that it will issue a synchronous write (if no CallbackRoutine is provided when calling FltWriteFile then the write will be a synchronous one).

If these implementations happen to meet on a machine, here's how a deadlock might happen:

User Thread (issuing a read operation for example) :
1. Minifilter gets called and it wants to log the operation and so it returns FLT_PREOP_STATUS_SUCCESS_WITH_CALLBACK
2. The file system receives the operation and doesn't do much (let's say it's a small non-cached read) and sends it down to the storage device.
3. The storage device pends the IRP_MJ_READ and adds it to the queue.

Storage Driver Thread
1. Get notification about the pended IRP_MJ_READ and dequeue it
2. Perform the operations associated with the request (read from an internal buffer, queue a DMA transfer or do whatever it is that storage drivers do when they need to read data :)).
3. Call IoCompleteRequest on the IRP_MJ_READ
4. The file systems' IoCompletionRoutine gets called, which doesn't do much and returns STATUS_SUCCESS
5. The minifilter's postOp callback gets called
6. The minifilter calls FltWriteFile(…logEntry….)
7. FltMgr sends an IRP_MJ_WRITE to the file system.
8. The Storage Driver gets an IRP_MJ_WRITE and it queues it and returns STATUS_PENDING.
9. FltMgr gets the STATUS_PENDING and since the caller wanted a synchronous write, it waits for the IRP to complete.. However, since this is the Storage Driver Thread already, it will never dequeue the request and it will deadlock.

Now, this might look like a pretty forced scenario (which it is :)), but it's to describe what the problem looks like. So now let's discuss how a more "real-world" scenario would look like and how some different design decisions might affect this outcome:

  • What if the storage driver had multiple threads (can we blame the writer of the storage driver)? Clearly this would help the scenario. But then even when there are multiple threads, there are some operations that likely need to be synchronized. For example, maybe the storage driver can perform multiple reads but only one write at one time.. This would solve the issue because the minifilter would issue the request from one of the reader threads and it would wait for the writer thread.. But what if the minifilter did the same thing for IRP_MJ_WRITEs ? The problem is still there. 
  • What if the driver supports multiple threads for both reads and writes ? Well, there is likely some operation that requires synchronization. For example, for a VHD storage driver (a dynamic VHD extends in blocks so when a new block is needed, metadata operations need to happen so some synchronization is required) might have multiple threads for IRP_MJ_READs and IRP_MJ_WRITEs but if the IRP_MJ_WRITE is an extending one (i.e. when a new block must be allocated), it might still queue the IRP_MJ_WRITE to a single "extending write" processing thread. So now the deadlock would happen only when the user's write would require the VHD to extend and when the minifilter's log write is also an extending one.. 
  • And even if there are multiple threads that are completely independent, if there are enough simultaneous requests or if there are enough minifilters blocking those threads, this might still happen.
  • What if the minifilter issued an asynchronous request and just waited for it for complete ? Well, this is largely equivalent to issuing a synchronous request so the issue is still there.

It might seem that this scenario simply can't work and issuing a write from a completion routine is always deadlock prone but there some things that could fix this problem, so let's talk about them as well:

  • The minifilter could issue a completely asynchronous request and NOT WAIT for it. This can work for logging since it might not matter when the logging happens, so the minifilter doesn't actually need to wait. But what if the minifilter is not just logging but is doing something that simply must complete before the original request completes ? Then the minifilter can simply issue the asynchronous request and return FLT_POSTOP_MORE_PROCESSING_REQUIRED and then in the CompletionRoutine complete the request. This would work because when FLT_POSTOP_MORE_PROCESSING_REQUIRED is returned, control is returned to where IoCompleteRequest was called, which was right the Storage Driver Thread called IoCompleteRequest. So now the Storage Driver Thread is no longer blocked and can go back to processing more IO (this is very similar to what FltDoCompletionProcessingWhenSafe does).
  • What if the minifilter doesn't want to issue an asynchronous request, since synchronous requests are much easier to handle ? Then the minifilter could queue the synchronous request to a worker thread and return FLT_POSTOP_MORE_PROCESSING_REQUIRED and have the worker thread complete the user's request after the synchronous request it issued completes.
  • And yet another approach a minfilter can take is to return FLT_PREOP_SYNCHRONIZE instead of FLT_PREOP_SUCCESS_WITH_CALLBACK. This means that once the request is completed in the storage driver, FltMgr will simply acknowledge that completion and not block that thread at all. This has the added benefit of executing in the context of the original request, which is usually a much better idea for minifilters that need to do complicated things in their postOp routines.

Now, the reason this is not specific to FltDoCompletionProcessingWhenSafe is because I already asserted that the storage driver never completes a request at DPC so calling FltDoCompletionProcessingWhenSafe is unnecessary. However, even if the storage driver could call IoCompleteRequest at DPC, FltDoCompletionProcessingWhenSafe would simply return STATUS_MORE_PROCESSING_REQUIRED so the thread where IoCompleteRequest would not be blocked. Besides, that thread would likely be an arbitrary thread anyway (since completion at DPC usually happens in whatever thread happened to be running when the request was completed by the hardware.. ) . Anyway, there are other more complicated reasons why this in fact simply can't happen when the thread actually completes at DPC (or at least I don't think so) but I won't go into that now.

However, one thing to keep in mind is that if completion doesn't actually happen at DPC, FltDoCompletionProcessingWhenSafe doesn't do anything more than call the user's completion function inline so the deadlock I described above can still happen.

So I guess the bottom line is that the warning that provoked this post should in fact be something more like :
Caution   To avoid deadlocks, minifilters should not perform synchronous requests from a postOp callback and should instead either:

  • queue the operation and return FLT_POSTOP_MORE_PROCESSING_REQUIRED from the postOp callback or
  • return FLT_PREOP_SYNCHRONIZE from the preOp

I hope this makes sense. Please feel free to comment on anything I might have missed (since this is a pretty complicated scenario and I haven't in fact ever seen this in practice so it's all hypothetical :) ).