Thursday, December 2, 2010

More thoughts on FltDoCompletionProcessingWhenSafe and minifilter completion in general

I promised in the last post that I'd talk about how FltDoCompletionProcessingWhenSafe can deadlock. I've never actually seen such a deadlock so I've spent some time thinking about it and I went over various scenarios but in the end I couldn't find anything specific to FltDoCompletionProcessingWhenSafe.

However, thinking about deadlocks in the completion path there is a way a deadlock can happen anyway, so I'll write about that instead and explain how I think this works with FltDoCompletionProcessingWhenSafe :).

There are some drivers that take the approach of queuing up requests and then using one or more threads to dequeue the requests and process them. In theory this can happen anywhere, in a minifilter, in the file system and in the storage stack. In fact the ramdisk sample in the WDK is implemented using such a queue (at least, as far as I can tell, WDF is not my forte). Anyway, the point to remember is that this is a fairly common design strategy, possibly even more so with storage drivers.

This will be easier to explain with an example, so I'll describe a possible architecture for a storage driver. This driver mark all requests as pending, it queues them to an internal queue, it then releases a semaphore (or some similar mechanism) after which it returns pending to the caller. The driver also has one thread that waits on the semaphore and then when it is signaled it dequeues one request and processes it synchronously (it waits for it to complete), after which it calls IoCompleteRequest and goes back to waiting. Pretty simple, right ? For this discussion I'll simplify things by making the storage driver never actually call IoCompleteRequest at DPC so that is not an issue..

Now, here is where a minifilter enters the picture. Let's say I need a minifilter that performs some sort of logging and after each successful operation (or unsuccessful, it doesn't matter I'm just trying to find something plausible a minifilter would do) it writes a record to a log file. So their postOp routine does something like this:

If (NT_SUCCESS(status)) FltWriteFile(..., logEntry, ..);

Now, let's say that because the minifilter write expects that there are multiple threads writing at the same time, it is easier to open the log file for synchronous IO and not worry about maintaing the current byte offset and so on. Which means that it will issue a synchronous write (if no CallbackRoutine is provided when calling FltWriteFile then the write will be a synchronous one).

If these implementations happen to meet on a machine, here's how a deadlock might happen:

User Thread (issuing a read operation for example) :
1. Minifilter gets called and it wants to log the operation and so it returns FLT_PREOP_STATUS_SUCCESS_WITH_CALLBACK
2. The file system receives the operation and doesn't do much (let's say it's a small non-cached read) and sends it down to the storage device.
3. The storage device pends the IRP_MJ_READ and adds it to the queue.

Storage Driver Thread
1. Get notification about the pended IRP_MJ_READ and dequeue it
2. Perform the operations associated with the request (read from an internal buffer, queue a DMA transfer or do whatever it is that storage drivers do when they need to read data :)).
3. Call IoCompleteRequest on the IRP_MJ_READ
4. The file systems' IoCompletionRoutine gets called, which doesn't do much and returns STATUS_SUCCESS
5. The minifilter's postOp callback gets called
6. The minifilter calls FltWriteFile(…logEntry….)
7. FltMgr sends an IRP_MJ_WRITE to the file system.
8. The Storage Driver gets an IRP_MJ_WRITE and it queues it and returns STATUS_PENDING.
9. FltMgr gets the STATUS_PENDING and since the caller wanted a synchronous write, it waits for the IRP to complete.. However, since this is the Storage Driver Thread already, it will never dequeue the request and it will deadlock.

Now, this might look like a pretty forced scenario (which it is :)), but it's to describe what the problem looks like. So now let's discuss how a more "real-world" scenario would look like and how some different design decisions might affect this outcome:

  • What if the storage driver had multiple threads (can we blame the writer of the storage driver)? Clearly this would help the scenario. But then even when there are multiple threads, there are some operations that likely need to be synchronized. For example, maybe the storage driver can perform multiple reads but only one write at one time.. This would solve the issue because the minifilter would issue the request from one of the reader threads and it would wait for the writer thread.. But what if the minifilter did the same thing for IRP_MJ_WRITEs ? The problem is still there. 
  • What if the driver supports multiple threads for both reads and writes ? Well, there is likely some operation that requires synchronization. For example, for a VHD storage driver (a dynamic VHD extends in blocks so when a new block is needed, metadata operations need to happen so some synchronization is required) might have multiple threads for IRP_MJ_READs and IRP_MJ_WRITEs but if the IRP_MJ_WRITE is an extending one (i.e. when a new block must be allocated), it might still queue the IRP_MJ_WRITE to a single "extending write" processing thread. So now the deadlock would happen only when the user's write would require the VHD to extend and when the minifilter's log write is also an extending one.. 
  • And even if there are multiple threads that are completely independent, if there are enough simultaneous requests or if there are enough minifilters blocking those threads, this might still happen.
  • What if the minifilter issued an asynchronous request and just waited for it for complete ? Well, this is largely equivalent to issuing a synchronous request so the issue is still there.

It might seem that this scenario simply can't work and issuing a write from a completion routine is always deadlock prone but there some things that could fix this problem, so let's talk about them as well:

  • The minifilter could issue a completely asynchronous request and NOT WAIT for it. This can work for logging since it might not matter when the logging happens, so the minifilter doesn't actually need to wait. But what if the minifilter is not just logging but is doing something that simply must complete before the original request completes ? Then the minifilter can simply issue the asynchronous request and return FLT_POSTOP_MORE_PROCESSING_REQUIRED and then in the CompletionRoutine complete the request. This would work because when FLT_POSTOP_MORE_PROCESSING_REQUIRED is returned, control is returned to where IoCompleteRequest was called, which was right the Storage Driver Thread called IoCompleteRequest. So now the Storage Driver Thread is no longer blocked and can go back to processing more IO (this is very similar to what FltDoCompletionProcessingWhenSafe does).
  • What if the minifilter doesn't want to issue an asynchronous request, since synchronous requests are much easier to handle ? Then the minifilter could queue the synchronous request to a worker thread and return FLT_POSTOP_MORE_PROCESSING_REQUIRED and have the worker thread complete the user's request after the synchronous request it issued completes.
  • And yet another approach a minfilter can take is to return FLT_PREOP_SYNCHRONIZE instead of FLT_PREOP_SUCCESS_WITH_CALLBACK. This means that once the request is completed in the storage driver, FltMgr will simply acknowledge that completion and not block that thread at all. This has the added benefit of executing in the context of the original request, which is usually a much better idea for minifilters that need to do complicated things in their postOp routines.

Now, the reason this is not specific to FltDoCompletionProcessingWhenSafe is because I already asserted that the storage driver never completes a request at DPC so calling FltDoCompletionProcessingWhenSafe is unnecessary. However, even if the storage driver could call IoCompleteRequest at DPC, FltDoCompletionProcessingWhenSafe would simply return STATUS_MORE_PROCESSING_REQUIRED so the thread where IoCompleteRequest would not be blocked. Besides, that thread would likely be an arbitrary thread anyway (since completion at DPC usually happens in whatever thread happened to be running when the request was completed by the hardware.. ) . Anyway, there are other more complicated reasons why this in fact simply can't happen when the thread actually completes at DPC (or at least I don't think so) but I won't go into that now.

However, one thing to keep in mind is that if completion doesn't actually happen at DPC, FltDoCompletionProcessingWhenSafe doesn't do anything more than call the user's completion function inline so the deadlock I described above can still happen.

So I guess the bottom line is that the warning that provoked this post should in fact be something more like :
Caution   To avoid deadlocks, minifilters should not perform synchronous requests from a postOp callback and should instead either:

  • queue the operation and return FLT_POSTOP_MORE_PROCESSING_REQUIRED from the postOp callback or
  • return FLT_PREOP_SYNCHRONIZE from the preOp

I hope this makes sense. Please feel free to comment on anything I might have missed (since this is a pretty complicated scenario and I haven't in fact ever seen this in practice so it's all hypothetical :) ).