Thursday, December 2, 2010

More thoughts on FltDoCompletionProcessingWhenSafe and minifilter completion in general

I promised in the last post that I'd talk about how FltDoCompletionProcessingWhenSafe can deadlock. I've never actually seen such a deadlock so I've spent some time thinking about it and I went over various scenarios but in the end I couldn't find anything specific to FltDoCompletionProcessingWhenSafe.

However, thinking about deadlocks in the completion path there is a way a deadlock can happen anyway, so I'll write about that instead and explain how I think this works with FltDoCompletionProcessingWhenSafe :).

There are some drivers that take the approach of queuing up requests and then using one or more threads to dequeue the requests and process them. In theory this can happen anywhere, in a minifilter, in the file system and in the storage stack. In fact the ramdisk sample in the WDK is implemented using such a queue (at least, as far as I can tell, WDF is not my forte). Anyway, the point to remember is that this is a fairly common design strategy, possibly even more so with storage drivers.

This will be easier to explain with an example, so I'll describe a possible architecture for a storage driver. This driver mark all requests as pending, it queues them to an internal queue, it then releases a semaphore (or some similar mechanism) after which it returns pending to the caller. The driver also has one thread that waits on the semaphore and then when it is signaled it dequeues one request and processes it synchronously (it waits for it to complete), after which it calls IoCompleteRequest and goes back to waiting. Pretty simple, right ? For this discussion I'll simplify things by making the storage driver never actually call IoCompleteRequest at DPC so that is not an issue..

Now, here is where a minifilter enters the picture. Let's say I need a minifilter that performs some sort of logging and after each successful operation (or unsuccessful, it doesn't matter I'm just trying to find something plausible a minifilter would do) it writes a record to a log file. So their postOp routine does something like this:

If (NT_SUCCESS(status)) FltWriteFile(..., logEntry, ..);

Now, let's say that because the minifilter write expects that there are multiple threads writing at the same time, it is easier to open the log file for synchronous IO and not worry about maintaing the current byte offset and so on. Which means that it will issue a synchronous write (if no CallbackRoutine is provided when calling FltWriteFile then the write will be a synchronous one).

If these implementations happen to meet on a machine, here's how a deadlock might happen:

User Thread (issuing a read operation for example) :
1. Minifilter gets called and it wants to log the operation and so it returns FLT_PREOP_STATUS_SUCCESS_WITH_CALLBACK
2. The file system receives the operation and doesn't do much (let's say it's a small non-cached read) and sends it down to the storage device.
3. The storage device pends the IRP_MJ_READ and adds it to the queue.

Storage Driver Thread
1. Get notification about the pended IRP_MJ_READ and dequeue it
2. Perform the operations associated with the request (read from an internal buffer, queue a DMA transfer or do whatever it is that storage drivers do when they need to read data :)).
3. Call IoCompleteRequest on the IRP_MJ_READ
4. The file systems' IoCompletionRoutine gets called, which doesn't do much and returns STATUS_SUCCESS
5. The minifilter's postOp callback gets called
6. The minifilter calls FltWriteFile(…logEntry….)
7. FltMgr sends an IRP_MJ_WRITE to the file system.
8. The Storage Driver gets an IRP_MJ_WRITE and it queues it and returns STATUS_PENDING.
9. FltMgr gets the STATUS_PENDING and since the caller wanted a synchronous write, it waits for the IRP to complete.. However, since this is the Storage Driver Thread already, it will never dequeue the request and it will deadlock.

Now, this might look like a pretty forced scenario (which it is :)), but it's to describe what the problem looks like. So now let's discuss how a more "real-world" scenario would look like and how some different design decisions might affect this outcome:

  • What if the storage driver had multiple threads (can we blame the writer of the storage driver)? Clearly this would help the scenario. But then even when there are multiple threads, there are some operations that likely need to be synchronized. For example, maybe the storage driver can perform multiple reads but only one write at one time.. This would solve the issue because the minifilter would issue the request from one of the reader threads and it would wait for the writer thread.. But what if the minifilter did the same thing for IRP_MJ_WRITEs ? The problem is still there. 
  • What if the driver supports multiple threads for both reads and writes ? Well, there is likely some operation that requires synchronization. For example, for a VHD storage driver (a dynamic VHD extends in blocks so when a new block is needed, metadata operations need to happen so some synchronization is required) might have multiple threads for IRP_MJ_READs and IRP_MJ_WRITEs but if the IRP_MJ_WRITE is an extending one (i.e. when a new block must be allocated), it might still queue the IRP_MJ_WRITE to a single "extending write" processing thread. So now the deadlock would happen only when the user's write would require the VHD to extend and when the minifilter's log write is also an extending one.. 
  • And even if there are multiple threads that are completely independent, if there are enough simultaneous requests or if there are enough minifilters blocking those threads, this might still happen.
  • What if the minifilter issued an asynchronous request and just waited for it for complete ? Well, this is largely equivalent to issuing a synchronous request so the issue is still there.

It might seem that this scenario simply can't work and issuing a write from a completion routine is always deadlock prone but there some things that could fix this problem, so let's talk about them as well:

  • The minifilter could issue a completely asynchronous request and NOT WAIT for it. This can work for logging since it might not matter when the logging happens, so the minifilter doesn't actually need to wait. But what if the minifilter is not just logging but is doing something that simply must complete before the original request completes ? Then the minifilter can simply issue the asynchronous request and return FLT_POSTOP_MORE_PROCESSING_REQUIRED and then in the CompletionRoutine complete the request. This would work because when FLT_POSTOP_MORE_PROCESSING_REQUIRED is returned, control is returned to where IoCompleteRequest was called, which was right the Storage Driver Thread called IoCompleteRequest. So now the Storage Driver Thread is no longer blocked and can go back to processing more IO (this is very similar to what FltDoCompletionProcessingWhenSafe does).
  • What if the minifilter doesn't want to issue an asynchronous request, since synchronous requests are much easier to handle ? Then the minifilter could queue the synchronous request to a worker thread and return FLT_POSTOP_MORE_PROCESSING_REQUIRED and have the worker thread complete the user's request after the synchronous request it issued completes.
  • And yet another approach a minfilter can take is to return FLT_PREOP_SYNCHRONIZE instead of FLT_PREOP_SUCCESS_WITH_CALLBACK. This means that once the request is completed in the storage driver, FltMgr will simply acknowledge that completion and not block that thread at all. This has the added benefit of executing in the context of the original request, which is usually a much better idea for minifilters that need to do complicated things in their postOp routines.

Now, the reason this is not specific to FltDoCompletionProcessingWhenSafe is because I already asserted that the storage driver never completes a request at DPC so calling FltDoCompletionProcessingWhenSafe is unnecessary. However, even if the storage driver could call IoCompleteRequest at DPC, FltDoCompletionProcessingWhenSafe would simply return STATUS_MORE_PROCESSING_REQUIRED so the thread where IoCompleteRequest would not be blocked. Besides, that thread would likely be an arbitrary thread anyway (since completion at DPC usually happens in whatever thread happened to be running when the request was completed by the hardware.. ) . Anyway, there are other more complicated reasons why this in fact simply can't happen when the thread actually completes at DPC (or at least I don't think so) but I won't go into that now.

However, one thing to keep in mind is that if completion doesn't actually happen at DPC, FltDoCompletionProcessingWhenSafe doesn't do anything more than call the user's completion function inline so the deadlock I described above can still happen.

So I guess the bottom line is that the warning that provoked this post should in fact be something more like :
Caution   To avoid deadlocks, minifilters should not perform synchronous requests from a postOp callback and should instead either:

  • queue the operation and return FLT_POSTOP_MORE_PROCESSING_REQUIRED from the postOp callback or
  • return FLT_PREOP_SYNCHRONIZE from the preOp

I hope this makes sense. Please feel free to comment on anything I might have missed (since this is a pretty complicated scenario and I haven't in fact ever seen this in practice so it's all hypothetical :) ).

Friday, November 12, 2010

Some thoughts on FltDoCompletionProcessingWhenSafe

I've been meaning to talk about this for a while. There is a warning in the MSDN page for FltDoCompletionProcessingWhenSafe which is pretty interesting:

Caution   To avoid deadlocks, FltDoCompletionProcessingWhenSafe cannot be called for I/O operations that can be directly completed by a driver in the storage stack, such as the following:
• IRP_MJ_READ
• IRP_MJ_WRITE
• IRP_MJ_FLUSH_BUFFERS

Let's start by looking a bit at how file systems handle requests. There are multiple ways in which file systems can complete user requests, but largely they fall into a few cases. I'd like to point out that I'm simplifying things here, there are many ways in which file systems might handle operations and the same goes for storage devices… What I'm describing is not an exhaustive list of how things happen in a file system and storage stack, but rather a plausible way in which they can happen in some file systems in some cases:
• Synchronous - when all the data is readily available then the file system doesn't need to do any additional steps and can just perform the operation and return to the caller. For example, when setting the delete disposition on a file, the file system only needs to access the FCB and set the flag (because the delete disposition is a flag on the FCB). If the file system can acquire the FCB immediately it can just set the flag to whatever disposition the caller wanted, release the FCB and call IoCompleteRequest. When this happens the completion routines (and the postOp callbacks for minifilters) are actually called in the same thread as the original operation, at the same IRQL (which is very likely at PASSIVE_LEVEL)...
• Queued (asynchronous) - this happens when the file system realizes it can't complete the operation immediately and it needs to pend the request and complete it when some condition occurs. There are a lot of cases when this happens, for example when the file system needs to acquire some resource and it doesn't want to wait for it inline. Another case where this is pretty much the only course of action is when the caller registers notifications for something (oplocks, directory changes and such) and the IRP gets pended. In these cases, the postOp callbacks will be called generally in the context of the thread that released the resource or that did something to trigger the notification (acknowledge an oplock break, rename a file and so on). This is usually a different thread from the original thread the request came in, and usually the IRQL is <= APC_LEVEL.
• Forwarded - this can happen when the file system needs to get some data from the storage device and it simply forwards the request the underlying device. For example let's say that a user wants to read some aligned data from a file. The file system might simply calculate where the data begins on disk (by consulting its allocation maps which we'll assume are cached so no reading from the device is necessary), change the offset in the IRP_MJ_READ parameters to the right sector where the data is located, then lock the buffer in memory and then call IoCallDriver. When this request will be satisfied by the storage stack, it will call IoCompleteRequest and the file system will pretty much not do anything (or free some resources or some such) and then let the request go up. In this case, the thread in which the postOp callback gets called is the thread that was running when the disk IO was completed by the device (the IO will be completed in an interrupt, which will likely queue a DPC, which will then execute in whatever thread context the CPU happened to be running when the interrupt triggered) and at DPC_LEVEL.

Now, in a lot of cases the file system will need to perform a bunch of things in response to one single user request. For example, a request to write something might mean the file system will need to do at least the following (please ignore the order of the operations here):
• Write the data
• Update the last access time
• Update the file size
All these changes need to be saved to different places on disk (usually, it really depends on the filesystem) so the request might be pended by the file system while it issues a bunch of different IO requests to the storage device and when all of them complete it can complete the request. So in most cases operations are a combination of queued and forwarded operations.

The reason I went into all of this was because I wanted to make this point: in most cases, the postOp callback will be called at DPC only if the operation required one or more IOs to be sent to the storage device and the filesystem didn't need to synchronize the operation back to some internal thread and instead simply had a passthrough completion routine (see FatSingleAsyncCompletionRoutine in the FASTFAT sample ). The file system will not usually complete an operation at DPC in other cases (again, different file system do things differently so it MIGHT still happen).

Now, this means that the either warning or the function are useless, because the only reason the FltDoCompletionProcessingWhenSafe exists is to enable minifilters to write completion routines that use functions that require being <= APC_LEVEL and not worry about whether the postOp callback is called at DPC. So if according to the warning, "FltDoCompletionProcessingWhenSafe cannot be called for I/O operations that can be directly completed by a driver in the storage stack", then this is like saying that FltDoCompletionProcessingWhenSafe cannot be called for operations that might be completed at DPC_LEVEL, which is the only case where it is useful.

I'll talk about the actual deadlocks in a post next week.

Thursday, November 4, 2010

ObQueryNameString can return names with a NULL Buffer ( and an example with SR.sys)

ObQueryNameString is a very useful API. It's used in a lot of places and is a pretty good choice if you want to find the name for an OB object. However, using it is not without pitfalls. At the moment the documentation page on MSDN has this to say in the Remarks section: "If the given object is unnamed, or if the object name was not successfully acquired, ObQueryNameString sets Name.Buffer to NULL and sets Name.Length and Name.MaximumLength to zero.". What is not clearly spelled in there is the fact that the return status in this case will be STATUS_SUCCESS.

So let's recap. Any app developer can call ObQueryNameString and get STATUS_SUCCESS, but the Name.Buffer will be NULL and they might not expect that. I've seen this issue over and over again. People get a reference to an object and they query the name, get a NULL buffer and then try to read/compare/do whatever with it and they get a visit from the bugcheck fairy. Please note that since the Length and MaximumLength are both 0, people would be safe using the Rtl functions since they tend to check these sort of things..

It is interesting to understand the reasons about how people can get bitten by this as well. The documentation specifies that "If the given object is unnamed, or if the object name was not successfully acquired...", which I guess for most people it translates into "if the name was not successfully acquired then I will some error NTSTATUS… if this object is unnamed then it's not clear what I get, maybe also some error code?..". So I suppose that people that create named objects that they own (or objects that the system creates and are guaranteed to be named) imagine that they can never get the NULL buffer and STATUS_SUCCESS. But any named object can become unnamed when it is deleted. After all, the namespace entry is simply an additional reference to the object and deleting a named object simply deletes that reference, but the object might still be kept around by other references. One easy way to see this is to follow the calls to IoCreateDevice. For example, for an unnamed device one can see this :

Imediately after a IoCreateDevice for an unnamed device:

3: kd> !devobj 93602e48  
Device object (93602e48) is for:
  \FileSystem\FltMgr DriverObject 92cb6660
Current Irp 00000000 RefCount 0 Type 00000003 Flags 00000080
DevExt 93602f00 DevObjExt 93602f30 
ExtensionFlags (0x00000800)  
                             Unknown flags 0x00000800
Device queue is not busy.

3: kd> !object 93602e48  
Object: 93602e48  Type: (922d6440) Device
    ObjectHeader: 93602e30 (new version)
    HandleCount: 0  PointerCount: 1

And immediately after a named device:
2: kd> !devobj 930e0628  
Device object (930e0628) is for:
 FltMgr \FileSystem\FltMgr DriverObject 92f691e8
Current Irp 00000000 RefCount 0 Type 00000008 Flags 000000c0
Dacl 96fd2eec DevExt 00000000 DevObjExt 930e06e0 
ExtensionFlags (0x00000800)  
                             Unknown flags 0x00000800
Device queue is not busy.

2: kd> !object 930e0628
Object: 930e0628  Type: (922d7508) Device
    ObjectHeader: 930e0610 (new version)
    HandleCount: 0  PointerCount: 2
    Directory Object: 96e61948  Name: FltMgr

Please notice how the pointer count is different. Once the named device is deleted (IoDeleteDevice), the reference from the OB namespace is removed (and the object's name in the OB header is changed) and then, when the reference count eventually reaches 0, the object is freed. However, it anyone calls ObQueryNameString on one of those references, they will get the NULL Name.Buffer...

So it is perfectly possible for a driver that is working with an object that it knows it must be named to actually get in the window between when the object is removed from the OB namespace and the time the final reference is released (the driver will of course have a reference of its own in order to be able to access the object… ). So that this means is that calling ObQueryNameString might return STATUS_SUCCESS and a NULL Name.Buffer even for a named object.

I've recently had the pleasure to debug an issue with SR.sys and my virtual volume drive on XP SP3. I will share it since it was somewhat interesting and it points out to this specific issue. This is what the stack looks like:

1: kd> lm v m sr
start    end        module name
f8489000 f849af00   sr         (pdb symbols)          d:\symbols\sr.pdb\9D5432B7234C4CD2A8F6275B9D9AF41F1\sr.pdb
    Loaded symbol image file: sr.sys
    Image path: sr.sys
    Image name: sr.sys
    Timestamp:        Sun Apr 13 11:36:50 2008 (480252C2)
    CheckSum:         00012604
    ImageSize:        00011F00
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4
The problem in SR is this one:
sr!SrGetObjectName+0xd4:
f849105c ff15c49b48f8    call    dword ptr [sr!_imp__ObQueryNameString (f8489bc4)]  <- call ObQueryNameString
f8491062 3bc3            cmp     eax,ebx   <- check for STATUS_SUCCESS
f8491064 894514          mov     dword ptr [ebp+14h],eax  <- save the status...
f8491067 7c24            jl      sr!SrGetObjectName+0x105 (f849108d)
f8491069 0fb707          movzx   eax,word ptr [edi]   <-  this is the Length member of the UNICODE_STRING for the name
f849106c 8b4f04          mov     ecx,dword ptr [edi+4]  <- this is the Buffer member of the UNICODE_STRING..
f849106f d1e8            shr     eax,1   <- calculate the number of characters instead of the number of bytes
f8491071 66897702        mov     word ptr [edi+2],si   <- write some value in MaximumLength… 
f8491075 66891c41        mov     word ptr [ecx+eax*2],bx  <-   write in the buffer a 0 (basically, make sure the string is NULL terminated).. But ECX can be NULL
The stack when I hit this problem looks like this:
1: kd> kbn
 # ChildEBP RetAddr  Args to Child              
00 f80b7944 f849440d 00000000 81dc0a18 e10eac08 sr!SrGetObjectName+0xed
01 f80b7990 f848ecf2 81dc0a18 8239a818 f80b79c0 sr!SrCreateAttachmentDevice+0x99
02 f80b79c4 f848ee0f 8239a818 8239a8d0 81fb4d48 sr!SrFsControlMount+0x2e
03 f80b79e0 804ef18f 8239a8d0 81fb4c90 81fb4c90 sr!SrFsControl+0x4b
04 f80b79f0 80581bc7 00000000 81dc0a18 806e6a4c nt!IopfCallDriver+0x31
05 f80b7a40 804f53d6 c000014f f80b7b00 00000000 nt!IopMountVolume+0x1b9
06 f80b7a70 80582bc0 81e1f268 81dc0a18 f80b7ba4 nt!IopCheckVpbMounted+0x5e
07 f80b7b60 805bf444 81dc0a18 00000000 81fc6600 nt!IopParseDevice+0x3d8
08 f80b7bd8 805bb9d0 00000000 f80b7c18 00000040 nt!ObpLookupObjectName+0x53c
09 f80b7c2c 80576033 00000000 00000000 00000001 nt!ObOpenObjectByName+0xea
0a f80b7ca8 805769aa 009bef80 00100001 009bef24 nt!IopCreateFile+0x407
0b f80b7d04 8057a1a9 009bef80 00100001 009bef24 nt!IoCreateFile+0x8e
0c f80b7d44 8054161c 009bef80 00100001 009bef24 nt!NtOpenFile+0x27
0d f80b7d44 7c90e4f4 009bef80 00100001 009bef24 nt!KiFastCallEntry+0xfc
0e 009beef4 7c90d58c 7c80ec86 009bef80 00100001 ntdll!KiFastSystemCallRet
0f 009beef8 7c80ec86 009bef80 00100001 009bef24 ntdll!NtOpenFile+0xc
10 009bf1f0 7c80ef87 01be31e8 00000000 01be7bf0 kernel32!FindFirstFileExW+0x1a7
11 009bf210 751b1e05 01be31e8 01be7bf0 751a2a04 kernel32!FindFirstFileW+0x16
12 009bf240 751aad1f 009bf714 00000001 000e1358 srsvc!Delnode_Recurse+0x12e
13 009bfb34 751abd1f 009bfd54 7c97b440 7c97b420 srsvc!CEventHandler::OnFirstWrite_Notification+0x3cd
14 009bff60 7c927ba5 00000000 0000006a 000e5f40 srsvc!IoCompletionCallback+0x17a
15 009bff74 7c927b7c 751abba5 00000000 0000006a ntdll!RtlpApcCallout+0x11
16 009bffb4 7c80b713 00000000 00000000 00000000 ntdll!RtlpWorkerThread+0x87
17 009bffec 00000000 7c910230 00000000 00000000 kernel32!BaseThreadStart+0x37
So as you can see, in the mount path SR.sys is trying to create their device to attach to the volume and while doing that it tries to get the name for this device:
1: kd> !devobj 81dc0a18 
Device object (81dc0a18) is for:
  \Driver\IvmVhd DriverObject 81fad590
Current Irp 00000000 RefCount 1 Type 00000007 Flags 00000050
Vpb 81ea0f10 Dacl e1f17924 DevExt 81dc0ad0 DevObjExt 81dc0c30 
ExtensionFlags (0x00000002)  DOE_DELETE_PENDING
Device queue is not busy.
This happens to be my virtual volume device, which as you can tell from the DOE_DELETE_PENDING flag, is about to be torn down. So what this all looks like is this:

1. Something is trying to open a file, see the IopCreateFile (frame 0xb)

2. Io manager, while trying to send the IRP_MJ_CREATE irp (frame 7) wants to make sure the volume is mounted. Please note that at this point the volume device is still in the OB namespace, since otherwise the ObpLookupObjectName call (frame 8) would not have been able to reference it… So at this point IO manager resolved the name to a device object and it now has a reference to the device...

3. IopCheckVpbMounted (frame 6) finds the volume is not mounted (since I dismount it before tearing it down) so it tries to mount it…

4. SR.sys gets the mount request and is trying to build a device to attach to the newly mounted volume (in case the mount succeeds). This is pretty standard stuff for a legacy filter… Anyway, in doing so it calls ObQueryNameString which no longer finds a name for the device and returns a NULL buffer. SR checks for NT_SUCCESS but doesn't check the buffer to make sure it's not null (or even the length which is 0) and it blindly tries to make sure the string is NULL terminated (which is also pointless since ObQueryNameString documentation mentions that "The object name (when present) includes a NULL-terminator and all path separators "\" in the name.")… bugcheck.

What my driver did was simply call IoDeleteDevice somewhere between frame 8 and frame 0.

I'm willing to bet that not checking for the null Name.Buffer is a pretty common mistake. For example, there is some code posted on a blog that looks like this:

Status = ObQueryNameString(FileObject->Vpb->RealDevice,OBI,Returned,&Returned ); 

if(NT_SUCCESS(Status)) { 
if(Root) { 
wcscat(OBI->Name.Buffer,L"\\"); 
OBI->Name.Length+=sizeof(WCHAR); 
} 

Thursday, October 28, 2010

Useful Models - how choosing the right abstraction can help design and some useful abstractions for working with minifilters

The poll on the site indicated this was the topic most people were interested in so here it is.

I find myself quite often in the position of trying to explain why something doesn't work the way someone expects it would. I guess this is due in large part that the work I do (storage and file systems) is something that people interact with quite often but in fact operates quite differently than the abstraction it presents to the users. I've mentioned this in my other posts anyway…

So in order to explain why some architecture won't work, I try to find an analogy or a model that would immediately make the problem obvious. Some of these models are very dependent on the problem I'm dealing with while some others I keep reusing. Some of the models are obviously not practical, but they highlight a certain features of the system. It would be nice if these models could be implemented as actual tools (like Driver Verifier) but the reality is that in some cases the effort to write something like this would not justify the benefits… So I guess most of them will remain in the realm of thought experiments but they can be useful nevertheless...

I'll go through a list of commonly asked questions and the models that I find help explain the problem. I'm sure most of the readers of this post could contribute their own examples so please do so through the comments.

Q: Why not send the file name directly to our minifilter from a service or some other user mode program ?
A: it really depends on the other minifilters on the system. The model here is a minifilter that implements ALL of the namespace perfectly, with file IDs and hardlinks and so on, at its level, and below itself it keeps a flat structure where all streams are identified by GUIDs and there are not directories. If your minifilter happens to be below such a filter then obviously the name of the file at your level (which is a GUID) has absolutely nothing to do with the name the user mode service sees (which can be a regular path). Now, it must be said that any minifilter that does anything like this to the namespace would be in the virtualization group, so if you are above the virtualization group you don't have this problem. But if you are IN or below the virtualization group, then you must take this into account.

Q: Why not communicate with my minifilter through a private communication channel and have it open and read files on behalf of my service ?
A: if you are in or below the virtualization group, see the example above. If you are below the AV group, then you should always think about malware. Let's say you do something very benign, like open your own file and read some configuration data (as opposed to opening and parsing or executing random user files). If there is a vulnerability with your parsing code, this allows someone to write a file based exploit targeting your product and no AVs will be able to see your accesses to the file and catch the vulnerability. Unfortunately, there isn't a good generic malware model so you need to construct your own every time you need to explain why bypassing some security measure is not a good idea…

Q: Why not create a back-up of a VHD file while the volume is mounted ? (which is another way of saying "why not try to read the data on a mounted volume by directly accessing the sectors ?").. This is a question that's not really related to file systems but to the storage stack.. However, I find a lot of people are confused about this and keep trying to read mounted volumes.
A: the model I find helps is that of a volume with a file system on top that on volume mount reads everything into memory and then it only writes the odd bytes (byte 1, 3, 5 and so on) of anything and keeps the even bytes in a cache, until it gets either a flush or a dismount. This makes immediately visible what would happen if you tried to read it. However, once I mention this people immediately ask whether we could flush and then take a snapshot, but then I point out that immediately after the flush the system might already have received some writes and then only the odd bytes have been written so you need a way to guarantee that no more writes happen on the file system, and the only way to do that is to dismount it.

Probably the most powerful model that exposes a lot of issues with filters (not only file system, any filters of any component really) is the "filter attached on top of itself" model. This is important because in general anything you can do in your filter someone else can do in theirs. For example, let's say the discussion is whether creating a new FSCTL that is currently unused and sending it down the FS stack to your filter is a good idea (spoiler: it's not). In the general case this wouldn't work with your filter attached twice, since all the IOCTLs will be captured by the top filter. This might not be an obvious problem (because depending on what the filter should do with the IOCTL , it might still work fine), but then consider that someone else can write a filter just like yours using the same IOCTL derived through the same mechanism and then you can expect more serious problems. So in this particular case you would want to make sure to either use a communication mechanism guaranteed to deliver messages directly to your filter like a control device or (if using a minifilter) communication ports. The same applies for file names (what if there already is a file with that name?) and other named resources.. Thinking about what would happen if your filter would be attached on top of itself is always an interesting thought experiment and highly recommended since it will expose potential problems with your design. Once you know what the problems are you can decide about how likely it is to happen and whether you should address the issue..


I thought I had more models and I should have done a better job at keeping track of them but I can't remember anymore right now. I will update the post when I do.

Thursday, October 21, 2010

Filtering in the Windows Storage Space

This post assumes that reader had some knowledge about the IO subsystem in windows.

The file system stack is simply a set of drivers between the IO manager and the file system (including the file system). These drivers are usually referred to as file system filters. In general the file system is the component that implements the hierarchy of files and directories and perhaps an additional set of features (like byte-range locking or hardlinks and so on). The file system filters usually add some functionality on top of what the file system provides (such as encryption or replication or security (think anti-virus scanners), quota management and so on). Most of these features could be implemented at any of these layers (for example, byte-range locking is usually done in the file system, but a filter can do it as well…). The decision is usually driven by customer requirements and even in the OS itself some things are done in filters, so that customers that don't need the feature don't pay the price.

For a pretty complete list of types of things file system filters can do, one can take a look at the list here. Of course, this is not a complete list, but still it shows how rich the ecosystem really is. I remember hearing that an average user on a Windows machine is running around 4 or 5 file system filters, usually without even realizing it.

The interface between the IO manager and the file system is very rich and complex. There are very many rules and everything is asynchronous which makes things very complicated. On top of this, while there is support in the NT model for filtering, it doesn't really provide some of the facilities that file system writers need (for example, there is not a lot of support for getting the name of a file or for attaching context to a certain file). This is where minifilters comes in. The minifilter infrastructure was written to primarily address some things that almost all file system filter need, without really changing the filtering model too much (which is why I'm avoiding the phrase "minifilter model" since it doesn't really change the IO model much, it just adds some features to it). This is all implemented via a support driver called filter manager. Filter manager is a legacy filter that is a part of the operating system and it provides things such as :
1. Support for contexts
2. An easier model for attaching to a volume
3. Easier model for file name querying
4. Support for unloading filters
5. Predictable filtering order
6. Easier communication between a user mode service and a driver.

Some of these are just nice features (like context support, where a legacy filter can still reliably implement their own scheme if they want) while some are downright impossible in the legacy model (for example, it used to be very problematic to make sure that an anti-virus filter would not be loaded below an encryption filter (which would make scanning files useless)).

The numbers that I've heard were that a legacy filter needs about 5000 lines of (very complicated and highly sensitive) code to just load and do nothing. With the minifilter model I'd say less than 50 are necessary, and most of them are just setting up structures and such.

Of course, a legacy filter can do all a minifilter can because filter manager itself is a legacy filter and it doesn't use private or undocumented interfaces. However, since the minifilter model is supported on all platforms since Windows 2000 there is really no reason for anyone developing a new filter to write a legacy filter. At least, that's my view. There are some people who disagree with this statement (as with any other model in fact) but the fact is that Microsoft is moving towards making the legacy model obsolete.

It is important to note that the storage infrastructure consists of two big parts, the file system stack and the disk stack. The disk stack deals with IO that is issued by the file system. The file system stack encapsulates all the complexity of operating with files and folders and such and issues just sector reads and writes. The disk stack has no concept of byte range locks, files and so on. What is deals with are sectors. The types of filters in this space are categorized about what they filter (disk, partition or volume) as well as the functionality they provide (encryption, compression, replication and so on). For example filters can offer things like volume snapshots, full volume encryption or full disk encryption, volume or partition replication, performance monitoring at all levels and so on.

As you can see, the storage subsystem is very rich and most of the time filters play a huge role in it (at least in the Windows world, where one can't just modify the source to add features to an operating system component). However, with so many ways to do things it is sometimes hard to know what architecture is best suited for a certain type of problem, and unfortunately selecting the wrong one can have a huge impact on the cost and complexity of a project.

Monday, September 20, 2010

Namespaces (part 1) - the OB namespace

I've been getting a lot of hits to my page about name usage in file system filters so I've decided to expand on the subject of names a bit further. This blog post is more about software design (and especially about OS design) and less about file system filters.

The role of language in shaping the way we think is a very interesting subject and one I've been interested in for a while. The book "Language in Thought and Action" is a very good introduction to the subject. One of the ideas in the book is that the mapping of names to objects changes the way we think about the object. While this is true to a certain extent in programming (think about how often you heard the phrase "well, this API would have been better named BlahBlah …"), computer science as a discipline has a completely new class of problems that I'd like to focus on in this post. The problems associated with actually designing namespaces. I'm not sure why designing and identifying namespaces isn't as popular in computer science circles as other concepts like indirection and variable scope because it's at least as important.

I don't think writing a formal definition of a namespace would actually be very interesting so I'll go straight to some examples of namespaces.

Probably the best known one is the file system namespace. The main elements of this namespace are file and directory names and the namespace serves to map file paths to streams of bytes. Also quite well known is the registry and it serves a very similar purpose. For people writing kernel mode drivers in windows also a pretty familiar one is the object manager namespace (or the OB namespace), where object names are used to identify kernel objects.

In some operating systems users are used to see and work with some other namespaces grafted into the main OS namespace (in windows users don’t usually see the OB namespace, but it can be explored using tools like WinObj ). For example, the storage devices namespace, the COM ports namespace or the running processes namespace.

For developers some familiar namespaces are the types namespace and the variables namespace (in the compiler).

But there are others even more interesting. For example, a namespace doesn't have to use ASCII or UNICODE strings to identify objects. If one were to use numbers, like 1,2,3 and so on the namespace would be an array. Similar, the process handles form a namespace, where the handle is used as the name. By now it's probably pretty clear that any key-value type of structure is a namespace. Even memory is a namespace as well, where the name is the address.

Now that we have some examples of namespaces we can look at some choices the designers of these namespaces made and what is the impact of those choices on the way they are used.

First, let's look at the object manager namespace in windows (which, as I said before, I'll refer to as the OB namespace).

I'll start by listing some of the properties of this namespace. The names in the OB namespace are UNICODE strings. As is usually the case with namespaces where the names are strings, the namespace implements a hierarchy of names and it is public. Some interesting features are that it supports links from one point in the namespace to another part and that it supports objects that don't have a name (we could treat anonymous OB objects as a different namespace but that's not particularly interesting).

Support for anonymous objects is by far the choice with the biggest impact because it means that whoever implements the namespace can't use the fact that the object is removed from the namespace as an indication that the object needs to be deleted. So they must use some different technique to track object usage and in the case of OB that technique is reference counting. From a user's perspective this means they have to do the little dance that involves increasing the reference count before sharing the object with anyone and decreasing the reference count when they're done using it. It also means that removing an object from the namespace (a delete) can happen immediately on an object (as opposed to it happening when the object is closed, like in file systems). Another implication of this architecture is that it's hard keeping logs of things because an object might not always have a name, so how does one log it ? The memory address doesn't usually convey any information about the object.

The fact that a namespace supports links is also quite interesting. The designer needs to decide whether they support links to directories in the namespace or just links to "leaves" (like files). For example NTFS supports hardlinks only between files, not directories. The OB namespace however supports links to directories, which means the OB namespace can contain loops. So the designer must come up with a way to deal with potential loops in the namespace. Another interesting implication is the fact that the caller might need to remember which way they arrived at an object in the namespace (the path to that object) in a way that takes links into account. The OB namespace doesn't do that but it is required for some features (like file system symlinks) so the users of the namespace must implement that themselves.

One final characteristic is that the namespace is hierarchical. Hierarchical namespaces have some advantages from the perspective of the implementer since they allow grouping objects that belong together. The main advantages are security and support for isolation. A flat namespace on the other hand is easy to implement, but it's very limited as it is basically just a hash.

To get a better picture of the implications of implementing a hierarchical namespace versus a flat one, let's consider some namespaces that don't support hierarchies, like the named synchronization primitives namespace in windows (events, mutexes and so on). It's easy to get name collisions so each Windows application must make sure it's using a name that no one else is using. And then from a security perspective there is no way to limit listing them. Basically, you can either prevent someone from seeing any of the names or allow them to see all the names. Access control is possible, by only on a case by case basis, and there usually isn't a way to inherit security permissions from another object.

The isolation part is also pretty important. For example, consider the fact that Windows supports sessions. If helps to keep those resources that are semantically linked into a directory, so they can be easily enumerated and operated on (even if they are just links to the actual object). Isolation is really useful in virtualization because the user of that part of the namespace doesn't necessarily see all the available objects, just the ones they're supposed to see.

This is getting pretty long so I'll stop here and talk about the file system namespace in a different post. If there is enough interest I might talk about other namespaces like the processes namespace (please leave some comments if this sounds interesting to you).

Saturday, September 18, 2010

I'm back

Hello everyone, I'm sorry I've been neglecting this blog for the past couple of month, a lot of things have changed and I've been really busy trying to adjust. I'm still not quite there yet but I'll try to do a better job with the blog from now on.

The good news is that I've been thinking about all sorts of things that I think would make good posts and so I should have some new material coming up.

In the mean time, please feel free to let me know if you have any suggestions for future topics and i'll do my best.