Friday, November 12, 2010

Some thoughts on FltDoCompletionProcessingWhenSafe

I've been meaning to talk about this for a while. There is a warning in the MSDN page for FltDoCompletionProcessingWhenSafe which is pretty interesting:

Caution   To avoid deadlocks, FltDoCompletionProcessingWhenSafe cannot be called for I/O operations that can be directly completed by a driver in the storage stack, such as the following:
• IRP_MJ_READ
• IRP_MJ_WRITE
• IRP_MJ_FLUSH_BUFFERS

Let's start by looking a bit at how file systems handle requests. There are multiple ways in which file systems can complete user requests, but largely they fall into a few cases. I'd like to point out that I'm simplifying things here, there are many ways in which file systems might handle operations and the same goes for storage devices… What I'm describing is not an exhaustive list of how things happen in a file system and storage stack, but rather a plausible way in which they can happen in some file systems in some cases:
• Synchronous - when all the data is readily available then the file system doesn't need to do any additional steps and can just perform the operation and return to the caller. For example, when setting the delete disposition on a file, the file system only needs to access the FCB and set the flag (because the delete disposition is a flag on the FCB). If the file system can acquire the FCB immediately it can just set the flag to whatever disposition the caller wanted, release the FCB and call IoCompleteRequest. When this happens the completion routines (and the postOp callbacks for minifilters) are actually called in the same thread as the original operation, at the same IRQL (which is very likely at PASSIVE_LEVEL)...
• Queued (asynchronous) - this happens when the file system realizes it can't complete the operation immediately and it needs to pend the request and complete it when some condition occurs. There are a lot of cases when this happens, for example when the file system needs to acquire some resource and it doesn't want to wait for it inline. Another case where this is pretty much the only course of action is when the caller registers notifications for something (oplocks, directory changes and such) and the IRP gets pended. In these cases, the postOp callbacks will be called generally in the context of the thread that released the resource or that did something to trigger the notification (acknowledge an oplock break, rename a file and so on). This is usually a different thread from the original thread the request came in, and usually the IRQL is <= APC_LEVEL.
• Forwarded - this can happen when the file system needs to get some data from the storage device and it simply forwards the request the underlying device. For example let's say that a user wants to read some aligned data from a file. The file system might simply calculate where the data begins on disk (by consulting its allocation maps which we'll assume are cached so no reading from the device is necessary), change the offset in the IRP_MJ_READ parameters to the right sector where the data is located, then lock the buffer in memory and then call IoCallDriver. When this request will be satisfied by the storage stack, it will call IoCompleteRequest and the file system will pretty much not do anything (or free some resources or some such) and then let the request go up. In this case, the thread in which the postOp callback gets called is the thread that was running when the disk IO was completed by the device (the IO will be completed in an interrupt, which will likely queue a DPC, which will then execute in whatever thread context the CPU happened to be running when the interrupt triggered) and at DPC_LEVEL.

Now, in a lot of cases the file system will need to perform a bunch of things in response to one single user request. For example, a request to write something might mean the file system will need to do at least the following (please ignore the order of the operations here):
• Write the data
• Update the last access time
• Update the file size
All these changes need to be saved to different places on disk (usually, it really depends on the filesystem) so the request might be pended by the file system while it issues a bunch of different IO requests to the storage device and when all of them complete it can complete the request. So in most cases operations are a combination of queued and forwarded operations.

The reason I went into all of this was because I wanted to make this point: in most cases, the postOp callback will be called at DPC only if the operation required one or more IOs to be sent to the storage device and the filesystem didn't need to synchronize the operation back to some internal thread and instead simply had a passthrough completion routine (see FatSingleAsyncCompletionRoutine in the FASTFAT sample ). The file system will not usually complete an operation at DPC in other cases (again, different file system do things differently so it MIGHT still happen).

Now, this means that the either warning or the function are useless, because the only reason the FltDoCompletionProcessingWhenSafe exists is to enable minifilters to write completion routines that use functions that require being <= APC_LEVEL and not worry about whether the postOp callback is called at DPC. So if according to the warning, "FltDoCompletionProcessingWhenSafe cannot be called for I/O operations that can be directly completed by a driver in the storage stack", then this is like saying that FltDoCompletionProcessingWhenSafe cannot be called for operations that might be completed at DPC_LEVEL, which is the only case where it is useful.

I'll talk about the actual deadlocks in a post next week.

Thursday, November 4, 2010

ObQueryNameString can return names with a NULL Buffer ( and an example with SR.sys)

ObQueryNameString is a very useful API. It's used in a lot of places and is a pretty good choice if you want to find the name for an OB object. However, using it is not without pitfalls. At the moment the documentation page on MSDN has this to say in the Remarks section: "If the given object is unnamed, or if the object name was not successfully acquired, ObQueryNameString sets Name.Buffer to NULL and sets Name.Length and Name.MaximumLength to zero.". What is not clearly spelled in there is the fact that the return status in this case will be STATUS_SUCCESS.

So let's recap. Any app developer can call ObQueryNameString and get STATUS_SUCCESS, but the Name.Buffer will be NULL and they might not expect that. I've seen this issue over and over again. People get a reference to an object and they query the name, get a NULL buffer and then try to read/compare/do whatever with it and they get a visit from the bugcheck fairy. Please note that since the Length and MaximumLength are both 0, people would be safe using the Rtl functions since they tend to check these sort of things..

It is interesting to understand the reasons about how people can get bitten by this as well. The documentation specifies that "If the given object is unnamed, or if the object name was not successfully acquired...", which I guess for most people it translates into "if the name was not successfully acquired then I will some error NTSTATUS… if this object is unnamed then it's not clear what I get, maybe also some error code?..". So I suppose that people that create named objects that they own (or objects that the system creates and are guaranteed to be named) imagine that they can never get the NULL buffer and STATUS_SUCCESS. But any named object can become unnamed when it is deleted. After all, the namespace entry is simply an additional reference to the object and deleting a named object simply deletes that reference, but the object might still be kept around by other references. One easy way to see this is to follow the calls to IoCreateDevice. For example, for an unnamed device one can see this :

Imediately after a IoCreateDevice for an unnamed device:

3: kd> !devobj 93602e48  
Device object (93602e48) is for:
  \FileSystem\FltMgr DriverObject 92cb6660
Current Irp 00000000 RefCount 0 Type 00000003 Flags 00000080
DevExt 93602f00 DevObjExt 93602f30 
ExtensionFlags (0x00000800)  
                             Unknown flags 0x00000800
Device queue is not busy.

3: kd> !object 93602e48  
Object: 93602e48  Type: (922d6440) Device
    ObjectHeader: 93602e30 (new version)
    HandleCount: 0  PointerCount: 1

And immediately after a named device:
2: kd> !devobj 930e0628  
Device object (930e0628) is for:
 FltMgr \FileSystem\FltMgr DriverObject 92f691e8
Current Irp 00000000 RefCount 0 Type 00000008 Flags 000000c0
Dacl 96fd2eec DevExt 00000000 DevObjExt 930e06e0 
ExtensionFlags (0x00000800)  
                             Unknown flags 0x00000800
Device queue is not busy.

2: kd> !object 930e0628
Object: 930e0628  Type: (922d7508) Device
    ObjectHeader: 930e0610 (new version)
    HandleCount: 0  PointerCount: 2
    Directory Object: 96e61948  Name: FltMgr

Please notice how the pointer count is different. Once the named device is deleted (IoDeleteDevice), the reference from the OB namespace is removed (and the object's name in the OB header is changed) and then, when the reference count eventually reaches 0, the object is freed. However, it anyone calls ObQueryNameString on one of those references, they will get the NULL Name.Buffer...

So it is perfectly possible for a driver that is working with an object that it knows it must be named to actually get in the window between when the object is removed from the OB namespace and the time the final reference is released (the driver will of course have a reference of its own in order to be able to access the object… ). So that this means is that calling ObQueryNameString might return STATUS_SUCCESS and a NULL Name.Buffer even for a named object.

I've recently had the pleasure to debug an issue with SR.sys and my virtual volume drive on XP SP3. I will share it since it was somewhat interesting and it points out to this specific issue. This is what the stack looks like:

1: kd> lm v m sr
start    end        module name
f8489000 f849af00   sr         (pdb symbols)          d:\symbols\sr.pdb\9D5432B7234C4CD2A8F6275B9D9AF41F1\sr.pdb
    Loaded symbol image file: sr.sys
    Image path: sr.sys
    Image name: sr.sys
    Timestamp:        Sun Apr 13 11:36:50 2008 (480252C2)
    CheckSum:         00012604
    ImageSize:        00011F00
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4
The problem in SR is this one:
sr!SrGetObjectName+0xd4:
f849105c ff15c49b48f8    call    dword ptr [sr!_imp__ObQueryNameString (f8489bc4)]  <- call ObQueryNameString
f8491062 3bc3            cmp     eax,ebx   <- check for STATUS_SUCCESS
f8491064 894514          mov     dword ptr [ebp+14h],eax  <- save the status...
f8491067 7c24            jl      sr!SrGetObjectName+0x105 (f849108d)
f8491069 0fb707          movzx   eax,word ptr [edi]   <-  this is the Length member of the UNICODE_STRING for the name
f849106c 8b4f04          mov     ecx,dword ptr [edi+4]  <- this is the Buffer member of the UNICODE_STRING..
f849106f d1e8            shr     eax,1   <- calculate the number of characters instead of the number of bytes
f8491071 66897702        mov     word ptr [edi+2],si   <- write some value in MaximumLength… 
f8491075 66891c41        mov     word ptr [ecx+eax*2],bx  <-   write in the buffer a 0 (basically, make sure the string is NULL terminated).. But ECX can be NULL
The stack when I hit this problem looks like this:
1: kd> kbn
 # ChildEBP RetAddr  Args to Child              
00 f80b7944 f849440d 00000000 81dc0a18 e10eac08 sr!SrGetObjectName+0xed
01 f80b7990 f848ecf2 81dc0a18 8239a818 f80b79c0 sr!SrCreateAttachmentDevice+0x99
02 f80b79c4 f848ee0f 8239a818 8239a8d0 81fb4d48 sr!SrFsControlMount+0x2e
03 f80b79e0 804ef18f 8239a8d0 81fb4c90 81fb4c90 sr!SrFsControl+0x4b
04 f80b79f0 80581bc7 00000000 81dc0a18 806e6a4c nt!IopfCallDriver+0x31
05 f80b7a40 804f53d6 c000014f f80b7b00 00000000 nt!IopMountVolume+0x1b9
06 f80b7a70 80582bc0 81e1f268 81dc0a18 f80b7ba4 nt!IopCheckVpbMounted+0x5e
07 f80b7b60 805bf444 81dc0a18 00000000 81fc6600 nt!IopParseDevice+0x3d8
08 f80b7bd8 805bb9d0 00000000 f80b7c18 00000040 nt!ObpLookupObjectName+0x53c
09 f80b7c2c 80576033 00000000 00000000 00000001 nt!ObOpenObjectByName+0xea
0a f80b7ca8 805769aa 009bef80 00100001 009bef24 nt!IopCreateFile+0x407
0b f80b7d04 8057a1a9 009bef80 00100001 009bef24 nt!IoCreateFile+0x8e
0c f80b7d44 8054161c 009bef80 00100001 009bef24 nt!NtOpenFile+0x27
0d f80b7d44 7c90e4f4 009bef80 00100001 009bef24 nt!KiFastCallEntry+0xfc
0e 009beef4 7c90d58c 7c80ec86 009bef80 00100001 ntdll!KiFastSystemCallRet
0f 009beef8 7c80ec86 009bef80 00100001 009bef24 ntdll!NtOpenFile+0xc
10 009bf1f0 7c80ef87 01be31e8 00000000 01be7bf0 kernel32!FindFirstFileExW+0x1a7
11 009bf210 751b1e05 01be31e8 01be7bf0 751a2a04 kernel32!FindFirstFileW+0x16
12 009bf240 751aad1f 009bf714 00000001 000e1358 srsvc!Delnode_Recurse+0x12e
13 009bfb34 751abd1f 009bfd54 7c97b440 7c97b420 srsvc!CEventHandler::OnFirstWrite_Notification+0x3cd
14 009bff60 7c927ba5 00000000 0000006a 000e5f40 srsvc!IoCompletionCallback+0x17a
15 009bff74 7c927b7c 751abba5 00000000 0000006a ntdll!RtlpApcCallout+0x11
16 009bffb4 7c80b713 00000000 00000000 00000000 ntdll!RtlpWorkerThread+0x87
17 009bffec 00000000 7c910230 00000000 00000000 kernel32!BaseThreadStart+0x37
So as you can see, in the mount path SR.sys is trying to create their device to attach to the volume and while doing that it tries to get the name for this device:
1: kd> !devobj 81dc0a18 
Device object (81dc0a18) is for:
  \Driver\IvmVhd DriverObject 81fad590
Current Irp 00000000 RefCount 1 Type 00000007 Flags 00000050
Vpb 81ea0f10 Dacl e1f17924 DevExt 81dc0ad0 DevObjExt 81dc0c30 
ExtensionFlags (0x00000002)  DOE_DELETE_PENDING
Device queue is not busy.
This happens to be my virtual volume device, which as you can tell from the DOE_DELETE_PENDING flag, is about to be torn down. So what this all looks like is this:

1. Something is trying to open a file, see the IopCreateFile (frame 0xb)

2. Io manager, while trying to send the IRP_MJ_CREATE irp (frame 7) wants to make sure the volume is mounted. Please note that at this point the volume device is still in the OB namespace, since otherwise the ObpLookupObjectName call (frame 8) would not have been able to reference it… So at this point IO manager resolved the name to a device object and it now has a reference to the device...

3. IopCheckVpbMounted (frame 6) finds the volume is not mounted (since I dismount it before tearing it down) so it tries to mount it…

4. SR.sys gets the mount request and is trying to build a device to attach to the newly mounted volume (in case the mount succeeds). This is pretty standard stuff for a legacy filter… Anyway, in doing so it calls ObQueryNameString which no longer finds a name for the device and returns a NULL buffer. SR checks for NT_SUCCESS but doesn't check the buffer to make sure it's not null (or even the length which is 0) and it blindly tries to make sure the string is NULL terminated (which is also pointless since ObQueryNameString documentation mentions that "The object name (when present) includes a NULL-terminator and all path separators "\" in the name.")… bugcheck.

What my driver did was simply call IoDeleteDevice somewhere between frame 8 and frame 0.

I'm willing to bet that not checking for the null Name.Buffer is a pretty common mistake. For example, there is some code posted on a blog that looks like this:

Status = ObQueryNameString(FileObject->Vpb->RealDevice,OBI,Returned,&Returned ); 

if(NT_SUCCESS(Status)) { 
if(Root) { 
wcscat(OBI->Name.Buffer,L"\\"); 
OBI->Name.Length+=sizeof(WCHAR); 
} 

Thursday, October 28, 2010

Useful Models - how choosing the right abstraction can help design and some useful abstractions for working with minifilters

The poll on the site indicated this was the topic most people were interested in so here it is.

I find myself quite often in the position of trying to explain why something doesn't work the way someone expects it would. I guess this is due in large part that the work I do (storage and file systems) is something that people interact with quite often but in fact operates quite differently than the abstraction it presents to the users. I've mentioned this in my other posts anyway…

So in order to explain why some architecture won't work, I try to find an analogy or a model that would immediately make the problem obvious. Some of these models are very dependent on the problem I'm dealing with while some others I keep reusing. Some of the models are obviously not practical, but they highlight a certain features of the system. It would be nice if these models could be implemented as actual tools (like Driver Verifier) but the reality is that in some cases the effort to write something like this would not justify the benefits… So I guess most of them will remain in the realm of thought experiments but they can be useful nevertheless...

I'll go through a list of commonly asked questions and the models that I find help explain the problem. I'm sure most of the readers of this post could contribute their own examples so please do so through the comments.

Q: Why not send the file name directly to our minifilter from a service or some other user mode program ?
A: it really depends on the other minifilters on the system. The model here is a minifilter that implements ALL of the namespace perfectly, with file IDs and hardlinks and so on, at its level, and below itself it keeps a flat structure where all streams are identified by GUIDs and there are not directories. If your minifilter happens to be below such a filter then obviously the name of the file at your level (which is a GUID) has absolutely nothing to do with the name the user mode service sees (which can be a regular path). Now, it must be said that any minifilter that does anything like this to the namespace would be in the virtualization group, so if you are above the virtualization group you don't have this problem. But if you are IN or below the virtualization group, then you must take this into account.

Q: Why not communicate with my minifilter through a private communication channel and have it open and read files on behalf of my service ?
A: if you are in or below the virtualization group, see the example above. If you are below the AV group, then you should always think about malware. Let's say you do something very benign, like open your own file and read some configuration data (as opposed to opening and parsing or executing random user files). If there is a vulnerability with your parsing code, this allows someone to write a file based exploit targeting your product and no AVs will be able to see your accesses to the file and catch the vulnerability. Unfortunately, there isn't a good generic malware model so you need to construct your own every time you need to explain why bypassing some security measure is not a good idea…

Q: Why not create a back-up of a VHD file while the volume is mounted ? (which is another way of saying "why not try to read the data on a mounted volume by directly accessing the sectors ?").. This is a question that's not really related to file systems but to the storage stack.. However, I find a lot of people are confused about this and keep trying to read mounted volumes.
A: the model I find helps is that of a volume with a file system on top that on volume mount reads everything into memory and then it only writes the odd bytes (byte 1, 3, 5 and so on) of anything and keeps the even bytes in a cache, until it gets either a flush or a dismount. This makes immediately visible what would happen if you tried to read it. However, once I mention this people immediately ask whether we could flush and then take a snapshot, but then I point out that immediately after the flush the system might already have received some writes and then only the odd bytes have been written so you need a way to guarantee that no more writes happen on the file system, and the only way to do that is to dismount it.

Probably the most powerful model that exposes a lot of issues with filters (not only file system, any filters of any component really) is the "filter attached on top of itself" model. This is important because in general anything you can do in your filter someone else can do in theirs. For example, let's say the discussion is whether creating a new FSCTL that is currently unused and sending it down the FS stack to your filter is a good idea (spoiler: it's not). In the general case this wouldn't work with your filter attached twice, since all the IOCTLs will be captured by the top filter. This might not be an obvious problem (because depending on what the filter should do with the IOCTL , it might still work fine), but then consider that someone else can write a filter just like yours using the same IOCTL derived through the same mechanism and then you can expect more serious problems. So in this particular case you would want to make sure to either use a communication mechanism guaranteed to deliver messages directly to your filter like a control device or (if using a minifilter) communication ports. The same applies for file names (what if there already is a file with that name?) and other named resources.. Thinking about what would happen if your filter would be attached on top of itself is always an interesting thought experiment and highly recommended since it will expose potential problems with your design. Once you know what the problems are you can decide about how likely it is to happen and whether you should address the issue..


I thought I had more models and I should have done a better job at keeping track of them but I can't remember anymore right now. I will update the post when I do.

Thursday, October 21, 2010

Filtering in the Windows Storage Space

This post assumes that reader had some knowledge about the IO subsystem in windows.

The file system stack is simply a set of drivers between the IO manager and the file system (including the file system). These drivers are usually referred to as file system filters. In general the file system is the component that implements the hierarchy of files and directories and perhaps an additional set of features (like byte-range locking or hardlinks and so on). The file system filters usually add some functionality on top of what the file system provides (such as encryption or replication or security (think anti-virus scanners), quota management and so on). Most of these features could be implemented at any of these layers (for example, byte-range locking is usually done in the file system, but a filter can do it as well…). The decision is usually driven by customer requirements and even in the OS itself some things are done in filters, so that customers that don't need the feature don't pay the price.

For a pretty complete list of types of things file system filters can do, one can take a look at the list here. Of course, this is not a complete list, but still it shows how rich the ecosystem really is. I remember hearing that an average user on a Windows machine is running around 4 or 5 file system filters, usually without even realizing it.

The interface between the IO manager and the file system is very rich and complex. There are very many rules and everything is asynchronous which makes things very complicated. On top of this, while there is support in the NT model for filtering, it doesn't really provide some of the facilities that file system writers need (for example, there is not a lot of support for getting the name of a file or for attaching context to a certain file). This is where minifilters comes in. The minifilter infrastructure was written to primarily address some things that almost all file system filter need, without really changing the filtering model too much (which is why I'm avoiding the phrase "minifilter model" since it doesn't really change the IO model much, it just adds some features to it). This is all implemented via a support driver called filter manager. Filter manager is a legacy filter that is a part of the operating system and it provides things such as :
1. Support for contexts
2. An easier model for attaching to a volume
3. Easier model for file name querying
4. Support for unloading filters
5. Predictable filtering order
6. Easier communication between a user mode service and a driver.

Some of these are just nice features (like context support, where a legacy filter can still reliably implement their own scheme if they want) while some are downright impossible in the legacy model (for example, it used to be very problematic to make sure that an anti-virus filter would not be loaded below an encryption filter (which would make scanning files useless)).

The numbers that I've heard were that a legacy filter needs about 5000 lines of (very complicated and highly sensitive) code to just load and do nothing. With the minifilter model I'd say less than 50 are necessary, and most of them are just setting up structures and such.

Of course, a legacy filter can do all a minifilter can because filter manager itself is a legacy filter and it doesn't use private or undocumented interfaces. However, since the minifilter model is supported on all platforms since Windows 2000 there is really no reason for anyone developing a new filter to write a legacy filter. At least, that's my view. There are some people who disagree with this statement (as with any other model in fact) but the fact is that Microsoft is moving towards making the legacy model obsolete.

It is important to note that the storage infrastructure consists of two big parts, the file system stack and the disk stack. The disk stack deals with IO that is issued by the file system. The file system stack encapsulates all the complexity of operating with files and folders and such and issues just sector reads and writes. The disk stack has no concept of byte range locks, files and so on. What is deals with are sectors. The types of filters in this space are categorized about what they filter (disk, partition or volume) as well as the functionality they provide (encryption, compression, replication and so on). For example filters can offer things like volume snapshots, full volume encryption or full disk encryption, volume or partition replication, performance monitoring at all levels and so on.

As you can see, the storage subsystem is very rich and most of the time filters play a huge role in it (at least in the Windows world, where one can't just modify the source to add features to an operating system component). However, with so many ways to do things it is sometimes hard to know what architecture is best suited for a certain type of problem, and unfortunately selecting the wrong one can have a huge impact on the cost and complexity of a project.

Monday, September 20, 2010

Namespaces (part 1) - the OB namespace

I've been getting a lot of hits to my page about name usage in file system filters so I've decided to expand on the subject of names a bit further. This blog post is more about software design (and especially about OS design) and less about file system filters.

The role of language in shaping the way we think is a very interesting subject and one I've been interested in for a while. The book "Language in Thought and Action" is a very good introduction to the subject. One of the ideas in the book is that the mapping of names to objects changes the way we think about the object. While this is true to a certain extent in programming (think about how often you heard the phrase "well, this API would have been better named BlahBlah …"), computer science as a discipline has a completely new class of problems that I'd like to focus on in this post. The problems associated with actually designing namespaces. I'm not sure why designing and identifying namespaces isn't as popular in computer science circles as other concepts like indirection and variable scope because it's at least as important.

I don't think writing a formal definition of a namespace would actually be very interesting so I'll go straight to some examples of namespaces.

Probably the best known one is the file system namespace. The main elements of this namespace are file and directory names and the namespace serves to map file paths to streams of bytes. Also quite well known is the registry and it serves a very similar purpose. For people writing kernel mode drivers in windows also a pretty familiar one is the object manager namespace (or the OB namespace), where object names are used to identify kernel objects.

In some operating systems users are used to see and work with some other namespaces grafted into the main OS namespace (in windows users don’t usually see the OB namespace, but it can be explored using tools like WinObj ). For example, the storage devices namespace, the COM ports namespace or the running processes namespace.

For developers some familiar namespaces are the types namespace and the variables namespace (in the compiler).

But there are others even more interesting. For example, a namespace doesn't have to use ASCII or UNICODE strings to identify objects. If one were to use numbers, like 1,2,3 and so on the namespace would be an array. Similar, the process handles form a namespace, where the handle is used as the name. By now it's probably pretty clear that any key-value type of structure is a namespace. Even memory is a namespace as well, where the name is the address.

Now that we have some examples of namespaces we can look at some choices the designers of these namespaces made and what is the impact of those choices on the way they are used.

First, let's look at the object manager namespace in windows (which, as I said before, I'll refer to as the OB namespace).

I'll start by listing some of the properties of this namespace. The names in the OB namespace are UNICODE strings. As is usually the case with namespaces where the names are strings, the namespace implements a hierarchy of names and it is public. Some interesting features are that it supports links from one point in the namespace to another part and that it supports objects that don't have a name (we could treat anonymous OB objects as a different namespace but that's not particularly interesting).

Support for anonymous objects is by far the choice with the biggest impact because it means that whoever implements the namespace can't use the fact that the object is removed from the namespace as an indication that the object needs to be deleted. So they must use some different technique to track object usage and in the case of OB that technique is reference counting. From a user's perspective this means they have to do the little dance that involves increasing the reference count before sharing the object with anyone and decreasing the reference count when they're done using it. It also means that removing an object from the namespace (a delete) can happen immediately on an object (as opposed to it happening when the object is closed, like in file systems). Another implication of this architecture is that it's hard keeping logs of things because an object might not always have a name, so how does one log it ? The memory address doesn't usually convey any information about the object.

The fact that a namespace supports links is also quite interesting. The designer needs to decide whether they support links to directories in the namespace or just links to "leaves" (like files). For example NTFS supports hardlinks only between files, not directories. The OB namespace however supports links to directories, which means the OB namespace can contain loops. So the designer must come up with a way to deal with potential loops in the namespace. Another interesting implication is the fact that the caller might need to remember which way they arrived at an object in the namespace (the path to that object) in a way that takes links into account. The OB namespace doesn't do that but it is required for some features (like file system symlinks) so the users of the namespace must implement that themselves.

One final characteristic is that the namespace is hierarchical. Hierarchical namespaces have some advantages from the perspective of the implementer since they allow grouping objects that belong together. The main advantages are security and support for isolation. A flat namespace on the other hand is easy to implement, but it's very limited as it is basically just a hash.

To get a better picture of the implications of implementing a hierarchical namespace versus a flat one, let's consider some namespaces that don't support hierarchies, like the named synchronization primitives namespace in windows (events, mutexes and so on). It's easy to get name collisions so each Windows application must make sure it's using a name that no one else is using. And then from a security perspective there is no way to limit listing them. Basically, you can either prevent someone from seeing any of the names or allow them to see all the names. Access control is possible, by only on a case by case basis, and there usually isn't a way to inherit security permissions from another object.

The isolation part is also pretty important. For example, consider the fact that Windows supports sessions. If helps to keep those resources that are semantically linked into a directory, so they can be easily enumerated and operated on (even if they are just links to the actual object). Isolation is really useful in virtualization because the user of that part of the namespace doesn't necessarily see all the available objects, just the ones they're supposed to see.

This is getting pretty long so I'll stop here and talk about the file system namespace in a different post. If there is enough interest I might talk about other namespaces like the processes namespace (please leave some comments if this sounds interesting to you).

Saturday, September 18, 2010

I'm back

Hello everyone, I'm sorry I've been neglecting this blog for the past couple of month, a lot of things have changed and I've been really busy trying to adjust. I'm still not quite there yet but I'll try to do a better job with the blog from now on.

The good news is that I've been thinking about all sorts of things that I think would make good posts and so I should have some new material coming up.

In the mean time, please feel free to let me know if you have any suggestions for future topics and i'll do my best.

Thursday, February 11, 2010

Context Usage in Minifilters

I’m not sure why but in spite of there being pretty good documentation and even a sample available, the topic of how Contexts work and how filters should use them comes up a lot.

There are a couple of rules that govern contexts and pretty much everything follows from the interaction between these rules (this applies to all contexts). Please note that this is more of a design discussion and the implementation might be slightly different:

  1. When the reference count on the context gets to 0, the memory is freed.
  2. Any pointer to the context needs to have a corresponding increment on the reference count. This is done transparently when the filter requests the context via one of the functions (FltAllocateContext(),FltGetXxxContext(),FltReferenceContex() and even FltSetXxxContext() and so on).
  3. A context needs to be linked to the underlying structure (i.e. StreamContext  to the stream, StreamHandleContext to the FILE_OBJECT, VolumeContext to the volume and so on…; please note that we are talking from a design perspective, the implementation of exactly which structure has the pointer to the context might be different, but this is irrelevant for this discussion).

This is pretty much it. I’d like to walk through the most common scenarios and explain how those rules apply:

A filter allocates a context (FltAllocateContext) and it gets a pointer to a context (refcount 1). The context is not linked to anything at this point in time. If the filter calls FltReleaseContext, the refcount will drop to 0 and the context will be freed. If the filter tries to attach the context to the structure (say by using FltSetStreamContext – i’ll use StreamContexts for the rest of the discussion, and the underlying structure in this case is the SCB (Stream Context Block; or, for file systems that don’t support multiple streams per file (aka alternate data streams), the FCB)), then there are three cases:

  1. It succeeds. refcount is now 2, one for the link from the SCB and the other one is the one the filter has. 
  2. It fails and the filter doesn’t get a context back (for whatever reason: memory is low or the filter passed a NULL pointer for OldContext or there is some other failure). In this case there is still only one pointer, the one the filter has, so the refcount needs to be 1.
  3. It fails and the filter gets another context back (there already was a context attached and OldContext was not NULL). Now the filter has two contexts, the original context that it has allocated which has a refcount of 1 (only the filter has a pointer to it) and a new context (though the name is OldContext), with a reference count of at least 2 (because there are at least two pointers to it, one from the underlying structure, the SCB, and one that was just returned in OldContext so the filter can use it – there could be other references from other threads, but to keep things simple we will ignore those). The filter will need to release the original context it has allocated because it can’t use it (and since the refcount was 1 this will drop it to 0 and will free it). The filter will also need to eventually release the reference it got on OldContext, after using it (which will drop it back to 1, which represents the pointer from the SCB to the context).

Before we go any further i want to discuss what a filter can do when getting a context fails for whatever reason (this includes allocation failures and failing to set or get the context). Some filters can simply ignore that object (for example, a filter trying to log file IO might make a note in the log that IO to file X will not be logged and that’s that). Other filters might work in a degraded mode (for example, an anti-virus filter that is trying to be smart about scanning a file when it’s closed might want to remember whether there was any write to the file. If it fails to get a context it might scan the file anyway… performance might be worse but it will still work). And yet another case is where a filter might simply not be able to work when it doesn’t get a context. In that case the filter might want to allocate and initialize the context early enough so that the operation can be failed, usually in the Create path so in case the allocation fails the filter can fail the Create and the file won’t be opened at all.

Yet another thing to mention is that if a filter needs to use a context at DPC (let’s say in postWrite) then the context needs to be allocated from nonpaged pool and since the context functions are not callable at DPC the recommended way is to get the context (which might involve allocating it and attaching it) in the preOperation callback and pass it through the CompletionContext to the postOperation callback which can use it and then call FltReleaseContext to release the reference (yeah, even at DPC if the context is allocated from nonpaged pool).

One might wonder why the strange dance with the OldContext and NewContext. Couldn’t the filter just check if there is a context and only allocate one if there isn’t one ? Well, of course it could but because the environment is asynchronous, multiple threads might be doing the same thing at the same time, and they will all check if there is a context, find none, allocate a context and set it so now you have 10 threads each trying to attach an SCB with a different context… So the context operation needs to be a CAS (CompareAndSwap) so that only one thread succeeds.

Thus filter should not really start using a context that it allocated until it actually manages to attach it to the structure. However, immediately after the context is attached another filter might get to it, so it needs to be in some defined state, otherwise the other filter will get an invalid context (more on this later). The steps need to be something like this (this is pretty much the logic in CtxFindOrCreateStreamContext in the ctx sample in the WDK):

  1. context = FltGetStreamContext()..
  2. If we didn’t get one:
    1. context = NULL
    2. FltAllocateContext (NewContext)
    3. Initialize context to whatever default values make sense. Please note that those values need to take into account the current reference as well.. I’ll explain more below.
    4. FltSetStreamContext(NewContext, OldContext)
    5. If it failed:
      1. FltReleaseContext(NewContext) –> no point in keeping it around. Since we had the only reference refcount was 1 and it dropped to 0 so it will be freed.
      2. If we got OldContext, context = OldContext
      3. else, we didn’t get OldContext but we also couldn’t attach our context for some reason – the filter needs to continue without a context, whatever that means… (and no, KeBugCheck is not a good idea :)… )
    6. if it didn’t fail –> context = NewContext
  3. At this point context points to the context to use. If it is null, something went wrong and we should bail… (we could bail here or in 2.5.3., doesn’t matter). By bail i mean we should either fail the operation or popup a warning to the user or mark somewhere that we missed one so the results are not reliable anymore.. doesn’t matter.
  4. do things with context….
  5. FltReleaseContext(context) -  here we release our context. We can do this later, for example if we get the context in the PreOperation callback we might want to pass it via the CompletionContext to the PostOperation and release it there. Or we could queue some work item and pass it in a context and have the work item release it. Anyway, once the context gets release the reference count on the structure will drop back to 1 (for the link from the SCB to the context).

In step 2.3. i said that the context needs to be initialized to whatever values make sense, but IT MUST take into account the current reference. Well, this is not always needed but it depends on the particular design of the filter (it’s usually needed though so keep reading). Consider a filter that uses a StreamContext to keep track of how many threads it has doing IO on a Stream and is using a handle that the filter opened via FltCreateFileEx2. Let’s say that when the count gets to 0 the filter will call FltClose on the filter’s handle. Now let’s imagine a case where in step 2.3. the filter simply initializes the count to 0. The logic would be something like this:

  1. context = GetStreamContext(); // allocate new context or get the existing one. also get a handle by calling FltCreateFileEx2(…) if needed
  2. context->Count++
  3. Do things on context->Handle
  4. context->Count--;
  5. If (context->Count == 0) then FltClose(context->Handle).
  6. FltReleaseContext(context);

Do you see the problem here ? What happens if there are two threads, T1 and T2, and T1 allocates the new context, initializes so that context->Count is 0 (which means, it is initialized to a default value that doesn’t take into account the current reference) and then it sets the context (refcount is 2, 1 for T1 and one for the underlying SCB), before getting to step 2. it gets preempted by T2, which starts at the top. T2 will get a context (refcount is 3, 1 for T1, 1 for T2 and one from the SCB), it will increment the count (so context->Count is 1), it “does things”, then it decreases the count in step 4. (so context->Count is now 0) and then step 5. will proceed to close the handle. Step 6. will release the context (so refcount drops back to 2). Then when T1 resumes it will be at Step 2. and it will again increase the context->Count to 1 (from the wiki link above, this is a manifestation of the ABA problem), then it will do things on context->Handle which has been closed….. And there you have it… This could have been avoided if GetContext() actually initialized the newly allocated context Count to 1. This complicates things a bit because Step2. now might need to only be called when the context was not allocated in this thread, meaning that Step2. will probably need to move in GetContext() and so on..

Another thing worth mentioning is that once the underlying object is torn down, the link from it to the context will be broken (i.e. the pointer from the underlying object will go away), so the reference count will need to be decremented. In most cases where there are no outstanding references (there are no other threads using the context) the refcount will go to 0 and the context will be freed (and the filter’s context teardown callback will be called, if one was registered). There are a couple of implications this has. If a filter simply allocates a context, associates it with an object and then calls FltReleaseContext() (which is the normal way to set up a context), the filter doesn’t need to do anything else to make sure the context goes away. It will be torn down when the underlying object is torn down.

The other thing that follows from the fact that the context is tied to the lifetime of the underlying object is that if a filter can never leave the context in a bad state assuming that it will go away, because the underlying object might hang around for a while and get reused, reviving the context. For example, for a StreamContext where a filter has pointer to an ERESOURCE and an open count, it would be a mistake to free the ERESOURCE when the open count gets to 0 under the assumption that once the last handle goes away the SCB will go away as well, because that might not be true. The file system might cache the SCB and if a new open to the same files comes along the file system will reuse the cached SCB, which means that the filter will get a context that has an invalid pointer to an ERESOURCE. So in this case the right place to free the ERESOURCE is in the context teardown callback.

Finally, the last thing i want to mention is what FltDeleteContext does. FltDeleteContext unlinks the context from the underlying object. So if a filter decides it no longer needs a context associated with a stream (for example), it will need to do something like this:

  1. context = FltGetStreamContext();
  2. If (context != NULL)
    1. FltDeleteContext(context);
    2. FltReleaseContext(context);

At this point it should be obvious that FltDeleteContext needs to be called before FltReleaseContext (because since FltReleaseContext will release the reference associated with the context it is not safe to use context at all after calling it; FltDeleteContext will only remove the reference from the underlying object, if it is set, not doing anything about the current reference). Please note that after FltDeleteContext unlinks the StreamContext, any thread trying to get it from the object will not find it. This means that the filter should not try to use it in a meaningful way since other threads might not see the changes. Basically, once FltDeleteContext was called, the filter should simply call FltReleaseContext…

I hope this makes sense. If nothing else it should be useful when you have trouble sleeping (i almost fell asleep twice while proofreading it)