Thursday, March 29, 2012

Directory Entries and File Properties

This post is about a pretty well-known behavior of NTFS that nevertheless seems to occasionally surprise people : the fact that the directory entries aren't an authoritative source of information when it comes to file properties. What I'm referring to here is that it's perfectly possible that doing a "dir" command will not return accurate information for the file. When hearing about this for the first time most people think of the situation where a file is open and actively being written to (appended for example) and so the dir command returns a file size that was the actual size at that time but since then the file has been modified and so the obviously the file size doesn't match what is the file size right at this moment.
What I've described above is actually a pretty straightforward case and I've not yet met anyone that was surprised or confused by it. However, following this train of thought leads us to pretty interesting places. If the situation I've described above can frequently happen, how is an application expected to get the actual file size so that it knows it can't change ? (For an example of where this might used think of a copy application that's trying to figure out whether there is enough space on a volume to copy some file there before it starts the copy…) Well, the right approach is to open the file and query the file size and not rely on the directory entry. However, it is possible that the file is modified even while the application has the file opened, so if being able to "know" the file size is really important then the application should not allow other applications write access to the file while it has it open (and this is done by using the sharing modes parameters of the create operation, where an application can not allow other handles to be opened for write, thus making sure the file size or contents can't change).
So let's recap. If the application must know the file size and must make sure it doesn't change, it must open a handle to the file. If there is no handle then the file size information can't be guaranteed to be accurate. So then why even bother to actually return the file size in a dir command ? It turns out that for a lot of cases knowing the file sizes without guaranteeing that they won't change is sufficient. After all, think back to all the cases where you've done a dir and looked at file sizes. I bet in most cases you didn't care if the file size changed a bit later.
So now we know that enumerating the files in a directory isn't by definition an operation that is expected to be 100% accurate and if an app needs those kinds of guarantees then the app must implement its own synchronization mechanism (that might involve opening all the files without sharing write and so on). So with this in mind, what should a file system do to implement the IRP_MJ_DIRECTORY_CONTROL with the IRP_MN_QUERY_DIRECTORY minor code ? One might be tempted to go through all the files in the directory and find their on-disk information and retrieve the file attributes and file information from there, but that would certainly be rather slow. So since the information isn't expected to be accurate, wouldn't it be better to have a sort of cache of the file information ? In fact that's how most file systems implement this. The directory actually stores information about the files it contains in a cache and it returns the information from that cache, which is much faster. One can even see this in action when comparing the time it takes to get a directory listing in CMD with the time it takes to open the directory in Explorer. Explorer displays additional information for each file (the icon) and so when it goes through a directory it will need to open each file and figure out what icon it should display. But please also note that Explorer implements an icon cache as well (there are many posts describing this icon cache, google it).
Now that we've established that a folder caches the file information, when does the cache get updated ? It would make sense that the cache gets updated when the file is closed, which is pretty much how most file systems do it. Incidentally, this explains why if you have a file that is constantly being written to in a thread and then you open it and close it from a different thread (or process) the attributes and file size get updated.
The really interesting thing happens for NTFS when you have hardlinks for a file. One might expect that the file system updates all the directories containing the file to show the new information, but that's not what happens. Instead, the link that is used to open the file is updated. It's not even about the directory that contains that link, it's that particular link. BTW, this is documented behavior: see Hard Links and Junctions and the msdn page for the CreateHardLink function. This is what this looks like (please note how the file size changes for the link I modify it from and then for the other link how it changes when I open the file even without modifying it):
D:\templink>dir
 Volume in drive D is Data
 Volume Serial Number is 3817-6E24

 Directory of D:\templink

03/29/2012  11:46 AM    <DIR>          .
03/29/2012  11:46 AM    <DIR>          ..
03/29/2012  11:46 AM                 8 foo.txt
               1 File(s)              8 bytes
               2 Dir(s)  74,514,755,584 bytes free

D:\templink>mklink /H bar.txt foo.txt
Hardlink created for bar.txt <<===>> foo.txt

D:\templink>dir
 Volume in drive D is Data
 Volume Serial Number is 3817-6E24

 Directory of D:\templink

03/29/2012  11:47 AM    <DIR>          .
03/29/2012  11:47 AM    <DIR>          ..
03/29/2012  11:46 AM                 8 bar.txt
03/29/2012  11:46 AM                 8 foo.txt
               2 File(s)             16 bytes
               2 Dir(s)  74,514,755,584 bytes free

D:\templink>echo hello world >foo.txt

D:\templink>dir
 Volume in drive D is Data
 Volume Serial Number is 3817-6E24

 Directory of D:\templink

03/29/2012  11:47 AM    <DIR>          .
03/29/2012  11:47 AM    <DIR>          ..
03/29/2012  11:46 AM                 8 bar.txt
03/29/2012  11:47 AM                14 foo.txt
               2 File(s)             22 bytes
               2 Dir(s)  74,514,755,584 bytes free

D:\templink>type bar.txt
hello world

D:\templink>dir
 Volume in drive D is Data
 Volume Serial Number is 3817-6E24

 Directory of D:\templink

03/29/2012  11:47 AM    <DIR>          .
03/29/2012  11:47 AM    <DIR>          ..
03/29/2012  11:47 AM                14 bar.txt
03/29/2012  11:47 AM                14 foo.txt
               2 File(s)             28 bytes
               2 Dir(s)  74,514,755,584 bytes free
This is very interesting to think about from a filter perspective. This kind of behavior where the file system will return data without any guarantees that it will remain consistent is fairly common and identifying the pattern can make life a lot easier for filter developers. For example, let's say we have a filter that wants to make certain files appear in a directory. To make things harder, let's say that all the files are stored somewhere on a network with very expensive characteristics, for example in the cloud somewhere where there is a real dollar cost in terms of bytes of traffic. If the filter is written with the assumption that the directory entries must always reflect the actual file size in the cloud then on each IRP_MN_QUERY_DIRECTORY it might query the file size from the cloud, which generates traffic and so it has a real dollar cost associated with it. However, once the developer understands this particular contract of the file system they can get away with caching the file properties locally and only updating them when the file is actually opened.
Another such example that is very dear to me is file names. Most minifilters implement very complicated procedures to store names and cache them in file contexts and so on without taking advantage of the fact that in most cases names are meant to be transient information (and also without taking advantage of the fact that FltMgr's name cache is doing exactly that anyway). For more on this see my previous post on Names and file systems filters.

Thursday, March 22, 2012

Some Limitations Using Files Opened By ID

The ability to open files by ID is a pretty nice feature of certain file systems, especially from the perspective of filters. The fact that file IDs are small and fixed in size makes them very suitable for things like storing in fixed-size records or allocating them from lookaside lists. Unfortunately the semantics for files opened by ID are a bit different from the semantics of the same files if they would have been opened by name.

As you can see we again end up talking about names. What is the relationship between a file and it's name ? Technically both the file's ID and a file's name are identifiers for the file. However, they belong to different namespaces, with different rules. The rules can be different between file systems and to keep things simple for the rest of this post I'll stick to talking about NTFS. The file name namespace for example allows multiple names for a file (hardlinks) while the ID namespace does not.

So let's get straight to the interesting bits. The different semantics of the different namespaces can make it so that some operations don't make sense. For example because NTFS allows multiple names for a file if the file is opened by ID and an operation that changes the namespace is attempted, which name should be affected ? To make this very clear, if file \Foo\f.txt and file \Bar\b.txt are hardlinks to the same file and I open the file by ID and I try to rename it, which name should change ? How about if I try a delete ?

Naturally NTFS will return some status codes that hardly describe what is going on and I've spent many hours reviewing code to figure out what I may have done wrong before figuring out that I'm working on a file that was opened by ID and that the particular operation wasn't supported. So here are some operations that I've found to not work when files are opened by ID:

I'm sure the list is not complete and if you have more examples please contribute them through comments and I'll update the list.

Anyway, the main point of this post is to remind (myself mostly :)) that if a request fails in an unexpected way (most likely with STATUS_INVALID_PARAMETER) even though it's working in a lot cases and after you've validated that the parameters are actually good then check if the file might have been opened by ID and if so verify if the operation makes sense on a file opened by ID.

In closing I'll show you how I quickly check that (and showcase the FileTest tool which is simply awesome!):

  1. First create a file (since you can't create a new file by ID) and then close it.
  2. Then get the file ID and click on the "Use" button.
  3. Then just open the file by ID (make sure to change DesiredAccess to match what you're trying to do).
  4. And then finally just try the operation to see if it can work on files opened by ID.

Thursday, March 15, 2012

Volume Names

In this post I want to talk about something that's not directly related to file system filters but that I've spent a lot of time fighting with. I'm specifically talking about volume names and the reason this is important to me is because these days I work on virtualization filters and in some cases when creating virtual files I need to make them feel the same as regular files on a real volume and the way some applications (both kernel mode and user mode) handle volume names is downright broken.
The most important point I want to make is that a volume name is NOT a drive letter. I read a lot of articles and attend a lot of presentations where volumes are identified by "drive letter" which, while is useful as a way to express a concept because everyone is familiar with drive letters, is actually wrong. Drive letters are a DOS concept, the NT concept is the volume name (and it looks like this '\\?\Volume{0d5759d1-429c-11df-8e0f-806e6f6e6963}'). Easiest way to see this is to use the "mountvol.exe" command line tool. This difference is very clearly expressed in the mountmgr.h file (%DDKPATH%\inc\ddk\mountmgr.h) where there are macros like 'MOUNTMGR_IS_VOLUME_NAME(s)' and 'MOUNTMGR_IS_DRIVE_LETTER':


//
// Macro that defines what a "drive letter" mount point is.  This macro can
// be used to scan the result from QUERY_POINTS to discover which mount points
// are find "drive letter" mount points.
//

#define MOUNTMGR_IS_DRIVE_LETTER(s) (   \
    (s)->Length == 28 &&                \
    (s)->Buffer[0] == '\\' &&           \
    (s)->Buffer[1] == 'D' &&            \
    (s)->Buffer[2] == 'o' &&            \
    (s)->Buffer[3] == 's' &&            \
    (s)->Buffer[4] == 'D' &&            \
    (s)->Buffer[5] == 'e' &&            \
    (s)->Buffer[6] == 'v' &&            \
    (s)->Buffer[7] == 'i' &&            \
    (s)->Buffer[8] == 'c' &&            \
    (s)->Buffer[9] == 'e' &&            \
    (s)->Buffer[10] == 's' &&           \
    (s)->Buffer[11] == '\\' &&          \
    (s)->Buffer[12] >= 'A' &&           \
    (s)->Buffer[12] <= 'Z' &&           \
    (s)->Buffer[13] == ':')

//
// Macro that defines what a "volume name" mount point is.  This macro can
// be used to scan the result from QUERY_POINTS to discover which mount points
// are "volume name" mount points.
//

#define MOUNTMGR_IS_VOLUME_NAME(s) (                                          \
     ((s)->Length == 96 || ((s)->Length == 98 && (s)->Buffer[48] == '\\')) && \
     (s)->Buffer[0] == '\\' &&                                                \
     ((s)->Buffer[1] == '?' || (s)->Buffer[1] == '\\') &&                     \
     (s)->Buffer[2] == '?' &&                                                 \
     (s)->Buffer[3] == '\\' &&                                                \
     (s)->Buffer[4] == 'V' &&                                                 \
     (s)->Buffer[5] == 'o' &&                                                 \
     (s)->Buffer[6] == 'l' &&                                                 \
     (s)->Buffer[7] == 'u' &&                                                 \
     (s)->Buffer[8] == 'm' &&                                                 \
     (s)->Buffer[9] == 'e' &&                                                 \
     (s)->Buffer[10] == '{' &&                                                \
     (s)->Buffer[19] == '-' &&                                                \
     (s)->Buffer[24] == '-' &&                                                \
     (s)->Buffer[29] == '-' &&                                                \
     (s)->Buffer[34] == '-' &&                                                \
     (s)->Buffer[47] == '}'                                                   \
    )
So unless you're writing applications that are specific to DOS, please stop thinking in terms of "drive letters" and instead think of "volume names", especially when writing articles and presentations. There are many volume user mode APIs that are very well documented (see the page Volume Management Functions in MSDN) and that should be used. Also, as a developer, never write a function that takes a parameter a volume as a "char" and instead always use mount points or volume names (which is a string). There is also a page on Naming a Volume which discusses some of the use cases and the available APIs.
As I mentioned in my previous post on Problems with STATUS_REPARSE - Part II, a lot of the times the problems come from user mode apps trying to build a path to a file and they expect to get a drive letter as the volume, which is just wrong. Even the MSDN example Obtaining a File Name From a File Handle falls into this trap by using drive letters all over:


…
        if (GetLogicalDriveStrings(BUFSIZE-1, szTemp)) 
        {
          TCHAR szName[MAX_PATH];
          TCHAR szDrive[3] = TEXT(" :");   <- this is wrong…
          BOOL bFound = FALSE;
          TCHAR* p = szTemp;

          do 
          {
            // Copy the drive letter to the template string
            *szDrive = *p;

            // Look up each device name
            if (QueryDosDevice(szDrive, szName, MAX_PATH))  <- this is wrong...
            {
              size_t uNameLen = _tcslen(szName);

              if (uNameLen < MAX_PATH) 
              {
                bFound = _tcsnicmp(pszFilename, szName, uNameLen) == 0
                         && *(pszFilename + uNameLen) == _T('\\');

                if (bFound) 
                {
                  // Reconstruct pszFilename using szTempFile
                  // Replace device path with DOS path
                  TCHAR szTempFile[MAX_PATH];
                  StringCchPrintf(szTempFile,
                            MAX_PATH,
                            TEXT("%s%s"),
                            szDrive,
                            pszFilename+uNameLen);
                  StringCchCopyN(pszFilename, MAX_PATH+1, szTempFile, _tcslen(szTempFile));
...
I've always wondered, as a windows developer, does it not bother people that they're calling functions like "QueryDosDevice" ? What does DOS have to do with anything ? Step into the 21st century already!
Anyway, the best way to do this is to call GetFinalPathNameByHandle() and use the VOLUME_NAME_GUID flag to get used to using volume names. Unfortunately this is only available in Vista and newer OSes and so for XP one could still use the technique described in Obtaining a File Name From a File Handle but there is something that needs to be changed. The problem is that the volume APIs don't seem to have a way to convert a volume device name ('\Device\HarddiskVolume2') to a volume GUID name. In fact, none of the volume APIs offer an easy way to work with volume device names. The one way I've been able to do this in the general case was to use the MountMgr APIs directly. I don't have any user mode code that shows exactly what need to be done but I'll show the kernel mode code piece that queries the MountMgr:

#define MY_MOUNTMGR_MOUNT_POINT_TAG = 'mMyM'

typedef enum _MY_MOUNTMGR_BUFFER_TYPE {

    //
    // we'll query the MOUNTMGR using one of the three keys it supports..
    //

    MY_MOUNTMGR_SYMLINK = ' myS',
    MY_MOUNTMGR_UNIQUE_ID = 'DIUU',
    MY_MOUNTMGR_DEVICE = ' veD',
        
} MY_MOUNTMGR_BUFFER_TYPE, *PMY_MOUNTMGR_BUFFER_TYPE;


NTSTATUS
MyQueryMountMgr(
    __in PVOID Buffer,
    __in USHORT BufferLength,
    __in MY_MOUNTMGR_BUFFER_TYPE BufferType,  
    __out PMOUNTMGR_MOUNT_POINTS * MountPoints  
    )
/*++

Routine Description:

    Call MountMgr to get a names of a volume when knowing one of the
    other names.

Arguments:

    Buffer - the buffer that we want to send MountMgr to allow it to identify 
             the volume we're talking about. 

    BufferLength - the length of that buffer

    BufferType - the type of information that the buffer describes.

    MountPoints - this is a buffer that is allocated inside this function that 
                  the caller must free which is the list of mount points that
                  MountMgr returned... if it's NULL then no buffer is returned..
                  This is NOT the standard convention (the caller should supply 
                  the buffer) but it saves time.

Return Value:

    an appropriate NTSTATUS value

--*/
{
    NTSTATUS status = STATUS_SUCCESS;

    PMOUNTMGR_MOUNT_POINT mountMgrKey = NULL;
    ULONG mountMgrKeyLength = 0;

    PIRP irp = NULL;

    UNICODE_STRING mountMgrName;
    PFILE_OBJECT mountMgrFileObject = NULL;
    PDEVICE_OBJECT mountMgrDeviceObject = NULL;

    IO_STATUS_BLOCK ioStatus;
    KEVENT ioEvent;

    PMOUNTMGR_MOUNT_POINTS mountMgrMountPoints = NULL;
    ULONG mountMgrMountPointsLength = 0;
    
    PAGED_CODE();

    __try{

        KeInitializeEvent( &ioEvent, NotificationEvent, FALSE);

        //
        // first try to set up the buffer for the name..
        //
        
        mountMgrKeyLength = sizeof(MOUNTMGR_MOUNT_POINT);
        mountMgrKeyLength += BufferLength;


        mountMgrKey = ExAllocatePoolWithTag( PagedPool,
                                             mountMgrKeyLength,
                                             MY_MOUNTMGR_MOUNT_POINT_TAG );

        if (mountMgrKey == NULL) {

            status = STATUS_INSUFFICIENT_RESOURCES;
            __leave;
        }

        //
        // populate the structure..
        //

        RtlZeroMemory( mountMgrKey, mountMgrKeyLength);

        switch(BufferType) {

            case MY_MOUNTMGR_DEVICE:

                mountMgrKey->DeviceNameLength = BufferLength;
                mountMgrKey->DeviceNameOffset = sizeof(MOUNTMGR_MOUNT_POINT);
                break;

            case MY_MOUNTMGR_UNIQUE_ID:

                mountMgrKey->UniqueIdLength= BufferLength;
                mountMgrKey->UniqueIdOffset = sizeof(MOUNTMGR_MOUNT_POINT);
                break;

            case MY_MOUNTMGR_SYMLINK:

                mountMgrKey->SymbolicLinkNameLength= BufferLength;
                mountMgrKey->SymbolicLinkNameOffset= sizeof(MOUNTMGR_MOUNT_POINT);
                break;

            default:

                status = STATUS_INVALID_PARAMETER;
                __leave;
                break;
        }

        RtlCopyMemory( Add2Ptr(mountMgrKey, sizeof(MOUNTMGR_MOUNT_POINT)),
                       Buffer,
                       BufferLength );

        //
        // now we need a reference to MountMgr
        //

        RtlInitUnicodeString(&mountMgrName, MOUNTMGR_DEVICE_NAME);
        
        status = IoGetDeviceObjectPointer( &mountMgrName,
                                           FILE_READ_ATTRIBUTES, 
                                           &mountMgrFileObject, 
                                           &mountMgrDeviceObject);
        
        if (!NT_SUCCESS(status)) {
        
            __leave;
        }

        mountMgrMountPointsLength = sizeof(MOUNTMGR_MOUNT_POINTS);

        status = STATUS_BUFFER_OVERFLOW;

        while(status == STATUS_BUFFER_OVERFLOW) {

            NT_ASSERT(mountMgrMountPoints == NULL);

            mountMgrMountPoints = ExAllocatePoolWithTag( PagedPool,
                                                         mountMgrMountPointsLength,
                                                         MY_MOUNTMGR_MOUNT_POINT_TAG );

            if (mountMgrMountPoints == NULL) {

                status = STATUS_INSUFFICIENT_RESOURCES;
                __leave;
            }

            irp = IoBuildDeviceIoControlRequest( IOCTL_MOUNTMGR_QUERY_POINTS,
                                                 mountMgrDeviceObject, 
                                                 mountMgrKey, 
                                                 mountMgrKeyLength, 
                                                 mountMgrMountPoints, 
                                                 mountMgrMountPointsLength, 
                                                 FALSE, 
                                                 &ioEvent, 
                                                 &ioStatus);

            if (irp == NULL) {

                status = STATUS_INSUFFICIENT_RESOURCES;
                __leave;
            }
        
            status = IoCallDriver( mountMgrDeviceObject, irp );
            
            if (status == STATUS_PENDING) {
            
                status = KeWaitForSingleObject( &ioEvent,
                                                Executive,
                                                KernelMode,
                                                FALSE,
                                                NULL );
            
                status = ioStatus.Status;
            }

            switch (status) {

                case STATUS_BUFFER_OVERFLOW:

                    //
                    // we need a bigger buffer, the Size should tell us how big.
                    // assert that it's more that we previously had...
                    //

                    NT_ASSERT(mountMgrMountPointsLength < mountMgrMountPoints->Size);

                    mountMgrMountPointsLength = mountMgrMountPoints->Size;

                    ExFreePoolWithTag( mountMgrMountPoints, MY_MOUNTMGR_MOUNT_POINT_TAG );
                    mountMgrMountPoints = NULL;
                    
                    break;

                case STATUS_OBJECT_NAME_NOT_FOUND:

                    //
                    // it is possible that the IOCTL doesn't find anything, this
                    // is not a problem...
                    //
                    
                    break;

                case STATUS_SUCCESS:

                    //
                    // we got the links back, all is good... for the delete case
                    // it's possible we'll get called multiple times..
                    //

                    NT_ASSERT((mountMgrMountPoints->NumberOfMountPoints != 0) ||
                               (MountMgrIoctl == IOCTL_MOUNTMGR_DELETE_POINTS));

                    break;

                default:

                    NT_ASSERT(!"why are we here ? investigate...");
                    break;
                        
            }

        }

    }
    __finally{

        if (mountMgrKey != NULL) {

            ExFreePoolWithTag( mountMgrKey, MY_MOUNTMGR_MOUNT_POINT_TAG );
        }

        if (mountMgrFileObject != NULL) {

            ObDereferenceObject( mountMgrFileObject );
        }

    }
    
    //
    // if we have some mount points and we were successful and the caller
    // gave us a pointer, then set it in that pointer. Otherwise, free it..
    //

    if (NT_SUCCESS(status) &&
        MountPoints != NULL) {

        *MountPoints = mountMgrMountPoints;
        
    }
     
    return status;
}
Something very similar can be done in user mode (though it would be a lot simpler), where instead of IoGetDeviceObjectPointer() one would have to open MOUNTMGR_DOS_DEVICE_NAME and get a handle to the MountMgr device and also the call to IoBuildDeviceIoControlRequest would be replaced with DeviceIoControl.

Thursday, March 8, 2012

Name Provider Changes

I just wanted to add a couple of things to my previous post on what a passthrough nameprovider should look like. There are two changes I'd like to make:
  1. The code I have posted for the PtGenerateFileNameCallback() function has an assert that it shouldn't receive a generate request for a normalized name. However, there are some changes in Win8 where the file system can now generate directly a normalized name (see my previous post on Name Normalization in Win8) and as such FltMgr has been changed and it can also now request a normalized name from in the generate name callback. So that assert needs to be changed to include a normalized name.
  2. Also in the code for PtGenerateFileNameCallback() I have omitted to add a very important flag, the FLT_FILE_NAME_DO_NOT_CACHE flag. This is a very important flag because it can lead to problems with FltMgr's name cache. It isn't really necessary for my example because the name above my filter is the same as the name below my filter, but it is very important for filters that change the name of a file. I'll explain this issue in more detail below.
So FltMgr's name cache is simply a FltMgr specific context associated with the stream (or with the FILE_OBJECT depending on some factors) where FltMgr remembers the name for the object. Because before Win8 building a name was such an expensive operation FltMgr wanted to cache the name whenever it was generated in response to a filter request (I mean that FltMgr never actually went and populated the name by itself) and also FltMgr tried really hard not to lose any such name.
First I'd like to point out that FLtMgr caches each name at multiple levels, for each name provider. So if there is no name provider on the stack there is only one entry, for the name at the file system level. If there is a name provider then there are two entries in the cache, one at the name provider's level and one at the file system's level. And so on.
The problem with using stream contexts (or stream handle contexts) is that you can't know what the actual context is until in postCreate. However, it is not uncommon for filters to query for the name during preCreate, which means that FltMgr can't look it up in the cache and so it must generate it every time (which is why querying for names in preCreate is a pretty big performance hit). Moreover, once the name is generated FltMgr can't even cache because it doesn't have the stream. So in this case FltMgr passes the name from preCreate to postCreate in a fashion similar to what a minifilter would do (using the CompletionContext parameter of the preCreate callback). However, since there is no such parameter FltMgr uses the IRP to send some information from preCreate to postCreate, including the name. Once the IRP_MJ_CREATE operation is complete and FltMgr knows what the stream is, it tries to save the name it generated during preCreate into the name cache.
This is where FLT_FILE_NAME_DO_NOT_CACHE comes in. Let's say we have a minifilter that changes the name for a file from C:\foo.txt to C:\bar.txt (so C:\foo.txt is the name seen above the minifilter and C:\bar.txt is the name seen at the file system's level). If the name provider minifilter simply passes the request down to the file system then the file system will see a request for C:\foo.txt. In this case this means that it will be look at the file system level as "C:\foo.txt". Once the name is generated FltMgr will cache it and associate it with the IRP_MJ_CREATE operation. However, once the actual IRP_MJ_CREATE is completed since the C:\foo.txt name is now associated with the IRP FltMgr will cache the name at the file system level as "C:\foo.txt" instead of the "C:\bar.txt" that it should be. This is why name providers must always set the FLT_FILE_NAME_DO_NOT_CACHE flag when making their own requests to the file system, to tell the file system not to cache those names.
So this is the new & improved code (changes in RED):


 NTSTATUS PtGenerateFileNameCallback(  
   __in   PFLT_INSTANCE Instance,  
   __in   PFILE_OBJECT FileObject,  
   __in_opt PFLT_CALLBACK_DATA CallbackData,  
   __in   FLT_FILE_NAME_OPTIONS NameOptions,  
   __out   PBOOLEAN CacheFileNameInformation,  
   __out   PFLT_NAME_CONTROL FileName  
   )  
 {  
   NTSTATUS status = STATUS_SUCCESS;  
   PFLT_FILE_NAME_INFORMATION belowFileName = NULL;  
   PT_DBG_PRINT( PTDBG_TRACE_ROUTINES,  
          ("PassThrough!PtGenerateFileNameCallback: Entered\n") );  
   __try {  

     //
     //  We expect to only get requests for opened, short and normalized names.
     //  If we get something else, fail. 
     //

     if (!FlagOn( NameOptions, FLT_FILE_NAME_OPENED ) && 
 !FlagOn( NameOptions, FLT_FILE_NAME_SHORT ) &&
 !FlagOn( NameOptions, FLT_FILE_NAME_NORMALIZED )) {

         ASSERT(!"we have a received a request for an unknown format. investigate!");

         return STATUS_NOT_SUPPORTED ;
     }

     //  
     // First we need to get the file name. We're going to call   
     // FltGetFileNameInformation below us to get the file name from FltMgr.   
     // However, it is possible that we're called by our own minifilter for   
     // the name so in order to avoid an infinite loop we must make sure to   
     // remove the flag that tells FltMgr to query this same minifilter.   
     //  
     ClearFlag( NameOptions, FLT_FILE_NAME_REQUEST_FROM_CURRENT_PROVIDER );  
     SetFlag( NameOptions, FLT_FILE_NAME_DO_NOT_CACHE );
     //  
     // this will be called for FltGetFileNameInformationUnsafe as well and  
     // in that case we don't have a CallbackData, which changes how we call   
     // into FltMgr.  
     //  
     if (CallbackData == NULL) {  
       //  
       // This must be a call from FltGetFileNameInformationUnsafe.  
       // However, in order to call FltGetFileNameInformationUnsafe the   
       // caller MUST have an open file (assert).  
       //  
       ASSERT( FileObject->FsContext != NULL );  
       status = FltGetFileNameInformationUnsafe( FileObject,  
                            Instance,  
                            NameOptions,  
                            &belowFileName );   
       if (!NT_SUCCESS(status)) {  
         __leave;  
       }                              
     } else {  
       //  
       // We have a callback data, we can just call FltMgr.  
       //  
       status = FltGetFileNameInformation( CallbackData,  
                         NameOptions,  
                         &belowFileName );   
       if (!NT_SUCCESS(status)) {  
         __leave;  
       }                              
     }  
     //  
     // At this point we have a name for the file (the opened name) that   
     // we'd like to return to the caller. We must make sure we have enough   
     // buffer to return the name or we must grow the buffer. This is easy   
     // when using the right FltMgr API.  
     //  
     status = FltCheckAndGrowNameControl( FileName, belowFileName->Name.Length );  
     if (!NT_SUCCESS(status)) {  
       __leave;  
     }  
     //  
     // There is enough buffer, copy the name from our local variable into  
     // the caller provided buffer.  
     //  
     RtlCopyUnicodeString( &FileName->Name, &belowFileName->Name );   
     //  
     // And finally tell the user they can cache this name.  
     //  
     *CacheFileNameInformation = TRUE;  
   } __finally {  
     if ( belowFileName != NULL) {  
       FltReleaseFileNameInformation( belowFileName );        
     }  
   }  
   return status;  
 }  

Thursday, March 1, 2012

Win8 Bugcheck 0x1a_61946

I was going to skip posting this week but at plugfest last week I've run into an issue that I believe might impact users testing with Win8 and I wanted to explain what the problem is and how it can be avoided.

This is what the bugcheck might look like:

1: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

MEMORY_MANAGEMENT (1a)
    # Any other values for parameter 1 must be individually examined.
Arguments:
Arg1: 00061946, The subtype of the bugcheck.
Arg2: d0427500
Arg3: 000214ce
Arg4: 00000000

Debugging Details:
------------------


BUGCHECK_STR:  0x1a_61946

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  System

CURRENT_IRQL:  2

LAST_CONTROL_TRANSFER:  from 8217aa43 to 82151d70

STACK_TEXT:  
8989f2ec 8217aa43 00000003 07cae677 0000001a nt!RtlpBreakWithStatusInstruction
8989f33c 8217a088 00000003 80bf4138 8989f744 nt!KiBugCheckDebugBreak+0x1c
8989f718 821509da 0000001a 00061946 d0427500 nt!KeBugCheck2+0x594
8989f73c 82150911 0000001a 00061946 d0427500 nt!KiBugCheck2+0xc6
8989f75c 821f0e08 0000001a 00061946 d0427500 nt!KeBugCheckEx+0x19
8989f79c 820b9b14 8989f7c0 000000c4 d0427500 nt! ?? ::FNODOBFM::`string'+0x1dc6f
8989f800 82518ec8 d0427500 00000000 00000001 nt!MmProbeAndLockPages+0x134
8989f820 8281fb38 d0427500 00000000 00000001 nt!VerifierMmProbeAndLockPages+0x7b
8989f860 8282553d d53fc4f8 d5601e28 00000001 Ntfs!NtfsLockUserBuffer+0x4c
8989f880 828177eb 00000001 00000000 d53fc4f8 Ntfs!NtfsPrePostIrpInternal+0xa1
8989f8a0 82824d68 00000001 8dcd8e00 db5e1c9f Ntfs!NtfsPostRequest+0x21
8989f90c 8281d01e d53fc4f8 d5601e28 c00000d8 Ntfs!NtfsProcessException+0x2b4
8989f988 82500e6b 88cac018 d5601e28 d5601c6c Ntfs!NtfsFsdRead+0x376
8989f9b0 82090047 8278b0ee d5601c68 8989fa14 nt!IovCallDriver+0x2f3
8989f9c0 8278b0ee d5601e28 88ca6ba8 d421e6c0 nt!IofCallDriver+0x72
8989fa14 8278c432 8989fa38 00000000 00000000 FLTMGR!FltpLegacyProcessingAfterPreCallbacksCompleted+0x25b
8989fa50 82500e6b 88ca6ba8 d5601e28 88ccf120 FLTMGR!FltpDispatch+0xca
8989fa78 82090047 822d9c57 d5601e28 8989fadc nt!IovCallDriver+0x2f3
8989fa88 822d9c57 d5601e28 d5602000 d5601e30 nt!IofCallDriver+0x72
8989fadc 822d3b37 88ca6ba8 00000001 00000000 nt!IopSynchronousServiceTail+0x10a
8989fb74 821cabec 88ca6ba8 80002e70 00000000 nt!NtReadFile+0x3f7
8989fb74 8214f1f1 88ca6ba8 80002e70 00000000 nt!KiFastCallEntry+0x12c
8989fc10 82780b85 80002038 80002e70 00000000 nt!ZwReadFile+0x11
8989fc94 82771fa9 8989fcc8 94c0f000 00001000 mydriver!MyDriverReadFile+0x28f 

STACK_COMMAND:  kb

FOLLOWUP_IP: 
nt! ?? ::FNODOBFM::`string'+1dc6f
821f0e08 8bd0            mov     edx,eax

SYMBOL_STACK_INDEX:  5

SYMBOL_NAME:  nt! ?? ::FNODOBFM::`string'+1dc6f

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: nt

IMAGE_NAME:  ntkrpamp.exe

DEBUG_FLR_IMAGE_TIMESTAMP:  4f23bd74

FAILURE_BUCKET_ID:  0x1a_61946_VRF_nt!_??_::FNODOBFM::_string_+1dc6f

BUCKET_ID:  0x1a_61946_VRF_nt!_??_::FNODOBFM::_string_+1dc6f

Followup: MachineOwner
---------

Now, what MyDriver was doing was to take an MDL that was used for a paging read and it created a system VA for it (by calling MmGetSystemAddressForMdlSafe()) and then it was populating some parts of it by calling ZwReadFile to read into that buffer. The problem is that when calling ZwReadFile for the address the pages will be marked dirty. However, the pages were supposed to be populated from a paging read and so it makes no sense that they are marked "dirty" in the paging read path. There are a couple of different reasons where this is problematic:

  • If the pages are marked dirty it means that MM will try to save their contents (since it believe the page contains data that hasn't been saved to disk yet - that's what a dirty page means). So if MM is running out of physical pages it will try to create more space by reusing some of the existing physical pages. For pages that aren't dirty MM can simply discard them because it knows it can get the contents back later (through another paging read). However, if the pages are marked dirty it must save the contents somewhere (in this case in the page file) and fault them in from the page file later. This is slow (MM must first save the contents to the page file though it doesn’t need to) and it can lead to problems because space in the page file hasn't been allocated for this page so MM might run out of space later and bugcheck.
  • this can be problematic for image pages (for exes, dlls) because any fixups that might normally be applied on the fly when paging them in from the file system will become "permanent" because the data is now dirty and so it might lead to crashes later on when the page is used in a different process.

Anyway, this is clearly bad so let's discuss what a driver should do about this:

  • If a file system filter (or any other type of driver) has an MDL and it wants to populate it with data from a file (or from multiple files) (i.e. if it needs to issue one or more reads to get the data) then it must issue an IRP_MJ_READ request using the same MDL that was passed in (please note that it might be necessary to split the MDL by using IoBuildPartialMdl()). Minifilters can use FltReadFileEx() to read data directy into the MDL (or MDLs). Legacy filters or other types of drivers must NOT use ZwReadFile and instead they must issue an IRP_MJ_READ with the partial MDLs.
  • If a file system (or file system filter) wants to populate the MDL or parts of the MDL by writing into it directly (for example an encryption filter that will write the decrypted data into the user's buffer directly; another example would be a filter that completely owns a virtual file for which it supplies all the contents without trying to read them from disk at all) it can call MmGetSystemAddressForMdlSafe() to get a system VA and then the filter can write directly into that buffer. However, the filter must not create a new MDL for any part of that buffer. If a filter must use a MDL for all or some part of the buffer it should build a partial MDL.

Please note that this has always been the correct way of doing things, but developers have been taking the easy way out and getting lucky until now. Checked builds for previous OS releases used assert when this was detected. One additional benefit of doing things the right way is that performance should be slightly better as well.

Just to point out what this might look like for a user, please take a look at this VirtualBox ticket https://www.virtualbox.org/ticket/10290 which illustrates this issue.