Thursday, April 26, 2012

Smart Reading and Storing of Files

I often run into tools that move files around (back-up solutions or cloud storage solutions or file copy utilities) that don't support some core NTFS features like alternate data streams. This lack of support for a very useful feature (just think about how many different file formats are out there that provide little value besides adding support for storing metadata along with the main data stream) is very frustrating.

Anyway, from looking at various applications that don't support this I see two main reasons for it. One is that they're multi-platform apps that only support the common feature set between all the platforms (which makes sense in some cases but doesn't really make much sense in other cases). The other reason is that I suspect people just don't want to have to deal with enumerating alternate data streams and then coming up with a mechanism to serialize them into a single file. On the other hand, most such solutions must read other file metadata and preserve it, such as file attributes, timestamps and so on, which I suspect they do on their own, using various Windows APIs that provide that functionality. Smarter solutions need to also deal with sparse files (it would be pretty stupid for a cloud storage solution to ignore the file system information that a large chunk of a file is all 0s and instead use bandwidth to transfer those) and possibly even hardlinks and reparse points.

Fortunately, this is simpler that it sounds, at least on Windows :). There is a set of Windows APIs that operate on a stream of data that describes a file complete with alternate data streams information and sparse blocks information and security information and so on. This stream is formatted in a way that allows the caller to understand what type of data each part of the stream represents. The API set even allows skipping certain types of file data that is not relevant to the caller. The APIs I'm referring to are referred to as the Backup APIs:

There is also the WIN32_STREAM_ID structure, which describes the information that follows the structure in the stream and allows for figuring out whether the stream is interesting for the caller, how long it is and so on.

These APIs allow the caller to read the file contents and metadata as a long stream of bytes, skip through the stream and write all the information back as one stream of bytes. Moreover, since the information is formatted it's also possible that when reading the data only a certain type of data is read. For example, let's discuss a solution that archives data to some cloud storage and then wants to read the data into a file on an OS different than Windows. It's quite easy to keep track in some database of the information of where the main data stream begins and how long it is so that it can only download that information for that platform.

There is quite a bit of documentation on these APIs and on what a backup solution should do with them and so on. There are some documents under the [MS-BKUP]: Microsoft NT Backup File Structure page and even a basic sample, Creating a Backup Application.

So please, next time you run across a solution that doesn't handle sparse files or alternate data streams or some other such features, feel free to point the developers to these APIs :)

Thursday, April 19, 2012

The Standby List and Storage Overprovisioning

This post is about an interesting issue I spent quite a bit of time debugging. As is often the case with very complex system, I knew most of the bits of information related to the issue but I didn't quite put everything together and so this scenario still surprised me.
It all started with me playing with file sizes and directory entries and so I was copying a large number of files to a VHD. Since I didn't have a lot of space for the VHD in my VM I decided to make all the files sparse so that they won't take any space on the VHD, which was rather small (2GB). I created about 20GB worth of sparse files on it and all was well. I've actually been using this setup for a while but when doing this in a Win8 VM I quickly ran into problems. The VHD ran out of free space. I knew there was no way for that to happen since all my files were sparse files and I didn't expect that I suddenly had that many directory entries that the file system metadata actually used most of the 2GB of the VHD. So figured this would be an interesting investigation.
The first thing I noticed was that if I dismounted the VHD and mounted it again the VHD look pretty much the way I expected: all the files were there and there were roughly 2GB worth of free space. So it seems it was a transient situation. I spent a bit of time looking at various NTFS counters (fsutil is a good tool for that), looking at ProcMon logs and poking the file system in various ways, but I couldn't find anything. I was about to embark on the next step, which was to try to figure out which blocks on the volume are owned by which file in the file system so that I can see why those files aren't sparse, but I was lucky enough to discover by chance that exactly the same behavior happens on a Win7 machine when running Microsoft Security Essentials. This was quite helpful because I stopped suspecting there was some new behavior in NTFS in Win8 and instead I could focus on Microsoft Security Essentials (MSE). Other AV products I had running in my VMs didn't seem to have the same effect so this was particular to MSE. One thing I knew that was rather unique to MSE (at least it was a some point in the past) was the fact that MSE uses mapped files (also known as sections in the NT world) to read file data so I started wondering if that had anything to do with it.
So using fsutil I created a new 1GB file and made it sparse. Then I opened it with FileTest, created a file mapping and then mapped a view for the whole file. Guess what: NTFS reserved space for the range I was reading (naturally I didn’t expect MSE to change the files it was scanning so I was just reading the files). This is necessary because in case something writes to the file using the section NTFS must be able to save that information to disk. When working with mapped files NTFS can't know in advance what data will be written (if any) and so it does a lot of preparation to be able to accommodate the scenario where everything is written. So it was pretty clear what was going on, the fact that MSE created sections for my sparse files made NTFS reserve blocks for the files. The one last thing I had to figure out was why MSE held on to the sections for so long. My files were pretty small (less than 1 MB on average) and so it took quite a lot of them to get NTFS to run out of free space. Initially I suspected a section leak in MSE, but while I was playing with FileTest I noticed that even when I closed FileTest (and so I could know for sure that all the file handles and memory mapping handles and mapped views and so on were released) the blocks still weren't returned to the free space pool. And at this point it hit me that it must have been MM that kept the section open and indeed using RamMap I could see that was the case.
Here is a quick recap about what the standby list is. When a file is used for memory mapped IO, when the pages are no longer used (the view is unmapped or the section or the file handle are closed and even when the whole process is terminated) the pages that are backed by the file are moved to the standby list. They will be moved out of the standby list to either the free list or the zero list (depending on whether there is memory pressure in the system and who's asking for what kind of pages) or they will be reused if the same file is used for memory mapped or cached IO. This last behavior is pretty much a file cache (not to be confused with the cache manager which has quite a different role). In my case since there was no memory pressure the pages would remain on the standby list for quite a while and so NTFS would not see the section being closed and so it kept the reserved blocks.
Please note that this is not unique to sparse files, any form of files that are overprovisioned (such as compressed files) have the same semantics. So it is quite easy to run out of space on a volume where the total logical size of all the files exceeds the volume's capacity even without writing anything to the volume.
Now, since this is a file systems and filters development blog I should mention that if you are a file system or a filter and you do any work with compressed files or sparse files or some such, you can actually tell MM to close a section using MmForceSectionClosed().
Update: I wanted to add some steps on how to reproduce this problem, in case you're interested.
  1. Create an empty 1GB file:
    C:\>fsutil file createnew C:\TestFile.bin 1000000000
    File C:\TestFile.bin is created
    
  2. Make the file sparse:
    C:\>fsutil sparse setflag C:\TestFile.bin
    C:\>fsutil sparse setrange C:\TestFile.bin 0 1000000000
  3. open the file in FileTest.exe. Make sure to request GENERIC_WRITE access:
  4. create a read-only file mapping, map a view and read the whole file:
  5. unmap the view, close the section handle and then the file handle and then close FileTest.exe (we could have closed it directly as well).
  6. you now have 1 GB less free space on C:
    C:\>dir
     Volume in drive C has no label.
     Volume Serial Number is 10FA-5C1D
    
     Directory of C:\
    
    06/10/2009  02:42 PM                24 autoexec.bat
    06/10/2009  02:42 PM                10 config.sys
    03/02/2010  06:31 PM    >DIR>          Far
    07/13/2009  07:37 PM    >DIR>          PerfLogs
    08/11/2011  11:13 AM    >DIR>          Perl
    11/05/2010  09:34 AM    >DIR>          Program Files
    04/20/2012  10:12 AM     1,000,000,000 TestFile.bin
    11/04/2009  12:57 PM    >DIR>          Users
    11/05/2010  09:40 AM    >DIR>          Windows
                   3 File(s)  1,000,000,034 bytes
                   6 Dir(s)  39,859,396,608 bytes free
    
  7. Use RamMap and see all the pages for the file on the standby list:
  8. Empty the standby list (click on Empty->Empty Standby List).
  9. Finally check the free space on C: again:
    C:\>dir
     Volume in drive C has no label.
     Volume Serial Number is 10FA-5C1D
    
     Directory of C:\
    
    06/10/2009  02:42 PM                24 autoexec.bat
    06/10/2009  02:42 PM                10 config.sys
    03/02/2010  06:31 PM    >DIR>          Far
    07/13/2009  07:37 PM    >DIR>          PerfLogs
    08/11/2011  11:13 AM    >DIR>          Perl
    11/05/2010  09:34 AM    >DIR>          Program Files
    04/20/2012  10:12 AM     1,000,000,000 TestFile.bin
    11/04/2009  12:57 PM    >DIR>          Users
    11/05/2010  09:40 AM    >DIR>          Windows
                   3 File(s)  1,000,000,034 bytes
                   6 Dir(s)  41,361,416,192 bytes free

Thursday, April 12, 2012

Useful Tools for File System Developers

In this post I'll go over some of the tools I use pretty frequently to investigate file system behavior and debug my filters. I've noticed that some of the tools I use are somewhat unknown to other developers and so I hope this might save someone some time.
  • FileTest - I've already mentioned this tool many times in my previous posts. It's extremely useful to me for investigating file system behavior as well as trying to reproduce bugs in my filter. I really like the ability to generate a breakpoint right before a certain operation is issued, which is very helpful when debugging certain code paths where a regular breakpoint will just be too noisy. In those cases I enable the breakpoint in FileTest and then when I triggers I do a "bp /t @$thread " and the breakpoint will only trigger in the context of that thread. It's also very useful to see how some of the structures for information classes and such are set by the file system. Overall, a great tool.
  • ProcMon - I'm sure everybody knows of ProcMon already. I use it a lot when I want to investigate the behavior of a certain application (how does an application write to a file or how does it use a log or some such). It's also the first thing I try when I get bug reports where "application X works without the filter but fails when the filter is present". For these debugging scenarios I also found it very useful for something else: finding the log file for the application. Most properly written applications have a mechanism to log errors and/or warnings but since there are so many ways to do that in Windows I found that ProcMon is quite useful to find the log file (if there is one). I'm sure everyone reading my blog is a power-user of ProcMon so I won't go into more detail about the use cases and such. However, there is one trick that I'd like to mention that makes is very useful for file system developers: changing the altitude of ProcMon's filter. For file system filter developers this is incredibly useful because it allows them to see what the IO their filter issues looks like. Also, for debugging interop issues it's very useful to be able to put ProcMon between the two instances and see what's going on. First I capture ProcMon's configuration like so (make sure to capture it again when upgrading ProcMon):
    1. Start procmon.
    2. Open the registry entry for the filter: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\PROCMON20 (for the current version on one of my test machines)
    3. Export the key into a human-readable format (I use *.REG files).
    At this point I have a *.REG file with procmon's configuration which I can use whenever I need to load ProcMon at a different altitude. Please note that I remove the Enum key so what I have looks something like this :
    Windows Registry Editor Version 5.00
    
    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\PROCMON20]
    
    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\PROCMON20\Instances]
    "DefaultInstance"="Process Monitor Instance"
    
    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\PROCMON20\Instances\Process Monitor Instance]
    "Altitude"="385200"
    "Flags"=dword:00000000
    
    So now whenever I need to have ProcMon filter at a different altitude I open the file and change the Altitude to what I need it to be. After that the steps are:
    1. Load the REG file in the registry
    2. Open regedit and go to the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\PROCMON20\Instances\Process Monitor Instance key
    3. Add DENY for Everyone for Delete and Set Value.
    4. start ProcMon
    5. use fltmc to make sure that procmon is loaded at the right altitude.
  • fsutil - I don't know exactly why but this very useful tool is not as well known as I would have expected. It is a Microsoft tool that allows command line access to many file system functions that are useful for file system developers and which would otherwise require writing an in-house tool to access. This is a list of the ones I've used over the years but there are many more:
    • get file name by ID
    • set file shortname
    • query the allocated ranges for a file
    • query all sorts of information about the file system on a volume, including file system capabilities, configuration options, statistics and so on.
    • create and list hardlinks for a file
    • display information for reparse points (symlinks, mountpoints, directory junctions and custom ones)
    • work with sparse files (query and set the ranges, set the attribute and so on)
    • manage the USN journal (create, delete, query the entries for a file and so on)
  • mountvol - this is another Microsoft tool. I found it very useful for volume management. It's great to create mount points and change (or remove) drive letters, get familiar with using the NT names for volumes and so on.
  • diskpart - another Microsoft tool useful for creating VHDs and formatting volumes with various unusual characteristics. Useful for file system developers that want to test their filters with various file systems (see my previous post on Testing a Minifilter on More Filesystems: UDF and ExFAT).
These are the ones I use the most these days. I'll update the list if I remember another tool. Also please feel free to mention your favorite tool in the comments.

Thursday, April 5, 2012

Setting IoStatus.Information

I spent a couple hours yesterday investigating a bug that turned out to be a pretty silly mistake on my part, but I wanted to share some of the analysis with you. The problem was that an application was failing with a message of "There is not enough space on the disk", according to its logs.

Naturally the first step was to run procmon with and without my filter and compare the logs. After some filtering and searching and looking at timestamps (as an aside, any decent logging solution should have timestamps for the events, it makes it so much easier to find things) I managed to narrow things down to this:

And this is what the log looked like when the application worked:

The thing that I found most puzzling just by looking at these to screenshots was the fact that when the application was failing, the IRP_MJ_QUERY_INFORMATION call for the FilePositionInformation class was successful but still ProcMon didn't print the position like it did when the application was working. Anyway, the problem was obviously that the application was trying to set the EndOfFilte to a very large offset and it failed because the disk wasn't that large. I was curious to see how that file size was calculated and so I spent some time looking at the number, in the hope that maybe some high bit is set somewhere, messing up the number. But to no avail, the value looked positively random: 1,288,524,915,353 bytes is 0x12C0211E299…

Anyway, with that investigation leading nowhere (the next step was to debug the application to figure out how that value came about, something I wasn't very eager to do), I decided that I should take a break and instead investigate the other puzzling aspect, ProcMon not displaying the FilePositionInformation even thought the call was successful. So I spent a bit of time trying to figure it out and in my filter everything looked fine, I was setting the size properly:


            case FilePositionInformation:
                {
                    PFILE_POSITION_INFORMATION positionInfo = Data->Iopb->Parameters.QueryFileInformation.InfoBuffer;

                    // retrieve the current position from the file object

                    positionInfo->CurrentByteOffset = FltObjects->FileObject->CurrentByteOffset;

                    // then complete the operation

                    Data->IoStatus.Status = STATUS_SUCCESS;
                    callbackStatus = FLT_PREOP_COMPLETE;
                }
                break;

Compare this with the FastFat implementation:


VOID
FatQueryPositionInfo (
    IN PIRP_CONTEXT IrpContext,
    IN PFILE_OBJECT FileObject,
    IN OUT PFILE_POSITION_INFORMATION Buffer,
    IN OUT PLONG Length
    )

/*++

Routine Description:

    This routine performs the query position information function for fat.

Arguments:

    FileObject - Supplies the File object being queried

    Buffer - Supplies a pointer to the buffer where the information is to
        be returned

    Length - Supplies the length of the buffer in bytes, and receives the
        remaining bytes free in the buffer upon return.

Return Value:

    None

--*/

{
    PAGED_CODE();

    DebugTrace(+1, Dbg, "FatQueryPositionInfo...\n", 0);

    //
    //  Get the current position found in the file object.
    //

    Buffer->CurrentByteOffset = FileObject->CurrentByteOffset;

    //
    //  Update the length and status output variables
    //

    *Length -= sizeof( FILE_POSITION_INFORMATION );

    DebugTrace( 0, Dbg, "*Length = %08lx\n", *Length);

    DebugTrace(-1, Dbg, "FatQueryPositionInfo -> VOID\n", 0);

    UNREFERENCED_PARAMETER( IrpContext );

    return;
}

Not at all that different, right ? Anyway, I figured I should set an on-access breakpoint to see where the information that I set in the buffer was read:

1: kd> ba r 4 937dd744  
1: kd> g
Breakpoint 0 hit
nt!ExDeferredFreePool+0x2e3:
82931943 894604          mov     dword ptr [esi+4],eax

So as you can see the buffer wasn't access until the memory got freed. I then decided to compare it with what happens when my filter is not in the picture:

1: kd> ba r 4 94b6c874  
1: kd> g
Breakpoint 0 hit
nt!memcpy+0x134:
8284f8f4 89448ffc        mov     dword ptr [edi+ecx*4-4],eax
1: kd> kn
 # ChildEBP RetAddr  
00 9a037bf8 828b31c3 nt!memcpy+0x134
01 9a037c48 82a6e017 nt!IopCompleteRequest+0xa0
02 9a037d18 8285444a nt!NtQueryInformationFile+0x86c
03 9a037d18 779c64f4 nt!KiFastCallEntry+0x12a
...

And this was where it hit me. Since this was BUFFERED_IO, the IO manager was supposed to copy the data I provided back into the user's buffer. But as we can clearly see, with my filter there was no access to the buffer at all. This is because I didn’t set the IoStatus.Information properly. I did set it to 0 somewhere up the in the beginning of the function but forgot to set it to reflect the size of the data that I filled in when completing the request. The net result was that ProcMon was smart enough not to display anything because it got no data back, but the application had no idea this was going on and so it used whatever happened to be in the buffer as the actual position, and then proceeded to set that position as the EndOfFile. After setting the IoStatus.Information properly the application started working. For completeness, this is the code that worked:


            case FilePositionInformation:
                {
                    PFILE_POSITION_INFORMATION positionInfo = Data->Iopb->Parameters.QueryFileInformation.InfoBuffer;

                    // retrieve the current position from the file object

                    positionInfo->CurrentByteOffset = FltObjects->FileObject->CurrentByteOffset;

                    // then complete the operation

                    Data->IoStatus.Status = STATUS_SUCCESS;
                    Data->IoStatus.Information = sizeof(positionInfo->CurrentByteOffset);
                    callbackStatus = FLT_PREOP_COMPLETE;
                }
                break;

In closing I'd like to point to Doron's blog on How to return the number of bytes required for a subsequent operation that explains some more of the inner workings on IoStatus.Information and how the IO manager uses it.