Thursday, June 16, 2011

Rename in File System Filters - part I

Rename is one of the operations that a lot of different classes of minifilters need to handle. I have been surprised by how rename works on a couple of occasions and so I figured it might be worthwhile to talk about how it is implemented and what particular problems it creates for file system filters.

Before we go any further there are some architectural decisions that are important to understand in order to see where some of the peculiarities of the rename operation come from:

  • In general the file system on a volume isn't aware of the other volumes mounted by the same file system. This means that a mounted file system (which I will refer to as a file system volume to differentiate it from a storage volume) never performs operations on a different file system volume (for example the NTFS code running on C: never touches NTFS structures for D: and is generally unaware of the existance of D: altogether). This is a big architectural decision and it has the advantage that it can tremendously simplify implementation. It's easy to imagine how complicated the file system stack would become if a file system might need to take locks and references against a different file system volume for some operations. So because of this rename operations at the file system level only support the same file system volume, and in fact even the OS doesn't support cross-volume renames (actually this depends on what your definition for the OS is; there is no support at NT level (which to me is the OS) but Win32 however introduces some support for cross-volume renames).
  • File paths are a big part of what a file system does. In fact, maintaining the namespace is the most important function of a file system when considering the fact that file data storage is largely implemented by the storage stack and all the file system has to do is to translate file offsets into storage offsets (to be fair there are cases where the file system does more than that, for example when compressing or encrypting). So clearly a lot of the code in a file system is dedicated to maintaining and operating on the namespace. The one code path that obviously needs to deal with file names is the IRP_MJ_CREATE path (regardless of whether files and directories are opened or created). So naturally a lot of the coding effort goes into implementing and optimizing that path. Once a file is opened, the file system would prefer to not have to deal with the file name at all (and this is indeed how it's implemented). With this in mind it's easy to see that it is best if any operation that deals with names can reuse as much of the CREATE code path as possible. This reuse can take two forms, either by having the OS call IRP_MJ_CREATE every time it needs to pass a file path to the file system and getting back a FILE_OBJECT on which to operate, or by having the file system internally call a lot of the functions that collectively make up the file system create path (which doesn't really happen). This is why a lot of the Win32 APIs take file paths (like DeleteFile()) and then internally convert this into "handle = ZwCreteFile(file_path), internal_operation(handle), ZwClose(handle)".

First let's look at what the APIs are that are usually used to rename a file. The usermode APIs are:

BOOL WINAPI MoveFile(
  __in  LPCTSTR lpExistingFileName,
  __in  LPCTSTR lpNewFileName
);

BOOL WINAPI MoveFileEx(
  __in      LPCTSTR lpExistingFileName,
  __in_opt  LPCTSTR lpNewFileName,
  __in      DWORD dwFlags
);

BOOL WINAPI MoveFileWithProgress(
  __in      LPCTSTR lpExistingFileName,
  __in_opt  LPCTSTR lpNewFileName,
  __in_opt  LPPROGRESS_ROUTINE lpProgressRoutine,
  __in_opt  LPVOID lpData,
  __in      DWORD dwFlags
);

BOOL WINAPI MoveFileTransacted(
  __in      LPCTSTR lpExistingFileName,
  __in_opt  LPCTSTR lpNewFileName,
  __in_opt  LPPROGRESS_ROUTINE lpProgressRoutine,
  __in_opt  LPVOID lpData,
  __in      DWORD dwFlags,
  __in      HANDLE hTransaction
);

These APIs follow a sort of progression, adding more and more options. However, for a file system developer it doesn't really matter much whether the user wants to track the progress of the operation or whether they specify a transaction or not (since at the file system filter's level the transaction is always available if the OS supports transactions). The most important parameters are the existing file name, the destination file name and the flags. Of the flags the most important one is MOVEFILE_COPY_ALLOWED because it instructs the IO manager that if the destination file is on different volume it should copy the file to the new volume and delete it from the source volume. This flag is usually specified and it's on by default when calling MoveFile (on my Win7 VM MoveFileW() is just a wrapper over MoveFileWithProgressW() where this is the only flag specified).

Now, when it gets to the OS layer, the way to issue a rename is by calling ZwSetInformationFile() with the FileRenameInformation information class, which requires the FILE_RENAME_INFORMATION:

NTSTATUS ZwSetInformationFile(
  __in   HANDLE FileHandle,
  __out  PIO_STATUS_BLOCK IoStatusBlock,
  __in   PVOID FileInformation,
  __in   ULONG Length,
  __in   FILE_INFORMATION_CLASS FileInformationClass
);

typedef struct _FILE_RENAME_INFORMATION {
  BOOLEAN ReplaceIfExists;
  HANDLE  RootDirectory;
  ULONG   FileNameLength;
  WCHAR   FileName[1];
} FILE_RENAME_INFORMATION, *PFILE_RENAME_INFORMATION;

The documentation for FILE_RENAME_INFORMATION and for IRP_MJ_SET_INFORMATION is pretty good and explains a lot about the parameters. However, the documentation for ZwSetInformationFile does not mention the FileRenameInformation case, which is actually pretty interesting.

One thing I mentioned at the beginning of this post is that one file system volume doesn't ever interact with another file system volume. Since the target of a rename can be a full path and since it can point to a different volume, the OS should only send the path down if it is on the same volume. So what happens is that ZwSetInformationFile must validate if the rename is within the same volume before sending the IRP_MJ_SET_INFORMATION request down. This is achieved in the function IopOpenLinkOrRenameTarget, which performs a couple of simple steps:

  1. check if the FILE_RENAME_INFORMATION->RootDirectory is a user mode handle and if so convert it to a kernel handle by calling IoConvertFileHandleToKernelHandle. This is very important for filter writers because it means that a filter cannot expect that the handle is kernel handle.
  2. issue an IoCreateFileEx to open the target of the rename, which is either a full rename or a relative one (depending on whether FILE_RENAME_INFORMATION->RootDirectory is NULL or not). This IoCreateFileEx inherits both the transaction and the DeviceObject hint from the source file object. Also this IRP_MJ_CREATE always has the SL_OPEN_TARGET_DIRECTORY flag set.
  3. If the create succeeds then the DeviceObject for the source FILE_OBJECT is compared with the DeviceObject for the target FILE_OBJECT and if they are different then IopOpenLinkOrRenameTarget returns STATUS_NOT_SAME_DEVICE.
  4. If they are the same then the new FILE_OBJECT (which is always a directory, the parent directory of the path specified by FILE_RENAME_INFORMATION->FileName and is always obtained from the IRP_MJ_CREATE with the SL_OPEN_TARGET_DIRECTORY as explained above) is set into IrpSp->Parameters.SetFile.FileObject. This FILE_OBJECT is the used by the file system to determine the parent directory for the target of the rename in the file system.

Once IopOpenLinkOrRenameTarget returns the only thing that remains for ZwSetInformationFile to do is to send the IRP to the file system.

I have some more things to say about renames that I'll save for next week, but I'd like to close with a list of things that make life really hard for file system developers:

  • The fact that ZwSetInformationFile performs the check of whether a certain rename would be a cross-volume rename before issuing the actual IRP_MJ_SET_INFORMATION means that all a file system filter sees is an IRP_MJ_CREATE file with SL_OPEN_TARGET_DIRECTORY, but it cannot know what the source file for this rename is. This is problematic for filters that might need to redirect renames to different locations depending on the source file (an application virtualization filter for example might want to redirect renames for certain files to some other location); such filters need to actually wait until the IRP_MJ_SET_INFORMATION request arrives and change the destination at that point. Another class of filters that is impacted by this are filters that do something where they create a virtual namespace in a volume (for example in a directory on a volume) by bringing contents from another volume. Clearly renames won't work across volumes, but even if the filter is prepared to implement something similar to the MOVEFILE_COPY_ALLOWED flag and copy the file if it is across volumes, it doesn't have the option because an IRP_MJ_SET_INFORMATION will simply not arrive if the target and the source are on different volumes. One feature that I feel would be very helpful here would be to add the source FILE_OBJECT into an ECP on the IRP_MJ_CREATE issued by IopOpenLinkOrRenameTarget, which would allow filters to detect what is the source file for the rename.
  • Another thing that makes things unnecessarily complicated for filter developers is that reusing of the DeviceObject hint in IopOpenLinkOrRenameTarget. This means that anytime a minifilter wants to rename a FILE_OBJECT that it created via FltCreateFile it must use as the FILE_RENAME_INFORMATION->FileName parameter a path on the same device as the FILE_OBJECT that it has created. This isn't that big of a problem since the minifilter must already know the right device, but it might still need to do some file path manipulation to make sure that the path itself is on that device (which can happen if the target of a rename comes from the user and it contains a reparse point). Just something to keep in mind I guess.
  • The presence of a handle (which is possibly a user mode handle) in the FILE_RENAME_INFORMATION structure means that a minifilter that wants to pend a rename operation must take extra steps to make sure that the handle is still valid in the context of the process where the thread handling the pended rename runs. It must keep all the data in the FILE_RENAME_INFORMATION and the IrpSp->Parameters.SetFile in sync because it can't make any assumption which data point another file system filter or even the file system might use.
  • The fact that the FILE_OBJECT that was opened by IopOpenLinkOrRenameTarget is used by the file system when processing the rename and in particular how it is used also means that filters that want to issue their own rename operation (by directly issuing an IRP or FLT_CALLBACK_DATA) must duplicate the stepts the OS takes in order to build a proper request. But more on this next week.