Thursday, April 26, 2012

Smart Reading and Storing of Files

I often run into tools that move files around (back-up solutions or cloud storage solutions or file copy utilities) that don't support some core NTFS features like alternate data streams. This lack of support for a very useful feature (just think about how many different file formats are out there that provide little value besides adding support for storing metadata along with the main data stream) is very frustrating.

Anyway, from looking at various applications that don't support this I see two main reasons for it. One is that they're multi-platform apps that only support the common feature set between all the platforms (which makes sense in some cases but doesn't really make much sense in other cases). The other reason is that I suspect people just don't want to have to deal with enumerating alternate data streams and then coming up with a mechanism to serialize them into a single file. On the other hand, most such solutions must read other file metadata and preserve it, such as file attributes, timestamps and so on, which I suspect they do on their own, using various Windows APIs that provide that functionality. Smarter solutions need to also deal with sparse files (it would be pretty stupid for a cloud storage solution to ignore the file system information that a large chunk of a file is all 0s and instead use bandwidth to transfer those) and possibly even hardlinks and reparse points.

Fortunately, this is simpler that it sounds, at least on Windows :). There is a set of Windows APIs that operate on a stream of data that describes a file complete with alternate data streams information and sparse blocks information and security information and so on. This stream is formatted in a way that allows the caller to understand what type of data each part of the stream represents. The API set even allows skipping certain types of file data that is not relevant to the caller. The APIs I'm referring to are referred to as the Backup APIs:

There is also the WIN32_STREAM_ID structure, which describes the information that follows the structure in the stream and allows for figuring out whether the stream is interesting for the caller, how long it is and so on.

These APIs allow the caller to read the file contents and metadata as a long stream of bytes, skip through the stream and write all the information back as one stream of bytes. Moreover, since the information is formatted it's also possible that when reading the data only a certain type of data is read. For example, let's discuss a solution that archives data to some cloud storage and then wants to read the data into a file on an OS different than Windows. It's quite easy to keep track in some database of the information of where the main data stream begins and how long it is so that it can only download that information for that platform.

There is quite a bit of documentation on these APIs and on what a backup solution should do with them and so on. There are some documents under the [MS-BKUP]: Microsoft NT Backup File Structure page and even a basic sample, Creating a Backup Application.

So please, next time you run across a solution that doesn't handle sparse files or alternate data streams or some other such features, feel free to point the developers to these APIs :)