Bolts: A Blocklevel Interface for Filebased Deduplication
In the recent years multiple cloud and cluster services providers have implemented various deduplication policies to trim away at the redundancy of storing the same information contained over multiple Virtual Machine Images (VMIs). Most of these solutions have come in the form of either block or file deduplication solutions. Most solutions use a block policy, for 2 main reasons: the block level usually offers somewhat better deduplication ratios if used aggresively and, on the other hand, emulators communicate with the storage at the block level due the file systems they rely upon.
However, recent studies have proven block deduplication to be rather too costly for the rather small improvement it gives in saved space, compared to wholefile deduplication, even when the complexity is increased. In the paper of Dutch T. et al. they found that filebased deduplication saves 75% of what the most aggressive block schema could on live file systems, and it showed to be almost on the par (at 90%) when used on backup / offline file systems.
These results have emboldened us to come up with a new solution in this matter, which could bring the advantages of a lowoverhead file deduplication mechanism to serve live virtual machine via a blocklevel interface we provide, build around our deduplication schema. Therefore, our project can be split in 2 parts:
The offline phase: during this phase, the deduplication itself happens and this works by maintaing 2 files that represent the hash maps. One will be used online and matches both physical and logical block ranges to our unique files, all this being wrapped up in a vector of maps, that will be further explained in the implementaion section. The other file is only used offline (during this phase) to keep track of all our nonredundant files, using as wrapper a map of sha1 sum keys and a file identifier.
The online phase: during this phase we will open a server to grant access to one of our deduplicated virtual machines. This works by first setting up the envirionment, where the previously mentioned hash map file is read, and our vector of maps structure is updated, before any connection is accepted. Afterwards, we open a Unix socket, and wait for connections and requests, which can be served fully in parallel. (the work inside the server will be explained in further detail in the implementation section also). This project does not yet suport writing to the deduplicated image momentarily, so therefore all clients will first link a qcow2 snapshot to our Unix socket and then proceed to boot on that file, this being the moment when the connection is established.
To test our solutions we have come up with two different setups. Firstly, we run our disk access tests on a live virutal machine booting over Bolts to compare this to a native live boot to see how much overhead we induce. Secondly, our deduplication data will be gathered offline, by observing the deduplication ratio over 1000 virtual images.
Our solution shows less than 1 second of overhead, while achieving a deduplication percentage of 83% over 1000VMs.