Have fun with the Linux kernel

Storage systems
I'd like to investigate the performance, scalability, reliability and security of storage systems in the future. Currently, I am particularly interested in deduplication storage systems. My interest in deduplication was specifically inspired by the Venti paper.

The most interesting problem in deduplication storage systems is how to handle indexes efficiently. There are several proposed approaches. One is to use a bloom filter to facilitate the exisitence check of a fingerprint. A second approach is to use SSD to store indexes. A third approach can be called sampled indexing in general: insert a new index at a probability. So, it seems to me that there is little space we can do for this problem.

Another problem in deduplication storage system is data fragmentation. Blocks belonging to a data stream in deduplication systems are not organized in sequential order any more but disks are only good at sequential accesses. How to optimize the read performance for deduplication storage systems is a problem that has not received enough attention yet. Maybe, it is just because it is not important at all. :)

Current projects

Improve the performance of deduplication storage system
Deduplication storage systems need massive and parallel computations for hash, index lookup, block existence check, compression, and decompression operations. We propose to use GPU to accelerate these computations, which reduces overheads from these computations for read and write operations. In deduplication storage systems, files are stored in disks in nonsequential orders. However, disks are only good at sequential accesses. As a result, disks in deduplication storage systems have a significant performance degradation and increased load. For a set of Linux images we store in Venti, we can observe a significant drop(82.04%) in the read performance: the read performance drops from 34.43 MB/s to be only 6.19MB/s. We are investigating the reasons for such a huge drop and try to optimize it.

OpenVZ vnode migration in Emulab
This project is mainly to implement the machenism to migrate an OpenVZ vnode to another host. We use the checkpoint and resume functions provided by OpenVZ to migrate process image and we use an enhanced LVM to migrate the disk delta for file systems.

Past projects

June. 2011 ~ Jan. 2012
High-performance Disk Imaging With Deduplicated Storage
In clouds and network testbeds, a disk image deployment system is needed to quickly distribute and install virtual machine images or operating system images at host devices. Previous work has shown that for these images, deduplication can save a significant amount of disk space. However, the read and write performance in deduplication storage systems is poor relative to traditional filesystem storage. In this work, we demonstrate that we can use deduplication storage systems as the backend of a high-performance image deployment system with only a negligible drop in performance by carefully pipelining to produce a balanced system.
[short paper][poster]

Jan. 2011 ~ June. 2011
Refining the Utility Metric for Utility-Based Cache Partitioning
Miss rate is widely used to determine cache partitioning for multi-core systems. However, a well recognized fact in the community is that MPKI can lead to sub-optimal cache partitioning. This project is to quantify the extent of sub-optimal for MPKI based cache partitioning and proposed a simple scheme for CPI predictions.
[paper] [source code]

Dec. 2010
Linux physical memory deduplication
The main goal is to deduplicate identical pages in physical memory. We have implemented a kernel module to calculate a hash for every single physical page for both x86 and x86_64 Linux. Another kernel module is also implemented to export the content of a single specified physical page. After we found that Linux has already implemented this function in /mm/ksm.c, we stopped this project.
[source code]

Resources:
storage-related I/O traces:
Traces from UCSC SNIA traces
open source deduplication storage systems:
Venti ZFS opendedup