Modern file systems, such as ext4, btrfs, and XFS, are evolving and enable the introduction of new features to meet ever-changing demands and improve reliability. File system developers are struggling to eliminate all software bugs, but the operating system community points out that file systems are a hotbed of critical software bugs. This paper analyzes the code coverage of xfstests, a widely used suite of file system tests, on three major file systems (ext4, btrfs, and XFS). The coverage is 72.34%, and the uncovered code runs into 23,232 lines of code. To understand why the code coverage is low, the uncovered code is manually examined line by line. We identified three major causes, peculiar to file systems, that hinder higher coverage. First, covering all the features is difficult because each file system provides a wide variety of file-system specific features, and some features can be tested only on special storage devices. Second, covering all the execution paths is difficult because they depend on file system configurations and internal on-disk states. Finally, the code for maintaining backward-compatibility is executed only when a file system encounters old formats. Our findings will help file system developers improve the coverage of test suites and provide insights into fostering the development of new methodologies for testing file systems.
Hadoop is a popular open-source MapReduce implementation. In the cases of jobs, wherein huge scale of output files of all relevant Map tasks are transmitted into Reduce tasks, such as TeraSort, the Reduce tasks are the bottleneck tasks and are I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we propose methods for improving the performance of Reduce tasks of such jobs by considering the following two things. One is that these files are accessed sequentially on an HDD, and the other is that each zone in an HDD has different sequential I/O performance. The proposed methods control the location to store intermediate data by modifying block bitmap of filesystem, which manages utilization (free or used) of blocks in an HDD. In addition, we propose striping layout for applying these methods for virtualized environment using image files. We then present performance evaluation of the proposed method and demonstrate that our methods improve the Hadoop application performance.
Hadoop is a popular data-analytics platform based on Google's MapReduce programming model. Hard-disk drives (HDDs) are generally used in big-data analysis, and the effectiveness of the Hadoop platform can be optimized by enhancing its I/O performance. HDD performance varies depending on whether the data are stored in the inner or outer disk zones. This paper proposes a method that utilizes the knowledge of job characteristics to realize efficient data storage in HDDs, which in turn, helps improve Hadoop performance. Per the proposed method, job files that need to be frequently accessed are stored in outer disk tracks which are capable of facilitating sequential-access speeds that are higher than those provided by inner tracks. Thus, the proposed method stores temporary and permanent files in the outer and inner zones, respectively, thereby facilitating fast access to frequently required data. Results of performance evaluation demonstrate that the proposed method improves Hadoop performance by 15.4% when compared to normal cases when file placement is not used. Additionally, the proposed method outperforms a previously proposed placement approach by 11.1%.
File system workloads are increasing write-heavy. The growing capacity of RAM in modern nodes allows many reads to be satisfied from memory while writes must be persisted to disk. Today's sophisticated local file systems like Ext4, XFS and Btrfs optimize for reads but suffer from workloads dominated by microdata (including metadata and tiny files). In this paper we present an LSM-tree-based file system, RFS, which aims to take advantages of the write optimization of LSM-tree to provide enhanced microdata performance, while offering matching performance for large files. RFS incrementally partitions the namespace into several metadata columns on a per-directory basis, preserving disk locality for directories and reducing the write amplification of LSM-trees. A write-ordered log-structured layout is used to store small files efficiently, rather than embedding the contents of small files into inodes. We also propose an optimization of global bloom filters for efficient point lookups. Experiments show our library version of RFS can handle microwrite-intensive workloads 2-10 times faster than existing solutions such as Ext4, Btrfs and XFS.
Protecting critical files in operating system is very important to system security. With the increasing adoption of Virtual Machine Introspection (VMI), designing VMI-based monitoring tools become a preferential choice with promising features, such as isolation, stealthiness and quick recovery from crash. However, these tools inevitably introduce high overhead due to their operation-based characteristic. Specifically, they need to intercept some file operations to monitor critical files once the operations are executed, regardless of whether the files are critical or not. It is known that file operation is high-frequency, so operation-based methods often result in performance degradation seriously. Thus, in this paper we present CFWatcher, a target-based real-time monitoring solution to protect critical files by leveraging VMI techniques. As a target-based scheme, CFWatcher constraints the monitoring into the operations that are accessing target files defined by users. Consequently, the overhead depends on the frequency of target files being accessed instead of the whole filesystem, which dramatically reduces the overhead. To validate our solution, a prototype system is built on Xen with full virtualization, which not only is able to monitor both Linux and Windows virtual machines, but also can take actions to prevent unauthorized access according to predefined policies. Through extensive evaluations, the experimental results demonstrate that the overhead introduced by CFWatcher is acceptable. Especially, the overhead is very low in the case of a few target files.
The fusion of the research field of high-performance computing (HPC) with that of big data, which has become known as the field of extreme big data, is problematic in that file creation in storage systems such as distributed file systems is not optimized. That is, the large workload leads to simultaneous creations of many files by many processes when creating checkpoints. The need to improve the file creation processes prompted us to design a scale-out distributed file system for post-petascale systems named PPFS. PPFS consists of PPMDS, which is a scale-out distributed metadata server, and PPOSS, which is a scalable distributed storage server for flash storage. The high file creation performance of PPMDS was achieved by using a key-value store for metadata storage and non-blocking distributed transactions to update multiple entries simultaneously. PPOSS depends on PPOST, which is an object storage system that manages the underlying low-level storage, such as Fusion IO ioDrive, a flash device connected through PCI express supporting OpenNVM. The high file creation performance was attained by implementing the PPFS prototype using file creation optimization, termed bulk creation, to reduce the amount of communication between PPMDS and PPOSS. And, to enhance the I/O performance of PPOSS when the client process and PPOSS run on the same node, PPOSS accesses a local storage device directly. The prototype implementation of PPFS with a further file creation optimization called object prefetching achieves 138, 000 Operations Per Second for file creation when using five metadata servers and 128 client processes, thereby exceeding the performance of IndexFS by 2.52 times. With local access optimization, PPOSS reached its limit at a block size of 16KiB, which is an improvement of 1.5 times compared to before optimization. Furthermore, this evaluation indicates that PPFS has a good scalability on file creation and IO performance, that is required for post-petascale systems.