2021 Volume E104.D Issue 5 Pages 697-708
Emerging byte-addressable non-volatile memory devices attract much attention. A non-volatile main memory (NVMM) built on them enables larger memory size and lower power consumption than a traditional DRAM main memory. To fully utilize an NVMM, both software and hardware must be cooperatively optimized. Simultaneously, even focusing on a memory module, its micro architecture is still being developed though real non-volatile memory modules, such as Intel Optane DC persistent memory (DCPMM), have been on the market. Looking at existing NVMM evaluation environments, software simulators can evaluate various micro architectures with their long simulation time. Emulators can evaluate the whole system fast with less flexibility in their configuration than simulators. Thus, an NVMM emulator that can realize flexible and fast system evaluation still has an important role to explore the optimal system. In this paper, we introduce an NVMM emulator for embedded systems and explore a direction of optimization techniques for NVMMs by using it. It is implemented on an SoC-FPGA board employing three NVMM behaviour models: coarse-grain, fine-grain and DCPMM-based. The coarse and fine models enable NVMM performance evaluations based on extensions of traditional DRAM behaviour. The DCPMM-based model emulates the behaviour of a real DCPMM. Whole evaluation environment is also provided including Linux kernel modifications and several runtime functions. We first validate the developed emulator with an existing NVMM emulator, a cycle-accurate NVMM simulator and a real DCPMM. Then, the program behavior differences among three models are evaluated with SPEC CPU programs. As a result, the fine-grain model reveals the program execution time is affected by the frequency of NVMM memory requests rather than the cache hit ratio. Comparing with the fine-grain model and the coarse-grain model under the condition of the former's longer total write latency than the latter's, the former shows lower execution time for four of fourteen programs than the latter because of the bank-level parallelism and the row-buffer access locality exploited by the former model.