Comprehensive testing with latency, bandwidth, throughput, and application benchmarks
Optane 3D Xpoint device

LLNL Evaluates Intel Optane 3D Xpoint Performance

Monday, July 10, 2017

Testing Methodology | Fio Performance | LRIOT Performance | User Level Cache Performance | Appendix

LLNL’s evaluation team includes postdoctoral fellow Keita Iwabuchi and summer intern Joshua Landgraf under the guidance of Maya Gokhale and Robin Goldstone. (Image credit above: Intel Corporation.)

This study evaluates performance of the Optane 3D Xpoint device. We ran several input/output (I/O) benchmarks and application-level benchmarks on Xeon servers with two NVME-attached solid state drives (SSDs): the Optane and the HGST SN100. The I/O benchmarks evaluate latency and bandwidth with various read/write scenarios. The application-level benchmarks use the memory map abstraction to access out-of-core data structures using either system mmap or a C++ User Level Cache (ULC) with smart pointers. System mmap caches pages at a 4K granularity. ULC can cache at arbitrary granularity settable by the applications, and the ULC evaluation utilizes several sizes from 4K to 1M.

Testing Methodology

Our XPoint test node was configured with Linux 4.11.0-rc5, 128GB of RAM, and two Xeon E5-2670 v3 (12 cores @ 2.3-3.1GHz with 30MB cache). In addition to the P4800X, it also included a 1.6TB HGST Ultrastar SN150, which has an advertised throughput of 3,000 MB/s read and 1,600 MB/s write, 743,000 random read IOPS and 140,000 random write IOPS, and 20us 512B latency. Unless otherwise specified, all tests performed on an HGST SSD were using this node.

We also used nodes from our Flash cluster. It is the same hardware as the Xpoint node except with Linux 3.10, 256GB of RAM, and hyperthreading on.

Unless otherwise specified, all tests were run with files on XFS (4K block size).

Unless otherwise specified, all tests were run with 128GB of RAM (many tests, including fio, used direct I/O, which should not be affected by available RAM). When performing tests with less than the maximum available RAM on the node, we wasted memory by allocating data on each NUMA node. For instance, to reduce RAM from 128GB to 64GB on the XPoint node, we allocated 32GB of data on each of the two NUMA nodes so that it would not available during the tests.

Unless otherwise specified, all I/O was done on a 4KB granularity.

The tests were done primarily with the default Optane settings. We tried changing the sector size to 4K but didn’t observe performance difference.

Fio Performance

This test measures latency on our test configuration and also compare XPoint’s latency to that of our HGST drive. We did this through some simple fio benchmarks on both drives that used different combinations of queue depth and number of processes. The “Mixed Read” and “Mixed Write” workloads are 50% read, 50% write, with the latency reported for the read (resp. write) portion of the workload. The graphs below show average latency with 1 standard deviation error bars. More comprehensive data is included in tables in the appendix. Summary latency statistics are shown in the tables including 50%, 90%, and 99.9% quantiles. The final column shows bandwidth for the offered load.

Figure 1. Average random read latency. (Click to enlarge.)

Figure 2. Average random write latency. (Click to enlarge.)

Figure 3. Average random mixed read latency. (Click to enlarge.)

Figure 4. Average random mixed write latency. (Click to enlarge.)

From these results, we can see that the XPoint drive has substantially better read latencies than the HGST drive, sometimes 10x faster. Write latencies are also better, though the HGST drive still does quite well with writes. It is also worth noting that the XPoint drive is able to get much higher bandwidth than the HGST drive when using low queue depths (see tables in the appendix). With deeper queues, the HGST drive sacrifices latency for better throughput. We note that the HGST drive shows higher latency variance and greater reduction in bandwidth with mixed workloads than the Optane.

LRIOT Performance

LRIOT is a throughput-oriented open source benchmark developed at LLNL. We swept over a large number of test scenarios to determine each device’s maximum IOPS as well as the number of threads needed to achieve that level of performance. Below are the results for the XPoint and HGST drives using both mmap and direct I/O.

Figure 5. Results using mmap and direct I/O. (Click to enlarge.)

Here we can see that the XPoint drive performs exceptionally well at low thread counts and can achieve maximal performance with just 8-16 threads. It also exceeded the performance specification provided by Intel by around 50,000 IOPS. When reading, the HGST drive needed many more threads to match the performance of the XPoint drive and even more beyond that to achieve maximum performance. The same was true when writing to the HGST drive except the number of threads required to achieve max bandwidth was significantly fewer than when reading. It is also worth noting that the HGST drive’s write performance was massively higher than advertised but still less than that of the XPoint drive.

User Level Cache Performance

In this section, we report on two application benchmarks that use an in-house User Level Cache (ULC) library to cache blocks from a backing file, and compare performance to the same benchmarks using system mmap. Our goal is to use the SSD storage as an extended memory that the application transparently accesses, with the load or store memory request being managed either by the ULC library or through a pagefault in the mmapped region.

Sort Benchmark

Below we show the time taken by both the ULC (configured with an I/O size of 1MB) and mmap when sorting a large amount of data (325GB) with multiple processes. Since sorting has a significant amount of locality, we have “wasted” 64GB of the node’s 128GB of RAM to result in a more realistic 5:1 SSD/RAM ratio.

Figure 6. Multiprocess sort. (Click to enlarge.)

Overall performance is better with HGST up to the crossover point of 32 processes. The HGST is rated for higher read bandwidth and IOPS than the Optane. However, at 64 process and greater, the XPoint drive shows slightly better runtime.

Breadth-First Search Benchmark (Read-Only)

In another experiment with the ULC, we ran multi-threaded breadth-first search (BFS) on a RMAT 30 graph (264 GB) varying I/O sizes for the ULC. There are 24 threads (24 physical cores on the node, hyperthreading off). Graph data is stored on the SSD (HGST or 3D XPoint) and accessed read only.

Figure 7. Multi-threaded BFS execution time. (Click to enlarge.)

Performance was uniformly better on Optane than HGST. On the HGST, using larger cache blocks improved performance substantially, as shown by the 16K and 64K bars. On the Optane, the 4K size delivered the best performance across the four version. For 64K blocks in the ULC, Optane was very slightly faster than HGST.