Paper review: Write Prediction for Persistent Memory Systems (PACT ’21)

Paper Information

  • Title: Write Prediction for Persistent Memory Systems
  • Authors: Suyash Mahar, Sihang Liu, Korakit Seemakhupt, Vinson Young, Samira Khan
  • Venue: PACT 2021
  • Keywords: Persistent memory, PM-support operation, value prediction

Paper content


Background and Problem

Write-backs and PM-support operations will degrade the I/O performance of persistent memories

Persistent memories (PMs) are a type of byte-addressable, non-volatile storage which has the same level of latency as DRAM. To achieve the best performance, the PM-based software systems usually work in direct access (DAX) mode, setting up a direct load-store access to PM and bypassing the OS indirections such as filesystem and paging. But the cost of high performance is that the PM-based software systems need to maintain recoverability by carefully ordering writes to PM. Some essential write-backs and fences may block executions in the critical path and slow down the I/O. There are some operations called PM-support operations that will also degrade the PM performance. The PM-support operations include security handling such as encrypting and hashing, or wear-level management like deduplication and compression.

Key Idea

Precompute the PM-support operations by predicting the write-backs

To reduce the latency of PM-support operations, there are already some software implementations. This paper introduced the first hardware implementation called PMWeaver based on a simple but key observation: the address and value of PM write-backs are often carried by the previous store-instructions. The relationship between address/value and instructions is frequent and stable enough so that the PM write-back can be predicted based on store-instructions. Based on this finding, the paper tried to precompute the results of PM-support operations to make PM-support operations executed before write-backs. According to the paper, 76.9% of write-backs happened within the average PM-support latency. So if the prediction succeeds, most of the latency of PM-support operations can be mitigated.


Learner: The first part of PMWeaver is using a learner to identify the pattern of how the write-back address and data were generated and the relationship between store-instructions. The learner maintains a buffer called Store History buffer then matches the address and data in the buffer. Once matched, the learner will record its program counter (PC) into the Prediction Table.

Predictor: PMWeaver uses the execution path (also called PC-path in the paper, actually it’s XOR hash of 32 most recent PCs) to trigger predictions. If the predictor finds patterns that match the Prediction Table, it will collect the value of the following store-instruction then send it to the PM-support unit to precompute the related PM-support operation.

Validator: To avoid corruptions caused by mispredictions, the precomputed predictions will be buffered until being validated by actual write-backs. To reduce the overhead, the validator decoupled the prediction of address and data, so that predictions with wrong data but correct addresses can still work in some operations. The validator also has several designs to handle mispredictions such as the use of the Bonsai Merkle Tree.

Design for special cases: Common data values will lower the accuracy of prediction, so the authors simply assumed most common cases as zero value (called Zero-value prediction in the paper). Another trade-off design is that a small part of the write-backs needn’t to be predicted because there is enough time to compute PM-support operations, so no extra latency will be introduced. The authors set a threshold that write-backs within 150ns after store-instructions will be handled by predictor.


The simulation showed that PMWeaver predicts 81.16% addresses and 49.90% of data of PM write-backs in common PM workloads (include operations on trees/hashmaps/linked-lists), providing 1.63x and 1.26x speedup in two types of testing in their benchmark. PMWeaver is also power-efficient, the prediction can save 20.4% to 38.5% (depending on workloads) energy with only 2.9% overhead. As for the area overhead, PMWeaver will cost about 3% area in CPU.


  • Transparent to software: PMWeaver is a transparent hardware implementation, it doesn’t need any modification on original workload code/systems (which can become a big benefit for deployment).
  • Simple: The idea of precomputing the PM-support operations to overcome the latency is straightforward, and the implementation is also simple (doesn’t take much area).
  • Power-efficient: The power consumption is pretty low compared to the energy saving of reduced operations by prediction.


  • The system should be tested under real-world workload to ensure the evaluation result is not limited to a typical data structure or I/O pattern.
  • The prediction relies on the assumption that address and data have a frequent and stable relationship, such relationship wasn’t been proved or well-explained.

Paper presentation

The presentation of this paper is very good, several details are making the paper easy to read. For example:

  • Marking keywords with different fonts.
  • Different colors of the figure number / external links.
  • Well-arranged hierarchy.


How to make the paper better?
– Explain the concepts like PC-set more clearly, or I will not use such expression (looks like a terminology or proper noun).
– Try what if the PMWeaver shares cache with CPU? Since the storage requirement is small compared to the CPU cache size.
– Arrange a separate part to introduce the overview of the PMWeaver with architecture figure.

Takeaways and questions

  • The Janus paper (which is the software implementation mentioned in this paper) is certainly worth reading.
  • Is the Bonsai Merkle Tree (mentioned in section ii-b but didn’t explain its feature) well-known?
  • Yes
  • Using a Bonsai Merkle Tree to consists the hierarchical hash of per-cacheline counters is pretty clever (which takes advantage of this data structure).
  • Maybe we should read the Janus paper (software implementation by the author) first. Are the PM-support operations mentioned in the paper applied in the products yet (e.g. deduplication and compression)
  • What is the lifetime of nowadays PMs? (in the market, Intel Optane was said to be far more durable than common NAND based SSDs, but I’m not sure whether all kinds of PMs can be used as DRAM alternatives)

本站基于 Creactive Commons BY-NC-SA 4.0 License 允许并欢迎您在注明来源和非商业使用前提下自由地对本文进行复制、分享或基于本文进行创作。