Paper review: A Top-Down Method for Performance Analysis and Counters Architecture (ISPASS ’14)

Paper information

  • Title: A Top-Down Method for Performance Analysis and Counters Architecture
  • Authors: Ahmad Yasin
  • Venue: ISPASS 2014
  • Keywords: Top-down Microarchitecture Analysis Method (TMAM), performance analysis, performance counter, pipeline, out-of-order processor

Paper content


What is the problem the paper is trying to solve?

Modern CPUs apply many technologies to improve performance, such as superscalar, out-of-order execution, predictive speculation, and hardware prefetching. These technologies make the CPUs very sophisticated. Identifying the performance bottleneck is difficult with hundreds of performance events generated by PMUs (performance monitoring unit, an on-chip subsystem helps analyzing the processor).

Traditional methods to estimate the stalls inside the CPUs are not suitable for nowadays out-of-order processors. For example, a traditional method may calculate the stalled cycles by simply multiplying the penalty latency and the number of missed events. But this approach might not work due to many reasons such as stalls overlap (many units work in parallel, the cache hit rates are not the same). To analyze the performance bottleneck of modern CPUs more effectively, the Top-Down Method was introduced.

What are the key ideas and insights of the paper?

This paper introduced a new way to determine performance bottlenecks called Top-Down method. This method depends on a small set of performance event counters, some of them can be found in PMU of modern CPU already.

The author divides a modern out-of-order processor core into frontend and backend: the frontend is responsible for translating the instructions into micro-ops, while the backend is responsible for reordering, executing and committing them. The following categories of Top-Down method are based on this partition.

At the top level of Top-Down method, pipeline slots (hardware resources needed to process one micro-op) can be classified into one of the four categories: Frontend Bound, Backend Bound, Bad Speculation, and Retiring.
– Frontend Bound denotes that there are not enough micro-ops sent to the backend. The Frontend Bound can be classified into bandwidth issues and latency issues.
– Bad Speculation means the slots are wasted because of incorrect speculations, usually caused by mispredicted branch and pipeline flush.
– Retiring means issued micro-ops get retired, all micro-ops in Retiring slots are doing useful work. 100% of Retiring represents the microarchitecture reaches its maximal IPC.
– Backend Bound denotes that no micro-ops delivered due to resource shortage. The Backend Bound category can be classified into Memory Bound (pipeline waiting for memory system) and Core Bound (bad execution port utilization).

One of the most significant advantages of Top-Down method is that it makes good use of the performance counters. Facing so many events provided by the PMU, it is not easy to figure out which one is critical. The Top-Down method only chooses a few of them called Top-Down Events to calculate the Top-Down Metrics. The metrics can be used to classify any pipeline slots into the 4 categories mentioned above and identify the bottleneck of the processors quickly.

What are the key mechanisms and contributions?

To evaluate the method, the author analyzed SPEC CPU2006 benchmarks in single-thread and multi-copy modes. The result showed the differences between the tests. For example, the integer applications are more likely to be Frontend Bound than floating-point applications, multi-core workloads require more memory bandwidth, so they are more likely to be Memory Bound. The author also tried to analyze some server workloads such as DBMS (database management system), which indicated that real-world server workloads usually have lower IPC. At last, the author showed how the application bottleneck changes during several case studies on vectorization, false sharing, and software prefetching.

The author also applied Top-Down method to some typical cases of performance tuning, such as matrix-multiply, false sharing, and software prefetch. For example, the value of Stores Bound significantly decreased after fixing false sharing, which shows the Top-Down metrics can be useful indicators when facing performance issues.

What are the key conclusions?

The paper introduced Top-Down Analysis method, a generic way to identify the bottleneck of modern out-of-order processors with a low cost. This method is easier to apply than previous approaches with no loss of accuracy.


  • The Top-Down method is easy to understand and easy to put into practice with a relatively low cost (especially at a large-scale). Because of the multi-level construction, the method can be either used to get rough results by just looking at the top level, or analyze in detail by diving into a deeper level if needed.
  • There are already some mature tools like VTune which makes the Top-Down method easy to use. The most popular profiling tool on Linux perf also has build-in support of Top-Down.


  • Simultaneous multithreading (SMT) and hardware-based virtualization (VT-x) need to be disabled during the analysis. These technologies are usually enabled in production, so the deployment (of such analysis method in real workload) will be restricted, the analysis result also can be biased.
  • The calculation of Top-Down Metric is highly related to the values of performance counters. These values may have different definitions on different products, which make the methodology less generic.
  • In section 5.4, the author directly compared the result of DBMS workloads on Sandy Bridge EP with SPEC workloads on Ivy Bridge, which was not rigorous.

Paper presentation

  • “A matrix-multiply textbook kernel is analyzed with Top-Down” in Section 5.5 came with no detail about the code. The readers have no way of knowing why the multiply2 is faster, or why it’s bottleneck shifts from Memory Bound to Core Bound compared to multiply1.
  • The pronouns are not consistent in the paper, sometimes the author mentioned the Intel’s processor by codename (such as Ivy Bridge), sometimes by generation (such as 3rd Core).
  • Some legends are badly placed in figures, such as Figure 7 and Figure 8. In Figure 8, the legends even cover the graph and Y-axis.


The paper aimed to analyze general out-of-order processors but talked too much about Intel’s products (4 out of 13 references are about Intel). It is not necessarily a weakness, but the sections of background/related works could be enriched with more examples to make the conclusion more generic.

Takeaways and questions

This paper is highly related to Profiling a warehouse-scale computer (ISCA 2015)

Other materials came from the author may help understand this paper, such as:
Ahmad Yasin: Top-down Microarchitecture Analysis through Linux perf and toplev tools (Haifa::C++ Meetup 2018)
Ahmad Yasin: Performance Analysis in Modern Multicores (Practical Parallel Programming, Haifa University)

The author has been continuously updating the Top-Down method for years after the paper was published. The SMT issue seems to be solved in the later versions of Top-Down method, but there is no certain paper that introduced the updates.

本站基于 Creactive Commons BY-NC-SA 4.0 License 允许并欢迎您在注明来源和非商业使用前提下自由地对本文进行复制、分享或基于本文进行创作。


您的电子邮箱地址不会被公开。 必填项已用 * 标注