- 1 Paper information
- 2 Paper content
- Title: Profiling a Warehouse-scale Computer
- Authors: Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, David Brooks
- Venue: ISCA 2015
- Keywords: warehouse-scale computer, cloud computing, performance analysis, datacenter, Top-Down method, out-of-order processor, microarchitecture
What is the problem the paper is trying to solve?
Warehouse-scale computers (WSCs) are computing platforms for cloud computing on a large scale. The performance characteristics of WSCs are different from normal computers, any improvements in performance or utilization will be magnified thousands of times at the scale of WSC, leading to significant cost savings. Because of the huge potential benefits, more researchers turned to the architecture of WSCs, studies like datacenter benchmarks and system-level characterizations of WSCs showed up in recent years. But the authors found there’s a lack of research on interactions of applications and underlying hardware which is more important in their point of view, so they showed their comprehensive study on profiling a live production warehouse-scale computer in this paper.
What are the key mechanisms and contributions?
Firstly, the Google-Wide Profiling (GWP) was used to profile the machines. All running jobs on 20,000 randomly chosen Ivy Bridge machines have been continuously profiling by GWP for about 3 years. GWP collectors sampled and symbolized the call stacks in a randomly selected time each day. Performance counters generated by performance monitoring units (PMU) were also be collected. Afterward, the aggregate of samples was stored in a database for easy analysis. The authors selected sampled data from 12 binaries written in C++ to be analyzed in detail. These binaries come from several different application classes, including different micro-architectural behaviors.
Secondly, a kind of performance analysis methodology called “Top-Down” was used to analyze the selected data. This method divides a modern out-of-order processor core into frontend and backend: the frontend is responsible for translating the instructions into micro-ops, while the backend is responsible for reordering, executing and committing them. The micro-op pipeline slots can be split into 4 categories: Retiring, Frontend bound, Bad speculation, Backend bound, only Retiring is classified as “useful work”. The “pipeline slot” in the paper represents hardware resources needed to process one micro-op. For example, there are 40 slots in 10 clock-cycles of a 4-wide CPU. Some selected performance events can be used to calculate the “Top-Down Metrics”. These metrics can be used to classify any pipeline slots into the 4 categories mentioned above and identify the bottleneck of the processors quickly.
What are the key ideas and insights of the paper?
According to the analysis result of GWP, there’s no “killer workload”. In other words, the workloads in WSCs are diverse. For example, the “hottest” binary only cover 9.9% cycles of WSCs while the top 50 binaries cover about 60% cycles. So it’s difficult to get significant performance gains by optimizing a single workload.
Although there’s no “silver bullet” to find a single bottleneck for all workloads, common building blocks can be found in datacenter workloads. These components bring performance impact called “Datacenter tax”. The “tax” takes about 22-27% cycles, including protobuf management, remote procedure calls (RPCs), data movement, compression, memory allocation, hashing, and kernel. Some of this overhead can be accelerated by hardware.
On the microarchitectural level, they compared WSC applications with SPEC benchmarks using the “Top-Down” method. The result showed that WSC applications have very low instructions per cycle (IPC) on average:
- One major overhead is WSC applications spend much time stalled in the frontend (often 2x-3x higher than typical SPEC benchmarks), the authors think it’s caused by instructions misses in cache because the WSC applications have significantly larger working sets than SPEC benchmarks (the i-cache footprints of some WSC applications amounts 4x larger than SPEC benchmarks like
400.perlbench, also larger than typical L2 cache of current CPU architectures).
- Another major overhead on the microarchitecture level is backend stalls, caused by data cache serving problems and low instruction level parallelism. 50%~60% of all cycles stalled on caches, account for more than 80% of all backend-bound pipeline slots. 72% of execution cycles only use 1 or 2 execution ports out of 6, which also indicated that many cycles are spent on waiting for cache.
- Further, the data cache serving problems were caused by low memory bandwidth utilization. The memory controllers are not well utilized because memory latency is more important than bandwidth for today’s WSC applications.
It’s widely accepted that simultaneous multi-threading (SMT) is efficient when the workloads have complex bottlenecks. So the authors estimated the impact of SMT by comparing performance counters between per-thread ones and aggregated per-core ones. The result shows that SMT succeeded in alleviating both frontend and backend inefficiencies of WSC applications. Frontend bound cycles decreased from 22% to 16% when accounting for SMT, frontend starvation cycles (no micro-ops dispatched) also decreased from 5% to 4%. Backend function unit utilization increased: 6% more cycles in which 3 or more of the 6 execution ports are used when counting for SMT (from 28% to 34%). The authors also indicated that the IPC with 2-wide SMT is still low compared to the theoretical value (1.2 vs. 4), wider SMT has potential benefits.
What are the key conclusions?
- WSC workloads are extremely diverse, the design of datacenter architectures should avoid significant performance loss when running any type of workloads.
- Because of 1, it’s hard to improve performance by optimizing a single workload. But there are some common building blocks called “datacenter tax” which could be accelerated by specialized hardware in the future.
- On the microarchitecture level, WSC workloads usually have low IPC, bad hit rate of cache (especially instruction cache), low utilization of memory bandwidth and functional unit, which is very different from SPEC benchmarks.
- Some directions will help to solve these issues: for example, wider SMT, i/d-cache partitioning, rebalancing the silicon area of memory controllers and other components. The future design of datacenter processors should pay attention to these directions to fit WSC applications better.
- The paper gives a high-level generalization about the performance of warehouse-scale computers. It’s impressive for other datacenter operators who have no condition to do such costly profiling.
- The experiment and analysis are well-designed, especially the comparison between WSC workloads and SPEC benchmarks on microarchitecture level using the “Top-down” method.
- It’s a good example of studying the interactions of software and underlying hardware. The investigation directions are valuable for chip designers like Intel to manufacture processors that can fit needs in datacenter better.
- The readers are not familiar with the applications inside Google’s datacenter, it’s better to explain the character of each selected applications to make readers easier understanding the benchmark results (for example, they can explain that
bigtableis a database service so it may require lower response time than object storage service, may result in higher requirements of cache hit rate and so on).
- The impact of SMT in Section 8 comes from estimation rather than a large-scale experiment. At least, it’s not as convincing as other parts.
- The concept “instruction footprint” appears all over the paper but it isn’t explained in detail (the definition, how did they get it, etc.)
- In Section 5, the paper compares the WSC workloads with
473.astarwithout mention it before. Figure 6 also doesn’t include this benchmark.
- Some adjectives like “brawny” and “wimpy” are not so widely used, it’s better to introduce the original source of these words or choose more common words.
Although the weaknesses are obvious in this paper, it’s not easy to do better since there’re many restrictions and limitations to experiments at a large scale. I’m really curious about how SMT affects the performance in production, but I also understand the risk of disabling SMT at-scale.
There’re still some other things we can do, the paper didn’t come up with any idea at the software level (maybe because this paper only focus on “profiling”, but they can have some foresight in the “future work” section), for example, reducing the WSC application’s working set to get a better cache hit rate.
The paper only focused on C++ workloads but there’re also some non-C++ workloads in Google’s datacenter like Java, Python and Golang. Different runtimes may cause different behaviors at the microarchitecture level. It’s interesting to study these behaviors and to explore the possibilities of hardware acceleration like the paper did.
Takeaways and questions
I think this paper is a very good example of how the industry and academia feedback each other. Some experiments (like the WSC profiling) are hard to complete in a laboratory, and the research result could lead to better industrial products.
Google has a series of datacenter performance analysis research, the GWP paper and some later research (like ASMDB) should worth reading to get a more comprehensive understanding of this paper.
本站基于 Creactive Commons BY-NC-SA 4.0 License 允许并欢迎您在注明来源和非商业使用前提下自由地对本文进行复制、分享或基于本文进行创作。