- Title: tako: A Polymorphic Cache Hierarchy for General-Purpose Optimization of Data Movement
- Authors: Brian C. Schwedock, Piratach Yoovidhya, Jennifer Seibert, Nathan Beckmann
- Venue: ISCA 2022
- Keywords: cache hierarchy, data movement
Problem to be solved
The idea of abstraction hides many details of data movement from the software, which makes programming easier. But the software’s inability to observe and control data movement is the main reason for the inefficiencies of today’s systems. There are many existing research studies for specialized memory hierarchies by adding custom accelerators. Those research studies achieved massive improvements in efficiency, but the author believes that having custom hardware for each feature is too expensive to be realistic.
This paper argues that a single, general-purpose architecture instead of custom accelerators should be introduced to support more kinds of applications. The architecture should expose the details of data movement to the program above the abstraction layer so that software can manage data movement by itself.
Tako uses callbacks to “tell” the software about the data movement in an appointed cache region called “phantom address”. The phantom address ranges only live in the cache and are not backed by offchip memory. The callbacks are software short threads (but triggered by hardware) that can run in parallel with conventional software threads. There is an onchip hardware called “engine” in each core near the cache to trigger and schedule the callbacks. The engine contains a scheduler and a programmable dataflow fabric (contains 16 simple processing elements to execute a few instructions), which makes it possible to issue 16 callbacks concurrently.
In this paper, tako implements three types of callbacks (onMiss, onEviction, and onWriteback). The tako engine will trigger callbacks if corresponding operations happen in phantom address ranges. The software can define what to do when receiving that feedback.
Tako enabled many possibilities by providing a polymorphic interface for data movement. tako archived 1.4x–4.2x speedup in selected applications, which are very close to ideal engines. It can also be used to detect side-channel attacks and help the NVM filesystem reduce logging overhead.
- The author considered very comprehensively, including virtualization with multi-tenant, side-channel attacks, NVM support, and how to compete with specialized hardware when designing.
- It’s interesting and surprising to see that tako helps NVM keep consistency on failure.
- It’s also interesting to realize that side-channel attacks to derive keys were already published in very early years.
- For user space applications, they only need to interact with polymorphic callbacks, so no much modification is needed.
- The callback is so lightweight (and is off the critical path) that it incurs negligible overheads compared to an ideal engine (in their benchmarks).
- The method of putting data I/O fully into cache may require a large cache size to have enough space for phantom data. The paper didn’t explain or evaluate whether different cache sizes influence the speed-up ratio of tako. I suppose there won’t be a good result if the workload doesn’t have good spatial locality or the cache size is not large enough for the workload. This occurred in parts of their benchmarks (e.g. the last paragraph of section 8.1).
- It’s not so easy to make use of tako (even if the programmer is an expert). The developers of tako-aware libraries should consider this deeply and carefully.
- The memory management (i.e. paging) and process model need modifications are needed to the OS.
The paper’s presentation itself is good; it’s easy to understand the idea roughly. Although the implementation is complex, which took me a long time to understand. I appreciate the style of writing that lists subheadings like “description, why tako, evaluation” in each section.
The idea and design of this paper are so comprehensive that I can barely come up with any improvement. If there is no section 4.5, I would say three types of callbacks are too few. There can be more kinds of callbacks then combined them as different sets, i.e. tako-light (includes three callbacks), tako-fine-grained (include more for some special workloads).
Takeaways and questions
- It’s a natural thought that the performance will increase if the software knows more about the underlying states. But it’s difficult to choose a metric to tell the software and also difficult to choose a way to measure the software. I think the core innovation of tako is “callback”, which doesn’t extend the critical path.
- I think the idea of “phantom data” is some kind of similar to the RAM disk. The core idea of phantom data is putting and operating data directly in cache to avoid DRAM I/O, and the RAM disk is used to avoid useless hard drive I/O in some cases, such as compiling.
- Should we learn about near data computing? I guess it’s been a hotspot in the past few years.
本站基于 Creactive Commons BY-NC-SA 4.0 License 允许并欢迎您在注明来源和非商业使用前提下自由地对本文进行复制、分享或基于本文进行创作。