Description: The von-Neumann bottleneck, defined as the performance gap between the processors and memory system, still limits the efficiency of the conventional computer architectures. Due to this issue, the performance of a vast majority of today's applications, particularly those with a large data footprints, are bounded by high latency of memory accesses. Traditional approaches such as adding multiple levels of fast and small cache memories near to the processors for accessing to the necessary data, fall short because of irregular memory access patterns of modern workloads and large data sets that they need to access and they cannot fit inside cache memories.

There is a need to develop new techniques in modern microprocessors to meet the demands of today's memory intensive workloads. Data prefetching, which predicts the upcoming memory accesses and fetch them into caches before they are demanded by the processor, is one of the well-known techniques to provide performance improvement by hiding the high latency of memory accesses. Unfortunately, the implemented hardware prefetchers inside processors are not able to predict complicated and irregular memory access patterns of modern data driven applications. That is the reason developers need to rely on software data prefetching techniques to use algorithmic knowledge of memory accesses to generate data prefetching instructions.

The initial results of our studies show that the proposed compiler based software prefetching techniques fall short in providing high performance improvements due to their static nature. In this work, we first investigate the main challenges in the way of enabling effective automated software data prefetching. Then based on our observations, we introduce APT-GET, a novel profile-guided technique that leverages dynamic execution time information to ensure prefetch timeliness. For capturing dynamic profiles during applications execution time, APT-GET utilizes Intel’s Last Branch Record (LBR) hardware support with a negligible overhead. Based on the collected profiling information, APT-GET characterize the execution time of frequently missed load instructions. It introduces a novel analytical model to select the optimal prefetch-distance and prefetch injection site for generating prefetch instructions. We evaluate the efficiency of APT-GET in the context of 10 real-world applications. The results of our evaluations reveals that APT-GET can improve the performance by 1.25× over the state-of-the-art software data prefetching mechanism and achieves a speedup of up to 2× and of 1.34× on average.