Talk On The Data Access Issue (or Why Modern CPUs Are Starving)

Presented by Francesc Alted in Scientific Applications 2009 on 2009/07/25 from 09:15 to 10:15
Abstract
Abstract

There exists a trend for CPU performance to increase at a faster rate than memory buses, which is creating a big problem of data starvation in current CPUs. In the first part of my note, I will talk about how system builders are designing architectures that can help to cope with this in some circumstances (basically when there is temporal or spacial data locality).

In the second part, I will suggest using advanced compression techniques in order to allow for better performance in scenarios where current architectures are not well suited (like for instance, operating with large streaming datasets that have not temporal locality at all).
Part I: A bit of computing history
Introduction

Every computer scientist knows that on many occasions their computations will be limited by the memory subsystem throughput and latency. This is due to the trend over the past 30 years for CPU performance to increase at a faster rate than memory buses. This trend will continue into the foreseeable future, forcing the scientists to design their programs to minimize memory bottlenecks so as to squeeze maximum performance out of their processors.
The hierarchical memory model

In order to cope with the limitation of the memory buses, computer architects have introduced a hierarchy of several levels of cache memories. As a result, the information stored there may be delivered much quicker than it would be done from slower media such as main memory or disk subsystems. Hence, cache memory helps to reduce undesired CPU stalls and to increase performance thereby.

On this section, I will give an overview of the evolution of the hierarchical memory models, making special emphasis on current designs. A glimpse on the foreseeable future models will be done too.
Techniques for fighting data starvation

Here, I'll survey some common techniques to deal with the problem of CPUs data starvation in current hierarchical memory models. Most of these techniques are exploiting the principles of temporal and spatial locality in order to make use of caches efficiently.

However, we will see that such techniques will be almost useless for tasks dealing with large amounts of data that are not reused anytime soon (i.e. lacking temporal locality). This is where compression can help by broadening the scenarios where hierachical memory models can be effectively used to improve performance.
Part II: The role of compression in the data access issue
Putting data redundancy away

The main characteristic that defines a compressor is that it reduces the amount of space required to describe a dataset. Compression can be lossy or lossless. As we are only interested in describing datasets completely, I'll focus on lossless compression only.

An interesting aspect of the space reduction is that buses are able to transmit compressed datasets to CPU faster than uncompressed ones. On the other hand, compression and decompression processes have traditionally required a lot of CPU time to complete, preventing the overall process of transmit+uncompress+compute+compress+retransmit to go faster than the more simple transmit+compute+retransmit.

However, provided the trend of faster increase of CPU performance in comparison with memory buses speeds, the improved throughput of transmitting compressed data will eventually compensate the overhead of compress/uncompress it. As we will see in the next section, we are already reaching the point where fast CPUs and highly optimized algorithms can achieve this important landmark in computing sciences.
Introducing Blosc: A Blocked Shuffler & Compressor

Blosc is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach. Blosc will be one of the first (if not the first) of a series of compressors that are meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate computations that are currently memory-bound.

In this section, I'll discuss the principles that underlie the Blosc compression algorithm and show some real-life benchmarks.
An application of Blosc: The PyTables database

Here, I'll demonstrate a practical use case for Blosc in the context of the PyTables database. PyTables is package that is meant to allow very high speed access to binary data. This data can be compressed transparently on-the-flight. In addition, it leverages Numexpr, a package that can evaluate multiple-operator array expressions achieving a good balance between cache and branch prediction in modern CPUs.

I hope to show how PyTables+Blosc+Numexpr can perform operations with large datasets that are near (if not beyond) the speed of optimized, but more traditional, C or Fortran algorithms. This will be another indication that, when proper programming techniques are applied, the performance of interpreted languages like Python can be similar, or even outperform, traditional C or Fortran codes.
tagged by
no related entity