Memory bandwidth optimization

jwatte's picture

Note that what I'm saying is sort of x86-centric, with specific illustration for Pentium II/III, although it'll also work on AMD and even on other platforms (like PowerPC). The more PC/workstation like the platform, the more truth this holds. DSPs with SRAMs aren't anything like this, though -- programmer beware.

Cache control really is a very broad subject. The main points to remember is that the L1 cache sits inside the main memory shuffler, and L2 and DRAM sit outside. DRAM has a very high latency to start reading from it, but can stream pretty well once you get it going; same thing for writing. The DRAM controller will keep streaming from/to DRAM only if there are no intervening reads/writes, and if the reads/writes are to/from sequential cache lines and spaced VERY close together.

What this means is that reading A, then some table for A, then B, then some table for B, then writing out to C, then repeating, will cause 5 DRAM open stalls per cycle (well, actually per 4 or 8 cycles depending on your cache line size). Instead, you could structure your code to work in blocks, and do something like:

  1. pre-read necessary coefficients/tables/what have you
  2. pre-read a block's worth of A
  3. pre-read a block's worth of B
  4. pre-read a block's worth of C (destination!)

The sum of 1+2+3+4 must be < 16 kB (L1 D-cache) on P-III, and < 8 kB on P-IV (waaa! that chip is meeeager!)

Once this is done, the data sits all in L1 cache, and operations on it are very efficient (on the order of 3 cycles of latency, which is hidden by the pipeline most of the time).

The only question that remains is how to get rid of the buffer you now modified in C -- ideally, you'd want the controller to just dump all the data there at once, rather than evict one cache line at a time as you pre-read the next output buffer. There are two strategies:

  1. Because you won't be reading any RAM (it's all in L1), you don't really need to pre-read the output buffer. As you write to it, the memory controller will fill in cache lines when you start writing to them, and evict them as necessary. This is not ideal memory traffic, but the benefits from points 1-3 make it good enough. This is the state of the art on Pentium/PPro/P-II. The PPC has an additional "dcbz" instruction you can use instead of pre-reading for step 4); I think the Athlons have this too.
  2. You can use a temporary buffer for #4 that stays in L1 cache as long as you're processing buffers. After each block, you copy from the temporarly buffer to the output buffer using "streaming writes" which by-pass the cache; movntq is available on P-III and up as well as Athlon and up; movntps is P-III and Athlon XP.
  3. As a variation to 2), you can write directly to your output buffer using the streaming stores. This is usually good enough if you don't need to read back or re-use the output data you generate, and ideal if your inner loop fits in registers.

Now, as far as your performance problem goes, 11,000,000 samples times 2 bytes equals 22 MB per second, which should be sustainable on most EDO-and-better systems, but only if you get it streaming. Piecewise scattered accesses will cut the throughput of any memory subsystem to pieces. Chances are, none of your optimizations helped much, because you're already so memory bound. I usually end up writing simple loops, no unrolling, etc, and put all my effort into memory management, and I usually hit my performance targets that way. Especially on x86, unrolling loops may actually hurt because the unrolled loop requires more registers and more dynamic execution hardware, causing more stalls and worse performance.

Here's a routine that will pre-read a block of memory into the L1 cache in an attempt to take advantage of DRAM streaming:

pre_read( void * base, int size )
  size_t cache_line_size = 32; // for P-II/III, works OK on others too
  char volatile * b = (char *)(((long)base)&-cache_line_size);
  while( size > 0 ) {
    *b; // force a read
    size -= cache_line_size;
    b += cache_line_size;