
Note that what I'm saying is sort of x86-centric, with specific illustration for Pentium II/III, although it'll also work on AMD and even on other platforms (like PowerPC). The more PC/workstation like the platform, the more truth this holds. DSPs with SRAMs aren't anything like this, though -- programmer beware.
Cache control really is a very broad subject. The main points to remember is that the L1 cache sits inside the main memory shuffler, and L2 and DRAM sit outside. DRAM has a very high latency to start reading from it, but can stream pretty well once you get it going; same thing for writing. The DRAM controller will keep streaming from/to DRAM only if there are no intervening reads/writes, and if the reads/writes are to/from sequential cache lines and spaced VERY close together.
What this means is that reading A, then some table for A, then B, then some table for B, then writing out to C, then repeating, will cause 5 DRAM open stalls per cycle (well, actually per 4 or 8 cycles depending on your cache line size). Instead, you could structure your code to work in blocks, and do something like:
The sum of 1+2+3+4 must be < 16 kB (L1 D-cache) on P-III, and < 8 kB on P-IV (waaa! that chip is meeeager!)
Once this is done, the data sits all in L1 cache, and operations on it are very efficient (on the order of 3 cycles of latency, which is hidden by the pipeline most of the time).
The only question that remains is how to get rid of the buffer you now modified in C -- ideally, you'd want the controller to just dump all the data there at once, rather than evict one cache line at a time as you pre-read the next output buffer. There are two strategies:
Now, as far as your performance problem goes, 11,000,000 samples times 2 bytes equals 22 MB per second, which should be sustainable on most EDO-and-better systems, but only if you get it streaming. Piecewise scattered accesses will cut the throughput of any memory subsystem to pieces. Chances are, none of your optimizations helped much, because you're already so memory bound. I usually end up writing simple loops, no unrolling, etc, and put all my effort into memory management, and I usually hit my performance targets that way. Especially on x86, unrolling loops may actually hurt because the unrolled loop requires more registers and more dynamic execution hardware, causing more stalls and worse performance.
Here's a routine that will pre-read a block of memory into the L1 cache in an attempt to take advantage of DRAM streaming:
void
pre_read( void * base, int size )
{
size_t cache_line_size = 32; // for P-II/III, works OK on others too
char volatile * b = (char *)(((long)base)&-cache_line_size);
while( size > 0 ) {
*b; // force a read
size -= cache_line_size;
b += cache_line_size;
}
}