The onward march of complexity (LRBni)

jwatte's picture

It used to be, a microprocessor was simple. It had instructions like "Load the accumulator from the address stored in the X register (LDA(X))." Or "Add register B to register A (ADD B, A)." And that was pretty much it. Time marched on, and we got more esoteric instructions, like "load effective address of scaled register indirect with offset" (LEA eax, [ebx*8+ecx]). Programmers adapted, and at that point, compilers were good enough that assembly was used mostly for the most critical inner loops, and nothing else.

Along came modern memory subsystems and execution units, where cache control started mattering a lot, and floating point performance went through the roof. DCBZ and DCBT were mainstays of the PowerPC, as well as EIEIO. At this point, the function of the instructions and the names of the instructions start becoming harder to decipher. "Data Cache Block Zero" would clear out 32 (or 64) bytes in a single instruction, as well as not cause unnecessary memory read traffic when filling an output buffer. "Data Cache Block Touch" would cause a memory pre-fetch, so that a cache line could be available at the point when the data would be needed. The latter also showed up in SSE instructions with PREFETCH. (Hey, that's an instruction name you can understand!). "Enforce In-order Execution of I/O" must be one of the best instruction names ever -- and also important to make sure that out-of order, write-back memory architectures could still be used to write device drivers.

Then came the MMX and SSE instructions from the Intel camp, where instructions like PUNPCKLBW [EBX + ECX], MM1 or SHUFPS xmm0, xmm0, 0x39 start stressing the programmer out for real. While some compilers can vectorize some code for you, and there are some compiler intrinsics (almost as complex to use as the assembly itself), these kinds of instructions actually start requiring more assembly code again, because most modern programming language do not lend themselves to expressing the specific vector computation accelerated by these instructions. Complexity increases.

Larrabee, the chip-to-come from Intel, billed as the architecture for a zillion CPU cores AND a GPU all in one convenient plug-in card, is pushing this complexity to new levels. Not only does it introduce 32 new registers, each consisting of 16 (!) floating-point values, but it also adds 8 16-bit predication registers, as well as three-way instructions (allowing fused multiply-add to be a single instruction). Of course, the PPC and other CPUs have had that for a long time, but coupled with the other esoterics of the x86/x64 instruction set, this has the chance of pushing single assembly instructions past the line wrapping point in your text editor! Consider:
vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {4to16}

I'm not going to tell you what it does; I'll let you figure it out :-) But it's not only amazing, but also amazingly complicated. It's also moving further and further away from what higher level programming languages can currently express. Couple that with a few dozen execution cores running a handful of threads each, and trying to write code that runs zillions of little threadlets all in parallel, and the future seems to hold boundless opportunity for new and exciting complexity.