Consists of:
Instructions are:
variable length
Instructions are:
Big endian: MSB at lowest address e.g. MIPS, SPARC, PowerPC
Little endian: MSB at highest address e.g. x86/x64
Instructions take multiple cycles Instructions go through various stages
Idea: overlap instruction execution => same latency => increased throughput
Idea: subdivide complex operations (e.g. FP addition, multiplication) into simpler stages Floating point sub stages:
Wind-up/ Wind-down: Time until the pipeline is full/empty again
Requires large amout of independent instructions
Scheduling by Hardware or compiler
Cycle time defined by the longest stage => Pipeline stages need to take the same time => further split stages to achieve this
At least three stages:
Pipeline stalls
Speculative execution
N operations m pipleline stages \(T_{pipe} = (N + m - 1)\) cycles vs. \(T_{sequential} = mN \) cycles
Speedup: \(T = \frac{T{seq}}{T{pipe}}\) As \(N \to \infty\), \(T \to m\)
Throughput: $$\frac{Tp}{T{pipe}} = \frac{N}{N + m -1}$$ can be increased up to 1 instruction per cycle
Sequencing overhead
Hazards
CISC architecture:
C/C++ Aliasing -fno-alias -fargument-noalias restrict keyword -> FORTRAN faster than C because no aliasing
Software pipelining
inlining might help
Processor designed to execute multiple instructions per cycle
Additional hardware needed
A kind of ILP
Can be realized through vector instructions or Superscalarity
SSE 128 bit registers AVX 256 bit registers
Operations need to be independent
Vectorized code can be produced by compiler
compiler directives
#pragma vector always
#pragma novectorize
#pragma vector aligned
#pragma omp simd
Ways to utilize SIMD as a programmer:
Explicit Vector programming
Memory bottleneck
spacial and temporal locality
-> Caches
Types of chaches:
Speed capacity tradeoff
Levels of caches L1 usually separate data and instruction cache L2, L3 unified and may be shared between cores