Unlimited Hummus

Stored Program Computer Architecture

Von Neumann Architecture

Data and Instructions in same memory

Instruction decode
fetch operands from memory
perform operation
write back
next instruction

Von Neumann bottleneck

Instruction set architecture (ISA)

Consists of:

opcodes
registers
operands
addressing modes

Types of operations

Arithmetic and logic
Data transfer
Control flow
System
Floating point
String operations
(Graphics)
(Decimal)

CISC

Instructions are:

powerful
complex e.g. modifying and storing data in one instruction
variable length
- number of operands
- operand types
- addressing mode

RISC

Load-Store architecture
Large Number of general purpose registers -> simplifies compiler design
Try to achieve same bit positions for different operations

Instructions are:

fixed length
uniform -> pipelining & clock frequency
simple
usually one clock cycle

Explicitly Parallel Instruction Computing (EPIC)

Increase number of instructions executed in parallel
only example: Intel Itanium

Endianness

Big endian: MSB at lowest address e.g. MIPS, SPARC, PowerPC

Little endian: MSB at highest address e.g. x86/x64

Microarchitecture

Pipelining

Instructions take multiple cycles Instructions go through various stages

Idea: overlap instruction execution => same latency => increased throughput

Operation Pipelining

Idea: subdivide complex operations (e.g. FP addition, multiplication) into simpler stages Floating point sub stages:

Instruction decode
Operand exponent alignment
Actual operation
Normalization

Wind-up/ Wind-down: Time until the pipeline is full/empty again

Requires large amout of independent instructions

Instruction level parallelism (ILP)
Scheduling by Hardware or compiler
- Out-of-Order Execution

Cycle time defined by the longest stage => Pipeline stages need to take the same time => further split stages to achieve this

Instruction Pipelining

At least three stages:

fetch
decode
execute

Pipeline stalls
Speculative execution

N operations m pipleline stages $T_{pipe} = (N + m - 1)$ cycles vs. $T_{sequential} = mN $ cycles

Speedup: $T = \frac{T{seq}}{T{pipe}}$ As $N \to \infty$, $T \to m$

Throughput: $$\frac{Tp}{T{pipe}} = \frac{N}{N + m -1}$$ can be increased up to 1 instruction per cycle

Pipeline issues

Sequencing overhead

register delay
clock skew

Hazards

CISC architecture:

inconsistent instruction length
inconsistent instruction complexity

C/C++ Aliasing -fno-alias -fargument-noalias restrict keyword -> FORTRAN faster than C because no aliasing

Pipeline Hazards

Structural Hazards Hardware limitation
-> Add hardware
Data Hazards Data dependencies
-> Out-of-Order Execution
Control Hazards Branches breaks pipleline
-> Branch prediction

Pipeline optimization

Software pipelining

utilize piplined hardware better by re-organizing code

inlining might help

Superscalarity

Processor designed to execute multiple instructions per cycle
Additional hardware needed

A kind of ILP

utilize with modulo variable expansion
compute separately add up at the end
However, FP not associative

SIMD

Can be realized through vector instructions or Superscalarity

SSE 128 bit registers AVX 256 bit registers

Operations need to be independent

Vectorized code can be produced by compiler

Loop unrolling

compiler directives

#pragma vector always
#pragma novectorize
#pragma vector aligned

#pragma omp simd

Ways to utilize SIMD as a programmer:

Compiler implicit vectorization
Explicit Vector programming
- OpenMP
- High level classes mapped to SIMD functionality
Compiler intrinsics
Inline assembly
Assembly

Hierarchies

Caches

Memory bottleneck
spacial and temporal locality
-> Caches

Types of chaches:

unified caches -> data and instructions
data caches
instruction cache
instruction trace chache / Micro OP cache -> Stores decoded instructions

Speed capacity tradeoff

Levels of caches L1 usually separate data and instruction cache L2, L3 unified and may be shared between cores

Bandwidth($B$)
Latency ($T_{lat}$) Data to be transfered ($N$) Overall transfer time $T = T_{lat} + \frac{N}{B}$

stream benchmark

Hit rate ($\beta$) Access time: Memory $T_m$, Cache $T_c$ Average access time $$ T_{av} = \beta T_c + ( 1 - \beta ) T_m $$

Speedup due to cache: $$ G(\tau, \beta) = \frac{T_m} {T_avg} = \frac{\tau}{\beta + \tau (1- \beta)} $$