Stored Program Computer Architecture
Von Neumann Architecture
- Data and Instructions in same memory
- Instruction decode
- fetch operands from memory
- perform operation
- write back
- next instruction
- Von Neumann bottleneck
Instruction set architecture (ISA)
Consists of:
- opcodes
- registers
- operands
- addressing modes
Types of operations
- Arithmetic and logic
- Data transfer
- Control flow
- System
- Floating point
- String operations
- (Graphics)
- (Decimal)
CISC
Instructions are:
- powerful
- complex e.g. modifying and storing data in one instruction
-
variable length
- number of operands
- operand types
- addressing mode
RISC
- Load-Store architecture
- Large Number of general purpose registers -> simplifies compiler design
- Try to achieve same bit positions for different operations
Instructions are:
- fixed length
- uniform -> pipelining & clock frequency
- simple
- usually one clock cycle
Explicitly Parallel Instruction Computing (EPIC)
- Increase number of instructions executed in parallel
- only example: Intel Itanium
Endianness
Big endian: MSB at lowest address e.g. MIPS, SPARC, PowerPC
Little endian: MSB at highest address e.g. x86/x64
Microarchitecture
Pipelining
Instructions take multiple cycles Instructions go through various stages
Idea: overlap instruction execution => same latency => increased throughput
Operation Pipelining
Idea: subdivide complex operations (e.g. FP addition, multiplication) into simpler stages Floating point sub stages:
- Instruction decode
- Operand exponent alignment
- Actual operation
- Normalization
Wind-up/ Wind-down: Time until the pipeline is full/empty again
Requires large amout of independent instructions
- Instruction level parallelism (ILP)
-
Scheduling by Hardware or compiler
- Out-of-Order Execution
Cycle time defined by the longest stage => Pipeline stages need to take the same time => further split stages to achieve this
Instruction Pipelining
At least three stages:
- fetch
- decode
- execute
-
Pipeline stalls
-
Speculative execution
N operations m pipleline stages \(T_{pipe} = (N + m - 1)\) cycles vs. \(T_{sequential} = mN \) cycles
Speedup: \(T = \frac{T{seq}}{T{pipe}}\) As \(N \to \infty\), \(T \to m\)
Throughput: $$\frac{Tp}{T{pipe}} = \frac{N}{N + m -1}$$ can be increased up to 1 instruction per cycle
Pipeline issues
Sequencing overhead
- register delay
- clock skew
Hazards
CISC architecture:
- inconsistent instruction length
- inconsistent instruction complexity
C/C++ Aliasing -fno-alias -fargument-noalias restrict keyword -> FORTRAN faster than C because no aliasing
Pipeline Hazards
- Structural Hazards Hardware limitation
-> Add hardware - Data Hazards Data dependencies
-> Out-of-Order Execution - Control Hazards Branches breaks pipleline
-> Branch prediction
Pipeline optimization
Software pipelining
- utilize piplined hardware better by re-organizing code
inlining might help
Superscalarity
Processor designed to execute multiple instructions per cycle
Additional hardware needed
A kind of ILP
- utilize with modulo variable expansion
- compute separately add up at the end
- However, FP not associative
SIMD
Can be realized through vector instructions or Superscalarity
SSE 128 bit registers AVX 256 bit registers
Operations need to be independent
Vectorized code can be produced by compiler
- Loop unrolling
compiler directives
#pragma vector always
#pragma novectorize
#pragma vector aligned
#pragma omp simd
Ways to utilize SIMD as a programmer:
- Compiler implicit vectorization
-
Explicit Vector programming
- OpenMP
- High level classes mapped to SIMD functionality
- Compiler intrinsics
- Inline assembly
- Assembly
Hierarchies
Caches
Memory bottleneck
spacial and temporal locality
-> Caches
Types of chaches:
- unified caches -> data and instructions
- data caches
- instruction cache
- instruction trace chache / Micro OP cache -> Stores decoded instructions
Speed capacity tradeoff
Levels of caches L1 usually separate data and instruction cache L2, L3 unified and may be shared between cores
Bandwidth(\(B\))
Latency (\(T_{lat}\)) Data to be transfered (\(N\)) Overall transfer time \(T = T_{lat} + \frac{N}{B}\)
stream benchmark
Hit rate (\(\beta\)) Access time: Memory \(T_m\), Cache \(T_c\) Average access time $$ T_{av} = \beta T_c + ( 1 - \beta ) T_m $$
Speedup due to cache: $$ G(\tau, \beta) = \frac{T_m} {T_avg} = \frac{\tau}{\beta + \tau (1- \beta)} $$