Stored Program Computer Architecture

Von Neumann Architecture

  • Data and Instructions in same memory
  1. Instruction decode
  2. fetch operands from memory
  3. perform operation
  4. write back
  5. next instruction
  • Von Neumann bottleneck

Instruction set architecture (ISA)

Consists of:

  • opcodes
  • registers
  • operands
  • addressing modes

Types of operations

  • Arithmetic and logic
  • Data transfer
  • Control flow
  • System
  • Floating point
  • String operations
  • (Graphics)
  • (Decimal)

CISC

Instructions are:

  • powerful
  • complex e.g. modifying and storing data in one instruction
  • variable length

    • number of operands
    • operand types
    • addressing mode

RISC

  • Load-Store architecture
  • Large Number of general purpose registers -> simplifies compiler design
  • Try to achieve same bit positions for different operations

Instructions are:

  • fixed length
  • uniform -> pipelining & clock frequency
  • simple
  • usually one clock cycle

Explicitly Parallel Instruction Computing (EPIC)

  • Increase number of instructions executed in parallel
  • only example: Intel Itanium

Endianness

Big endian: MSB at lowest address e.g. MIPS, SPARC, PowerPC

Little endian: MSB at highest address e.g. x86/x64

Microarchitecture

Pipelining

Instructions take multiple cycles Instructions go through various stages

Idea: overlap instruction execution => same latency => increased throughput

Operation Pipelining

Idea: subdivide complex operations (e.g. FP addition, multiplication) into simpler stages Floating point sub stages:

  1. Instruction decode
  2. Operand exponent alignment
  3. Actual operation
  4. Normalization

Wind-up/ Wind-down: Time until the pipeline is full/empty again

Requires large amout of independent instructions

  • Instruction level parallelism (ILP)
  • Scheduling by Hardware or compiler

    • Out-of-Order Execution

Cycle time defined by the longest stage => Pipeline stages need to take the same time => further split stages to achieve this

Instruction Pipelining

At least three stages:

  1. fetch
  2. decode
  3. execute
  • Pipeline stalls

  • Speculative execution

    N operations m pipleline stages \(T_{pipe} = (N + m - 1)\) cycles vs. \(T_{sequential} = mN \) cycles

Speedup: \(T = \frac{T{seq}}{T{pipe}}\) As \(N \to \infty\), \(T \to m\)

Throughput: $$\frac{Tp}{T{pipe}} = \frac{N}{N + m -1}$$ can be increased up to 1 instruction per cycle

Pipeline issues

Sequencing overhead

  • register delay
  • clock skew

Hazards

CISC architecture:

  • inconsistent instruction length
  • inconsistent instruction complexity

C/C++ Aliasing -fno-alias -fargument-noalias restrict keyword -> FORTRAN faster than C because no aliasing

Pipeline Hazards
  • Structural Hazards Hardware limitation
    -> Add hardware
  • Data Hazards Data dependencies
    -> Out-of-Order Execution
  • Control Hazards Branches breaks pipleline
    -> Branch prediction
Pipeline optimization

Software pipelining

  • utilize piplined hardware better by re-organizing code

inlining might help

Superscalarity

Processor designed to execute multiple instructions per cycle
Additional hardware needed

A kind of ILP

  • utilize with modulo variable expansion
  • compute separately add up at the end
  • However, FP not associative

SIMD

Can be realized through vector instructions or Superscalarity

SSE 128 bit registers AVX 256 bit registers

Operations need to be independent

Vectorized code can be produced by compiler

  • Loop unrolling

compiler directives

#pragma vector always
#pragma novectorize
#pragma vector aligned
#pragma omp simd

Ways to utilize SIMD as a programmer:

  • Compiler implicit vectorization
  • Explicit Vector programming

    • OpenMP
    • High level classes mapped to SIMD functionality
  • Compiler intrinsics
  • Inline assembly
  • Assembly

Hierarchies

Caches

Memory bottleneck
spacial and temporal locality
-> Caches

Types of chaches:

  • unified caches -> data and instructions
  • data caches
  • instruction cache
  • instruction trace chache / Micro OP cache -> Stores decoded instructions

Speed capacity tradeoff

Levels of caches L1 usually separate data and instruction cache L2, L3 unified and may be shared between cores

Bandwidth(\(B\))
Latency (\(T_{lat}\)) Data to be transfered (\(N\)) Overall transfer time \(T = T_{lat} + \frac{N}{B}\)

stream benchmark

Hit rate (\(\beta\)) Access time: Memory \(T_m\), Cache \(T_c\) Average access time $$ T_{av} = \beta T_c + ( 1 - \beta ) T_m $$

Speedup due to cache: $$ G(\tau, \beta) = \frac{T_m} {T_avg} = \frac{\tau}{\beta + \tau (1- \beta)} $$

Prefetching

Cache mapping

Cache coherence