class4-pipeline-a.pdf

Overview
CS:APP Chapter 4
Computer Architecture
General Principles of Pipelining
Pipelined
Implementation
!
Goal
!
Difficulties
Creating a Pipelined Y86 Processor
Part I
!
Rearranging SEQ
!
Inserting pipeline registers
!
Problems with data and control hazards
Randal E. Bryant
Carnegie Mellon University
http://csapp.cs.cmu.edu
Real-World Pipelines: Car Washes
Sequential
CS:APP
–2–
CS:APP
Computational Example
Parallel
300 ps
20 ps
Combinational
logic
R
e
g
Delay = 320 ps
Throughput = 3.12 GOPS
Clock
System
Pipelined
Idea
–3–
!
Divide process into
independent stages
!
Move objects through stages
in sequence
!
At any given times, multiple
objects being processed
CS:APP
–4–
!
Computation requires total of 300 picoseconds
!
Additional 20 picoseconds to save result in register
!
Can must have clock cycle of at least 320 ps
CS:APP
3-Way Pipelined Version
100 ps
20 ps
100 ps
20 ps
100 ps
Comb.
logic
A
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
Pipeline Diagrams
20 ps
Unpipelined
R
Delay = 360 ps
e
Throughput = 8.33 GOPS
g
OP1
OP2
OP3
Clock
!
System
!
Divide combinational logic into 3 blocks of 100 ps each
!
Can begin new operation as soon as previous one passes
through stage A.
Cannot start new operation until previous one completes
3-Way Pipelined
OP1
OP2
OP3
" Begin new operation every 120 ps
!
Time
A
B
A
" 360 ps from start to finish
!
CS:APP
–5–
Operating a Pipeline
239
241 300 359
B
A
0
120
C
C
B
A
CS:APP
Limitations: Nonuniform Delays
C
B
240
Up to 3 operations in process simultaneously
–6–
Clock
A
C
B
Time
Overall latency increases
OP1
OP2
OP3
C
B
A
360
50 ps
20 ps
150 ps
20 ps
100 ps
Comb.
logic
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
A
C
480
Comb.
logic
A
20 ps
R
e
g
100 ps
Comb.
logic
B
20 ps
R
e
g
100 ps
Comb.
logic
C
OP1
OP2
OP3
20 ps
A
B
Clock
C
A
B
A
C
B
C
Time
R
e
g
Clock
–7–
R
Delay = 510 ps
e
Throughput = 5.88 GOPS
g
640
Time
100 ps
20 ps
CS:APP
–8–
!
Throughput limited by slowest stage
!
Other stages sit idle for much of the time
!
Challenging to partition system into balanced stages
CS:APP
Limitations: Register Overhead
Data Dependencies
50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Clock
R
e
g
Comb.
logic
R
e
g
Comb.
logic
Comb.
logic
As try to deepen pipeline, overhead of loading registers
becomes more significant
!
Percentage of clock cycle spent loading register:
" 1-stage pipeline:
" 3-stage pipeline:
" 6-stage pipeline:
High speeds of modern processor designs obtained through
very deep pipelining
CS:APP
Data Hazards
Comb.
logic
A
OP1
OP2
OP3
OP4
OP1
OP2
OP3
System
!
Each operation depends on result from preceding one
CS:APP
– 10 –
Data Dependencies in Processors
R
e
g
A
Clock
Time
6.25%
16.67%
28.57%
–9–
Comb.
logic
B
B
A
C
B
A
R
e
g
Comb.
logic
C
R
e
g
!
Clock
C
B
A
C
B
1
irmovl $50, %eax
2
addl %eax ,
3
mrmovl 100( %ebx ),
%ebx
%edx
Result from one instruction used as operand for another
" Read-after-write (RAW) dependency
C
!
Very common in actual programs
!
Must make sure our pipeline handles these properly
" Get correct results
Time
– 11 –
R
e
g
Combinational
logic
Delay = 420 ps, Throughput = 14.29 GOPS
!
!
R
e
g
" Minimize performance impact
!
Result does not feed back around in time for next operation
!
Pipelining has changed behavior of system
CS:APP
– 12 –
CS:APP
newPC
SEQ Hardware
valM
SEQ+ Hardware
New
PC
PC
data out
Mem.
control
Memory
read
Data
Data
memory
memory
write
Addr
!
!
Stages occur in sequence
One operation in process
at a time
!
valM
data out
read
Mem.
control
Memory
Data
Data
memory
memory
write
Addr
Execute
Bch
valE
CC
CC
ALU
ALU
ALU
A
!
Data
!
ALU
B
!
valB
dstE dstM srcA srcB
dstE dstM srcA srcB
Decode
A
B
Write back
Fetch
ifun
rA
rB
Instruction
Instruction
memory
memory
valC
valP
PC
PC
increment
increment
!
PC is no longer stored in
register
!
But, can determine PC
based on other stored
information
PC
CS:APP
– 13 –
Adding Pipeline Registers
valE, valM
W_icode, W_valM
Write back
valM
Task is to select PC for
current instruction
Based on results
computed by previous
instruction
Processor State
Register
Register M
file
file E
icode
Bch
valE
CC
CC
Execute
W_valE, W_valM, W_dstE, W_dstM
ALU
fun.
ALU
ALU
ALU
A
ALU
B
valA
Data
Data
memory
memory
Memory
A
Write back
icode
Fetch
ifun
rA
rB
valC
Instruction
Instruction
memory
memory
valP
PC
PC
increment
increment
PC
PC
PC
pIcode pBch
pValM
pValC
pValP
CS:APP
Pipeline Stages
Fetch
!
Addr, Data
W_icode, W_valM
W_valE, W_valM, W_dstE, W_dstM
W
Bch
Bch
ALU
ALU
valE
Read instruction
!
Compute incremented PC
aluA, aluB
Data
Data
memory
memory
M_icode,
M_Bch,
M_valA
Addr, Data
M
Bch
ALU
ALU
Decode
aluA, aluB
!
E
E
valA, valB
Execute
valA, valB
srcA, srcB
dstA, dstB
A
B
Register
Register M
file
file E
ALU
ALU
Read program registers
valA, valB
icode, valC
valP
valE
CC
CC
Execute
aluA, aluB
CC
CC
Execute
Memory
Select current PC
!
M
valE
CC
CC
B
Register
Register M
file
file E
– 14 –
Data
Data
memory
memory
M_icode,
M_Bch,
M_valA
Addr, Data
Execute
dstE dstM srcA srcB
valM
valM
Memory
valB
dstE dstM srcA srcB
Decode
valM
W
Decode
Data
PC Stage
ALU
fun.
valA
Still sequential
implementation
Reorder PC stage to put at
beginning
d_srcA,
d_srcB
Decode
A
B
Register M
Register
file
file
!
E
d_srcA,
d_srcB
Decode
A
B
Register
Register M
file
file
E
Write back
Operate ALU
D
valP
Write back
valP
icode, ifun
rA, rB
valC
Fetch
Instruction
Instruction
memory
memory
D
icode, ifun,
rA, rB, valC
PC
PC
increment
increment
Instruction
Instruction
memory
memory
Fetch
PC
PC
valP
Memory
valP
PC
PC
increment
increment
!
PC
PC
Write Back
f_PC
!
F
– 15 –
CS:APP
Instruction
Instruction
memory
memory
Read or write data memory
predPC
pState
icode, ifun,
rA, rB, valC
Fetch
– 16 –
valP
PC
PC
increment
increment
predPC
f_PC
F
Update register file
CS:APP
Write back
PIPE- Hardware
W
Write back
icode
valE
!
write
Memory
Pipeline registers hold
intermediate values
from instruction
execution
icode
M_valA
Bch
valE
valA
Predicted PC
E
icode
ifun
!
valC
valA
valB
D
through decode
d_rvalA
A
icode
ifun
rA
rB
valC
dstE dstM srcA srcB
W_valM
B
Jump taken/not-taken
!
Fall-through or target
address
PC
PC
increment
increment
!
M
icode
M_valA
Bch
valE
valA
dstE dstM
ALU
A
Execute
E
icode
ALU
fun.
ALU
ALU
CC
CC
ifun
ALU
B
valC
valA
!
f_PC
M_valA
F
rB
valC
valP
Predict
PC
Need
valC
Instr
valid
Need
regids
valB
Fetch
icode
d_rvalA
dstE dstM srcA srcB
A
dstE dstM srcA srcB
W_valM
B
Register
Register M
file
file E
ifun
rA
rB
Instruction
Instruction
memory
memory
valC
W_valE
valP
PC
PC
increment
increment
Predict
PC
f_PC
M_valA
Select
PC
W_valM
predPC
CS:APP
– 18 –
Our Prediction Strategy
Instructions that Don’
Don’t Transfer Control
PC
PC
increment
increment
Align
Align
Byte 0
D
F
M_icode
M_Bch
M_valA
W_icode
W_valM
Split
Split
To register file write
ports
predPC
CS:APP
rA
Decode
W_valM
– 17 –
ifun
Select
A
Read from memory
Register updates
Predict
PC
Select
PC
!
Predict next PC to be valP
!
Always reliable
Bytes 1-5
Call and Unconditional Jumps
Instruction
Instruction
memory
memory
Select
PC
!
Predict next PC to be valC (destination)
!
Always reliable
Conditional Jumps
predPC
Start fetch of new instruction after current one has completed
fetch stage
!
Predict next PC to be valC (destination)
!
Only correct if branch is taken
" Typically right 60% of time
" Not enough time to reliably determine next instruction
Return Instruction
Guess which instruction will follow
" Recover if prediction was incorrect
– 19 –
data in
d_srcA d_srcB
Return point
W_valE
valP
Instruction
Instruction
memory
memory
Fetch
!
Guess value of next PC
!
dstE dstM srcA srcB
Register
Register M
file
file E
" e.g., valC passes
!
Data
Data
memory
memory
Addr
Branch information
ALU
B
Decode
F
write
Memory
d_srcA d_srcB
Cannot jump past
stages
icode
dstE dstM
e_Bch
ALU
A
Execute
valM
data out
read
M_Bch
dstE dstM
ALU
fun.
ALU
ALU
Select
A
D
valE
e_Bch
CC
CC
Values passed from one
stage to next
Predicting the
PC
icode
Mem.
control
data in
Addr
M
W
Data
Data
memory
memory
M_Bch
Forward (Upward) Paths
!
Feedback Paths
dstE dstM
data out
read
Mem.
control
!
valM
!
CS:APP
– 20 –
Don’t try to predict
CS:APP
Recovering
from PC
Misprediction
M_icode
M_Bch
M_valA
W_icode
W_valM
D
icode
ifun
rA
rB
valC
Pipeline Demonstration
valP
Predict
PC
Need
valC
Instr
valid
Need
regids
Split
Split
PC
PC
increment
increment
Align
Align
Byte 0
Bytes 1-5
Instruction
Instruction
memory
memory
irmovl
$1,%eax
#I1
irmovl
$2,%ecx
#I2
irmovl
$3,%edx
#I3
irmovl
$4,%ebx
#I4
halt
1
2
3
4
5
6
7
8
9
F
D
F
E
D
F
M
E
D
F
W
M
E
D
F
W
M
E
D
W
M
E
W
M
W
#I5
Cycle 5
Select
PC
F
!
W
I1
File: demo-basic.ys
demo-basic.ys
predPC
M
I2
Mispredicted Jump
E
I3
" Will see branch flag once instruction reaches memory stage
" Can get fall-through PC from valA
!
D
I4
Return Instruction
" Will get return PC when ret reaches write-back stage
F
I5
CS:APP
– 21 –
Data Dependencies: 3 Nop’s
1
# demo-h3.ys
0x000: irmovl $10,%edx
0x006: irmovl
$3,%eax
0x00c: nop
0x00d: nop
F
2
D
F
3
E
D
F
4
5
M
E
D
F
W
M
E
D
F
0x00e: nop
0x00f: addl %edx,%eax
0x011: halt
6
W
M
E
D
F
7
Data Dependencies: 2 Nop’s
8
9
10
11
# demo-h2.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
F
E
D
F
M
E
D
F
W
M
E
D
F
0x006: irmovl
W
M
E
D
F
CS:APP
– 22 –
$3,%eax
0x00c: nop
0x00d: nop
W
M
E
D
W
M
E
0x00e: addl %edx,%eax
W
M
0x010: halt
6
7
8
9
10
W
M
E
D
F
W
M
E
D
W
M
E
W
M
W
W
Cycle 6
Cycle 6
W
R[%eax] ! 3
W
R[%eax] ! 3
•
•
•
– 23 –
Cycle 7
D
D
valA ! R[%edx] = 10
valB ! R[%eax] = 0
valA ! R[%edx] = 10
valB ! R[%eax]
=3
CS:APP
– 24 –
Error
CS:APP
Data Dependencies: 1 Nop
# demo-h1.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
F
E
D
F
M
E
D
F
W
M
E
D
F
0x006: irmovl
$3,%eax
0x00c: nop
0x00d: addl %edx,%eax
0x00f: halt
6
W
M
E
D
Data Dependencies: No Nop
7
8
9
# demo-h0.ys
1
2
3
4
5
6
7
8
0x000: irmovl $10,%edx
F
D
F
E
D
F
M
E
D
F
W
M
E
D
W
M
E
W
M
W
0x006: irmovl
W
M
E
$3,%eax
0x00c: addl %edx,%eax
W
M
0x00e: halt
W
Cycle 4
Cycle 5
M
W
M_valE = 10
M_dstE = %edx
R[%edx] ! 10
E
M
e_valE ! 0 + 3 = 3
E_dstE = %eax
M_valE = 3
M_dstE = %eax
D
•
•
•
D
valA ! R[%edx] = 0
valB ! R[%eax] = 0
– 25 –
Error
CS:APP
CS:APP
– 26 –
Branch Misprediction Trace
Branch Misprediction Example
# demo-j
demo-j.ys
0x000:
xorl %eax,%eax
0x002:
jne t
0x007:
irmovl $1, %eax
0x00d:
nop
0x00e:
nop
0x00f:
nop
0x010:
halt
0x011: t: irmovl $3, %edx
0x017:
irmovl $4, %ecx
0x01d:
irmovl $5, %edx
!
Error
valA ! R[ %edx] = 0
valB ! R[%eax] = 0
0x000:
xorl %eax,%eax
0x002:
jne t # Not taken
1
2
3
4
5
6
7
8
9
F
D
F
E
D
F
M
E
D
F
W
M
E
D
F
W
M
E
D
W
M
E
W
M
W
0x011: t: irmovl $3, %edx # Target
# Not taken
# Fall through
0x017:
irmovl $4, %ecx # Target+1
0x007:
irmovl $1, %eax # Fall Through
Cycle 5
M
!
# Target (Should not execute)
# Should not execute
# Should not execute
Incorrectly execute two
instructions at branch target
M_Bch = 0
M_valA = 0x007
E
valE ! 3
dstE = %edx
D
Should only execute first 8 instructions
valC = 4
dstE = %ecx
F
– 27 –
CS:APP
– 28 –
valC ! 1
rB ! %eax
CS:APP
Return Example
0x000:
0x006:
0x007:
0x008:
0x009:
0x00e:
0x014:
0x020:
0x020:
0x021:
0x022:
0x023:
0x024:
0x02a:
0x030:
0x036:
0x100:
0x100:
!
Incorrect Return Example
demo-ret.ys
# demo-ret
irmovl Stack,%esp # Intialize stack pointer
nop
# Avoid hazard on %esp
nop
nop
call p
# Procedure call
irmovl $5,%esi
# Return point
halt
.pos 0x20
p: nop
# procedure
nop
nop
ret
irmovl $1,%eax
# Should not be executed
irmovl $2,%ecx
# Should not be executed
irmovl $3,%edx
# Should not be executed
irmovl $4,%ebx
# Should not be executed
.pos 0x100
Stack:
# Stack: Stack pointer
– 29 –
Concept
Break instruction execution into 5 stages
!
Run instructions through in pipelined mode
Limitations
!
Can’t handle dependencies between instructions when
instructions follow too closely
!
Data dependencies
" One instruction writes register, later one reads it
Control dependency
" Instruction sets PC in way that pipeline did not predict correctly
" Mispredicted branch and return
Fixing the Pipeline
!
– 31 –
irmovl $1,%eax # Oops! F
0x02a:
irmovl $2,%ecx # Oops!
0x030:
irmovl $3,%edx # Oops!
0x00e:
irmovl $5,%esi # Return
Incorrectly execute 3
instructions following ret
F
D
E
D
F
M
E
D
F
W
M
E
D
F
W
M
E
D
W
M
E
W
M
W
W
M
valE = 1
dstE = %eax
E
valE ! 2
dstE = %ecx
D
valC = 3
dstE = %edx
F
Pipeline Summary
!
ret
0x024:
valM = 0x0e
Require lots of nops to avoid data hazards
CS:APP
!
!
0x023:
We’ll do that next time
CS:APP
– 30 –
valC ! 5
rB ! %esi
CS:APP