class4-pipeline-a.ppt

CS:APP Chapter 4
Computer Architecture
Pipelined
Implementation
Part I
Randal E. Bryant
Carnegie Mellon University
http://csapp.cs.cmu.edu
CS:APP
Overview
General Principles of Pipelining


Goal
Difficulties
Creating a Pipelined Y86 Processor



–2–
Rearranging SEQ
Inserting pipeline registers
Problems with data and control hazards
CS:APP
Real-World Pipelines: Car Washes
Sequential
Parallel
Pipelined
Idea



–3–
Divide process into
independent stages
Move objects through stages
in sequence
At any given times, multiple
objects being processed
CS:APP
Computational Example
300 ps
20 ps
Combinational
logic
R
e
g
Delay = 320 ps
Throughput = 3.12 GOPS
Clock
System



–4–
Computation requires total of 300 picoseconds
Additional 20 picoseconds to save result in register
Can must have clock cycle of at least 320 ps
CS:APP
3-Way Pipelined Version
100 ps
20 ps
100 ps
20 ps
100 ps
Comb.
logic
A
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
20 ps
R
Delay = 360 ps
e
Throughput = 8.33 GOPS
g
Clock
System


Divide combinational logic into 3 blocks of 100 ps each
Can begin new operation as soon as previous one passes
through stage A.
 Begin new operation every 120 ps

Overall latency increases
 360 ps from start to finish
–5–
CS:APP
Pipeline Diagrams
Unpipelined
OP1
OP2
OP3

Time
Cannot start new operation until previous one completes
3-Way Pipelined
OP1
OP2
A
B
C
A
B
C
A
B
OP3
C
Time

–6–
Up to 3 operations in process simultaneously
CS:APP
Operating a Pipeline
239
241 300 359
Clock
OP1
A
OP2
B
C
A
B
C
A
B
OP3
0
120
240
360
C
480
640
Time
100 ps
20 ps
100 ps
20 ps
100 ps
20 ps
Comb.
logic
A
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
R
e
g
Clock
–7–
CS:APP
Limitations: Nonuniform Delays
50 ps
20 ps
150 ps
20 ps
100 ps
Comb.
logic
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
A
OP1
OP2
A
B
OP3
B
A
R
Delay = 510 ps
e
Throughput = 5.88 GOPS
g
Clock
C
A
20 ps
C
B
C
Time



–8–
Throughput limited by slowest stage
Other stages sit idle for much of the time
Challenging to partition system into balanced stages
CS:APP
Limitations: Register Overhead
50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Clock


 3-stage pipeline:
 6-stage pipeline:
–9–
R
e
g
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Delay = 420 ps, Throughput = 14.29 GOPS
As try to deepen pipeline, overhead of loading registers
becomes more significant
Percentage of clock cycle spent loading register:
 1-stage pipeline:

Comb.
logic
6.25%
16.67%
28.57%
High speeds of modern processor designs obtained through
very deep pipelining
CS:APP
Data Dependencies
Combinational
logic
R
e
g
Clock
OP1
OP2
OP3
Time
System

– 10 –
Each operation depends on result from preceding one
CS:APP
Data Hazards
Comb.
logic
A
OP1
OP2
R
e
g
A
Comb.
logic
B
R
e
g
Comb.
logic
C
Clock
B
C
A
B
C
A
B
C
A
B
OP3
OP4
R
e
g
C
Time


– 11 –
Result does not feed back around in time for next operation
Pipelining has changed behavior of system
CS:APP
Data Dependencies in Processors

1
irmovl $50, %eax
2
addl %eax ,
3
mrmovl 100( %ebx ),
%ebx
%edx
Result from one instruction used as operand for another
 Read-after-write (RAW) dependency


Very common in actual programs
Must make sure our pipeline handles these properly
 Get correct results
 Minimize performance impact
– 12 –
CS:APP
newPC
SEQ Hardware


Stages occur in sequence
One operation in process
at a time
New
PC
PC
valM
data out
read
Data
Data
memory
memory
Mem.
control
Memory
write
Addr
Execute
Bch
valE
CC
CC
ALU
ALU
ALU
A
Data
ALU
fun.
ALU
B
valA
Decode
A
valB
dstE dstM srcA
srcB
dstE dstM srcA
srcB
B
Register
Register M
file
file
E
Write back
icode
Fetch
ifun
rA
rB
Instruction
Instruction
memory
memory
valC
valP
PC
PC
increment
increment
PC
– 13 –
CS:APP
valM
SEQ+ Hardware
data out
read
Data
Data
memory
memory
Mem.
control
Memory
write
Addr


Still sequential
implementation
Reorder PC stage to put at
beginning
Execute
Bch
valE
CC
CC
ALU
ALU
ALU
A
Data
ALU
fun.
ALU
B
PC Stage


Task is to select PC for
current instruction
Based on results
computed by previous
instruction
Processor State


– 14 –
PC is no longer stored in
register
But, can determine PC
based on other stored
information
valA
Decode
valB
A
dstE dstM srcA
srcB
dstE dstM srcA
srcB
B
Register
Register M
file
file E
Write back
icode
Fetch
ifun
rA
rB
valC
Instruction
Instruction
memory
memory
valP
PC
PC
increment
increment
PC
PC
PC
pIcode pBch
pValM
pValC
pValP
CS:APP
Adding Pipeline Registers
valE, valM
W_icode, W_valM
Write back
valM
W_valE, W_valM, W_dstE, W_dstM
valM
W
valM
Data
Data
memory
memory
Memory
Memory
Data
Data
memory
memory
M_icode,
M_Bch,
M_valA
Addr, Data
Addr, Data
M
valE
Bch
Bch
CC
CC
Execute
ALU
ALU
valE
CC
CC
Execute
aluA, aluB
ALU
ALU
aluA, aluB
E
valA, valB
Decode
valA, valB
srcA, srcB
dstA, dstB
icode, valC
valP
A
B
Register
Register M
file
file
d_srcA,
d_srcB
Decode
A
B
Register
Register M
file
file
E
E
Write back
valP
icode, ifun
rA, rB
valC
Fetch
Instruction
Instruction
memory
memory
D
valP
icode, ifun,
rA, rB, valC
PC
PC
increment
increment
Instruction
Instruction
memory
memory
Fetch
valP
PC
PC
increment
increment
PC
PC
predPC
pState
PC
f_PC
F
– 15 –
CS:APP
W_icode, W_valM
Pipeline Stages
W_valE, W_valM, W_dstE, W_dstM
W
valM
Fetch
Memory
Data
Data
memory
memory
M_icode,
M_Bch,
M_valA
Addr, Data



Select current PC
Read instruction
Compute incremented PC
M
Bch
valE
CC
CC
Execute
ALU
ALU
aluA, aluB
Decode

E
Read program registers
valA, valB
Execute
d_srcA,
d_srcB
Decode
A
B
Register
Register M
file
file
E
Write back

Operate ALU
Memory

D
icode, ifun,
rA, rB, valC
Instruction
Instruction
memory
memory
Fetch
Write Back
– 16 –
valP
PC
PC
increment
increment
predPC
Read or write data memory
PC

valP
f_PC
F
Update register file
CS:APP
Write back
PIPE- Hardware
W
icode
valE
Mem.
control
write
Pipeline registers hold
intermediate values
from instruction
execution
Forward (Upward) Paths


Values passed from one
stage to next
Cannot jump past
stages
Data
Data
memory
memory
Memory
data in
Addr
M_valA
M_Bch
M
icode
Bch
valE
valA
dstE dstM
e_Bch
Execute
E
ALU
fun.
ALU
ALU
CC
CC
icode ifun
ALU
A
ALU
B
valC
valA
valB
dstE dstM srcA srcB
d_srcA d_srcB
Select
A
Decode
D
Fetch
d_rvalA
A
dstE dstM srcA srcB
W_valM
B
Register
Register M
file
file E
 e.g., valC passes
through decode
dstE dstM
data out
read

valM
icode ifun
rA
rB
Instruction
Instruction
memory
memory
valC
W_valE
valP
PC
PC
increment
increment
Predict
PC
f_PC
M_valA
Select
PC
F
– 17 –
W_valM
predPC
CS:APP
Write back
Feedback Paths
W
icode
valE
dstE dstM
data out
read
Mem.
control
write
Predicted PC
valM
Data
Data
memory
memory
Memory
data in
Addr
M_valA
M_Bch

Guess value of next PC
M
icode
Bch
valE
valA
dstE dstM
e_Bch
Branch information


Jump taken/not-taken
Fall-through or target
address
Execute
E
ALU
fun.
ALU
ALU
CC
CC
icode ifun
ALU
A
ALU
B
valC
valA
valB
dstE dstM srcA srcB
d_srcA d_srcB
Return point

Decode
To register file write
ports
D
Fetch
A
icode ifun
rA
rB
Instruction
Instruction
memory
memory
dstE dstM srcA srcB
W_valM
B
valC
W_valE
valP
PC
PC
increment
increment
Predict
PC
f_PC
M_valA
Select
PC
F
– 18 –
d_rvalA
Register
Register M
file
file E
Read from memory
Register updates

Select
A
W_valM
predPC
CS:APP
Predicting the
PC
D
M_icode
M_Bch
M_valA
W_icode
W_valM
icode ifun
rA
rB
valC
valP
Predict
PC
Need
valC
Instr
valid
Need
regids
Split
Split
PC
PC
increment
increment
Align
Align
Byte 0
Bytes 1-5
Instruction
Instruction
memory
memory
Select
PC
F

predPC
Start fetch of new instruction after current one has completed
fetch stage
 Not enough time to reliably determine next instruction

Guess which instruction will follow
 Recover if prediction was incorrect
– 19 –
CS:APP
Our Prediction Strategy
Instructions that Don’t Transfer Control


Predict next PC to be valP
Always reliable
Call and Unconditional Jumps


Predict next PC to be valC (destination)
Always reliable
Conditional Jumps


Predict next PC to be valC (destination)
Only correct if branch is taken
 Typically right 60% of time
Return Instruction

– 20 –
Don’t try to predict
CS:APP
Recovering
from PC
Misprediction
M_icode
M_Bch
M_valA
W_icode
W_valM
D
icode ifun
rA
rB
valC
valP
Predict
PC
Need
valC
Instr
valid
Need
regids
Split
Split
PC
PC
increment
increment
Align
Align
Byte 0
Bytes 1-5
Instruction
Instruction
memory
memory
Select
PC
F

predPC
Mispredicted Jump
 Will see branch flag once instruction reaches memory stage
 Can get fall-through PC from valA

Return Instruction
 Will get return PC when ret reaches write-back stage
– 21 –
CS:APP
Pipeline Demonstration
irmovl
$1,%eax
#I1
irmovl
$2,%ecx
#I2
irmovl
$3,%edx
#I3
irmovl
$4,%ebx
#I4
halt
#I5
1
2
3
4
5
6
7
8
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
9
W
Cycle 5
File: demo-basic.ys
W
I1
M
I2
E
I3
D
I4
F
I5
– 22 –
CS:APP
Data Dependencies: 3 Nop’s
# demo-h3.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl
$3,%eax
0x00c: nop
0x00d: nop
0x00e: nop
0x00f: addl %edx,%eax
0x011: halt
6
7
8
9
10
Cycle 6
W
R[ %eax] f 3
Cycle 7
D
– 23 –
valA f R[ %edx] = 10
valB f R[ %eax
]=3
CS:APP
11
W
Data Dependencies: 2 Nop’s
# demo-h2.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl
$3,%eax
0x00c: nop
0x00d: nop
0x00e: addl %edx,%eax
0x010: halt
6
7
8
9
10
W
Cycle 6
W
R[ %eax] f 3
•
•
•
D
valA f R[ %edx] = 10
valB f R[ %eax] = 0
– 24 –
Error
CS:APP
Data Dependencies: 1 Nop
# demo-h1.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl
$3,%eax
0x00c: nop
0x00d: addl %edx,%eax
0x00f: halt
6
7
8
9
W
Cycle 5
W
R[ %edx] f 10
M
M_valE = 3
M_dstE = %eax
•
•
•
D
– 25 –
valA f R[ %edx] = 0
valB f R[ %eax] = 0
Error
CS:APP
Data Dependencies: No Nop
# demo-h0.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl
$3,%eax
0x00c: addl %edx,%eax
0x00e: halt
6
7
8
W
Cycle 4
M
M_valE = 10
M_dstE = %edx
E
e_valE f 0 + 3 = 3
E_dstE = %eax
D
valA f R[ %edx] = 0
valB f R[ %eax] = 0
– 26 –
Error
CS:APP
Branch Misprediction Example
demo-j.ys
0x000:
xorl %eax,%eax
0x002:
jne t
0x007:
irmovl $1, %eax
0x00d:
nop
0x00e:
nop
0x00f:
nop
0x010:
halt
0x011: t: irmovl $3, %edx
0x017:
irmovl $4, %ecx
0x01d:
irmovl $5, %edx

– 27 –
# Not taken
# Fall through
# Target (Should not execute)
# Should not execute
# Should not execute
Should only execute first 8 instructions
CS:APP
Branch Misprediction Trace
# demo-j
0x000:
xorl %eax,%eax
0x002:
jne t # Not taken
1
2
3
4
5
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
0x011: t: irmovl $3, %edx # Target
0x017:
irmovl $4, %ecx # Target+1
0x007:
irmovl $1, %eax # Fall Through
6
7
8
9
W
Cycle 5
M

Incorrectly execute two
instructions at branch target
M_Bch = 0
M_valA = 0x007
E
valE f 3
dstE = %edx
D
valC = 4
dstE = %ecx
F
– 28 –
valC f 1
rB f %eax
CS:APP
Return Example
0x000:
0x006:
0x007:
0x008:
0x009:
0x00e:
0x014:
0x020:
0x020:
0x021:
0x022:
0x023:
0x024:
0x02a:
0x030:
0x036:
0x100:
0x100:

– 29 –
demo-ret.ys
irmovl Stack,%esp # Intialize stack pointer
nop
# Avoid hazard on %esp
nop
nop
call p
# Procedure call
irmovl $5,%esi
# Return point
halt
.pos 0x20
p: nop
# procedure
nop
nop
ret
irmovl $1,%eax
# Should not be executed
irmovl $2,%ecx
# Should not be executed
irmovl $3,%edx
# Should not be executed
irmovl $4,%ebx
# Should not be executed
.pos 0x100
Stack:
# Stack: Stack pointer
Require lots of nops to avoid data hazards
CS:APP
Incorrect Return Example
# demo-ret

0x023:
ret
D
E
M
W
0x024:
irmovl $1,%eax # Oops! F
D
E
M
W
0x02a:
irmovl $2,%ecx # Oops!
F
D
E
M
W
0x030:
irmovl $3,%edx # Oops!
F
D
E
M
W
0x00e:
irmovl $5,%esi # Return
F
D
E
M
Incorrectly execute 3
instructions following ret
F
W
W
valM = 0x0e
M
valE = 1
dstE = %eax
E
valE f 2
dstE = %ecx
D
valC = 3
dstE = %edx
F
– 30 –
valC f 5
rB f %esi
CS:APP
Pipeline Summary
Concept


Break instruction execution into 5 stages
Run instructions through in pipelined mode
Limitations


Can’t handle dependencies between instructions when
instructions follow too closely
Data dependencies
 One instruction writes register, later one reads it

Control dependency
 Instruction sets PC in way that pipeline did not predict correctly
 Mispredicted branch and return
Fixing the Pipeline

– 31 –
We’ll do that next time
CS:APP