Overview CS:APP Chapter 4 Computer Architecture General Principles of Pipelining Pipelined Implementation ! Goal ! Difficulties Creating a Pipelined Y86 Processor Part I ! Rearranging SEQ ! Inserting pipeline registers ! Problems with data and control hazards Randal E. Bryant Carnegie Mellon University http://csapp.cs.cmu.edu Real-World Pipelines: Car Washes Sequential CS:APP –2– CS:APP Computational Example Parallel 300 ps 20 ps Combinational logic R e g Delay = 320 ps Throughput = 3.12 GOPS Clock System Pipelined Idea –3– ! Divide process into independent stages ! Move objects through stages in sequence ! At any given times, multiple objects being processed CS:APP –4– ! Computation requires total of 300 picoseconds ! Additional 20 picoseconds to save result in register ! Can must have clock cycle of at least 320 ps CS:APP 3-Way Pipelined Version 100 ps 20 ps 100 ps 20 ps 100 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C Pipeline Diagrams 20 ps Unpipelined R Delay = 360 ps e Throughput = 8.33 GOPS g OP1 OP2 OP3 Clock ! System ! Divide combinational logic into 3 blocks of 100 ps each ! Can begin new operation as soon as previous one passes through stage A. Cannot start new operation until previous one completes 3-Way Pipelined OP1 OP2 OP3 " Begin new operation every 120 ps ! Time A B A " 360 ps from start to finish ! CS:APP –5– Operating a Pipeline 239 241 300 359 B A 0 120 C C B A CS:APP Limitations: Nonuniform Delays C B 240 Up to 3 operations in process simultaneously –6– Clock A C B Time Overall latency increases OP1 OP2 OP3 C B A 360 50 ps 20 ps 150 ps 20 ps 100 ps Comb. logic R e g Comb. logic B R e g Comb. logic C A C 480 Comb. logic A 20 ps R e g 100 ps Comb. logic B 20 ps R e g 100 ps Comb. logic C OP1 OP2 OP3 20 ps A B Clock C A B A C B C Time R e g Clock –7– R Delay = 510 ps e Throughput = 5.88 GOPS g 640 Time 100 ps 20 ps CS:APP –8– ! Throughput limited by slowest stage ! Other stages sit idle for much of the time ! Challenging to partition system into balanced stages CS:APP Limitations: Register Overhead Data Dependencies 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps Comb. logic R e g Comb. logic R e g Comb. logic R e g Clock R e g Comb. logic R e g Comb. logic Comb. logic As try to deepen pipeline, overhead of loading registers becomes more significant ! Percentage of clock cycle spent loading register: " 1-stage pipeline: " 3-stage pipeline: " 6-stage pipeline: High speeds of modern processor designs obtained through very deep pipelining CS:APP Data Hazards Comb. logic A OP1 OP2 OP3 OP4 OP1 OP2 OP3 System ! Each operation depends on result from preceding one CS:APP – 10 – Data Dependencies in Processors R e g A Clock Time 6.25% 16.67% 28.57% –9– Comb. logic B B A C B A R e g Comb. logic C R e g ! Clock C B A C B 1 irmovl $50, %eax 2 addl %eax , 3 mrmovl 100( %ebx ), %ebx %edx Result from one instruction used as operand for another " Read-after-write (RAW) dependency C ! Very common in actual programs ! Must make sure our pipeline handles these properly " Get correct results Time – 11 – R e g Combinational logic Delay = 420 ps, Throughput = 14.29 GOPS ! ! R e g " Minimize performance impact ! Result does not feed back around in time for next operation ! Pipelining has changed behavior of system CS:APP – 12 – CS:APP newPC SEQ Hardware valM SEQ+ Hardware New PC PC data out Mem. control Memory read Data Data memory memory write Addr ! ! Stages occur in sequence One operation in process at a time ! valM data out read Mem. control Memory Data Data memory memory write Addr Execute Bch valE CC CC ALU ALU ALU A ! Data ! ALU B ! valB dstE dstM srcA srcB dstE dstM srcA srcB Decode A B Write back Fetch ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment ! PC is no longer stored in register ! But, can determine PC based on other stored information PC CS:APP – 13 – Adding Pipeline Registers valE, valM W_icode, W_valM Write back valM Task is to select PC for current instruction Based on results computed by previous instruction Processor State Register Register M file file E icode Bch valE CC CC Execute W_valE, W_valM, W_dstE, W_dstM ALU fun. ALU ALU ALU A ALU B valA Data Data memory memory Memory A Write back icode Fetch ifun rA rB valC Instruction Instruction memory memory valP PC PC increment increment PC PC PC pIcode pBch pValM pValC pValP CS:APP Pipeline Stages Fetch ! Addr, Data W_icode, W_valM W_valE, W_valM, W_dstE, W_dstM W Bch Bch ALU ALU valE Read instruction ! Compute incremented PC aluA, aluB Data Data memory memory M_icode, M_Bch, M_valA Addr, Data M Bch ALU ALU Decode aluA, aluB ! E E valA, valB Execute valA, valB srcA, srcB dstA, dstB A B Register Register M file file E ALU ALU Read program registers valA, valB icode, valC valP valE CC CC Execute aluA, aluB CC CC Execute Memory Select current PC ! M valE CC CC B Register Register M file file E – 14 – Data Data memory memory M_icode, M_Bch, M_valA Addr, Data Execute dstE dstM srcA srcB valM valM Memory valB dstE dstM srcA srcB Decode valM W Decode Data PC Stage ALU fun. valA Still sequential implementation Reorder PC stage to put at beginning d_srcA, d_srcB Decode A B Register M Register file file ! E d_srcA, d_srcB Decode A B Register Register M file file E Write back Operate ALU D valP Write back valP icode, ifun rA, rB valC Fetch Instruction Instruction memory memory D icode, ifun, rA, rB, valC PC PC increment increment Instruction Instruction memory memory Fetch PC PC valP Memory valP PC PC increment increment ! PC PC Write Back f_PC ! F – 15 – CS:APP Instruction Instruction memory memory Read or write data memory predPC pState icode, ifun, rA, rB, valC Fetch – 16 – valP PC PC increment increment predPC f_PC F Update register file CS:APP Write back PIPE- Hardware W Write back icode valE ! write Memory Pipeline registers hold intermediate values from instruction execution icode M_valA Bch valE valA Predicted PC E icode ifun ! valC valA valB D through decode d_rvalA A icode ifun rA rB valC dstE dstM srcA srcB W_valM B Jump taken/not-taken ! Fall-through or target address PC PC increment increment ! M icode M_valA Bch valE valA dstE dstM ALU A Execute E icode ALU fun. ALU ALU CC CC ifun ALU B valC valA ! f_PC M_valA F rB valC valP Predict PC Need valC Instr valid Need regids valB Fetch icode d_rvalA dstE dstM srcA srcB A dstE dstM srcA srcB W_valM B Register Register M file file E ifun rA rB Instruction Instruction memory memory valC W_valE valP PC PC increment increment Predict PC f_PC M_valA Select PC W_valM predPC CS:APP – 18 – Our Prediction Strategy Instructions that Don’ Don’t Transfer Control PC PC increment increment Align Align Byte 0 D F M_icode M_Bch M_valA W_icode W_valM Split Split To register file write ports predPC CS:APP rA Decode W_valM – 17 – ifun Select A Read from memory Register updates Predict PC Select PC ! Predict next PC to be valP ! Always reliable Bytes 1-5 Call and Unconditional Jumps Instruction Instruction memory memory Select PC ! Predict next PC to be valC (destination) ! Always reliable Conditional Jumps predPC Start fetch of new instruction after current one has completed fetch stage ! Predict next PC to be valC (destination) ! Only correct if branch is taken " Typically right 60% of time " Not enough time to reliably determine next instruction Return Instruction Guess which instruction will follow " Recover if prediction was incorrect – 19 – data in d_srcA d_srcB Return point W_valE valP Instruction Instruction memory memory Fetch ! Guess value of next PC ! dstE dstM srcA srcB Register Register M file file E " e.g., valC passes ! Data Data memory memory Addr Branch information ALU B Decode F write Memory d_srcA d_srcB Cannot jump past stages icode dstE dstM e_Bch ALU A Execute valM data out read M_Bch dstE dstM ALU fun. ALU ALU Select A D valE e_Bch CC CC Values passed from one stage to next Predicting the PC icode Mem. control data in Addr M W Data Data memory memory M_Bch Forward (Upward) Paths ! Feedback Paths dstE dstM data out read Mem. control ! valM ! CS:APP – 20 – Don’t try to predict CS:APP Recovering from PC Misprediction M_icode M_Bch M_valA W_icode W_valM D icode ifun rA rB valC Pipeline Demonstration valP Predict PC Need valC Instr valid Need regids Split Split PC PC increment increment Align Align Byte 0 Bytes 1-5 Instruction Instruction memory memory irmovl $1,%eax #I1 irmovl $2,%ecx #I2 irmovl $3,%edx #I3 irmovl $4,%ebx #I4 halt 1 2 3 4 5 6 7 8 9 F D F E D F M E D F W M E D F W M E D W M E W M W #I5 Cycle 5 Select PC F ! W I1 File: demo-basic.ys demo-basic.ys predPC M I2 Mispredicted Jump E I3 " Will see branch flag once instruction reaches memory stage " Can get fall-through PC from valA ! D I4 Return Instruction " Will get return PC when ret reaches write-back stage F I5 CS:APP – 21 – Data Dependencies: 3 Nop’s 1 # demo-h3.ys 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop F 2 D F 3 E D F 4 5 M E D F W M E D F 0x00e: nop 0x00f: addl %edx,%eax 0x011: halt 6 W M E D F 7 Data Dependencies: 2 Nop’s 8 9 10 11 # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D F E D F M E D F W M E D F 0x006: irmovl W M E D F CS:APP – 22 – $3,%eax 0x00c: nop 0x00d: nop W M E D W M E 0x00e: addl %edx,%eax W M 0x010: halt 6 7 8 9 10 W M E D F W M E D W M E W M W W Cycle 6 Cycle 6 W R[%eax] ! 3 W R[%eax] ! 3 • • • – 23 – Cycle 7 D D valA ! R[%edx] = 10 valB ! R[%eax] = 0 valA ! R[%edx] = 10 valB ! R[%eax] =3 CS:APP – 24 – Error CS:APP Data Dependencies: 1 Nop # demo-h1.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D F E D F M E D F W M E D F 0x006: irmovl $3,%eax 0x00c: nop 0x00d: addl %edx,%eax 0x00f: halt 6 W M E D Data Dependencies: No Nop 7 8 9 # demo-h0.ys 1 2 3 4 5 6 7 8 0x000: irmovl $10,%edx F D F E D F M E D F W M E D W M E W M W 0x006: irmovl W M E $3,%eax 0x00c: addl %edx,%eax W M 0x00e: halt W Cycle 4 Cycle 5 M W M_valE = 10 M_dstE = %edx R[%edx] ! 10 E M e_valE ! 0 + 3 = 3 E_dstE = %eax M_valE = 3 M_dstE = %eax D • • • D valA ! R[%edx] = 0 valB ! R[%eax] = 0 – 25 – Error CS:APP CS:APP – 26 – Branch Misprediction Trace Branch Misprediction Example # demo-j demo-j.ys 0x000: xorl %eax,%eax 0x002: jne t 0x007: irmovl $1, %eax 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $3, %edx 0x017: irmovl $4, %ecx 0x01d: irmovl $5, %edx ! Error valA ! R[ %edx] = 0 valB ! R[%eax] = 0 0x000: xorl %eax,%eax 0x002: jne t # Not taken 1 2 3 4 5 6 7 8 9 F D F E D F M E D F W M E D F W M E D W M E W M W 0x011: t: irmovl $3, %edx # Target # Not taken # Fall through 0x017: irmovl $4, %ecx # Target+1 0x007: irmovl $1, %eax # Fall Through Cycle 5 M ! # Target (Should not execute) # Should not execute # Should not execute Incorrectly execute two instructions at branch target M_Bch = 0 M_valA = 0x007 E valE ! 3 dstE = %edx D Should only execute first 8 instructions valC = 4 dstE = %ecx F – 27 – CS:APP – 28 – valC ! 1 rB ! %eax CS:APP Return Example 0x000: 0x006: 0x007: 0x008: 0x009: 0x00e: 0x014: 0x020: 0x020: 0x021: 0x022: 0x023: 0x024: 0x02a: 0x030: 0x036: 0x100: 0x100: ! Incorrect Return Example demo-ret.ys # demo-ret irmovl Stack,%esp # Intialize stack pointer nop # Avoid hazard on %esp nop nop call p # Procedure call irmovl $5,%esi # Return point halt .pos 0x20 p: nop # procedure nop nop ret irmovl $1,%eax # Should not be executed irmovl $2,%ecx # Should not be executed irmovl $3,%edx # Should not be executed irmovl $4,%ebx # Should not be executed .pos 0x100 Stack: # Stack: Stack pointer – 29 – Concept Break instruction execution into 5 stages ! Run instructions through in pipelined mode Limitations ! Can’t handle dependencies between instructions when instructions follow too closely ! Data dependencies " One instruction writes register, later one reads it Control dependency " Instruction sets PC in way that pipeline did not predict correctly " Mispredicted branch and return Fixing the Pipeline ! – 31 – irmovl $1,%eax # Oops! F 0x02a: irmovl $2,%ecx # Oops! 0x030: irmovl $3,%edx # Oops! 0x00e: irmovl $5,%esi # Return Incorrectly execute 3 instructions following ret F D E D F M E D F W M E D F W M E D W M E W M W W M valE = 1 dstE = %eax E valE ! 2 dstE = %ecx D valC = 3 dstE = %edx F Pipeline Summary ! ret 0x024: valM = 0x0e Require lots of nops to avoid data hazards CS:APP ! ! 0x023: We’ll do that next time CS:APP – 30 – valC ! 5 rB ! %esi CS:APP
© Copyright 2025 Paperzz