Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems Hyoun Kyu Cho and Scott Mahlke University of Michigan, Ann Arbor December 2, 2012 1 University of Michigan Electrical Engineering and Computer Science Critical Path • Longest path between source and sink in DAG 2 University of Michigan Electrical Engineering and Computer Science Critical Path 3 2 5 2 5 3 10 14 2 2 5 9 3 3 2 [Saidi`08] 3 University of Michigan Electrical Engineering and Computer Science Critical Path for Multithreaded Programs Call Call Call StartLock ArBarrier Call Call ArBarrier ArBarrier Unlock EndLock T1 LvBarrier LvBarrier T2 T1 (a) Mutex Lock LvBarrier T3 T2 (b) Barrier [Hollingsworth`98] 4 University of Michigan Electrical Engineering and Computer Science Scalability of Multithreaded Programs Some benchmarks does not scale very well! 5 University of Michigan Electrical Engineering and Computer Science CPU Time Wasted on Synchronizations Synchronization is major bottleneck! 6 University of Michigan Electrical Engineering and Computer Science Arrival Time Variation 7 University of Michigan Electrical Engineering and Computer Science Accelerating Critical Path • ACS [Suleman et al. ASPLOS `09] – Critical sections • Voltage Boosting [Dreslinski `11] – Transactional bottlenecks • Booster [Miller et al. HPCA `12] – Alleviate performance variation – Reactive acceleration for barriers 8 University of Michigan Electrical Engineering and Computer Science Challenges and Opportunities of NTC • Poor single thread performance • Very sensitive to process variation – Running at the slowest one leads to severe loss – Likely to have performance heterogeneity • Potential for bigger frequency boosting 9 University of Michigan Electrical Engineering and Computer Science Objectives • Systematic way of identifying critical paths • Dealing with performance variation • Flexible control of core boosting 10 University of Michigan Electrical Engineering and Computer Science System Architecture offline instrumentation Instrumented Executable Compilation Target Program online Intermediate Representation Priority Weighted Probabilistic Priority Scheduler Monitor Observe Schedule Adjust Priority Parallelism Analysis Monitoring Logic 11 University of Michigan Electrical Engineering and Computer Science Lottery Scheduling • Each thread holds a number of tickets • Scheduler select fast mode thread by picking a ticket • Efficient implementation of proportional-share resource management • Responsive, flexible control over relative execution rate total = 20 random [0 .. 19] = 15 10 ∑ = 10 ∑ > 15? no 2 5 ∑ = 12 ∑ > 15? no 12 1 ∑ = 17 ∑ > 15? yes 2 [Waldspurger`94] University of Michigan Electrical Engineering and Computer Science Progress Monitoring • For data parallel threads • Slower threads are more likely to be in critical path • Divide task into multiple smaller chunks and instrument monitoring code • Monitoring code reduce number of tickets 13 University of Michigan Electrical Engineering and Computer Science Example of Progress Monitoring … pthread_barrier_wait(barrier); long PROGRESS_GRANULE = (k2 – k1) / NUM_STEPS; for ( i = k1 ; i < k2 ; i++ ) { float x_cost = dist(points->p[i],points->p[x],points->dim) * points->p[i].weight; float current_cost = points->p[i].cost; if ( x_cost < current_cost ) { switch_membership[i] = 1; cost_of_opening_x += x_cost – current_cost; } else { int assign = points->p[i].assign; lower[center_table[assign]] += current_cost – x_cost; } if ( (i – k1) % PROGRESS_GRANULE == 0 ) halve_priority_tickets(); } pthread_barrier_wait(barrier); … Loop Body 14 University of Michigan Electrical Engineering and Computer Science Priority Delegation • Thread holding a mutex is likely to be in critical path – Increase tickets when acquire mutex • More likely to be in critical path if other threads are waiting – Temporarily transfer waiting thread’s ticket to the thread holding mutex 15 University of Michigan Electrical Engineering and Computer Science Performance Evaluation • Post processing traces – Generated on 32-core machine • Four 8-core Intel Xeon X7560 • 24MB L3 cache per chip • 32GB Total memory – Augmented progress time indication • 1.5x, 2x, 5x, 10x acceleration for 1 fast mode core • Varying scheduling quantum from 1us to 1ms 16 University of Michigan Electrical Engineering and Computer Science Speedup for Streamcluster H/W User OS mode 17 University of Michigan Electrical Engineering and Computer Science Current Status instrumentation Instrumented Executable Compilation Target Program Monitor Observe Schedule Adjust Priority Parallelism Analysis Monitoring Logic Cores Intermediate Representation Priority Weighted Probabilistic Priority Scheduler 18 Normal Normal Turbo Turbo University of Michigan Electrical Engineering and Computer Science Conclusion & Future Work • Introduce S/W framework to improve multithreaded programs’ performance using core boosting • Combines static analysis, dynamic monitoring, and probabilistic priority scheduling to predict critical paths • Shows 5% ~ 27% performance improvement for streamcluster • Better model the tradeoff between performance and energy • Predicting critical paths on other type of parallelism 19 University of Michigan Electrical Engineering and Computer Science Thank you! 20 University of Michigan Electrical Engineering and Computer Science
© Copyright 2025 Paperzz