IBM Presentations: Blue Pearl Basic template

Thomas J. Watson Research Center
Modeling Optimistic Concurrency
using Quantitative Dependence Analysis
Christoph von Praun
Rajesh Bordawekar
Calin Cascaval
PPoPP, February 2008
© 2007 IBM Corporation
Motivation
Which applications can benefit from optimistic concurrency?
How can the performance benefit be quantified?
1) Parallelization of sequential code: optimistic concurrency may simplify the
parallelization process.
– How much parallelism can a straightforward parallelization achieve?
– Given a set of tasks: How much do data dependences inhibit parallelism?
– Which data structures have to be changed or protected?
2) Optimization of parallel codes that synchronize through locks:
– Benefit of using optimistic vs. pessimistic concurrency control?
Quantitative dependence analysis can help to answer these questions.
PPoPP, February 2008
2 Corporation
© 2004 IBM
Contributions
1) Quantitative dependence analysis on a task-based program
execution model. Key metrics are dependence density among
tasks and (algorithmically) available parallelism.
2) Tool that extracts input data for the model from a singlethreaded program execution.
3) Case study on real applications using quantitative dependence
analysis as a guideline for the use of optimistic concurrency.
PPoPP, February 2008
3 Corporation
© 2004 IBM
Outline
Motivation
Model
Experience
Conclusions
PPoPP, February 2008
4 Corporation
© 2004 IBM
Example (1/2)
int filter(const double* arr, double* res, int len)
{
int i, j = 0;
Intended
double v;
for (i = 0; i < arr_len; ++i)
{
v = f(arr[i]);
if (v < THRESH) {
result_arr[j] = v;
++j;
}
}
return j;
behavior: record all
values v in arr where f(v)
larger than THRESH in the
result array res. Entries in
res can occur in any order.
}
PPoPP, February 2008
5 Corporation
© 2004 IBM
Example (2/2)
Loop parallelization with optimistic concurrency
OpenTM [PACT07] notation
int filter(const double* arr, double* res, int len)
{
int i, j = 0;
double v;
#pragma omp transfor private (i,v)
for (i = 0; i < len; ++i)
{
v = f(arr[i]);
if (v < THRESH) {
res[j] = v;
++j;
}
}
return j;
}
PPoPP, February 2008
Execution of method
filter is a
program phase.
One loop iteration is a (speculative) task.
Tasks may execute concurrently as
unordered transactions.
Memory-level dependencies among tasks
are resolved through the runtime system.
6 Corporation
© 2004 IBM
Model of a program execution
program
program execution
program execution schedule
independent
tasks, low
dependence
density
critical
sections
medium
dependence
density
high
dependence
density
program phase
PPoPP, February 2008
group of tasks
corresponding
to the same
critical section
task
inter-task
dependence
execution
context, thread
7 Corporation
© 2004 IBM
Preliminary definitions
tasks: t,t1,t 2 ,...
set of tasks: T
all tasks in a program phase: T p
length of a task: len(t)




read _ set(t) : {l | t reads from location l and t has not previously written to

write_ set(t)
 : {l | t writes to location l}
flow_ dep(t1,t 2 ) : write_ set(t1)  read _ set(t 2 )
l}
for ordered tasks:
pred(t) : {s | execution of task s must precede t}
succ(t) : {s | execution of task s must follow t}


PPoPP, February 2008
8 Corporation
© 2004 IBM
Data dependence
has_ data_ dep(t1,t 2 ) : t1  t 2  ( flow _ dep(t1,t 2 )  
 flow _ dep(t 2 ,t1)  )

Dependence density
data_ dep _ dens(t) :
 len (s)
s Tp  has_ data
t2
t1 _ dep(t,s)
 len (t)
tTp  {t}



t3
t6

flow_dep
 data_ dep _dens(t)


data_ dep _ dens(T ) :

t5
t7

{t i | has_ data_ dep(t1,t i )}
t T
|T |
Probabilty of flow dependence between t and some other

randomly chosen task s  TP in the same program phase.
PPoPP, February 2008
t4
9 Corporation
© 2004 IBM
Causal dependence (simple variant: ‘seq’)
has_ causal _ dep(t1,t 2 ) : (t 2  succ(t1)has_ causal _ chain(t1,t 2 )) 
(t 2  pred(t1) has_ causal _ chain(t 2 ,t1))
has_ causal _ chain(t1,t 2 ) : flow _ dep(t1,t 2 )   
t 3  succ(t1)  pred(t 2 ) : flow _ dep(t1,t 3 )



Dependence density
causal _ dep _ dens(t) :
 len (s)
sTp  has_
t 2 dep(t,s)t 3
t1 causal_
 len (t)

t Tp  {t}


succ

_ dens(t)
 causal _ depflow_dep
causal _ dep _ dens(T ) :
PPoPP, February 2008
t4

t5
t6

tT
{t i| | has_ causal _ chain(t1,t i )}
|T
10 Corporation
© 2004 IBM
Available parallelism
n1
avail _ par(T p ,n) :
 (1 dep _ dens(T p )) k
k 0

1 (1 dep _ dens(T p )) n
dep _ dens(T p )
Example
avail _dep
par(T
_ density
0.5 avail _ par(T p ,n)
p ) : lim
n
avail _ par(T p ,4)  1  0.5  0.25  0.125
1

contribution of useful dep _ dens(T p )



work (fraction of a task)
executed by thread k:
increasing probablity of conflict with additional concurrent tasks
diminishing return of additional execution units
PPoPP, February 2008
11 Corporation
© 2004 IBM
Limitations
This model is idealized ...
 considers only memory-level dependences
 ignores shortage or contention on other resources, e.g., execution
threads, memory access path
 does not model TM contention management
 for unordered tasks: scheduler picks tasks at random
PPoPP, February 2008
12 Corporation
© 2004 IBM
Outline
Motivation
Model
Experience
Conclusions
PPoPP, February 2008
13 Corporation
© 2004 IBM
Methodology
1) Program annotation
•
Mark phase and task boundaries in the program code.
2) Recording
•
Dynamic binary instrumentation of a single-threaded program run
•
Monitor execution of phases and tasks (critical sections)
•
Sample 5% of the tasks: record addresses of shared (non-stack)
memory accesses
3) Analysis
•
Compute probability of memory-level dependence among two randomly
chosen tasks within a phase
(word-granularity address disambiguation, higher granularity possible
to model false sharing)
•
Compute dependence density and available parallelism
PPoPP, February 2008
14 Corporation
© 2004 IBM
Results - Overview
Unordered tasks
program
src
coverage
[%-phase]
vacation-low
vacation-high
kmeans-low
kmeans-high
mysql-keycache
client.c:170
client.c:170
normal.c:164
normal.c:164
mf_keycache.c:1808
mf_keycache.c:1863
90.8
86.7
1.8
3.5
3.3
3.5
data-dep
density
0.0012
0.0026
0.0242
0.0497
0.2577
0.9946
avail-par
833
385
41
20
4
1
Ordered tasks
program
src
umt2k
snswp3d.c:357
PPoPP, February 2008
coverage
[%-phase]
100
causal-depdensity seq
(avg)
0.88-0.97
(0.91)
causal-depdensity win250
(avg)
0.08–0.86
(0.31)
15 Corporation
© 2004 IBM
MySQL keycache (1/3)
MySQL keycache
– part of the MyISAM storage manager
– caches index bocks for database tables that reside on disk
– implementation: thread-safe data structure protected by a single lock.
ATIS SQL benchmark (serial execution)
1.
2.
3.
4.
create tables
insert records
retrieve data (select, join, key prefix join, distinct, group join):
drop tables
phase-A
phase-B
read-only database operation
PPoPP, February 2008
16 Corporation
© 2004 IBM
MySQL keycache (2/3)
Distribution of write-access probability in different critical sections:
fraction of
total written
locations [%]
Locations that are updated by almost
every critical region. Potential scalability
bottleneck for transactions. Reason for
high data-dependence density.
... adds up to 100
16
14
Example:
keycache->global_cache_write++;
12
10
8
6
4
2
0
<0.1
0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9
mf_keycache.c:1863
mf_keycache.c:2083
PPoPP, February 2008
mf_keycache.c:1808
mf_keycache.c:2284
>0.9
probability of access
in a specific critical section.
17 Corporation
© 2004 IBM
MySQL keycache (3/3)
Single lock is held during ~8% of the execution time of phase-B (retrieve)
–
8% split in ~500k critical section instances
Amdahl’s Law: 8% serial, 92 % of the workload are ‘perfectly‘ parallel
–
assume, e.g., 92 processors
–
maximum speedup:
100 / (1 + 8) = 11.1
–
simplifying assumption: execution of 500k critical section instances smooth out such
that threads are not waiting ... this ideal model makes too optimistic predictions.
Lessons learned:
–
read-only database operation is internally r-w operation
–
keycache is an optimization designed for workloads with no or little concurrency
–
keycache may become scalability bottleneck
PPoPP, February 2008
18 Corporation
© 2004 IBM
Umt2k



Simulation of energy transport/propagation across an object
Object represented as mesh (sparse, ‘unstructured’ data representation)
Mesh traversersal is 50% of execution time
start of traversal
progress of traversal
Graphic source: Wikipedia
PPoPP, February 2008
19 Corporation
© 2004 IBM
Doacross loop: snswp3d.c 356-549
for (i = 1; i <= nelem; ++i) {
...
/* Flux calculation */
if (afezm > zero) {
iexit = ixfez;
for (ip = 1; ip <= npart; ++ip) {
loop carried dependence –
prevents doall parallelization
Potential point of conflict for
speculative parallelization
(in this case, dependence occurs
in almost every iteration)
/* Compute sources */
sigvx = sigvol_ref(ip, ix);
...
/* Calculate average angular flux in each tet (PSIT) */
psit_ref(ip) = stet + ybase * psi_inc_ref(ip, ix);
psifez = sfez + xbase * psi_inc_ref(ip, ix);
tpsic_ref(ip, ic) = tpsic_ref(ip, ic) + tetwtx * psit_ref(ip);
psi_inc_ref(ip, iexit) = psi_inc_ref(ip, iexit) +
two * afezm * psifez;
...
ix = next_ref(i + 1);
ixfez = konnect_ref(3, ix);
...
reduction
}
PPoPP, February 2008
20 Corporation
© 2004 IBM
RAW dependence distance in snswp3d.c 356-549
dependence distance 1: 98661 (37.2%), 2: 12173 (45.26%)
count
5000
...
4500
4000
3500
3000
2500
2000
1500
1000
500
...
0
1
11
21
31
41
51
61
71
81
91
101
Loop is doacross at runtime; 265K iterations, avg. dependence distance: ~12.5
PPoPP, February 2008
111
121
dependence
distance
[# iterations]
21 Corporation
© 2004 IBM
Causal dependence density
precise: scheduler considers window of 250 consecutive tasks for execution
PPoPP, February 2008
22 Corporation
© 2004 IBM
Algorithmic considerations




Iteration space is an unstructured graph
Iteration order is a linearization of that graph: topological sort.
Computation of topo-sort is ~15% of the total umt2k runtime
Any topological sort is good (i.e. respects algorithmic dependences)
– the one chosen is good for uniprocessor cache locality
– choose a different one that allows larger speculation window
PPoPP, February 2008
© 2004 IBM Corporation
Scheduling experiment (1/2)
Schedule iterations into buckets with k-wide execution buckets:
topo-sort
k-wide buckets
1
2
3
4
...
Algorithm
 from the tail of the topo-sort: compute closest algorithmic dependences
(RAW)
 from the head of the topo-sort: hoist iteration into bucket that
– has a free slot
– follows closest the bucket that holds the iteration on which it is
dependent
PPoPP, February 2008
24 Corporation
© 2004 IBM
Scheduling experiment (2/2)

Iterations in each bucket can execute in parallel
(transactions required to account for occasional write-write
dependence on reduction variables)
Buckets execute in series
How well did the buckets fill up...? Bucket fill histogram with k=32:


percent of total buckets [%]
98.2
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Buckets fill up very well. Possible parallelism with k=32: 30.96
Conclusion: turned type “doacross” loop into “mostly doall” loop.
PPoPP, February 2008
bucket fill
25 Corporation
© 2004 IBM
Outline
Motivation
Model
Experience
Conclusions
PPoPP, February 2008
26 Corporation
© 2004 IBM
Concluding remarks




We designed a simple execution model and analysis that estimates
the performance potential of optimistic concurrency from a profiled,
single threaded program run.
Key metrics: dependence density, available parallelism.
Metrics capture application properties and abstract from runtime
implementation artifacts.
Methodology and tool proved useful in a project on transactional
synchronization: Goal was to identify and quantify opportunities for
optimistic concurrency in today’s programs.
 no or little effort to adapt applications
 no architecture simulation or STM overheads
 quick turnaround time
PPoPP, February 2008
27 Corporation
© 2004 IBM
praun@acm.org
PPoPP, February 2008
28 Corporation
© 2004 IBM