Finding Highly Correlated Pairs Efficiently with Powerful Pruning Jian Zhang, Joan Feigenbaum CIKM’06 2007/5/3 Chen Yi-Chun 1 Outline • • • • • • Motivation TAPER Our approach Algorithm Experiment results Conclusion 2007/5/3 Chen Yi-Chun 2 Motivation • We consider the problem of finding highly correlated pairs in a large data set. • With massive data sets – the total number of pairs may exceed the mainmemory capacity. – The computational cost of the naïve method is prohibitive. 2007/5/3 Chen Yi-Chun 3 TAPER • Two passes: – Generate a set of candidate pairs whose correlation coefficients may be above the threshold. – Compute the correlation coefficients of candidate pairs. 2007/5/3 Chen Yi-Chun 4 Cont. • Advantage – Computation simplicity • Decide whether a pair (a,b) should be pruned, the TAPER uses a that considers only the frequencies of individual items a and b • Disadvantage – It is that a relatively large group of uncorrelated pairs is missed by the pruning rule. 2007/5/3 Chen Yi-Chun 5 Notation definition 2007/5/3 Chen Yi-Chun 6 Our approach • Jaccard distance(JD): R(a) R(b) R(a) R(b) • A strong connection – If the pair (a, b) has a large correlation coefficient, then its JD must be small. 2007/5/3 Chen Yi-Chun 7 • Pearson correlation coefficient: sp(ab) sp(a)sp(b) (a, b) sp(a)sp(b)(1 sp(a))(1 sp(b)) (a) a and sp(ab Because Assumespthat b )are,we highly can replace correlated sp(ab) with sp(a) S sp (a )(1 sp (b)) sp (b)(1 sp (a )) By the assumption that sp(a) sp(b), S 1 2007/5/3 Chen Yi-Chun 8 sp(ab) R(a) R(b) m Our rule is thus to prune when R(a ) R(b) R(a ) R(b) 2 The last inequality comes from the fact that , S 1 Given 1 S 2007/5/3 , this ratio achieves its minimum value of when S Chen Yi-Chun The last inequality comes from the fact that 2 S 9 Min-hash function hmin (a) min rR ( a ) {h(r )} • It has the following property: Pr(hmin (a) hmin (b)) 2007/5/3 R(a) R(b) R(a) R(b) Chen Yi-Chun 10 Cont. • Ex1. Assume that there are 10 rows (baskets)in total and we choose the following values of h r h(r) 0 1 2 3 4 5 6 7 8 9 17 21 9 44 5 16 1 20 37 8 • Also assume that item 3 appears in baskets 2,5,8 hmin min{h(2) 9, h(5) 16, h(8) 37} 9 2007/5/3 Chen Yi-Chun 11 False negative problem • Note that this bound is tight. – Consider two items a and b with R(a) R(b) R(a) – Assume sp(a) and sp(b) are very small • sp (a ) ( a, b) sp (b) R(a) R(b) sp(a) 2 , ( a, b) R(a) R(b) sp(b) R(a ) R(b) 2 Hence, if we prune a pair when R(a ) R(b) ,we may have removed a pair whose (a, b) 2007/5/3 Chen Yi-Chun 12 Multiple min-hash function • We use k independent min-hash functions and define an equivalence relation “ “ – For two items a and b, a b a and b have the same min-hash values for all the k hash functions. – If one min-hash function, Pr(a b) x ,with k indep. functions, Pr(a b) x k x – We repeat the whole process t times. 2007/5/3 Chen Yi-Chun 13 Cont. • The probability that a and b belong to the same equivalence class in at least one of the k t 1 (1 x ) trials is 2007/5/3 Chen Yi-Chun 14 Cont. • Ex2. we show how candidates are generated after we obtain the minhash value. 2007/5/3 • In round 1, v1 of item 3 is equal to v1 of item 17. Hence (3,17) is put in the candidate set. • In round 2, no two vectors are equal. • In round 3, v3 of item 9 is equal to v3 of item 17. Hence (9,17) is put in the candidate set. Chen Yi-Chun 15 Algorithm 2007/5/3 Chen Yi-Chun 16 Cont. 2007/5/3 Chen Yi-Chun 17 Cont. 2007/5/3 Chen Yi-Chun 18 Experiment results 2007/5/3 Chen Yi-Chun 19 Conclusion • Back to Ex2. (3,9)is not in the candidate set. Does this agree with my substitution ?? 2007/5/3 Chen Yi-Chun 20
© Copyright 2025 Paperzz