Data Anonymization (1) Outline Problem concepts algorithms on domain generalization hierarchy Algorithms on numerical data The Massachusetts Governor Privacy Breach •Name •SSN •Name • Governor MA 87 % of USofpopulation uniquely identified using ZipCode, Birth Date, and Sex. •Address •Date •Visit Date • Birth Registered •Diagnosis date •Party •Procedure affiliation Name linked to Diagnosis •Medication• Sex •Date last •Total Charge voted • Zip Quasi Identifier Medical Data Voter List Sweeney, IJUFKS 2002 3 Definition Table Column: attributes, row: records Quasi-identifier A list of attributes that can potentially be used to identify individuals K-anonymity Any QI in the table appears at least k times Basic techniques Generalization Zip {02138, 02139} 0213* Domain generalization hierarchy A0 A1…An Eg. {02138, 02139} 0213* 021* 02*0** This hierarchy is a tree structure suppression Balance Better privacy guarantee Lower data utility There are many schemes satisfying the k-anonymity specification. We want to minimize the distortion of table, in order to maximize data utility • Suppression is required if we cannot find a k-anonymity group for a record. Criteria Minimal generalization Minimal generalization that satisfy the kanonymization specification Minimal table distortion Minimal generalization with minimal utility loss Use precision to evaluate the loss [sweeny papers] Application-specific utility Complexity of finding optimal solution on generalization NP-hard (bayardo ICDE05) So all proposed algorithms are approximate algorithms Shared features in different solutions Always satisfy the k-anonymity specification If some records not, suppress them Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric Algorithms Assume the domain generalization hierarchy is given Efficiency Utility maximization Metrics to be optimized Two cost metrics – we want to minimize (bayardo ICDE05) Discernibility # of items in the k-anony group Classification The dataset has a class label column – preserving the classification model # Records in minor classes in the group metrics A combination of information loss and anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric metrics Information loss Dataset has class labels Entropy a set S, labeled by different classes Entropy is used to calculate the impurity of labels Info(S)= p log p i i Pi is the percentage of label i i Information loss of a generalization G {c1,c2,…cn} p N ci I(G) = info(Sp) - info (Rci) i Np Anonymity gain A(VID) : # of records with the VID AG(VID) >= A(VID): generalization improves or does not change A(VID) Anonymity gain P(G) = x – A(VID) x = AG (VID) if AG (VID) <=K x = K, otherwise As long as k-anonymity is satisfied, further generalization of the VID does not gain Information-privacy combined metric IP = info loss/anonymity gain = I(G)/P(G) We want to minimize IP If P(G) ==0, use I(G) only Either small I(G) or large P(G) will reduce IP… If P(G)s are same, pick one with minimum I(G) Domain-hierarchy based algorithms The sweeny’s algorithm Bayardo’s tree pruning algorithm Wang’s top-down and bottom up algorithms They are all dimension-by-dimension methods Multidimensional techniques Categorical data? Categories are mapped to numerize the categories Bayardo 95 paper Order matters? (no research on that) Numerical data K-anonymization n-dim space partitioning Many existing techniques can be applied Single-dimensional vs. multidimensional The evolving procedure Categorical(domain hierarchy)[sweeney, topdown/bottom-up] numerized categories, single dimensional [bayardo05] numerized/numerical multidimensional[Mondrian,spatial indexing,…] Method 1: Mondrain Numerize categorical data Apply a top-down partioning process Step2.2 Step2.1 step1 Allowable cut Method 2: spatial indexing Multidimensional spatial techniques Kd-tree (similar to Mondrain algorithm) R-tree and its variations Upper layer Leaf layer R-tree R+-tree Compacting bounds Information is better preserved Example: uncompacted: age[1-80], salary[10k-100k] compacted: age[20-40], salary[10k-50k] Original Mondrain does not consider compacting bounds For R+-Tree, it is automatically done. Benefits of using R+-Tree Scalable: originally designed for indexing disk-based large data Multi-granularity k-anonymity: layers Better performance Better quality Performance Mondrain Utility Metrics Discenibility penalty KL divergence: describe the difference between a pair of distributions Anonymized data distribution Certainty penalty T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range Other issues Sparse high-dimensionality Transactional data boolean matrix “On the anonymization of sparse high-dimensional data” ICDE08 Relate to the clustering problem of transactional data! The above one uses matrix-based clustering item based clustering (?) Other issues Effect of numerizing categorical data Ordering of categories may have certain impact on quality General-purpose utility metrics vs. special task oriented utility metrics Attacks on k-anonymity definition
© Copyright 2025 Paperzz