Privacy-Aware Computing

Data Anonymization (1)
Outline
 Problem
 concepts
 algorithms on domain generalization
hierarchy
 Algorithms on numerical data
The Massachusetts Governor Privacy Breach
•Name
•SSN
•Name
• Governor
MA
87 % of USofpopulation
uniquely identified
using ZipCode,
Birth Date, and Sex.
•Address
•Date
•Visit Date
• Birth
Registered
•Diagnosis date
•Party
•Procedure
affiliation Name linked to Diagnosis
•Medication• Sex
•Date last
•Total Charge
voted
• Zip
Quasi Identifier
Medical Data
Voter
List
Sweeney, IJUFKS 2002
3
Definition
 Table
 Column: attributes, row: records
 Quasi-identifier
 A list of attributes that can potentially be
used to identify individuals
 K-anonymity
 Any QI in the table appears at least k
times
Basic techniques
 Generalization
 Zip {02138, 02139}  0213*
 Domain generalization hierarchy
 A0 A1…An
 Eg. {02138, 02139}  0213*  021* 02*0**
 This hierarchy is a tree structure
suppression
 Balance
Better privacy guarantee
Lower data utility
There are many schemes satisfying the k-anonymity specification.
We want to minimize the distortion of table, in order to maximize
data utility
• Suppression is required if we cannot find a k-anonymity group for
a record.
Criteria
 Minimal generalization
 Minimal generalization that satisfy the kanonymization specification
 Minimal table distortion
 Minimal generalization with minimal
utility loss
 Use precision to evaluate the loss
[sweeny papers]
 Application-specific utility
Complexity of finding optimal
solution on generalization
 NP-hard (bayardo ICDE05)
 So all proposed algorithms are
approximate algorithms
Shared features in different
solutions
 Always satisfy the k-anonymity
specification
 If some records not, suppress them
 Differences are at the utility loss/cost
function
 Sweeney’s precision metric
 Discernibility & classification metrics
 Information-privacy metric
 Algorithms
 Assume the domain generalization hierarchy is
given
 Efficiency
 Utility maximization
Metrics to be optimized
 Two cost metrics – we want to minimize
(bayardo ICDE05)
 Discernibility
# of items in the k-anony group
 Classification
 The dataset has a class label column – preserving
the classification model
# Records in minor classes in the group
metrics
 A combination of information loss and
anonymity gain (wang ICDE04)
 Information loss, anonymity gain
 Information-privacy metric
metrics
 Information loss
 Dataset has class labels
 Entropy
 a set S, labeled by different classes
 Entropy is used to calculate the impurity of labels
Info(S)=
 p log p
i
i
Pi is the percentage of label i
i
 Information loss of a generalization G
{c1,c2,…cn}  p
N ci
I(G) = info(Sp) - 
info (Rci)
i Np
 Anonymity gain
 A(VID) : # of records with the VID
 AG(VID) >= A(VID): generalization
improves or does not change A(VID)
 Anonymity gain
P(G) = x – A(VID)
x = AG (VID) if AG (VID) <=K
x = K, otherwise
As long as k-anonymity is satisfied, further
generalization of the VID does not gain
 Information-privacy combined metric
IP = info loss/anonymity gain
= I(G)/P(G)
We want to minimize IP
If P(G) ==0, use I(G) only
Either small I(G) or large P(G) will reduce IP…
If P(G)s are same, pick one with minimum I(G)
Domain-hierarchy based
algorithms
 The sweeny’s algorithm
 Bayardo’s tree pruning algorithm
 Wang’s top-down and bottom up
algorithms
 They are all dimension-by-dimension
methods
Multidimensional techniques
 Categorical data?
 Categories are mapped to
 numerize the categories
 Bayardo 95 paper
 Order matters? (no research on that)
 Numerical data
 K-anonymization  n-dim space
partitioning
 Many existing techniques can be applied
Single-dimensional vs.
multidimensional
The evolving procedure
Categorical(domain hierarchy)[sweeney, topdown/bottom-up] 
numerized categories, single dimensional
[bayardo05]
numerized/numerical
multidimensional[Mondrian,spatial indexing,…]
Method 1: Mondrain
 Numerize categorical data
 Apply a top-down partioning process
Step2.2
Step2.1
step1
Allowable cut
Method 2: spatial indexing
 Multidimensional spatial techniques
 Kd-tree (similar to Mondrain algorithm)
 R-tree and its variations
Upper
layer
Leaf layer
R-tree
R+-tree
Compacting bounds
Information is better
preserved
Example:
uncompacted: age[1-80], salary[10k-100k]
compacted: age[20-40], salary[10k-50k]
Original Mondrain does not consider compacting bounds
For R+-Tree, it is automatically done.
Benefits of using R+-Tree
 Scalable: originally designed for
indexing disk-based large data
 Multi-granularity k-anonymity: layers
 Better performance
 Better quality
Performance
Mondrain
Utility
 Metrics
 Discenibility penalty
 KL divergence: describe the difference
between a pair of distributions
Anonymized data
distribution
 Certainty penalty
T: table, t: record, m: # of attributes,
t.Ai generaled range, T.Ai total range
Other issues
 Sparse high-dimensionality
 Transactional data boolean matrix
“On the anonymization of sparse high-dimensional
data” ICDE08
 Relate to the clustering problem of
transactional data!
 The above one uses matrix-based clustering
 item based clustering (?)
Other issues
 Effect of numerizing categorical data
 Ordering of categories may have certain
impact on quality
 General-purpose utility metrics vs.
special task oriented utility metrics
 Attacks on k-anonymity definition