Privacy Preserving in Graph data publishing

Privacy Issues in Graph Data
Publishing
Summer intern: Qing Zhang
(from NC State University)
Mentors:
Graham Cormode and Divesh Srivastava
Outline





Privacy in graph data publishing
Apply existing microdata anonymization
techniques
A simple graph anonymization
technique
Understanding attacks
Plan of future work
Microdata publishing

Data publishing

Macrodata

Pre-aggregated statistics
Computing Surveys, 1989.)

Microdata


(N.R. Adam et al. ACM
Individual records
Concerns in microdata release

Privacy of individual tuple



Privacy of atomic values (e.g. SSN)
Association between tuple’s attributes
Accuracy of aggregate query answering
Graph data

Relationship among entities




No sensitive attributes
Private information is the association
Many graphs of interest are sparse
Examples:

General graph

social network, etc.


Who talks to whom
Bipartite graph (focus of our work)

customer shopping record, etc.

Who bought what
Example Graph Data
author
paper
Author ID
Name
Paper ID
Title
Conference
year
A1
Andy
P1
A
SIGMOD
2006
A2
Bob
P2
B
SIGMOD
2007
A3
Cathy
P3
C
VLDB
2007
P4
D
ICDE
2007
(author, paper) Association
Author ID
Paper ID
A1
P1
A1
P2
A2
P2
A2
P3
A2
P4
A3
P1
A3
P4
Author
A1
Paper
P1
P2
A2
P3
A3
P4
Privacy-preserving microdata sharing

Current status

Focus on the protection of the association
between quasi identifiers and a single sensitive
attribute


Disease, salary, etc.
Related work

k-anonymity (Sweeney, International Journal on Uncertainty,
Fuzziness and Knowledge-based Systems’ 02)



l-diversity (A. Machanavajjhala et al., ICDE’ 06)
t-closeness (N. Li et al., ICDE’ 07)
(k,e)-anonymity (Q. Zhang, N. Koudas, D. Srivastava, T. Yu, ICDE’ 07)
Related work

Attacking anonymized social network




Active attack: insert nodes/links
Passive attack: collude and observe graph
Privacy risks of public mentions

(L. Backstrom et al., WWW’ 07 )
(D. Frankowski et al., SIGIR’ 07)
Link the movie score and movie review databases
How to break anonymity of the netflix prize dataset
(A.
Narayannan et al., UT Austin)


Attack through background information
How to assemble pieces of a graph privately
(K. Frikken et
al., WPES’ 06)

Distributed graph construction via multi-party computation
Focus of our work

Privacy protection in bipartite graph

Protect individual link information across
two parties


e.g. (author,paper) association
Maintain aggregate graph statistics


e.g. average number of coauthors, diameter of
graph, shortest path distribution, etc.
Not considered by previous work
Dataset working on

DBLP(conference data only)





402023 distinct authors, 541243 distinct papers
1401349 author-paper pairs
most number of papers of one author: 290
most number of authors of one paper: 115
Graph statistics we are looking at

1st order statistics (node degree)



2nd order statistics



Number of papers of each author
Number of authors of each paper
Coauthors of each author
Copapers of each paper
Higher order statistics

Walking more steps along the bipartite graph
Outline





Privacy in graph data sharing
Apply existing microdata anonymization
techniques
A simple graph anonymization
technique
Understanding attacks
Plan of future work
Anonymization by permutation


Publish the (author, paper) relation
Permute paper w.r.t. author



Global permutation
Various partition mechanisms
Study the graph statistics

2nd order statistics studied

coauthor times


coauthor distribution


Avg number of coauthors of each author
copaper times


Avg number of papers coauthored by each coauthor pair
Avg number of authors shared by each copaper pair
copaper distribution

Avg number of copapers of each paper
coauthor times

Source statistics
number of coauthor pairs
number of coauthors pairs who write a certain number papers together
10000000
1000000
100000
10000
1000
100
10
1
1
10
100
number of papers
number of coauthor pairs

After global permutation
Number of papers coauthored
Number of coauthor pairs
1
3411970
2
1377
3
8
1000
coauthor distribution
number of authors who has a certain number of coauthors
number of authors
100000
10000
1000
100
10
1
1
10
100
number of coauthors of each author
source
global permutation
1000
More bad news

Source distribution




The author with the most number of coauthors
(363), has 247 publications (7th)
the author with the most number of publications
(290), has only 44 co-authors
Correlation is weak
After global permutation

The most number of coauthors is 779,



The corresponding author has 287 papers (2nd most).
The author with the most number of papers (290)
has 722 coauthors.
False correlation created!
Other experiments

Results are the same for


copaper statistics
other partitioning mechanisms

On authorCt, year, conference, etc.
Observations

Permutation of (author, paper) relation
guarantees preservation of 1st order statistics


Cannot maintain even 2nd order graph
statistics

Break the clustering properties



Degrees are just counts
Remove links within cluster
Introduce fake links among clusters
Need other anonymization techniques to
maintain graph statistics
Outline





Privacy in graph data sharing
Apply existing microdata anonymization
techniques
A simple graph anonymization
technique
Understanding attacks
Plan of future work
Publish tuple-level statistics

Publish two tables



AuthorDegree(authorID, degree):
coAuthor (authorID, coAuthorID)
From these two tables, we can get


the 1st-order degree D1 of any author
the set of 2nd-order degree {D2}


By joining the two tables
We may leak more information!

D1 and set of {D2} can serve as signatures
to identify entities
Privacy Risk

k-identifiable



An entity shares the same signature with k1 other entities
1-identifiable means uniquely identifiable
Count the number of authors who have
the same signatures

maximum k=20015

coming from the authors who has D1=1 and
D2=0 (single author – single paper pairs).
Attack simulation-author
se ts of authors which are k-ide ntifiable
sets of authors
1000000
100000
10000
1000
100
10
1
1
10
100
1000
10000
100000
k
number of authors
number of authors who are k-identifiable
1000000
100000
10000
1000
100
10
1
1
10
100
1000
k
10000
100000
Observations

Too many entities can be uniquely
identified



37 by D1 only,
134426 by {D1, {D2}}
In order to protect your privacy


You’d better not publish any paper
Or, just publish one paper, without
collaborating with anyone else

Because many others are doing the same
Outline





Privacy in graph data sharing
Apply existing microdata anonymization
techniques
A simple graph anonymization
technique
Understanding attacks
Plan of future work
Understanding attacks

Given an anonymization scheme that
preserves statistics, explore attacker’s ability





What background information is available
What strategy to take
How much knowledge can he gain
What’s the cost of attack
Starting point: publishing complete statistics

Publish complete author sets of each paper, and
complete paper sets of each author
Publishing complete statistics

Example


Author set: {{a1,a4}, {a1,a2}, {a2,a3},
{a3,a4}}
Paper set: {p1,p2}, {p2,p3}, {p3,p4},
{p1,p4}}
a1
p1
a2
p2
a3
p3
a4
p4
Graph theoretic analysis

Can be seen as publishing two
isomorphic bipartite graphs


a1
a2
Each graph removes labels on one side
Bipartite graph isomorphism problem
p1
p2
a3
p3
a4
p4
Solution to the problem

Hardness of bipartite isomorphism is
unknown


Previous nth-order signature can serve as a
greedy solution


may exist effective solution for graphs with
specific properties
n is bounded by the diameter of the graph
More information leakage when background
information available


node information
edge information
Attacker with background information

With node information

Node a3 is known in previous example


{a3, p3}, {a3, p4} is known to the attacker
a1 can then be uniquely identified


It’s the only node with distance 4 to a3
a1 can further help labeling of the isomorphic
matching
p1
a1
a2
a1
p3
a3
a4
p2
a3
p4
Attacker with background information

With Edge information

Edge {a3,p3} is known

{a1, p1} can be recovered



By enumerating all possible worlds
Disjunctive reasoning
It’s a finer-grained attack model
a1
a1’
p1
a2
a1’’
p2
a3
a3’
p3
a4
a3’’
p4
Plan of future work(1)

Detailed study of the “set of all isomorphism”
problem



Algorithm and hardness
How different background information helps
Publish other statistics

Binary/triple/… sets of authors/papers



For {a1, a2, a3}, publish {a1, a2}, {a2, a3}, {a1,a3}
Maintain more statistics than permutation
Maintain more privacy than publishing complete author
sets

How to evaluate it quantitively
Plan of future work(2)

Other possible signatures

Shortest path to other nodes



Compute pairwise shorted path
Sort the vector as signature
Other datasets

IMDB data
Thanks!