Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava Outline Privacy in graph data publishing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work Microdata publishing Data publishing Macrodata Pre-aggregated statistics Computing Surveys, 1989.) Microdata (N.R. Adam et al. ACM Individual records Concerns in microdata release Privacy of individual tuple Privacy of atomic values (e.g. SSN) Association between tuple’s attributes Accuracy of aggregate query answering Graph data Relationship among entities No sensitive attributes Private information is the association Many graphs of interest are sparse Examples: General graph social network, etc. Who talks to whom Bipartite graph (focus of our work) customer shopping record, etc. Who bought what Example Graph Data author paper Author ID Name Paper ID Title Conference year A1 Andy P1 A SIGMOD 2006 A2 Bob P2 B SIGMOD 2007 A3 Cathy P3 C VLDB 2007 P4 D ICDE 2007 (author, paper) Association Author ID Paper ID A1 P1 A1 P2 A2 P2 A2 P3 A2 P4 A3 P1 A3 P4 Author A1 Paper P1 P2 A2 P3 A3 P4 Privacy-preserving microdata sharing Current status Focus on the protection of the association between quasi identifiers and a single sensitive attribute Disease, salary, etc. Related work k-anonymity (Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems’ 02) l-diversity (A. Machanavajjhala et al., ICDE’ 06) t-closeness (N. Li et al., ICDE’ 07) (k,e)-anonymity (Q. Zhang, N. Koudas, D. Srivastava, T. Yu, ICDE’ 07) Related work Attacking anonymized social network Active attack: insert nodes/links Passive attack: collude and observe graph Privacy risks of public mentions (L. Backstrom et al., WWW’ 07 ) (D. Frankowski et al., SIGIR’ 07) Link the movie score and movie review databases How to break anonymity of the netflix prize dataset (A. Narayannan et al., UT Austin) Attack through background information How to assemble pieces of a graph privately (K. Frikken et al., WPES’ 06) Distributed graph construction via multi-party computation Focus of our work Privacy protection in bipartite graph Protect individual link information across two parties e.g. (author,paper) association Maintain aggregate graph statistics e.g. average number of coauthors, diameter of graph, shortest path distribution, etc. Not considered by previous work Dataset working on DBLP(conference data only) 402023 distinct authors, 541243 distinct papers 1401349 author-paper pairs most number of papers of one author: 290 most number of authors of one paper: 115 Graph statistics we are looking at 1st order statistics (node degree) 2nd order statistics Number of papers of each author Number of authors of each paper Coauthors of each author Copapers of each paper Higher order statistics Walking more steps along the bipartite graph Outline Privacy in graph data sharing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work Anonymization by permutation Publish the (author, paper) relation Permute paper w.r.t. author Global permutation Various partition mechanisms Study the graph statistics 2nd order statistics studied coauthor times coauthor distribution Avg number of coauthors of each author copaper times Avg number of papers coauthored by each coauthor pair Avg number of authors shared by each copaper pair copaper distribution Avg number of copapers of each paper coauthor times Source statistics number of coauthor pairs number of coauthors pairs who write a certain number papers together 10000000 1000000 100000 10000 1000 100 10 1 1 10 100 number of papers number of coauthor pairs After global permutation Number of papers coauthored Number of coauthor pairs 1 3411970 2 1377 3 8 1000 coauthor distribution number of authors who has a certain number of coauthors number of authors 100000 10000 1000 100 10 1 1 10 100 number of coauthors of each author source global permutation 1000 More bad news Source distribution The author with the most number of coauthors (363), has 247 publications (7th) the author with the most number of publications (290), has only 44 co-authors Correlation is weak After global permutation The most number of coauthors is 779, The corresponding author has 287 papers (2nd most). The author with the most number of papers (290) has 722 coauthors. False correlation created! Other experiments Results are the same for copaper statistics other partitioning mechanisms On authorCt, year, conference, etc. Observations Permutation of (author, paper) relation guarantees preservation of 1st order statistics Cannot maintain even 2nd order graph statistics Break the clustering properties Degrees are just counts Remove links within cluster Introduce fake links among clusters Need other anonymization techniques to maintain graph statistics Outline Privacy in graph data sharing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work Publish tuple-level statistics Publish two tables AuthorDegree(authorID, degree): coAuthor (authorID, coAuthorID) From these two tables, we can get the 1st-order degree D1 of any author the set of 2nd-order degree {D2} By joining the two tables We may leak more information! D1 and set of {D2} can serve as signatures to identify entities Privacy Risk k-identifiable An entity shares the same signature with k1 other entities 1-identifiable means uniquely identifiable Count the number of authors who have the same signatures maximum k=20015 coming from the authors who has D1=1 and D2=0 (single author – single paper pairs). Attack simulation-author se ts of authors which are k-ide ntifiable sets of authors 1000000 100000 10000 1000 100 10 1 1 10 100 1000 10000 100000 k number of authors number of authors who are k-identifiable 1000000 100000 10000 1000 100 10 1 1 10 100 1000 k 10000 100000 Observations Too many entities can be uniquely identified 37 by D1 only, 134426 by {D1, {D2}} In order to protect your privacy You’d better not publish any paper Or, just publish one paper, without collaborating with anyone else Because many others are doing the same Outline Privacy in graph data sharing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work Understanding attacks Given an anonymization scheme that preserves statistics, explore attacker’s ability What background information is available What strategy to take How much knowledge can he gain What’s the cost of attack Starting point: publishing complete statistics Publish complete author sets of each paper, and complete paper sets of each author Publishing complete statistics Example Author set: {{a1,a4}, {a1,a2}, {a2,a3}, {a3,a4}} Paper set: {p1,p2}, {p2,p3}, {p3,p4}, {p1,p4}} a1 p1 a2 p2 a3 p3 a4 p4 Graph theoretic analysis Can be seen as publishing two isomorphic bipartite graphs a1 a2 Each graph removes labels on one side Bipartite graph isomorphism problem p1 p2 a3 p3 a4 p4 Solution to the problem Hardness of bipartite isomorphism is unknown Previous nth-order signature can serve as a greedy solution may exist effective solution for graphs with specific properties n is bounded by the diameter of the graph More information leakage when background information available node information edge information Attacker with background information With node information Node a3 is known in previous example {a3, p3}, {a3, p4} is known to the attacker a1 can then be uniquely identified It’s the only node with distance 4 to a3 a1 can further help labeling of the isomorphic matching p1 a1 a2 a1 p3 a3 a4 p2 a3 p4 Attacker with background information With Edge information Edge {a3,p3} is known {a1, p1} can be recovered By enumerating all possible worlds Disjunctive reasoning It’s a finer-grained attack model a1 a1’ p1 a2 a1’’ p2 a3 a3’ p3 a4 a3’’ p4 Plan of future work(1) Detailed study of the “set of all isomorphism” problem Algorithm and hardness How different background information helps Publish other statistics Binary/triple/… sets of authors/papers For {a1, a2, a3}, publish {a1, a2}, {a2, a3}, {a1,a3} Maintain more statistics than permutation Maintain more privacy than publishing complete author sets How to evaluate it quantitively Plan of future work(2) Other possible signatures Shortest path to other nodes Compute pairwise shorted path Sort the vector as signature Other datasets IMDB data Thanks!
© Copyright 2025 Paperzz