Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi , William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon University Paper ID : 02 This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. 1 Motivation Entities on the Web can be present in multiple datasets. E.g. HTML tables, text documents etc. Traditional systems : Entities as sparse vector of document Ids in which it occurs. We propose a low-dimensional representation for such entities. Helps to efficiently perform different tasks with a small number of primitive operations : Semi-supervised Learning (SSL) Set Expansion (SE) Automatic Class Instance Acquisition (ASIA) 2 Entities in HTML tables Country Capital City India Delhi USA Washington DC Canada France Ottawa Paris Entity-Column Bi-partite Graph Entity USA India Table-column TC-1 TC-2 TC-3 TC-2 Hockey Country Sports India Hockey UK Cricket USA Tennis TC-3 Cricket Tennis TC-4 3 Entities in unstructured text “Such as” Bi-partite Graph Suchas Entity Country Location USA Countries such as India are developing rapidly in terms of infrastructure. India Hockey Cricket Sports Outdoor sports include Tennis and Cricket. Tennis 4 Resultant Tri-partite Graph “Such as” Bi-partite Graph Entity-Column Bi-partite Graph Suchas Entity Country Location USA India Table-column TC-1 TC-2 Hockey TC-3 Cricket Sports Tennis TC-4 5 Encoding the graph Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010) “Entity-Column” Bi-partite Graph Entity USA India Table-column TC-1 TC-2 Hockey Entity X1 X2 USA 0.43 0.66 India 0.41 0.69 Hockey 0.36 0.80 Cricket 0.35 0.82 Tennis 0.34 0.79 TC-3 Cricket Tennis TC-4 Entities with similar X1/X2 values should be ontologically similar - values summarize tabular co-occurrence 6 Encoding the graph Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010) “Such as” Bi-partite Graph Suchas Country Location Entity Entity Y1 Y2 USA USA 0.23 0.76 India 0.21 0.79 Hockey 0.66 0.35 Cricket 0.16 0.92 Tennis 0.14 0.89 India Hockey Cricket Sports Tennis Entities with similar Y1/Y2 values should be ontologically similar - values summarize “such as pattern” co-occurrence 7 Low-dimensional PIC3 embedding n * m PIC embedding m << t n*t entity-tableColumn Bipartite graph n * 2m PIC3 embedding PIC Concatenate n * m PIC embedding m << s PIC n * s entity-suchas Bipartite graph Entity X1 X2 Y1 Y2 USA 0.43 0.66 0.23 0.76 India 0.41 0.69 0.21 0.79 Hockey 0.36 0.80 0.66 0.35 Cricket 0.35 0.82 0.16 0.92 Tennis 0.34 0.79 0.14 0.89 Using PIC3 Representation • Semi-Supervised Learning : Given few seed examples for each class, predict class-labels for unlabeled data-points. • Set Expansion : Given a set of seed entities, find more entities similar to seed entities. • Automatic Set Instance Acquisition (ASIA) : Given a concept name automatically find instances of that concept. 9 Quantitative Evaluation: Datasets Dataset #entities Toy_Apple Delicious_Sports 14,996 438 156 176,598 925 9,192 2,348 7,683 11 419 1,649 4,799 3 39 #hand-coded column types 31 30 #columns in labeled types 156 925 # table-columns #entity-table column edges #suchas concepts #entity-suchas edges #general entity classes (NELL KB) #entities in general classes Link to dataset: http://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_online SSL using PIC3 Input : Few seed examples for each class label Output : Class-labels for unlabeled data-points Task Training Testing SemiSupervised Learning PIC3 + Train SVM classifier Predict using learnt SVM model PIC clusters similar entities together better SVM classifier on unlabeled data (use of background data) 11 SSL Task - I # dimensions : 2504 10 12 SSL Task - II # dimensions : 2574 10 13 Set Expansion using PIC3 Input : Few seed entities e.g. Football, Hockey, Tennis Output : More entities of same type as seeds e.g. Baseball, Badminton, Cricket, Golf …. Task Training Testing Set Expansion PIC3 Centroid(entity set) + K-NN (centroid) K-NN operation is extremely efficient using KD-trees. 14 Query Times • PIC3 preprocessing : 0.02 sec • # SE queries = 881 Method K-NN + PIC3 K-NN-Baseline MAD Total Query Time (s) 12.7 80.1 38.2 • Precision Recall Curve : K-NN+PIC3 consistently beats K-NNModified Adsorption Baseline. Modified Adsorption method is better on 2/5 : Graph based label query classes at the expense of larger query time. propagation algorithm 15 Automatic Set Instance Acquisition (ASIA) : using PIC3 Input : Class label e.g. Country Output : Entities belonging to the given class label e.g. India, China, USA, Canada, Japan ….. Task Training Testing Automatic Set Instance Acquisition PIC3 + seeds = top-k-entities Inverted index (lookup concept in index) (suchasConcept + Set Expansion (seeds) entities) Previously described Set Expansion algorithm is used as a subroutine here. 16 Query Times • PIC3 preprocessing : 0.02 sec • # ASIA queries = 25 Method K-NN + PIC3 K-NN-Baseline MAD Total Query Time (s) 0.5 1.4 150.0 • Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/4 query classes at the expense of much larger query time. 17 Conclusions & Future Work Presented a novel low-dimensional PIC3 representation for entities on the Web using Power Iteration Clustering (PIC). Simple primitive operations on PIC3 to perform following tasks : Semi-Supervised Learning Set Expansion Automatic Set Instance Acquisition Future work : Use PIC3 representation for Named entity disambiguation and Unsupervised class-instance pair acquisition 18 Thank You !! Please visit our poster ID : 02 This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. 19 Examples : Set Expansion 20 Examples : ASIA 21 Set Expansion 22 ASIA Task 23
© Copyright 2025 Paperzz