Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA Iterative Set Expansion of Named Entities Richard C. Wang Outline Introduction to Set Expansion SE System – SEAL Current Issue with SEAL Proposed Solution Iterative SEAL (iSEAL) Evaluation Setting Experimental Results Conclusion Language Technologies Institute, Carnegie Mellon University 2 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Set Expansion (SE) For example, Given { survivor, amazing race } The a query (seeds): answer is: { american idol, big brother, etc. } A well-known example of a SE system is Google Sets™ http://labs.google.com/sets Language Technologies Institute, Carnegie Mellon University 3 / 21 Iterative Set Expansion of Named Entities SE System: SEAL Independent of human/markup language Support seeds in English, Chinese, Japanese, Korean, ... Accept documents in HTML, XML, SGML, TeX, WikiML, … Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Based on two research contributions (Wang & Cohen, ICDM 2007) Features Richard C. Wang Automatically construct wrappers for extracting candidate items Rank candidates using random walk Try it out for yourself at www.BooWa.com Language Technologies Institute, Carnegie Mellon University 4 / 21 Iterative Set Expansion of Named Entities Canon Nikon Olympus SEAL’s Pipeline Richard C. Wang Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … Fetcher: Download web pages containing all seeds Extractor: Construct wrappers for extracting candidate items Ranker: Rank candidate items using Random Walk Language Technologies Institute, Carnegie Mellon University 5 / 21 Iterative Set Expansion of Named Entities Richard C. Wang How to Build a Graph? Wrapper #2 “chevrolet” 22.5% contain Wrapper #3 extract curryauto.com extract “honda” 26.1% contain Wrapper #1 northpointcars.com Wrapper #4 “acura” 34.6% “volvo” 8.4% “bmw” 8.4% A graph consists of a fixed set of… Node Types: { document, wrapper, item } Labeled Directed Edges: { contain, extract } Each edge asserts that a binary relation r holds Each edge has an inverse relation r-1 (graph is cyclic) Language Technologies Institute, Carnegie Mellon University 6 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Limitation of SEAL Preliminary Study on Seed Sizes 85% Mean Average Precision 84% Evaluated using Mean Average Precision on 36 datasets 83% 82% 81% 80% 79% For each dataset, we randomly pick n seeds (and repeat 3 times) 78% RW PR RW BS WL 77% 76% 75% 2 3 4 5 6 # Seeds (Seed Size) Performance drops significantly when given more than 5 seeds The Fetcher downloads web pages that contain all seeds However, not many pages has more than 5 seeds Language Technologies Institute, Carnegie Mellon University 7 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Motivation 1. Can SEAL be made to handle many seeds? 2. Can SEAL bootstrap given only a few seeds? 3. How well does SEAL’s ranker perform? Language Technologies Institute, Carnegie Mellon University 8 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Proposed Solution: Iterative SEAL iSEAL makes several calls to SEAL In each call (iteration) Expands a few seeds Aggregates statistics We evaluated iSEAL using… Two iterative processes Two seeding strategies Five ranking methods Language Technologies Institute, Carnegie Mellon University 9 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Iterative Process & Seeding Strategy Iterative Processes Preliminary Study on Seed Sizes 1. Supervised 85% 2. 82% Bootstrapping At every84%iteration, seeds are obtained from a reliable source 83% (e.g. human) Mean Average Precision 81% At every80%iteration, seeds are selected from candidate items (except 79% the 1st iteration) 78% RW PR BS WL Seeding Strategies 77% 76% 1. Fixed Seed Size 75% 2. 2 3 4 Uses 2 seeds at every iteration # Seeds (Seed Size) 5 6 Increasing Seed Size Starts with 2 seeds, then 3 seeds for next iteration, and fixed at 4 seeds afterwards Language Technologies Institute, Carnegie Mellon University 10 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Ranking Methods 1. Random Walk with Restart 2. PageRank 3. Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005. Wrapper Length 5. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998. Bayesian Sets 4. H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006. Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency Weights each item based on the number of wrappers that extract the item Language Technologies Institute, Carnegie Mellon University 11 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Evaluation Datasets Language Technologies Institute, Carnegie Mellon University 12 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Evaluation Metric / Procedure Evaluation metric: Mean Average Precision Contains recall and precision-oriented aspects Sensitive to the entire ranking Evaluation procedure: For every combination of iterative process, seeding strategy, and ranking methods 1. 2. Perform 10 iterative expansions for each of the 36 datasets (and repeat 3 times) At every iteration, compute and report MAP Language Technologies Institute, Carnegie Mellon University 13 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Fixed Seed Size (Supervised) Initial Seeds Language Technologies Institute, Carnegie Mellon University 14 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Fixed Seed Size (Supervised) 98% Mean Average Precision 97% 96% 95% 94% 93% 92% RW 91% PR BS 90% WL WF 89% 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) Language Technologies Institute, Carnegie Mellon University 15 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Fixed Seed Size (Bootstrap) Initial Seeds Language Technologies Institute, Carnegie Mellon University 16 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% 90% 89% 88% RW PR BS WL 87% WF 86% 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) Language Technologies Institute, Carnegie Mellon University 17 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Increasing Seed Size (Bootstrap) Initial Seeds Used Seeds Language Technologies Institute, Carnegie Mellon University 18 / 21 Iterative Set Expansion of Named Entities Increasing Seed Size (Bootstrapping) 94% Mean Average Precision Richard C. Wang 93% 92% 91% RW PR 90% BS WL WF 89% 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) Language Technologies Institute, Carnegie Mellon University 19 / 21 Iterative Set Expansion of Named Entities Richard C. Wang Conclusion 1. Can SEAL be made to handle many seeds? 2. Can SEAL bootstrap given only a few seeds? 3. Yes, by Fixed Seed Size (Supervised). Yes, by Increasing Seed Size (Bootstrapping). How well does SEAL’s ranker perform? In supervised, RW is comparable to the best (BS) In bootstrapping, RW outperforms others Robust to noisy seeds Language Technologies Institute, Carnegie Mellon University 20 / 21 Iterative Set Expansion of Named Entities Richard C. Wang The End – Thank You! Try out Boo!Wa! at www.BooWa.com A SEAL-based list extractor for many languages Send any feedback to: rcwang@cs.cmu.edu Language Technologies Institute, Carnegie Mellon University 21 / 21
© Copyright 2024 Paperzz