投影片 1 - Richard Wang

Iterative Set Expansion
of Named Entities
using the Web
Richard C. Wang and William W. Cohen
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213 USA
Iterative Set Expansion of Named Entities
Richard C. Wang
Outline

Introduction to Set Expansion
 SE
System – SEAL
Current Issue with SEAL
 Proposed Solution

 Iterative
SEAL (iSEAL)
Evaluation Setting
 Experimental Results
 Conclusion

Language Technologies Institute, Carnegie Mellon University
2 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Set Expansion (SE)

For example,
 Given

{ survivor, amazing race }
 The


a query (seeds):
answer is:
{ american idol, big brother, etc. }
A well-known example of a SE
system is Google Sets™

http://labs.google.com/sets
Language Technologies Institute, Carnegie Mellon University
3 / 21
Iterative Set Expansion of Named Entities
SE System: SEAL

Independent of human/markup language



Support seeds in English, Chinese, Japanese, Korean, ...
Accept documents in HTML, XML, SGML, TeX, WikiML, …
Does not require pre-annotated training data

Utilize readily-available corpus: World Wide Web
Based on two research contributions



(Wang & Cohen, ICDM 2007)
Features


Richard C. Wang
Automatically construct wrappers for extracting
candidate items
Rank candidates using random walk
Try it out for yourself at www.BooWa.com
Language Technologies Institute, Carnegie Mellon University
4 / 21
Iterative Set Expansion of Named Entities
Canon
Nikon
Olympus



SEAL’s Pipeline
Richard C. Wang
Pentax
Sony
Kodak
Minolta
Panasonic
Casio
Leica
Fuji
Samsung
…
Fetcher: Download web pages containing all seeds
Extractor: Construct wrappers for extracting candidate items
Ranker: Rank candidate items using Random Walk
Language Technologies Institute, Carnegie Mellon University
5 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
How to Build a Graph?
Wrapper #2
“chevrolet”
22.5%
contain
Wrapper #3
extract
curryauto.com
extract
“honda”
26.1%
contain
Wrapper #1
northpointcars.com
Wrapper #4
“acura”
34.6%
“volvo”
8.4%

“bmw”
8.4%
A graph consists of a fixed set of…
 Node
Types: { document, wrapper, item }
 Labeled Directed Edges: { contain, extract }


Each edge asserts that a binary relation r holds
Each edge has an inverse relation r-1 (graph is cyclic)
Language Technologies Institute, Carnegie Mellon University
6 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Limitation of SEAL
Preliminary Study on Seed Sizes
85%
Mean Average Precision
84%
Evaluated using Mean
Average Precision on
36 datasets
83%
82%
81%
80%
79%
For each dataset, we
randomly pick n seeds
(and repeat 3 times)
78%
RW
PR
RW
BS
WL
77%
76%
75%
2
3
4
5
6
# Seeds (Seed Size)

Performance drops significantly when given more than 5 seeds
 The Fetcher downloads web pages that contain all seeds
 However, not many pages has more than 5 seeds
Language Technologies Institute, Carnegie Mellon University
7 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Motivation
1.
Can SEAL be made to handle many seeds?
2.
Can SEAL bootstrap given only a few seeds?
3.
How well does SEAL’s ranker perform?
Language Technologies Institute, Carnegie Mellon University
8 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Proposed Solution: Iterative SEAL

iSEAL makes several calls to SEAL
 In
each call (iteration)
Expands a few seeds
 Aggregates statistics


We evaluated iSEAL using…
 Two
iterative processes
 Two seeding strategies
 Five ranking methods
Language Technologies Institute, Carnegie Mellon University
9 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Iterative Process & Seeding Strategy
Iterative Processes
Preliminary Study on Seed Sizes
1.
Supervised
85%

2.
82%
Bootstrapping


At every84%iteration, seeds are obtained from a reliable source
83%
(e.g. human)
Mean Average Precision

81%
At every80%iteration, seeds are selected from candidate items
(except 79%
the 1st iteration)
78%
RW
PR
BS
WL
Seeding Strategies
77%
76%
1.
Fixed Seed Size
75%

2.
2
3
4
Uses 2 seeds
at every
iteration
# Seeds
(Seed Size)
5
6
Increasing Seed Size

Starts with 2 seeds, then 3 seeds for next iteration, and
fixed at 4 seeds afterwards
Language Technologies Institute, Carnegie Mellon University
10 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Ranking Methods
1.
Random Walk with Restart

2.
PageRank

3.
Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.
Wrapper Length

5.
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation
ranking: Bringing order to the web. 1998.
Bayesian Sets

4.
H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and
its application. In ICDM, 2006.
Weights each item based on the length of common contextual string of
that item and the seeds
Wrapper Frequency

Weights each item based on the number of wrappers that extract the item
Language Technologies Institute, Carnegie Mellon University
11 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Evaluation Datasets
Language Technologies Institute, Carnegie Mellon University
12 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Evaluation Metric / Procedure

Evaluation metric: Mean Average Precision



Contains recall and precision-oriented aspects
Sensitive to the entire ranking
Evaluation procedure:

For every combination of iterative process,
seeding strategy, and ranking methods
1.
2.
Perform 10 iterative expansions for each of the 36 datasets
(and repeat 3 times)
At every iteration, compute and report MAP
Language Technologies Institute, Carnegie Mellon University
13 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Fixed Seed Size (Supervised)
Initial Seeds
Language Technologies Institute, Carnegie Mellon University
14 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Fixed Seed Size (Supervised)
98%
Mean Average Precision
97%
96%
95%
94%
93%
92%
RW
91%
PR
BS
90%
WL
WF
89%
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
Language Technologies Institute, Carnegie Mellon University
15 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Fixed Seed Size (Bootstrap)
Initial Seeds
Language Technologies Institute, Carnegie Mellon University
16 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
90%
89%
88%
RW
PR
BS
WL
87%
WF
86%
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
Language Technologies Institute, Carnegie Mellon University
17 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Increasing Seed Size (Bootstrap)
Initial Seeds
Used Seeds
Language Technologies Institute, Carnegie Mellon University
18 / 21
Iterative Set Expansion of Named Entities
Increasing Seed Size (Bootstrapping)
94%
Mean Average Precision
Richard C. Wang
93%
92%
91%
RW
PR
90%
BS
WL
WF
89%
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
Language Technologies Institute, Carnegie Mellon University
19 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
Conclusion
1.
Can SEAL be made to handle many seeds?

2.
Can SEAL bootstrap given only a few seeds?

3.
Yes, by Fixed Seed Size (Supervised).
Yes, by Increasing Seed Size (Bootstrapping).
How well does SEAL’s ranker perform?


In supervised, RW is comparable to the best (BS)
In bootstrapping, RW outperforms others

Robust to noisy seeds
Language Technologies Institute, Carnegie Mellon University
20 / 21
Iterative Set Expansion of Named Entities
Richard C. Wang
The End – Thank You!

Try out Boo!Wa! at www.BooWa.com
 A SEAL-based
list extractor for many languages
 Send any feedback to: rcwang@cs.cmu.edu
Language Technologies Institute, Carnegie Mellon University
21 / 21