Multilingual Wikipedia: Editors of Primary Language Contribute to

WO
RK
IN
P
Multilingual Wikipedia:
Editors of Primary Language Contribute to More Complex Articles
RO
GR
ES
S
Sungjoon Park, Suin Kim*, Scott A. Hale, Sooyoung Kim, Jeongmin Byun, Alice Oh
{sungjoon.park, suin.kim, sooyoungkim, jmbyun}@kaist.ac.kr, scott.hale@oii.ox.ac.uk, alice.oh@kaist.edu
Primary/Non-Primary Users
Research Questions
•
Data
Primary Languages of Users Editing this Language Edition
Language
Edition
EN
DE
3,832
EN
FR
1,350
ES
889
RU IT
801 594 472
DE
4,150
EN
ES
787 147 99 91 70
EN
FR
996
ES
#Editors
FR NLRUIT OTHERS
1,631
DE
English
OTHERS
DE PT CA
#Article Edit
Sessions
#Edits
446
OTHERS
675 144 139 92 86
Quantifying Language Complexity
History
Seen as "Cela Sculptoris" in the
lower right of this 1825 star chart
from Urania's Mirror
Caelum was first introduced in the
== History ==
[[File:Sidney Hall - Urania's
Mirror - Canis Major, Lepus,
Columba Noachi & Cela
Sculptoris.jpg|thumb|left|Seen
as "Cela Sculptoris" in the
Wiki Markup
HTML
max(f (di ))
Aggregate
per User
11,616
3,271
2,506
237,849
120,123
69,557
350,541
160,126
112,099
Lexical Diversity
•
•
Number of characters
Number of words
Number of unique words
Number of sentences
Average word length in
characters
Average sentence length in
words
•
•
•
history seen as " cela sculptoris " in
the lower right of this 1825 star chart
from urania ' s mirror caelum was first
introduced in the eighteenth century
by nicolas louis de lacaille , a french
Measure
Spanish
Basic Features
•
Plain Text
f (di )
di
German
374
•
•
•
•
Entropy of word frequency
Average word rank
Average word occurrence
frequency
Error rate
Syntactic Structure
•
•
Tokenized +
Lowercase
•
Entropy of POS frequency
Mean phrase length (NP, VP)
Mean parse tree depth
Topic Clusters
Length of Noun Phrases
22
30000
19
22500
LDA+DBSCAN Algorithm
•
Number of Articles
•
•
16
15000
13
7500
10
Basic Features
Basic
Lexical
Features
Diversity
0
Primary
Number of Characters
Number of Characters
Entropy of Part-of-Speech Unigrams
Number of Words
Results (per Paragraph)Basic Features
Basic Features
Basic Features
174.2
232.9
219.5
Number of Words
Sentences
42.1
Articles in history cluster are more complex
Articles in football cluster are less complex
The complexity measures used are consistent
Different topics show different complexity
Non-Primary
4.99 4.53
of Part-of-Speech
Unigrams
English
German 4.43
Spanish
3.97
3.95Entropy
3.75
49.1
32.3
Average
Word Length
English
German
Spanish
Lexical Diversity
Structure
2.99
Average Occurrence
of Words in Reference
2.43
2.23
1.90
1.84Word Length
Average
1.49
•
Entropy of Part-of-Speech Unigrams
English
2.37
Number of Words
Number of Sentences
51.2
Average Occurrence of Words in Reference
Primary
Non-Primary
Entropy of Part-of-Speech Unigrams
Average Word Length
•
•
•
Number of Characters
Average WordPrimary
Length
139.5
60.3
Entropy of Part-of-Speech Unigrams
Lexical Diversity
Primary
Non-Primary
Syntactic
Number of Words
56.4
•
Non-Primary
Number of Characters
Lexical Diversity
Number of
Characters
Average
Word
Length
234.6
Primary
Number of Words
Average Occurrence of Words in Reference
296.8
Lexical Diversity
Non-Primary
Fit LDA over entire data with 100 topics
• Each document is reduced to 100 dimensions
Cluster documents into 20 clusters with DBSCAN
Cluster-level rank of language complexity
is consistent over measures
•
Arabic
Football(Soccer)
Olympics
City
Animals/Plants
Transportation
Sports:Teams
Music:Song, album
Names of Places
Entertainment
Sports:American
Politics
TV
Music:musician
Science
Computer
Film
Military
Education
History
•
Is it possible to quantify the complexity of language in
Wikipedia articles?
Do primary users tend to edit parts of the Wikipedia •
articles with higher language complexity than nonprimary users?
Contributions
Do we observe more natural language in the articles • We tested 20 measures for language complexity and showed they
are highly consistent.
after primary users’ edits compared to the articles
• Multilinguals in Wikipedia show relatively high levels of proficiency in
after non-primary users’ edits?
their primary languages.
Arabic
Football(Soccer)
Olympics
City
Animals/Plants
Transportation
Sports:Teams
Music:Song, album
Names of Places
Entertainment
Sports:American
Politics
TV
Music:musician
Science
Computer
Film
Military
Education
History
•
Multilingual editors have edited different language editions
We define a user’s primary language as the language that the user
edited the most (not necessarily user’s native language)
For each language X edition, we define primary users as those whose
primary language is X
•
•
2.00
2.82German
2.47
2.55
•
Spanish
1.98
•
Syntactic Structure
Average Occurrence of Words in Reference
Average Occurrence of Words in Reference
1.63E+07
1.55E+07
6.14E+06
Primary editors edit articles with higher
lexical diversity.
5.41E+06
English
5.86E+06
German
4.96E+06
Spanish
English
German
Spanish
•
Primary editors edit articles with more
complex syntactic structure.
•
English
NP length German
VP length Spanish
Parse Tree Depth NP length
VP length Parse Tree Depth
Syntactic Structure
English
German
Spanish
Syntactic
Number of Sentences
22.92 Structure
Primary editors edit longer articles.
Number ofaverage
Sentences word
Number of characters, words,
Syntactic Structure
length, sentences are always higher for the
Number of Sentences
articles edited by primary users
•
18.19
13.86
17.50
•
13.28
11.53
English
German
Spanish
NP length
VP length Parse Tree Depth
English
German
Spanish
NP length
VP length Parse Tree Depth
Entropy of n-gram is always higher in primary
edits
Primary users edit articles of frequent yet
diverse set of words
Parse-tree based measures
• Length of longest noun/verb phrase
• Depth of parse tree
Articles edited by primary users have complex
syntactic structure