Identifiers and Types - Cornell Computer Science

Quality: Traditional and Possible Mechanisms
CS 502 – 200200312
Carl Lagoze – Cornell University
Cornell CS 502
The Problem
• Build a large-scale digital library from web resources but
maintain “quality”
– National Science Digital Library
• Traditional library technique
– Acquisitions librarians
– Trusted Sources
• Professional societies, publishers
– Patron request
• Problems in NSDL environment
–
–
–
–
Unfocused audience
Scale
Variability of resources
COPPA
(http://www.cdt.org/legislation/105th/privacy/coppa.html)
– What is quality?
Cornell CS 502
General observations of quality
• Is there such a thing as a shared notion of
quality?
– Few good studies
• Studies with popular culture
– Amento, Terveen, Hill ACM TOIS study of web sites on
“Buffy the Vampire Slayer”, “Simpsons”, “Smashing
Pumpkins”, “Tori Amos”
– Studies show that expert agreement on quality is around
75%
• Expert agreement differs across different
categories of information
• Often a confusion between relevance and quality,
which are different
Cornell CS 502
What is quality on the Web?
• Factors
–
–
–
–
Site Layout
Site Organization
Uniqueness of information
Reputation of “publisher”
Cornell CS 502
Cornell CS 502
Current Quality Strategy 1:
The Reader Looks for Clues
Internal clues can inform an
experienced reader
All that glisters is not gold.
And vice versa.
Cornell CS 502
Cornell CS 502
Considerations
Publisher, ACM, is a well-known scientific
society that follows standard procedures for
peer review.
Editor-in-chief is a well-known professor in a
strong department.
(http://www.acm.org/jacm/Editors.html)
Papers in theoretical computer science can be
reviewed from their content.
Gold
Cornell CS 502
Cornell CS 502
Considerations
Looks the same as the Journal of the ACM.
but ...
Procedures for selecting and reviewing
conference papers are loosely controlled.
Papers in applications research are difficult to
evaluate by superficial reading.
Not gold
Cornell CS 502
Cornell CS 502
Considerations
The appearance looks like a draft.
Nothing technical from 1981 is current.
Who is DARPA anyway?
yet ...
This is the official definition of IP.
http://www.ietf.org/rfc/rfc0791.txt?number=791
Gold
Cornell CS 502
Cornell CS 502
Considerations
The appearance looks like a joke.
URL looks suspicious (strange spelling).
What’s with the graphic?
yet ...
This is the working literature of physics research.
http://arxiv.org
Gold
Cornell CS 502
Current Quality Strategy 2:
The Publisher as Creator
Materials are written by authors or
selected by curators who are employed
by the publisher.
Quality is tied to the reputation of the
publisher.
Cornell CS 502
Cornell CS 502
Cornell CS 502
Current Quality Strategy 3:
External Readers Chosen by
the Publisher
Publishers ask external experts to
review materials
Cornell CS 502
Cornell CS 502
Cornell CS 502
Observations about Peer Review
At its best, it is superb.
At its worst, it validates junk.
Some topics can be reviewed from a
paper, e.g., mathematics.
Some topics cannot be reviewed from a
paper, e.g., computer systems.
"Whatever you do, write a paper. Some
journal will publish it." Advice to young
faculty member, University of Sussex, 1972.
Cornell CS 502
Current Quality Strategy 4:
Independent Reviews
Reviewers, hopefully independent of the
author and publisher, describe their
opinion of the item.
Value of the review to the user depends on
(a) the reputation of where the review is
published and (b) how well it is done.
Cornell CS 502
Cornell CS 502
Cornell CS 502
Cornell CS 502
Cornell CS 502
Citation Analysis
• Understanding citation patterns among scholarly
journals
– Quality metric on journals (not on individual articles or
scholars)
– Cost/benefit analysis – what “basic” journals should a
library have in its holdings
• Eugene Garfield – “Father of citation analysis”
• Science Citation Index
– Origins circa 1950’s
– Hand analysis of printed journals showing patterns of
citations into and out from journals
Cornell CS 502
Concepts – References and Citations
Doc2
Doc1
Doc1 references:
Doc1 citations:
(Doc2, Doc3)
Cornell CS 502
(Doc2)
Doc3
Concepts – References and Citations
• # of references of a document is finite, stable,
and easy to determine/compute
• # of citations of a document is dynamic,
impossible to computer (infinite)
• Generally, references are at the work, or
manifestation level, NOT at the item level
Cornell CS 502
Results of citation analysis
acks. Garfield, Science, 1972
Cornell CS 502
Citation analysis in the digital age
• Automatic citation linking among papers in arXiv
– Citebase (Open Citations Project)
– http://citebase.eprints.org/cgibin/search?submit=1&author=Hawking%2C%20S%20W%
20
• Scientometrics - Automation of methods reveals
lots of data
– Longevity of interest in paper
– Journal and ePrint citation patterns
• Automatic citation analysis as a reviewing tool?
Cornell CS 502
Are papers downloaded then cited or cited then
downloaded?(2)
What came first the Citation or the Download
7000
6000
Frequency
5000
4000
3000
2000
1000
0
-300
0
300
600
900
1200
1500
1800
2100
2400
2700
Age of Paper at Download minus Age of Paper at Citation
•
If all these time differences are plotted the above graph is produced.
Cornell CS 502
Acks: S. Harnad
Citation Latencies
Frequency of Citation Latencies: 1992-1999
5000
4500
4000
Citations
3500
3000
2500
2000
1500
1000
500
0
0
12
24
36
48
60
72
84
96
Time Difference/Months
99
•
98
97
96
95
94
93
92
The raw data show that the latency of the citation peak has been reducing
over the period of the archive
Acks: S. Harnad
Cornell CS 502
Author Impact Quartiles
Quartile
High 25%
Med 50%
Low 25%
•
•
Total % Total
798 2.09%
9,262 24.20%
28,211 73.71%
Mean
Citations/Aut
Citations Papers
Deposits Updates/
hor/Paper
Author
240,092
2,732
0.11
6,720
0.48
733,272 37,318
0.00212
93,671
0.37
251,925 67,951
0.000131
165,971
0.27
High impact authors update more than medium or low
High and medium impact authors deposit more papers than low
Acks: S. Harnad
Cornell CS 502
Citation Quality
Do Papers Cite Papers of Like Impact
140000
120000
100000
80000
60000
High
40000
Medium
20000
Dest. Impact
No of Citations
0
Low
Low
Medium
High
Source Impact
•
Papers generally cite papers of like impact
Cornell CS 502
Acks: S. Harnad
Citation Spread
Histogram of Citations per Paper
(author im pact) 30,000 papers w ere by authors w ith no citation
40000
35000
30807
30000
Papers
25000
20000
13668
15000
11527
2060
6784
10000
9627
3105
5000
6534
138
4441
121
6072
170
5863
257
4781
249
1797
No citations
1 Citation
2/3 Citations
4/5/6
Citations
7/8/9/10
Citations
11 or more
Citations
0
High (2.53%)
•
Medium (34.55%)
Low (62.92%)
A small number of papers receive a very large number of citations
Acks: S. Harnad
Cornell CS 502
How Paper Impact Effects Usage
All Papers
Frequency Density
0.0025
0.002
0.0015
0.001
0.0005
2398
2289
2180
2071
1962
1853
1744
1635
1526
1417
1308
1199
1090
981
872
763
654
545
436
327
218
109
0
0
Age of paper (days)
High (2.0%)
•
Medium (7.7%)
Low (46.5%)
Unknown (39.6%)
Higher impact papers have a longer download life expectancy.
Cornell CS 502
Acks: S. Harnad
What is the correlation between citations and downloads?
Download type
All Papers
High Impact Papers (2.0%)
Medium Impact Papers (7.7%)
Low Impact Papers (46.5%)
•
r
0.11155
0.27293
0.01288
-0.01412
n
63671
1981
5937
30163
There is a significant positive correlation between citations and downloads
for high impact papers.
Acks: S. Harnad
Cornell CS 502
Automatic Reviewing Techniques
• Traditional Collaborative Filtering
– Estimate what score a reviewer might give to an item
that he/she has not scored yet
– Frequently used by recommender systems
– Use of user profiles
• Collaborative quality filtering
– http://www.cs.berkeley.edu/~tracyr/project/
– Attempts to automatically determine which reviewers are
"good" in an open reviewing system, in order to provide
the same (or better) benefits as peer review
Cornell CS 502
Collaborative Quality Filtering Algorithm
• Assume true value of an item is the asymptotic
average of review scores
• Good reviewers are those who consistently predict
this average
• Normalize according to # of reviews of an item, #
of reviews by reviewer, review latency
• Adjust by “expertise”
– Use similarity of term vectors of items reviewed
Cornell CS 502
Cornell CS 502
Annotation Systems
• Worked successfully in many cases in Web
environment
– Amazon
• Most successful when combined with reputation
systems
– E-bay
• Problems with existing systems
– Natural language
– Closed/private systems
– Non-extensible
Cornell CS 502
Annotea: Open Web Infrastructure for Shared Web
Annotations
• http://www.w3.org/2001/Annotea/
• Annotations as class of metadata
• External to the document and stored on an annotation server
• Primitive RDF class annotation.
– Sub-classed in various ways: Advice, Change, Example,
Explanation, Question, See Also
• Ratings can be formally expressed and machine readable
• Storage of annotations in RDF database on annotations
servers that can be queried.
• Information in multiple annotation servers can be merged
Cornell CS 502
Annotea System Architecture
Cornell CS 502
Annotea RDF data model
Cornell CS 502
Annotations in the NSDL
OAI
org1 annotation service
Searc
h
I only want resources
endorsed by org1
Cornell CS 502
org2 annotation service