Quality: Traditional and Possible Mechanisms CS 502 – 200200312 Carl Lagoze – Cornell University Cornell CS 502 The Problem • Build a large-scale digital library from web resources but maintain “quality” – National Science Digital Library • Traditional library technique – Acquisitions librarians – Trusted Sources • Professional societies, publishers – Patron request • Problems in NSDL environment – – – – Unfocused audience Scale Variability of resources COPPA (http://www.cdt.org/legislation/105th/privacy/coppa.html) – What is quality? Cornell CS 502 General observations of quality • Is there such a thing as a shared notion of quality? – Few good studies • Studies with popular culture – Amento, Terveen, Hill ACM TOIS study of web sites on “Buffy the Vampire Slayer”, “Simpsons”, “Smashing Pumpkins”, “Tori Amos” – Studies show that expert agreement on quality is around 75% • Expert agreement differs across different categories of information • Often a confusion between relevance and quality, which are different Cornell CS 502 What is quality on the Web? • Factors – – – – Site Layout Site Organization Uniqueness of information Reputation of “publisher” Cornell CS 502 Cornell CS 502 Current Quality Strategy 1: The Reader Looks for Clues Internal clues can inform an experienced reader All that glisters is not gold. And vice versa. Cornell CS 502 Cornell CS 502 Considerations Publisher, ACM, is a well-known scientific society that follows standard procedures for peer review. Editor-in-chief is a well-known professor in a strong department. (http://www.acm.org/jacm/Editors.html) Papers in theoretical computer science can be reviewed from their content. Gold Cornell CS 502 Cornell CS 502 Considerations Looks the same as the Journal of the ACM. but ... Procedures for selecting and reviewing conference papers are loosely controlled. Papers in applications research are difficult to evaluate by superficial reading. Not gold Cornell CS 502 Cornell CS 502 Considerations The appearance looks like a draft. Nothing technical from 1981 is current. Who is DARPA anyway? yet ... This is the official definition of IP. http://www.ietf.org/rfc/rfc0791.txt?number=791 Gold Cornell CS 502 Cornell CS 502 Considerations The appearance looks like a joke. URL looks suspicious (strange spelling). What’s with the graphic? yet ... This is the working literature of physics research. http://arxiv.org Gold Cornell CS 502 Current Quality Strategy 2: The Publisher as Creator Materials are written by authors or selected by curators who are employed by the publisher. Quality is tied to the reputation of the publisher. Cornell CS 502 Cornell CS 502 Cornell CS 502 Current Quality Strategy 3: External Readers Chosen by the Publisher Publishers ask external experts to review materials Cornell CS 502 Cornell CS 502 Cornell CS 502 Observations about Peer Review At its best, it is superb. At its worst, it validates junk. Some topics can be reviewed from a paper, e.g., mathematics. Some topics cannot be reviewed from a paper, e.g., computer systems. "Whatever you do, write a paper. Some journal will publish it." Advice to young faculty member, University of Sussex, 1972. Cornell CS 502 Current Quality Strategy 4: Independent Reviews Reviewers, hopefully independent of the author and publisher, describe their opinion of the item. Value of the review to the user depends on (a) the reputation of where the review is published and (b) how well it is done. Cornell CS 502 Cornell CS 502 Cornell CS 502 Cornell CS 502 Cornell CS 502 Citation Analysis • Understanding citation patterns among scholarly journals – Quality metric on journals (not on individual articles or scholars) – Cost/benefit analysis – what “basic” journals should a library have in its holdings • Eugene Garfield – “Father of citation analysis” • Science Citation Index – Origins circa 1950’s – Hand analysis of printed journals showing patterns of citations into and out from journals Cornell CS 502 Concepts – References and Citations Doc2 Doc1 Doc1 references: Doc1 citations: (Doc2, Doc3) Cornell CS 502 (Doc2) Doc3 Concepts – References and Citations • # of references of a document is finite, stable, and easy to determine/compute • # of citations of a document is dynamic, impossible to computer (infinite) • Generally, references are at the work, or manifestation level, NOT at the item level Cornell CS 502 Results of citation analysis acks. Garfield, Science, 1972 Cornell CS 502 Citation analysis in the digital age • Automatic citation linking among papers in arXiv – Citebase (Open Citations Project) – http://citebase.eprints.org/cgibin/search?submit=1&author=Hawking%2C%20S%20W% 20 • Scientometrics - Automation of methods reveals lots of data – Longevity of interest in paper – Journal and ePrint citation patterns • Automatic citation analysis as a reviewing tool? Cornell CS 502 Are papers downloaded then cited or cited then downloaded?(2) What came first the Citation or the Download 7000 6000 Frequency 5000 4000 3000 2000 1000 0 -300 0 300 600 900 1200 1500 1800 2100 2400 2700 Age of Paper at Download minus Age of Paper at Citation • If all these time differences are plotted the above graph is produced. Cornell CS 502 Acks: S. Harnad Citation Latencies Frequency of Citation Latencies: 1992-1999 5000 4500 4000 Citations 3500 3000 2500 2000 1500 1000 500 0 0 12 24 36 48 60 72 84 96 Time Difference/Months 99 • 98 97 96 95 94 93 92 The raw data show that the latency of the citation peak has been reducing over the period of the archive Acks: S. Harnad Cornell CS 502 Author Impact Quartiles Quartile High 25% Med 50% Low 25% • • Total % Total 798 2.09% 9,262 24.20% 28,211 73.71% Mean Citations/Aut Citations Papers Deposits Updates/ hor/Paper Author 240,092 2,732 0.11 6,720 0.48 733,272 37,318 0.00212 93,671 0.37 251,925 67,951 0.000131 165,971 0.27 High impact authors update more than medium or low High and medium impact authors deposit more papers than low Acks: S. Harnad Cornell CS 502 Citation Quality Do Papers Cite Papers of Like Impact 140000 120000 100000 80000 60000 High 40000 Medium 20000 Dest. Impact No of Citations 0 Low Low Medium High Source Impact • Papers generally cite papers of like impact Cornell CS 502 Acks: S. Harnad Citation Spread Histogram of Citations per Paper (author im pact) 30,000 papers w ere by authors w ith no citation 40000 35000 30807 30000 Papers 25000 20000 13668 15000 11527 2060 6784 10000 9627 3105 5000 6534 138 4441 121 6072 170 5863 257 4781 249 1797 No citations 1 Citation 2/3 Citations 4/5/6 Citations 7/8/9/10 Citations 11 or more Citations 0 High (2.53%) • Medium (34.55%) Low (62.92%) A small number of papers receive a very large number of citations Acks: S. Harnad Cornell CS 502 How Paper Impact Effects Usage All Papers Frequency Density 0.0025 0.002 0.0015 0.001 0.0005 2398 2289 2180 2071 1962 1853 1744 1635 1526 1417 1308 1199 1090 981 872 763 654 545 436 327 218 109 0 0 Age of paper (days) High (2.0%) • Medium (7.7%) Low (46.5%) Unknown (39.6%) Higher impact papers have a longer download life expectancy. Cornell CS 502 Acks: S. Harnad What is the correlation between citations and downloads? Download type All Papers High Impact Papers (2.0%) Medium Impact Papers (7.7%) Low Impact Papers (46.5%) • r 0.11155 0.27293 0.01288 -0.01412 n 63671 1981 5937 30163 There is a significant positive correlation between citations and downloads for high impact papers. Acks: S. Harnad Cornell CS 502 Automatic Reviewing Techniques • Traditional Collaborative Filtering – Estimate what score a reviewer might give to an item that he/she has not scored yet – Frequently used by recommender systems – Use of user profiles • Collaborative quality filtering – http://www.cs.berkeley.edu/~tracyr/project/ – Attempts to automatically determine which reviewers are "good" in an open reviewing system, in order to provide the same (or better) benefits as peer review Cornell CS 502 Collaborative Quality Filtering Algorithm • Assume true value of an item is the asymptotic average of review scores • Good reviewers are those who consistently predict this average • Normalize according to # of reviews of an item, # of reviews by reviewer, review latency • Adjust by “expertise” – Use similarity of term vectors of items reviewed Cornell CS 502 Cornell CS 502 Annotation Systems • Worked successfully in many cases in Web environment – Amazon • Most successful when combined with reputation systems – E-bay • Problems with existing systems – Natural language – Closed/private systems – Non-extensible Cornell CS 502 Annotea: Open Web Infrastructure for Shared Web Annotations • http://www.w3.org/2001/Annotea/ • Annotations as class of metadata • External to the document and stored on an annotation server • Primitive RDF class annotation. – Sub-classed in various ways: Advice, Change, Example, Explanation, Question, See Also • Ratings can be formally expressed and machine readable • Storage of annotations in RDF database on annotations servers that can be queried. • Information in multiple annotation servers can be merged Cornell CS 502 Annotea System Architecture Cornell CS 502 Annotea RDF data model Cornell CS 502 Annotations in the NSDL OAI org1 annotation service Searc h I only want resources endorsed by org1 Cornell CS 502 org2 annotation service
© Copyright 2025 Paperzz