Informetrics ir
Download
1 / 40

Informetrics IR - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Informetrics & IR. Presentation Readings Discussion & Review Projects & Papers. Why use metrics?. Apply theory from another field to solve IS problems We need new modeling techniques or metaphors to examine these complex systems

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Informetrics IR' - reid


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Informetrics ir l.jpg
Informetrics & IR

  • Presentation

  • Readings Discussion & Review

  • Projects & Papers


Why use metrics l.jpg
Why use metrics?

  • Apply theory from another field to solve IS problems

  • We need new modeling techniques or metaphors to examine these complex systems

  • An attempt to apply some new models and metaphors to complex systems

  • Bibliometrics

    • Direct Citation Counting

    • Bib Coupling

    • Co-Citation Analysis

    • Bibliometric Laws

      • Web Servers

      • Server Log

      • Log Analysis


  • How do informetrics impact ir l.jpg
    How do Informetrics impact IR?

    • Measures of:

      • Content & subject area

      • Relationships

      • Use & popularity

    • An information-based view of communications, focused on documents

      • Instead of the text in a document, focus on the document properties (metadata?)

        • Author(s)

        • Dates

        • Publication source(s)

        • Front Matter: Titles & Contact info

        • Back Matter: Citations & Support


    What are these metrics l.jpg
    What are these metrics?

    • Bibliometrics

      • “series of techniques that seek to quantify the process of written communication.“ Ikpaahindi

      • counting and analyzing citations

      • consistently observable patterns

      • referenced in key places: Science Citation Index, Social Science Citation Index, Arts and Humanities Citation Index

    • Webometrics

      • Applying bibliometric methods to Web pages & Web sites

    • Informetrics

      • Wider scale application of methods to networked information sources


    Citing linking l.jpg
    Citing & Linking

    • paying homage to pioneers

    • giving credit for related work (homage to peers)

    • identifying methodology, equipment, etc.

    • background reading

    • correcting one’s own work

    • correcting the work of others

    • criticizing previous work

    • substantiating claims

    • alerting to forthcoming work

    • providing leads to poorly disseminated, poorly indexed, or un-cited work

    • authenticating data and classes of fact - physical constants, etc.

    • identifying original pubs in which an idea or concept was discussed

    • id original pub or other work describing an eponymic concept or term (Hodgkin’s Disease)

    • disclaiming work or ideas of others (negative claims)

    • disputing priority claims of others (negative homage)


    Direct citation counting l.jpg
    Direct Citation Counting

    • How many citations over a given period of time.

    • Impact formula:

      • n journal citations/n citable articles published

    • Immediacy index:

      • n citations received by article during the year/ total number of citable articles published


    Bibliometric coupling l.jpg
    Bibliometric Coupling

    • “a number of papers bear a meaningful relation to each other when they have one or more references in common” Kessler

    • What’s the Web equivalent?


    Co citation analysis l.jpg
    Co-Citation Analysis

    • if two references are cited together, in a latter literature, the two references are themselves related. the greater the number of times they are cited together, the greater their cocitation strength. (Marshakova and Small 1973 independently)

    • How about Web citations?

    • What’s a set of Web pages? A Site, a long page?


    Finer points l.jpg
    Finer Points

    • Classification of references:

      • is the reference conceptual or operational

      • is the reference organic or perfunctory

      • is the reference evolutionary or juxtapositional (built on a preceding or an alternative to it)

      • is the reference confirmative or negational

  • Citation reference errors:

    • multiple authors (not primary or “et. al.) what contribution/influence by order of names?

    • self-citations

    • like-names, initial/full names, different fields

    • field variation of citation amounts/purposes

    • fluctuation of influence/use

    • typos


  • Bibliometric laws l.jpg
    Bibliometric Laws

    • Seek to describe the working of science by mathematical means. Generally that a few entities account for the many citations.

      • Bradford’s Law of Scattering

      • Lotka’s Law

      • Zipf’s Law


    Bradford s law of scattering l.jpg
    Bradford’s Law of Scattering

    • How literature in a subject in distributed in journals.

      • “If scientific journals are arranged in order of decreasing productivity of articles on a given subject, they may be divided into a nucleus of periodicals more particularly devoted to the subject and several other groups of zones containing the same number of articles as the nucleus.”

        • 9 journals had 429 articles, the next 59 had 499, the last 258 had 404.

    • Bradford discovered this regularity of calculating the number of titles in each of the three groups: 9 titles, 9x5 titles, 9x5x5 titles.

      • Can be influenced by sample size, area of specialization and journal policies.


    Brookes on bradford s formula l.jpg
    Brookes on Bradford’s Formula

    • “The index terms assigned to documents also follow a Bradford distribution because those terms most frequently assigned become less and less specific and therefore increasingly ineffective in retrieval.”


    Bradford s formula itself l.jpg
    Bradford’s Formula Itself

    • Bradford’s Formula makes it possible to estimate how many of the most productive sources would yield any specified fraction p of the total number of items. The formula is:

  • R(n) = N log n/s (1 <_ n <_ N)

    • where R(n) = cumulative total of items contributed by the sources of rank 1 to n.

    • N = total number of contributing sources

    • s = a constant characteristic of the literature

    • then

    • R(N) = N log N/s

    • is the total number of items contributed by N sources.


  • More bradford s law l.jpg
    More Bradford’s Law

    • Citations originally counted year by year can be expressed as the geometric sequence:

    • R, Ra, Ra2, Ra3, Ra4, ..., Rat-1

      • where R = presumed number of citations during the first year, some of which do not immediately emerge in publication. But as a<1, the sum of the sequence converges to the finite limit R/(1-a).


    Lotka s law l.jpg
    Lotka’s Law

    • An inverse square law that for every 100 authors contributing on article, 25 will contribute 2, 11 will contribute 3 and 6 will contribute 4.

    • formula is- 1:n2.

    • Voos found 1:n3.5 for Info Science (1974).

    • What are other similar analysis tasks you could use Lotka’s law for?

    • Are users, browsers, bloggers like authors?


    Zipf s law l.jpg
    Zipf’s Law

    • The distribution which applied to word frequency in a text states that the nth ranking word will appear k/n times, where k is a constant for that text.

      • It is easier to choose and use familiar words, therefore probabilities of occurrence of familiar words is higher. rf=C rank, frequency,

      • This can be applied by counting all of the words in a document (minus some words in a stop list - common words (the, therefore...)) with the most frequent occurrences representing the subject matter of the document. Could also use relative frequency (more often than expected) instead of absolute frequency.


    Wyllys on zipf s law l.jpg
    Wyllys on Zipf’s Law

    • Surprisingly constrained relationship between rank and frequency in natural language.

    • Zipf said the fundamental reason for human behavior : the striving to minimize effort.

    • Mandelbrot - further refinement of Zipf’s law: (r+m)Bf=c where r is the rank of a word, f is its frequency, m, B and c are constants dependent on the corpus. m has the greatest effect when r is small.


    Optimum utility of articles l.jpg
    Optimum utility of articles?

    • the most compact library is not the least costly because you get rid of articles more quickly therefore you buy more.

    • fewer articles are acquired and kept longer but more shelf space and maintenance is needed.

    • the challenge is to keep the most frequently accessed available.


    Goffman s theory l.jpg
    Goffman’s Theory

    • His General Theory of Information Systems

    • Ideas are “endemic” with minor outbreaks occurring from time to time. Cycles of use. Like memes and paradigm shifts (Kuhn). Based on epidemiology and Shannon’s communications theory.


    Online article life l.jpg
    Online Article Life

    • Burton proposed a measure for the decay in citations to older literature, a “half-life”

      • How is this different on the net?

        • a shorter life?

        • older sites referred less, more?

        • commercial sites vs. private sites.

        • advertised vs word of mouth?

        • linked from popular pages?


    Price s law l.jpg
    Price’s Law

    • “half of the scientific papers are contributed by the square root of the total number of scientific authors”

  • Leads to:

    • bibliographic coupling - the number of reference two papers have in common, as a measure of their similarity, a clustering based on this measure yields meaningful groupings of papers for information retrieval.


  • Cumulative advantage model l.jpg
    Cumulative advantage model

    • Price noticed this advantage

    • Success breeds success. also implies that an obsolescence factor is at work. You get mentioned a lot, you get mentioned in more and more cited papers.

    • Polya describes this as “contagion”


    Bibliometrics on the web l.jpg
    Bibliometrics on the Web

    • We can use these techniques, rules and formulas to analyze Web usage.

      • Like a bibliometric index for historical analysis.

    • Key question: are citations like page browsing/using?

    • Using Web Servers Effectively

    • Server Logs give us much data to mine

    • Studies on the Web


    Understanding the web l.jpg
    Understanding the Web

    • User-based data collection

    • Surveys

      • GVU, Nielsen and GNN

        • Qualitative questions

          • phone

          • web forms

        • Self-selected sample problems

          • random selection

          • oversample


    Understanding the web25 l.jpg
    Understanding the Web

    • Web Servers

      • Serve:

        • text

        • graphics

        • CGI

        • XMLHTTPRequest (REST, AJAX)

        • Web services (SOAP)

      • other MIME types

    • Server Logs represent this activity

      • A lot of empirical, quantitative data on use


    Problems with web servers l.jpg
    Problems with Web Servers

    • Not as Foolproof as Print

    • No State Information

      • Interaction with Web pages or Web apps is difficult to log & analyze

    • Server Hits not Representative

      • Counters inaccurate

      • Different, non HTTP requests & effects

    • Floods/Bandwidth can Stop “intended” usage

    • Robots, Spam, (D)DoS, Caching, etc.


    Web server records l.jpg
    Web Server Records

    • Server-based

    • Proxy-based

    • Client-based

    • Network-based


    Clever web content setup l.jpg
    Clever Web Content Setup

    • unique file and directory names

    • clear, consistent structure

    • FTP server for file transfer

      • frees up logs and server!

    • Judicious use of links

    • Wise MIME types

      • some hard/impossible to log


    Clever web server setup l.jpg
    Clever Web Server Setup

    • Redirect CGI to find referrer

    • Use a database

      • store web content

      • record usage data

    • create state information with programming

      • NSAPI

      • ActiveX

    • Have contact information

    • Have purpose statements

    • Bibliometric Servlets?


    Managing log files l.jpg
    Managing Log Files

    • Backup

    • Store Results or Logs?

    • Beginning New Logs

    • Posting Results


    Log file format l.jpg
    Log File Format

    • see Appendix

    • key advantage:

      • computer storage cost decreases while paper cost rises

    • every server generates slightly different logs


    Extended log file formats l.jpg
    Extended Log File Formats

    • WWW Consortium Standards

    • Will automatically record much of what is programmatically done now.

      • faster

      • more accurate

      • standard baselines for comparison

      • graphics standards


    Log analysis tools l.jpg
    Log Analysis Tools

    • Analog

    • WWWStat

    • GetStats

    • Perl Scripts

    • Commercial Tools


    Log analysis cumulative sample l.jpg
    Log Analysis Cumulative Sample

    Program started at Tue-03-Dec-2006 01:20 local time.

    Analysed requests from Thu-28-Jul-2003 20:31 to Mon-02-Dec-2003 23:59 (858.1 days).

    Total successful requests: 4 282 156 (88 952)

    Average successful requests per day: 4 990 (12 707)

    Total successful requests for pages: 1 058 526 (17 492)

    Total failed requests: 88 633 (1 649)

    Total redirected requests: 14 457 (197)

    Number of distinct files requested: 9 638 (2 268)

    Number of distinct hosts served: 311 878 (11 284)

    Number of new hosts served in last 7 days: 7 020

    Corrupt logfile lines: 262

    Unwanted logfile entries: 976

    Total data transferred: 23 953 Mbytes (510 619 kbytes)

    Average data transferred per day: 28 582 kbytes (72 946 kbytes)


    Downie and web usage l.jpg
    Downie and Web Usage

    • User-based analyses

      • who

      • where

      • what

    • File-based analyses

      • amount

    • Request analyses

      • conform (loosely) to Zipf’s Law

    • Byte-based analyses


    Neat bibliometric web tricks l.jpg
    Neat Bibliometric Web Tricks

    • use a search engine to find references

      • “link:www.ischool.utexas/~donturn”

        • key to using unique names

      • use many engines

        • update times different

        • blocking mechanisms are different

    • use Google News (and the like)

      • look for references

      • look for IP addresses of users


    Neat tricks cont l.jpg
    Neat Tricks, cont.

    • Walking up the Links

      • follow URL’s upward

    • Reverse Sort

      • look for relations

    • Use your own robot to index

      • test


    Projects l.jpg
    Projects

    • capture current and previous user information seeking behavior and modify interface and content to meet needs

    • Dynamic Web Publishing System

      • anticipate information seeking behavior

      • based on recorded preferences and pre-supplied rules, generate and guide users through a document space.


    Summary l.jpg
    Summary

    • Bibliometrics, now Informetrics

      • Bradford’s - distribution of documents in a specific discipline

      • Lotka’s - number of authors of varying productivity

      • Zipf’s - word frequency rankings

  • The Web

    • out of control in growth = opportunities

    • wise setup can help

    • use good analysis tools


  • Projects papers l.jpg
    Projects & Papers

    • Everyone have topic or project?

    • Let’s talk more (via email too) about ideas and projects


    ad