Terminology in Statistical Information Integration Tasks: What’s the Problem?

Terminology in Statistical Information Integration Tasks: What’s the Problem? Open Forum 2003 on Metadata Registries Thursday, January 23, 2003 2:00-2:45 pm Sheila O. Denn

Introduction • Work undertaken as part of NSF grant (EIA 0131824) to study integration of data and interfaces to work toward a statistical knowledge network. • This talk focuses on results from first phase of metadata user study to determine what kinds of problems users have with terminology and metadata on government statistical web sites.

Current Situation: each agency has its own backend data and provides its own intermediary. End user has little opportunity for interaction or active manipulation. Burden of finding information and integrating it across agencies (and occasionally within one agency) is on user. agency intermediary: reports, tables, “planned” DB queries agency backend data agency intermediary: reports, tables, “planned” DB queries agencybackend data end user: generally passive reader, little interaction, must do all integration agencybackend data agency intermediary: reports, tables, “planned” DB queries agency backend data agency intermediary: reports, tables, “planned” DB queries firewall

Goal: In the SKN, each agency has its own backend data, which feeds into a common public intermediary (PI) outside of firewall. User Interfaces link to the PI under user control. end user end user end user end user Statistical Ontology Domain Ontologies agency backend data end user I n t e r f a c e s agency backend data public intermediary: variable/concept level, XML-based, single point of access to information from all agencies end users: interact with data from information/concept perspective, not agency perspective U s e r agency backend data end user Domain Experts End User Communities agency backend data end user firewall

What kinds of problems does terminology cause for users? User Agency ?? Miss Term ?? Term Collision Termuser Termagency Termuser category Termagency category Categorization Termuser Termuser Termuser Termuser

What kinds of problems does terminology cause for users? • Misses • There is no agency term or concept that is linked to a term or concept that the user is interested in or • The user encounters a term on the system with which she is unfamiliar or about which she has only a vague understanding. • Examples: • Seasonal adjustment • Consumption vs. production • Farm profits vs. market value of agricultural products

What kinds of problems does terminology cause for users? • Collisions • A user has an understanding of a concept that is different from the way the concept is expressed by the agency. • The same term is used differently by different agencies, making integration of data difficult. • Can also apply to clusters of terms where it is not clear what the distinction between them is. • Examples • Labor, labor force, labor supply, workforce, labor force participation rate, labor market • Full-time employment • Sector • Categorization – when category groupings do not make sense to the user. • Example • Soybeans

Data Collection • In previous work: • Transaction logs • User queries • Interviews • In the first phase of current study, interviews with agency and non-agency domain experts • These sources of evidence yielded categories of terms that can cause difficulty.

Categories of Terms • Statistical terms • Date/currency/time • Geography • Domain terms • User terms

Implications for Vocabulary Support Tools • Goals: • Provide a basic level of statistical literacy • Not intended to be a highly technical, or comprehensive resource • Include terms users frequently encounter while browsing statistical agency sites • Sources of Evidence: • Terminology used on frequently visited pages • Anecdotal evidence from agency and non-agency consultants • Metadata user study • Web crawl of agency sites

Implications for Vocabulary Support Tools • We need to explore how we can use metadata to map between the user terms and the agency terms, and between terms as used by different agencies. • Users are not likely to browse the glossary as a distinct activity, so they need “just-in-time” vocabulary support. • Vocabulary support should allow users to remain in context, not lose sight of the task they are working on. • Context specificity – explanations should be provided at varying levels of specificity • General (context-free or “universal”) • Agency or context-specific (term as used by particular agency or within particular domain) • Table or statistic-specific (term as it relates to a particular row, column, or statistic)

Implications for Vocabulary Support Tools • Provide explanations of term or concept that are as relevant to the user’s current context as possible. • The most specific explanations available should be offered at the time a user first invokes help. • If there are no explanations appropriate for a specific statistic, row, or column, offer an explanation one level up in generality. • Pathways from specific to general will be based on a statistical ontology currently under development. • The ontology will also be used to provide patterns (templates) for definitions at each level of specificity.

Vocabulary Support Tool Examples • The tools we are working on will provide a basic level of explanation of statistical terms. • Tools may include: • Definitions • Examples • Brief tutorials • Demonstrations • Interactive simulations • Pointers to related terms/concepts • Pointers to more complete (or more technical) explanations

Index An index combines numbers measuring different things into a single number. The single number represents all the different measures in a compact, easy-to-use form. Values for an index can be compared to each other, for example, over time. 6 24.7 59 103 42 10.1 combiner index = 12.3

Jan. combiner Apr. combiner Jul. combiner Oct. combiner 14.3 13.1 12.3 13.9 The index has increased this year.

Consumer Price Index (CPI) • The Consumer Price Index (CPI) represents changes in prices of all goods and services produced for consumption by urban households. It combines prices into a single number that can be compared over time. • Items are classified into 8 major groups: • Food and Beverages • Housing • Apparel • Transportation • Medical Care • Recreation • Education and Communication • Other

transportation food & beverage apparel education & communication housing recreation medical care other CPI combiner Consumer Price Index

1997 CPI Combiner 1998 CPI Combiner 1999 CPI Combiner 2000 CPI Combiner 2001 CPI Combiner The Consumer Price Index has increased since 1995.

Antiknock Index, also known as Octane Rating A number used to indicate gasoline’s antiknock performance in motor vehicle engines. The two recognized laboratory engine test methods for determining the antiknock rating, i.e., octane rating, of gasolines are the Research method and the Motor method. In the United States, to provide a single number as guidance to the consumer, the antiknock index (R+M)/2, which is the average of the Research and Motor octane numbers, was developed. http://www.eia.doe.gov/glossary/glossary_a.htm

(R + M)/2 Research method Motor method Antiknock Combiner Antiknock Index, also known as Octane Rating Regular: 85 - 88 Midrange: 88 - 90 Premium: 90 or above

Evaluation • What do we need to evaluate? • Technical accuracy • Usability of interface • “Effectiveness” • Is it attractive enough to entice people to use it? • Is it helpful? • Is it informative? • Does it help the user in completion of task? • How do we measure these things? • What other kinds of vocabulary support issues do we need to address?

Other Issues • Implementation • Ongoing maintenance/responsibility

Metadata User Study Team Carol Hert Stephanie Haas Jenny Fry Lydia Harris Sheila Denn Vocabulary Support Team Stephanie Haas Ron Brown Cristina Pattuelli Jesse Wilbur Project Teams GovStat PIs • Gary Marchionini, UNC-CH • Stephanie Haas, UNC-CH • Carol Hert, Syracuse • Catherine Plaisant, UMd • Ben Shneiderman, UMd

For More Information Sheila O. Denn School of Information and Library Science University of North Carolina at Chapel Hill denns@ils.unc.edu http://ils.unc.edu/govstat/

Terminology in Statistical Information Integration Tasks: What’s the Problem?

Terminology in Statistical Information Integration Tasks: What’s the Problem?

Presentation Transcript

i-Tasks - i nteractive workflow Tasks for the WWWEB ___________

Medical Terminology

Animal Reproduction and Genetics

Useful Statistical Tools

LING / C SC 439/539 Statistical Natural Language Processing

龙星计划课程 : 信息检索 Statistical Language Models for IR

Veterinary Terminology

Statistical Methods

MEDICAL TERMINOLOGY

Seminar: Statistical NLP

TECHNIQUES OF INTEGRATION

Testing Oral Ability

Integration by Parts Integration Using Tables of Integrals Numerical Integration

Information Theory, Statistical Measures and Bioinformatics approaches to gene expression

Information Geometry -- Manifolds of Probability Distributions

Supporting on-the-fly data Integration for bioinformatics

Integration of Health Information Resources into Electronic Health Records

Chapter 3: Getting Started with Tasks

Integration - Application Systems

Statistical Studies: Statistical Investigations

STK 4600: Statistical methods for social sciences.

Two big shifts AND One big problem