Concordancing the web with kwicfinder
1 / 23

Concordancing the Web with KWiCFinder - PowerPoint PPT Presentation

  • Uploaded on

Concordancing the Web with KWiCFinder. William H. Fletcher United States Naval Academy American Association for Applied Corpus Linguistics Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001. How Big is the Web?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Concordancing the Web with KWiCFinder' - burton

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Concordancing the web with kwicfinder

Concordancing the Webwith KWiCFinder

William H. Fletcher

United States Naval Academy

American Association for Applied Corpus Linguistics

Third North American Symposium on

Corpus Linguistics and Language Teaching, Boston, MA,

23-25 March 2001

How big is the web
How Big is the Web?

  • Now 2-4 billion webpages accessible via public links (Cyberveillance estimates & projection July 2000; Inktomi estimates are more modest.)

  • “Invisible web” / restricted sites several times larger

  • Estimated 80%-95% content in English, but…

  • Since mid 2000, non-Anglophones outnumber English speakers online

  • Anglophones < 30% of 850 million users in 2005

  • Percentage of new users fluent in English decreasing

  • For many regions / languages, still no data available

Search purposes
Search Purposes

  • General users typically seek…

    • a specific site

    • any well-stocked site meeting their needs

  • Scholarly searchers must examine and evaluate a range of sites to identify the most relevant and reliable resources

  • Educators want to foster similar online research behavior in their students

Typical search behaviors
Typical Search Behaviors

  • Marked preference for directories with pre-selected links organized by topic over full-text search engines

  • Simple queries – single word or phrase – predominate (80%-90%)

  • 10%-25% of attempted complex queries (Boolean operators, bracketing) are ill-formed

  • Users tend to work in a single window, calling up one document at a time, then returning to search engine for another link

Typical search outcomes
Typical Search Outcomes

  • Users follow up only first few links, then settle on a page after browsing from these

  • Usual outcome is amatch, not best match

Ways to use the web for instruction and research
Ways to Use the Web for Instruction and Research

  • Micro level

    • Discover eloquent examples

    • Verify current / possible usage, with rough indication of prevalence

    • Acquire vocabulary not (yet) in dictionaries

      • Timeliness is essential -- “off-the-shelf corpora” often cannot help here!

    • Enable students to develop discovery skills (Salzman/Mills “Grammar Safari”)

Ways to use the web for instruction and research 2
Ways to Use the Web for Instruction and Research (2)

  • Macro level

    • Find authentic texts accessible to students

    • Locate relevant online resources for research projects

      • Student reports

      • Scholarly research

Impediments to finding relevant resources online
Impediments to Finding Relevant Resources Online

  • Reliance on commercial search engines (SEs) essential due to Web’s size

  • SEs’ priorities match ours only by coincidence

  • Link rot

    • Pages move or disappear

    • Page content changes

Challenges to responsible research
Challenges to Responsible Research

  • Online there is too much ephemeral content of unknown reliability

    • Preponderance of journalistic, commercial and personal texts of unknown authorship and authority

    • Details of sources and research methodology haphazard

    • Even student papers (gasp) and machine translated texts (groan choke)

Challenges to responsible research 2
Challenges to Responsible Research (2)

  • Representativity of Web as Corpus

    • Much ill-formed or fragmentary language

    • Domain only a rough clue to provenance

  • Numbers vs. Statistics

    • Search engines number of pages matching a query, not actual citations

    • One page may contain alternate usages

    • Narrower filters may eliminate some pages

Webidence as evidence
Webidence as Evidence

Our profession needs to develop “Standards of Webidence” to guide selection and documentation of online language for serious research purposes.

The web is not a corpus in the classical sense

The Web is not a corpus in the classical sense…

…but it does offer an inexhaustible body of linguistic and cultural information for research and use.

Why kwicfinder
Why KWiCFinder?

  • Automate process of search and retrieval

  • Expedite evaluation of webpages

  • Provide specific enhancements for foreign language users and linguists

  • Encourage students and colleagues to take full advantage of online resources

Why altavista
Why AltaVista?

  • All words are indexed, including "stopwords"

  • Distinguishes case and "special characters"

  • Supports Boolean operators, bracketing, and wildcards

  • True world-wide coverage, with search by language

  • No limits to length or complexity of the query

  • Literal text search, without "second-guessing"

Kwicfinder enhances altavista with
KWiCFinder Enhances AltaVista with…

  • Intuitive input for foreign characters, bracketing, operators, dates

  • Inclusion / exclusion criteria not included in KWiC report to focus search

  • Automatic search and retrieval in the background returning KWiC abstracts

Kwicfinder enhances altavista with 2
KWiCFinder Enhances AltaVista with… (2)

  • Restricted wildcards ? % (1, 0-1 char) vs. AltaVista * (0-5 chars)

  • “Sic” option so “plain” or lower-case char does not match “special” or upper-case variants:

    • By SE default, a matches any of aáâäàãæåAÁÂÄÀÃÆÅ

Kwicfinder enhances altavista with 3
KWiCFinder Enhances AltaVista with… (3)

“Tamecards” -- User inputs pattern, KF generates variants:

  • on-line matches on-line, on line, online

  • s[iau]ng matches sing, sang, sung

  • {me,te,se,nos,os,se} desp[i,]ert{o,as,a,amos,áis,an} matches only reflexive forms me despierto, te despiertas, se despierta, nos despertamos, os despertáis, se despiertan

How does xml enhance kwicfinder
How Does XML Enhance KWiCFinder?

  • Search results become a dynamic database for end user to manipulate:

    • categorize, annotate, delete, merge / split searches, citations and documents

  • Free tools permit developer or end-user to restyle and add interactivity to reports

    • Layouts

    • Languages

    • Data format

Why webkwic
Why WebKWiC?

  • Original hope: cross-platform, cross-browser solution

  • Minimal entry threshold: small download of HTML pages + JavaScript

  • Support for non-Western European languages

Why google
Why Google?

  • Link popularity ranking puts relevant sites at or near top of list

  • Straightforward approach to Advanced Search (“implicit Booleans”) easy to learn, thus most likely to be used by students independently

  • Largest number of pages analyzed

  • Matching pages always* available in cache with KWiC markup

How does webkwic complement google
How Does WebKWiC Complement Google?

  • Focuses and enhances interface for language learners

  • Provides tools to navigate among citations and documents

  • Simplifies management of multiple windows

Future of web concordancing
Future of Web Concordancing

  • Agents will create specialized corpora on demand, by “search and crawl” or by monitoring specific sites

  • Multiplicity of encoding formats (various HTMLs, XML…) and languages will place increasing demands on developers of KWiCFinder and analogues

Pleas e visit http miniappolis com
Pleas(e) Visit

  • Download and try KWiCFinder and WebKWiC

  • View bibliography as well as this and related presentations

  • Use these tools with your students

  • Send feedback and suggestions to [email protected]