Concordancing the web with kwicfinder
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

Concordancing the Web with KWiCFinder PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Concordancing the Web with KWiCFinder. William H. Fletcher United States Naval Academy American Association for Applied Corpus Linguistics Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001. How Big is the Web?.

Download Presentation

Concordancing the Web with KWiCFinder

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Concordancing the web with kwicfinder

Concordancing the Webwith KWiCFinder

William H. Fletcher

United States Naval Academy

American Association for Applied Corpus Linguistics

Third North American Symposium on

Corpus Linguistics and Language Teaching, Boston, MA,

23-25 March 2001


How big is the web

How Big is the Web?

  • Now 2-4 billion webpages accessible via public links (Cyberveillance estimates & projection July 2000; Inktomi estimates are more modest.)

  • “Invisible web” / restricted sites several times larger

  • Estimated 80%-95% content in English, but…

  • Since mid 2000, non-Anglophones outnumber English speakers online

  • Anglophones < 30% of 850 million users in 2005

  • Percentage of new users fluent in English decreasing

  • For many regions / languages, still no data available


Search purposes

Search Purposes

  • General users typically seek…

    • a specific site

    • any well-stocked site meeting their needs

  • Scholarly searchers must examine and evaluate a range of sites to identify the most relevant and reliable resources

  • Educators want to foster similar online research behavior in their students


Typical search behaviors

Typical Search Behaviors

  • Marked preference for directories with pre-selected links organized by topic over full-text search engines

  • Simple queries – single word or phrase – predominate (80%-90%)

  • 10%-25% of attempted complex queries (Boolean operators, bracketing) are ill-formed

  • Users tend to work in a single window, calling up one document at a time, then returning to search engine for another link


Typical search outcomes

Typical Search Outcomes

  • Users follow up only first few links, then settle on a page after browsing from these

  • Usual outcome is amatch, not best match


Ways to use the web for instruction and research

Ways to Use the Web for Instruction and Research

  • Micro level

    • Discover eloquent examples

    • Verify current / possible usage, with rough indication of prevalence

    • Acquire vocabulary not (yet) in dictionaries

      • Timeliness is essential -- “off-the-shelf corpora” often cannot help here!

    • Enable students to develop discovery skills (Salzman/Mills “Grammar Safari”)


Ways to use the web for instruction and research 2

Ways to Use the Web for Instruction and Research (2)

  • Macro level

    • Find authentic texts accessible to students

    • Locate relevant online resources for research projects

      • Student reports

      • Scholarly research


Impediments to finding relevant resources online

Impediments to Finding Relevant Resources Online

  • Reliance on commercial search engines (SEs) essential due to Web’s size

  • SEs’ priorities match ours only by coincidence

  • Link rot

    • Pages move or disappear

    • Page content changes


Challenges to responsible research

Challenges to Responsible Research

  • Online there is too much ephemeral content of unknown reliability

    • Preponderance of journalistic, commercial and personal texts of unknown authorship and authority

    • Details of sources and research methodology haphazard

    • Even student papers (gasp) and machine translated texts (groan choke)


Challenges to responsible research 2

Challenges to Responsible Research (2)

  • Representativity of Web as Corpus

    • Much ill-formed or fragmentary language

    • Domain only a rough clue to provenance

  • Numbers vs. Statistics

    • Search engines number of pages matching a query, not actual citations

    • One page may contain alternate usages

    • Narrower filters may eliminate some pages


Webidence as evidence

Webidence as Evidence

Our profession needs to develop “Standards of Webidence” to guide selection and documentation of online language for serious research purposes.


The web is not a corpus in the classical sense

The Web is not a corpus in the classical sense…

…but it does offer an inexhaustible body of linguistic and cultural information for research and use.


Why kwicfinder

Why KWiCFinder?

  • Automate process of search and retrieval

  • Expedite evaluation of webpages

  • Provide specific enhancements for foreign language users and linguists

  • Encourage students and colleagues to take full advantage of online resources


Why altavista

Why AltaVista?

  • All words are indexed, including "stopwords"

  • Distinguishes case and "special characters"

  • Supports Boolean operators, bracketing, and wildcards

  • True world-wide coverage, with search by language

  • No limits to length or complexity of the query

  • Literal text search, without "second-guessing"


Kwicfinder enhances altavista with

KWiCFinder Enhances AltaVista with…

  • Intuitive input for foreign characters, bracketing, operators, dates

  • Inclusion / exclusion criteria not included in KWiC report to focus search

  • Automatic search and retrieval in the background returning KWiC abstracts


Kwicfinder enhances altavista with 2

KWiCFinder Enhances AltaVista with… (2)

  • Restricted wildcards ? % (1, 0-1 char) vs. AltaVista * (0-5 chars)

  • “Sic” option so “plain” or lower-case char does not match “special” or upper-case variants:

    • By SE default, a matches any of aáâäàãæåAÁÂÄÀÃÆÅ


Kwicfinder enhances altavista with 3

KWiCFinder Enhances AltaVista with… (3)

“Tamecards” -- User inputs pattern, KF generates variants:

  • on-line matches on-line, on line, online

  • s[iau]ng matches sing, sang, sung

  • {me,te,se,nos,os,se} desp[i,]ert{o,as,a,amos,áis,an} matches only reflexive forms me despierto, te despiertas, se despierta, nos despertamos, os despertáis, se despiertan


How does xml enhance kwicfinder

How Does XML Enhance KWiCFinder?

  • Search results become a dynamic database for end user to manipulate:

    • categorize, annotate, delete, merge / split searches, citations and documents

  • Free tools permit developer or end-user to restyle and add interactivity to reports

    • Layouts

    • Languages

    • Data format


Why webkwic

Why WebKWiC?

  • Original hope: cross-platform, cross-browser solution

  • Minimal entry threshold: small download of HTML pages + JavaScript

  • Support for non-Western European languages


Why google

Why Google?

  • Link popularity ranking puts relevant sites at or near top of list

  • Straightforward approach to Advanced Search (“implicit Booleans”) easy to learn, thus most likely to be used by students independently

  • Largest number of pages analyzed

  • Matching pages always* available in cache with KWiC markup


How does webkwic complement google

How Does WebKWiC Complement Google?

  • Focuses and enhances interface for language learners

  • Provides tools to navigate among citations and documents

  • Simplifies management of multiple windows


Future of web concordancing

Future of Web Concordancing

  • Agents will create specialized corpora on demand, by “search and crawl” or by monitoring specific sites

  • Multiplicity of encoding formats (various HTMLs, XML…) and languages will place increasing demands on developers of KWiCFinder and analogues


Pleas e visit http miniappolis com

Pleas(e) Visit http://miniappolis.com/

  • Download and try KWiCFinder and WebKWiC

  • View bibliography as well as this and related presentations

  • Use these tools with your students

  • Send feedback and suggestions to [email protected]


  • Login