html5-img
1 / 23

LIS 7450, Searching Electronic Databases

LIS 7450, Searching Electronic Databases. Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres. Database Structure. Organization of Data Elements and records. Database Record.

ralph
Download Presentation

LIS 7450, Searching Electronic Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres

  2. Database Structure Organization of Data Elements and records

  3. Database Record • Record – basic unit of information in a database (file). • Example: Bibliographic record contains description information, i.e. author, title, publisher etc.

  4. Fields • Field – a distinct part or section of a record (a unit of information within the record) • Example of personnel record fields: employee’s name, special identifier number, address, date of hire etc.

  5. Field Design Decisions • For each field • Decide what information is placed within that field & format for that information (text, numeric) • Should there be subfields within a field? • What to call the fields? • Field codes (abbreviations, numbering) • Order of the fields

  6. Example: MARC Record (a type of record you should be familiar with) Record Fields & Codes The 100 field contain author information. The 245 field contains main title information.

  7. Other Design Decisions • Hyphenated words • Home-school • Stop words • High frequency words not useful for searching • Single words and phrases • Library, library science, color of money • Alternative spellings of words • Color, colour

  8. Types of Databases • Bibliographic – references and abstracts of published documents • Fulltext – complete text of articles, dictionary entry, code of law, or other such document. • Directory – factual information about organizations, companies, products, people, or materials.

  9. Types of Databases • Numeric – data in a tabular or statistically manipulated form, often with some added text. • Hybrid – a mix of record types. For example, a database may have full-text records for some publications and citations and abstracts for other source documents.

  10. Database Construction BasicSteps for automatic indexing of text documents

  11. Six Basic Steps Step 1: Parse text into words Step 2: Compare to stoplist and eliminate stopwords Step 3: Stem content words (reduce to root words) (skip this step if decide not to stem) Step 4: Count stemmed word occurrences Step 5: Create union list of terms Step 6: Create data structure for specific retrieval techniques (i.e. an inverted file)

  12. Example: Simple Set of 5, One-sentence documents D1: It is a dog eat dog world! D2: While the world sleeps. D3: Let sleeping dogs lie. D4: I will eat my hat. D5: My dog wears a hat. “D” stands for document

  13. Step 1: Parse Text into Words Note: Some databases remove punctuation for words, like possessives; others preserve it. What difference would this make?

  14. Step 2: Eliminate Stop Words Stop words are content-free words – those not useful in determining the content of the document. Examples: pronouns (I, my), prepositions (of, by, on), articles (a, the, this)

  15. Step 3: Stemming (remember not all databases stem words)

  16. Types of Stemming Decisions No Stemming: contract contracts contracted contracting contractor contraction contractual contracture Weak Stemming: Inflections: -s, -es, -ed, -ing, -’s Strong Stemming: Derivations: -tion, -ly, -ally Reduce words to a root variant; there are different stemming algorithms

  17. A bit more about stemming for searching… • Some databases automatically search for all of the words that come from the same stem/root word unless you indicate that you only want the word you entered. • Example: if you entered computer, the database would also search for computing, computers, computation, etc.

  18. Step 4: Sort Words, Count Duplicates Sort into Alpha order Count any duplicates

  19. Step 5: Create Union List of Unique Terms

  20. Step 6: Create Inverted Index (inverted file) Union List Unique terms Inverted Index dog eat hat let lie sleep wear word dog: D1 D3 D5 eat: D1 D4 hat: D4 D5 let: D3 lie: D3 sleep: D2 D3 wear: D5 word: D1 D2 Inverted Index: has pointers to documents in which word occurs

  21. Dialog Database Construction FYI: For those interested in Dialog

  22. Dialog Database Construction Step 1: Create a linear fileof records received from the Information Provider. Assign sequential accession numbers to the records. Step 2: Label the fields within the records: AU for Author, TI for Title, etc. If a field is word-indexed, also label the words within each field. Exclude stop words: AN FOR THE AND FROM TO BY WITH

  23. Dialog Database Construction Step 3: Create the Basic Index: all words and phrases from fields containing subject-related terms. Step 4: Create the Additional Indexes: all terms from all remaining fields.

More Related