1 / 26

Web Programming Week 13

Web Programming Week 13. Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein<mklein@cs.odu.edu> 11/23/10. Relational Data Model is a Special Case…. SELECT pi.fname, m.aces, m.unforced_errors, m.winners FROM player_info pi, matches m

wilma
Download Presentation

Web Programming Week 13

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web ProgrammingWeek 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein<mklein@cs.odu.edu> 11/23/10

  2. Relational Data Model is a Special Case… SELECT pi.fname, m.aces, m.unforced_errors, m.winners FROM player_info pi, matches m WHERE pi.fname = “Andre” AND m.opponent_name = “Sampras” AND m.year = “2002”;

  3. Unstructured Datais More Common…

  4. Precision and Recall • how much extra stuff did you get? • how much did you miss? source: http://www.hsl.creighton.edu/hsl/Searching/Recall-Precision.html

  5. Precision and Recall 10 documents in the index are relevant search returns 20 documents 5 of which are relevant half of the relevant documents were retrieved 1 out of 4 retrieved documents are relevant source: http://www.hsl.creighton.edu/hsl/Searching/Recall-Precision.html

  6. Precision and Recall 1 Precision figure 1.2 in FBY 0 1 Recall

  7. Why Isn’t Recall Always 100%? Virginia Agricultural and Mechanical College? Virginia Agricultural and Mechanical College and Polytechnic Institute? Virginia Polytechnic Institute? Virginia Polytechnic Institute and State University? Virginia Tech?

  8. Precision and Recall - Literature

  9. CREATE Table mysql> CREATE TABLE ODUtennis( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> ); Query OK, 0 rows affected (0.00 sec)

  10. INSERT mysql> INSERT INTO ODUtennis (title, body) VALUES -> ('Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals', 'The fourth-seeded Monarchs ...'), -> ('Monarchs Close Out Season With 4-3 Win Over South Alabama', 'ODU closes out the 2010 tennis schedule...'), -> ('ODU Edged By DePaul in Mens Tennis Action, 4-3', 'Junior Tobias Fanselow was the other Monarch…'), -> ('ODU Mens Tennis Drops Delaware, 5-2, in CAA Action', 'The Old Dominion mens tennis team…doubles…Monarchs...'), -> ('Mens Tennis Nipped by #64 UNC Wilmington, 4-3', 'The two teams split the first two doubles matches…Monarchs…doubles...'); Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0

  11. LIKE & REGEXP • We can search rows with the “LIKE” (or “REGEXP”) operator • http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html • for tables of any size, this will be s-l-o-w

  12. Example 1 mysql> SELECT id, title FROM ODUtennis WHERE title LIKE 'Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals'; mysql> SELECT id, title FROM ODUtennis WHERE title REGEXP 'Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals'; +----+--------------------------------------------------------------------------------------+ | id | title | +----+--------------------------------------------------------------------------------------+ | 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals | +----+--------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)

  13. Example 2 mysql> SELECT id, title FROM ODUtennis WHERE title REGEXP 'Monarchs'; +----+---------------------------------------------------------------------------------------+ | id | title | +----+---------------------------------------------------------------------------------------+ | 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals | | 2 | Monarchs Close Out Season With 4-3 Win Over South Alabama | +----+----------------------------------------------------------------------------------------+ 2 rows in set (0.01 sec) BUT mysql> SELECT id, title FROM ODUtennis WHERE title LIKE 'Monarchs'; Empty set (0.00 sec)

  14. Full-Text Search – The Better Way • MATCH()…AGAINST() • Performs a natural language search over index • Index = set of one ore more columns of the same table • Index as argument to MATCH() • Search string as argument to AGAINST() • If used in WHERE clause result returned in order of relevance score • Relevance: similarity between search string and index row

  15. CREATE Table mysql> CREATE TABLE ODUtennis ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ); Query OK, 0 rows affected (0.00 sec) • can only create FULLTEXT on CHAR, VARCHAR or TEXT columns • “title” and “body” still available as regular columns • if you want to search only on “title”, you need to create a separate index

  16. INSERT mysql> INSERT INTO ODUtennis (title, body) VALUES -> ('Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals', 'The fourth-seeded Monarchs ...'), -> ('Monarchs Close Out Season With 4-3 Win Over South Alabama', 'ODU closes out the 2010 tennis schedule...'), -> ('ODU Edged By DePaul in Mens Tennis Action, 4-3', 'Junior Tobias Fanselow was the other Monarch…'), -> ('ODU Mens Tennis Drops Delaware, 5-2, in CAA Action', 'The Old Dominion mens tennisteam…doubles…Monarchs...'), -> ('Mens Tennis Nipped by #64 UNC Wilmington, 4-3', 'The two teams split the first two doubles matches…Monarchs…doubles...'); Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0

  17. MATCH .. AGAINST mysql> SELECT * FROM ODUtennis WHERE MATCH(title,body) AGAINST('fanselow'); +----+------------------------------------------------+------------------------------------------------------------------------------------+ | id | title | body | +----+------------------------------------------------+------------------------------------------------------------------------------------+ | 3 | ODU Edged By DePaul in Mens Tennis Action, 4-3 | Junior Tobias Fanselow was the other … | +----+------------------------------------------------+------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT * FROM ODUtennis WHERE MATCH(title,body) AGAINST('Monarchs'); Empty set (0.00 sec) why?!

  18. Ranking • If the word appears in > 50% of the rows then the word is considered a “stop word” and is not matched (unless you are in Boolean mode) • this makes sense for large collections (the word is not a good discriminator of records), but can lead to unexpected results for small collections

  19. Stopwords • Stopwords exist in stoplists or negative dictionaries • Idea: remove low semantic content • index should only have “important stuff” • What not to index is domain dependent, but often includes: • “small” words: a, and, the, but, of, an, very, etc. • NASA ADS example: • http://adsabs.harvard.edu/abs_doc/stopwords.html • MySQL full-text index: • http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html

  20. Stopwords • Punctuation, numbers often stripped or treated as stopwords • precision suffers on searches for: • NASA TM-3389 • F-15 • X.500 • .NET • Tree::Suffix • MySQL also treats words < 4 characters as stopwords • too bad for: “Liu”, “ORF”, “DEA”, etc.

  21. Getting the Rank mysql> SELECT id, MATCH(title,body) AGAINST('doubles') from ODUtennis; +----+---------------------------------------------------+ | id | MATCH(title,body) AGAINST('doubles') | +----+---------------------------------------------------+ | 1 | 0 | | 2 | 0 | | 3 | 0 | | 4 | 0.28169098496437 | | 5 | 0.61670464277267 | +----+---------------------------------------------------+ 5 rows in set (0.00 sec)

  22. Getting the Rank in Order mysql> SELECT id, title, MATCH(title,body) AGAINST('doubles') AS score FROM ODUtennis WHERE MATCH(title,body) AGAINST('doubles'); +----+-------------------------------------------------------------------------+----------------------------+ | id | title | score | +----+-------------------------------------------------------------------------+----------------------------+ | 5 | Mens Tennis Nipped by #64 UNC Wilmington, 4-3 | 0.61670464277267 | | 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action | 0.28169098496437 | +----+-------------------------------------------------------------------------+----------------------------+ 2 rows in set (0.00 sec)

  23. Boolean Mode mysql> SELECT id, title FROM ODUtennis WHERE MATCH(title,body) AGAINST('+Monarchs' IN BOOLEAN MODE); +----+-----------------------------------------------------------------------------------------+ | id | title | +----+-----------------------------------------------------------------------------------------+ | 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals | | 2 | Monarchs Close Out Season With 4-3 Win Over South Alabama | | 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action | | 5 | Mens Tennis Nipped by #64 UNC Wilmington, 4-3 | +----+-----------------------------------------------------------------------------------------+ 4 rows in set (0.00 sec) • Does not use the 50% threshold • Does use stopwords, length limitation • Operator list: • http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html

  24. Blind Query Expansion (AKA Automatic Relevance Feedback) • General assumption: user query is insufficient • Too short • Too generic • Too many results • How does one keep up with Virginia Tech’s multiple names / nicknames? • Hokies, Fighting Gobblers, VPI, VPI&SU, Va Tech, VT • Idea: • run the query with the requested terms • then take the results and • re-run the query with the most relevant terms from the initial results

  25. Blind Query Expansion (AKA Automatic Relevance Feedback) mysql> SELECT * FROM ODUtennis WHERE MATCH(title,body) AGAINST('fanselow' WITH QUERY EXPANSION); +----+-------------------------------------------------------------------------+----------------------------------------------------------------+ | id | title | body | +----+-------------------------------------------------------------------------+-----------------------------------------------------------------+ | 3 | ODU Edged By DePaul in Mens Tennis Action, 4-3 | Junior Tobias Fanselow was the other Monarch | | | to win, earning the win at No. 1 singles over | | | Alasdair Graetz, 6-2, 7-6 | ----------------------------------------------------------------------------------------------------------------------------------------------------+ | 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action | The Old Dominion mens tennis team improved | | | to 14-8 overall and 2-3 in the CAA with a 5-2 | | | victory over the Delaware Fighting Blue Hens | | | on Wednesday. After dropping all three doubles | | | matches, the Monarchs went on to win five of | | | the six singles matches for the victory. | +----+----------------------------------------------------+--------------------------------------------------------------------------------------+ 2 rows in set (0.00 sec)

  26. For More Information… • MySQL documentation: • http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html • Chapter 12/13 “Building a Content Management System” • CS 751/851 “Introduction to Digital Libraries” • http://www.cs.odu.edu/~mln/teaching/ • esp. “Information Retrieval Concepts” lecture • Is MySQL the right tool for your job? • http://lucene.apache.org/ MySQL examples in this lecture based on those found at dev.mysql.com content snippets taken from www.odusports.com

More Related