110 likes | 116 Views
Timedex.org. Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy. Abstract. Extract events from Wikipedia Index the events based on date of occurrence Display the events in a useful webapp. Step 1: Importing Wikipedia into MySQL. Large dataset 8GB page links table ~5GB page table
E N D
Timedex.org Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy
Abstract • Extract events from Wikipedia • Index the events based on date of occurrence • Display the events in a useful webapp
Step 1: Importing Wikipedia into MySQL • Large dataset • 8GB page links table • ~5GB page table • Difficulties: • Altering tables (adding columns or indexes) • Lessons learned: • Be careful with alter operations on large tables • Use Postgres
Step 2: Extracting Sentences and Page Hierarchy • Used Lingpipe API to find sentences • Parsed Wikipedia tags to create heading/sentence tree of each page • Difficulties: • Many Wikipedia sentences are terminated by newlines • Periods in abbreviations can be confusing • Lessons learned: • 3rd party packages never do exactly what you need
Step 3: Detecting Dates • Used a set of regular expressions to check for dates • Difficulties: • Deciding what date formats to accept such that date-like constructs that are not dates are minimized • Lessons learned: • Regular expressions are easy to control and tune, so use them if possible
Step 4: Event Summaries • Given a sentence containing a date, find a short phrase describing it • We thought this would be done best by training an HMM or CRF to extract the events. • Switched to much simpler system of using headings • Difficulties: • Most sentences do not contain good, complete information! • E.g. “He had two children by her in 1948 and 1951.” • Lessons learned: • Try basic methods first, then experiment with more elaborate schemes • English is hard; pronouns suck
Step 5: Ranking Events • We have more events than anyone would ever want • Run PageRank at page level using link table • Assume highly linked-to pages contain better events • Count heading levels • Assume events at deep subheading levels are less important • Empirically, first sentence of page often has most useful event • Difficulties • PageRank can take until the end of time (177 million links) • Lessons learned • In rare cases, efficiency does matter! • Buy more memory
Step 6: Writing the Webapp • Makes an asynchronous JS request • Results returned in JSON • Uses Lucene to query for keywords • Difficulties: • Creating a distribution of ranks for “hide/show” experience to be interesting • Creating a good looking site • Lessons learned: • Use JS libraries whenever possible
Technologies Used • Mallet • Machine learning API • Lingpipe • Linguistic analysis API • Hibernate • Java persistence layer • Spring • Java MVC framework • MySQL + Tomcat + Apache • Scriptaculous • JS library • Lucene • Java Indexing API
Who Did What • Brandon • Sentence and event extraction, date extraction, Lucene semantics • Alex • MySQL import, PageRank, webapp, data access layer • Robert • Sentence and event extraction, webapp, data access layer • Sean • Sentence and event extraction, hierarchy parsing