phrase hierarchy inference l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Phrase Hierarchy Inference PowerPoint Presentation
Download Presentation
Phrase Hierarchy Inference

Loading in 2 Seconds...

play fullscreen
1 / 21

Phrase Hierarchy Inference - PowerPoint PPT Presentation


  • 198 Views
  • Uploaded on

Phrase Hierarchy Inference Gordon Paynter, UC Riverside Craig Nevill-Manning, Google Ian Witten, University of Waikato Outline Overlapping vs non-overlapping phrases Memory-based algorithm Suffix trees Suffix arrays Multipass algorithm Non-overlapping phrases

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Phrase Hierarchy Inference' - Faraday


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
phrase hierarchy inference

Phrase Hierarchy Inference

Gordon Paynter, UC Riverside

Craig Nevill-Manning, Google

Ian Witten, University of Waikato

outline
Outline
  • Overlapping vs non-overlapping phrases
  • Memory-based algorithm
    • Suffix trees
    • Suffix arrays
  • Multipass algorithm
non overlapping phrases
Non-overlapping phrases
  • Given a text, parse it into a tree of repeated phrases
  • Advantage
    • Based on existing data compression algorithms
  • Disadvantage
    • Sometimes arbitrary association of words

In the beginning, God created the heaven and the earth

overlapping phrases
Overlapping Phrases
  • Instead, we count all repeating phrases, even if two phrases overlap
  • Limit phrase length to, say, ten
memory based algorithm
Memory-based Algorithm
  • For each word w:
    • Everywhere that word occurs, consider the phrase formed by the word plus the word to the left (aw)
    • Similarly for words to the right (wa)
    • If the phrase is always preceded or followed by the same word, extend the phrase
    • If the phrase begins or ends with a stopword, extend the phrase
    • Add all the extended phrases to the list of expansions for w
  • For each phrase p:
memory based algorithm6
Memory-based Algorithm
  • Problem:
    • How to efficiently find words to the right and left for every occurrence of a word or a phrase?
  • Solution:
    • Suffix trees
suffix tree
Suffix Tree
  • A compacted trie of suffixes
  • Trie: a tree containing a set of strings

she

sells

sea

shells

on

the

sea

shore

s h e l l s 

o r e 

e l l l s 

a 

o n 

t h e 

suffix tree8
Suffix Tree
  • Compacted trie: no nodes with only one child

s h e l l s 

o r e 

e l l l s 

a 

o n 

t h e 

s h e lls

ore

e llls

a

on

the

suffix tree9
Suffix Tree
  • Compacted trie of all suffixes

she sells sea shells on the sea shore

he sells sea shells on the sea shore

e sells sea shells on the sea shore

sells sea shells on the sea shore

sells sea shells on the sea shore

ells sea shells on the sea shore

lls sea shells on the sea shore

ls sea shells on the sea shore

s sea shells on the sea shore

sea shells on the sea shore

sea shells on the sea shore

two surprising facts
Two Surprising Facts
  • Even though there are O(n2) characters in all the suffixes,
  • Suffix trees consume O(n) space
  • Suffix trees take O(n) time to compute
suffix tree11
Suffix Tree
  • How does the suffix tree help us?
    • Build a suffix tree of words (instead of single letters)
    • For any word, words to the right are children in the tree
    • Compaction means that the longest unique sequence is already computed
  • For words to the left, build a suffix tree for the reverse sequence
suffix array
Suffix Array
  • Sorted list of suffixes

·sea·shells·on·the·sea·shore

·sells·sea·shells·on·the·sea·shore

e·sells·sea·shells·on·the·sea·shore

ells·sea·shells·on·the·sea·shore

he·sells·sea·shells·on·the·sea·shore

lls·sea·shells·on·the·sea·shore

ls·sea·shells·on·the·sea·shore

s·sea·shells·on·the·sea·shore

sea·shells·on·the·sea·shore

sells·sea·shells·on·the·sea·shore

she·sells·sea·shells·on·the·sea·shore

suffix array13
Suffix Array
  • Advantages
    • Simple: 10 lines of code
    • Space efficient: one array of pointers
  • Disadvantages
    • More expensive to create: O(n log n)
    • More expensive to operate on (linear scans instead of following an edge)
multi pass algorithm
Multi-pass Algorithm
  • Disk seeks dominate
    • minimize disk seeks
    • fit within available memory
  • Disk reads are cheap, seeks are expensive
  • Make multiple passes over the data, using as little memory as possible
three phases
Three Phases
  • Phase 1: count all single words, two word phrases, three word phrases…
  • Phase 2: make expansion lists for each phrase
  • Phase 3: delete uninteresting phrases
phase 1 count phrases
Phase 1: Count Phrases
  • Make one pass over the data, counting individualwords
  • Write out all words that appear more than once
  • Make a second pass over the data, counting pairs of words, where both words appear more than once
  • Write out all pairs that appear more than once
  • Make a third pass over the data, counting triples of words, where both overlapping pairs appear more than once
  • Write out all triples that appear more than once
phase 1 output
Phase 1: Output

words

and 31

Gone 2

man 4

old 12

sea 8

the 57

Wind 3

with 17

pairs of words

and the 25

Gone with 2

man and 3

old man 2

The old 5

the sea 3

the Wind 2

with the 13

triples of words

and the sea 3

Gone with the 2

man and the 2

old man and 2

The old man 2

with the Wind 2

phase 2 make expansion lists
Phase 2: Make Expansion Lists
  • Read all pairs of words that appear more than once (from phase 1)
  • Insert each pair in the list for each word
  • Read all frequent triples
  • Insert each triple in the list for each overlapping pair
phase 2 output
Phase 2: Output

words

and 31

Gone 2

man 4

old 12

sea 8

the 57

Wind 3

with 17

pairs of words

and the 25

Gone with 2

man and 3

old man 2

The old 5

the sea 3

the Wind 2

with the 13

triples of words

and the sea 3

Gone with the 2

man and the 2

old man and 2

The old man 2

with the Wind 2

phase 3
Phase 3
  • Delete each phrase in the hierarchy if
    • it begins or ends in a stopword (“man and”)
    • it occurs in a particular longer phrase more than 75% of the time (“theoretical computer”)
  • Pointers to that phrase now point to that phrase’s expansions
  • Process is recursive
phase 3 output
Phase 3: Output

words

and 31

Gone 2

man 4

old 12

sea 8

the 57

Wind 3

with 17

pairs of words

and the 25

Gone with 2

man and 3

old man 2

The old 5

the sea 3

the Wind 2

with the 13

triples of words

and the sea 3

Gone with the 2

man and the 2

old man and 2

The old man 2

with the Wind 2