1 / 22

Efficient Approximate Entity Extraction with Edit Distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints. Presented by: Aneeta Kolhe. Introduction. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text mining and also for web search. Problem.

Download Presentation

Efficient Approximate Entity Extraction with Edit Distance Constraints

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Approximate Entity Extraction with Edit Distance Constraints Presented by: Aneeta Kolhe

  2. Introduction • Named Entity Recognition finds approximate matches in text. • Important task for information extraction and integration, text mining and also for web search.

  3. Problem • Approximate dictionary matching. • Previous solution – Token based similarity constraints • Proposed solution – Neighborhood generation method

  4. Limitations of token based solution • It uses Jaccard co-efficient similarity • It may miss some match. • It may result in too many matches.

  5. For Example: Given al-qaida *“al-qaeda” or “al-qa’ida” won’t be matched unless use low jaccard similarity of 0.33. “alqaeda” will match “al gore” as well as “al pacino” Hence we use edit distance

  6. Problem Definition: • For example: • Given :document D, a dictionary E of entities • To find: all substrings in D such that they are within edit distance from one of the entities in E • Solution: Iterate through all the valid substrings of the document D • Issue a similarity selection query to the dictionary to retrieve the set of entities that satisfy the constraint. • Consider each substring as a query segment.

  7. Neighborhood generation method using partitioning • at least one partition with at most one edit error • select k т = (т +1)/2 Example: s = [ abcdefghijkl ] s’= [ axxbcdefghxijkl ] т = 3 , k т = 2 • s = [ abcdef ], [ ghijkl ] • s’ = [ axxbcde ], [ fghxijkl ]

  8. Shifting the first partition s by 2 => s = [cdef] • scaling it by -1 => s = [ cdefg] • Transformation rules • First partition, we only need to consider scaling • within the range of [−2, 2]. • Last partition, we only need to consider the combination of the same amount of shifting and scaling within the range of [− т, т] (so that the last character is always included in the resulting substring). • For the rest of the partitions, we need to consider shifting within the range [− т, т] and scaling within the range [−2, 2].

  9. Partitioned variant filtering • 1st partition: 5 variations • intermediate partitions: 5*(2 т +1) variations • last partition: (2 т +1) variations • Total amount of the 1-variants generated = O(m + 2).

  10. s = [ abcdef ], [ ghijkl ] • s’ = [ axxbcde ], [ fghxijkl ] < [ abcd ], 1>< [ abcdefgh ], 1>< [ ghijkl ], 2> • <[ abcde ], 1> <[ jkl ], 2> < [ fghijkl ], 2 > • <[ abcdef ], 1> < [ ijkl ], 2 > < [ efghijkl ], 2> • <[ abcdefg ],1>< [ hijkl ],2><[ defghijkl ], 2> • segment s’ comes in second partition [ fghxijkl ], will have 1-variant match with s’s partition variation [fghijkl ] generated from s’s second partition.

  11. Prefixed Pruning method • The partition (variation) is longer than a prefix length l p, we only use its l p-prefix to generate its 1-variants. • Assume l p is set to 3. Then 1-variants are generated from only the following prefixes. • <[ abc ], 1> <[ ghi ], 2 > <[ hij ], 2> <[ fgh ], 2 > • By setting l p ≤ m/kт – 2 • Total # of 1-variants generated is further reduced to O(l p т²).

  12. Indexing the entities • to index short and long entities • in the dictionary, and store them in two inverted indexes, Ishort and Ilong • For each entity whose length is smaller than kтlp + т • lp-prefix of each partition variation is used to generate its 1-variant family, which will be indexed.

  13. Algorithm : BuildIndex(E, , lp) • for each e ЄE do • if |e| < k lp + then • V GenVariants(e[1 .. min(lp, |e|)], ); • /* The GenVariants (s, k) function generates • the k-variant family of string s */ • for each v Є V do • Ishort <- Ishort U { e }; • if |e| ≥ k lp then • P the set of k partitions of e; • for each i-th partition p Є P do • PT TransformPartition(p); • /* according to the three • transformation rules in Section 3.1 */ • for each partition variations pTЄ PT do • V GenVariants(p[1 .. lp], 1); • for each v 2 V do • Ilong <- Ilong U <e, i >; • return (Ishort, Ilong)

  14. Algorithm : MatchDocument (D, E, т ) • for each starting position p Є[1, |D| − Lmin + т + 1] do • SearchLong (D[p .. p + lp − 1], E, т ); /* matching entities no shorter than kт lp */ • SearchShort (D[p .. p + lp − 1], E, т ); /* matching entities of length in [lmin, kт lp) */

  15. Search Long (s) • R <- ф; /* holds results */ • C <- ф; /* holds candidates */ • V <- GenVariants(s, 1) ; /* gen 1-variant family */ • for each v ЄV do • for each <e, pid> ЄIlongvdo • C <- C U <e, pid > ; /* duplicates removed */ • 7 for each <e, pid > Є C do • 8 S <- QuerySegmentInstantiation(e, pid); • /* returns • the set of query segment candidates for e */ • for each segЄS do • if Verify(seg, e) = true then • R <-R <seg, e > • Return R

  16. Search short(s) • We need to generate the т-variant families for each possible length l between Lmin − т and lp • If the current query segment is shorter than lp, every candidate pair formed by probing the index needs to be verified • Otherwise, we need to perform verification for 2 т + 1 possible query segments.

  17. Reduce amount of enumeration • For example, enumerate 1-variants of the string [ abcdef ] from left to right. • no variant starts with abc in the index. • Algorithm still enumerate other three 1-variants containing abc. • To avoid this set parameter lpp set to lp/2.

  18. Consider 4 possible cases:

  19. Conclusion • Successfully reduced the size of neighborhood • Proposed an efficient query processing algorithm • Optimized the algorithm to share computation • Avoid unnecessary variant enumeration

  20. ?? Questions ??

  21. Thank You !!

More Related