# C ompares H to C. They are different. The distance for this comparison is 1. - PowerPoint PPT Presentation

1 / 49

C ompares H to C. They are different. The distance for this comparison is 1. Looking at the neighboring cells, the best distance so far is 0 (as seen in the top-left cell). So , if add this distance of 1 to the best previous distance, this cell gets a value of 1 (which is 0+1).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

C ompares H to C. They are different. The distance for this comparison is 1.

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

#### Presentation Transcript

Compares H to C. They are different. The distance for this comparison is 1.

Looking at the neighboring cells, the best distance so far is 0 (as seen in the top-left cell).

So, if add this distance of 1 to the best previous distance, this cell gets a value of 1 (which is 0+1).

Final distance is 3.

In Boyer-Moore is that the comparison is done from right to left, starting with the last character in the pattern.

The first comparison is between X and C, which do not match.

But since X does not appear anywhere in the search pattern, we can now rule out a match anywhere in the first 3 characters. So the skip value for X will be initialized to 3, the length of the search pattern.

Again we start from the right by comparing B and C, which again do not match.

However, this time B does occur within the search pattern. The skip value for B will be 1 in order to line up with the last B in the search pattern.

Traditionally, implementations of this algorithm have created a 256-byte table to hold the skip value for all possible characters.

Example

pattern = "STING" string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT" try to match first m characters STING A STRING SEARCHING EXAMPLE CONSISTING OF TEXT

This fails. Slide pattern right to look for other matches.Note that R does not occur in pattern. So can slide it past R.

Example

pattern = "STING" string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"

STING A STRING SEARCHING EXAMPLE CONSISTING OF TEXT Fails again. Rightmost character S is in pattern precisely once, so slide until two S's line up.

Example

pattern = "STING" string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"

STING A STRING SEARCHING EXAMPLE CONSISTING OF TEXT

No C in pattern. Slide past it.

Example

pattern = "STING" string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"

STING A STRING SEARCHING EXAMPLE CONSISTING OF TEXT

No space in pattern. Slide past it

Example

pattern = "STING" string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"

STING A STRING SEARCHING EXAMPLE CONSISTING OF TEXT

No O in pattern. Slide past it.

Example

pattern = "STING" string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"

STING A STRING SEARCHING EXAMPLE CONSISTING OF TEXT

Rightmost char T. Exactly one T in pattern. Slide to align them.

Example

pattern = "STING" string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"

STING A STRING SEARCHING EXAMPLE CONSISTING OF TEXT

match

• Complexity is O(n).

• The execution time can actually be sub-linear:

• It doesn't need to actually check every character of the string to be

• searched but rather skips over some of them (check right-most character of the block of m first, if not found in pattern can skip entire rest of block).

• Best-case performance is O(n/m). In the best case, only one in m characters needs to be checked.

• Actually works better (on average) with longer m!

Text Editor, Digital Library and Search Engine:

Every person uses a text editor and every user of a digital library or search engine, needs to find patterns in a text.

The Boyer Moore algorithm is directly implemented the search command of practically all text editors. The longest common subsequence dynamic programming algorithm is implemented in system commands that test differences between files.

Multimedia and Computational Biology:

Computational biology: in finding a close mutation,

Communications: to adjust for transmission noise,

Texts: to detect common typing errors.

Multimedia: to adjust for loss compressions, occlusions, scaling, affine transformations or dimension loss.

DNA sequencing: The largest overlap heuristic for finding the shortest common superstring.

Medical Tests:

The BMH algorithm achieves the best overall results when used with medical tests. This algorithm usually performs at least twice as fast as the other algorithms tested. The time performance of exact string pattern matching can be greatly improved if an efficient algorithm is used. Considering the growing amount of text handled in the electronic patient record, it is worth implementing this efficient algorithm.

Retrieving Music Pattern from Musical Database:

When musical note from musical database are to be retrieved then we need string matching. There are four similar techniques for this: edit distance, dice similarity, Jaccardsimilarity and cosine similarity. The musical notes are retrieved by QBE (query by example) approach. So the best scheme for this problem is Levenshtein distance with Jaccard similarity. This is an approximate music search technique. As the Jaccardsimilarity performs excellent in passing a query when a pitch change scenario is selected.

Intrusion Detection:

Intrusion detection systems fall into two basic categories:

signature-based intrusion detection systems and

anomaly detection systems.

Intrusion Detection:

Signature-based intrusion detection systems:

Intruders have signatures, like computer viruses.

Find data packets that contain any known intrusion relatedsignatures or anomalies related to Internet

protocols.

Based upon a set of signatures and rules, the detection system is able to find and log suspicious

activity and generate alerts.

Intrusion Detection:

Anomaly-based intrusion detection systems:

Anomaly-based intrusion detection usually depends on packet anomalies present in protocol header parts.

Intrusion Detection:

May become the performance bottleneck in deep packet inspection.

Detecting Plagiarism:

Composes of structural and syntactic phases:

In the structural phase, documents are decomposed into components by its syntax and compared at the coarse level.

Detecting Plagiarism:

The structural mapping processes the decomposed documents based on its syntax without actually mapping at the word level.

The structural mapping can be applied in a hierarchical way based on the structural organization of a document.

Detecting Plagiarism:

Secondly, the syntactic matching algorithm uses a heuristic look-ahead algorithm for matching consecutive tokens with a verification patch.

Bioinformatics:

Approximate matching of a search pattern to a target

(called the “text” in string algorithms) is a fundamental tool in molecular biology.

The pattern is often called the “query” and the text is called a “sequence database”, but we will use “pattern” and “text” to be consistent.

Bioinformatics:

The importance of approximate matching is that biological sequences change and evolve.

Related genes in different organisms, or even similar genes within the same organism, most commonly have similar, but not identical sequences.

Determining which sequences of known function are most similar to a new gene of unknown function is often the first step in finding out what the new gene does.

Digital Forensics:

Digital forensic text string searches are designed to

search every byte of the digital evidence, at the physical level, to locate specific text strings of interest to the investigation.

Given the nature of the data sets typically encountered, text string search results are extremely noisy, which results in inordinately high levels of information retrieval (IR) overhead and information overload.

Text Mining:

Information extraction,

topic tracking,

content summarization,

information visualization,

text categorization/ classification, and

text clustering

Video Retrieval:

String based video retrieval method first converts the unstructured video into a curve and marks the feature string of it.

Approximate string matching is then used to retrieve

video quickly.

The characteristic curve of the key frame sequence is first extracted followed by marking the feature string and then approximate string matching is used on the feature string to get fast video retrieval.

Introduction to Algorithmsby Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.

http://shaunwagner.com/writings_computer_levenshtein.html

A fast string searching algorithm, R. S. Boyer and J. S. Moore, Communications of the ACM, vol. 20 (10), pp. 762-772).

Questions?

THANK YOU!