数据结构与算法

数据结构与算法 2004.2-5

字符串 (String) 字符串是n (  0 ) 个字符的有限序列，记作 S : “c1c2c3…cn” 其中，S是串名字 “c1c2c3…cn”是串值 ci是串中字符 n是串的长度。

字符串抽象数据类型和类定义 const int maxLen = 128; class String { int curLen; //串的当前长度 char*ch; //串的存储数组 public: String ( const String & ob); String ( const char *init ); String (); ~String ( ) { delete [ ] ch; } int Length ( ) const {returncurLen; }

String&operator ( )( intpos,int len ); int operator== ( constString &ob ) const { return strcmp (ch, ob.ch) == 0; } int operator != ( constString &ob ) const { return strcmp (ch, ob.ch) != 0; } int operator ! ( ) const{ return curLen == 0; } String&operator = ( constString&ob ); String &operator += ( constString &ob ); char &operator [ ] ( int i ); int Find ( String pat ) const; }

字符串部分操作的实现 String::String ( const String &ob ) { //复制构造函数：从已有串ob复制 ch = new char[maxLen+1]; if ( !ch ) { cout << “Allocation Error\n”; exit(1); } curLen =ob.curLen; strcpy ( ch, ob.ch ); }

String::String ( constchar *init ) { //复制构造函数: 从已有字符数组*init复制 ch = new char[maxLen+1]; if ( !ch ){ cout << “Allocation Error\n”; exit(1); } curLen =strlen (init); strcpy ( ch, init ); }

String::String () { //构造函数：创建一个空串 ch = new char[maxLen+1]; if ( !ch ) { cout << “Allocation Error\n”; exit(1); } curLen = 0; ch[0] = ‘\0’; }

提取子串的算法示例 pos+len -1 pos+len -1 curLen-1 curLen

String &String:: operator ( ) ( int pos,intlen ) { //从串中第pos个位置起连续提取len个字符 //形成子串返回 if ( pos < 0|| pos+len -1 >= maxLen || len < 0 ) { temp.curLen = 0;//返回空串 temp.ch[0] = '\0'; } else { //提取子串 String *temp = new String; //动态分配

if ( pos+len -1 >= curLen ) len = curLen - pos; temp→curLen = len; //子串长度 for ( int i=0, j=pos; i<len; i++, j++ ) temp→ch[i] = ch[j]; //传送串数组 temp→ch[len] = ‘\0’; //子串结束 } returntemp; }

String &String:: operator =( constString &ob ) { //串赋值：从已有串ob复制 if ( &ob != this ) { delete [ ] ch; ch = newchar [maxLen+1]; //重新分配 if ( ! ch ) { cerr << “Out Of Memory!\n ”;exit (1); } curLen = ob.curLen; //串复制 strcpy ( ch, ob.ch ); }

else cout << “Attempted assignment of a String to itself!\n”; return *this; } char &String::operator [ ] ( int i ) { //按串名提取串中第i个字符 if ( i < 0 && i >= curLen) { cout << “Out Of Boundary!\n ”;exit (1) ; } return ch[i]; }

String &String:: //串连接 operator +=( constString &ob ) { char * temp =ch; //暂存原串数组 curLen += ob.curLen; //串长度累加 ch = newchar [maxLen+1]; if ( ! ch ){ cerr << “Out Of Memory!\n ”; exit (1) ; } strcpy ( ch, temp );//拷贝原串数组 strcat ( ch, ob.ch );//连接ob串数组 delete [ ] temp; return*this; }

串的模式匹配 • 定义在串中寻找子串（第一个字符）在串中的位置 • 词汇在模式匹配中，子串称为模式，串称为目标。 • 示例目标 T : “Beijing” 模式 P : “jin” 匹配结果= 3

第1趟T a b b a b a 穷举的模式 P a b a匹配过程第2趟T a b b a b a P a b a 第3趟T a b b a b a P a b a 第4趟T a b b a b a P a b a 

int String::Find ( String &pat ) const { //穷举的模式匹配 char * p = pat.ch, * s = ch;int i = 0; if ( *p && *s ) //当两串未检测完 while ( i <= curLen- pat.curLen ) if ( *p++ == *s++ ) //比较串字符if ( !*p )returni; //相等 else {i++; s = ch+i; p = pat.ch; } //对应字符不相等，对齐目标的 //下一位置，继续比较 return-1; }

改进的模式匹配 目标Tt0t1t2 …… tm-1 … tn-1  模式patp0p1p2 …… pm-1 目标Tt0t1t2 …… tm-1tm … tn-1  模式patp0p1 …… pm-2 pm-1 目标Tt0t1 …… titi+1…… ti+m-2ti+m-1… tn-1 ‖ ‖ ‖ ‖ 模式patp0p1 …… pm-2 pm-1

穷举的模式匹配算法时间代价： 最坏情况比较n-m+1趟，每趟比较m次，总比较次数达(n-m+1)*m • 原因在于每趟重新比较时，目标串的检测指针要回退。改进的模式匹配算法可使目标串的检测指针每趟不回退。 • 改进的模式匹配(KMP)算法的时间代价： • 若每趟第一个不匹配，比较n-m+1趟，总比较次数最坏达(n-m)+m = n • 若每趟第m个不匹配，总比较次数最坏亦达到 n

T t0t1 … ts-1tsts+1ts+2 … ts+j-1ts+jts+j+1 … tn-1 ‖ ‖ ‖ ‖ ‖  Pp0p1p2 … pj-1pj pj+1 则有ts ts+1ts+2 … ts+j= p0p1p2 …pj (1) 为使模式 P 与目标 T 匹配，必须满足 p0p1p2 …pj-1 …pm-1 = ts+1ts+2ts+3 … ts+j … ts+m 如果 p0p1 …pj-1 p1p2 …pj (2) 则立刻可以断定 p0p1 …pj-1 ts+1ts+2 … ts+j 下一趟必不匹配

同样，若p0p1 …pj-2 p2p3 …pj 则再下一趟也不匹配，因为有 p0p1 …pj-2 ts+2ts+3 … ts+j 直到对于某一个“k”值，使得 p0p1 …pk+1 pj-k-1pj-k…pj 且p0p1 …pk=pj-kpj-k+1 …pj 则p0p1 …pk= ts+j-kts+j-k+1 … ts+j ‖ ‖ ‖ pj-kpj-k+1 … pj

k 的确定方法 当比较到模式第 j 个字符失配时， k 的值与模式的前 j 个字符有关，与目标无关。利用失效函数f (j)可描述。利用失效函数 f (j) 的匹配处理如果 j= 0，则目标指针加 1，模式指针回到 p0。如果 j> 0，则目标指针不变，模式指针回到pf(j-1)+1。

若设模式P = p0 p1…pm-2 pm-1 示例：确定失效函数f (j)

朴素匹配算法 (Naive) int index_naive(char S[ ], charT[ ]) { i = j = 0; while (i < S_len && j < T_len) { if (S[ i ] == T[ j ]) { i++; j++;} else {i = i - (j - 1); j = 0;} } if (j == T_len) return (i - T_len); else return -1; }

a b a b c a b c a c b a b a a a a a a a b b b b b b b c c c c c c c a a a a a a a c c c c c c c 朴素匹配算法效率较低 T S 总共进行了六趟匹配 :-<

a b a b c a b c a c b a b a a a a b b b b c c c c a a a a c c c c 模式匹配的改进 T S 只需进行三趟匹配 :-）

Knuth-Morris-Pratt算法 Demo k = 2 = next [5 ] i = 7 a c a b a a b a a b c a c a b a a b c j = 5 当主串 S[ i ] 与子串 T[ j ] 失配时，i 不回溯，仅 j 回溯到一个尽量“偏右”的位置 k。因此 KPM 算法的核心问题是寻找确定 k = next[ j ] 的方法。

KMP 算法分析(I) S[i - k ... i -1] = T[0 ... k -1] i S a c a b a a b a a b c a c T a b a a b c j k

KMP 算法分析(II) S[i - k ... i -1]=T[j - k ...j -1] i S a c a b a a b a a b c a c T a b a a b c j k

KMP 算法分析(III) S[i - k ... i -1]=T[0 ... next[ j ] -1] i S a c a b a a b a a b c a c T a b a a b c k next [ j ]

KMP 算法分析(IV) 由 (I) ，(II)，和 (III) 我们得到： T[0 ... k -1]=T[ j - k ... j -1] ＝ T[0 ... next[ j ] -1] 因此得到 k = next [ j ] 的定义(注意下标范围)：以上定义也说明 next [ j ]与主串 S 无关。

a a a a a a a a c b b c b b b c a a a a a a a a b b a a a a b b b b a a a a a b a a a a c a a c a b b b b b b b a a a a a a a a a a a a a a a a b b b b b b b b c c c c c c c c a a a a a a a a c c c c c c c c T : a b a a b c a c j : 0 -1 1 0 2 0 1 3 1 4 2 5 0 6 7 1 next [ j ] : next[ j ]函数举例使主串指针 i 前行

KMP算法 int index_kmp(char S[ ], charT[ ]) { i = j = 0; while (i < S_len && j < T_len) { if (j == -1 || S[ i ] == T[ j ]) { i++; j++;} else { j = next[ j ]; } } if (j == T_len) return (i - T_len); else return -1; } Naive

next[ j ]函数的求法 • 根据定义 next[0] = -1; • 设 next[j] = k，求 next[j+1] • 若 T[j] = T[k]，则 next[j+1] = k + 1 = next[j] + 1; • 否则(T[j]  T[k])， • 若T[[j] = T[next[k]], 则 next[j+1] = next[k] + 1; • 否则...... j=5 T : a b a a b c a c next[6] = next[5+1] = next[next[next[5]]] + 1 = next[next[2]] + 1 = next[0] + 1 = -1 +1 = 0 j : 0 1 2 3 4 5 6 7 next [ j ] : -1 0 0 1 1 2 0 1

next[ j ]函数 void get_next(char T[ ], int next[]) { i = 0; j = -1; next[0] = -1; while (i < T_len) { if (j == -1 || T[ i ] == T[ j ]) { i++; j++; next[ i ] = j; else j = next[ i ]; } }

运用KMP算法的匹配过程 第1趟目标a c a b a a b a a b c a c a a b c 模式a b a a b c a c  j = 1 j = f (j-1)+1 = 0 第2趟目标 a c a b a a b a a b c a c a a b c 模式a b a a b c a c  j = 5 j = f (j-1)+1= 2 第3趟目标 a c a b a a b a a b c a c a a b c 模式(a b) a a b c a c 

intString :: fastFind ( String pat ) const { //带失效函数的KMP匹配算法 intposP = 0, posT = 0; intlengthP = pat.curLen, lengthT = curLen; while ( posP < lengthP && posT < lengthT ) if ( pat.ch[posP] == ch[posT] ) { posP++; posT++; //相等继续比较 } else if ( posP == 0 ) posT++; //不相等 else posP = pat.f [posP-1]+1; if ( posP < lengthP ) return-1; elsereturnposT - lengthP; }

计算失效函数 f [ j ]的方法 首先确定f [0] = -1，再利用f [ j]求f [ j+1]。其中, f (1)[ j ] = f [ j ], f(m)[ j ] = f [f(m -1)[ j ]]

f [0] =-1; j = 1时, f [0]+1 = 0, p0 p1, f [1] = -1; j = 2时, f [1]+1 = 0, p0= p2, f [2] = f [1]+1 = 0; j = 3时, f [2]+1 = 1, p1 p3, f [1]+1= 0, p0= p3, f [3] = f [1]+1 = 0; j = 4时, f [3]+1 = 1, p1= p4, f [4] = f [3]+1 = 1;

voidString::fail ( ) { //计算失效函数 intlengthP = curLen; f [0] =-1; //直接赋值 for ( int j=1; j<lengthP; j++ ) { //依次求f[j] int i = f [j-1]; while ( *(ch+j)!= *(ch+i+1)&& i >= 0 ) i = f [i]; //递推 if ( *(ch+j) == *(ch+i+1) ) f [j] = i+1; elsef [j] =-1; } }

字符串操作应用举例 • 文本编辑 • 建立词索引表

文本编辑 • Microsoft NotePad • Microsoft Word (WYSIWYG) • Unix’s VI • Emacs and Emacsen who use • GNU Emacs • XEmacs

267 201 209 226 250 282 15 24 17 8 2 17 文本编辑的基本数据结构——“行” i text [ i ] #define MAXLINE 65535 typedef struct { char *line[]; int length; } Line; Line text[MAXLINE]; 100 101 102 103 104 105

建立词索引表 • 建立词索引表可以加速信息检索建立索引程序用户请求索引表用户接口检索结果数据库检索程序

关键词 书号索引书号书名 algorithms 034 005 Computer Data Structure analysis 034, 050, 067 010 Introduction to Data Structure computer 005, 034 023 Fundamentals of Data Structure data 005, 010, 023 034 The Design and Analysis of Computer Algorithms design 034 050 Introduction to Numerical Analysis fundamentals 023 067 Numerical Analysis introduction 010, 050 numerical 050, 067 structures 005, 010, 023 书目检索举例(建立书目关键词索引表) 关键词索引表书目文件

关键词 书号索引 algorithms 034 analysis 034, 050, 067 computer 005, 034 data 005, 010, 023 design 034 fundamentals 023 introduction 010, 050 numerical 050, 067 structures 005, 010, 023 关键词索引表的数据结构 #define MaxKeyNum 2500 typedef struct { HString key; LinkList bnolist; } IdxTermType; typedef struct { IdxTermType item[MaxKeyNum+1]; int last; } IdxListType;

数据结构与算法

数据结构与算法

Presentation Transcript