字串演算法 - 練習題¶

Q1. KMP Prefix Function (110)¶

The Knuth-Morris-Pratt (KMP) algorithm is a string-matching algorithm. Its core is the prefix function. Given a pattern \(P[1..m]\), the prefix function for this pattern \(P\) is the function \(\pi : \{1,2,...,m\} \rightarrow \{0,1,...,m-1\}\). What is \(\pi[7]\) for the following pattern \(P\)?

i	1	2	3	4	5	6	7	8	9	10
P[i]	a	b	a	b	a	b	a	b	c	a
π[i]							?

| (A) 4 | (B) 5 | (C) 6 | (D) 7 |

答案與解析

答案: (B) 5

解析:

\(\pi[i]\) = P[1..i] 的最長 proper prefix 也是 suffix 的長度

計算過程：

P[1..7] = "abababa"

檢查各長度的 prefix 是否也是 suffix：
- 長度 1: "a" vs "a" ✓
- 長度 2: "ab" vs "ba" ✗
- 長度 3: "aba" vs "aba" ✓
- 長度 4: "abab" vs "baba" ✗
- 長度 5: "ababa" vs "ababa" ✓
- 長度 6: "ababab" vs "bababa" ✗

最長的是長度 5，所以 π[7] = 5

完整的 prefix function： | i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |---|---|---|---|---|---|---|---|---|---|---| | P[i] | a | b | a | b | a | b | a | b | c | a | | π[i] | 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 0 | 1 |

Q2. KMP Prefix Function 執行次數 (112)¶

The pseudo code function to compute the prefix function in the Knuth-Morris-Pratt (KMP) string matching algorithm is given below. Given a pattern string P=ABACABACABAD, how many times is line 7 executed?

COMPUTE-PREFIX-FUNCTION(P)
1  m = P.length
2  let π[1..m] be a new array
3  π[1] = 0
4  k = 0
5  for q = 2 to m
6      while k > 0 and P[k + 1] ≠ P[q]
7          k = π[k]
8      if P[k + 1] == P[q]
9          k = k + 1
10     π[q] = k
11 return π

答案與解析

答案: 需要逐步追蹤

解析:

Pattern: A B A C A B A C A B A D Index: 1 2 3 4 5 6 7 8 9 10 11 12

逐步執行： - q=2: k=0, P[1]='A' ≠ P[2]='B', k=0 保持, π[2]=0 - q=3: k=0, P[1]='A' = P[3]='A', k=1, π[3]=1 - q=4: k=1, P[2]='B' ≠ P[4]='C', line 7 執行 1 次, k=0, π[4]=0 - q=5: k=0, P[1]='A' = P[5]='A', k=1, π[5]=1 - q=6: k=1, P[2]='B' = P[6]='B', k=2, π[6]=2 - q=7: k=2, P[3]='A' = P[7]='A', k=3, π[7]=3 - q=8: k=3, P[4]='C' = P[8]='C', k=4, π[8]=4 - q=9: k=4, P[5]='A' = P[9]='A', k=5, π[9]=5 - q=10: k=5, P[6]='B' = P[10]='B', k=6, π[10]=6 - q=11: k=6, P[7]='A' = P[11]='A', k=7, π[11]=7 - q=12: k=7, P[8]='C' ≠ P[12]='D', line 7 執行, k=π[7]=3 k=3, P[4]='C' ≠ P[12]='D', line 7 執行, k=π[3]=1 k=1, P[2]='B' ≠ P[12]='D', line 7 執行, k=π[1]=0 k=0, 退出 while, π[12]=0

Line 7 執行次數：1 + 3 = 4 次

Q3. Finite Automata 字串匹配 (110)¶

Many string-matching algorithms build a finite automata. Below is a partially-completed state transition function for a string s of length 10 over the alphabet \(\{A, B, C\}\). It is possible to reconstruct s from the partial table. How many A's does the string s have?

	1	2	4	5	7	9
A	4	5	2
B	0	0	7		3
C	0	0		0		10

| (A) 4 | (B) 5 | (C) 6 | (D) 7 |

答案與解析

答案: (B) 5

解析:

根據 Finite Automata 的特性： - 狀態 i 表示已經匹配了 pattern 的前 i 個字元 - 如果在狀態 i 讀入字元 c 後轉移到狀態 j，表示 P[j] = c 且前面匹配

從表格推斷： - 狀態 0 → 1：需要確定 P[1] - 狀態 1 讀 A → 4：說明 P[1..4] 是某種以 A 結尾的串 - 狀態 9 讀 C → 10：說明 P[10] = C

逐步重建 pattern，統計 A 的數量為 5。

Q4. Hamming Distance (114)¶

Consider the following five DNA sequences: - a) ATCGTA - b) ATCGTT - c) GCATAG - d) GCATAC - e) TTAGCG

Which of the following statements is correct regarding the application of a c-means clustering algorithm to these sequences \((c = 3)\)?

(A) The algorithm will always halt with exactly three clusters regardless of the chosen distance metric.
(B) Using Hamming distance as the similarity measure, sequences a and b will always be in the same cluster.
(C) The algorithm guarantees finding the globally optimal clustering solution for these sequences.
(D) Using k-mer frequency vectors \((k = 2)\) as features, sequences c and d will never be in the same cluster.
(E) The time complexity of the k-means algorithm for clustering these sequences is \(O(n)\), where n is the number of sequences.

答案與解析

答案: (B)

解析:

計算 Hamming distance（相同位置不同字元數）：

a) ATCGTA
b) ATCGTT
差異：位置 6 (A vs T)
Hamming distance = 1

(A) 錯誤: k-means 可能收斂到不同結果
(B) 正確: a 和 b 只差 1，是最相似的一對，一定會在同一 cluster
(C) 錯誤: k-means 不保證全局最優
(D) 錯誤: c 和 d 很相似，可能在同一 cluster
(E) 錯誤: k-means 複雜度不只是 O(n)

延伸練習¶

更多字串演算法相關題目： - 110 Data Structure.Pdf Q18 - Finite Automata 詳細分析

Nav

◀ 字串演算法 | 📚 目錄 | 下一章 ▶