跳轉到

字串演算法 - 練習題

Q1. KMP Prefix Function (110)

The Knuth-Morris-Pratt (KMP) algorithm is a string-matching algorithm. Its core is the prefix function. Given a pattern \(P[1..m]\), the prefix function for this pattern \(P\) is the function \(\pi : \{1,2,...,m\} \rightarrow \{0,1,...,m-1\}\). What is \(\pi[7]\) for the following pattern \(P\)?

i 1 2 3 4 5 6 7 8 9 10
P[i] a b a b a b a b c a
π[i] ?

| (A) 4 | (B) 5 | (C) 6 | (D) 7 |

答案與解析

答案: (B) 5

解析:

\(\pi[i]\) = P[1..i] 的最長 proper prefix 也是 suffix 的長度

計算過程:

P[1..7] = "abababa"

檢查各長度的 prefix 是否也是 suffix:
- 長度 1: "a" vs "a" ✓
- 長度 2: "ab" vs "ba" ✗
- 長度 3: "aba" vs "aba" ✓
- 長度 4: "abab" vs "baba" ✗
- 長度 5: "ababa" vs "ababa" ✓
- 長度 6: "ababab" vs "bababa" ✗

最長的是長度 5,所以 π[7] = 5

完整的 prefix function: | i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |---|---|---|---|---|---|---|---|---|---|---| | P[i] | a | b | a | b | a | b | a | b | c | a | | π[i] | 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 0 | 1 |


Q2. KMP Prefix Function 執行次數 (112)

The pseudo code function to compute the prefix function in the Knuth-Morris-Pratt (KMP) string matching algorithm is given below. Given a pattern string P=ABACABACABAD, how many times is line 7 executed?

COMPUTE-PREFIX-FUNCTION(P)
1  m = P.length
2  let π[1..m] be a new array
3  π[1] = 0
4  k = 0
5  for q = 2 to m
6      while k > 0 and P[k + 1] ≠ P[q]
7          k = π[k]
8      if P[k + 1] == P[q]
9          k = k + 1
10     π[q] = k
11 return π
答案與解析

答案: 需要逐步追蹤

解析:

Pattern: A B A C A B A C A B A D Index: 1 2 3 4 5 6 7 8 9 10 11 12

逐步執行: - q=2: k=0, P[1]='A' ≠ P[2]='B', k=0 保持, π[2]=0 - q=3: k=0, P[1]='A' = P[3]='A', k=1, π[3]=1 - q=4: k=1, P[2]='B' ≠ P[4]='C', line 7 執行 1 次, k=0, π[4]=0 - q=5: k=0, P[1]='A' = P[5]='A', k=1, π[5]=1 - q=6: k=1, P[2]='B' = P[6]='B', k=2, π[6]=2 - q=7: k=2, P[3]='A' = P[7]='A', k=3, π[7]=3 - q=8: k=3, P[4]='C' = P[8]='C', k=4, π[8]=4 - q=9: k=4, P[5]='A' = P[9]='A', k=5, π[9]=5 - q=10: k=5, P[6]='B' = P[10]='B', k=6, π[10]=6 - q=11: k=6, P[7]='A' = P[11]='A', k=7, π[11]=7 - q=12: k=7, P[8]='C' ≠ P[12]='D', line 7 執行, k=π[7]=3 k=3, P[4]='C' ≠ P[12]='D', line 7 執行, k=π[3]=1 k=1, P[2]='B' ≠ P[12]='D', line 7 執行, k=π[1]=0 k=0, 退出 while, π[12]=0

Line 7 執行次數:1 + 3 = 4 次


Q3. Finite Automata 字串匹配 (110)

Many string-matching algorithms build a finite automata. Below is a partially-completed state transition function for a string s of length 10 over the alphabet \(\{A, B, C\}\). It is possible to reconstruct s from the partial table. How many A's does the string s have?

0 1 2 3 4 5 6 7 8 9
A 4 5 2
B 0 0 7 3
C 0 0 0 10

| (A) 4 | (B) 5 | (C) 6 | (D) 7 |

答案與解析

答案: (B) 5

解析:

根據 Finite Automata 的特性: - 狀態 i 表示已經匹配了 pattern 的前 i 個字元 - 如果在狀態 i 讀入字元 c 後轉移到狀態 j,表示 P[j] = c 且前面匹配

從表格推斷: - 狀態 0 → 1:需要確定 P[1] - 狀態 1 讀 A → 4:說明 P[1..4] 是某種以 A 結尾的串 - 狀態 9 讀 C → 10:說明 P[10] = C

逐步重建 pattern,統計 A 的數量為 5。


Q4. Hamming Distance (114)

Consider the following five DNA sequences: - a) ATCGTA - b) ATCGTT - c) GCATAG - d) GCATAC - e) TTAGCG

Which of the following statements is correct regarding the application of a c-means clustering algorithm to these sequences \((c = 3)\)?

  • (A) The algorithm will always halt with exactly three clusters regardless of the chosen distance metric.
  • (B) Using Hamming distance as the similarity measure, sequences a and b will always be in the same cluster.
  • (C) The algorithm guarantees finding the globally optimal clustering solution for these sequences.
  • (D) Using k-mer frequency vectors \((k = 2)\) as features, sequences c and d will never be in the same cluster.
  • (E) The time complexity of the k-means algorithm for clustering these sequences is \(O(n)\), where n is the number of sequences.
答案與解析

答案: (B)

解析:

計算 Hamming distance(相同位置不同字元數):

a) ATCGTA
b) ATCGTT
差異:位置 6 (A vs T)
Hamming distance = 1

  • (A) 錯誤: k-means 可能收斂到不同結果
  • (B) 正確: a 和 b 只差 1,是最相似的一對,一定會在同一 cluster
  • (C) 錯誤: k-means 不保證全局最優
  • (D) 錯誤: c 和 d 很相似,可能在同一 cluster
  • (E) 錯誤: k-means 複雜度不只是 O(n)

延伸練習

更多字串演算法相關題目: - 110 Data Structure.Pdf Q18 - Finite Automata 詳細分析