一、题目
给定一个连续不包含空格字符的字符串,该字符串仅包含英文小写字母及英文标点符号(逗号、句号、分号),同时给定词库,对该字符串进行精确分词。
说明:
- 精确分词:字符串分词后,不会出现重叠。例如 “ilovechina”,不同切分后可得到 “i”, “love”, “china”。
- 标点符号不分词,仅用于断句。
- 词库:根据常识及词库统计出来的常用词汇。例如:dictionary={“i”,”love”,”china”,”ilovechina”,”lovechina”}。
- 分词原则: 采用分词顺序优先且最长匹配原则。“ilovechina”,假设分词结果[i,ilove,lo,love,ch,china,lovechina] 则输出 [ilove,china]
- 错误输出:[i, lovechina],原因:”ilove” > 优先于 “lovechina” 成词。
- 错误输出:[i, love, china],原因:”ilove” > “i”,遵循最长匹配原则。
二、输入
字符串长度限制:0 < length < 256
词库长度限制:0 < length < 100000
第一行输入待分词语句 “ilovechina”
第二行输入中文词库 “i, love, china, ch, na, ve, lo, this, is, the, word”
三、输出
按顺序输出分词结果 “i, love, china”
四、示例
示例一
1 2 3 4 5
| 输入: ilovechina i,love,china,ch,na,ve,lo,this,is,the,word 输出: i,love,china
|
说明:输入的字符串按最长匹配原则分为 “i”, “love”, “china”。
示例二
1 2 3 4 5
| 输入: ilovechina,thewordisbeautiful i,love,china,ch,na,ve,lo,this,is,the,word,beauti,tiful,ful 输出: i,love,china,the,word,is,beauti,ful
|
五、题解
- Tried树(前缀搜素树)解决最长字符搜索问题
- 最长匹配失败时只取一个字符(包括标点符号、不在词典的字符串)加入结果,继续从下一个字符匹配
- 分词结果中不包括标点符号
5.1 Java 实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
| package org.stone.study.algo.ex202412;
import java.util.ArrayList; import java.util.List; import java.util.Scanner;
public class EnglishWordSegmet {
public static void main(String[] args) { Scanner scanner = new Scanner(System.in); String text = scanner.nextLine(); String[] dict = scanner.nextLine().split(",");
List<String> result = wordBreak(text, dict); System.out.println(String.join(",", result)); }
public static List<String> wordBreak(String text, String[] wordDict) { TrieNode root = buildTrie(wordDict); List<String> result = new ArrayList<>(); int n = text.length(); for (int i = 0; i < n; ) { TrieNode node = root;
int longestMatch = -1; for (int j = i; j < n; ++j) { char c = text.charAt(j); if (c < 'a' || c > 'z') { break; } if (node.children[c - 'a'] == null) { break; }
node = node.children[c - 'a']; if (node.isWord) { longestMatch = j; } } if (longestMatch == - 1) { String c = text.substring(i, i + 1); if (c.matches("[a-z]")) { result.add(c); } i++; } else {
result.add(text.substring(i, longestMatch + 1)); i = longestMatch + 1; } }
return result; }
private static TrieNode buildTrie(String[] wordDict) { TrieNode root = new TrieNode(); for (String word : wordDict) { TrieNode node = root; for (char c : word.toCharArray()) { if (node.children[c - 'a'] == null) { node.children[c - 'a'] = new TrieNode(); } node = node.children[c - 'a']; } node.isWord = true; }
return root; }
static class TrieNode { TrieNode[] children; boolean isWord; public TrieNode() { children = new TrieNode[26]; isWord = false; } } }
|
5.2 Python实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
|
class TrieNode: def __init__(self) -> None: self.arr = [None] * 26 self.is_word = False
def buildTree(word_dict): root = TrieNode() for word in word_dict: node = root for c in word: idx = ord(c) - ord('a') if not node.arr[idx]: node.arr[idx] = TrieNode() node = node.arr[idx] node.is_word = True return root
def wordBreak(text, word_dict): root = buildTree(word_dict)
n = len(text) res = [] i = 0 while i < n: node = root longest_match = -1 for j in range(i, n): idx = ord(text[j]) - ord('a')
if idx < 0 or idx >= 26: break; if not node.arr[idx]: break
node = node.arr[idx] if node.is_word : longest_match = j
if longest_match == -1: c = text[i : i+1] if 'a' <= c <= 'z': res.append() i += 1 else: res.append(text[i:longest_match + 1]) i = longest_match + 1
return res
if __name__ == '__main__': text = input() word_dict = input().split(',') res = wordBreak(text, word_dict) print(','.join(res))
|