word分词是一个Java实现的中文分词组件,提供了多种基于词典的分词算法,并利用ngram模型来消除歧义。 能准确识别英文、数字,以及日期、时间等数量词,能识别人名、地名、组织机构名等未登录词。 同时提供了Lucene、Solr、ElasticSearch插件。
引入依赖 1.3版本
<dependency> <groupId>org.apdplat</groupId> <artifactId>word</artifactId> <version>1.3</version> </dependency>
测试:
public class WordFilter { public static void automaticSelection(String title) { //移除停用词进行分词 List<Word> list = WordSegmenter.seg(title); System.out.println(JSON.toJSONString(list)); //保留停用词 List<Word> lists = WordSegmenter.segWithStopWords(title); System.out.println(JSON.toJSONString(lists)); } public static void main(String[] args) { WordFilter.automaticSelection("子查询中的返回结果字段组合是一个索引"); } }
输出结果:
[{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"我"},{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"叫"},{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"李太白"},{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"我"},{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"是"},{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"一个"},{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"诗人"},{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"我"},{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"生活"},{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"在"},{"acronymPinYin":"","antonym":[],"frequency":0,"fullPinYin":"","synonym":[],"text":"唐朝"}]