顿搜
08. ES内置分词器和IK分词器介绍——ElasticSearch基础专栏
09/08
一、simple分词器
simple分词器是对字母文本进行分词拆分,并将分词后的内容转换成小写格式
对指定内容根据"simple"分词器进行分词
POST _analyze
{
"analyzer": "simple",
"text": "Our usual study and experience are our most powerful support at a critical moment"
}分词之后的单词如下
["our","usual","study","and","experience","are","our","most","powerful ","support"," at"," a"," critical"," moment"]二、simple_pattern分词器
根据正则表达式进行分词的分词器
PUT myindex
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern",
"pattern": "[0123456789]{3}" // 正则表达式表示,如果连续有3个数字在一起,则可以被当作一个单词
}
}
}
}
}对指定内容根据"my_analyzer"分词器进行分词
POST myindex/_analyze
{
"analyzer": "my_analyzer",
"text": "fd-123-4567-890-xxd9-689-x987"
}分词之后的单词如下
[123,456,890,689,987]三、simple_pattern_split分词器
simple_pattern_split(指定分词符号)分词器比simple_pattern分词器功能更有限,但是分词效率较高
默认模式下,它的分词匹配符号是空字符串。需要注意的是,使用此分词器应该根据业务进行配置,而不是简单地使用默认匹配模式
PUT myindex
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern_split",
"pattern": "-" // 当遇到"-"符号就进行分词
}
}
}
}
}对指定内容根据"-"分隔符匹配规则进行分词
POST myindex/_analyze
{
"analyzer": "my_analyzer",
"text": "fd-123-4567896-890-xxd9-689-x987"
}分词之后的单词如下
[fd,123,4567,890,xxd9,689,x987]四、standard分词器
standard(标准)分词器是Elasticsearch中默认的分词器,它是基于Unicode文本分割算法进行分词的
POST _analyze
{
"analyzer": "standard",
"text": "Our usual study and experience are our most powerful support at a critical moment"
}分词之后的单词如下
["Our","usual","study","and","experience","are","our","most","powerful ","support"," at"," a"," critical"," moment"]standard分词器还提供了两种参数
- max_token_length: 分词后单词的最大长度,超过最大长度就按照最大位置进行拆分,多余的作为另外的单词,默认为255
- stopwords: 表示停顿词,可以为0个或者多个,例如_english_或者数组类型的值
PUT myindex
{
"settings": {
"analysis": {
"analyzer": {
"english_analyzer":{
"type":"standard",
"max_token_length":6,
"stopwords":"_english_"
}
}
}
}
}对指定内容根据如上规则进行分词
POST myindex/_analyze
{
"analyzer": "english_analyzer",
"text": "Our usual study and experience are our most powerful support at a critical moment"
}分词之后的单词如下
["our","usual","study","experi","ence","our","most","powerf","ul","suppor","t","critic","al","moment"]。五、自定义分词器
如果希望自定义一个与standard类似的分词器,只需要在原定义中配置参数即可
PUT myindex
{
"settings": {
"analysis": {
"analyzer": {
"rebuild_analyzer":{
"type":"keyword", // 根据关键字类型分词
"tokenizer":"standard",
"filter":["lowercase"] // 单词都转成小写
}
}
}
}
}对指定内容根据如上自定义的分词规则进行分词
POST myindex/_analyze
{
"text": "Our usual study and experience are our most powerful support at a critical moment"
}分词之后的单词如下
["our","usual","study","and","experience","are","our","most","powerful ","support"," at"," a"," critical"," moment"]六、IK分词器
POST _analyze
{
"analyzer": "ik_max_word",
"text": "内心没有分别心,就是真正的苦行"
}分词之后的单词如下
[内心,没有,分别,心,就是,真正,的,苦行]如果安装了IK分词器之后没有指定分词器,也没有加上"analyzer":"ik_max_word"这条语句,那么其分词效果与没有安装IK分词器是一致的,也是把中文内容分词成单独的汉字。
IK分词器有以下两种分词模型
- ik_max_word:对文本进行最细粒度的拆分
- ik_smart:对文本进行最粗粒度的拆分
6.1 ik_max_word模式
POST _analyze
{
"analyzer": "ik_max_word",
"text": "中华人民共和国国歌"
}分词之后的单词如下
[中华人民共和国,中华人民,中华,华人,人民共和国,人民,共和国,共和,国,国歌]6.2 ik_smart模式
POST _analyze
{
"analyzer": "ik_smart",
"text": "中华人民共和国国歌"
}分词之后的单词如下
[中华人民共和国,国歌]6.3 指定ik分词器
PUT myindex
{
"settings":{
"analysis":{
"analyzer":{
"ik":{
"tokenizer":"ik_max_word"
}
}
}
},
"mappings":{
"properties":{
"field1":{
"type":"text"
},
"field2":{
"type":"integer"
},
"field3":{
"type":"text"
},
"field4":{
"type":"text"
}
}
}
}或者为每一个字段单独指定
PUT myindex
{
"mappings":{
"properties":{
"field1":{
"type":"text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word" // 设置搜索分词模式
},
"field2":{
"type":"integer"
},
"field3":{
"type":"text",
"analyzer": "standard",
"search_analyzer": "standard"
},
"field4":{
"type":"text",
"analyzer": "ik_max_word",
"search_analyzer": " ik_smart"
}
}
}
}查询指定分词器
PUT /myindex
{
"mappings": {
"properties": {
"name":{
"type": "keyword"
},
"address":{
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart" // 设置搜索分词模式
},
"age":{
"type": "integer"
}
}
}
}全文搜索address等于"魏国"的数据,并指定分词模式
POST myindex/_doc/_search
{
"query":{
"match": {
"address": {
"query": "魏国",
"analyzer": "ik_smart" // 这句可以不写,因为默认就是这种模式
}
}
}
}存储时选择尽量细的分词规则,这样在搜索时可以指定符合具体项目要求的分词模式