TypechoJoeTheme

IT技术分享

统计

08. ES内置分词器和IK分词器介绍——ElasticSearch基础专栏

2022-09-08
/
0 评论
/
1,528 阅读
/
正在检测是否收录...
09/08

一、simple分词器

simple分词器是对字母文本进行分词拆分,并将分词后的内容转换成小写格式

对指定内容根据"simple"分词器进行分词

POST _analyze
{
  "analyzer": "simple",
  "text": "Our usual study and experience are our most powerful support at a critical moment"
}

分词之后的单词如下

["our","usual","study","and","experience","are","our","most","powerful ","support"," at"," a"," critical"," moment"]

二、simple_pattern分词器

根据正则表达式进行分词的分词器

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern",
          "pattern": "[0123456789]{3}" // 正则表达式表示,如果连续有3个数字在一起,则可以被当作一个单词
        }
      }
    }
  }
}

对指定内容根据"my_analyzer"分词器进行分词

POST myindex/_analyze
{
  "analyzer": "my_analyzer",
  "text": "fd-123-4567-890-xxd9-689-x987"
}

分词之后的单词如下

[123,456,890,689,987]

三、simple_pattern_split分词器

simple_pattern_split(指定分词符号)分词器比simple_pattern分词器功能更有限,但是分词效率较高

默认模式下,它的分词匹配符号是空字符串。需要注意的是,使用此分词器应该根据业务进行配置,而不是简单地使用默认匹配模式

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "-"  // 当遇到"-"符号就进行分词
        }
      }
    }
  }
}

对指定内容根据"-"分隔符匹配规则进行分词

POST myindex/_analyze
{
  "analyzer": "my_analyzer",
  "text": "fd-123-4567896-890-xxd9-689-x987"
}

分词之后的单词如下

[fd,123,4567,890,xxd9,689,x987]

四、standard分词器

standard(标准)分词器是Elasticsearch中默认的分词器,它是基于Unicode文本分割算法进行分词的

POST _analyze
{
  "analyzer": "standard",
  "text": "Our usual study and experience are our most powerful support at a critical moment"
}

分词之后的单词如下

["Our","usual","study","and","experience","are","our","most","powerful ","support"," at"," a"," critical"," moment"]

standard分词器还提供了两种参数

  • max_token_length: 分词后单词的最大长度,超过最大长度就按照最大位置进行拆分,多余的作为另外的单词,默认为255
  • stopwords: 表示停顿词,可以为0个或者多个,例如_english_或者数组类型的值
PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_analyzer":{
          "type":"standard",
          "max_token_length":6,
          "stopwords":"_english_"
        }
      }
    }
  }
}

对指定内容根据如上规则进行分词

POST myindex/_analyze
{
  "analyzer": "english_analyzer",
  "text": "Our usual study and experience are our most powerful support at a critical moment"
}

分词之后的单词如下

["our","usual","study","experi","ence","our","most","powerf","ul","suppor","t","critic","al","moment"]。

五、自定义分词器

如果希望自定义一个与standard类似的分词器,只需要在原定义中配置参数即可

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuild_analyzer":{
          "type":"keyword",          // 根据关键字类型分词
          "tokenizer":"standard",
          "filter":["lowercase"]    // 单词都转成小写
        }
      }
    }
  }
}

对指定内容根据如上自定义的分词规则进行分词

POST myindex/_analyze
{
  "text": "Our usual study and experience are our most powerful support at a critical moment"
}

分词之后的单词如下

["our","usual","study","and","experience","are","our","most","powerful ","support"," at"," a"," critical"," moment"]

六、IK分词器

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "内心没有分别心,就是真正的苦行"
}

分词之后的单词如下

[内心,没有,分别,心,就是,真正,的,苦行]
如果安装了IK分词器之后没有指定分词器,也没有加上"analyzer":"ik_max_word"这条语句,那么其分词效果与没有安装IK分词器是一致的,也是把中文内容分词成单独的汉字。

IK分词器有以下两种分词模型

  • ik_max_word:对文本进行最细粒度的拆分
  • ik_smart:对文本进行最粗粒度的拆分

6.1 ik_max_word模式

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国国歌"
}

分词之后的单词如下

[中华人民共和国,中华人民,中华,华人,人民共和国,人民,共和国,共和,国,国歌]

6.2 ik_smart模式

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国国歌"
}

分词之后的单词如下

[中华人民共和国,国歌]

6.3 指定ik分词器

PUT myindex
{
  "settings":{
    "analysis":{
      "analyzer":{
        "ik":{
          "tokenizer":"ik_max_word"
        }
      }
    }
  },
  "mappings":{
      "properties":{
        "field1":{
          "type":"text"
        },
        "field2":{
          "type":"integer"
        },
        "field3":{
          "type":"text"
        },
        "field4":{
          "type":"text"
        }
    }
  }
}

或者为每一个字段单独指定

PUT myindex
{
  "mappings":{
      "properties":{
        "field1":{
          "type":"text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word" // 设置搜索分词模式
        },
        "field2":{
          "type":"integer"
        },
        "field3":{
          "type":"text",
          "analyzer": "standard",
          "search_analyzer": "standard"
        },
        "field4":{
          "type":"text",
          "analyzer": "ik_max_word",
          "search_analyzer": " ik_smart"
        }
    }
  }
}

查询指定分词器

PUT /myindex
{
   "mappings": {
     "properties": {
     "name":{
       "type": "keyword"
     },
     "address":{
        "type": "text",
         "analyzer": "ik_max_word",
         "search_analyzer": "ik_smart"  // 设置搜索分词模式
     },
     "age":{
        "type": "integer"
     }
     }
   }
}

全文搜索address等于"魏国"的数据,并指定分词模式

POST myindex/_doc/_search
{
  "query":{
     "match": {
       "address": {
         "query": "魏国",
          "analyzer": "ik_smart" // 这句可以不写,因为默认就是这种模式
       }
     }
  }
}
存储时选择尽量细的分词规则,这样在搜索时可以指定符合具体项目要求的分词模式
朗读
赞 · 0
版权属于:

IT技术分享

本文链接:

https://idunso.com/archives/2897/(转载时请注明本文出处及文章链接)