Elastic Search

08. ES内置分词器和IK分词器介绍——ElasticSearch基础专栏

顿搜

2022-09-08

0 评论

1,631 阅读

正在检测是否收录...

09/08

一、simple分词器

simple分词器是对字母文本进行分词拆分，并将分词后的内容转换成小写格式

对指定内容根据"simple"分词器进行分词

POST _analyze
{
  "analyzer": "simple",
  "text": "Our usual study and experience are our most powerful support at a critical moment"
}

分词之后的单词如下

["our","usual","study","and","experience","are","our","most","powerful ","support"," at"," a"," critical"," moment"]

二、simple_pattern分词器

根据正则表达式进行分词的分词器

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern",
          "pattern": "[0123456789]{3}" // 正则表达式表示，如果连续有3个数字在一起，则可以被当作一个单词
        }
      }
    }
  }
}

对指定内容根据"my_analyzer"分词器进行分词

POST myindex/_analyze
{
  "analyzer": "my_analyzer",
  "text": "fd-123-4567-890-xxd9-689-x987"
}

分词之后的单词如下

[123,456,890,689,987]

三、simple_pattern_split分词器

simple_pattern_split（指定分词符号）分词器比simple_pattern分词器功能更有限，但是分词效率较高

默认模式下，它的分词匹配符号是空字符串。需要注意的是，使用此分词器应该根据业务进行配置，而不是简单地使用默认匹配模式

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "-"  // 当遇到"-"符号就进行分词
        }
      }
    }
  }
}

对指定内容根据"-"分隔符匹配规则进行分词

POST myindex/_analyze
{
  "analyzer": "my_analyzer",
  "text": "fd-123-4567896-890-xxd9-689-x987"
}

分词之后的单词如下

[fd,123,4567,890,xxd9,689,x987]

四、standard分词器

standard（标准）分词器是Elasticsearch中默认的分词器，它是基于Unicode文本分割算法进行分词的

POST _analyze
{
  "analyzer": "standard",
  "text": "Our usual study and experience are our most powerful support at a critical moment"
}

分词之后的单词如下

["Our","usual","study","and","experience","are","our","most","powerful ","support"," at"," a"," critical"," moment"]

standard分词器还提供了两种参数

max_token_length: 分词后单词的最大长度，超过最大长度就按照最大位置进行拆分，多余的作为另外的单词，默认为255
stopwords: 表示停顿词，可以为0个或者多个，例如_english_或者数组类型的值

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_analyzer":{
          "type":"standard",
          "max_token_length":6,
          "stopwords":"_english_"
        }
      }
    }
  }
}

对指定内容根据如上规则进行分词

POST myindex/_analyze
{
  "analyzer": "english_analyzer",
  "text": "Our usual study and experience are our most powerful support at a critical moment"
}

分词之后的单词如下

["our","usual","study","experi","ence","our","most","powerf","ul","suppor","t","critic","al","moment"]。

五、自定义分词器

如果希望自定义一个与standard类似的分词器，只需要在原定义中配置参数即可

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuild_analyzer":{
          "type":"keyword",          // 根据关键字类型分词
          "tokenizer":"standard",
          "filter":["lowercase"]    // 单词都转成小写
        }
      }
    }
  }
}

对指定内容根据如上自定义的分词规则进行分词

POST myindex/_analyze
{
  "text": "Our usual study and experience are our most powerful support at a critical moment"
}

分词之后的单词如下

["our","usual","study","and","experience","are","our","most","powerful ","support"," at"," a"," critical"," moment"]

六、IK分词器

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "内心没有分别心，就是真正的苦行"
}

分词之后的单词如下

[内心，没有，分别，心，就是，真正，的，苦行]

如果安装了IK分词器之后没有指定分词器，也没有加上"analyzer":"ik_max_word"这条语句，那么其分词效果与没有安装IK分词器是一致的，也是把中文内容分词成单独的汉字。

IK分词器有以下两种分词模型

ik_max_word：对文本进行最细粒度的拆分
ik_smart：对文本进行最粗粒度的拆分

6.1 ik_max_word模式

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国国歌"
}

分词之后的单词如下

[中华人民共和国，中华人民，中华，华人，人民共和国，人民，共和国，共和，国，国歌]

6.2 ik_smart模式

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国国歌"
}

分词之后的单词如下

[中华人民共和国，国歌]

6.3 指定ik分词器

PUT myindex
{
  "settings":{
    "analysis":{
      "analyzer":{
        "ik":{
          "tokenizer":"ik_max_word"
        }
      }
    }
  },
  "mappings":{
      "properties":{
        "field1":{
          "type":"text"
        },
        "field2":{
          "type":"integer"
        },
        "field3":{
          "type":"text"
        },
        "field4":{
          "type":"text"
        }
    }
  }
}

或者为每一个字段单独指定

PUT myindex
{
  "mappings":{
      "properties":{
        "field1":{
          "type":"text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word" // 设置搜索分词模式
        },
        "field2":{
          "type":"integer"
        },
        "field3":{
          "type":"text",
          "analyzer": "standard",
          "search_analyzer": "standard"
        },
        "field4":{
          "type":"text",
          "analyzer": "ik_max_word",
          "search_analyzer": " ik_smart"
        }
    }
  }
}

查询指定分词器

PUT /myindex
{
   "mappings": {
     "properties": {
     "name":{
       "type": "keyword"
     },
     "address":{
        "type": "text",
         "analyzer": "ik_max_word",
         "search_analyzer": "ik_smart"  // 设置搜索分词模式
     },
     "age":{
        "type": "integer"
     }
     }
   }
}

全文搜索address等于"魏国"的数据，并指定分词模式

POST myindex/_doc/_search
{
  "query":{
     "match": {
       "address": {
         "query": "魏国",
          "analyzer": "ik_smart" // 这句可以不写，因为默认就是这种模式
       }
     }
  }
}

存储时选择尽量细的分词规则，这样在搜索时可以指定符合具体项目要求的分词模式

朗读

赞 · 0

版权属于：

IT技术分享

本文链接：

https://idunso.com/archives/2897/（转载时请注明本文出处及文章链接）

作品采用：

《署名-非商业性使用-相同方式共享 4.0 国际 (CC BY-NC-SA 4.0)》许可协议授权

IT技术分享

08. ES内置分词器和IK分词器介绍——ElasticSearch基础专栏

一、simple分词器

二、simple_pattern分词器

三、simple_pattern_split分词器

四、standard分词器

五、自定义分词器

六、IK分词器

6.1 ik_max_word模式

6.2 ik_smart模式

6.3 指定ik分词器

人生倒计时

今日天气

热门文章

历史今天

最新回复

顿搜

绿水本无忧，因风皱面

青山原不老，为雪白头