完美起航-ElasticSearch集成IK分词器

ElasticSearch集成IK分词器

后端大苹果 2023-12-27 21:58 135

ElasticSearch

前述

为了能够更好地对中文进行搜索和查询，就需要在Elasticsearch中集成好的分词器插件，而 IK 分词器就是用于对中文提供支持得插件。

集成IK分词器

下载

IK分词器下载 (opens new window)
注意版本需要对应，目前IK分词器还没有8.3.3版本，所以就先重新下一个8.2.3版本的ES

安装

将下载的IK压缩包直接解压到 elasticsearch-8.2.3 的 plugins目录下，重启ES

使用 IK 分词器

IK 分词器提供了两个分词算法：

▶️ ik_smart: 会做最粗粒度的拆分，适合 Phrase 查询

▶️ Ik_max_word:会将文本做最细粒度的拆分,会穷尽各种可能的组合,适合 Term Query

为索引指定默认IK分词器

这样我们在索引中就不用创建每一个字段，可以通过动态字段映射，将String类型的字段映射为text类型，同时分词器指定为ik_max_word

PUT ik_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "ik_max_word"
        }
      }
    }
  }
}

使用

GET 索引名/_analyze
{
    "analyzer": "ik_smart",
    "text": ["分词词语测试"]
}

自定义分词效果

我们在使用 IK 分词器时会发现其实有时候分词的效果也并不是我们所期待的,有时一些特殊得术语会被拆开，但实际上我们希望不要拆开。
IK 插件给我们提供了自定义分词字典，我们可以添加自己想要保留得字了。

自定义分词

首先在 elasticsearch-analysis-ik-8.2.3的config目录下新建一个my_self.dic文件，输入自己希望不想被拆开的术语，比如分词词语
接下来我们修改配置文件：config/IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典多个文件用';'分隔开 -->
	<entry key="ext_dict">my_self.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

热更新 IK 分词使用方法

<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">location</entry>

其中 location 是指一个 url，比如 http://yoursite.com/getCustomDict，该请求只需满足以下两点即可完成分词热更新。

1、该 http 请求需要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。
2、该 http 请求返回的内容格式是一行一个分词，换行符用 \n即可。

满足上面两点要求就可以实现热更新分词了，不需要重启 ES 实例。

可以将需自动更新的热词放在一个 UTF-8 编码的 .txt 文件里，放在 nginx 或其他简易 http server 下，当 .txt 文件修改时，http server 会在客户端请求该文件时自动返回相应的 Last-Modified 和 ETag。可以另外做一个工具来从业务系统提取相关词汇，并更新这个 .txt 文件。

常见问题

1.自定义词典为什么没有生效？

请确保扩展词典的文本格式为 UTF8 编码

2.分词测试失败，请在某个索引下调用analyze接口测试,而不是直接调用analyze接口如:

curl -XGET "http://localhost:9200/your_index/_analyze" -H 'Content-Type: application/json' -d'
{
   "text":"分词词语测试","tokenizer": "my_ik"
}'

JAVA-API

本实例在ElasticSearch-8.X的JAVA-API 基础上进行再扩展

/**
 * @version 1.0.0
 * @className: IndexTest
 * @description: 创建索引并设置String列使用IK分词器
 * @author: smartzeng
 * @create: 2022/8/8 10:03
 */
@SpringBootTest
@Slf4j
public class IndexTest
{
    @Autowired
    private ElasticsearchClient elasticsearchClient;

    /**
     * 创建索引并设置字段使用IK分词器
     *
     * @throws IOException ioexception
     */
    @Test
    void createIndexAndIk() throws IOException {
        Map<String, Property> documentMap = new HashMap<>();
        documentMap.put("userName",Property.of(p -> p
                .text(TextProperty.of(textProperty ->
                        textProperty.index(true).analyzer("ik_max_word")))));

        documentMap.put("age", Property.of(property ->
                        property.integer(IntegerNumberProperty.of(integerNumberProperty
                                -> integerNumberProperty.index(true))
                        )
                )
        );
        CreateIndexResponse response = elasticsearchClient.indices().create(createIndexBuilder ->
                createIndexBuilder.index("user").mappings(mappings ->
                                mappings.properties(documentMap))
                        .aliases("User",aliases ->
                                aliases.isWriteIndex(true))
        );
        //响应状态
        boolean acknowledged = response.acknowledged();
        boolean shardsAcknowledged = response.shardsAcknowledged();
        String index = response.index();
        log.info("创建索引状态:{}",acknowledged);
        log.info("已确认的分片:{}",shardsAcknowledged);
        log.info("索引名称:{}",index);
    }
}

个人信息

大苹果

积分: 2

温故而知新

github: https://github.com/zyw327

个人主页: http://www.okgoes.com/user?uid=1&type=1

私信专栏