Skip to content

Mecab-ko 韩文分词器安装与使用

环境信息:

Linux: centos7 (debian编译碰到了好多依赖问题,需要一个个调试)

Docker也是同样的问题,推荐使用centos7镜像

1. 前置依赖安装

shell
# 编译依赖安装
yum install -y gcc gcc-c++ wget automake autoconf autogen

2. mecab-ko 编译安装

mecab-ko: 基础库

mecab-ko-dic:字典库

mecab-java:外部执行程序,必须在linux下。 其实主要是为了生成 libMeCab.soMeCab.jar

编译后的文件都放到了: /usr/local/lib/usr/local/bin

shell
# 1. mecab-ko 安装
wget https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
# 解压
tar xvfz mecab-0.996-ko-0.9.2.tar.gz
cd mecab-0.996-ko-0.9.2
# 编译安装
./configure
make & make install
# 验证是否安装成功
mecab --version

# 2. mecab-ko-dic 安装
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
# 解压
tar xvfz mecab-ko-dic-2.1.1-20180720.tar.gz
cd mecab-ko-dic-2.1.1-20180720
# 编译安装 
sudo ./autogen.sh
./configure
make & make install
ldconfig

# 3.安装jdk,这里以jdk-11.0.12为例,jdk解压到/opt/jdk-11.0.2路径下
wget https://cdn.azul.com/zulu/bin/zulu11.58.23-ca-jdk11.0.16.1-linux_x64.tar.gz

# 4.mecab-java安装
wget https://bitbucket.org/eunjeon/mecab-java/downloads/mecab-java-0.996.tar.gz
# 解压
tar zxvf mecab-java-0.996.tar.gz
# 编辑Makefile文件,更新jdk地址
vi Makefile
# ----------------------
# java环境变量替换
INCLUDE=/opt/jdk-11.0.2/include
JAVAC=/opt/jdk-11.0.2/bin/javac
JAR=/opt/jdk-11.0.2/bin/jar
# -----------------------

# 复制编译好之后的文件到/usr/local/lib 下
sudo cp libMeCab.so /usr/local/lib 
sudo cp MeCab.jar /usr/local/lib


# 配置环境变量
echo "export LD_LIBRARY_PATH=/usr/local/lib" >> ~/.bash_profile 
# 或
vim ~/.bash_profile
#------------------
export LD_LIBRARY_PATH=/usr/local/lib
#-----------------
source ~/.bash_profile
# 执行完以上操作后就可以进入mecab-java环境进行测试
cd /mecab-java
# javac 编译
/opt/jdk-11.0.2/bin/javac test.java
# java执行
/opt/jdk-11.0.2/bin/java test

#/opt/jdk-11.0.2/bin/javac -cp MeCab.jar test.java
#/opt/jdk-11.0.2/bin/java -Djava.library.path=/usr/local/bin/mecab-java -cp .:MeCab.jar test

参考资料
1. [MeCabのインストール(研究室編)] https://hase1031.hatenadiary.org/entry/20111005/1317808636
2. [Mac에서 MeCab Ko의 Java API 사용하기] https://velog.io/@nocode/Mac%EC%97%90%EC%84%9C-MeCab-Ko%EC%9D%98-Java-API-%EC%82%AC%EC%9A%A9%ED%95%98%EA%B8%B0
3. [MECAB 한글 형태소 분석기 플러그인 설치하기] http://guruble.com/mecab-%ED%95%9C%EA%B8%80-%ED%98%95%ED%83%9C%EC%86%8C-%EB%B6%84%EC%84%9D%EA%B8%B0-%ED%94%8C%EB%9F%AC%EA%B7%B8%EC%9D%B8-%EC%84%A4%EC%B9%98%ED%95%98%EA%B8%B0/
4. [Elasticsearch 5.5.0 은전한닢 형태소 분석기 연동 설치하기
출처: https://kogun82.tistory.com/173 [Ctrl+C&V 하는 프로그래밍:티스토리]] https://kogun82.tistory.com/173
5. [GitHub] https://github.com/SOMJANG/Mecab-ko-for-Google-Colab/blob/master/install_mecab-ko_on_colab_light_220429.sh

3. Dockerfile

先准备好如下文件,放到docker打包路径下:

shell

[root@localhost docker]# ll
总用量 4
drwxr-xr-x. 9 root root  141 10月 14 15:05 build
-rw-r--r--. 1 root root 1891 10月 14 12:53 Dockerfile
drwxr-xr-x. 2 root root  183 10月 14 10:42 public
[root@localhost docker]#

# build 为java编译后生成的路径
# public 为资源文件

# public资源文件如下:
总用量 244684
-rw-r--r--. 1 root root       697 10月 14 10:42 Makefile
-rw-r--r--. 1 root root   1414979 10月 14 10:42 mecab-0.996-ko-0.9.2.tar.gz
-rw-r--r--. 1 root root     28653 10月 14 10:42 mecab-java-0.996.tar.gz
-rw-r--r--. 1 root root  49775061 10月 14 10:42 mecab-ko-dic-2.1.1-20180720.tar.gz
-rw-r--r--. 1 root root 199324356 10月 14 10:42 zulu11.58.23-ca-jdk11.0.16.1-linux_x64.tar.gz
dockerfile
FROM centos:7
USER root
#==============
# mecab 编译
#==============
# nginx版本
ENV JAVA_PATH /opt
# 临时源文件路径
ENV TMP_PATH /opt/tmp
ENV LD_LIBRARY_PATH /usr/local/lib
ENV JDK_PACKAGE zulu11.58.23-ca-jdk11.0.16.1-linux_x64
ENV MECAB_KO_PACKAGE mecab-0.996-ko-0.9.2
ENV MECAB_KO_DIC mecab-ko-dic-2.1.1-20180720
ENV MECAB_JAVA mecab-java-0.996

# 依赖准备
RUN yum install -y gcc gcc-c++ wget automake autoconf autogen make
COPY /public $TMP_PATH
WORKDIR $TMP_PATH

# 安装mecab-ko
RUN tar xvfz $MECAB_KO_PACKAGE.tar.gz
WORKDIR $MECAB_KO_PACKAGE
RUN echo "mecab-ko.........."
RUN ./configure --with-charset=utf8 \
    && make -j4 \
    && make install \
    && ldconfig \
    && mecab --version

# 安装mecab-ko-dic
WORKDIR $TMP_PATH
RUN tar xvfz $MECAB_KO_DIC.tar.gz
WORKDIR $MECAB_KO_DIC
RUN ./autogen.sh \
    && ./configure --with-charset=utf8 \
    && make -j4 \
    && make install \
    && ldconfig
RUN echo "java 准备编译... "

# jdk安装
WORKDIR /opt
RUN mv $TMP_PATH/$JDK_PACKAGE.tar.gz /opt \
    && tar zxvf $JDK_PACKAGE.tar.gz \
    && mv $JDK_PACKAGE jdk11
WORKDIR jdk11

# java 类库安装
WORKDIR $TMP_PATH
RUN ls -l
RUN tar zxvf $MECAB_JAVA.tar.gz
WORKDIR $MECAB_JAVA
# 覆盖Makefile
# COPY /public/Makefile $TMP_PATH/mecab-java-0.996
RUN mv -f $TMP_PATH/Makefile $TMP_PATH/$MECAB_JAVA \
    && make \
    && cp libMeCab.so /usr/local/lib \
    && cp MeCab.jar /usr/local/lib

# 清除临时目录
RUN rm -rf $TMP_PATH \
    rm -rf $TMP_PATH/$JDK_PACKAGE.tar.gz

WORKDIR /
COPY build/libs/*.jar app.jar
RUN ls -l
# USER java
EXPOSE 8000
# CMD /opt/jdk11/bin/java -Dfile.encoding=utf-8 -Djava.security.egd=file:/dev/./urandom -jar /app.jar --spring.profiles.active=${APPS_ENV}
CMD /opt/jdk11/bin/java -Dfile.encoding=utf-8 -Djava.security.egd=file:/dev/./urandom -jar /app.jar

4. 常用Docker命令

shell
systemctl restart docker

# docker 测试
# 打包资源上传路径
cd /opt/docker

# 生成image镜像
docker build -t mecab-ko:v2.4 .

# container 运行image镜像
docker run -d --name mecab-ko -p 8000:8000 mecab-ko:v2.4

docker run -d --name mecab-ko -p 8000:18000 mecab-ko:v2.5

# 进入容器
docker exec -it d4d09122009d /bin/bash

# 停止所有docker container
docker stop $(docker ps -a -q)

# 删除所有docker container
docker rm $(docker ps -a -q)


docker run -d --name go-img4  -p 9902:9902 go-img:v1.3

docker run -d --name go-img  -p 9902:9902 go-img:v1.0

ab -n 100 -c 20 -T "application/json" -H "Content-Type: application/json"  -p /opt/ab-test/post.txt http://10.15.1.32:3000/img/hash

go tool pprof --seconds=180 -inuse_space https://cc-go-img-dev.cafe24.com/debug/pprof/heap

go tool pprof --seconds=180 -inuse_space http://10.15.1.32:3000/debug/pprof/heap

5. SpringBoot核心代码

需要将mecab-java-0.996项目下的 org.chasen.mecab 包及类文件完整的拷贝到你的Springboot项目中。

java
/**
 * @author: curyu
 * @date: 2022/10/13 10:51
 * @description:
 */
@RestController
@RequestMapping("mecab")
public class MecabController {

    static {
        try {
            System.loadLibrary("MeCab");
        } catch (UnsatisfiedLinkError e) {
            System.err.println("Cannot load the example native code.\nMake sure your LD_LIBRARY_PATH contains \'.\'\n" + e);
            System.exit(1);
        }
    }

    @SneakyThrows
    @GetMapping("/api")
    public List<Map<String, Object>> mecabKeywords(String text) {
        List<Map<String, Object>> resultList = new ArrayList<>();
        String[] split = text.split("\\r\\n");
        for (String s : split) {
            Map<String, Object> map = doMecab(s);
            resultList.add(map);
        }
        return resultList;
    }

    private Map<String, Object> doMecab(String text) {
        Map<String, Object> result = new HashMap<>();
        List<String> list = new ArrayList<>();
        System.out.println(MeCab.VERSION);
        Tagger tagger = new Tagger();
        String record1 = tagger.parse(text);
        Node node = tagger.parseToNode(text);
        result.put("sourceText", text);
        // result.put("targetParse", record1);
        for (; node != null; node = node.getNext()) {
            String record3 = node.getSurface();
            list.add(record3);
            if (node.getSurface() != null && !("").equals(node.getSurface()) && node.getSurface().toString().length() > 0) {
                result.put("targetNode", list);
            }
        }
        return result;
    }

}

测试页面html

html
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>分词测试</title>
    <script src="jquery.mini.js"></script>
</head>
<body>
<form action="/mecab/api" target="hidden_frame">
    keywords: <textarea style="width:1000px;height:500px" name="text"></textarea>
    <br>
    <input type="submit">
</form>
<iframe id="hidden_frame" name="hidden_frame" style="display:none;"></iframe>
<div class="innerDiv">
<h1>为了展示更直观,以 | 为分隔符</h1>
</div>
</body>
</html>

<script type="text/javascript" charset="UTF-8">
    $('#hidden_frame').load(function () {
        var resultText = $("#hidden_frame")[0].contentDocument.body.getElementsByTagName("pre")[0].innerHTML;
        var result = $.parseJSON(resultText);
        console.log("result ==", result)
        var innerHtmlText = "";
        result.forEach((item, index, arr) => {
            var sourceText = item.sourceText;
            var targetNode = item.targetNode;
            console.log(sourceText);
            console.log("----------------")
            console.log(targetNode)

            innerHtmlText += "    <h1>原文</h1>\n" +
                "    <span>" + sourceText + "</span>\n" +
                "    <br>\n" +
                "    <h2>分词</h2>\n" +
                "    <span>" + targetNode.join(" <span style='color: red'>|</span>  ") + "</span>\n" +
                "    <br>\n" +
                "    <br>"
        })
        $(".innerDiv").html(innerHtmlText)
    });
</script>

6. 效果展示

Mecab-ko分词效果展示gif

7. 扩展 -- 词库添加

参考链接:

  1. [Ubuntu에서 Mecab-ko-dic 설치 및 사용자사전 추가]

    [출처] Ubuntu에서 Mecab-ko-dic 설치 및 사용자사전 추가|작성자 고구마기사

  2. [Mecab] 사용자 사전 추가하기]

    [출처] [Mecab] 사용자 사전 추가하기|작성자 IML

mecab-ko-dic 分词类型:

mecab-ko-dic 품사 태그
태그설명
NNG普通名词
NNP固有名词
NNB依赖名词
NNBC单位名词
NR调查
NP代词
VV动词
VA形容词
VX辅助谓词
VCP肯定指定词
VCN否定指定词
MM冠形词
MAG普通副词
MAJ连接副词
IC感叹词
JKS主格助词
JKC补格助词
JKG冠形格助词
JKO宾格调查
JKB副词格助词
JKV呼格助词
JKQ引格助词
JX助词
JC接续助词
EP先语末词尾 ,辅助词尾
EF终结词尾
EC连接词尾
ETN名词形全盛词尾
ETM管状全盛母
XPN体词前缀
XSN名词衍生后缀
XSV动词派生后缀
XSA形容词衍生后缀
XR词根
SF句号 问号 感叹号
SE缩略号…
SSO打开的括号(, [
SSC闭合括号), ]
SC分隔符, · /:
SY
SL外语
SH汉字
SN数字
shell
# linux 下 vi乱码解决办法
# (如果没有)创建~/.vimrc文件
# 编辑.vimrc文件
vi ~/.vimrc
# 添加以下值

set fileencodings=utf-8
set termencoding=utf-8
set encoding=utf-8
set fileencoding=utf-8
# 保存结束
:wq
shell
# 进入dic目录
cd /opt/tmp/mecab-ko-dic-2.1.1-20180720
# 进入user-dic目录
cd user-dic
# 修改nnp.csv
# 加入以下测试单词
류천우,,,,NNP,*,T,류천우,*,*,*,*,*
:wq

# 编译生效
cd ../tools/
ls
./add-userdic.sh
cd ..
make clean
make install