Elastic Search

[Elastic Search] Nori Tokenizer & Filter ์ ์šฉ๊ธฐ

Tempo 2022. 2. 9. 21:40

์ด์ „ ๊ธ€์—์„œ Elastic Search์˜ ์ฟผ๋ฆฌ๋“ค์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ์กฐ๊ธˆ ๋” ์ž์„ธํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ ์กฐํšŒ๋ฅผ ํ•ด๋ณด๊ณ  ์‹ถ์—ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ €์žฅ๋œ ํ…์ŠคํŠธ๋“ค์— ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ฒ€์ƒ‰์„ ์ข€ ๋” ์ž์„ธํžˆ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ์•„๋ณด์•˜๋‹ค.

Elastic Search ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ

Elastic Search 7.0 ์ดํ›„ ๋ฒ„์ „๋ถ€ํ„ฐ๋Š” Nori(๋…ธ๋ฆฌ)๋ผ๋Š” ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. (๊ณต์‹์ ์œผ๋กœ๋Š” 6.6 ๋ฒ„์ „ ์ดํ›„๋ถ€ํ„ฐ ์ œ๊ณต) Nori์˜ ์„ค์น˜๋Š” ์•„๋ž˜ ๋งํฌ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ์ง„ํ–‰ํ•œ๋‹ค.

https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-nori.html

ํ˜„ ์ƒํ™ฉ

๊ธฐ์กด์—๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ธ๋ฑ์Šค & ๋ถ„์„๊ธฐ๋ฅผ ๊ตฌ์„ฑํ•˜์˜€๋‹ค.

{
      "settings": {
          "index": {
              "number_of_shards": 3,
              "number_of_replicas": 1
          },
          "analysis": {
              "tokenizer": {
                  "korean_nori_tokenizer": {
                      "type": "nori_tokenizer",
                      "decompound_mode": "mixed"
                  }
              },
              "analyzer": {
                  "nori_analyzer": {
                      "type": "custom",
                      "tokenizer": "korean_nori_tokenizer"
                  }
              },
              "filter": {
                  "nori_posfilter": {"type": "nori_part_of_speech",
                                     "stoptags": ["E", "IC", "J", "MAG", "MM", "NA", "NR", "SC", "SE", "SF", "SH",
                                                  "SL", "SN", "SP", "SSC", "SSO", "SY", "UNA", "UNKNOWN", "VA", "VCN",
                                                  "VCP", "VSV", "VV", "VX", "XPN", "XR", "XSA", "XSN", "XSV"]}
              }
          }
      },
      "mappings": {
          "properties": {
              "prd_id": {
                  "type": "text"
              },
              "review_id": {
                  "type": "text"
              },
              "review": {
                  "type": "text",
                  "fields": {
                      "full": {
                          "type": "keyword"
                      },
                      "nori_mixed": {
                          "type": "text",
                          "analyzer": "nori_analyzer",
                          "search_analyzer": "standard"
                      }
                  }
              },
              "genders": {
                  "type": "rank_features"
              }
          }
      }
  }

ํ•ด๋‹น ์ธ๋ฑ์Šค๋ฅผ python ์—์„œ ์ธ๋ฑ์Šค๋ฅผ ์ƒ์„ฑ, ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค.

ํ•˜์ง€๋งŒ, nori_mixed์˜ term vectors(๋ฌธ์„œ์—์„œ vector ํ™”๊ฐ€ ๋œ ๋‹จ์–ด๋ฅผ ์กฐํšŒ) ์กฐํšŒ ์‹œ ์•„๋ฌด๊ฒƒ๋„ ๋‚˜์˜ค์ง€ ์•Š๋Š”๋‹ค.

์œ„ ์ธ๋ฑ์Šค ๊ตฌ์„ฑ์—์„œ ์ž˜๋ชป๋œ ์ ์„ ์ฐพ์•„๋ณด์ž.

์šฐ์„  Analyzer๋ฅผ ๋ณด๋ฉด, โ€˜tokenizerโ€™์™€ โ€˜analyzerโ€™๊ฐ€ ์„ ์–ธ๋˜์—ˆ๊ณ , filter ๋˜ํ•œ ์„ ์–ธ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ analyzer ๋˜๋Š” ์ธ๋ฑ์Šค ๊ตฌ์„ฑ ์–ด๋””์—๋„ filter(nori_posfilter)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ถ€๋ถ„์ด ์—†๋‹ค.

"analysis": {
        "tokenizer": {
            "korean_nori_tokenizer": {
                "type": "nori_tokenizer",
                "decompound_mode": "mixed"
            }
        },
        "analyzer": {
            "nori_analyzer": {
                "type": "custom",
                "tokenizer": "korean_nori_tokenizer"
            }
        },
        "filter": {
            "nori_posfilter": {"type": "nori_part_of_speech",
                               "stoptags": ["E", "IC", "J", "MAG", "MM", "NA", "NR", "SC", "SE", "SF", "SH",
                                            "SL", "SN", "SP", "SSC", "SSO", "SY", "UNA", "UNKNOWN", "VA", "VCN",
                                            "VCP", "VSV", "VV", "VX", "XPN", "XR", "XSA", "XSN", "XSV"]}
        }
    }

๊ทธ๋ž˜์„œ analyzer์˜ tokenizer ๋ถ€ํ„ฐ ์ฒœ์ฒœํžˆ ์ ์šฉ์‹œ์ผœ๋ณด๊ธฐ๋กœ ํ–ˆ๋‹ค.

Nori-Tokenizer ๊ณต์‹ ๋ฌธ์„œ

https://www.elastic.co/guide/en/elasticsearch/plugins/7.17/analysis-nori-tokenizer.html

Nori-Mixed ์ ์šฉ

Nori-Tokenizer ์—๋Š” decompound_mode ๋ผ๋Š” ์„ค์ •์ด ์žˆ๋‹ค. ์ด ์„ค์ •์€ ํ† ํฌ ๋‚˜์ด์ €๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‹จ์–ด๋ฅผ ๋‚˜๋ˆ  ์ €์žฅํ• ์ง€ ๊ฒฐ์ •ํ•˜๋Š” ์˜ต์…˜์ด๋‹ค.

 

Elastic Search ๊ณต์‹๋ฌธ์„œ ์ฐธ๊ณ 

decompound mode ์—์„œ๋Š” nori ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๊ฐ€ ๋ถ„์„ํ•˜๋Š” ์˜ต์…˜์„ ๋‘˜ ์ˆ˜ ์žˆ๋‹ค. ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์ตœ๋Œ€ํ•œ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด Mixed๋ฅผ ์ ์šฉํ•ด ์ธ๋ฑ์Šค๋ฅผ ์ƒ์„ฑํ•ด๋ณด์•˜๋‹ค.

PUT review_token
{
  "settings": {
    "analysis": {
      "analyzer": {
        "nori_mixed": {
          "tokenizer": "nori_t_mixed",
          "filter": "shingle"
        }
      },
      "tokenizer": {
        "nori_t_mixed": {
          "type": "nori_tokenizer",
          "decompound_mode": "mixed"
        }
      }
    }
  },
  "mappings": {
    "properties": {
        "prd_id": {
            "type": "text"
        },
        "review_id": {
            "type": "text"
        },
        "review": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword"
                },
                "nori_mixed": {
                    "type": "text",
                    "analyzer": "nori_mixed",
                    "search_analyzer": "standard"
                }
            }
        },
        "genders": {
            "type": "rank_features"
        }
    }
  }
}

ํ•ด๋‹น ์ธ๋ฑ์Šค๋กœ ์ƒ์„ฑํ•œ ๊ฒŒ์‹œ๋ฌผ์˜ term vector๋ฅผ ์กฐํšŒํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

nori ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๊ฐ€ ์ ์šฉ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Nori ํ•„ํ„ฐ ๋” ์ž์„ธํ•˜๊ฒŒ ์ ์šฉํ•˜๊ธฐ

์œ„์˜ nori ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๊ณต์ ์œผ๋กœ ์ง„ํ–‰ํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด, ํ˜•ํƒœ์†Œ์˜ pos ๋”ฐ๋ผ ๋ณ„๋„์˜ ์นผ๋Ÿผ์œผ๋กœ ์ €์žฅํ•˜๊ณ  ๋˜ํ•œ nori filter ๋ฅผ ์ ์šฉํ•œ ์นผ๋Ÿผ์— pos๋ฅผ ๊ฐ™์ด ํ™•์ธํ•  ์ˆ˜ ์žˆ์„๊นŒ? ๊ทธ๋ฆฌ๊ณ  pos๋กœ ๊ฒ€์ƒ‰๋„ ๊ฐ€๋Šฅํ• ๊นŒ??

๐Ÿ”ฅ To-Do

  • [ ] Nori ํ•„ํ„ฐ๋กœ ํ•„ํ„ฐ๋œ ํ…์ŠคํŠธ์— ๊ฐ pos๋ฅผ ํ•จ๊ป˜ ๋ณด์—ฌ์ฃผ๊ธฐโ€”> ๋ถˆ๊ฐ€๋Šฅ
  • [x] ํŠน์ • pos(๋ช…์‚ฌ ํ˜น์€ ํ˜•์šฉ์‚ฌ) ๋งŒ ๋ณ„๋„๋กœ ์ถ”์ถœํ•˜์—ฌ ์นผ๋Ÿผ์œผ๋กœ ๋งŒ๋“ค๊ธฐ
  • [ ] ์ธ๋ฑ์Šค ๋‚ด์—์„œ pos ๋กœ๋งŒ ๊ฒ€์ƒ‰ํ•˜๊ธฐ(pos ์นผ๋Ÿผ ๋งŒ๋“ค์–ด ๊ฒ€์ƒ‰)โ€”> ๋ถˆ๊ฐ€๋Šฅ

Mecab์˜ ์›๋ฆฌ - https://gritmind.blog/2020/07/22/nori_deep_dive/ Mecab์˜ ํ˜•ํƒœ์†Œ ํ‘œ - https://joonable.tistory.com/33

Custom analyzer ์ ์šฉํ•˜๊ธฐ

Create a custom analyzer | Elasticsearch Guide [7.16] | Elastic

 

์•„๋ž˜ ์ธ๋ฑ์Šค ๊ตฌ์„ฑ์—์„œ nori_pos_noun ์€ ๋ช…์‚ฌ(๊ณ ์œ ๋ช…์‚ฌ ๋“ฑ)๋งŒ ์ถ”์ถœํ•˜์—ฌ ๋ณ„๋„์˜ ํ•„๋“œ๋กœ ์ €์žฅํ•˜๋Š” ๋ถ„์„๊ธฐ์ด๋‹ค.

PUT review_pos
{
  "settings": {
    "analysis": {
      "analyzer": {
        "nori_mixed": {
          "tokenizer": "nori_t_mixed",
          "filter": "shingle"
        },
        "nori_pos_noun": {
          "type": "custom",
          "tokenizer": "nori_t_mixed",
          "filter": "pos_filter"
        }
      },
      "tokenizer": {
        "nori_t_mixed": {
          "type": "nori_tokenizer",
          "decompound_mode": "mixed"
        }
      },
      "filter": {
        "pos_filter": {
          "type": "nori_part_of_speech",
// ๋ช…์‚ฌ ํƒœ๊น…๋œ ๋‹จ์–ด๋งŒ ์ €์žฅํ•จ. ์•„๋ž˜ ํ•„ํ„ฐ์—์„œ๋Š” ๋ช…์‚ฌ๋ฅผ ์ œ์™ธํ•œ ํ˜•ํƒœ์†Œ๋งŒ ํ•„ํ„ฐ๋งํ•จ
          "stoptags": [
            "VV", "VA", "VX", "VCP", "VCN", "MM", "MAG", "MAJ", 
            "IC", "J", "E", 
            "XPN", "XSA", "XSN", "XSV", 
            "SP", "SSC", "SSO", "SC", "SE",
            "UNA"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
        "prd_id": {
            "type": "text"
        },
        "review_id": {
            "type": "text"
        },
        "review": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword"
                },
                "nori_mixed": {
                    "type": "text",
                    "analyzer": "nori_mixed",
                    "search_analyzer": "standard"
                },
                "nori_noun": {
                  "type": "text",
                  "analyzer": "nori_pos_noun",
                  "search_analyzer": "standard"
                }
            }
        },
        "genders": {
            "type": "rank_features"
        }
    }
  }
}

์ด๋ ‡๊ฒŒ ๊ตฌ์„ฑํ•œ ์ธ๋ฑ์Šค์—์„œ ๊ฒ€์ƒ‰ ์‹œ

GET review_pos/_search
{
  "query": {
    "match": {
      "review.nori_noun": "์˜ˆ์˜"
    }
  }
}

// ๊ฒฐ๊ณผ
"hits" : [
      {
        "_index" : "review_pos",
        "_type" : "_doc",
        "_id" : "4746d199f2869751e5350a3a17fbc055659ceed8",
        "_score" : 1.645329,
        "_source" : {
          "gender" : {
            "female" : 0.5363630652427673,
            "male" : 0.4636369347572326
          },
          "prd_id" : 2071204,
          "review" : "ํ•๋„ ์ƒ๊ฐํ–ˆ๋˜ ๊ฒƒ๋ณด๋‹ค ๋” ์˜ˆ์˜๊ณ  ๊ธฐ๋ชจ์—ฌ์„œ ๋”ฐ๋œปํ•ฉ๋‹ˆ๋‹ค",
          "review_id" : 22857733
        }
      },
      {
        "_index" : "review_pos",
        "_type" : "_doc",
        "_id" : "244abf90dca42137b33296ae19448ea4f0a5d238",
        "_score" : 1.2024188,
        "_source" : {
          "gender" : {
            "female" : 0.8708744049072266,
            "male" : 0.1291255950927734
          },
          "prd_id" : 2071204,
          "review" : "๊ธธ์ด๊ฐ€ ์กฐ๊ธˆ ๋” ๊ธธ์ค„ ์•Œ์•˜๋Š”๋ฐ ์ƒํ’ˆ ๋ฐ›๊ณ  ์ž…์–ด๋ณด๋‹ˆ ์ƒ๊ฐ๋ณด๋‹ค ์งง์•„์„œ ๋‹นํ™ฉ์Šค๋Ÿฌ์› ์–ด์š” ๊ทธ๋ž˜๋„ ์˜ˆ์˜๊ฒŒ ์ž˜ ์ž…๊ณ  ์žˆ์–ด์š”",
          "review_id" : 22642259
        }
      }
    ]

nori ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋กœ ๋ถ„์„ํ•œ ๋ช…์‚ฌ๋งŒ ๋ณ„๋„๋กœ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

๋ฐ˜์‘ํ˜•