์ด์ ๊ธ์์ Elastic Search์ ์ฟผ๋ฆฌ๋ค์ ๊ณต๋ถํ๋ฉด์ ์กฐ๊ธ ๋ ์์ธํ๊ฒ ๋ฐ์ดํฐ ์กฐํ๋ฅผ ํด๋ณด๊ณ ์ถ์๋ค. ๊ทธ๋์ ์ ์ฅ๋ ํ ์คํธ๋ค์ ํ๊ธ ํํ์ ๋ถ์๊ธฐ๋ฅผ ์ ์ฉํ์ฌ ๊ฒ์์ ์ข ๋ ์์ธํ ํ ์ ์๋ ๋ฐฉ๋ฒ์ ์ฐพ์๋ณด์๋ค.
Elastic Search ํ๊ธ ํํ์ ๋ถ์๊ธฐ
Elastic Search 7.0 ์ดํ ๋ฒ์ ๋ถํฐ๋ Nori(๋ ธ๋ฆฌ)๋ผ๋ ํ๊ธ ํํ์ ๋ถ์๊ธฐ๋ฅผ ์ฌ์ฉํ ์ ์๋ค. (๊ณต์์ ์ผ๋ก๋ 6.6 ๋ฒ์ ์ดํ๋ถํฐ ์ ๊ณต) Nori์ ์ค์น๋ ์๋ ๋งํฌ๋ฅผ ์ฐธ์กฐํ์ฌ ์งํํ๋ค.
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-nori.html
ํ ์ํฉ
๊ธฐ์กด์๋ ์๋์ ๊ฐ์ด ์ธ๋ฑ์ค & ๋ถ์๊ธฐ๋ฅผ ๊ตฌ์ฑํ์๋ค.
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"analysis": {
"tokenizer": {
"korean_nori_tokenizer": {
"type": "nori_tokenizer",
"decompound_mode": "mixed"
}
},
"analyzer": {
"nori_analyzer": {
"type": "custom",
"tokenizer": "korean_nori_tokenizer"
}
},
"filter": {
"nori_posfilter": {"type": "nori_part_of_speech",
"stoptags": ["E", "IC", "J", "MAG", "MM", "NA", "NR", "SC", "SE", "SF", "SH",
"SL", "SN", "SP", "SSC", "SSO", "SY", "UNA", "UNKNOWN", "VA", "VCN",
"VCP", "VSV", "VV", "VX", "XPN", "XR", "XSA", "XSN", "XSV"]}
}
}
},
"mappings": {
"properties": {
"prd_id": {
"type": "text"
},
"review_id": {
"type": "text"
},
"review": {
"type": "text",
"fields": {
"full": {
"type": "keyword"
},
"nori_mixed": {
"type": "text",
"analyzer": "nori_analyzer",
"search_analyzer": "standard"
}
}
},
"genders": {
"type": "rank_features"
}
}
}
}
ํด๋น ์ธ๋ฑ์ค๋ฅผ python ์์ ์ธ๋ฑ์ค๋ฅผ ์์ฑ, ๋ฐ์ดํฐ๋ฅผ ์ถ๊ฐํ์๋ค.
ํ์ง๋ง, nori_mixed์ term vectors(๋ฌธ์์์ vector ํ๊ฐ ๋ ๋จ์ด๋ฅผ ์กฐํ) ์กฐํ ์ ์๋ฌด๊ฒ๋ ๋์ค์ง ์๋๋ค.

์ ์ธ๋ฑ์ค ๊ตฌ์ฑ์์ ์๋ชป๋ ์ ์ ์ฐพ์๋ณด์.
์ฐ์ Analyzer๋ฅผ ๋ณด๋ฉด, โtokenizerโ์ โanalyzerโ๊ฐ ์ ์ธ๋์๊ณ , filter ๋ํ ์ ์ธ๋์๋ค. ํ์ง๋ง analyzer ๋๋ ์ธ๋ฑ์ค ๊ตฌ์ฑ ์ด๋์๋ filter(nori_posfilter)๋ฅผ ์ฌ์ฉํ๋ ๋ถ๋ถ์ด ์๋ค.
"analysis": {
"tokenizer": {
"korean_nori_tokenizer": {
"type": "nori_tokenizer",
"decompound_mode": "mixed"
}
},
"analyzer": {
"nori_analyzer": {
"type": "custom",
"tokenizer": "korean_nori_tokenizer"
}
},
"filter": {
"nori_posfilter": {"type": "nori_part_of_speech",
"stoptags": ["E", "IC", "J", "MAG", "MM", "NA", "NR", "SC", "SE", "SF", "SH",
"SL", "SN", "SP", "SSC", "SSO", "SY", "UNA", "UNKNOWN", "VA", "VCN",
"VCP", "VSV", "VV", "VX", "XPN", "XR", "XSA", "XSN", "XSV"]}
}
}
๊ทธ๋์ analyzer์ tokenizer ๋ถํฐ ์ฒ์ฒํ ์ ์ฉ์์ผ๋ณด๊ธฐ๋ก ํ๋ค.
Nori-Tokenizer ๊ณต์ ๋ฌธ์
https://www.elastic.co/guide/en/elasticsearch/plugins/7.17/analysis-nori-tokenizer.html
Nori-Mixed ์ ์ฉ
Nori-Tokenizer ์๋ decompound_mode ๋ผ๋ ์ค์ ์ด ์๋ค. ์ด ์ค์ ์ ํ ํฌ ๋์ด์ ๊ฐ ์ด๋ป๊ฒ ๋จ์ด๋ฅผ ๋๋ ์ ์ฅํ ์ง ๊ฒฐ์ ํ๋ ์ต์ ์ด๋ค.

decompound mode ์์๋ nori ํํ์ ๋ถ์๊ธฐ๊ฐ ๋ถ์ํ๋ ์ต์ ์ ๋ ์ ์๋ค. ๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์ต๋ํ ์ถ์ถํ๊ธฐ ์ํด Mixed๋ฅผ ์ ์ฉํด ์ธ๋ฑ์ค๋ฅผ ์์ฑํด๋ณด์๋ค.
PUT review_token
{
"settings": {
"analysis": {
"analyzer": {
"nori_mixed": {
"tokenizer": "nori_t_mixed",
"filter": "shingle"
}
},
"tokenizer": {
"nori_t_mixed": {
"type": "nori_tokenizer",
"decompound_mode": "mixed"
}
}
}
},
"mappings": {
"properties": {
"prd_id": {
"type": "text"
},
"review_id": {
"type": "text"
},
"review": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"nori_mixed": {
"type": "text",
"analyzer": "nori_mixed",
"search_analyzer": "standard"
}
}
},
"genders": {
"type": "rank_features"
}
}
}
}
ํด๋น ์ธ๋ฑ์ค๋ก ์์ฑํ ๊ฒ์๋ฌผ์ term vector๋ฅผ ์กฐํํ ๊ฒฐ๊ณผ๋ ๋ค์๊ณผ ๊ฐ๋ค.

nori ํํ์ ๋ถ์๊ธฐ๊ฐ ์ ์ฉ๋์ด ์๋ ๊ฒ์ ํ์ธํ ์ ์๋ค.
Nori ํํฐ ๋ ์์ธํ๊ฒ ์ ์ฉํ๊ธฐ
์์ nori ํํ์ ๋ถ์๊ธฐ๋ฅผ ์ ์ฉํ๋ ๊ฒ์ ์ฑ๊ณต์ ์ผ๋ก ์งํํ๋ค. ๊ทธ๋ฌ๋ฉด, ํํ์์ pos ๋ฐ๋ผ ๋ณ๋์ ์นผ๋ผ์ผ๋ก ์ ์ฅํ๊ณ ๋ํ nori filter ๋ฅผ ์ ์ฉํ ์นผ๋ผ์ pos๋ฅผ ๊ฐ์ด ํ์ธํ ์ ์์๊น? ๊ทธ๋ฆฌ๊ณ pos๋ก ๊ฒ์๋ ๊ฐ๋ฅํ ๊น??
๐ฅ To-Do
- [ ] Nori ํํฐ๋ก ํํฐ๋ ํ ์คํธ์ ๊ฐ pos๋ฅผ ํจ๊ป ๋ณด์ฌ์ฃผ๊ธฐโ> ๋ถ๊ฐ๋ฅ
- [x] ํน์ pos(๋ช ์ฌ ํน์ ํ์ฉ์ฌ) ๋ง ๋ณ๋๋ก ์ถ์ถํ์ฌ ์นผ๋ผ์ผ๋ก ๋ง๋ค๊ธฐ
- [ ] ์ธ๋ฑ์ค ๋ด์์ pos ๋ก๋ง ๊ฒ์ํ๊ธฐ(pos ์นผ๋ผ ๋ง๋ค์ด ๊ฒ์)โ> ๋ถ๊ฐ๋ฅ
Mecab์ ์๋ฆฌ - https://gritmind.blog/2020/07/22/nori_deep_dive/ Mecab์ ํํ์ ํ - https://joonable.tistory.com/33
Custom analyzer ์ ์ฉํ๊ธฐ
Create a custom analyzer | Elasticsearch Guide [7.16] | Elastic
์๋ ์ธ๋ฑ์ค ๊ตฌ์ฑ์์ nori_pos_noun ์ ๋ช ์ฌ(๊ณ ์ ๋ช ์ฌ ๋ฑ)๋ง ์ถ์ถํ์ฌ ๋ณ๋์ ํ๋๋ก ์ ์ฅํ๋ ๋ถ์๊ธฐ์ด๋ค.
PUT review_pos
{
"settings": {
"analysis": {
"analyzer": {
"nori_mixed": {
"tokenizer": "nori_t_mixed",
"filter": "shingle"
},
"nori_pos_noun": {
"type": "custom",
"tokenizer": "nori_t_mixed",
"filter": "pos_filter"
}
},
"tokenizer": {
"nori_t_mixed": {
"type": "nori_tokenizer",
"decompound_mode": "mixed"
}
},
"filter": {
"pos_filter": {
"type": "nori_part_of_speech",
// ๋ช
์ฌ ํ๊น
๋ ๋จ์ด๋ง ์ ์ฅํจ. ์๋ ํํฐ์์๋ ๋ช
์ฌ๋ฅผ ์ ์ธํ ํํ์๋ง ํํฐ๋งํจ
"stoptags": [
"VV", "VA", "VX", "VCP", "VCN", "MM", "MAG", "MAJ",
"IC", "J", "E",
"XPN", "XSA", "XSN", "XSV",
"SP", "SSC", "SSO", "SC", "SE",
"UNA"
]
}
}
}
},
"mappings": {
"properties": {
"prd_id": {
"type": "text"
},
"review_id": {
"type": "text"
},
"review": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"nori_mixed": {
"type": "text",
"analyzer": "nori_mixed",
"search_analyzer": "standard"
},
"nori_noun": {
"type": "text",
"analyzer": "nori_pos_noun",
"search_analyzer": "standard"
}
}
},
"genders": {
"type": "rank_features"
}
}
}
}
์ด๋ ๊ฒ ๊ตฌ์ฑํ ์ธ๋ฑ์ค์์ ๊ฒ์ ์
GET review_pos/_search
{
"query": {
"match": {
"review.nori_noun": "์์"
}
}
}
// ๊ฒฐ๊ณผ
"hits" : [
{
"_index" : "review_pos",
"_type" : "_doc",
"_id" : "4746d199f2869751e5350a3a17fbc055659ceed8",
"_score" : 1.645329,
"_source" : {
"gender" : {
"female" : 0.5363630652427673,
"male" : 0.4636369347572326
},
"prd_id" : 2071204,
"review" : "ํ๋ ์๊ฐํ๋ ๊ฒ๋ณด๋ค ๋ ์์๊ณ ๊ธฐ๋ชจ์ฌ์ ๋ฐ๋ปํฉ๋๋ค",
"review_id" : 22857733
}
},
{
"_index" : "review_pos",
"_type" : "_doc",
"_id" : "244abf90dca42137b33296ae19448ea4f0a5d238",
"_score" : 1.2024188,
"_source" : {
"gender" : {
"female" : 0.8708744049072266,
"male" : 0.1291255950927734
},
"prd_id" : 2071204,
"review" : "๊ธธ์ด๊ฐ ์กฐ๊ธ ๋ ๊ธธ์ค ์์๋๋ฐ ์ํ ๋ฐ๊ณ ์
์ด๋ณด๋ ์๊ฐ๋ณด๋ค ์งง์์ ๋นํฉ์ค๋ฌ์ ์ด์ ๊ทธ๋๋ ์์๊ฒ ์ ์
๊ณ ์์ด์",
"review_id" : 22642259
}
}
]
nori ํํ์ ๋ถ์๊ธฐ๋ก ๋ถ์ํ ๋ช ์ฌ๋ง ๋ณ๋๋ก ๊ฒ์ํ ์ ์์๋ค.