Elastic Search

[Elastic Search] MBTI ๊ฒ€์ƒ‰ ํ”„๋กœ์ ํŠธ - 2. Emoji ๊ฒ€์ƒ‰ ๋ฐ Aggregation(3ํŽธ)

Tempo 2022. 4. 24. 21:19

Re-Index

๋ง‰์ƒ ์ด๋ชจํ‹ฐ์ฝ˜ ๊ฒ€์ƒ‰์„ ํ•ด๋ณด๋‹ˆ ๊ฐ ์›๋ฌธ์—์„œ ์ด๋ชจํ‹ฐ์ฝ˜์ด ์–ผ๋งˆ๋‚˜ ํฌํ•จ๋˜์–ด ์žˆ๋Š”์ง€, ์–ด๋–ค ์ด๋ชจํ‹ฐ์ฝ˜์ด ๊ฐ€์žฅ ๋งŽ์ด ์žˆ๋Š”์ง€ ๊ฒ€์ƒ‰ํ•ด๋ณด์ž

๊ทธ์ „์— ์‚ฌ์ „ ์ค€๋น„ ์ž‘์—…์œผ๋กœ text field๋กœ ๋“ค์–ด๊ฐ„ ๋ฐ์ดํ„ฐ์—์„œ ํ‚ค์›Œ๋“œ๋ฅผ ์ถ”์ถœ(Es ๋‚ด๋ถ€์—์„œ๋Š” Term)ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ธ๋ฑ์Šค๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ  ์ „์ฒด ๋ฌธ์„œ์—์„œ ํ‚ค์›Œ๋“œ ๋นˆ๋„์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ์•„๋ด…๋‹ˆ๋‹ค.

์ธ๋ฑ์Šค ๊ตฌ์„ฑ

PUT /mbti_term
{
        "settings": {
            "analysis": {
              "analyzer": {
                "nori_mixed": {
                  "tokenizer": "nori_t_mixed",
                  "filter": "shingle"
                },
                "nori_pos_noun": {
                  "type": "custom",
                  "tokenizer": "nori_t_mixed",
                  "filter": "pos_filter"
                }
              },
              "tokenizer": {
                "nori_t_mixed": {
                  "type": "nori_tokenizer",
                  "decompound_mode": "mixed"
                }
              },
              "filter": {
                "pos_filter": {
                  "type": "nori_part_of_speech",
                  "stoptags": [
                    "VV", "VA", "VX", "VCP", "VCN", "MM", "MAG", "MAJ",
                    "IC", "J", "E",
                    "XPN", "XSA", "XSN", "XSV",
                    "SP", "SSC", "SSO", "SC", "SE",
                    "UNA"
                  ]
                }
              }
            }
          },
        "mappings": {
            "properties": {
                "title": {
                    "type": "text"
                },
                "contents": {
                    "type": "text",
                    "fields": {
                        "full": {
                          "type": "keyword"
                        },
                        "nori_mixed": {
                            "type": "text",
                            "analyzer": "nori_mixed",
                            "search_analyzer": "standard",
                            "fielddata": true,
                            "term_vector": "yes"
                        },
                        "nori_noun": {
                          "type": "text",
                          "analyzer": "nori_pos_noun",
                          "search_analyzer": "standard",
                          "fielddata": true,
                          "term_vector": "yes"
                        }
                    },
                    "fielddata": true,
                    "term_vector": "yes"
                },
                "keyword": {
                    "type": "keyword"
                },
                "platform": {
                    "type": "keyword"
                },
                "published_at": {
                    "type": "date"
                },
                "doc_url": {
                  "type": "text"
                },
                "comment_cnt": {
                    "type": "integer"
                },
                "like_cnt": {
                    "type": "integer"
                }
            }
        }
    }

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ ๋Œ€์ƒ ํ•„๋“œ์—์„œ term_vector,fielddata ์ด๋‹ค.

"contents": {
      "type": "text",
      "fields": {
          "full": {
            "type": "keyword"
          },
          "nori_mixed": {
              "type": "text",
              "analyzer": "nori_mixed",
              "search_analyzer": "standard",
              "fielddata": true,
              "term_vector": "yes"
          },
          "nori_noun": {
            "type": "text",
            "analyzer": "nori_pos_noun",
            "search_analyzer": "standard",
            "fielddata": true,
            "term_vector": "yes"
          }
      },
      "fielddata": true,
      "term_vector": "yes"
  }

์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฌธ์„œ์— ์žˆ๋Š” ๋‹จ์–ด(term)์˜ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์ฟผ๋ฆฌ๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

GET mbti_term/_search
{
  "size": 0,
  "aggs": {
    "term_cnt": {
      "terms": {
			  //contents.nori_mixed, contents.nori_noun ๋“ฑ์œผ๋กœ๋„ ๊ฐ€๋Šฅ
        "field": "contents",
        "size": 1000
      }
    }
  }
}

์—ฌ๊ธฐ์— ๋‹จ์–ด์˜ ๊ธธ์ด๋กœ ์ •๋ ฌํ•˜์—ฌ ๋ชฉ๋ก์„ ์ถœ๋ ฅํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๋‹ค. - ๋งํฌ (ํ•˜์ง€๋งŒ 502 Bad Gateway ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค..)

์œ„์˜ ์ฟผ๋ฆฌ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

{
  "took" : 555,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "term_cnt" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 11058130,
      "buckets" : [
        {
          "key" : "์ด",
          "doc_count" : 15136
        },
        {
          "key" : "ํ•˜",
          "doc_count" : 15088
        },
        {
          "key" : "๋Š”",
          "doc_count" : 14448
        },
        {
          "key" : "แ†ซ",
          "doc_count" : 13909
        },
        {
          "key" : "๊ณ ",
          "doc_count" : 13348
        },
        {
          "key" : "์•„",
          "doc_count" : 13032
        },
        {
          "key" : "์€",
          "doc_count" : 13015
        },
        {
          "key" : "๊ฐ€",
          "doc_count" : 12940
        },

 

์ด๋ชจํ‹ฐ์ฝ˜ ํ•„๋“œ ๊ตฌ์„ฑ

๋ณธ๋ฌธ์—์„œ ์–ด๋–ป๊ฒŒ ์ด๋ชจํ‹ฐ์ฝ˜๋งŒ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์„๊นŒ?

์•„๋ž˜ ์ฟผ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋ชจํ‹ฐ์ฝ˜์ด ํฌํ•จ๋œ ๋ณธ๋ฌธ๋งŒ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.

GET mbti_term/_search
{
  "query": {
    "regexp": {
      "contents": "[\\uD83D\\uDE00-\\uD83D\\uDE4F, \\uD83C\\uDF00-\\uD83D\\uDDFF, \\uD83D\\uDE80-\\uD83D\\uDEFF, \\uD83C\\uDDE0-\\uD83C\\uDDFF]"
    }
  }  
}

์ฝ˜ํ…์ธ ์—์„œ ์ด๋ชจํ‹ฐ์ฝ˜์ด ํฌํ•จ๋œ ์ฝ˜ํ…์ธ ๋งŒ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋ชจํ‹ฐ์ฝ˜๋งŒ ๋ณ„๋„์˜ ํ•„๋“œ๋กœ ์ €์žฅํ•ด์„œ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒŒ ํ•„์š”ํ•  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์—๋Š” ์‹ค์ œ ์ธ๋ฑ์Šค ํ•„๋“œ์— ์ด๋ชจํ‹ฐ์ฝ˜ ๊ฐ’๋งŒ ์ €์žฅ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์ธ๋ฑ์Šค ๊ตฌ์„ฑ์„ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ฐ˜์‘ํ˜•