Elastic Search

[Elastic Search] MBTI ๊ฒ€์ƒ‰ ํ”„๋กœ์ ํŠธ - 2. Emoji ๊ฒ€์ƒ‰ ๋ฐ Aggregation(2ํŽธ)

Tempo 2022. 4. 21. 20:29

๊ธฐ์กด ์ฝ˜ํ…์ธ ์—์„œ ์ด๋ชจํ‹ฐ์ฝ˜๋งŒ ํŒŒ์‹ฑ ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ RDB์— ์ˆ˜์ง‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

(ES Analyzer์— regex filter๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์€ ๋‹ค์Œ์— ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!)

์Šคํ‚ค๋งˆ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ Insert ํ•˜์˜€์Šต๋‹ˆ๋‹ค(RDB)

T: t_emoji_dashboard

Columns: emoji, mbti_type(MBTI ํƒ€์ž… ์ž…๋‹ˆ๋‹ค), emoji_count(๊ฐ ๋ฌธ์„œ๋ณ„ ๋“ฑ์žฅ ํšŸ์ˆ˜์ž…๋‹ˆ๋‹ค)

SELECT emoji, mbti_type, sum(emoji_count)
FROM t_emoji_dashboard
WHERE emoji = '๐Ÿ˜˜'
GROUP BY emoji, mbti_type
ORDER BY mbti_type, sum DESC

์‚ฌ์šฉํ•œ ์ฟผ๋ฆฌ๋กœ ์กฐํšŒํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. (ํŠน์ • ์ด๋ชจํ‹ฐ์ฝ˜๋งŒ ์กฐํšŒํ•˜์˜€์Šต๋‹ˆ๋‹ค), ์˜ค๋ฅธ์ชฝ์€ ํ…Œ์ด๋ธ” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์กฐํšŒํ•˜์˜€์Šต๋‹ˆ๋‹ค.(Group by emoji)

RDB๋กœ ์šฐ์„  ์ €์žฅํ•˜์—ฌ ํ˜„์žฌ ๋ฌธ์„œ์— ํฌํ•จ๋œ ์ด๋ชจํ‹ฐ์ฝ˜์„ ํŒŒ์•…ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์œ„์™€ ๊ฐ™์ด RDS ์ฟผ๋ฆฌ์ฒ˜๋Ÿผ ElasticSearch์—์„œ ์กฐํšŒ๋ฅผ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

GET mbti/_search
{
  "size": "0",
  "query": {
    "terms": {
      "contents": ["๐Ÿ˜˜"]
    }
  },
  "aggregations": {
    "significant_mbti_type": {
      "significant_terms": {
        "field": "keyword" ,
        "min_doc_count": 0
      }
    }
  }
}

์‹คํ–‰ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

{
"aggregations" : {
    "significant_mbti_type" : {
      "doc_count" : 136,
      "bg_count" : 27146,
      "buckets" : [
        {
          "key" : "ESTJ",
          "doc_count" : 40,
          "score" : 0.28670597307320017,
          "bg_count" : 4043
        },
        {
          "key" : "ESTP",
          "doc_count" : 24,
          "score" : 0.12469690824841013,
          "bg_count" : 2807
        },
        {
          "key" : "ESFJ",
          "doc_count" : 25,
          "score" : 0.11545638190983143,
          "bg_count" : 3065
        },
        {
          "key" : "ESFP",
          "doc_count" : 5,
          "score" : 0.007764077040010751,
          "bg_count" : 824
        }
      ]
    }
  }
}

์œ„ ์ฟผ๋ฆฌ์—์„œ contents๋Š” text ํ•„๋“œ์ด์ž, tokenized ๋œ ํ•„๋“œ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ term ์ฟผ๋ฆฌ ์ž‘๋™์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!

 

Significant terms aggregation | Elasticsearch Guide [8.1] | Elastic

Show significant_terms in contextFree-text significant_terms are much more easily understood when viewed in context. Take the results of significant_terms suggestions from a free-text field and use them in a terms query on the same field with a highlight c

www.elastic.co

์ฟผ๋ฆฌ๋ฅผ ์„ค๋ช…ํ•˜์ž๋ฉด, contents ํ•„๋“œ ๋‚ด์—์„œ ํ•ด๋‹น ์ด๋ชจํ‹ฐ์ฝ˜์„ ํฌํ•จํ•œ ๊ฒŒ์‹œ๋ฌผ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ๋ฌธ์„œ์˜ ์ˆ˜๋ฅผ ์ข…ํ•ฉํ•ด์„œ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค( doc count Sum )

์ด๋ฅผ ๋ณด๋ฉด ์ง์ ‘ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์‹ฑ ํ•ด์„œ ๋„ฃ์€ RDB์™€ ์ˆ˜์น˜๊ฐ€ ๋น„์Šทํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋‹ค์Œ ํฌ์ŠคํŒ…์—์„œ๋Š” ES Analyzer์— Regex filter๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฝ˜ํ…์ธ  ๋‚ด ์ด๋ชจํ‹ฐ์ฝ˜๋งŒ ์ €์žฅํ•˜๋Š” ํ•„๋“œ๋ฅผ ๋งŒ๋“ค๊ณ 

1) ๋ฌธ์„œ ๋‚ด ์ €์žฅ๋œ ์ด๋ชจํ‹ฐ์ฝ˜ ์ˆ˜

2) ๊ฐ ์ด๋ชจํ‹ฐ์ฝ˜ ์‚ฌ์šฉ๋ณ„ MBTI ํƒ€์ž…

๋“ฑ์„ ์กฐํšŒํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

๋ฐ˜์‘ํ˜•