Elastic Search

[Elastic Search] MBTI ๊ฒ€์ƒ‰ ํ”„๋กœ์ ํŠธ - 2. Emoji ๊ฒ€์ƒ‰ ๋ฐ Aggregation

Tempo 2022. 4. 13. 23:17

MBTI ๋ณ„ ํŠน์„ฑ์„ ํŒŒ์•…ํ•˜๋Š” ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค. ์ˆ˜์ง‘๋œ ํ…์ŠคํŠธ๋“ค์„ ๋ณด๋‹ˆ ์ด๋ชจ์ง€๊ฐ€ ๋งŽ์ด ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋Š” ๊ฑธ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

###Q: Emoji๊ฐ€ ElasticSearch ์—์„œ ๊ฒ€์ƒ‰์ด ๋˜๋‚˜์š”?
###A: ๋„ค Emoji๋„ ํ…์ŠคํŠธ๋กœ ์ทจ๊ธ‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฒ€์ƒ‰์ด ์ž˜ ๋ฉ๋‹ˆ๋‹ค!

ํ•˜์ง€๋งŒ ๋ชจ๋“  ์ด๋ชจ์ง€๋ฅผ ๊ฒ€์ƒ‰ํ•˜์—ฌ ๋ฌธ์„œ ์ˆ˜๊ฐ€ ์–ผ๋งˆ๋‚˜ ์žˆ๋Š”์ง€ ํŒŒ์•…ํ•˜๊ธฐ๋Š” ์‰ฝ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋ฏธ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ์—์„œ ์ด๋ชจ์ง€๋งŒ ํŒŒ์‹ฑ ํ•ด๋ด…์‹œ๋‹ค.
์ €๋Š” Python์˜ Regex๋ฅผ ์ด์šฉํ•ด์„œ ์ด๋ชจ์ง€๋ฅผ ์ถ”์ถœํ–ˆ์Šต๋‹ˆ๋‹ค.

import pandas as pd
import re

# ... DB์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๋ถ€๋ถ„์€ ์ƒ๋žต
# df๋Š” ์นผ๋Ÿผ์œผ๋กœ contents(์ˆ˜์ง‘๋œ ํ…์ŠคํŠธ), doc_url(ํ…์ŠคํŠธ์˜ url)์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)

# emoji_set ์— ์›๋ฌธ์—์„œ ํŒŒ์‹ฑํ•œ ๋ชจ๋“  ์ด๋ชจ์ง€๋ฅผ ์ˆ˜์ง‘ํ•จ
result, emoji_set = [], []
for idx, row in df.iterrows():
    cont, url = row['contents'], row['doc_url']
    emoji_list = emoji_pattern.findall(cont)
    if not emoji_list:
        continue
    emoji_list = ''.join(emoji_list)
    emj_set = list(set([e for e in emoji_list]))
    result.append({url: emj_set})
    emoji_set.extend(emj_set)
    
# set ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ์ค‘๋ณต์„ ์ œ๊ฑฐํ•จ
emoji_set = list(set(emoji_set))

# ์—˜๋ผ์Šคํ‹ฑ์„œ์น˜์— ์—ฐ๊ฒฐ
es = elasticsearch.Elasticsearch(["<URL>"])

# emoji๋ฅผ ํŒŒ์‹ฑํ•˜๋Š” ํ•จ์ˆ˜
def parse_emoji(t, e):
    """
    :params t: elasticsearch ๊ฒฐ๊ณผ
    :params e: emoji
    """
    
    result_list = t.body.get('aggregations', {}).get('mbti_types', {}).get('buckets', [])
    parse_meta = {r['key']:r['doc_count'] for r in result_list}
    parse_meta = dict({'emoji': e}, **parse_meta)
    
    # {"emoji": "emoji", "MBTI-TYPE": 1 ...}
    return parse_meta

# ์ฟผ๋ฆฌ์— ์ด๋ชจ์ง€๋ฅผ ๋„ฃ์–ด ๊ฐ ์ด๋ชจ์ง€๋ณ„ ๊ฒฐ๊ณผ๋ฅผ ๊ตฌํ•จ
parse_list = []
for e in emoji_set:
    query = {
        "size": 0, 
        "query": {
          "match": {
            "contents": e
          }
          },
        "aggs": {
          "mbti_types": {
            "terms": {
              "field": "keyword",
              "size": 16
            }
          }
        }
    }
    t = es.search(index='mbti', body=query)
    parse_list.append(parse_emoji(t, e))

์ถœ์ฒ˜: https://studyprogram.tistory.com/1

 

Python์—์„œ ํ…์ŠคํŠธ ์•ˆ์˜ ์ด๋ชจ์ง€(emoji)์ œ๊ฑฐํ•˜๊ธฐ

ํŒŒ์ด์ฌ์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋˜ ๋„์ค‘ ์•ˆ์— ์ด๋ชจ์ง€๊ฐ€ ์žˆ์œผ๋ฉด ์ฒ˜๋ฆฌ๋ฅผ ๋ชปํ•ด์„œ ์—๋Ÿฌ๊ฐ€ ๋‚˜๋Š” ์ผ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์…€๋ ˆ๋‹ˆ์›€ find_element_by_*****.send_key(text) ์—์„œ text์— ์ด๋ชจ์ง€๊ฐ€ ์žˆ์„ ๋•Œ WebDriverException: unkno..

studyprogram.tistory.com

์ด๋ฅผ ํ†ตํ•ด ์–ป์€ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋Œ€๋žต์ ์ธ ๊ฒฐ๊ณผ์ด์ง€๋งŒ, ์ด๋ฅผ RDB์— Insert ํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•ด์•ผ ํ•  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

ํ•˜์ง€๋งŒ ์ƒ๊ฐํ•ด๋ด์•ผ ํ•  ๊ฒƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฝ˜ํ…์ธ ๋Š” ์•ž์œผ๋กœ ๊ณ„์† ์ˆ˜์ง‘๋  ๊ฒƒ์ด๊ณ  ์ด๋ฏธ ์žˆ๋Š” ์ด๋ชจ์ง€๋ผ๋ฉด ์นด์šดํŠธ๊ฐ€ ๊ณ„์† Update ๋˜๋Š” ํ˜•ํƒœ๊ฐ€ ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—…๋ฐ์ดํŠธ๊ฐ€ ์ž์ฃผ ์ผ์–ด๋‚˜๋Š” ํ™˜๊ฒฝ์—์„œ๋Š” RDB๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ด์— ๋ฐฉํ–ฅ์€ 2๊ฐ€์ง€๋ฅผ ์ƒ๊ฐํ•ด๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

1. RDB ์™ธ์— ๋‹ค๋ฅธ DB๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค(Document DB or NoSQL)

2. ์ผ์ž๋ณ„, MBTI ์œ ํ˜•๋ณ„ ์ด๋ชจ์ง€์˜ ๋ฌธ์„œ ์ˆ˜๋ฅผ ์ง‘๊ณ„ํ•œ๋‹ค.

3. Elastic Search Query์—์„œ ์ง‘๊ณ„ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด๊ฒฐํ•œ๋‹ค.

 

ํ˜„์žฌ Elastic Search๋ฅผ ๊ณต๋ถ€ ์ค‘์ด๋‹ˆ, ๋‹ค์Œ์—๋Š” 3๋ฒˆ ํ•ด๊ฒฐ๋ฐฉ์•ˆ์„ ๋งŒ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ฐ˜์‘ํ˜•