6-2. 자연어분석(베이즈정리)

ihl 2020. 9. 19. 16:19

1. 베이즈 정리

베이즈 정리란 확률에 관한 정의로 P(B|A) = P(A|B) * P(B) / P(A) 이다.

풀어 쓰자면 A가 발생한 후 B발생 확률 = B가 발생 후 A발생 확률 * B 발생확률 / A발생 확률 이라는 의미이다.

이 베이즈정리를 이용해서 A라는 사건이 B에 속하는지. 즉 지도학습을 구현할 수 있다.

나이브베이즈분류란 텍스트 내부에서 단어 출현 비율을 조사하여 이 텍스트를 어떤 카테고리로 분류할지 판단하는 것이다. 이 때 단어의 출현율은 단어의 출현횟수 / 카테고리 전체 단어 수 로 판단한다.

import math, sys
from konlpy.tag import Twitter

class BayesianFilter:
    def __init__(self):
        self.words = set()
        self.word_dict = {}
        self.category_dict = {}

    def split(self, text):
        results = []
        twitter = Twitter()
        malist = twitter.pos(text, norm = True, stem = True)
        for word in malist:
            if not word[1] in ["Josa", "Eomi", "Punctuation"]:
                results.append(word[0])
        return results

    def inc_word(self, word, category):
        if not category in self.word_dict:
            self.word_dict[category] = {}
        if not word in self.word_dict[category]:
            self.word_dict[category][word] = 0
        self.word_dict[category][word] += 1
        self.words.add(word)

    def inc_category(self, category):
        #카테고리 계산하기
        if not category in self.category_dict:
            self.category_dict[category] = 0
        self.category_dict[category] += 1

    #텍스트 학습하기
    def fit(self, text, category):
        word_list = self.split(text)
        for word in word_list:
            self.inc_word(word, category)
        self.inc_category(category)

    #단어리스트에 점수 매기기
    def score(self, words, category):
        score = math.log(self.category_prob(category))#확률을 곱할 때 값이 너무 작으면 다운플로 발생할 수 있으므로 log 사용
        for word in words:
            score += math.log(self.word_prob(word, category))
        return score

    def predict(self, text):
        best_category = None
        max_score = -sys.maxsize
        words = self.split(text)
        score_list = []
        for category in self.category_dict.keys():
            score = self.score(words, category)
            score_list.append((category, score))
            if score > max_score:
                max_score = score
                best_category = category
        return best_category, score_list

    #카테고리 내부의 단어 출현 횟수 구하기
    def get_word_count(self, word, category):
        if word in self.word_dict[category]:
            return self.word_dict[category][word]
        else:
            return 0

    #카테고리 계산
    def category_prob(self, category):
        sum_categories = sum(self.category_dict.values())
        category_v = self.category_dict[category]
        return category_v / sum_categories

    #카테고리 내부의 단어 출현 비율 계산
    def word_prob(self, word, category):
        n = self.get_word_count(word, category) + 1 #0으로 하면 사전에 없는 단어면 곱해서 0이 되버리므로 1
        d = sum(self.word_dict[category].values()) + len(self.words)
        return n /d

from bayes import BayesianFilter
bf = BayesianFilter()

#텍스트 학습
bf.fit("파격 세일!! 오늘까지만 30% 할인", "광고")
bf.fit("쿠폰 선물 & 무료 배송", "광고")
bf.fit("현대 백화점 세일", "광고")
bf.fit("봄과 함께 찾아온 따뜻한 신제품 소식", "광고")
bf.fit("인기 제품 기간 한정 세일", "광고")
bf.fit("오늘 일정 확인", "중요")
bf.fit("프로젝트 진행 상황 보고","중요")
bf.fit("계약 잘 부탁드립니다","중요")
bf.fit("회의 일정이 등록되었습니다.","중요")
bf.fit("오늘 일정이 없습니다.","중요")

#예측
pre, scorelist = bf.predict("재고 정리 할인, 무료배송")
print("결과: ", pre)
print(scorelist)

BayesianFilter 클래스를 선언한 뒤 fit으로 제목과 카테고리(광고/중요)를 넣어서 학습할 수 있다.

제목은 형태소 단위로 분리되며 inc_word에서 word_dict에 카테고리별 단어의 사용 횟수를 저장한다.

예를들어 [광고][세일]=3, [광고][인기]=1, [중요][일정]=3 이런 식이다.

inc_category에서는 category_dict에 카테고리의 출현 횟수를 저장한다.

[광고]=10, [중요]=10 이런 식이다.

그 후 predict에 테스트 데이터(제목)을 넣으면 그간 fit을 통해 저장된 카테고리-단어들을 활용하여 테스트 데이터에 대한

점수를 매기고 가장 점수가 높은 카테고리로 결과를 리턴한다.

category_prob은 메일이 왔을 때 이것이 '광고' 카테고리일 확률 혹은 '중요' 카테고리일 확률을 의미한다.

word_prob은 해당 카테고리에서 지금 입력한 단어의 등장비율(확률)을 의미한다.

즉 제목에 대해 1) A 카테고리일 확률 2) 제목의 각 단어에 대해 카테고리 내에서 해당 단어가 나올 확률 구해서 점수를 계산한 것이다.