파이썬 토크나이저 - 단어 개수 세기

🛠 기타/Data & AI

파이썬 토크나이저 - 단어 개수 세기

inu 2020. 7. 17. 10:49

counter

from collections import Counter

words = ['Have', 'you', 'ever', 'fallen', 'head', '.', 'over', 'heels', 'for', 'somebody', '.', 'Not', 'just', 'somebody', '.', 'No', 'no', '.', 'Rex', 'you', 'did', 'it', 'again', '.', 'Have', 'you', 'ever', 'fallen', 'head', '.', 'over', 'heels', 'for', 'somebody', '.', 'That', 'made', 'promises', '.', 'to', 'give', 'you', 'the', 'world', 'Um', '.', 'I', 'really', 'hope', 'they', 'held', 'you', 'down', '.', 'I', 'really', 'hope', 'it', 'was', 'no', 'lying', '.', 'Cause', 'when', 'heart', 'breaks', 'it', '.', 'feel', 'like', 'the', 'world', "'s", 'gone', '.', 'But', 'if', 'the', 'love', "'s", 'real', '.']

sw = ['.', ',']
removed_list = []
for word in words:
  if word.lower() not in sw:
    removed_list.append(word)


count_list = Counter(removed_list)

print(type(count_list))
print(count_list)
print()
common_cl = count_list.most_common(10) # top10
print(common_cl)

==결과==
<class 'collections.Counter'>
Counter({'you': 5, 'somebody': 3, 'it': 3, 'the': 3, 'Have': 2, 'ever': 2, 'fallen': 2, 'head': 2, 'over': 2, 'heels': 2, 'for': 2, 'no': 2, 'world': 2, 'I': 2, 'really': 2, 'hope': 2, "'s": 2, 'Not': 1, 'just': 1, 'No': 1, 'Rex': 1, 'did': 1, 'again': 1, 'That': 1, 'made': 1, 'promises': 1, 'to': 1, 'give': 1, 'Um': 1, 'they': 1, 'held': 1, 'down': 1, 'was': 1, 'lying': 1, 'Cause': 1, 'when': 1, 'heart': 1, 'breaks': 1, 'feel': 1, 'like': 1, 'gone': 1, 'But': 1, 'if': 1, 'love': 1, 'real': 1})

<class 'list'>
[('you', 5), ('somebody', 3), ('it', 3), ('the', 3), ('Have', 2), ('ever', 2), ('fallen', 2), ('head', 2), ('over', 2), ('heels', 2)]

collections 라이브러리의 Counter를 활용한다.
해당 리스트에서 각 성분이 몇 번 등장했는지 파악한다. (단어 리스트는 토크나이징을 통해 수행한다. 본 코드에서는 기 수행된 리스트를 활용했다.)
리턴 타입은 collections.Counter 클래스 오브젝트이다.
해당 오브젝트에서 most_common함수를 활용하면 가장 많은 단어개수부터 특정 개수만큼만 잘라서 리스트형태로 리턴할 수도 있다.

예제1 : 빈도수 순서로 딕셔너리 지정하기

common_cl_dict = {}
# 빈도수 순서로 0~단어갯수로 정렬

i = 0
for word, _ in common_cl:
  common_cl_dict[word] = i
  i += 1

print(common_cl_dict)

==결과==
{'you': 0, 'somebody': 1, 'it': 2, 'the': 3, 'Have': 4, 'ever': 5, 'fallen': 6, 'head': 7, 'over': 8, 'heels': 9}

예제2 : 예제1의 딕셔너리를 기반으로 원 핫 벡터 리스트 만들기

oh_vector_list = []

for value in common_cl_dict.values():
    oh_vector = [0] * len(common_cl_dict)
    oh_vector[value] = 1
    oh_vector_list.append(oh_vector)

print(oh_vector_list)

==결과==
[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]

예제3 : count_list 기반으로 워드클라우드 이미지 만들기

from wordcloud import WordCloud
import matplotlib.pyplot as plt

my_wc = WordCloud(background_color = 'white')
plt.imshow(my_wc.generate_from_frequencies(count_list))
plt.show()

'🛠 기타 > Data & AI' 카테고리의 다른 글

[scikit-learn 라이브러리] 비만도 데이터 기반 분류 문제 (0)	2020.07.17
선형회귀와 경사하강법(Gradient Descent) (0)	2020.07.17
파이썬 토크나이저 - stop words (nltk) (0)	2020.07.17
파이썬 토크나이저 - 어간 찾기 (nltk) (0)	2020.07.17
Pandas 데이터프레임 pivot table (0)	2020.07.16

현재글파이썬 토크나이저 - 단어 개수 세기

이누의 개발성장기

파이썬 토크나이저 - 단어 개수 세기

'🛠 기타 > Data & AI' 카테고리의 다른 글

'🛠 기타/Data & AI'의 다른글

티스토리툴바

파이썬 토크나이저 - 단어 개수 세기

'🛠 기타 > Data & AI' 카테고리의 다른 글

'🛠 기타/Data & AI'의 다른글

관련글

티스토리툴바