在自然语言处理 (NLP) 中,为了让机器学习算法能够理解文本数据,我们需要将其转换为数字。实现这一目标的一个常见方法是词袋模型。它将句子、段落或文档等文本转换为单词的集合,并计算每个单词出现的频率,但会忽略单词的顺序。它不考虑单词的顺序或语法,而是专注于计算每个单词在文本中出现的频率。
这使得它在文本分类、情感分析和聚类等任务中非常有用。
BoW 的核心组件
- 词汇表: 它是整个数据集中所有唯一单词的列表。词汇表中的每个单词都对应模型中的一个特征。
- 文档表示: 每个文档被表示为一个向量,其中的每个元素显示该文档中词汇表单词的频率。每个单词的频率被用作模型的特征。
实现词袋模型 (BoW) 的步骤
让我们来看看如何使用 Python 实现 BoW 模型。在这里,我们将使用 NLTK、Heapq、Matplotlib、Word cloud、Numpy 和 Seaborn 库来进行实现。
步骤 1:预处理文本
在应用 BoW 模型之前,我们需要对文本进行预处理。这包括:
- 将文本转换为小写
- 删除非单词字符
- 删除多余的空格
import nltk
import re
text = """Beans. I was trying to explain to somebody as we were flying in, that‘s corn.
That‘s beans. And they were very impressed at my agricultural knowledge.
Please give it up for Amaury once again for that outstanding introduction.
I have a bunch of good friends here today, including somebody who I served with,
who is one of the finest senators in the country, and we‘re lucky to have him,
your Senator, Dick Durbin is here. I also noticed, by the way,
former Governor Edgar here, who I haven‘t seen in a long time, and
somehow he has not aged and I have. And it‘s great to see you, Governor.
I want to thank President Killeen and everybody at the U of I System for
making it possible for me to be here today. And I am deeply honored at the Paul
Douglas Award that is being given to me. He is somebody who set the path for so
much outstanding public service here in Illinois. Now, I want to start by
addressing the elephant in the room. I know people are still wondering why
I didn‘t speak at the commencement."""
dataset = nltk.sent_tokenize(text)
for i in range(len(dataset)):
dataset[i] = dataset[i].lower()
dataset[i] = re.sub(r‘\W‘, ‘ ‘, dataset[i])
dataset[i] = re.sub(r‘\s+‘, ‘ ‘, dataset[i])
for i, sentence in enumerate(dataset):
print(f"Sentence {i+1}: {sentence}")
输出:
预处理文本的结果将显示在控制台中。
步骤 2:统计词频
在这一步中,我们将统计预处理后的文本中每个单词的频率。我们将把这些计数存储在一个 pandas DataFrame 中,以便以表格形式轻松查看。
- 我们初始化一个字典来保存我们的词频统计。
- 然后,我们将每个句子分词为单词。
- 对于每个单词,我们检查它是否存在于我们的字典中。如果存在,我们增加它的计数。如果不存在,我们将其添加到字典中,计数为 1。
word2count = {}
for data in dataset:
words = nltk.word_tokenize(data)
for word in words:
if word not in word2count:
word2count[word] = 1
else:
word2count[word] += 1
stop_words = set(stopwords.words(‘english‘))
filtered_word2count = {word: count for word, count in word2count.items() if word not in stop_words}
word_freq_df = pd.DataFrame(list(filtered_word2count.items()), columns=[‘Word‘, ‘Frequency‘])
word_freq_df = word_freq_df.sort_values(by=‘Frequency‘, ascending=False)
print(word_freq_df)
输出:
我们将看到一个按频率排序的单词列表。
步骤 3:选择最常用的词
现在我们已经统计了词频,我们将选择出现频率最高的 N 个单词(例如前 10 个)用于 BoW 模型。我们可以使用柱状图将这些常用词可视化,以了解数据集中单词的分布情况。