如何用 sklearn 进行文本分析

本文主要是针对文本获取统计数据，以分析文本中的有效信息。

注意在选择方法的时候关注评价指标，而在分析文本的时候关注获得的结论。侧重点不一样，这个时候可以选择几种相对较好的方法试一试。

再次强调，本文的重点是提取信息。

统计词频，提取关键词

提取高频词和关键词。

这一步可以直接使用内置的 collections.Counter，能计算词频并返回高频词。

词频分布可视化

可以统计文本中单词的频率，并使用柱状图展示前 N 个最常见的单词。

python

运行

import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# 示例文本
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

# 使用 CountVectorizer 提取词频
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()
word_counts = X.sum(axis=0).A1

# 创建 DataFrame
df = pd.DataFrame({'word': feature_names, 'count': word_counts})
# 按词频降序排序并取前 10 个
top_words = df.sort_values(by='count', ascending=False).head(10)

# 绘制柱状图
plt.figure(figsize=(10, 6))
plt.bar(top_words['word'], top_words['count'])
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top 10 Most Frequent Words')
plt.xticks(rotation=45)
plt.show()

文本分类结果可视化可以使用混淆矩阵来展示分类模型的性能，混淆矩阵可以直观地显示模型在各个类别上的分类情况。

文本聚类结果可视化
对于文本聚类结果，可以使用散点图展示降维后的文本向量在二维平面上的分布，不同的聚类用不同的颜色表示。

提取关键词

我们希望关键词反映并区分文档的主题。

每个文档的高频词可以作为这个文档的关键词。但是诸如冠词等常见词汇并不能用来区分文档。

可以用 tf-idf 来修正词频，减少在不同文档中都出现了的词的权重。

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

docs = [
    "I love natural language processing",
    "Natural language processing is fun",
    "I love machine learning"
]

# 计算 TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)
terms = vectorizer.get_feature_names_out()

# 转成 DataFrame 方便处理
df = pd.DataFrame(tfidf_matrix.toarray(), columns=terms)
print(df)


# 输出每个文档的关键词（按权重排序）
for i, row in df.iterrows():
    sorted_terms = row.sort_values(ascending=False)
    keywords = [(term, round(score, 3)) for term, score in sorted_terms.items() if score > 0]
    print(f"Document {i+1} top keywords:", ", ".join(keywords[:3]))

文本分类

比如垃圾邮件过滤，情感分析（比如 IMDB）。

很多时候我们是用来对数据进行打分，或者用来筛选数据，并不需要自行训练模型。

训练数据可是模型的一部分，也会影响模型的倾向性。

用文本关键词

可以用正则表达式。
可以作为屏蔽词，出现就拒绝。
也可以出现此处设定一个阈值。
也可以给每个词设定一个打分。

用 import regex as re 可以增加 unicode 支持，比如 \p{Han} 可以匹配中文汉字。

import regex as re
spam_regex = re.compile(r"""
    # 英语垃圾邮件特征（匹配任意一项即判定为垃圾）
    free\s+(gift|offer|trial)  # 1. 免费诱导：free gift/offer/trial
    | win\s+(cash|prize|award) # 2. 中奖诱导：win cash/prize/award
    | urgent|immediate|hurry   # 3. 紧急催促：制造焦虑感的关键词
    | click\s+here             # 4. 诱导点击：常见引导跳转话术
    | visit\s+our\s+site       # 4. 诱导点击：另一类跳转引导
    | www\.\w+\.\w+            # 5. 可疑链接：无协议头的网址（如www.fake.com）
    | http(s)?://\w+           # 5. 可疑链接：带http/https的网址
    | unsubscribe\s+by         # 6. 退订话术：垃圾邮件标志性内容
    | reply\s+to\s+remove      # 6. 退订话术：另一类退订表述
        """, re.IGNORECASE | re.VERBOSE)  # 组合两个标志：忽略大小写 + 支持换行注释

test_emails = [
    "FREE Gift! Click here: http://fake.com",
    "Urgent! You won $500 cash - reply to remove if not interested",
    "Hi, let's confirm the meeting time for next Monday"
]

for i, email in enumerate(test_emails, 1):
    print(f"\n--- Email {i} ---")
    print(f"Content: {email}")
    print(f"Is Spam? {'YES' if len(re.findall(spam_regex, email))>=2 else 'NO'}")

用 TF-IDF 进行文本分类

根据关键词可以对文档进行分类。这里和上面不同的是，阈值和参数的组合可以通过模型来学习。

TF（tern frequency）是词频，是关键词是否出现，以及出现几次，可以利用这个信息来训练机器学习模型。IDF用来降低高频词的权重。

TF-IDF 可用于文本分类，也算是一个比较经典的方法了。

其他的改进

去除停用词
可以用 ngram 考虑词语组合
可以用降维来发现主题
模型可以用朴素贝叶斯，当然 SVM 或者 xgboost 也行。

具体使用时需要测试看看是否有效果。

新闻分类

# 使用TF-IDF进行文本分类
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline

# 1. 加载数据集（使用20个新闻组数据集）
# 选择几个类别以加快训练速度
categories = ['rec.sport.hockey', 'talk.politics.misc', 
              'comp.graphics', 'sci.med']
newsgroups = fetch_20newsgroups(subset='all', categories=categories,
                                remove=('headers', 'footers', 'quotes'))

X = newsgroups.data  # 文本数据
y = newsgroups.target  # 标签

# 2. 分割训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. 创建一个包含TF-IDF和分类器的管道
# 这样可以确保在交叉验证和测试时使用相同的参数
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)),
    ('clf', MultinomialNB()),
])

# 4. 训练模型
text_clf.fit(X_train, y_train)

# 5. 在测试集上进行预测
y_pred = text_clf.predict(X_test)

# 6. 评估模型性能
print(f"准确率: {accuracy_score(y_test, y_pred):.4f}")
print("\n分类报告:")
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))

# 7. 示例预测
sample_texts = [
    "The goalkeeper made a great save in the last minute of the game",
    "The new computer graphics card has 8GB of VRAM",
    "The president announced new healthcare policies yesterday",
    "The patient was diagnosed with a rare form of cancer"
]

predicted_categories = text_clf.predict(sample_texts)
for text, category in zip(sample_texts, predicted_categories):
    print(f"\n文本: {text}")
    print(f"预测类别: {newsgroups.target_names[category]}")

IMDB 情感分析

其他方法

用 FastText/Bert 进行文本分类

fasttext 有命令行工具。由于缺乏无监督预训练，可以效果不是那么好。
可以用 Pytorch 自行实现，或者用 huggingface 的实现。

用 ChatGPT 进行文本分类

通过写提示词来完成。
必要时可以给出若干示例。

文本的聚类与降维

聚类与降维

用于发现文档的主题。

KMeans

散点图。用于分类模型。

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt

# 加载数据集
categories = ['alt.atheism', 'soc.religion.christian']
data = fetch_20newsgroups(subset='all', categories=categories)

# 特征提取
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data.data)

# 聚类
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X)

# 降维
svd = TruncatedSVD(n_components=2)
X_reduced = svd.fit_transform(X)

# 可视化聚类结果
plt.figure(figsize=(10, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=kmeans.labels_, cmap='viridis')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.title('Text Clustering Results')
plt.show()

PCA

降维，可以发现关键词所属于的主题。

LDA

可以使用柱状图展示每个主题下的前 N 个关键词及其权重。

# 属于一种文本聚类
import numpy as np
from sklearn.datasets import fetch_20newsgroups  
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


# 使用 sklearn 提供的新闻数据集
newsgroups_data = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
texts = newsgroups_data.data[:1000]  # 获取新闻数据集的文本内容
print(len(texts))

# 文本向量化
vectorizer = CountVectorizer(stop_words='english')  # 添加 stop_words='english' 参数以去除停用词
X = vectorizer.fit_transform(texts)

# 训练LDA模型
lda = LatentDirichletAllocation(n_components=5)  # 主题数量
         
lda.fit(X)

# 查看每个主题的前几个关键词
n_top_words = 5
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f"主题 {topic_idx}:")
    top_words_idx = topic.argsort()[:-n_top_words - 1:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(", ".join(top_words))
    print([topic[i] for i in top_words_idx])

# 预测每个文档的主题分布
doc_topic_distribution = lda.transform(X)
print("每个文档的主题分布：")
print(doc_topic_distribution)

其他任务

拼写检查，用编辑距离。

附录

分词可以用 NLTK 或 Spacy。中文分词用 jieba。

参考资料

https://spssau.com/helps/
https://theory.stanford.edu/~blynn/lambda/ https://github.com/norvig/pytudes/blob/main/ipynb/How%20to%20Do%20Things%20with%20Words.ipynb

文本信息的可视化

其他工具

huggingface

pytorch + fasttext