商业数据分析--产品评价

姜智浩 Lv5

声明

本文代码均保存在
https://github.com/super-213/business_data_analysis
有需要的可以自行下载

查看数据

1
2
3
4
5
6
7
8
9
import pandas as pd
import jieba
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report


df = pd.read_excel('产品评价.xlsx'
df.head()

分词

1
2
3
4
def clean_text(text):
return ' '.join(jieba.cut(str(text)))

df['评论_clean'] = df['评论'].apply(clean_text)

TF-IDF

1
2
3
4
# TF-IDF特征
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['评论_clean'])
y = df['评价']

数据划分

1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier

models = {
'朴素贝叶斯': MultinomialNB(),
'逻辑回归': LogisticRegression(max_iter=200),
'支持向量机': LinearSVC(),
'随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
'MLP': MLPClassifier(random_state=42)
}

for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'\n模型:{name}')
print('准确率:', acc)
print(classification_report(y_test, y_pred))

模型:朴素贝叶斯
准确率: 0.875
precision recall f1-score support

       0       1.00      0.70      0.83        91
       1       0.82      1.00      0.90       125

accuracy                           0.88       216

macro avg 0.91 0.85 0.86 216
weighted avg 0.90 0.88 0.87 216

模型:逻辑回归
准确率: 0.9814814814814815
precision recall f1-score support

       0       0.97      0.99      0.98        91
       1       0.99      0.98      0.98       125

accuracy                           0.98       216

macro avg 0.98 0.98 0.98 216
weighted avg 0.98 0.98 0.98 216

模型:支持向量机
准确率: 0.9629629629629629
precision recall f1-score support

       0       0.93      0.99      0.96        91
       1       0.99      0.94      0.97       125

accuracy                           0.96       216

macro avg 0.96 0.97 0.96 216
weighted avg 0.96 0.96 0.96 216

模型:随机森林
准确率: 0.9768518518518519
precision recall f1-score support

       0       0.96      0.99      0.97        91
       1       0.99      0.97      0.98       125

accuracy                           0.98       216

macro avg 0.97 0.98 0.98 216
weighted avg 0.98 0.98 0.98 216

模型:XGBoost
准确率: 0.9675925925925926
precision recall f1-score support

       0       0.94      0.99      0.96        91
       1       0.99      0.95      0.97       125

accuracy                           0.97       216

macro avg 0.96 0.97 0.97 216
weighted avg 0.97 0.97 0.97 216

模型:MLP
准确率: 0.9583333333333334
precision recall f1-score support

       0       0.93      0.98      0.95        91
       1       0.98      0.94      0.96       125

accuracy                           0.96       216

macro avg 0.96 0.96 0.96 216
weighted avg 0.96 0.96 0.96 216

CNN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import pandas as pd
import jieba
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Tokenizer 序列化
max_words = 10000 # 词表大小
max_len = 100 # 每条评论的最大长度

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df['评论_clean'])

X = tokenizer.texts_to_sequences(df['评论_clean'])
X = pad_sequences(X, maxlen=max_len)
y = np.array(df['评价'])

# 数据划分
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

model = Sequential([
Embedding(max_words, 128, input_length=max_len),
Conv1D(128, 3, activation='relu'),
GlobalMaxPooling1D(),
Dropout(0.5),
Dense(64, activation='relu'),
Dropout(0.3),
Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = model.fit(
X_train, y_train,
validation_split=0.2,
epochs=10,
batch_size=64,
callbacks=[early_stop],
verbose=1
)

loss, acc = model.evaluate(X_test, y_test)
print(f"\n测试集准确率: {acc:.4f}")

def predict_comment(text):
text_cut = ' '.join(jieba.cut(text))
seq = tokenizer.texts_to_sequences([text_cut])
seq = pad_sequences(seq, maxlen=max_len)
pred = model.predict(seq)[0][0]
return "好评 " if pred > 0.5 else "差评 ", float(pred)

print(predict_comment("这款手机手感非常棒,运行速度快"))
print(predict_comment("质量太差了,用了两天就坏"))

测试集准确率: 0.9676
1/1 ━━━━━━━━━━━━━━━━━━━━ 1s 681ms/step
(‘好评’, 0.999203622341156)
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 17ms/step
(‘差评’, 0.03876214474439621)

1
2
3
4
5
6
7
8
import matplotlib.pyplot as plt

plt.plot(history.history['accuracy'], label='training accuracy')
plt.plot(history.history['val_accuracy'], label='validation accuracy')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()

photo

  • Title: 商业数据分析--产品评价
  • Author: 姜智浩
  • Created at : 2025-10-09 11:45:14
  • Updated at : 2025-10-09 12:21:14
  • Link: https://super-213.github.io/zhihaojiang.github.io/2025/10/09/20251009商业数据分析--产品评价/
  • License: This work is licensed under CC BY-NC-SA 4.0.
On this page
商业数据分析--产品评价