智浩的Blog

商业数据分析--产品评价

姜智浩 Lv5

2025-10-09 11:45:14 2025-10-09 11:45:14 Created 2025-10-09 12:21:14 2025-10-09 12:21:14 Updated

学习笔记

831 Words 4 Mins

声明

本文代码均保存在
https://github.com/super-213/business_data_analysis
有需要的可以自行下载

查看数据

import pandas as pd
import jieba
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report


df = pd.read_excel('产品评价.xlsx'
df.head()

分词

def clean_text(text):
    return ' '.join(jieba.cut(str(text)))

df['评论_clean'] = df['评论'].apply(clean_text)

TF-IDF

# TF-IDF特征
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['评论_clean'])
y = df['评价']

数据划分

1	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

模型

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier

models = {
    '朴素贝叶斯': MultinomialNB(),
    '逻辑回归': LogisticRegression(max_iter=200),
    '支持向量机': LinearSVC(),
    '随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    'MLP': MLPClassifier(random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f'\n模型：{name}')
    print('准确率：', acc)
    print(classification_report(y_test, y_pred))

模型：朴素贝叶斯
准确率： 0.875
precision recall f1-score support
       0       1.00      0.70      0.83        91
       1       0.82      1.00      0.90       125

accuracy                           0.88       216
macro avg 0.91 0.85 0.86 216
weighted avg 0.90 0.88 0.87 216

模型：逻辑回归
准确率： 0.9814814814814815
precision recall f1-score support
       0       0.97      0.99      0.98        91
       1       0.99      0.98      0.98       125

accuracy                           0.98       216
macro avg 0.98 0.98 0.98 216
weighted avg 0.98 0.98 0.98 216

模型：支持向量机
准确率： 0.9629629629629629
precision recall f1-score support
       0       0.93      0.99      0.96        91
       1       0.99      0.94      0.97       125

accuracy                           0.96       216
macro avg 0.96 0.97 0.96 216
weighted avg 0.96 0.96 0.96 216

模型：随机森林
准确率： 0.9768518518518519
precision recall f1-score support
       0       0.96      0.99      0.97        91
       1       0.99      0.97      0.98       125

accuracy                           0.98       216
macro avg 0.97 0.98 0.98 216
weighted avg 0.98 0.98 0.98 216

模型：XGBoost
准确率： 0.9675925925925926
precision recall f1-score support
       0       0.94      0.99      0.96        91
       1       0.99      0.95      0.97       125

accuracy                           0.97       216
macro avg 0.96 0.97 0.97 216
weighted avg 0.97 0.97 0.97 216

模型：MLP
准确率： 0.9583333333333334
precision recall f1-score support
       0       0.93      0.98      0.95        91
       1       0.98      0.94      0.96       125

accuracy                           0.96       216
macro avg 0.96 0.96 0.96 216
weighted avg 0.96 0.96 0.96 216

CNN

import pandas as pd
import jieba
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Tokenizer 序列化
max_words = 10000 # 词表大小
max_len = 100 # 每条评论的最大长度

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df['评论_clean'])

X = tokenizer.texts_to_sequences(df['评论_clean'])
X = pad_sequences(X, maxlen=max_len)
y = np.array(df['评价'])

# 数据划分
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = Sequential([
    Embedding(max_words, 128, input_length=max_len),
    Conv1D(128, 3, activation='relu'),
    GlobalMaxPooling1D(),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=10,
    batch_size=64,
    callbacks=[early_stop],
    verbose=1
)

loss, acc = model.evaluate(X_test, y_test)
print(f"\n测试集准确率: {acc:.4f}")

def predict_comment(text):
    text_cut = ' '.join(jieba.cut(text))
    seq = tokenizer.texts_to_sequences([text_cut])
    seq = pad_sequences(seq, maxlen=max_len)
    pred = model.predict(seq)[0][0]
    return "好评 " if pred > 0.5 else "差评 ", float(pred)

print(predict_comment("这款手机手感非常棒，运行速度快"))
print(predict_comment("质量太差了，用了两天就坏"))

测试集准确率: 0.9676
1/1 ━━━━━━━━━━━━━━━━━━━━ 1s 681ms/step
(‘好评’, 0.999203622341156)
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 17ms/step
(‘差评’, 0.03876214474439621)

import matplotlib.pyplot as plt

plt.plot(history.history['accuracy'], label='training accuracy')
plt.plot(history.history['val_accuracy'], label='validation accuracy')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()

Title: 商业数据分析--产品评价
Author: 姜智浩
Created at : 2025-10-09 11:45:14
Updated at : 2025-10-09 12:21:14
Link: https://super-213.github.io/zhihaojiang.github.io/2025/10/09/20251009商业数据分析--产品评价/
License: This work is licensed under CC BY-NC-SA 4.0.

On this page

商业数据分析--产品评价

声明
查看数据
分词
TF-IDF
数据划分
模型
CNN