商业数据分析--产品定价模型

姜智浩 Lv5

声明

本文代码均保存在
https://github.com/super-213/business_data_analysis
有需要的可以自行下载

查看数据

1
2
df = pd.read_excel('产品定价模型.xlsx')
df.head()
页数 类别 彩印 纸张 价格
207 技术类 0 双胶纸 60
210 技术类 0 双胶纸 62
206 技术类 0 双胶纸 62
218 技术类 0 双胶纸 64
209 技术类 0 双胶纸 60

缺失值处理

1
df.isnull().sum()
字段 缺失值数量
页数 0
类别 0
彩印 0
纸张 0
价格 0

异常值处理

1
2
3
4
5
6
7
8
9
for col in df.columns:
if df[col].dtype in ['int64', 'float64']:
plt.boxplot(df[col])
plt.title(col)
plt.show()
else:
df[col].value_counts().plot(kind='bar')
plt.title(col)
plt.show()

photo

photo

photo

photo

photo

发现页数存在异常值 将页数进行胜率变换

1
2
from scipy.stats import mstats
df['页数'] = mstats.winsorize(df['页数'], limits=[0.01, 0.01])

特征编码

1
2
3
4
5
6
7
df['类别'].describe()
df['纸张'].describe()

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['类别'] = le.fit_transform(df['类别'])
df['纸张'] = le.fit_transform(df['纸张'])
统计量
count 1000
unique 3
top 技术类
freq 336
统计量
count 1000
unique 3
top 双胶纸
freq 615

数据划分

1
2
3
4
5
6
from sklearn.model_selection import train_test_split

X = df.drop(['价格'], axis=1)
y = df['价格']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

GBDT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.ensemble import GradientBoostingRegressor

GBDT = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
GBDT.fit(X_train, y_train)
y_pred = GBDT.predict(X_test)
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Root Mean Squared Error:", rmse)
from sklearn.metrics import r2_score
r2 = r2_score(y_test, GBDT.predict(X_test))
print(r2)
from sklearn.metrics import mean_squared_log_error
rmsle = mean_squared_log_error(y_test, y_pred, squared=False)
print("GBDT RMSLE:", rmsle)

Root Mean Squared Error: 8.252320312081766
0.8465042353222849
GBDT RMSLE: 0.14476982287151097

XGBoost

1
2
3
4
5
6
7
8
9
10
11
from xgboost import XGBRegressor

xgb = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
rmse_xgb = mean_squared_error(y_test, y_pred_xgb, squared=False)
print("XGBoost Root Mean Squared Error:", rmse_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(r2_xgb)
rmsle_xgb = mean_squared_log_error(y_test, y_pred_xgb, squared=False)
print("XGBoost RMSLE:", rmsle_xgb)

XGBoost Root Mean Squared Error: 8.176196042228552
0.8493230448316168
Root Mean Squared Log Error: 8.252320312081766
XGBoost RMSLE: 0.14297996456087966

LGBM

1
2
3
4
5
6
7
8
9
10
11
from lightgbm import LGBMRegressor

lgbm = LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
lgbm.fit(X_train, y_train)
y_pred_lgbm = lgbm.predict(X_test)
rmse_lgbm = mean_squared_error(y_test, y_pred_lgbm, squared=False)
print("LightGBM RMSE:", rmse_lgbm)
r2_lgbm = r2_score(y_test, y_pred_lgbm)
print(r2_lgbm)
rmsle_lgbm = mean_squared_log_error(y_test, y_pred_lgbm, squared=False)
print("LightGBM RMSLE:", rmsle_lgbm)

LightGBM RMSE: 8.222406796622195
0.847615020120697
LightGBM RMSLE: 0.1428256145540605

SVM

1
2
3
4
5
6
7
8
9
10
11
from sklearn.svm import SVR

svr = SVR(kernel='linear', C=1.0, epsilon=0.1)
svr.fit(X_train, y_train)
y_pred_svr = svr.predict(X_test)
rmse_svr = mean_squared_error(y_test, y_pred_svr, squared=False)
print("SVR RMSE:", rmse_svr)
r2_svr = r2_score(y_test, y_pred_svr)
print(r2_svr)
rmsle_svr = mean_squared_log_error(y_test, y_pred_svr, squared=False)
print("SVR RMSLE:", rmsle_svr)

SVR RMSE: 15.248610756339428
0.47591129445611624
SVR RMSLE: 0.2390376383177629

KNN

1
2
3
4
5
6
7
8
9
10
11
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
rmse_knn = mean_squared_error(y_test, y_pred_knn, squared=False)
print("KNN RMSE:", rmse_knn)
r2_knn = r2_score(y_test, y_pred_knn)
print(r2_knn)
rmsle_knn = mean_squared_log_error(y_test, y_pred_knn, squared=False)
print("KNN RMSLE:", rmsle_knn)

KNN RMSE: 14.38134207923586
0.5338313360332646
KNN RMSLE: 0.2380403734142886

产品定价模型回归结果

模型评估结果对比

模型 RMSE RMSLE
GBDT 8.252 0.847 0.145
XGBoost 8.176 0.849 0.143
LightGBM 8.222 0.848 0.143
SVR 15.249 0.476 0.239
KNN 14.381 0.534 0.238

结果解读

集成树模型(GBDT / XGBoost / LightGBM)

  • 特点:
    • 能捕捉非线性关系和特征交互。
    • 不依赖特征标准化,对异常值不敏感。
    • Boosting 框架逐步优化残差。
  • 结果:
    • RMSE ~8、R² ~0.85、RMSLE ~0.14 → 效果最佳。
    • XGBoost/LightGBM 在优化与并行上更先进,略优于传统 GBDT。

SVR(支持向量回归)

  • 特点:
    • 基于核函数映射,高维空间里拟合线性。
    • 更适合连续、平滑的函数关系。
    • 对高维离散特征敏感,需要标准化与合适核函数。
  • 结果:
    • RMSE ~15、R² ~0.48 → 误差大,拟合不足。
    • 难以捕捉“跳跃式”的定价规律。

KNN(K近邻回归)

  • 特点:
    • 基于“相似样本价格相似”的假设。
    • 局部模型,对高维数据容易受“维度灾难”影响。
  • 结果:
    • RMSE ~14、R² ~0.53 → 效果一般。
    • 最近邻不一定能找到真正相似的样本。

总结

  • 最佳模型: XGBoost / LightGBM 稳定且泛化最好
  • 可选模型: GBDT 性能稍逊
  • 不适合: SVR、KNN 误差大,拟合不足
  • Title: 商业数据分析--产品定价模型
  • Author: 姜智浩
  • Created at : 2025-09-25 11:45:14
  • Updated at : 2025-09-25 09:49:24
  • Link: https://super-213.github.io/zhihaojiang.github.io/2025/09/25/20250925商业数据分析--产品定价模型/
  • License: This work is licensed under CC BY-NC-SA 4.0.