商业数据分析--信用卡交易数据

姜智浩 Lv5

声明

本文代码均保存在
https://github.com/super-213/business_data_analysis
有需要的可以自行下载

查看数据

1
2
df = pd.read_excel('信用卡交易数据.xlsx')
df.head()
换设备次数 支付失败次数 换IP次数 换IP国次数 交易金额 欺诈标签
0 11 3 5 28836 1
5 6 1 4 21966 1
6 2 0 0 18199 1
5 8 2 2 24803 1
7 10 5 0 26277 1

缺失值处理

1
df.isnull().sum()
字段 缺失值数量
换设备次数 0
支付失败次数 0
换IP次数 0
换IP国次数 0
交易金额 0
欺诈标签 0

异常值处理

1
2
3
4
for col in df.columns:
plt.boxplot(df[col])
plt.title(col)
plt.show()

photo

photo

photo

photo

photo

photo

数据划分

1
2
3
4
from sklearn.model_selection import train_test_split
X = df.drop(columns=['欺诈标签'])
y = df['欺诈标签']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

XGBoost

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from xgboost import XGBClassifier

xgb = XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
subsample=0.8,
colsample_bytree=0.8,
objective='binary:logistic',
eval_metric='auc',
random_state=42
)
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
y_pred_proba = xgb.predict_proba(X_test)[:, 1]

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.title('ROC - XGBoost')
plt.show()
实际 \ 预测 0 1
0 119 0
1 18 63
类别 Precision Recall F1-Score Support
0 0.87 1.00 0.93 119
1 1.00 0.78 0.88 81
accuracy 0.91 200
macro avg 0.93 0.89 0.90 200
weighted avg 0.92 0.91 0.91 200

photo

LGBM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
subsample=0.8,
colsample_bytree=0.8,
objective='binary',
random_state=42
)
lgbm.fit(X_train, y_train)
y_pred_lgbm = lgbm.predict(X_test)
y_pred_proba_lgbm = lgbm.predict_proba(X_test)[:,1]

print(confusion_matrix(y_test, y_pred_lgbm))
print(classification_report(y_test, y_pred_lgbm))

fpr_lgbm, tpr_lgbm, thresholds_lgbm = roc_curve(y_test, y_pred_proba_lgbm)
roc_auc_lgbm = auc(fpr_lgbm, tpr_lgbm)
plt.plot(fpr_lgbm, tpr_lgbm, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_lgbm)
plt.title('ROC - LGBM')
plt.show()
实际 \ 预测 0 1
0 118 1
1 18 63
类别 Precision Recall F1-Score Support
0 0.87 0.99 0.93 119
1 0.98 0.78 0.87 81
accuracy 0.91 200
macro avg 0.93 0.88 0.90 200
weighted avg 0.91 0.91 0.90 200

photo

随机森林

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.ensemble import RandomForestClassifier

rm = RandomForestClassifier(n_estimators=100, random_state=42)
rm.fit(X_train, y_train)
y_pred_rm = rm.predict(X_test)
y_pred_proba_rm = rm.predict_proba(X_test)[:,1]

print(confusion_matrix(y_test, y_pred_rm))
print(classification_report(y_test, y_pred_rm))

fpr_rm, tpr_rm, thresholds_rm = roc_curve(y_test, y_pred_proba_rm)
roc_auc_rm = auc(fpr_rm, tpr_rm)
plt.plot(fpr_rm, tpr_rm, color='green', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_rm)
plt.title('ROC - Random Forest')
plt.show()
实际 \ 预测 0 1
0 118 1
1 17 64
类别 Precision Recall F1-Score Support
0 0.87 0.99 0.93 119
1 0.98 0.79 0.88 81
accuracy 0.91 200
macro avg 0.93 0.89 0.90 200
weighted avg 0.92 0.91 0.91 200

photo

贝叶斯

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
y_pred_proba_nb = nb.predict_proba(X_test)[:,1]

print(confusion_matrix(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))

fpr_nb, tpr_nb, thresholds_nb = roc_curve(y_test, y_pred_proba_nb)
roc_auc_nb = auc(fpr_nb, tpr_nb)
plt.plot(fpr_nb, tpr_nb, color='red', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_nb)
plt.title('ROC - NB')
plt.show()
实际 \ 预测 0 1
0 114 5
1 24 57
类别 Precision Recall F1-Score Support
0 0.83 0.96 0.89 119
1 0.92 0.70 0.80 81
accuracy 0.85 200
macro avg 0.87 0.83 0.84 200
weighted avg 0.86 0.85 0.85 200

photo

KNN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
y_pred_proba_knn = knn.predict_proba(X_test)[:,1]

print(confusion_matrix(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test, y_pred_proba_knn)
roc_auc_knn = auc(fpr_knn, tpr_knn)
plt.plot(fpr_knn, tpr_knn, color='purple', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_knn)
plt.title('ROC - KNN')
plt.show()
实际 \ 预测 0 1
0 86 33
1 57 24
类别 Precision Recall F1-Score Support
0 0.60 0.72 0.66 119
1 0.42 0.30 0.35 81
accuracy 0.55 200
macro avg 0.51 0.51 0.50 200
weighted avg 0.53 0.55 0.53 200

photo

模型结果汇总对比

XGBoost

实际 \ 预测 0 1
0 119 0
1 18 63
类别 Precision Recall F1-Score Support
0 0.87 1.00 0.93 119
1 1.00 0.78 0.88 81
accuracy 0.91 200
macro avg 0.93 0.89 0.90 200
weighted avg 0.92 0.91 0.91 200

LightGBM

实际 \ 预测 0 1
0 118 1
1 18 63
类别 Precision Recall F1-Score Support
0 0.87 0.99 0.93 119
1 0.98 0.78 0.87 81
accuracy 0.91 200
macro avg 0.93 0.88 0.90 200
weighted avg 0.91 0.91 0.90 200

随机森林

实际 \ 预测 0 1
0 118 1
1 17 64
类别 Precision Recall F1-Score Support
0 0.87 0.99 0.93 119
1 0.98 0.79 0.88 81
accuracy 0.91 200
macro avg 0.93 0.89 0.90 200
weighted avg 0.92 0.91 0.91 200

朴素贝叶斯

实际 \ 预测 0 1
0 114 5
1 24 57
类别 Precision Recall F1-Score Support
0 0.83 0.96 0.89 119
1 0.92 0.70 0.80 81
accuracy 0.85 200
macro avg 0.87 0.83 0.84 200
weighted avg 0.86 0.85 0.85 200

KNN

实际 \ 预测 0 1
0 86 33
1 57 24
类别 Precision Recall F1-Score Support
0 0.60 0.72 0.66 119
1 0.42 0.30 0.35 81
accuracy 0.55 200
macro avg 0.51 0.51 0.50 200
weighted avg 0.53 0.55 0.53 200

总结

  • 树模型 (XGBoost / LightGBM / 随机森林) → 表现最好,准确率 0.91,ROC-AUC 高,Precision 接近 1,但 Recall 约 0.78–0.79,有一定漏判风险。
  • 朴素贝叶斯 → 简单快速,但 Recall 下降到 0.70,漏判更多。
  • KNN → 表现最差,不适合该任务。

结论: 金融场景推荐使用 XGBoost / LightGBM / 随机森林,并可通过 阈值调整、类别权重、SMOTE 等方法进一步优化 Recall。

  • Title: 商业数据分析--信用卡交易数据
  • Author: 姜智浩
  • Created at : 2025-09-26 11:45:14
  • Updated at : 2025-09-28 14:02:53
  • Link: https://super-213.github.io/zhihaojiang.github.io/2025/09/26/20250926商业数据分析--信用卡交易数据/
  • License: This work is licensed under CC BY-NC-SA 4.0.