声明

本文代码均保存在
https://github.com/super-213/business_data_analysis
有需要的可以自行下载

查看数据

1 2	df = pd.read_excel('信用卡交易数据.xlsx') df.head()

换设备次数	支付失败次数	换IP次数	换IP国次数	交易金额	欺诈标签
0	11	3	5	28836	1
5	6	1	4	21966	1
6	2	0	0	18199	1
5	8	2	2	24803	1
7	10	5	0	26277	1

缺失值处理

1	df.isnull().sum()

字段	缺失值数量
换设备次数	0
支付失败次数	0
换IP次数	0
换IP国次数	0
交易金额	0
欺诈标签	0

异常值处理

for col in df.columns:
    plt.boxplot(df[col])
    plt.title(col)
    plt.show()

数据划分

from sklearn.model_selection import train_test_split
X = df.drop(columns=['欺诈标签'])
y = df['欺诈标签']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

XGBoost

from xgboost import XGBClassifier

xgb = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='binary:logistic',
    eval_metric='auc',
    random_state=42
)
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
y_pred_proba = xgb.predict_proba(X_test)[:, 1]

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.title('ROC - XGBoost')
plt.show()

实际 \ 预测	0	1
0	119	0
1	18	63

类别	Precision	Recall	F1-Score	Support
0	0.87	1.00	0.93	119
1	1.00	0.78	0.88	81
accuracy			0.91	200
macro avg	0.93	0.89	0.90	200
weighted avg	0.92	0.91	0.91	200

LGBM

from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='binary',
    random_state=42
)
lgbm.fit(X_train, y_train)
y_pred_lgbm = lgbm.predict(X_test)
y_pred_proba_lgbm = lgbm.predict_proba(X_test)[:,1]

print(confusion_matrix(y_test, y_pred_lgbm))
print(classification_report(y_test, y_pred_lgbm))

fpr_lgbm, tpr_lgbm, thresholds_lgbm = roc_curve(y_test, y_pred_proba_lgbm)
roc_auc_lgbm = auc(fpr_lgbm, tpr_lgbm)
plt.plot(fpr_lgbm, tpr_lgbm, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_lgbm)
plt.title('ROC - LGBM')
plt.show()

实际 \ 预测	0	1
0	118	1
1	18	63

类别	Precision	Recall	F1-Score	Support
0	0.87	0.99	0.93	119
1	0.98	0.78	0.87	81
accuracy			0.91	200
macro avg	0.93	0.88	0.90	200
weighted avg	0.91	0.91	0.90	200

随机森林

from sklearn.ensemble import RandomForestClassifier

rm = RandomForestClassifier(n_estimators=100, random_state=42)
rm.fit(X_train, y_train)
y_pred_rm = rm.predict(X_test)
y_pred_proba_rm = rm.predict_proba(X_test)[:,1]

print(confusion_matrix(y_test, y_pred_rm))
print(classification_report(y_test, y_pred_rm))

fpr_rm, tpr_rm, thresholds_rm = roc_curve(y_test, y_pred_proba_rm)
roc_auc_rm = auc(fpr_rm, tpr_rm)
plt.plot(fpr_rm, tpr_rm, color='green', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_rm)
plt.title('ROC - Random Forest')
plt.show()

实际 \ 预测	0	1
0	118	1
1	17	64

类别	Precision	Recall	F1-Score	Support
0	0.87	0.99	0.93	119
1	0.98	0.79	0.88	81
accuracy			0.91	200
macro avg	0.93	0.89	0.90	200
weighted avg	0.92	0.91	0.91	200

贝叶斯

from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
y_pred_proba_nb = nb.predict_proba(X_test)[:,1]

print(confusion_matrix(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))

fpr_nb, tpr_nb, thresholds_nb = roc_curve(y_test, y_pred_proba_nb)
roc_auc_nb = auc(fpr_nb, tpr_nb)
plt.plot(fpr_nb, tpr_nb, color='red', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_nb)
plt.title('ROC - NB')
plt.show()

实际 \ 预测	0	1
0	114	5
1	24	57

类别	Precision	Recall	F1-Score	Support
0	0.83	0.96	0.89	119
1	0.92	0.70	0.80	81
accuracy			0.85	200
macro avg	0.87	0.83	0.84	200
weighted avg	0.86	0.85	0.85	200

KNN

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
y_pred_proba_knn = knn.predict_proba(X_test)[:,1]

print(confusion_matrix(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test, y_pred_proba_knn)
roc_auc_knn = auc(fpr_knn, tpr_knn)
plt.plot(fpr_knn, tpr_knn, color='purple', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_knn)
plt.title('ROC - KNN')
plt.show()

实际 \ 预测	0	1
0	86	33
1	57	24

类别	Precision	Recall	F1-Score	Support
0	0.60	0.72	0.66	119
1	0.42	0.30	0.35	81
accuracy			0.55	200
macro avg	0.51	0.51	0.50	200
weighted avg	0.53	0.55	0.53	200

模型结果汇总对比

XGBoost

实际 \ 预测	0	1
0	119	0
1	18	63

类别	Precision	Recall	F1-Score	Support
0	0.87	1.00	0.93	119
1	1.00	0.78	0.88	81
accuracy			0.91	200
macro avg	0.93	0.89	0.90	200
weighted avg	0.92	0.91	0.91	200

LightGBM

实际 \ 预测	0	1
0	118	1
1	18	63

类别	Precision	Recall	F1-Score	Support
0	0.87	0.99	0.93	119
1	0.98	0.78	0.87	81
accuracy			0.91	200
macro avg	0.93	0.88	0.90	200
weighted avg	0.91	0.91	0.90	200

随机森林

实际 \ 预测	0	1
0	118	1
1	17	64

类别	Precision	Recall	F1-Score	Support
0	0.87	0.99	0.93	119
1	0.98	0.79	0.88	81
accuracy			0.91	200
macro avg	0.93	0.89	0.90	200
weighted avg	0.92	0.91	0.91	200

朴素贝叶斯

实际 \ 预测	0	1
0	114	5
1	24	57

类别	Precision	Recall	F1-Score	Support
0	0.83	0.96	0.89	119
1	0.92	0.70	0.80	81
accuracy			0.85	200
macro avg	0.87	0.83	0.84	200
weighted avg	0.86	0.85	0.85	200

KNN

实际 \ 预测	0	1
0	86	33
1	57	24

类别	Precision	Recall	F1-Score	Support
0	0.60	0.72	0.66	119
1	0.42	0.30	0.35	81
accuracy			0.55	200
macro avg	0.51	0.51	0.50	200
weighted avg	0.53	0.55	0.53	200

总结

树模型 (XGBoost / LightGBM / 随机森林) → 表现最好，准确率 0.91，ROC-AUC 高，Precision 接近 1，但 Recall 约 0.78–0.79，有一定漏判风险。
朴素贝叶斯 → 简单快速，但 Recall 下降到 0.70，漏判更多。
KNN → 表现最差，不适合该任务。

结论： 金融场景推荐使用 XGBoost / LightGBM / 随机森林，并可通过 阈值调整、类别权重、SMOTE 等方法进一步优化 Recall。

智浩的Blog

商业数据分析--信用卡交易数据

声明

查看数据

缺失值处理

异常值处理

数据划分

XGBoost

LGBM

随机森林

贝叶斯

KNN

模型结果汇总对比

XGBoost

LightGBM

随机森林

朴素贝叶斯

KNN

总结