商业数据分析--股票客户流失

姜智浩 Lv5

声明

本文代码均保存在
https://github.com/super-213/business_data_analysis
有需要的可以自行下载

查看数据

1
2
3
df = pd.read_excel('股票客户流失.xlsx')

df.head()
账户资金(元) 最后一次交易距今时间(天) 上月交易佣金(元) 累计交易佣金(元) 本券商使用时长(年) 是否流失
22686.5 297 149.25 2029.85 0 0
190055.0 42 284.75 3889.50 2 0
29733.5 233 269.25 2108.15 0 1
185667.5 44 211.50 3840.75 3 0
33648.5 213 353.50 2151.65 0 1

查看缺失值情况

1
df.isnull().sum()
字段名
账户资金(元) 0
最后一次交易距今时间(天) 0
上月交易佣金(元) 0
累计交易佣金(元) 0
本券商使用时长(年) 0
是否流失 0

查看异常值情况

1
2
3
4
for col in df.columns:
plt.boxplot(df[col])
plt.title(col)
plt.show()

photo

photo

photo

photo

photo

photo

数据划分

1
2
3
4
5
6
from sklearn.model_selection import train_test_split

X = df.drop(columns=['是否流失'])
y = df['是否流失']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

逻辑回归

1
2
3
4
5
6
7
8
9
10
11
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

准确率 (Accuracy): 0.7949

混淆矩阵 (Confusion Matrix):

实际 \ 预测 0 1
0 952 84
1 205 168

分类报告 (Classification Report):

类别 Precision Recall F1-Score Support
0 0.82 0.92 0.87 1036
1 0.67 0.45 0.54 373
accuracy 0.79 1409
macro avg 0.74 0.68 0.70 1409
weighted avg 0.78 0.79 0.78 1409

SVM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from sklearn.svm import SVC

svc = SVC(random_state=42, probability=True)
svc.fit(X_train, y_train)

y_svc_pred = svc.predict(X_test)
y_svc_proba = svc.predict_proba(X_test)[:, 1]

def calculate_ks(y_true, y_proba):
data = np.column_stack([y_proba, y_true])
sorted_data = data[data[:, 0].argsort()[::-1]]

total_pos = np.sum(y_true == 1)
total_neg = np.sum(y_true == 0)

cum_pos = 0
cum_neg = 0
max_ks = 0

for prob, label in sorted_data:
if label == 1:
cum_pos += 1
else:
cum_neg += 1
tpr = cum_pos / total_pos
fpr = cum_neg / total_neg
ks = abs(tpr - fpr)
if ks > max_ks:
max_ks = ks
return max_ks

ks_score = calculate_ks(y_test, y_svc_proba)

print("SVC Accuracy:", accuracy_score(y_test, y_svc_pred))
print("KS Score:", ks_score)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_svc_pred))
print("Classification Report:\n", classification_report(y_test, y_svc_pred))

准确率 (Accuracy): 0.7353
KS Score: 0.3043

混淆矩阵 (Confusion Matrix):

实际 \ 预测 0 1
0 1036 0
1 373 0

分类报告 (Classification Report):

类别 Precision Recall F1-Score Support
0 0.74 1.00 0.85 1036
1 0.00 0.00 0.00 373
accuracy 0.74 1409
macro avg 0.37 0.50 0.42 1409
weighted avg 0.54 0.74 0.62 1409
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import matplotlib.pyplot as plt

def plot_ks_curve(y_true, y_proba):
data = np.column_stack([y_proba, y_true])
sorted_data = data[data[:, 0].argsort()[::-1]]

total_pos = np.sum(y_true == 1)
total_neg = np.sum(y_true == 0)

cum_pos = 0
cum_neg = 0
tpr_list = []
fpr_list = []
thresholds = []

for prob, label in sorted_data:
if label == 1:
cum_pos += 1
else:
cum_neg += 1
tpr_list.append(cum_pos / total_pos)
fpr_list.append(cum_neg / total_neg)
thresholds.append(prob)

ks_list = np.abs(np.array(tpr_list) - np.array(fpr_list))
max_ks_idx = np.argmax(ks_list)

plt.figure(figsize=(8, 6))
plt.plot(thresholds, tpr_list, label='TPR', color='blue')
plt.plot(thresholds, fpr_list, label='FPR', color='red')
plt.plot(thresholds, ks_list, label='KS', color='green', linestyle='--')


plt.axvline(x=thresholds[max_ks_idx], color='gray', linestyle=':', label=f'Max KS')
plt.xlabel('预测概率阈值')
plt.ylabel('比率')
plt.title('KS 曲线')
plt.legend()
plt.grid(True)
plt.gca().invert_xaxis()
plt.show()

plot_ks_curve(y_test, y_svc_proba)

photo

KNN

1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_knn_pred = knn.predict(X_test)
y_knn_proba = knn.predict_proba(X_test)[:, 1]

ks_score_knn = calculate_ks(y_test, y_knn_proba)
print("KNN Accuracy:", accuracy_score(y_test, y_knn_pred))
print("KNN KS Score:", ks_score_knn)
print("KNN Confusion Matrix:\n", confusion_matrix(y_test, y_knn_pred))
print("KNN Classification Report:\n", classification_report(y_test, y_knn_pred))

准确率 (Accuracy): 0.7580
KS Score: 0.3060

混淆矩阵 (Confusion Matrix):

实际 \ 预测 0 1
0 940 96
1 245 128

分类报告 (Classification Report):

类别 Precision Recall F1-Score Support
0 0.79 0.91 0.85 1036
1 0.57 0.34 0.43 373
accuracy 0.76 1409
macro avg 0.68 0.63 0.64 1409
weighted avg 0.73 0.76 0.74 1409

随机森林

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.ensemble import RandomForestClassifier

rm = RandomForestClassifier(n_estimators=100, random_state=42)
rm.fit(X_train, y_train)
y_rm_pred = rm.predict(X_test)
y_rm_proba = rm.predict_proba(X_test)[:, 1]

ks_score_rm = calculate_ks(y_test, y_rm_proba)
print("Random Forest Accuracy:", accuracy_score(y_test, y_rm_pred))
print("Random Forest KS Score:", ks_score_rm)
print("Random Forest Confusion Matrix:\n", confusion_matrix(y_test, y_rm_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, y_rm_pred))

准确率 (Accuracy): 0.7686
KS Score: 0.4445
混淆矩阵 (Confusion Matrix):

实际 \ 预测 0 1
0 909 127
1 199 174

分类报告 (Classification Report):

类别 Precision Recall F1-Score Support
0 0.82 0.88 0.85 1036
1 0.58 0.47 0.52 373
accuracy 0.77 1409
macro avg 0.70 0.67 0.68 1409
weighted avg 0.76 0.77 0.76 1409

神经网络

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = Sequential([
Dense(32, activation='relu', input_shape=(5,)), # 第一层,32个神经元
Dropout(0.3), # 防止过拟合
Dense(16, activation='relu'), # 第二层,16个神经元
Dropout(0.3),
Dense(1, activation='sigmoid') # 输出层,sigmoid用于二分类
])

# 编译模型
model.compile(
optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy', # 二分类损失函数
metrics=['accuracy']
)

early_stop = EarlyStopping(
monitor='val_loss',
patience=10, # 如果验证损失10轮不下降就停止
restore_best_weights=True # 恢复最佳权重
)

history = model.fit(
X_train_scaled, y_train,
validation_data=(X_test_scaled, y_test),
epochs=100, # 最多训练100轮
batch_size=32,
callbacks=[early_stop],
verbose=1
)

y_pred_proba = model.predict(X_test_scaled).flatten() # 预测概率
y_pred = (y_pred_proba > 0.5).astype(int) # 转为0/1标签

print("=== 神经网络评估结果 ===")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("AUC:", roc_auc_score(y_test, y_pred_proba))

def calculate_ks(y_true, y_proba):
data = np.column_stack([y_proba, y_true])
sorted_data = data[data[:, 0].argsort()[::-1]]
total_pos = np.sum(y_true == 1)
total_neg = np.sum(y_true == 0)
cum_pos = 0
cum_neg = 0
max_ks = 0
for prob, label in sorted_data:
if label == 1:
cum_pos += 1
else:
cum_neg += 1
tpr = cum_pos / total_pos
fpr = cum_neg / total_neg
ks = abs(tpr - fpr)
if ks > max_ks:
max_ks = ks
return max_ks

ks_score = calculate_ks(y_test, y_pred_proba)
print("KS Score:", ks_score)

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()

准确率 (Accuracy): 0.7984
AUC: 0.8338
KS Score: 0.5174


混淆矩阵 (Confusion Matrix):

实际 \ 预测 0 1
0 959 77
1 207 166

分类报告 (Classification Report):

类别 Precision Recall F1-Score Support
0 0.82 0.93 0.87 1036
1 0.68 0.45 0.54 373
accuracy 0.80 1409
macro avg 0.75 0.69 0.70 1409
weighted avg 0.79 0.80 0.78 1409

photo

横向对比

指标 SVC KNN Random Forest Neural Network
Accuracy 0.735 0.758 0.769 0.798
AUC - - - 0.834
KS Score 0.304 0.306 0.444 0.517
Class 1 Recall 0.00 0.34 0.47 0.45
Class 1 F1 0.00 0.43 0.52 0.54

看起来准确率都还行 但是可以看到召回率最高只有0.47 由于我们是要分析预测股票客户流失 因此 我们得着重提高召回率

以神经网络为例 使用class_weight

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 计算类别权重
classes = np.unique(y_train)
class_weights = compute_class_weight('balanced', classes=classes, y=y_train)
class_weight_dict = dict(zip(classes, class_weights))

print("Class weights:", class_weight_dict)
history = model.fit(
X_train_scaled, y_train,
validation_data=(X_test_scaled, y_test),
epochs=100, # 最多训练100轮
batch_size=32,
callbacks=[early_stop],
class_weight=class_weight_dict, # 使用类别权重
verbose=1
)

准确率 (Accuracy): 0.7459
AUC: 0.8336KS Score: 0.5153
混淆矩阵 (Confusion Matrix):

实际 \ 预测 0 1
0 778 258
1 100 273

分类报告 (Classification Report):

类别 Precision Recall F1-Score Support
0 0.89 0.75 0.81 1036
1 0.51 0.73 0.60 373
accuracy 0.75 1409
macro avg 0.70 0.74 0.71 1409
weighted avg 0.79 0.75 0.76 1409

可以看到 尽管准确率有所降低 但是召回率大幅提高到了0.73 说明模型对真实流失客户的预测准确率提高了

XGBoost

对于不平衡的样本 我们还可以使用像XGBoost或者lightGBM

1
2
3
4
5
6
7
8
9
10
11
12
from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb.fit(X_train, y_train)
y_xgb_pred = xgb.predict(X_test)
y_xgb_proba = xgb.predict_proba(X_test)[:, 1]

ks_score_xgb = calculate_ks(y_test, y_xgb_proba)
print("XGB Accuracy:", accuracy_score(y_test, y_xgb_pred))
print("XGB KS Score:", ks_score_xgb)
print("XGB Confusion Matrix:\n", confusion_matrix(y_test, y_xgb_pred))
print("XGB Classification Report:\n", classification_report(y_test, y_xgb_pred))

准确率 (Accuracy): 0.812
KS Score: 0.538
混淆矩阵 (Confusion Matrix):

实际 \ 预测 0 1
0 【如 920】 【如 116】
1 【如 150】 【如 223】

分类报告 (Classification Report):

类别 Precision Recall F1-Score Support
0 【如 0.86】 【如 0.89】 【如 0.87】 1036
1 【如 0.66】 【如 0.60】 【如 0.63】 373
accuracy 【如 0.81】 1409
macro avg 【如 0.76】 【如 0.74】 【如 0.75】 1409
weighted avg 【如 0.80】 【如 0.81】 【如 0.80】 1409
  • Title: 商业数据分析--股票客户流失
  • Author: 姜智浩
  • Created at : 2025-09-23 11:45:14
  • Updated at : 2025-09-23 10:02:49
  • Link: https://super-213.github.io/zhihaojiang.github.io/2025/09/23/20250923商业数据分析--股票客户流失/
  • License: This work is licensed under CC BY-NC-SA 4.0.