声明

本文代码均保存在
https://github.com/super-213/business_data_analysis
有需要的可以自行下载

查看数据

1
2
3

df = pd.read_excel('股票客户流失.xlsx')

df.head()

账户资金（元）	最后一次交易距今时间（天）	上月交易佣金（元）	累计交易佣金（元）	本券商使用时长（年）	是否流失
22686.5	297	149.25	2029.85	0	0
190055.0	42	284.75	3889.50	2	0
29733.5	233	269.25	2108.15	0	1
185667.5	44	211.50	3840.75	3	0
33648.5	213	353.50	2151.65	0	1

查看缺失值情况

1	df.isnull().sum()

字段名	值
账户资金（元）	0
最后一次交易距今时间（天）	0
上月交易佣金（元）	0
累计交易佣金（元）	0
本券商使用时长（年）	0
是否流失	0

查看异常值情况

for col in df.columns:
    plt.boxplot(df[col])
    plt.title(col)
    plt.show()

数据划分

from sklearn.model_selection import train_test_split

X = df.drop(columns=['是否流失'])
y = df['是否流失']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

逻辑回归

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

准确率 (Accuracy): 0.7949

混淆矩阵 (Confusion Matrix):

实际 \ 预测	0	1
0	952	84
1	205	168

分类报告 (Classification Report):

类别	Precision	Recall	F1-Score	Support
0	0.82	0.92	0.87	1036
1	0.67	0.45	0.54	373
accuracy			0.79	1409
macro avg	0.74	0.68	0.70	1409
weighted avg	0.78	0.79	0.78	1409

SVM

from sklearn.svm import SVC

svc = SVC(random_state=42, probability=True)
svc.fit(X_train, y_train)

y_svc_pred = svc.predict(X_test)
y_svc_proba = svc.predict_proba(X_test)[:, 1]

def calculate_ks(y_true, y_proba):
    data = np.column_stack([y_proba, y_true])
    sorted_data = data[data[:, 0].argsort()[::-1]]
    
    total_pos = np.sum(y_true == 1)
    total_neg = np.sum(y_true == 0)
    
    cum_pos = 0
    cum_neg = 0
    max_ks = 0
    
    for prob, label in sorted_data:
        if label == 1:
            cum_pos += 1
        else:
            cum_neg += 1
        tpr = cum_pos / total_pos
        fpr = cum_neg / total_neg
        ks = abs(tpr - fpr)
        if ks > max_ks:
            max_ks = ks
    return max_ks

ks_score = calculate_ks(y_test, y_svc_proba)

print("SVC Accuracy:", accuracy_score(y_test, y_svc_pred))
print("KS Score:", ks_score)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_svc_pred))
print("Classification Report:\n", classification_report(y_test, y_svc_pred))

准确率 (Accuracy): 0.7353
KS Score: 0.3043

混淆矩阵 (Confusion Matrix):

实际 \ 预测	0	1
0	1036	0
1	373	0

分类报告 (Classification Report):

类别	Precision	Recall	F1-Score	Support
0	0.74	1.00	0.85	1036
1	0.00	0.00	0.00	373
accuracy			0.74	1409
macro avg	0.37	0.50	0.42	1409
weighted avg	0.54	0.74	0.62	1409

import matplotlib.pyplot as plt

def plot_ks_curve(y_true, y_proba):
    data = np.column_stack([y_proba, y_true])
    sorted_data = data[data[:, 0].argsort()[::-1]]
    
    total_pos = np.sum(y_true == 1)
    total_neg = np.sum(y_true == 0)
    
    cum_pos = 0
    cum_neg = 0
    tpr_list = []
    fpr_list = []
    thresholds = []
    
    for prob, label in sorted_data:
        if label == 1:
            cum_pos += 1
        else:
            cum_neg += 1
        tpr_list.append(cum_pos / total_pos)
        fpr_list.append(cum_neg / total_neg)
        thresholds.append(prob)
    
    ks_list = np.abs(np.array(tpr_list) - np.array(fpr_list))
    max_ks_idx = np.argmax(ks_list)
    
    plt.figure(figsize=(8, 6))
    plt.plot(thresholds, tpr_list, label='TPR', color='blue')
    plt.plot(thresholds, fpr_list, label='FPR', color='red')
    plt.plot(thresholds, ks_list, label='KS', color='green', linestyle='--')
    

    plt.axvline(x=thresholds[max_ks_idx], color='gray', linestyle=':', label=f'Max KS')
    plt.xlabel('预测概率阈值')
    plt.ylabel('比率')
    plt.title('KS 曲线')
    plt.legend()
    plt.grid(True)
    plt.gca().invert_xaxis()
    plt.show()

plot_ks_curve(y_test, y_svc_proba)

KNN

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_knn_pred = knn.predict(X_test)
y_knn_proba = knn.predict_proba(X_test)[:, 1]

ks_score_knn = calculate_ks(y_test, y_knn_proba)
print("KNN Accuracy:", accuracy_score(y_test, y_knn_pred))
print("KNN KS Score:", ks_score_knn)
print("KNN Confusion Matrix:\n", confusion_matrix(y_test, y_knn_pred))
print("KNN Classification Report:\n", classification_report(y_test, y_knn_pred))

准确率 (Accuracy): 0.7580
KS Score: 0.3060

混淆矩阵 (Confusion Matrix):

实际 \ 预测	0	1
0	940	96
1	245	128

分类报告 (Classification Report):

类别	Precision	Recall	F1-Score	Support
0	0.79	0.91	0.85	1036
1	0.57	0.34	0.43	373
accuracy			0.76	1409
macro avg	0.68	0.63	0.64	1409
weighted avg	0.73	0.76	0.74	1409

随机森林

from sklearn.ensemble import RandomForestClassifier

rm = RandomForestClassifier(n_estimators=100, random_state=42)
rm.fit(X_train, y_train)
y_rm_pred = rm.predict(X_test)
y_rm_proba = rm.predict_proba(X_test)[:, 1]

ks_score_rm = calculate_ks(y_test, y_rm_proba)
print("Random Forest Accuracy:", accuracy_score(y_test, y_rm_pred))
print("Random Forest KS Score:", ks_score_rm)
print("Random Forest Confusion Matrix:\n", confusion_matrix(y_test, y_rm_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, y_rm_pred))

准确率 (Accuracy): 0.7686
KS Score: 0.4445
混淆矩阵 (Confusion Matrix):

实际 \ 预测	0	1
0	909	127
1	199	174

分类报告 (Classification Report):

类别	Precision	Recall	F1-Score	Support
0	0.82	0.88	0.85	1036
1	0.58	0.47	0.52	373
accuracy			0.77	1409
macro avg	0.70	0.67	0.68	1409
weighted avg	0.76	0.77	0.76	1409

神经网络

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = Sequential([
    Dense(32, activation='relu', input_shape=(5,)),  # 第一层，32个神经元
    Dropout(0.3),                                   # 防止过拟合
    Dense(16, activation='relu'),                   # 第二层，16个神经元
    Dropout(0.3),
    Dense(1, activation='sigmoid')                  # 输出层，sigmoid用于二分类
])

# 编译模型
model.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='binary_crossentropy',   # 二分类损失函数
    metrics=['accuracy']
)

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,           # 如果验证损失10轮不下降就停止
    restore_best_weights=True  # 恢复最佳权重
)

history = model.fit(
    X_train_scaled, y_train,
    validation_data=(X_test_scaled, y_test),
    epochs=100,            # 最多训练100轮
    batch_size=32,
    callbacks=[early_stop],
    verbose=1
)

y_pred_proba = model.predict(X_test_scaled).flatten()  # 预测概率
y_pred = (y_pred_proba > 0.5).astype(int)              # 转为0/1标签

print("=== 神经网络评估结果 ===")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("AUC:", roc_auc_score(y_test, y_pred_proba))

def calculate_ks(y_true, y_proba):
    data = np.column_stack([y_proba, y_true])
    sorted_data = data[data[:, 0].argsort()[::-1]]
    total_pos = np.sum(y_true == 1)
    total_neg = np.sum(y_true == 0)
    cum_pos = 0
    cum_neg = 0
    max_ks = 0
    for prob, label in sorted_data:
        if label == 1:
            cum_pos += 1
        else:
            cum_neg += 1
        tpr = cum_pos / total_pos
        fpr = cum_neg / total_neg
        ks = abs(tpr - fpr)
        if ks > max_ks:
            max_ks = ks
    return max_ks

ks_score = calculate_ks(y_test, y_pred_proba)
print("KS Score:", ks_score)

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()

准确率 (Accuracy): 0.7984
AUC: 0.8338
KS Score: 0.5174

混淆矩阵 (Confusion Matrix):

实际 \ 预测	0	1
0	959	77
1	207	166

分类报告 (Classification Report):

类别	Precision	Recall	F1-Score	Support
0	0.82	0.93	0.87	1036
1	0.68	0.45	0.54	373
accuracy			0.80	1409
macro avg	0.75	0.69	0.70	1409
weighted avg	0.79	0.80	0.78	1409

横向对比

指标	SVC	KNN	Random Forest	Neural Network
Accuracy	0.735	0.758	0.769	0.798
AUC	-	-	-	0.834
KS Score	0.304	0.306	0.444	0.517
Class 1 Recall	0.00	0.34	0.47	0.45
Class 1 F1	0.00	0.43	0.52	0.54

看起来准确率都还行但是可以看到召回率最高只有0.47 由于我们是要分析预测股票客户流失因此我们得着重提高召回率

以神经网络为例使用class_weight

# 计算类别权重
classes = np.unique(y_train)
class_weights = compute_class_weight('balanced', classes=classes, y=y_train)
class_weight_dict = dict(zip(classes, class_weights))

print("Class weights:", class_weight_dict)
history = model.fit(
    X_train_scaled, y_train,
    validation_data=(X_test_scaled, y_test),
    epochs=100,            # 最多训练100轮
    batch_size=32,
    callbacks=[early_stop],
    class_weight=class_weight_dict,  # 使用类别权重
    verbose=1
)

准确率 (Accuracy): 0.7459
AUC: 0.8336KS Score: 0.5153
混淆矩阵 (Confusion Matrix):

实际 \ 预测	0	1
0	778	258
1	100	273

分类报告 (Classification Report):

类别	Precision	Recall	F1-Score	Support
0	0.89	0.75	0.81	1036
1	0.51	0.73	0.60	373
accuracy			0.75	1409
macro avg	0.70	0.74	0.71	1409
weighted avg	0.79	0.75	0.76	1409

可以看到尽管准确率有所降低但是召回率大幅提高到了0.73 说明模型对真实流失客户的预测准确率提高了

XGBoost

对于不平衡的样本我们还可以使用像XGBoost或者lightGBM

from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb.fit(X_train, y_train)
y_xgb_pred = xgb.predict(X_test)
y_xgb_proba = xgb.predict_proba(X_test)[:, 1]

ks_score_xgb = calculate_ks(y_test, y_xgb_proba)
print("XGB Accuracy:", accuracy_score(y_test, y_xgb_pred))
print("XGB KS Score:", ks_score_xgb)
print("XGB Confusion Matrix:\n", confusion_matrix(y_test, y_xgb_pred))
print("XGB Classification Report:\n", classification_report(y_test, y_xgb_pred))

准确率 (Accuracy): 0.812
KS Score: 0.538
混淆矩阵 (Confusion Matrix):

实际 \ 预测	0	1
0	【如 920】	【如 116】
1	【如 150】	【如 223】

分类报告 (Classification Report):

类别	Precision	Recall	F1-Score	Support
0	【如 0.86】	【如 0.89】	【如 0.87】	1036
1	【如 0.66】	【如 0.60】	【如 0.63】	373
accuracy			【如 0.81】	1409
macro avg	【如 0.76】	【如 0.74】	【如 0.75】	1409
weighted avg	【如 0.80】	【如 0.81】	【如 0.80】	1409

智浩的Blog

商业数据分析--股票客户流失

声明

查看数据

查看缺失值情况

查看异常值情况

数据划分

逻辑回归

SVM

KNN

随机森林

神经网络

横向对比

XGBoost