保险已成为“高净值人士”的青睐之选。在当前投资机会较少、股市/楼市/实体经济风险较高的背景下，保险业迎来蓬勃发展，大量保险公司推出人寿型、医疗型、投资理财型等产品，竞争激烈。传统的保险模型往往依赖于历史数据和经验法则，易出现险种推荐错配、过度推销等问题，引发客户信任危机和抵触情绪，损害企业品牌信誉。
本案例旨在通过机器学习技术分析家庭购买保险的历史数据，帮助保险公司更好地理解客户的购买行为和风险偏好。完成数据清洗、特征选择、模型构建、模型评估、模型优化和模型解释等数据分析任务，挖掘影响客户购买移动放车险的重要因素，构建移动房车险购买倾向预测模型，提升推荐准确度，从而在竞争激烈的市场中获得优势。

基本库导入

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from matplotlib import font_manager
# 设置字体路径
font_path = '/System/Library/Fonts/STHeiti Medium.ttc'

# 加载字体
my_font = font_manager.FontProperties(fname=font_path)

# 设置为默认字体
plt.rcParams['font.family'] = my_font.get_name()
plt.rcParams['axes.unicode_minus'] = False  # 正确显示负号

colors = ['#d7fbe8','#9df3c4','#62d2a2','#1fab89','#a6d0e4', '#f9ffea', '#ffecda', '#d4a5a5', '#fbafaf', '#f2c6b4', '#f3e8cb', '#99e1e5']

数据读取

1 2	df = pd.read_excel('train.xlsx') df_test = pd.read_excel('test.xlsx')

df.info()

df.shape

df.head()

df.describe()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 1756 entries, 0 to 1755
Data columns (total 86 columns):

Column Non-Null Count Dtype

0 客户次类别 1756 non-null int64
1 房产数 1756 non-null int64
2 每房人数 1756 non-null int64
3 平均年龄 1756 non-null int64
4 客户主类别 1756 non-null int64
5 罗马天主教比例 1756 non-null int64
6 新教比例 1756 non-null int64
7 其它宗教比例 1756 non-null int64
8 无宗教比例 1756 non-null int64
9 已婚占比 1756 non-null int64
10 同居占比 1756 non-null int64
11 其它关系占比 1756 non-null int64
12 单身占比 1756 non-null int64
13 无子女 1756 non-null int64
14 有子女 1756 non-null int64
15 高等教育 1756 non-null int64
16 中等教育 1756 non-null int64
17 低等教育 1756 non-null int64
18 高管 1756 non-null int64
19 企业家 1756 non-null int64
…
84 投保社会安全险数量 1756 non-null int64
85 移动房车险数量 1756 non-null int64
dtypes: int64(86)
memory usage: 1.2 MB

(1756, 86)

sns.countplot(x='移动房车险数量', data=df, palette=colors)
plt.title('移动房车险数量分布', fontproperties=my_font, fontsize=16)
plt.xlabel('移动房车险数量', fontproperties=my_font, fontsize=14)
plt.ylabel('数量', fontproperties=my_font, fontsize=14)
plt.show()

我们查看训练集数据的分布发现其数量较均匀（也查看了测试集的分布其购买保险的人数远少于未买保险的人数）

数据清洗

缺失值处理

1
2
3

df.isnull().sum()
print("存在缺失值的列：")
print(df.columns[df.isnull().any()])

存在缺失值的列：
Index([], dtype=’object’)

发现数据中没有缺失值

异常值处理

检测数据分布

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

numeric_cols = df.select_dtypes(include='number').columns

for col in numeric_cols:
    plt.figure(figsize=(14, 5))

    plt.subplot(1, 2, 1)
    sns.histplot(df[col], kde=True, bins=30, color='skyblue')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('Frequency')

    plt.subplot(1, 2, 2)
    stats.probplot(df[col], dist="norm", plot=plt)
    plt.title(col)

    plt.tight_layout()
    plt.show()

我们可视化了各个特征的分布发现在前半部份的特征（非投保）数据分布较为正常在投保部分的特征数据呈现非常明显的偏态分布这种情况在进行异常检测时可能是异常值但考虑到我们这个数据集是预测移动房车险购买倾向有可能这些异常值就是那些购买保险的人因此我们可以标记异常值

for col in df.columns:
    if col == '移动房车险数量' or df[col].dtype == 'object':
        continue
    
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    
    df[col + '_outlier'] = df[col].apply(lambda x: 1 if (x < lower or x > upper) else 0)

特征选择

我们计算方差和皮尔逊相关系数删除方差过小的特征和p>0.05的特征

low_var = df.var(numeric_only=True)
df = df.drop(columns=low_var[low_var <= 0.1].index)

from scipy.stats import pearsonr

target_col = "移动房车险数量"
numeric_cols = df.select_dtypes(include='number').columns

for col in df.columns:
    if col == target_col:
        continue
    if col not in numeric_cols:
        df = df.drop(columns=[col])
        continue
    try:
        corr, p = pearsonr(df[col], df[target_col])
        if p > 0.05:
            df = df.drop(columns=[col])
    except Exception as e:
        print(f"Error on {col}: {e}")
        df = df.drop(columns=[col])

print(df.columns.tolist())
print("剩余变量数量:", len(df.columns))

[‘客户次类别’, ‘每房人数’, ‘客户主类别’, ‘新教比例’, ‘无宗教比例’, ‘已婚占比’, ‘同居占比’, ‘其它关系占比’, ‘单身占比’, ‘高等教育’, ‘中等教育’, ‘低等教育’, ‘高管’, ‘农场主’, ‘中层管理者’, ‘技术工人’, ‘非熟练劳工’, ‘社会阶层A’, ‘社会阶层B1’, ‘社会阶层C’, ‘社会阶层D’, ‘租房子’, ‘房主’, ‘一辆车’, ‘无车’, ‘公共社保’, ‘私人社保’, ‘收入低于30’, ‘收入45-75’, ‘收入75-122’, ‘平均收入’, ‘购买力水平’, ‘个人第三方保险’, ‘投保车险’, ‘投保机动自行车险’, ‘投保身残险’, ‘投保火险’, ‘投保船险’, ‘投保社会安全险’, ‘第三方私人险数量’, ‘投保车险数量’, ‘投保寿险数量’, ‘投保火险数量’, ‘移动房车险数量’, ‘社会阶层D_outlier’]
剩余变量数量: 45

1 2	selected_cols = df.columns.tolist() df_test = df[selected_cols]

数据划分

X = df.drop(columns=['移动房车险数量'])
y = df['移动房车险数量']

X_train = df.drop(columns=['移动房车险数量'])
y_train = df['移动房车险数量']
X_test = df_test.drop(columns=['移动房车险数量'])
y_test = df_test['移动房车险数量']

模型构建

我们使用了两种模型处理并优化第一个使用了朴素贝叶斯模型使用网格优化得到
recall=0.73 f1=0.67
相较与只使用朴素贝叶斯其召回率提高了5%
我们又使用多层感知机进行训练并用adam进行优化并且设置了早停其准确率和召回率达到了0.95和0.96 并且我们更关注模型预测会购买保险的召回率：0.98 说明模型了解了购买保险人的画像

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Binarizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, recall_score, classification_report


pipeline = Pipeline([
    ('nb', BernoulliNB())
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print('准确率：', accuracy_score(y_test, y_pred))
print('召回率：', recall_score(y_test, y_pred))
print("分类报告：")
print(classification_report(y_test, y_pred))

准确率： 0.6822323462414579
召回率： 0.617816091954023
分类报告：
precision recall f1-score support

       0       0.74      0.72      0.73      1060
       1       0.60      0.62      0.61       696

accuracy                           0.68      1756
macro avg      0.67      0.67      0.67      1756
weighted avg   0.68      0.68      0.68      1756

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('nb', BernoulliNB())
])

param_grid = {
    'nb__alpha': [0.1, 0.5, 1.0, 2.0],
    'nb__binarize': [0.0, 0.5, 1.0, 2.0],
    'nb__class_prior': [None, [0.3, 0.7], [0.4, 0.6]]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1', n_jobs=-1)

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)

print("最佳参数：", grid_search.best_params_)
print("最佳模型得分：", grid_search.best_score_)
print("\n分类报告：")
print(classification_report(y_test, y_pred))

最佳参数： {‘nb__alpha’: 0.1, ‘nb__binarize’: 0.0, ‘nb__class_prior’: [0.4, 0.6]}
最佳模型得分： 0.6235479451448764

分类报告：
precision recall f1-score support

       0       0.78      0.63      0.70      1060
       1       0.57      0.73      0.64       696

accuracy                           0.67      1756
macro avg      0.67      0.68      0.67      1756
weighted avg   0.70      0.67      0.68      1756

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, recall_score, classification_report
import matplotlib.pyplot as plt

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('mlp', MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500,
                          solver='adam', early_stopping=True,
                          validation_fraction=0.1, n_iter_no_change=10,
                          random_state=42))
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print('准确率：', accuracy_score(y_test, y_pred))
print('召回率（宏平均）：', recall_score(y_test, y_pred, average='macro'))
print("分类报告：")
print(classification_report(y_test, y_pred))

plt.plot(pipeline.named_steps['mlp'].loss_curve_)
plt.title("Loss Curve")
plt.xlabel("Iterations")
plt.ylabel("Loss")
plt.grid(True)
plt.show()

准确率： 0.9595671981776766
召回率（宏平均）： 0.9635491216655823
分类报告：
precision recall f1-score support

       0       0.99      0.94      0.97      1060
       1       0.92      0.98      0.95       696

accuracy                           0.96      1756
macro avg      0.95      0.96      0.96      1756
weighted avg   0.96      0.96      0.96      1756

特征重要性

import shap

scaler = pipeline.named_steps['scaler']
model = pipeline.named_steps['mlp']

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

explainer = shap.KernelExplainer(model.predict, X_train_scaled[:100])
shap_values = explainer.shap_values(X_test_scaled[:100])

shap.summary_plot(shap_values, X_test_scaled[:100], feature_names=X.columns)

智浩的Blog

移动房车险购买倾向预测分析