声明

本文代码均保存在
https://github.com/super-213/business_data_analysis
有需要的可以自行下载

查看数据

1
2
3

df = pd.read_excel('肿瘤数据.xlsx')

df.head()

最大周长	最大凹陷度	平均凹陷度	最大面积	最大半径	平均灰度值	肿瘤性质
184.60	0.2654	0.14710	2019.0	25.38	17.33	0
158.80	0.1860	0.07017	1956.0	24.99	23.41	0
152.50	0.2430	0.12790	1709.0	23.57	25.53	1
98.87	0.2575	0.10520	567.7	14.91	26.50	0
152.20	0.1625	0.10430	1575.0	22.54	16.67	0

缺失值处理

1	df.isnull().sum()

字段	缺失值数量
最大周长	0
最大凹陷度	0
平均凹陷度	0
最大面积	0
最大半径	0
平均灰度值	0
肿瘤性质	0

异常值处理

for col in df.columns:
    plt.boxplot(df[col])
    plt.title(col)
    plt.show()

发现用四分位数发现存在一些异常值朴素贝叶斯是对异常值敏感的我们可视化数据的分布情况

import seaborn as sns
cols = ['最大周长', '平均凹陷度', '最大面积', '最大半径', '平均灰度值']
target_col = '肿瘤性质'

n_cols = len(cols)
fig, axes = plt.subplots(n_cols, 2, figsize=(14, n_cols * 4))
fig.suptitle('特征分布与目标变量关系', fontsize=16, fontweight='bold')

for i, col in enumerate(cols):
    ax1 = axes[i, 0]
    ax2 = axes[i, 1]

    sns.histplot(
        data=df,
        x=col,
        hue=target_col,
        kde=True,
        ax=ax1,
        bins=30,
        alpha=0.7,
        palette='Set2'
    )
    ax1.set_title(f'{col} - 分布直方图 & KDE', fontsize=14)
    ax1.grid(True, linestyle='--', alpha=0.6)

    sns.boxplot(
        data=df,
        x=target_col,
        y=col,
        ax=ax2,
        palette='Set2'
    )
    ax2.set_title(f'{col} - 箱线图', fontsize=14)
    ax2.grid(True, linestyle='--', alpha=0.6)
    ax2.set_xlabel('目标类别')

plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()

可以发现数据呈现右偏应该是使用对数变换处理不过为了更具体我们测试各类预处理方案

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, RobustScaler
import numpy as np

score_raw = cross_val_score(GaussianNB(), X, y, cv=5, scoring='f1').mean()

X_scaled = StandardScaler().fit_transform(X)
score_scaled = cross_val_score(GaussianNB(), X_scaled, y, cv=5, scoring='f1').mean()

X_robust = RobustScaler().fit_transform(X)
score_robust = cross_val_score(GaussianNB(), X_robust, y, cv=5, scoring='f1').mean()

X_log = np.log(X + 1)
score_log = cross_val_score(GaussianNB(), X_log, y, cv=5, scoring='f1').mean()

print(f"原始: {score_raw:.4f}")
print(f"标准化: {score_scaled:.4f}")
print(f"鲁棒缩放: {score_robust:.4f}")
print(f"对数变换: {score_log:.4f}")

原始: 0.9711
标准化: 0.9651
鲁棒缩放: 0.9651
对数变换: 0.9692
发现不进行预处理反而会更好

朴素贝叶斯

from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

准确率：0.9649

实际 \ 预测	0	1
0	39	3
1	1	71

类别	Precision	Recall	F1-Score	Support
0	0.97	0.93	0.95	42
1	0.96	0.99	0.97	72
accuracy			0.96	114
macro avg	0.97	0.96	0.96	114
weighted avg	0.97	0.96	0.96	114

逻辑回归

其他的数据我们添加异常值的

df = pd.read_excel('肿瘤数据.xlsx')

cols = ['最大周长', '平均凹陷度', '最大面积', '最大半径', '平均灰度值']

for col in cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df[f'{col}_is_outlier'] = ((df[col] < lower_bound) | (df[col] > upper_bound)).astype(int)

df.head()

X = df.drop(columns=['肿瘤性质'])
y = df['肿瘤性质']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

准确率：0.9561

实际 \ 预测	0	1
0	39	3
1	2	70

类别	Precision	Recall	F1-Score	Support
0	0.95	0.93	0.94	42
1	0.96	0.97	0.97	72
accuracy			0.96	114
macro avg	0.96	0.95	0.95	114
weighted avg	0.96	0.96	0.96	114

决策树

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

准确率：0.9211

实际 \ 预测	0	1
0	37	5
1	4	68

类别	Precision	Recall	F1-Score	Support
0	0.90	0.88	0.89	42
1	0.93	0.94	0.94	72
accuracy			0.92	114
macro avg	0.92	0.91	0.91	114
weighted avg	0.92	0.92	0.92	114

XGBoost

from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb.fit(X_train, y_train)
y_xgb_pred = xgb.predict(X_test)

print("XGB Accuracy:", accuracy_score(y_test, y_xgb_pred))
print("XGB Confusion Matrix:\n", confusion_matrix(y_test, y_xgb_pred))
print("XGB Classification Report:\n", classification_report(y_test, y_xgb_pred))

XGB Accuracy: 0.9561

实际 \ 预测	0	1
0	39	3
1	2	70

类别	Precision	Recall	F1-Score	Support
0	0.95	0.93	0.94	42
1	0.96	0.97	0.97	72
accuracy			0.96	114
macro avg	0.96	0.95	0.95	114
weighted avg	0.96	0.96	0.96	114

随机森林

from sklearn.ensemble import RandomForestClassifier

rm = RandomForestClassifier(n_estimators=100, random_state=42)
rm.fit(X_train, y_train)
y_rm_pred = rm.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_rm_pred))
print("Random Forest Confusion Matrix:\n", confusion_matrix(y_test, y_rm_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, y_rm_pred))

Random Forest Accuracy: 0.9561

实际 \ 预测	0	1
0	39	3
1	2	70

类别	Precision	Recall	F1-Score	Support
0	0.95	0.93	0.94	42
1	0.96	0.97	0.97	72
accuracy			0.96	114
macro avg	0.96	0.95	0.95	114
weighted avg	0.96	0.96	0.96	114

智浩的Blog

商业数据分析--肿瘤数据

声明