商业数据分析--肿瘤数据

姜智浩 Lv5

声明

本文代码均保存在
https://github.com/super-213/business_data_analysis
有需要的可以自行下载

查看数据

1
2
3
df = pd.read_excel('肿瘤数据.xlsx')

df.head()
最大周长 最大凹陷度 平均凹陷度 最大面积 最大半径 平均灰度值 肿瘤性质
184.60 0.2654 0.14710 2019.0 25.38 17.33 0
158.80 0.1860 0.07017 1956.0 24.99 23.41 0
152.50 0.2430 0.12790 1709.0 23.57 25.53 1
98.87 0.2575 0.10520 567.7 14.91 26.50 0
152.20 0.1625 0.10430 1575.0 22.54 16.67 0

缺失值处理

1
df.isnull().sum()
字段 缺失值数量
最大周长 0
最大凹陷度 0
平均凹陷度 0
最大面积 0
最大半径 0
平均灰度值 0
肿瘤性质 0

异常值处理

1
2
3
4
for col in df.columns:
plt.boxplot(df[col])
plt.title(col)
plt.show()

photo

photo

photo

photo

photo

photo

photo

发现用四分位数发现存在一些异常值 朴素贝叶斯是对异常值敏感的 我们可视化数据的分布情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import seaborn as sns
cols = ['最大周长', '平均凹陷度', '最大面积', '最大半径', '平均灰度值']
target_col = '肿瘤性质'

n_cols = len(cols)
fig, axes = plt.subplots(n_cols, 2, figsize=(14, n_cols * 4))
fig.suptitle('特征分布与目标变量关系', fontsize=16, fontweight='bold')

for i, col in enumerate(cols):
ax1 = axes[i, 0]
ax2 = axes[i, 1]

sns.histplot(
data=df,
x=col,
hue=target_col,
kde=True,
ax=ax1,
bins=30,
alpha=0.7,
palette='Set2'
)
ax1.set_title(f'{col} - 分布直方图 & KDE', fontsize=14)
ax1.grid(True, linestyle='--', alpha=0.6)

sns.boxplot(
data=df,
x=target_col,
y=col,
ax=ax2,
palette='Set2'
)
ax2.set_title(f'{col} - 箱线图', fontsize=14)
ax2.grid(True, linestyle='--', alpha=0.6)
ax2.set_xlabel('目标类别')

plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()

photo

可以发现 数据呈现右偏 应该是使用对数变换处理 不过为了更具体我们测试各类预处理方案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, RobustScaler
import numpy as np

score_raw = cross_val_score(GaussianNB(), X, y, cv=5, scoring='f1').mean()

X_scaled = StandardScaler().fit_transform(X)
score_scaled = cross_val_score(GaussianNB(), X_scaled, y, cv=5, scoring='f1').mean()

X_robust = RobustScaler().fit_transform(X)
score_robust = cross_val_score(GaussianNB(), X_robust, y, cv=5, scoring='f1').mean()

X_log = np.log(X + 1)
score_log = cross_val_score(GaussianNB(), X_log, y, cv=5, scoring='f1').mean()

print(f"原始: {score_raw:.4f}")
print(f"标准化: {score_scaled:.4f}")
print(f"鲁棒缩放: {score_robust:.4f}")
print(f"对数变换: {score_log:.4f}")

原始: 0.9711
标准化: 0.9651
鲁棒缩放: 0.9651
对数变换: 0.9692
发现不进行预处理反而会更好

朴素贝叶斯

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

准确率:0.9649

实际 \ 预测 0 1
0 39 3
1 1 71
类别 Precision Recall F1-Score Support
0 0.97 0.93 0.95 42
1 0.96 0.99 0.97 72
accuracy 0.96 114
macro avg 0.97 0.96 0.96 114
weighted avg 0.97 0.96 0.96 114

逻辑回归

其他的数据我们添加异常值的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df = pd.read_excel('肿瘤数据.xlsx')

cols = ['最大周长', '平均凹陷度', '最大面积', '最大半径', '平均灰度值']

for col in cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df[f'{col}_is_outlier'] = ((df[col] < lower_bound) | (df[col] > upper_bound)).astype(int)

df.head()

X = df.drop(columns=['肿瘤性质'])
y = df['肿瘤性质']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
1
2
3
4
5
6
7
8
9
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

准确率:0.9561

实际 \ 预测 0 1
0 39 3
1 2 70
类别 Precision Recall F1-Score Support
0 0.95 0.93 0.94 42
1 0.96 0.97 0.97 72
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114

决策树

1
2
3
4
5
6
7
8
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

准确率:0.9211

实际 \ 预测 0 1
0 37 5
1 4 68
类别 Precision Recall F1-Score Support
0 0.90 0.88 0.89 42
1 0.93 0.94 0.94 72
accuracy 0.92 114
macro avg 0.92 0.91 0.91 114
weighted avg 0.92 0.92 0.92 114

XGBoost

1
2
3
4
5
6
7
8
9
from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb.fit(X_train, y_train)
y_xgb_pred = xgb.predict(X_test)

print("XGB Accuracy:", accuracy_score(y_test, y_xgb_pred))
print("XGB Confusion Matrix:\n", confusion_matrix(y_test, y_xgb_pred))
print("XGB Classification Report:\n", classification_report(y_test, y_xgb_pred))

XGB Accuracy: 0.9561

实际 \ 预测 0 1
0 39 3
1 2 70
类别 Precision Recall F1-Score Support
0 0.95 0.93 0.94 42
1 0.96 0.97 0.97 72
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114

随机森林

1
2
3
4
5
6
7
8
9
from sklearn.ensemble import RandomForestClassifier

rm = RandomForestClassifier(n_estimators=100, random_state=42)
rm.fit(X_train, y_train)
y_rm_pred = rm.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_rm_pred))
print("Random Forest Confusion Matrix:\n", confusion_matrix(y_test, y_rm_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, y_rm_pred))

Random Forest Accuracy: 0.9561

实际 \ 预测 0 1
0 39 3
1 2 70
类别 Precision Recall F1-Score Support
0 0.95 0.93 0.94 42
1 0.96 0.97 0.97 72
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
  • Title: 商业数据分析--肿瘤数据
  • Author: 姜智浩
  • Created at : 2025-09-23 11:45:14
  • Updated at : 2025-09-23 19:48:58
  • Link: https://super-213.github.io/zhihaojiang.github.io/2025/09/23/20250923商业数据分析--肿瘤数据/
  • License: This work is licensed under CC BY-NC-SA 4.0.