Spaceship Titanic

姜智浩 Lv4

数据来源

https://www.kaggle.com/competitions/spaceship-titanic/data

数据描述

欢迎来到2912年,你需要运用数据科学技能来解开一个宇宙之谜。我们收到了来自四光年外的传输信息,情况看起来不太妙。

泰坦尼克号宇宙飞船是一艘星际客轮,于一个月前发射升空。这艘载有近1.3万名乘客的飞船开始了它的首航,将来自我们太阳系的移民运送到三颗围绕邻近恒星运行的、新发现的宜居系外行星。

在绕过半人马座阿尔法星,前往其首个目的地——炙热的巨蟹座E星——的途中,粗心大意的泰坦尼克号宇宙飞船与隐藏在尘埃云中的时空异常相撞。不幸的是,它遭遇了与一千年前同名飞船相似的命运。虽然飞船完好无损,但几乎一半的乘客被传送到了另一个维度!

为了帮助救援队找回失踪的乘客,您需要使用从宇宙飞船受损的计算机系统中恢复的记录来预测哪些乘客是被异常现象运送的。

帮助拯救他们并改变历史!

过程

导入库

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
1
df = pd.read_csv('train.csv')
1
df.head()

photo

1
df.info()

photo

现在我们分析一下各个字段的含义:

  • PassengerId:乘客的ID
  • HomePlanet:乘客出发的星球 通常是他们永久居住的星球
  • CryoSleep:指示乘客是否选择在航行期间处于休眠状态 处于休眠状态的乘客将被限制在自己的船舱内
  • Cabin:乘客所住舱位号。格式为deck/num/side 其中side可以是P左舷 也可以是S右舷
  • Destination:乘客即将登陆的星球
  • Age:乘客的年龄
  • VIP:乘客是否已支付航行期间的特殊VIP服务费用
  • RoomService: 乘客在泰坦尼克号宇宙飞船的众多豪华设施中支付的金额
  • FoodCourt: 乘客在泰坦尼克号宇宙飞船的众多豪华设施中支付的金额
  • ShoppingMall: 乘客在泰坦尼克号宇宙飞船的众多豪华设施中支付的金额
  • Spa: 乘客在泰坦尼克号宇宙飞船的众多豪华设施中支付的金额
  • VRDeck: 乘客在泰坦尼克号宇宙飞船的众多豪华设施中支付的金额
  • Name:乘客的名字和姓氏
  • Transported:乘客是否被传送到了另一个维度。这是目标,也就是你要预测的列

缺失值处理

我们首先处理缺失值

1
df.isnull().sum()

photo

针对数值型特征 使用均值填充

针对类别型特征 使用众数填充

1
2
3
4
5
6
7
8
9
missing_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

for feature in missing_features:
df[feature].fillna(df[feature].mean(), inplace=True)

categorical_features = ['HomePlanet', 'CryoSleep', 'Cabin','Destination', 'VIP']

for feature in categorical_features:
df[feature].fillna(df[feature].mode()[0], inplace=True)

我们删除一些不必要的列 例如id和名称

1
2
3
df.drop(columns=['PassengerId', 'Name'], inplace=True)

df.info()

photo

其中 我发现Cabin这个特征包含了许多信息 将其分为三列

1
2
3
4
5
df[['Cabin_deck', 'Cabin_num', 'Cabin_side']] = df['Cabin'].str.split('/', expand=True)

df.drop(columns=['Cabin'], inplace=True)

df.info()

photo

异常值处理

用箱线图可视化查看异常值

1
2
3
4
5
6
7
8
features = ['Age', 
'RoomService',
'FoodCourt', 'ShoppingMall',
'Spa', 'VRDeck']

for col in features:
sns.boxplot(x=df[col])
plt.show()

photo

photo

photo

photo

photo

photo
对异常值进行处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
features = [ 'RoomService',
'FoodCourt', 'ShoppingMall',
'Spa', 'VRDeck']
for col in features:
df[col] = np.log1p(df[col])

Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df[col] = df[col].clip(lower=lower, upper=upper)
1
2
3
4
5
6
7
8
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df['Age'] = df['Age'].clip(lower=lower, upper=upper)
1
2
3
4
5
6
7
8
features = ['Age', 
'RoomService',
'FoodCourt', 'ShoppingMall',
'Spa', 'VRDeck']

for col in features:
sns.boxplot(x=df[col])
plt.show()

photo

photo

photo

photo

photo

photo

可视化分析

1
2
3
4
5
6
7
features = ['HomePlanet', 'CryoSleep', 'Destination',
'Cabin_deck', 'Cabin_side']

for feature in features:
sns.countplot(x=feature, hue='Transported', data=df)
plt.title(feature)
plt.show()

photo

photo

photo

photo

photo

特征编码

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.preprocessing import OneHotEncoder

one_hot_features = ['HomePlanet', 'Destination', 'CryoSleep', 'VIP', 'Cabin_deck', 'Cabin_side']

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

encoded_array = ohe.fit_transform(df[one_hot_features])
encoded_cols = ohe.get_feature_names_out(one_hot_features)

encoded_df = pd.DataFrame(encoded_array, columns=encoded_cols, index=df.index)

df_cleaned_encoded = pd.concat([df.drop(columns=one_hot_features), encoded_df], axis=1)
1
df_cleaned_encoded.drop(columns=['Cabin_num'], inplace=True)

相关性矩阵

1
2
3
4
5
6
7
8
9
10
11
12
y = df_cleaned_encoded['Transported']
X = df_cleaned_encoded.drop(['Transported'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

correlation_matrix = pd.concat([X_train, y_train], axis=1).corr()
# 设置相关性的阈值
threshold = 0.2
mask = np.abs(correlation_matrix) > threshold

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', mask=~mask, center=0)
plt.show()

photo

特征选择

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris

# 使用卡方检验选择前 2 个最佳特征
selector = SelectKBest(chi2, k=9)
X_new = selector.fit_transform(X_train, y_train)
print("选择后的特征形状:", X_new.shape)
print("每个特征的得分:", selector.scores_)
print("是否被选择:", selector.get_support())

# 输出每个特征的得分
scores = pd.Series(selector.scores_, index=X.columns)
scores = scores.sort_values(ascending=False)
print("卡方检验得分最高的特征:\n", scores.head(29))

photo

1
print("卡方检验得分最低的特征:\n", scores.tail(15))

photo

我们将特征分数较低的删除

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
df_cleaned_encoded.drop(columns=[
'Cabin_deck_A'
, 'Destination_PSO J318.5-22'
, 'VIP_False'
, 'Cabin_deck_T'
, 'Cabin_deck_G'
, 'HomePlanet_Mars'
, 'Cabin_deck_D'
, 'VIP_True'
, 'Destination_TRAPPIST-1e'
, 'Cabin_side_S'
, 'Cabin_side_P'
, 'Cabin_deck_F'
, 'Cabin_deck_E'
, 'Destination_55 Cancri e'
, 'Cabin_deck_C'], inplace=True)

模型选择

1
2
3
4
y = df_cleaned_encoded['Transported']
X = df_cleaned_encoded.drop(['Transported'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)

y_pred = rf.predict(X_test_scaled)

print(classification_report(y_test, y_pred))

photo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

model = lgb.LGBMClassifier()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

# 特征重要性
import matplotlib.pyplot as plt
lgb.plot_importance(model, max_num_features=20)
plt.title("Feature Importance")
plt.show()

photo

photo

混淆矩阵

1
2
3
4
5
6
7
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap='Blues')
plt.title("Confusion Matrix")
plt.show()

photo

测试集预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#整理测试集
tdf = pd.read_csv('test.csv')

missing_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

for feature in missing_features:
tdf[feature].fillna(tdf[feature].mean(), inplace=True)

categorical_features = ['HomePlanet', 'CryoSleep', 'Cabin','Destination', 'VIP']

# 用众数填充缺失值
for feature in categorical_features:
tdf[feature].fillna(tdf[feature].mode()[0], inplace=True)

tdf.drop(columns=['PassengerId', 'Name'], inplace=True)

tdf[['Cabin_deck', 'Cabin_num', 'Cabin_side']] = tdf['Cabin'].str.split('/', expand=True)

tdf.drop(columns=['Cabin'], inplace=True)

features = [ 'RoomService',
'FoodCourt', 'ShoppingMall',
'Spa', 'VRDeck']
for col in features:
tdf[col] = np.log1p(tdf[col])

Q1 = tdf[col].quantile(0.25)
Q3 = tdf[col].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

tdf[col] = tdf[col].clip(lower=lower, upper=upper)

Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

tdf['Age'] = tdf['Age'].clip(lower=lower, upper=upper)

from sklearn.preprocessing import OneHotEncoder

one_hot_features = ['HomePlanet', 'Destination', 'CryoSleep', 'VIP', 'Cabin_deck', 'Cabin_side']

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

encoded_array = ohe.fit_transform(tdf[one_hot_features])
encoded_cols = ohe.get_feature_names_out(one_hot_features)

encoded_df = pd.DataFrame(encoded_array, columns=encoded_cols, index=tdf.index)

tdf_cleaned_encoded = pd.concat([tdf.drop(columns=one_hot_features), encoded_df], axis=1)

tdf_cleaned_encoded.drop(columns=['Cabin_num'], inplace=True)

tdf_cleaned_encoded.drop(columns=[
'Cabin_deck_A'
, 'Destination_PSO J318.5-22'
, 'VIP_False'
, 'Cabin_deck_T'
, 'Cabin_deck_G'
, 'HomePlanet_Mars'
, 'Cabin_deck_D'
, 'VIP_True'
, 'Destination_TRAPPIST-1e'
, 'Cabin_side_S'
, 'Cabin_side_P'
, 'Cabin_deck_F'
, 'Cabin_deck_E'
, 'Destination_55 Cancri e'
, 'Cabin_deck_C'], inplace=True)

X_test_scaled = scaler.transform(tdf_cleaned_encoded)

model = lgb.LGBMClassifier()
model.fit(X_train_scaled, y_train)

# 模型预测
y_pred = model.predict(tdf_cleaned_encoded)

tdf = pd.read_csv('test.csv')
tdf['pred'] = y_pred

result = pd.DataFrame({
'iPassengerId': tdf['PassengerId'],
'Transported': tdf['pred']
})

result.to_csv("submission.csv", index=False)

参考来源

chatGPT 4o
提问的问题:

  1. 我有一份数据 其中的一个特征内容为:B/0/P 这种形式 如何将这一组数据以“/”拆分成三列特征
  2. bool类型的数据怎么处理
  3. bool类型要转换为数值型吗
  4. 我想查看类别型与target之间的关系
  5. 在数据分析时 我有一列是类别型 但是这一列的内容是数字 如何将其转换为数值型
  6. 如何将object类型添加到相关性矩阵查看其相关性
  7. 分类问题选择哪个模型好 GBDT怎么样
  8. 如何用这个模型来预测测试集
  9. 就是说我要重新把测试集也像训练集一样进行处理吗
  10. 假设我有两列叫id和pred 我如何将其保存到一个新的csv文件中
  11. 代码问题:
1
ValueError: Classification metrics can't handle a mix of continuous-multioutput and binary targets
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

model = lgb.LGBMClassifier()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
print(classification_report(X_test_scaled, y_pred))

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

# 特征重要性
import matplotlib.pyplot as plt
lgb.plot_importance(model, max_num_features=20)
plt.title("Feature Importance")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
ValueError                                Traceback (most recent call last)
Cell In[67], line 9
6 model.fit(X_train_scaled, y_train)
8 y_pred = model.predict(X_test_scaled)
----> 9 print(classification_report(X_test, y_pred))
11 mse = mean_squared_error(y_test, y_pred)
12 r2 = r2_score(y_test, y_pred)

File /opt/anaconda3/lib/python3.12/site-packages/sklearn/utils/_param_validation.py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
207 try:
208 with config_context(
209 skip_parameter_validation=(
210 prefer_skip_nested_validation or global_skip_validation
211 )
212 ):
--> 213 return func(*args, **kwargs)
214 except InvalidParameterError as e:
215 # When the function is just a wrapper around an estimator, we allow
216 # the function to delegate validation to the estimator, but we replace
217 # the name of the estimator by the name of the function in the error
218 # message to avoid confusion.
219 msg = re.sub(
220 r"parameter of \w+ must be",
221 f"parameter of {func.__qualname__} must be",
...
116 )
118 # We can't have more than one value on y_type => The set is no more needed
119 y_type = y_type.pop()

ValueError: Classification metrics can't handle a mix of continuous-multioutput and binary t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
TypeError                                 Traceback (most recent call last)
Cell In[68], line 11
8 y_pred = model.predict(X_test_scaled)
9 print(classification_report(y_test, y_pred))
---> 11 mse = mean_squared_error(y_test, y_pred)
12 r2 = r2_score(y_test, y_pred)
14 print("Mean Squared Error:", mse)

File /opt/anaconda3/lib/python3.12/site-packages/sklearn/utils/_param_validation.py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
207 try:
208 with config_context(
209 skip_parameter_validation=(
210 prefer_skip_nested_validation or global_skip_validation
211 )
212 ):
--> 213 return func(*args, **kwargs)
214 except InvalidParameterError as e:
215 # When the function is just a wrapper around an estimator, we allow
216 # the function to delegate validation to the estimator, but we replace
217 # the name of the estimator by the name of the function in the error
218 # message to avoid confusion.
219 msg = re.sub(
220 r"parameter of \w+ must be",
221 f"parameter of {func.__qualname__} must be",
...
--> 510 output_errors = np.average((y_true - y_pred) ** 2, axis=0, weights=sample_weight)
512 if isinstance(multioutput, str):
513 if multioutput == "raw_values":

TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
这个是用RandomForestClassifier
precision recall f1-score support

False 0.78 0.75 0.76 861
True 0.76 0.79 0.77 878

accuracy 0.77 1739
macro avg 0.77 0.77 0.77 1739
weighted avg 0.77 0.77 0.77 1739

这个是用GBDT
[LightGBM] [Info] Number of positive: 3500, number of negative: 3454
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000422 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1356
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503307 -> initscore=0.013230
[LightGBM] [Info] Start training from score 0.013230
precision recall f1-score support

False 0.80 0.75 0.77 861
True 0.77 0.82 0.79 878

accuracy 0.78 1739
macro avg 0.78 0.78 0.78 1739
weighted avg 0.78 0.78 0.78 1739

他们的结果还可以吗 哪个好

kaggle地址

我的此项目的kaggle网址:
https://www.kaggle.com/code/super213/randomforest-gbdt-f1-0-77

  • Title: Spaceship Titanic
  • Author: 姜智浩
  • Created at : 2025-04-18 11:45:14
  • Updated at : 2025-04-18 21:26:55
  • Link: https://super-213.github.io/zhihaojiang.github.io/2025/04/18/20250418Spaceship Titanic/
  • License: This work is licensed under CC BY-NC-SA 4.0.