Spaceship Titanic

数据来源

https://www.kaggle.com/competitions/spaceship-titanic/data

数据描述

欢迎来到2912年，你需要运用数据科学技能来解开一个宇宙之谜。我们收到了来自四光年外的传输信息，情况看起来不太妙。

泰坦尼克号宇宙飞船是一艘星际客轮，于一个月前发射升空。这艘载有近1.3万名乘客的飞船开始了它的首航，将来自我们太阳系的移民运送到三颗围绕邻近恒星运行的、新发现的宜居系外行星。

在绕过半人马座阿尔法星，前往其首个目的地——炙热的巨蟹座E星——的途中，粗心大意的泰坦尼克号宇宙飞船与隐藏在尘埃云中的时空异常相撞。不幸的是，它遭遇了与一千年前同名飞船相似的命运。虽然飞船完好无损，但几乎一半的乘客被传送到了另一个维度！

为了帮助救援队找回失踪的乘客，您需要使用从宇宙飞船受损的计算机系统中恢复的记录来预测哪些乘客是被异常现象运送的。

帮助拯救他们并改变历史！

过程

导入库

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

1	df = pd.read_csv('train.csv')

df.head()

df.info()

现在我们分析一下各个字段的含义：

PassengerId：乘客的ID
HomePlanet：乘客出发的星球通常是他们永久居住的星球
CryoSleep：指示乘客是否选择在航行期间处于休眠状态处于休眠状态的乘客将被限制在自己的船舱内
Cabin：乘客所住舱位号。格式为deck/num/side 其中side可以是P左舷也可以是S右舷
Destination：乘客即将登陆的星球
Age：乘客的年龄
VIP：乘客是否已支付航行期间的特殊VIP服务费用
RoomService：乘客在泰坦尼克号宇宙飞船的众多豪华设施中支付的金额
FoodCourt：乘客在泰坦尼克号宇宙飞船的众多豪华设施中支付的金额
ShoppingMall：乘客在泰坦尼克号宇宙飞船的众多豪华设施中支付的金额
Spa：乘客在泰坦尼克号宇宙飞船的众多豪华设施中支付的金额
VRDeck：乘客在泰坦尼克号宇宙飞船的众多豪华设施中支付的金额
Name：乘客的名字和姓氏
Transported：乘客是否被传送到了另一个维度。这是目标，也就是你要预测的列

缺失值处理

我们首先处理缺失值

1	df.isnull().sum()

针对数值型特征使用均值填充

针对类别型特征使用众数填充

missing_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

for feature in missing_features:
    df[feature].fillna(df[feature].mean(), inplace=True)

categorical_features = ['HomePlanet', 'CryoSleep', 'Cabin','Destination', 'VIP']

for feature in categorical_features:
    df[feature].fillna(df[feature].mode()[0], inplace=True)

我们删除一些不必要的列例如id和名称

1
2
3

df.drop(columns=['PassengerId', 'Name'], inplace=True)

df.info()

其中我发现Cabin这个特征包含了许多信息将其分为三列

df[['Cabin_deck', 'Cabin_num', 'Cabin_side']] = df['Cabin'].str.split('/', expand=True)

df.drop(columns=['Cabin'], inplace=True)

df.info()

异常值处理

用箱线图可视化查看异常值

features = ['Age', 
            'RoomService',
        'FoodCourt', 'ShoppingMall',
    'Spa', 'VRDeck']

for col in features:
    sns.boxplot(x=df[col])
    plt.show()

对异常值进行处理

features = [ 'RoomService',
        'FoodCourt', 'ShoppingMall',
    'Spa', 'VRDeck']
for col in features:
    df[col] = np.log1p(df[col])

    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    df[col] = df[col].clip(lower=lower, upper=upper)

Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df['Age'] = df['Age'].clip(lower=lower, upper=upper)

features = ['Age', 
            'RoomService',
        'FoodCourt', 'ShoppingMall',
    'Spa', 'VRDeck']

for col in features:
    sns.boxplot(x=df[col])
    plt.show()

可视化分析

features = ['HomePlanet', 'CryoSleep', 'Destination',
       'Cabin_deck', 'Cabin_side']

for feature in features:
    sns.countplot(x=feature, hue='Transported', data=df)
    plt.title(feature)
    plt.show()

特征编码

from sklearn.preprocessing import OneHotEncoder

one_hot_features = ['HomePlanet', 'Destination', 'CryoSleep', 'VIP', 'Cabin_deck', 'Cabin_side']

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

encoded_array = ohe.fit_transform(df[one_hot_features])
encoded_cols = ohe.get_feature_names_out(one_hot_features)

encoded_df = pd.DataFrame(encoded_array, columns=encoded_cols, index=df.index)

df_cleaned_encoded = pd.concat([df.drop(columns=one_hot_features), encoded_df], axis=1)

1	df_cleaned_encoded.drop(columns=['Cabin_num'], inplace=True)

特征选择

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris

# 使用卡方检验选择前 2 个最佳特征
selector = SelectKBest(chi2, k=9)
X_new = selector.fit_transform(X_train, y_train)
print("选择后的特征形状：", X_new.shape)
print("每个特征的得分：", selector.scores_)
print("是否被选择：", selector.get_support())

# 输出每个特征的得分
scores = pd.Series(selector.scores_, index=X.columns)
scores = scores.sort_values(ascending=False)
print("卡方检验得分最高的特征：\n", scores.head(29))

1	print("卡方检验得分最低的特征：\n", scores.tail(15))

我们将特征分数较低的删除

df_cleaned_encoded.drop(columns=[
    'Cabin_deck_A'
, 'Destination_PSO J318.5-22'
, 'VIP_False'
, 'Cabin_deck_T'
, 'Cabin_deck_G'
, 'HomePlanet_Mars'
, 'Cabin_deck_D'
, 'VIP_True'
, 'Destination_TRAPPIST-1e'
, 'Cabin_side_S'
, 'Cabin_side_P'
, 'Cabin_deck_F'
, 'Cabin_deck_E'
, 'Destination_55 Cancri e'
, 'Cabin_deck_C'], inplace=True)

模型选择

y = df_cleaned_encoded['Transported']
X = df_cleaned_encoded.drop(['Transported'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)

y_pred = rf.predict(X_test_scaled)

print(classification_report(y_test, y_pred))

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

model = lgb.LGBMClassifier()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

# 特征重要性
import matplotlib.pyplot as plt
lgb.plot_importance(model, max_num_features=20)
plt.title("Feature Importance")
plt.show()

混淆矩阵

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap='Blues')
plt.title("Confusion Matrix")
plt.show()

测试集预测

#整理测试集
tdf = pd.read_csv('test.csv')

missing_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

for feature in missing_features:
    tdf[feature].fillna(tdf[feature].mean(), inplace=True)

categorical_features = ['HomePlanet', 'CryoSleep', 'Cabin','Destination', 'VIP']

# 用众数填充缺失值
for feature in categorical_features:
    tdf[feature].fillna(tdf[feature].mode()[0], inplace=True)

tdf.drop(columns=['PassengerId', 'Name'], inplace=True)

tdf[['Cabin_deck', 'Cabin_num', 'Cabin_side']] = tdf['Cabin'].str.split('/', expand=True)

tdf.drop(columns=['Cabin'], inplace=True)

features = [ 'RoomService',
        'FoodCourt', 'ShoppingMall',
    'Spa', 'VRDeck']
for col in features:
    tdf[col] = np.log1p(tdf[col])

    Q1 = tdf[col].quantile(0.25)
    Q3 = tdf[col].quantile(0.75)
    IQR = Q3 - Q1

    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    tdf[col] = tdf[col].clip(lower=lower, upper=upper)

Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

tdf['Age'] = tdf['Age'].clip(lower=lower, upper=upper)

from sklearn.preprocessing import OneHotEncoder

one_hot_features = ['HomePlanet', 'Destination', 'CryoSleep', 'VIP', 'Cabin_deck', 'Cabin_side']

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

encoded_array = ohe.fit_transform(tdf[one_hot_features])
encoded_cols = ohe.get_feature_names_out(one_hot_features)

encoded_df = pd.DataFrame(encoded_array, columns=encoded_cols, index=tdf.index)

tdf_cleaned_encoded = pd.concat([tdf.drop(columns=one_hot_features), encoded_df], axis=1)

tdf_cleaned_encoded.drop(columns=['Cabin_num'], inplace=True)

tdf_cleaned_encoded.drop(columns=[
    'Cabin_deck_A'
, 'Destination_PSO J318.5-22'
, 'VIP_False'
, 'Cabin_deck_T'
, 'Cabin_deck_G'
, 'HomePlanet_Mars'
, 'Cabin_deck_D'
, 'VIP_True'
, 'Destination_TRAPPIST-1e'
, 'Cabin_side_S'
, 'Cabin_side_P'
, 'Cabin_deck_F'
, 'Cabin_deck_E'
, 'Destination_55 Cancri e'
, 'Cabin_deck_C'], inplace=True)

X_test_scaled = scaler.transform(tdf_cleaned_encoded)

model = lgb.LGBMClassifier()
model.fit(X_train_scaled, y_train)

# 模型预测
y_pred = model.predict(tdf_cleaned_encoded)

tdf = pd.read_csv('test.csv')
tdf['pred'] = y_pred

result = pd.DataFrame({
    'iPassengerId': tdf['PassengerId'],
    'Transported': tdf['pred']
})

result.to_csv("submission.csv", index=False)

参考来源

chatGPT 4o
提问的问题：

我有一份数据其中的一个特征内容为：B/0/P 这种形式如何将这一组数据以“/”拆分成三列特征
bool类型的数据怎么处理
bool类型要转换为数值型吗
我想查看类别型与target之间的关系
在数据分析时我有一列是类别型但是这一列的内容是数字如何将其转换为数值型
如何将object类型添加到相关性矩阵查看其相关性
分类问题选择哪个模型好 GBDT怎么样
如何用这个模型来预测测试集
就是说我要重新把测试集也像训练集一样进行处理吗
假设我有两列叫id和pred 我如何将其保存到一个新的csv文件中
代码问题：

1	ValueError: Classification metrics can't handle a mix of continuous-multioutput and binary targets

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

model = lgb.LGBMClassifier()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
print(classification_report(X_test_scaled, y_pred))

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

# 特征重要性
import matplotlib.pyplot as plt
lgb.plot_importance(model, max_num_features=20)
plt.title("Feature Importance")
plt.show()

ValueError                                Traceback (most recent call last)
Cell In[67], line 9
      6 model.fit(X_train_scaled, y_train)
      8 y_pred = model.predict(X_test_scaled)
----> 9 print(classification_report(X_test, y_pred))
     11 mse = mean_squared_error(y_test, y_pred)
     12 r2 = r2_score(y_test, y_pred)

File /opt/anaconda3/lib/python3.12/site-packages/sklearn/utils/_param_validation.py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    207 try:
    208     with config_context(
    209         skip_parameter_validation=(
    210             prefer_skip_nested_validation or global_skip_validation
    211         )
    212     ):
--> 213         return func(*args, **kwargs)
    214 except InvalidParameterError as e:
    215     # When the function is just a wrapper around an estimator, we allow
    216     # the function to delegate validation to the estimator, but we replace
    217     # the name of the estimator by the name of the function in the error
    218     # message to avoid confusion.
    219     msg = re.sub(
    220         r"parameter of \w+ must be",
    221         f"parameter of {func.__qualname__} must be",
...
    116     )
    118 # We can't have more than one value on y_type => The set is no more needed
    119 y_type = y_type.pop()

ValueError: Classification metrics can't handle a mix of continuous-multioutput and binary t

TypeError                                 Traceback (most recent call last)
Cell In[68], line 11
      8 y_pred = model.predict(X_test_scaled)
      9 print(classification_report(y_test, y_pred))
---> 11 mse = mean_squared_error(y_test, y_pred)
     12 r2 = r2_score(y_test, y_pred)
     14 print("Mean Squared Error:", mse)

File /opt/anaconda3/lib/python3.12/site-packages/sklearn/utils/_param_validation.py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    207 try:
    208     with config_context(
    209         skip_parameter_validation=(
    210             prefer_skip_nested_validation or global_skip_validation
    211         )
    212     ):
--> 213         return func(*args, **kwargs)
    214 except InvalidParameterError as e:
    215     # When the function is just a wrapper around an estimator, we allow
    216     # the function to delegate validation to the estimator, but we replace
    217     # the name of the estimator by the name of the function in the error
    218     # message to avoid confusion.
    219     msg = re.sub(
    220         r"parameter of \w+ must be",
    221         f"parameter of {func.__qualname__} must be",
...
--> 510 output_errors = np.average((y_true - y_pred) ** 2, axis=0, weights=sample_weight)
    512 if isinstance(multioutput, str):
    513     if multioutput == "raw_values":

TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

这个是用RandomForestClassifier
precision    recall  f1-score   support

       False       0.78      0.75      0.76       861
        True       0.76      0.79      0.77       878

    accuracy                           0.77      1739
   macro avg       0.77      0.77      0.77      1739
weighted avg       0.77      0.77      0.77      1739

这个是用GBDT
[LightGBM] [Info] Number of positive: 3500, number of negative: 3454
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000422 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1356
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503307 -> initscore=0.013230
[LightGBM] [Info] Start training from score 0.013230
              precision    recall  f1-score   support

       False       0.80      0.75      0.77       861
        True       0.77      0.82      0.79       878

    accuracy                           0.78      1739
   macro avg       0.78      0.78      0.78      1739
weighted avg       0.78      0.78      0.78      1739

他们的结果还可以吗 哪个好

kaggle地址

我的此项目的kaggle网址：
https://www.kaggle.com/code/super213/randomforest-gbdt-f1-0-77

智浩的Blog