数据来源

https://www.kaggle.com/datasets/sonalshinde123/eco-driving-behavior-dataset/data

概览

About Dataset
🚗 Introduction
The Eco-Driving Behavior Dataset is a large-scale synthetic dataset consisting of 30,000 rows, designed to simulate realistic vehicle driving behavior and fuel efficiency patterns. It represents trip-level driving metrics commonly captured by vehicle telematics systems, on-board diagnostics (OBD), and CAN-based sensors.

The dataset is especially useful for machine learning modeling, statistical analysis, and benchmarking, where real-world driving data is often difficult to obtain due to privacy, cost, and accessibility constraints.

🌱 Objective
The primary goal of this dataset is to help analyze how different driving behaviors impact fuel consumption and an overall eco-driving score, which reflects driving efficiency and environmental friendliness.

The dataset allows users to:

Study correlations between driving habits and eco-efficiency
Build predictive models for eco-driving scores
Experiment with feature engineering techniques
Compare regression and classification approaches on the same data
🧩 Feature-Level Explanation

1.rpm_variation
Captures how much the engine RPM fluctuates during a trip. Higher values typically indicate aggressive throttle usage, frequent gear changes, or unstable driving patterns.

2.harsh_braking_count
Represents the number of sudden braking events detected during the trip. Frequent harsh braking is usually associated with unsafe and inefficient driving behavior.

3.idling_time
Measures the total duration (in minutes) when the vehicle engine was running without movement. Excessive idling leads to unnecessary fuel consumption and lower eco-efficiency.

4.fuel_consumption
Indicates the fuel consumption rate for the trip. Higher values generally correspond to inefficient driving or heavy traffic conditions.

5.acceleration_smoothness
A normalized score (0–1) indicating how smoothly the driver accelerates. Higher values reflect smoother acceleration, reduced mechanical stress, and better fuel economy.

6.eco_score
A composite score derived from all behavioral factors. It reflects overall driving efficiency, environmental impact, and smoothness of driving behavior. This variable is ideal as a regression target or can be discretized for classification tasks (e.g., Poor / Average / Good driving).

🧪 Synthetic Data Generation Approach

The dataset was generated using:

Controlled statistical distributions
Feature inter-dependencies (e.g., harsh braking impacting fuel consumption)
Randomized noise to avoid overly deterministic patterns
Special care was taken to ensure:

Realistic feature ranges
Natural variance across trips
No perfect correlations
ML-friendly structure for experimentation

导入库

1	import pandas as pd

1 2	df = pd.read_csv('eco_driving_score.csv') df.head()

1.rpm_variation 1.转速变化
Captures how much the engine RPM fluctuates during a trip. Higher values typically indicate aggressive throttle usage, frequent gear changes, or unstable driving patterns.
记录行驶过程中发动机转速的波动幅度。数值越高，通常表示油门使用过于激进、频繁换挡或驾驶习惯不稳定。

2.harsh_braking_count 2.急刹车次数
Represents the number of sudden braking events detected during the trip. Frequent harsh braking is usually associated with unsafe and inefficient driving behavior.
表示行程中检测到的急刹车事件次数。频繁的急刹车通常与不安全和低效的驾驶行为有关。

3.idling_time 3.空闲时间
Measures the total duration (in minutes) when the vehicle engine was running without movement. Excessive idling leads to unnecessary fuel consumption and lower eco-efficiency.
测量车辆发动机空转但车辆不移动的总时长（以分钟为单位）。怠速时间过长会导致不必要的燃油消耗和降低环保效率。

4.fuel_consumption 4.燃油消耗
Indicates the fuel consumption rate for the trip. Higher values generally correspond to inefficient driving or heavy traffic conditions.
表示本次行程的燃油消耗率。数值越高，通常表示驾驶效率低下或交通拥堵情况越严重。

5.acceleration_smoothness
5.加速度平滑度
A normalized score (0–1) indicating how smoothly the driver accelerates. Higher values reflect smoother acceleration, reduced mechanical stress, and better fuel economy.
标准化评分（0-1）表示驾驶员加速的平顺程度。数值越高，表示加速越平顺、机械应力越小、燃油经济性越好。

6.eco_score 6.生态评分
A composite score derived from all behavioral factors. It reflects overall driving efficiency, environmental impact, and smoothness of driving behavior. This variable is ideal as a regression target or can be discretized for classification tasks (e.g., Poor / Average / Good driving).
该综合评分由所有行为因素构成，反映了整体驾驶效率、环境影响和驾驶行为的流畅性。该变量非常适合作为回归目标，也可以离散化用于分类任务（例如，驾驶水平差/一般/良好）。

EDA

1 2	import matplotlib.pyplot as plt import seaborn as sns

for col in df.columns:
    plt.hist(df[col], bins=100)
    plt.title(col)
    plt.show()

1 2	sns.pairplot(df) plt.show()

1 2	sns.heatmap(df.corr()) plt.show()

cols = ['rpm_variation', 'harsh_braking_count', 'idling_time', 'fuel_consumption', 'acceleration_smoothness']
for col in cols:
    df[col + '_bin'] = pd.cut(df[col], bins=20)
    binned = df.groupby(col + '_bin')['eco_score'].mean()
    bin_centers = [interval.mid for interval in binned.index]

    df.drop(columns=[col + '_bin'], inplace=True)

    plt.plot(bin_centers, binned.values, marker='o')
    plt.title('Mean Eco Score VS ' + col + ' Bin')
    plt.xlabel(col)
    plt.ylabel('Mean Eco Score')
    plt.grid(True)
    plt.show()

随着怠速时间的增加 ECO得分降低但是可以看到一个一场的点当怠速时间超过27分钟后得分反而会升高

1 2	plt.boxplot(df['idling_time']) plt.show()

Q1 = df["idling_time"].quantile(0.25)
Q3 = df["idling_time"].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df["idling_time"] < (Q1 - 1.5 * IQR)) | (df["idling_time"] > (Q3 + 1.5 * IQR)))]

1 2	plt.boxplot(df['idling_time']) plt.show()

df['idling_time_bin'] = pd.cut(df['idling_time'], bins=20)
binned = df.groupby('idling_time_bin')['eco_score'].mean()
bin_centers = [interval.mid for interval in binned.index]

df.drop(columns=['idling_time_bin'], inplace=True)

plt.plot(bin_centers, binned.values, marker='o')
plt.title('Mean Eco Score VS Idling Time Bin')
plt.xlabel('Idling Time Bin')
plt.ylabel('Mean Eco Score')
plt.grid(True)
plt.show()

数据划分

from sklearn.model_selection import train_test_split
X = df.drop('eco_score', axis=1)
y = df['eco_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

模型选择

1 2	from sklearn.ensemble import RandomForestRegressor from xgboost import XGBRegressor

rm = RandomForestRegressor()
rm.fit(X_train, y_train)
r2 = rm.score(X_test, y_test)
r2

0.9250779240975352

xgb = XGBRegressor()
xgb.fit(X_train, y_train)

xgb_r2 = xgb.score(X_test, y_test)
xgb_r2

0.9244345372632315

RM GV

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    'n_estimators': randint(100, 500),
    'max_depth': randint(10, 50),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

random_search = RandomizedSearchCV(
    rm,
    param_distributions=param_dist,
    n_iter=100,         # 随机尝试100组参数
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    random_state=42,
    verbose=1
)
random_search.fit(X_train, y_train)
print("Best params (Random):", random_search.best_params_)

print(f"Best R2 (Random): {random_search.best_estimator_.score(X_test, y_test)}")

Best params (Random): {‘max_depth’: 44, ‘max_features’: ‘sqrt’, ‘min_samples_leaf’: 1, ‘min_samples_split’: 12, ‘n_estimators’: 200}
Best R2 (Random): 0.9286245296105547

rm = RandomForestRegressor(
    max_depth=44,
    min_samples_split=12,
    min_samples_leaf=1,
    max_features='sqrt',
    n_estimators=200,
)
rm.fit(X_train, y_train)

r2_score = rm.score(X_test, y_test)
print(f"R2 Score: {r2_score}")

R2 Score: 0.928480167672461

特征重要性

import shap
explainer = shap.TreeExplainer(rm)
shap_values = explainer.shap_values(X_train, approximate=True)

shap.summary_plot(shap_values, X_train, plot_type="dot")

分析

1 `fuel_consumption`（燃油消耗）

影响范围最大，说明它对模型输出影响最显著。
红色部分集中在右侧 → 当燃油消耗高时，模型倾向于预测更高的输出
蓝色部分集中在左侧 → 燃油消耗低 → 模型预测值偏低
结论：燃油消耗越高，模型预测值越大 → 代表“更耗油”、“更不经济”的驾驶行为

2 `harsh_braking_count`（急刹车次数）

影响第二
高值（红）集中在右侧 → 急刹车多 → 模型预测值高 → 可能代表“危险驾驶”
低值（蓝）集中在左侧 → 急刹车少 → 预测值低 → 更安全/平稳
结论：急刹车次数越多，模型认为驾驶风险/成本越高

3 `rpm_variation`（转速波动）

影响中等
分布较对称，但高值略偏向右侧 → 转速波动大 → 预测值偏高
结论：转速不稳定可能被模型视为“不良驾驶习惯”

4 `acceleration_smoothness`（加速平顺性）

影响较小（约 -10 到 +10）
有趣的是：高值集中在左侧！ → 加速越平顺，SHAP 值为负 → 模型预测值降低
结论：加速越平顺 → 预测值越低 → 代表更优驾驶行为）

5 `idling_time`（怠速时间）

影响最小（约 -8 到 +8）
高值（红）在右侧 → 怠速时间长 → 预测值高 → 代表“浪费燃油”或“效率低”
结论：怠速时间越长，模型预测值越高 → 不良行为

整体模型行为总结

特征	对模型输出影响方向	含义
fuel_consumption	↑ 高 → 输出↑	越耗油，风险/成本越高
harsh_braking_count	↑ 多 → 输出↑	急刹车多，驾驶风险高
rpm_variation	↑ 波动大 → 输出↑	转速不稳，驾驶不规范
acceleration_smoothness	↑ 平顺 → 输出↓	加速越平顺，驾驶越好
idling_time	↑ 长 → 输出↑	怠速久，效率低/浪费

建议

优化驾驶行为：

减少急刹车、控制转速波动、提升加速平顺性 → 可显著降低模型预测的“风险评分”

智浩的Blog

Eco-Driving Behavior

数据来源

概览

导入库

EDA

数据划分

模型选择

RM GV

特征重要性

分析

1 fuel_consumption（燃油消耗）

2 harsh_braking_count（急刹车次数）

3 rpm_variation（转速波动）

4 acceleration_smoothness（加速平顺性）

5 idling_time（怠速时间）