GitHub地址：https://github.com/super-213/dataanalysistools

information

summarize_df()

汇总 DataFrame 的基本信息，包括 head, describe, info 等内容。

参数:
df (pd.DataFrame): 需要汇总的 DataFrame
head_rows (int): 显示前几行数据,默认5行

返回:
dict: 包含 head, tail, describe, info, dtypes, missing 的字典结果

clean

drop_columns()

删除 DataFrame 中的指定列或指定类型的列。

参数
df : pd.DataFrame
要处理的数据框
cols_to_drop : 可迭代[str], optional
要删除的列名列表（如 [‘id’, ‘name’]）
dtypes_to_drop : 可迭代[str], optional
要删除的数据类型（如 [‘object’, ‘datetime64[ns]’]）
verbose : bool, default False
是否打印删除的列名

返回
pd.DataFrame
删除列后的副本

impute()

通用缺失值填补函数。

参数
df : DataFrame
原始数据
strategy : {“mean”, “median”, “mode”, “constant”}
填补策略
constant_value : 任意
strategy=”constant” 时使用

返回:
DataFrame
填补后的副本

detect_outliers_zscore()

使用 Z-Score 方法检测并可剔除异常值。

参数
df : pd.DataFrame
原始数据
cols : list[str], optional
要检测的列（默认全部数值列）
threshold : float
Z 分数阈值，超过即视为异常
method : {‘both’, ‘high’, ‘low’}
检测异常值方向：
- ‘both’: 正负两侧都检测（默认）
- ‘high’: 仅检测高值异常（z > threshold）
- ‘low’: 仅检测低值异常（z < -threshold）
return_zscore : bool
是否返回异常值的 Z 分数（附加在结果中）
drop : bool
是否直接剔除异常值行（返回清洗后的 DataFrame）
verbose : bool
是否输出每列异常值数量
返回
如果 drop=False 且 return_zscore=False:
返回异常值的行（pd.DataFrame）
如果 drop=False 且 return_zscore=True:
返回包含 Z 分数的异常值 DataFrame
如果 drop=True:
返回剔除后的 DataFrame（无异常值）

detect_outliers_iqr()

使用 IQR（四分位数）方法检测异常值，并可选择剔除。

参数
df : pd.DataFrame
原始数据
cols : list[str], optional
要检测的列（默认全部数值列）
multiplier : float
IQR 倍数阈值（默认 1.5）
method : {“both”, “high”, “low”}
异常方向控制：
- “both”: 上下两端都检测
- “high”: 仅检测上异常（值 > Q3 + 1.5×IQR）
- “low” : 仅检测下异常（值 < Q1 - 1.5×IQR）
return_bounds : bool
是否返回每列的上下界（Q1 和 Q3 基础上的判断线）
drop : bool
是否直接剔除异常值行（返回清洗后的 df）
verbose : bool
是否打印每列异常数量
返回
如果 drop=False:
返回异常值 DataFrame
如果 drop=True:
返回剔除异常值后的 DataFrame
如果 return_bounds=True:
返回 (异常值DataFrame或清洗后df, bounds字典)

detect_outliers_boxplot()

用箱线图批量可视化异常值
df: pandas DataFrame
features: 要检查的特征列表（数值型）
max_per_page: 每页最多显示的子图数
figsize: 每页图像尺寸
dpi: 图像清晰度

handle_outliers()

处理异常值：删除、替换或裁剪。
参数：
- df : 原始 DataFrame
- cols : 指定处理列（默认数值列）
- method : 异常检测方法（zscore 或 iqr）
- strategy : 处理策略（删除、替换、裁剪）
- z_thresh : Z-Score 阈值
- iqr_mult : IQR 倍数
- direction : 异常检测方向（上下）
- verbose : 是否打印信息

EAD

pairwise_plot()

批量绘制特征两两组合的图表，每张大图包含最多9张子图。

参数:
- df: DataFrame 数据
- features: 要组合的特征列名列表
- plot_type: 图表类型：’scatter’, ‘box’, ‘violin’, ‘hist’
- hue: 分类变量（可选）
- max_per_figure: 每页最多显示几个子图
- figsize: 每张图的整体大小
- save: 是否保存图像
- save_prefix: 图像保存的前缀

plot_all_barplots()

将 DataFrame 中所有列（除 except）按类别分组绘制为子图中的柱状图

参数:
- df: pandas DataFrame，包含数据和类别列
- hue: str，表示分类的列名
- n_cols: int，每行显示多少个图

返回:
- 无（直接显示整张图）

plot_feature_distributions()

批量绘制特征分布图（直方图 + KDE）。

参数：
- df: pandas DataFrame
- features: 要查看分布的列（默认选择数值型列）
- hue: 分组变量（分类列名）
- bins: 直方图的 bin 数量
- kde: 是否绘制 KDE 曲线
- max_per_figure: 每页显示的图表数
- figsize: 每页图像大小
- save: 是否保存图像
- save_prefix: 保存图像的文件名前缀

plot_all_boxplots()

将 DataFrame 中所有数值列绘制为子图中的箱线图

参数:
- df: pandas DataFrame
- n_cols: 每行显示的图数量
- exclude_columns: 要排除的列名列表，例如类别列

返回:
- 无（直接显示整张图）

features

bin_continuous_variable()

对连续变量进行分箱处理。

参数:
df (pd.DataFrame): 输入 DataFrame。
column (str): 要分箱的列名。
method (str): 分箱方法，’equal_width’（等宽）、’equal_freq’（等频）、’custom’（自定义）。
bins (int or List[float]): 分箱数量或自定义边界。
labels (List[str], optional): 分箱标签。如果不传将使用默认区间或数字编号。
return_interval (bool): 是否显示区间（仅对默认 label 生效）。
new_col_name (str): 分箱后新列的名称，默认为原列名+”_binned”。
drop_original (bool): 是否删除原始列。
inplace (bool): 是否在原df上修改。

返回:
pd.DataFrame: 处理后的 DataFrame。

使用案例：

https://super-213.github.io/zhihaojiang.github.io/2025/05/18/20250518%E7%B3%96%E5%B0%BF%E7%97%85%E9%A2%84%E6%B5%8B%E5%88%86%E6%9E%90/

具体详见GitHub

智浩的Blog

data_analysis_tools介绍