工业蒸汽预测

赛题背景
火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能。在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率,即燃料燃烧加热水产生高温高压蒸汽。锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。
赛题描述
经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别),根据锅炉的工况,预测产生的蒸汽量(回归模型)。
数据说明
数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这38个字段是作为特征变量,”target”作为目标变量。选手利用训练数据训练出模型,预测测试数据的目标变量,排名结果依据预测结果的MSE(mean square error)。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
# data path
train_data_file = "./data/zhengqi_train.txt"
test_data_file = "./data/zhengqi_test.txt"

EDA阶段

基本数据探索

train_data = pd.read_csv(train_data_file, sep="\t")
train_data.head()
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V29 V30 V31 V32 V33 V34 V35 V36 V37 target
0 0.566 0.016 -0.143 0.407 0.452 -0.901 -1.812 -2.360 -0.436 -2.114 ... 0.136 0.109 -0.615 0.327 -4.627 -4.789 -5.101 -2.608 -3.508 0.175
1 0.968 0.437 0.066 0.566 0.194 -0.893 -1.566 -2.360 0.332 -2.114 ... -0.128 0.124 0.032 0.600 -0.843 0.160 0.364 -0.335 -0.730 0.676
2 1.013 0.568 0.235 0.370 0.112 -0.797 -1.367 -2.360 0.396 -2.114 ... -0.009 0.361 0.277 -0.116 -0.843 0.160 0.364 0.765 -0.589 0.633
3 0.733 0.368 0.283 0.165 0.599 -0.679 -1.200 -2.086 0.403 -2.114 ... 0.015 0.417 0.279 0.603 -0.843 -0.065 0.364 0.333 -0.112 0.206
4 0.684 0.638 0.260 0.209 0.337 -0.454 -1.073 -2.086 0.314 -2.114 ... 0.183 1.078 0.328 0.418 -0.843 -0.215 0.364 -0.280 -0.028 0.384

5 rows × 39 columns

# 无缺失值
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2888 entries, 0 to 2887
Data columns (total 39 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      2888 non-null   float64
 1   V1      2888 non-null   float64
 2   V2      2888 non-null   float64
 3   V3      2888 non-null   float64
 4   V4      2888 non-null   float64
 5   V5      2888 non-null   float64
 6   V6      2888 non-null   float64
 7   V7      2888 non-null   float64
 8   V8      2888 non-null   float64
 9   V9      2888 non-null   float64
 10  V10     2888 non-null   float64
 11  V11     2888 non-null   float64
 12  V12     2888 non-null   float64
 13  V13     2888 non-null   float64
 14  V14     2888 non-null   float64
 15  V15     2888 non-null   float64
 16  V16     2888 non-null   float64
 17  V17     2888 non-null   float64
 18  V18     2888 non-null   float64
 19  V19     2888 non-null   float64
 20  V20     2888 non-null   float64
 21  V21     2888 non-null   float64
 22  V22     2888 non-null   float64
 23  V23     2888 non-null   float64
 24  V24     2888 non-null   float64
 25  V25     2888 non-null   float64
 26  V26     2888 non-null   float64
 27  V27     2888 non-null   float64
 28  V28     2888 non-null   float64
 29  V29     2888 non-null   float64
 30  V30     2888 non-null   float64
 31  V31     2888 non-null   float64
 32  V32     2888 non-null   float64
 33  V33     2888 non-null   float64
 34  V34     2888 non-null   float64
 35  V35     2888 non-null   float64
 36  V36     2888 non-null   float64
 37  V37     2888 non-null   float64
 38  target  2888 non-null   float64
dtypes: float64(39)
memory usage: 880.1 KB
train_data.describe().T
count mean std min 25% 50% 75% max
V0 2888.0 0.123048 0.928031 -4.335 -0.29700 0.3590 0.72600 2.121
V1 2888.0 0.056068 0.941515 -5.122 -0.22625 0.2725 0.59900 1.918
V2 2888.0 0.289720 0.911236 -3.420 -0.31300 0.3860 0.91825 2.828
V3 2888.0 -0.067790 0.970298 -3.956 -0.65225 -0.0445 0.62400 2.457
V4 2888.0 0.012921 0.888377 -4.742 -0.38500 0.1100 0.55025 2.689
V5 2888.0 -0.558565 0.517957 -2.182 -0.85300 -0.4660 -0.15400 0.489
V6 2888.0 0.182892 0.918054 -4.576 -0.31000 0.3880 0.83125 1.895
V7 2888.0 0.116155 0.955116 -5.048 -0.29500 0.3440 0.78225 1.918
V8 2888.0 0.177856 0.895444 -4.692 -0.15900 0.3620 0.72600 2.245
V9 2888.0 -0.169452 0.953813 -12.891 -0.39000 0.0420 0.04200 1.335
V10 2888.0 0.034319 0.968272 -2.584 -0.42050 0.1570 0.61925 4.830
V11 2888.0 -0.364465 0.858504 -3.160 -0.80325 -0.1120 0.24700 1.455
V12 2888.0 0.023177 0.894092 -5.165 -0.41900 0.1230 0.61600 2.657
V13 2888.0 0.195738 0.922757 -3.675 -0.39800 0.2895 0.86425 2.475
V14 2888.0 0.016081 1.015585 -2.455 -0.66800 -0.1610 0.82975 2.558
V15 2888.0 0.096146 1.033048 -2.903 -0.66225 -0.0005 0.73000 4.314
V16 2888.0 0.113505 0.983128 -5.981 -0.30000 0.3060 0.77425 2.861
V17 2888.0 -0.043458 0.655857 -2.224 -0.36600 0.1650 0.43000 2.023
V18 2888.0 0.055034 0.953466 -3.582 -0.36750 0.0820 0.51325 4.441
V19 2888.0 -0.114884 1.108859 -3.704 -0.98750 -0.0005 0.73725 3.431
V20 2888.0 -0.186226 0.788511 -3.402 -0.67550 -0.1565 0.30400 3.525
V21 2888.0 -0.056556 0.781471 -2.643 -0.51700 -0.0565 0.43150 2.259
V22 2888.0 0.302893 0.639186 -1.375 -0.06300 0.2165 0.87200 2.018
V23 2888.0 0.155978 0.978757 -5.542 0.09725 0.3380 0.36825 1.906
V24 2888.0 -0.021813 1.033403 -1.344 -1.19100 0.0950 0.93125 2.423
V25 2888.0 -0.051679 0.915957 -3.808 -0.55725 -0.0760 0.35600 7.284
V26 2888.0 0.072092 0.889771 -5.131 -0.45200 0.0750 0.64425 2.980
V27 2888.0 0.272407 0.270374 -1.164 0.15775 0.3250 0.44200 0.925
V28 2888.0 0.137712 0.929899 -2.435 -0.45500 -0.4470 0.73000 4.671
V29 2888.0 0.097648 1.061200 -2.912 -0.66400 -0.0230 0.74525 4.580
V30 2888.0 0.055477 0.901934 -4.507 -0.28300 0.0535 0.48800 2.689
V31 2888.0 0.127791 0.873028 -5.859 -0.17025 0.2995 0.63500 2.013
V32 2888.0 0.020806 0.902584 -4.053 -0.40725 0.0390 0.55700 2.395
V33 2888.0 0.007801 1.006995 -4.627 -0.49900 -0.0400 0.46200 5.465
V34 2888.0 0.006715 1.003291 -4.789 -0.29000 0.1600 0.27300 5.110
V35 2888.0 0.197764 0.985675 -5.695 -0.20250 0.3640 0.60200 2.324
V36 2888.0 0.030658 0.970812 -2.608 -0.41300 0.1370 0.64425 5.238
V37 2888.0 -0.130330 1.017196 -3.630 -0.79825 -0.1855 0.49525 3.000
target 2888.0 0.126353 0.983966 -3.044 -0.35025 0.3130 0.79325 2.538
test_data = pd.read_csv(test_data_file, sep="\t")
test_data.head()
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V28 V29 V30 V31 V32 V33 V34 V35 V36 V37
0 0.368 0.380 -0.225 -0.049 0.379 0.092 0.550 0.551 0.244 0.904 ... -0.449 0.047 0.057 -0.042 0.847 0.534 -0.009 -0.190 -0.567 0.388
1 0.148 0.489 -0.247 -0.049 0.122 -0.201 0.487 0.493 -0.127 0.904 ... -0.443 0.047 0.560 0.176 0.551 0.046 -0.220 0.008 -0.294 0.104
2 -0.166 -0.062 -0.311 0.046 -0.055 0.063 0.485 0.493 -0.227 0.904 ... -0.458 -0.398 0.101 0.199 0.634 0.017 -0.234 0.008 0.373 0.569
3 0.102 0.294 -0.259 0.051 -0.183 0.148 0.474 0.504 0.010 0.904 ... -0.456 -0.398 1.007 0.137 1.042 -0.040 -0.290 0.008 -0.666 0.391
4 0.300 0.428 0.208 0.051 -0.033 0.116 0.408 0.497 0.155 0.904 ... -0.458 -0.776 0.291 0.370 0.181 -0.040 -0.290 0.008 -0.140 -0.497

5 rows × 38 columns

# 无缺失值
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1925 entries, 0 to 1924
Data columns (total 38 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      1925 non-null   float64
 1   V1      1925 non-null   float64
 2   V2      1925 non-null   float64
 3   V3      1925 non-null   float64
 4   V4      1925 non-null   float64
 5   V5      1925 non-null   float64
 6   V6      1925 non-null   float64
 7   V7      1925 non-null   float64
 8   V8      1925 non-null   float64
 9   V9      1925 non-null   float64
 10  V10     1925 non-null   float64
 11  V11     1925 non-null   float64
 12  V12     1925 non-null   float64
 13  V13     1925 non-null   float64
 14  V14     1925 non-null   float64
 15  V15     1925 non-null   float64
 16  V16     1925 non-null   float64
 17  V17     1925 non-null   float64
 18  V18     1925 non-null   float64
 19  V19     1925 non-null   float64
 20  V20     1925 non-null   float64
 21  V21     1925 non-null   float64
 22  V22     1925 non-null   float64
 23  V23     1925 non-null   float64
 24  V24     1925 non-null   float64
 25  V25     1925 non-null   float64
 26  V26     1925 non-null   float64
 27  V27     1925 non-null   float64
 28  V28     1925 non-null   float64
 29  V29     1925 non-null   float64
 30  V30     1925 non-null   float64
 31  V31     1925 non-null   float64
 32  V32     1925 non-null   float64
 33  V33     1925 non-null   float64
 34  V34     1925 non-null   float64
 35  V35     1925 non-null   float64
 36  V36     1925 non-null   float64
 37  V37     1925 non-null   float64
dtypes: float64(38)
memory usage: 571.6 KB

说明:训练集和测试集中的数据没有缺失值,特征字段经过脱敏处理,没有具体的特征含义,共38个特征。

可视化异常值以及查看数据分布情况

# 箱图
fig = plt.figure(figsize=(80, 60))
feature_columns = train_data.columns.to_list()[:39]

for i, columns in enumerate(feature_columns):
    plt.subplot(7, 8, i+1) # 注:从1开始
    sns.boxplot(train_data[columns], orient="vertical", width=0.5)
    plt.ylabel(columns)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-s9KkEjGd-1612277796483)(output_11_0.png)]

结论:大多数特征变量存在异常值,需要考虑对异常值的处理。

基于模型的异常值获取

基于模型的方法

def find_outliers(model, X, y, sigma=3):
    """找出异常值并可视化"""
    
    try:
        y_pred = pd.Series(model.predict(X_train), index=y_train.index)
    except:
        model.fit(X, y)
        y_pred = pd.Series(model.predict(X_train), index=y_train.index)

    print("MSE:{}, R^2:{}".format(round(mean_squared_error(y, y_pred), 4), round(model.score(X, y), 4)))
    # 计算参统计量
    resid = y-y_pred
    resid_mean = resid.mean()
    resid_std = resid.std()
    print("resid_mean:{}, resid_std:{}".format(round(resid_mean, 4), round(resid_std, 4)))

    # 标准化(正态分布)
    z = (resid - resid_mean)/resid_std

    # 异常值位置(真实值和预测值偏离程度较大,大于三倍标准差)
    outliers = z[abs(z) > sigma].index
    print("outlier index:",outliers.tolist())
    # 可视化异常值(呈现正相关性越好)
    plt.figure(figsize=(15, 3))
    ax_131 = plt.subplot(1, 3, 1)
    plt.plot(y, y_pred, ".", label="Accepted")
    plt.plot(y.loc[outliers], y_pred[outliers], "ro", label="Outlier")
    plt.ylabel("y_pred")
    plt.xlabel("y")
    plt.legend()
    
    # 残差越靠近零值越好
    ax_132 = plt.subplot(1, 3, 2)
    plt.plot(y, y-y_pred, ".", label="Accepted")
    plt.plot(y.loc[outliers], y.loc[outliers ] - y_pred.loc[outliers], "ro", label="Outlier")
    plt.ylabel("y-y_pred")
    plt.xlabel("y")
    plt.legend()

    # 绘制直方图
    ax_133 = plt.subplot(1, 3, 3)
    z.plot.hist(bins=50, ax=ax_133)
    z.loc[outliers].plot.hist(bins=50, color="r", ax=ax_133)
    plt.legend(["Accepted", "Outlier"])
    plt.xlabel("z")
    
    plt.savefig("./imgs/Outlier.png")
    
    return outliers
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge

X_train = train_data.iloc[:, 0:-1]
y_train = train_data.iloc[:, -1]

# 岭回归
model = Ridge()
outliers = find_outliers(model, X_train, y_train)
MSE:0.1073, R^2:0.8891
resid_mean:0.0, resid_std:0.3277
outlier index: [321, 348, 376, 777, 884, 1145, 1164, 1310, 1458, 1466, 1484, 1523, 1704, 1874, 1879, 1979, 2002, 2279, 2528, 2620, 2645, 2647, 2667, 2668, 2669, 2696, 2767, 2769, 2807, 2842, 2863]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VyLiQ77j-1612277796485)(output_15_1.png)]

直方图和Q-Q图

# 如果数据服从正态分布,则所有的点都会落在直线上
plt.figure(figsize=(10, 5))

ax = plt.subplot(1, 2, 1)
sns.distplot(train_data["V1"], fit=stats.norm)

ax = plt.subplot(1, 2, 2)
res = stats.probplot(train_data["V1"], plot=plt)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ybN40xr6-1612277796487)(output_18_0.png)]

# 绘制全部数据
train_cols = 6
train_rows = len(train_data.columns)

plt.figure(figsize=(4*train_cols, 4*train_rows))

i = 0
for col in train_data.columns:
    i += 1
    ax = plt.subplot(train_rows, train_cols, i)
    sns.distplot(train_data[col], fit=stats.norm)
    
    i += 1
    ax = plt.subplot(train_rows, train_cols, i)
    res = stats.probplot(train_data[col], plot=plt)
plt.tight_layout()


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Al0nK8ml-1612277796489)(output_19_0.png)]

结论:特征变量(如V1, V23, V24, V35)的数据分布不是正态分布。

KDE分布

目标:查看训练集和测试集对应特征变量的分布情况,有助于发现两个数据中分布不一致的情况。

# 核密度:直方图的平滑
plt.figure(figsize=(8, 4))

# 查看特征分布是否相同(train and test)
ax = sns.kdeplot(train_data["V1"], color="red", shade=True)
ax = sns.kdeplot(test_data["V1"], color="blue", shade=True)

ax.set_xlabel("V1")
ax.set_ylabel("Frequence")
ax.legend(["train", "test"])

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ejcPJMDt-1612277796491)(output_22_1.png)]

# 绘制全部数据
dist_cols = 6
dist_rows = len(test_data.columns)

plt.figure(figsize=(4*dist_cols, 4*dist_rows))

i = 1
for col in test_data.columns:
    ax = plt.subplot(dist_rows, dist_cols, i)
    ax = sns.kdeplot(train_data[col], color="red", shade=True)
    ax = sns.kdeplot(test_data[col], color="blue", shade=True)
    ax.set_xlabel(col)
    ax.set_ylabel("Frequence")
    ax.legend(["train", "test"])
    i += 1
    
plt.tight_layout()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FlA6DZBh-1612277796493)(output_23_0.png)]

结论:特征变量V5,V9,V11,V17,V22,V28在训练集和测试集中的数据分布不一致,会导致模型非泛化能力变差,需要删除此类特征。

线性回归关系图(定性分析)

说明:分析变量与目标之间的相关性。

fcols = 2
frows = 1

plt.figure(figsize=(8, 4))

ax = plt.subplot(frows, fcols, 1)
sns.regplot(x="V0", y="target", data=train_data, ax=ax
            , scatter_kws={"marker":".", "s":3, "alpha":0.3}
            , line_kws={"color":"k", "alpha":0.7}
           )

ax = plt.subplot(frows, fcols, 2)
sns.distplot(train_data["V0"].dropna())

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6iZsbM9w-1612277796494)(output_26_0.png)]

fcols = 6
frows = len(test_data.columns)

plt.figure(figsize=(5*fcols, 4*frows))

index  = 0
for col in test_data.columns:
    index += 1
    ax = plt.subplot(frows, fcols, index)
    sns.regplot(x=col, y="target", data=train_data, ax=ax
                , scatter_kws={"marker":".", "s":3, "alpha":0.3}
                , line_kws={"color":"k", "alpha":0.7}
               )
    index += 1
    ax = plt.subplot(frows, fcols, index)
    sns.distplot(train_data[col].dropna())

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kSlKuYBD-1612277796495)(output_27_0.png)]

计算特征变量之间的相关性(定量分析)

# 删除不必要的数据
pd.set_option("display.max_columns", 10)
pd.set_option("display.max_rows", 10)
train_data1 = train_data.drop(columns=["V5", "V9", "V11", "V17", "V22", "V28"])
# 计算协方差矩阵
train_corr = train_data1.corr()
train_corr
V0 V1 V2 V3 V4 ... V34 V35 V36 V37 target
V0 1.000000 0.908607 0.463643 0.409576 0.781212 ... -0.019342 0.138933 0.231417 -0.494076 0.873212
V1 0.908607 1.000000 0.506514 0.383924 0.657790 ... -0.029115 0.146329 0.235299 -0.494043 0.871846
V2 0.463643 0.506514 1.000000 0.410148 0.057697 ... -0.025620 0.043648 0.316462 -0.734956 0.638878
V3 0.409576 0.383924 0.410148 1.000000 0.315046 ... -0.031898 0.080034 0.324475 -0.229613 0.512074
V4 0.781212 0.657790 0.057697 0.315046 1.000000 ... 0.028659 0.100010 0.113609 -0.031054 0.603984
... ... ... ... ... ... ... ... ... ... ... ...
V34 -0.019342 -0.029115 -0.025620 -0.031898 0.028659 ... 1.000000 0.233616 -0.019032 -0.006854 -0.006034
V35 0.138933 0.146329 0.043648 0.080034 0.100010 ... 0.233616 1.000000 0.025401 -0.077991 0.140294
V36 0.231417 0.235299 0.316462 0.324475 0.113609 ... -0.019032 0.025401 1.000000 -0.039478 0.319309
V37 -0.494076 -0.494043 -0.734956 -0.229613 -0.031054 ... -0.006854 -0.077991 -0.039478 1.000000 -0.565795
target 0.873212 0.871846 0.638878 0.512074 0.603984 ... -0.006034 0.140294 0.319309 -0.565795 1.000000

33 rows × 33 columns

# 绘制热力图,便于分析变量之间的相关性和变量与目标之间的相关性
plt.figure(figsize=(20, 16))
sns.heatmap(train_corr, vmax=.8, square=True, annot=True)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-alkcACnc-1612277796496)(output_31_1.png)]

统计与目标最相关的K-th个特征变量

k = 10
nlargest_f = train_corr.nlargest(k, columns="target")["target"]
nlargest_f
target    1.000000
V0        0.873212
V1        0.871846
V8        0.831904
V27       0.812585
V31       0.750297
V2        0.638878
V4        0.603984
V12       0.594189
V16       0.536748
Name: target, dtype: float64
# n_largest corr
cols = nlargest_f.index
cm = np.corrcoef(train_data[cols].values.T)
hm = plt.figure(figsize=(10, 10))
sns.heatmap(train_data[cols].corr(), annot=True, square=True)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yAIqNZFZ-1612277796498)(output_34_0.png)]

# 增加变量,降低相关系数阈值
threshold = 0.5

corrmat = train_data.corr()
corr_cols = corrmat[corrmat["target"] > threshold]["target"].index
hm = plt.figure(figsize=(10, 10))
sns.heatmap(train_data[corr_cols].corr(), annot=True, square=True, cmap="RdYlGn")  # 配色:RdYlGn
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GYBgfGR9-1612277796499)(output_35_0.png)]

# 设置阈值,确定合适地特征变量
threshold = 0.5
corr_matrix = train_data1.corr().abs()
drop_cols  = corr_matrix[corr_matrix["target"] < 0.5].index
train_data1 = train_data1.drop(columns=drop_cols)

结论:与目标的相关系数越大,则可以认为这些特征变量对target变量的线性影响最大。

Box_Cox变换

由于线性回归是基于正态分布的,因此在进行统计分析时,需要将数据转换使其符合正态分布, Box_Cox能使线性回归模型在满足线性、正态性、独立性以及方差齐性的同时,又不丢失信息。

全量数据

drop_columns=["V5", "V9", "V11", "V17", "V22", "V28"]
train_x = train_data.drop(columns="target")

# 离线:合并训练集和测试集 线上:训练集
all_data = pd.concat([train_x, test_data])
all_data.drop(columns = drop_columns, inplace=True)
all_data.head()
V0 V1 V2 V3 V4 ... V33 V34 V35 V36 V37
0 0.566 0.016 -0.143 0.407 0.452 ... -4.627 -4.789 -5.101 -2.608 -3.508
1 0.968 0.437 0.066 0.566 0.194 ... -0.843 0.160 0.364 -0.335 -0.730
2 1.013 0.568 0.235 0.370 0.112 ... -0.843 0.160 0.364 0.765 -0.589
3 0.733 0.368 0.283 0.165 0.599 ... -0.843 -0.065 0.364 0.333 -0.112
4 0.684 0.638 0.260 0.209 0.337 ... -0.843 -0.215 0.364 -0.280 -0.028

5 rows × 32 columns

# 预先进行归一化操作(全部数据,建议在数据量比较大的时候进行处理)
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
all_cols = all_data.columns.tolist()
all_data  = mms.fit_transform(all_data)

all_data  = pd.DataFrame(all_data, columns=all_cols)
all_data.describe().T
count mean std min 25% 50% 75% max
V0 4813.0 0.694172 0.144198 0.0 0.626676 0.729488 0.790195 1.0
V1 4813.0 0.721357 0.131443 0.0 0.679416 0.752497 0.799553 1.0
V2 4813.0 0.602300 0.140628 0.0 0.514414 0.617072 0.700464 1.0
V3 4813.0 0.603139 0.152462 0.0 0.503888 0.614270 0.710474 1.0
V4 4813.0 0.523743 0.106430 0.0 0.478182 0.535866 0.585036 1.0
... ... ... ... ... ... ... ... ...
V33 4813.0 0.458493 0.099095 0.0 0.409037 0.454518 0.500000 1.0
V34 4813.0 0.483790 0.101020 0.0 0.454490 0.499949 0.511365 1.0
V35 4813.0 0.762873 0.102037 0.0 0.727273 0.800020 0.800020 1.0
V36 4813.0 0.332385 0.127456 0.0 0.270584 0.347056 0.414861 1.0
V37 4813.0 0.545795 0.150356 0.0 0.445647 0.539317 0.643061 1.0

32 rows × 8 columns

### 分开处理
# 测试集和训练集分别处理
cols_numeric = test_data.columns
train_data_process = train_data[cols_numeric]
# 归一化后,并消除异常值
train_data_process = pd.DataFrame(mms.fit_transform(train_data_process), columns=cols_numeric)

test_data_process = test_data[cols_numeric]
test_data_process = pd.DataFrame(mms.fit_transform(test_data_process), columns=cols_numeric)
# Box_Cox变换
cols_numeric_left = cols_numeric[:13]
cols_numeric_right = cols_numeric[13:]
train_data_process = pd.concat([train_data_process, train_data["target"]], axis=1)
fcols = 6
frows = len(cols_numeric_left)

plt.figure(figsize=(4*fcols, 4*frows))

i = 0

for var in cols_numeric_left:
    dat =  train_data_process[[var, "target"]].dropna()
    i += 1
    plt.subplot(frows, fcols, i)
    sns.distplot(dat[var], fit=stats.norm)
    plt.title(var + " Original")
    plt.xlabel("")
    
    i += 1
    plt.subplot(frows, fcols, i)
    _ = stats.probplot(dat[var], plot=plt)
    plt.xlabel("")
    plt.ylabel("")
    plt.title("skew=" + "{:.4f}".format(stats.skew(dat[var])))
    
    # 相关性图
    i += 1
    plt.subplot(frows, fcols, i)
    plt.plot(dat[var], dat["target"], ".",alpha=0.8)
    plt.title("corr=" + "{:.2f}".format(np.corrcoef(dat[var], dat["target"])[0][1]))
    
    # 变换后(Box_Cox)
    i += 1
    plt.subplot(frows, fcols, i)
    trans_var, lambda_var = stats.boxcox(dat[var].dropna() + 1)  # 数值不能为负值
    trans_var = pd.DataFrame(mms.fit_transform(trans_var.reshape(-1, 1)))[0]
    
    plt.subplot(frows, fcols, i)
    sns.distplot(trans_var, fit=stats.norm)
    plt.title(var + "Transformed")
    plt.xlabel("")
    
    i += 1
    plt.subplot(frows, fcols, i)
    _ = stats.probplot(trans_var, plot=plt)
    plt.xlabel("")
    plt.ylabel("")
    plt.title("skew=" + "{:.4f}".format(stats.skew(trans_var)))
    
    # 相关性图
    i += 1
    plt.subplot(frows, fcols, i)
    plt.plot(trans_var, dat["target"], ".",alpha=0.8)
    plt.title("corr=" + "{:.2f}".format(np.corrcoef(trans_var, dat["target"])[0][1]))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vrTRMCPc-1612277796501)(output_48_0.png)]

知识点
偏度的衡量是相对于正态分布来说,正态分布的偏度为0,即若数据分布是对称的,偏度为0。若偏度大于0,则分布右偏,即分布有一条长尾在右;若偏度小于0,则分布为左偏,即分布有一条长尾在左;同时偏度的绝对值越大,说明分布的偏移程度越严重。(大0偏右,小0偏左,等0正态)【注意】数据分布的左偏或右偏,指的是数值拖尾的方向,而不是峰的位置。

特征工程

异常值分析

# 箱图
plt.figure(figsize=(18, 10))
plt.boxplot(train_data.values, labels=train_data.columns)
plt.hlines([-7.5, 7.5], 0, 40, "red")
plt.show()


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oLpHEh1N-1612277796502)(output_52_0.png)]

# 特征变量V9出现了明显的异常值, 删除
train_data = train_data[train_data["V9"] > -7.5]
test_data = test_data[test_data["V9"] > -7.5]
display(train_data.describe())
display(test_data.describe())
V0 V1 V2 V3 V4 ... V34 V35 V36 V37 target
count 2886.000000 2886.000000 2886.000000 2886.000000 2886.000000 ... 2886.000000 2886.000000 2886.000000 2886.000000 2886.000000
mean 0.123725 0.056856 0.290340 -0.068364 0.012254 ... 0.006959 0.198513 0.030099 -0.131957 0.127451
std 0.927984 0.941269 0.911231 0.970357 0.888037 ... 1.003411 0.985058 0.970258 1.015666 0.983144
min -4.335000 -5.122000 -3.420000 -3.956000 -4.742000 ... -4.789000 -5.695000 -2.608000 -3.630000 -3.044000
25% -0.292000 -0.224250 -0.310000 -0.652750 -0.385000 ... -0.290000 -0.199750 -0.412750 -0.798750 -0.347500
50% 0.359500 0.273000 0.386000 -0.045000 0.109500 ... 0.160000 0.364000 0.137000 -0.186000 0.314000
75% 0.726000 0.599000 0.918750 0.623500 0.550000 ... 0.273000 0.602000 0.643750 0.493000 0.793750
max 2.121000 1.918000 2.828000 2.457000 2.689000 ... 5.110000 2.324000 5.238000 3.000000 2.538000

8 rows × 39 columns

V0 V1 V2 V3 V4 ... V33 V34 V35 V36 V37
count 1925.000000 1925.000000 1925.000000 1925.000000 1925.000000 ... 1925.000000 1925.000000 1925.000000 1925.000000 1925.000000
mean -0.184404 -0.083912 -0.434762 0.101671 -0.019172 ... -0.011433 -0.009985 -0.296895 -0.046270 0.195735
std 1.073333 1.076670 0.969541 1.034925 1.147286 ... 0.989732 0.995213 0.946896 1.040854 0.940599
min -4.814000 -5.488000 -4.283000 -3.276000 -4.921000 ... -4.627000 -4.789000 -7.477000 -2.608000 -3.346000
25% -0.664000 -0.451000 -0.978000 -0.644000 -0.497000 ... -0.460000 -0.290000 -0.349000 -0.593000 -0.432000
50% 0.065000 0.195000 -0.267000 0.220000 0.118000 ... -0.040000 0.160000 -0.270000 0.083000 0.152000
75% 0.549000 0.589000 0.278000 0.793000 0.610000 ... 0.419000 0.273000 0.364000 0.651000 0.797000
max 2.100000 2.120000 1.946000 2.603000 4.475000 ... 5.465000 5.110000 1.671000 2.861000 3.021000

8 rows × 38 columns

数据归一化

# 归一化特征变量
feature_columns = test_data.columns.to_list()

mms = MinMaxScaler()
train_data_scalar = mms.fit_transform(train_data[feature_columns])
train_data_scalar = pd.DataFrame(train_data_scalar, columns=feature_columns)
train_data_scalar["target"] = train_data["target"]

test_data_scalar = mms.fit_transform(test_data[feature_columns])
test_data_scalar = pd.DataFrame(test_data_scalar, columns=feature_columns)

查看数据分布

# 已经在EDA阶段探索出,"V5", "V9", "V11", "V17", "V22", "V28“等的变量数据分布差异性较大,故删除
drop_cols = 6
drop_rows = 1

plt.figure(figsize=(5*drop_cols, 5*drop_rows))

for i, col in enumerate(["V5", "V9", "V11", "V17", "V22", "V28"]):
    
    ax = plt.subplot(drop_rows, drop_cols, i+1)
    ax = sns.kdeplot(train_data[col], shade=True, color="Red")
    ax = sns.kdeplot(test_data[col], shade=True, color="Blue")
    ax.set_ylabel("Frequency")
    ax.set_xlabel(col)
    ax = ax.legend(["train", "test"])

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sVF4QRUM-1612277796503)(output_58_0.png)]

特征相关性

plt.figure(figsize=(20, 16))

columns = train_data_process.columns.to_list()
mcorr = train_data_scalar[columns].corr(method="spearman")
mask = np.zeros_like(mcorr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt="0.2f")

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SugD8z1c-1612277796505)(output_60_0.png)]

特征降维

mcorr = mcorr.abs()
num_mcoor = mcorr[mcorr["target"] > 0.1]["target"]
index = num_mcoor.sort_values(ascending=False).index
index
Index(['target', 'V0', 'V31', 'V1', 'V8', 'V27', 'V2', 'V16', 'V3', 'V4',
       'V12', 'V10', 'V36', 'V37', 'V24', 'V5', 'V6', 'V20', 'V11', 'V15',
       'V29', 'V7', 'V19', 'V18', 'V13', 'V17', 'V22', 'V30'],
      dtype='object')
train_data_scalar[index].corr(method="spearman")
target V0 V31 V1 V8 ... V18 V13 V17 V22 V30
target 1.000000 0.712403 0.711636 0.682909 0.679469 ... 0.149741 0.149199 0.126262 -0.112743 0.101378
V0 0.712403 1.000000 0.739116 0.894116 0.832151 ... 0.132143 0.173861 0.055024 -0.076698 0.099242
V31 0.711636 0.739116 1.000000 0.807585 0.841469 ... 0.094678 0.071517 0.115056 -0.106450 0.131453
V1 0.682909 0.894116 0.807585 1.000000 0.849034 ... 0.093688 0.134595 0.081446 -0.072848 0.109216
V8 0.679469 0.832151 0.841469 0.849034 1.000000 ... 0.079592 0.105380 0.102544 -0.078333 0.165204
... ... ... ... ... ... ... ... ... ... ... ...
V18 0.149741 0.132143 0.094678 0.093688 0.079592 ... 1.000000 0.242008 -0.073678 0.016819 0.133708
V13 0.149199 0.173861 0.071517 0.134595 0.105380 ... 0.242008 1.000000 -0.108020 0.348432 -0.097178
V17 0.126262 0.055024 0.115056 0.081446 0.102544 ... -0.073678 -0.108020 1.000000 0.363785 0.057480
V22 -0.112743 -0.076698 -0.106450 -0.072848 -0.078333 ... 0.016819 0.348432 0.363785 1.000000 -0.054570
V30 0.101378 0.099242 0.131453 0.109216 0.165204 ... 0.133708 -0.097178 0.057480 -0.054570 1.000000

28 rows × 28 columns

## PCA降维

目标:消除多重共线性

from sklearn.decomposition import PCA

pca = PCA(n_components=0.9)  # 保留%90的特征信息

# train 
train_data_pca_90 = pca.fit_transform(train_data_scalar.iloc[:, :-1])
train_data_pca_90 = pd.DataFrame(train_data_pca_90)
train_data_pca_90["target"] = train_data["target"]

# test
test_data_pca_90 = pca.fit_transform(test_data_scalar.iloc[:, :])
test_data_pca_90 = pd.DataFrame(test_data_pca_90)

display(train_data_pca_90)
display(test_data_pca_90)
0 1 2 3 4 ... 12 13 14 15 target
0 0.183626 0.075444 0.076502 0.613128 0.609231 ... -0.042831 0.015694 -0.096437 0.320443 0.175
1 0.012796 0.026542 0.011350 0.405590 0.422800 ... 0.136143 0.043224 0.066051 0.177521 0.676
2 0.034287 -0.054796 0.123225 0.277690 0.431320 ... 0.181925 0.042951 0.160700 0.148086 0.633
3 0.141906 -0.089136 0.078565 0.262941 0.433194 ... 0.176589 0.031108 0.077063 0.067650 0.206
4 0.143749 0.009386 0.246991 0.171165 0.327496 ... 0.141855 0.115849 0.157205 0.029145 0.384
... ... ... ... ... ... ... ... ... ... ... ...
2881 -0.111495 0.140241 0.200300 -0.283676 -0.281734 ... -0.059668 0.007330 -0.003344 0.115570 0.892
2882 -0.189328 0.113281 0.247078 -0.245414 -0.226019 ... -0.155242 0.075538 0.045501 0.065343 0.812
2883 0.077546 0.201220 0.183691 -0.277095 -0.431218 ... -0.095667 0.010671 -0.038339 0.068913 0.235
2884 0.021993 0.224493 0.112763 -0.199688 -0.458057 ... -0.133675 0.068059 -0.007461 -0.027665 1.042
2885 -0.022576 0.195085 0.194000 -0.306873 -0.299947 ... -0.077414 0.093005 -0.009916 -0.009609 0.005

2886 rows × 17 columns

0 1 2 3 4 ... 9 10 11 12 13
0 0.014675 -0.348751 0.237515 -0.191205 0.081670 ... -0.046190 -0.013370 0.038045 0.099094 -0.062470
1 -0.056760 -0.357766 0.259784 -0.279175 0.155516 ... -0.071004 -0.024006 0.047927 0.174885 -0.117769
2 -0.045538 -0.334789 0.183349 -0.145120 0.241489 ... -0.055299 -0.018512 0.024006 0.133972 -0.162479
3 -0.106663 0.025139 -0.011816 -0.196600 -0.031321 ... -0.140342 -0.075948 -0.068304 0.119366 0.032868
4 -0.179450 -0.022747 -0.127998 -0.186130 0.068751 ... -0.160952 -0.043351 0.019240 0.141839 -0.083498
... ... ... ... ... ... ... ... ... ... ... ...
1920 1.441162 0.580629 0.751858 0.202756 0.097667 ... -0.029617 0.122375 0.083911 0.327381 -0.380743
1921 1.420970 0.803316 0.922584 -0.303722 -0.108873 ... -0.110200 0.023644 0.040485 0.172820 -0.223728
1922 1.161142 0.984069 0.579733 -0.370852 -0.092587 ... -0.163546 -0.037952 -0.177113 0.268723 -0.001955
1923 1.117736 0.956606 0.529063 -0.451184 -0.031903 ... -0.129908 -0.082961 -0.153917 0.265543 -0.010352
1924 1.024624 0.780777 0.418843 -0.280930 0.135125 ... -0.254216 -0.198561 -0.094339 0.373396 -0.002877

1925 rows × 14 columns

模型验证

数据处理

train_data = pd.read_csv(train_data_file, sep="\t")
test_data = pd.read_csv(test_data_file, sep="\t")

# 数据归一化
feature_columns = test_data.columns.to_list()

mms = MinMaxScaler()
train_data_scalar = mms.fit_transform(train_data[feature_columns])
train_data_scalar = pd.DataFrame(train_data_scalar, columns=feature_columns)
train_data_scalar["target"] = train_data["target"]

test_data_scalar = mms.fit_transform(test_data[feature_columns])
test_data_scalar = pd.DataFrame(test_data_scalar, columns=feature_columns)

# PCA数据降维
pca = PCA(n_components=16)  # 保留16的特征信息

# train 
train_data_pca_16 = pca.fit_transform(train_data_scalar.iloc[:, :-1])
train_data_pca_16 = pd.DataFrame(train_data_pca_16)
train_data_pca_16["target"] = train_data["target"]

# test
test_data_pca_16 = pca.fit_transform(test_data_scalar.iloc[:, :])
test_data_pca_16 = pd.DataFrame(test_data_pca_16)

# 数据划分
train_data_pca_16 = train_data_pca_16.fillna(0)
X = train_data_pca_16.iloc[:,:-1]
y = train_data_pca_16.iloc[:, -1]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

模型

# L2正则
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import SGDRegressor
poly = PolynomialFeatures(degree=3)

Poly_X_train = poly.fit_transform(X_train)
Poly_X_val = poly.fit_transform(X_val)

clf = SGDRegressor(max_iter=1000, tol=0.001, penalty="L2", alpha=0.00001)
clf.fit(Poly_X_train, y_train)
SGDRegressor(alpha=1e-05, penalty='L2')
print("Training mse:", MSE(y_train, clf.predict(Poly_X_train)))
print("Testing mse:", MSE(y_val, clf.predict(Poly_X_val)))
Training mse: 0.13411813527459393
Testing mse: 0.14244273147224495
# L1正则
poly = PolynomialFeatures(degree=3)

Poly_X_train = poly.fit_transform(X_train)
Poly_X_val = poly.fit_transform(X_val)

clf = SGDRegressor(max_iter=1000, tol=0.001, penalty="L1", alpha=0.00001)
clf.fit(Poly_X_train, y_train)
SGDRegressor(alpha=1e-05, penalty='L1')
print("Training mse:", MSE(y_train, clf.predict(Poly_X_train)))
print("Testing mse:", MSE(y_val, clf.predict(Poly_X_val)))
Training mse: 0.13443580962219054
Testing mse: 0.14261036582932843

LGB线下模型验证

# 预读数据
train_data_2_f = train_data.copy()
test_data_2_f = test_data.copy()

train_data_2 = train_data_2_f[test_data_2_f.columns]
train_data_2_target = train_data_2_f.iloc[:, -1]
# 5折交差验证
from sklearn.model_selection import KFold


Folds = 5
kf = KFold(n_splits=Folds, shuffle=True, random_state=2021)

# 记录训练和预测的MSE
MSE_DICT = {"train mse":[], "test mse":[]}

# 交叉验证
for i, (train_index, test_index) in enumerate(kf.split(train_data_2)):
    
    lgb_reg = lgb.LGBMRegressor(learning_rate=0.01
                            , max_depth=-1
                            , n_estimators=5000
                            , boosting_type="gbdt"
                            , random_state=2021
                            , objective="regression"
                           )
    
    X_train_KFold, X_test_KFold = train_data_2.iloc[train_index, :], train_data_2.iloc[test_index, :]
    Y_train_KFold, Y_test_KFold = train_data_2_target[train_index], train_data_2_target[test_index]
    
    # 训练模型
    lgb_reg.fit(X=X_train_KFold, y=Y_train_KFold
            , eval_set=[(X_train_KFold, Y_train_KFold), (X_test_KFold, Y_test_KFold)]
            , eval_names=["Train", "Test"]
            , eval_metric="MSE"
            , early_stopping_rounds=100
            , verbose=100
           )

    # 预测模型
    X_train_KFold_pred = MSE(Y_train_KFold, lgb_reg.predict(X_train_KFold, num_iteration=lgb_reg.best_iteration_))
    X_test_KFold_pred = MSE(Y_test_KFold, lgb_reg.predict(X_test_KFold, num_iteration=lgb_reg.best_iteration_))
    
    print("第{}折 训练和测试 训练MSE 预测MSE".format(i+1))
    print("-------\n训练MSE:{}".format(X_train_KFold_pred))
    print("测试MSE:{}\n-------\n".format(X_test_KFold_pred))
    
    # 求平均
    MSE_DICT["train mse"].append(X_train_KFold_pred)
    MSE_DICT["test mse"].append(X_test_KFold_pred)
    print("-------\nTrain数值:", MSE_DICT["train mse"])
    print("Test 数值:", MSE_DICT["test mse"])
    print("训练MSE(mean):{}".format(np.mean(MSE_DICT["train mse"])))
    print("测试MSE(mean):{}\n-------\n".format(np.mean(MSE_DICT["test mse"])))

学习曲线

说明:学习曲线是在训练集大小不同是,通过绘制模型训练集和交叉验证集上的准确率来观察模型在新数据上的表现,进而判断模型的方差和偏差是否过高,以及增大训练集是否可以减少过拟合。

def plot_learning_curve(estimator, title, X, y, 
                        ylim = None,
                        cv=None,
                        n_jobs = None
                       ):
    train_size, train_scores, test_scores = learning_curve(estimator=estimator
                                                           ,X = X, y = y, cv = cv, n_jobs=n_jobs)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.plot(train_size, train_scores_mean, "o-", label = "Training Score", color="r")
    plt.fill_between(train_size, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, color="r", alpha=0.1)
    
    plt.plot(train_size, test_scores_mean, "o-", label = "Cross_validation score", color="g")
    plt.fill_between(train_size, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, color="g", alpha=0.1)
    
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.ylabel("Score")
    plt.xlabel("Training examples")
    plt.grid()
    plt.legend(loc="best")
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve

cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=2021)
model = SGDRegressor()
plot_learning_curve(model, "LinerRegression", X, y, ylim=(0.7, 0.9), cv=cv, n_jobs=-1)


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Qz1MBUQZ-1612277796507)(output_102_0.png)]

结论:当增加模型的训练数据时,可以有效的防止过拟合现象。

验证曲线

说明:用于评估某一个参数对模型的影响情况。

from sklearn.model_selection import validation_curve

X = train_data_2.copy()
y = train_data_2_target.copy()

model = SGDRegressor(max_iter=1000, tol=1e-3, penalty="L2")
# 调整正则化系数
param_range = [0.1, 0.01, 0.001, 0.00001, 0.000001, 0.0000001]

train_scores, test_scores = validation_curve(model, X, y
                                             , param_name="alpha"
                                             , param_range=param_range
                                             , scoring="r2"
                                             , cv=10
                                             , n_jobs=-1
                                            )

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)

test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)


plt.semilogx(param_range, train_scores_mean, "o-", label = "Training Score", color="r")
plt.fill_between(param_range, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, color="r", alpha=0.1)

plt.semilogx(param_range, test_scores_mean, "o-", label = "Cross-validation score", color="g")
plt.fill_between(param_range, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, color="g", alpha=0.1)

plt.title("Validation Curve with SGDRegressor")
plt.xlabel("alpha")
plt.ylabel("Score")
plt.ylim(0.6, 1.0)
plt.legend(loc="best")
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5yyG9QhN-1612277796509)(output_105_0.png)]

模型融合

获取数据

train_data = pd.read_csv(train_data_file, sep="\t")
test_data = pd.read_csv(test_data_file, sep="\t")
# # 归一化数据
# feature_columns = test_data.columns.to_list()

# mms = MinMaxScaler()
# train_data_scalar = mms.fit_transform(train_data[feature_columns])
# train_data_scalar = pd.DataFrame(train_data_scalar, columns=feature_columns)

# test_data_scalar = mms.fit_transform(test_data[feature_columns])
# test_data_scalar = pd.DataFrame(test_data_scalar, columns=feature_columns)

# 消除测试集和训练集不一致的特征
# training
# train_data_scalar = train_data.drop(columns=["V5", "V9", "V11", "V17", "V22", "V28"])
corr_col = ['V0', 'V31', 'V1', 'V8', 'V27', 'V2', 'V16', 'V3', 'V4',
       'V12', 'V10', 'V36', 'V37', 'V24', 'V5', 'V6', 'V20', 'V11', 'V15',
       'V29', 'V7', 'V19', 'V18', 'V13', 'V17', 'V22', 'V30']
train_data_scalar = train_data[corr_col]
train_data_scalar["target"] = train_data["target"]

# testing
# test_data_scalar = test_data.drop(columns =["V5", "V9", "V11", "V17", "V22", "V28"] )
test_data_scalar = test_data[corr_col]
train_data_scalar["origin"] = "train"
test_data_scalar["origin"] = "test"
data_all = pd.concat([train_data_scalar, test_data_scalar], axis=0, ignore_index=True)
def get_training_data(all_data=None):
    """获取训练数据"""
    
    from sklearn.model_selection import train_test_split
    

    df_train = all_data[all_data["origin"] == "train"]
    df_train["label"] = all_data["target"]
    
    # 整理数据
    y = df_train["target"]
    X = df_train.drop(columns=["label", "target", "origin"])
    X_train, X_val, y_train, y_val = train_test_split(X, y
                                                        , test_size=0.2
                                                        , random_state=2021
                                                       )
    
    return X_train, X_val, y_train, y_val
def get_test_data():
    """获取测试数据"""
    
    df_test = data_all[all_data["origin"] == "test"].reset_index(drop=True)
    
    return df_test.drop(columns=["origin"])

评价指标

from sklearn.metrics import make_scorer

def rmse(y_true, y_pred):
    
    diff = y_pred - y_true
    sum_sq = sum(diff**2)
    n = len(y_pred)
    
    return np.sqrt(sum_sq/n)

def mse(y_true, y_pred):
    
    return mean_squared_error(y_true, y_pred)
# 损失
rmse_scorer = make_scorer(rmse, greater_is_better=False)
mse_scorer = make_scorer(mse, greater_is_better=False)

数据准备

# 使用模型消除异常值
from sklearn.linear_model import Ridge
model = Ridge()
X_train, X_val, y_train, y_val = get_training_data(data_all)
outliers = find_outliers(model, X_train, y_train)
MSE:0.1111, R^2:0.8862
resid_mean:0.0, resid_std:0.3335
outlier index: [2620, 1310, 777, 1874, 2264, 2667, 2279, 348, 2767, 1905, 884, 1523, 2645, 2647, 2696, 1879, 2842, 1704, 2668, 1458, 2669]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2tF5T6WP-1612277796510)(output_124_1.png)]

X_outliers=X_train.loc[outliers]
y_outliers=y_train.loc[outliers]
X_t=X_train.drop(index=outliers)
y_t=y_train.drop(index=outliers)
def get_trainning_data_omitoutliers():

    y=y_t.copy()
    X=X_t.copy()
    
    return X,y

训练模型

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold

def train_model(model, param_grid=[], X=[], y=[], 
                splits=5, repeats=5):

    # 获取数据
    if len(y)==0:
        X,y = get_trainning_data_omitoutliers()
        
    # 交叉验证
    rkfold = RepeatedKFold(n_splits=splits, n_repeats=repeats)
    
    # 网格搜索最佳参数
    if len(param_grid)>0:
        gsearch = GridSearchCV(model, param_grid, cv=rkfold,
                               scoring="neg_mean_squared_error",
                               verbose=1, return_train_score=True)

        # 训练
        gsearch.fit(X,y)

        # 最好的模型
        model = gsearch.best_estimator_        
        best_idx = gsearch.best_index_

        # 获取交叉验证评价指标
        grid_results = pd.DataFrame(gsearch.cv_results_)
        cv_mean = abs(grid_results.loc[best_idx,'mean_test_score'])
        cv_std = grid_results.loc[best_idx,'std_test_score']

    # 没有网格搜索  
    else:
        # 使用普通的交叉验证方法
        grid_results = []
        cv_results = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=rkfold)
        cv_mean = abs(np.mean(cv_results))
        cv_std = np.std(cv_results)
    
    # 合并数据
    cv_score = pd.Series({'mean':cv_mean,'std':cv_std})

    # 预测
    y_pred = model.predict(X)
    
    # 模型性能的统计数据        
    print('----------------------')
    print(model)
    print('----------------------')
    print('score=',model.score(X, y))
    print('rmse=',rmse(y, y_pred))
    print('mse=',mse(y, y_pred))
    print('cross_val: mean=',cv_mean,', std=',cv_std)
    
    # 残差分析与可视化
    y_pred = pd.Series(y_pred, index=y.index)
    resid = y - y_pred
    mean_resid = resid.mean()
    std_resid = resid.std()
    z = (resid - mean_resid)/std_resid    
    n_outliers = sum(abs(z)>3)
    outliers = z[abs(z)>3].index
    
    plt.figure(figsize=(15, 5))
    ax_131 = plt.subplot(1, 3, 1)
    plt.plot(y,y_pred,'.')
#     plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro')
    plt.xlabel('y')
    plt.ylabel('y_pred');
    plt.title('corr = {:.3f}'.format(np.corrcoef(y,y_pred)[0][1]))
    
    ax_132=plt.subplot(1, 3, 2)
    plt.plot(y,y-y_pred,'.')
#     plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro')
    plt.xlabel('y')
    plt.ylabel('y - y_pred');
    plt.title('std resid = {:.3f}'.format(std_resid))
    
    ax_133=plt.subplot(1, 3, 3)
    z.plot.hist(bins=50, ax=ax_133)
#     z.loc[outliers].plot.hist(color='r', bins=50, ax=ax_133)
    plt.xlabel('z')
    plt.title('{:.0f} samples with z>3'.format(n_outliers))

    return model, cv_score, grid_results
opt_models = dict()
score_models = pd.DataFrame(columns=["mean", "std"])
splits = 5
repeats = 5

单一模型

Ridge(L2正则)

model = "Ridge"

opt_models[model] = Ridge()

# 调整参数
alpha_range = np.arange(0.25, 6, 0.25)
param_grid = {"alpha": alpha_range}

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=repeats)

cv_score.name = model
score_models = score_models.append(cv_score)

plt.figure()
plt.errorbar(alpha_range
             , abs(grid_results["mean_test_score"])
             , abs(grid_results["std_test_score"])/np.sqrt(splits*repeats)
            )
plt.xlabel("alpha")
plt.ylabel("score")
----------------------
Ridge(alpha=0.5)
----------------------
score= 0.9030109271008152
rmse= 0.3043510385681425
mse= 0.09262955467750701
cross_val: mean= 0.0956725874884991 , std= 0.006621695786068565

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bfdcAuvM-1612277796511)(output_132_5.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-d3Gm7mXP-1612277796512)(output_132_6.png)]

mean_squared_error(y_val, opt_models["Ridge"].predict(X_val))
0.10705413929393254

Lasso(L1正则)

from sklearn.linear_model import Lasso

model = "Lasso"

opt_models[model] = Lasso()

# 调整参数(对参数alpha非常敏感)
alpha_range = np.arange(1e-4, 1e-3, 4e-5)
param_grid = {"alpha": alpha_range}

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=repeats)

cv_score.name = model
score_models = score_models.append(cv_score)


plt.figure()
plt.errorbar(alpha_range
             , abs(grid_results["mean_test_score"])
             , abs(grid_results["std_test_score"])/np.sqrt(splits*repeats)
            )

plt.xlabel("alpha")
plt.ylabel("score")
----------------------
Lasso(alpha=0.00022000000000000003)
----------------------
score= 0.9031250717142573
rmse= 0.3046339142675125
mse= 0.09280182172194627
cross_val: mean= 0.09659329022777374 , std= 0.006394970336172911

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-theAxCr6-1612277796513)(output_136_4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lQ2scbQv-1612277796515)(output_136_5.png)]

ElasticNet回归

from sklearn.linear_model import ElasticNet

model = "ElasticNet"

opt_models[model] = ElasticNet()

# 调整参数

param_grid = {"alpha": np.arange(1e-4, 1e-3, 1e-4), "l1_ratio": np.arange(0.1, 1.0, 0.1), "max_iter":[100000]}

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=repeats)

cv_score.name = model
score_models = score_models.append(cv_score)
----------------------
ElasticNet(alpha=0.00030000000000000003, l1_ratio=0.9, max_iter=100000)
----------------------
score= 0.903108934683549
rmse= 0.3046592855488974
mse= 0.09281728027116438
cross_val: mean= 0.09687683618331633 , std= 0.005227508188395059

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kkoxUj47-1612277796516)(output_138_3.png)]

SVR回归

from sklearn.svm import LinearSVR
model = "LinearSVR"

opt_models[model] = LinearSVR()

# 调整参数
c_range = np.arange(0.1, 1, 0.1)
param_grid = {"C": c_range
              , "max_iter":[1000]
             }
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=repeats)

cv_score.name = model
score_models = score_models.append(cv_score)

plt.figure()
plt.errorbar(c_range
             , abs(grid_results["mean_test_score"])
             , abs(grid_results["std_test_score"])/np.sqrt(splits*repeats)
            )

plt.xlabel("C")
plt.ylabel("score")
----------------------
LinearSVR(C=0.7000000000000001)
----------------------
score= 0.9018854105633738
rmse= 0.30657684406544794
mse= 0.09398936131712997
cross_val: mean= 0.0968415393834378 , std= 0.006284252976090882

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-f06jSGGI-1612277796518)(output_140_4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ovUKTqyC-1612277796519)(output_140_5.png)]

K近邻

from sklearn.neighbors import KNeighborsRegressor

model = "KNeighbors"

opt_models[model] = KNeighborsRegressor()

# 调整参数
k_range = np.arange(3, 11, 1)
param_grid = {"n_neighbors": k_range}

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=1)

cv_score.name = model
score_models = score_models.append(cv_score)

plt.figure()
plt.errorbar(k_range
             , abs(grid_results["mean_test_score"])
             , abs(grid_results["std_test_score"])/np.sqrt(splits*repeats)
            )

plt.xlabel("n_neighbors")
plt.ylabel("score")
----------------------
KNeighborsRegressor(n_neighbors=7)
----------------------
score= 0.8671509432245664
rmse= 0.35673998640960447
mse= 0.12726341790352505
cross_val: mean= 0.17977332255159276 , std= 0.015313175226783347

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wt8zhzKc-1612277796520)(output_142_4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zM1mU19j-1612277796521)(output_142_5.png)]

Boosting方法

GBDT

from sklearn.ensemble import GradientBoostingRegressor

model = "GBDT"

opt_models[model] = GradientBoostingRegressor()

# 调整参数
param_grid = {"n_estimators":[150, 250, 350]
              , "max_depth":[3, 4, 5, 6]
              , "subsample":[0.6, 0.7, 0.8]
              , "min_samples_split":[5, 6, 7]
             }

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=1)

cv_score.name = model
score_models = score_models.append(cv_score)
----------------------
GradientBoostingRegressor(max_depth=4, min_samples_split=5, n_estimators=350,
                          subsample=0.7)
----------------------
score= 0.9908065332807071
rmse= 0.09370290133201399
mse= 0.008780233718037152
cross_val: mean= 0.09280908808257234 , std= 0.0026463691062500556

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-72tZVdBM-1612277796522)(output_145_3.png)]

mean_squared_error(y_val, opt_models["GBDT"].predict(X_val))
0.1144506129075825

XGB

from xgboost import XGBRegressor

model = "XGB"

opt_models[model] = XGBRegressor()

# 调整参数
param_grid = {"n_estimators":[100, 200, 300, 400, 500]
              , "max_depth":[1, 2, 3]
             }

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=1)

cv_score.name = model
score_models = score_models.append(cv_score)
----------------------
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=2,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=200, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)
----------------------
score= 0.9557014165347619
rmse= 0.2060001199760511
mse= 0.04243604943014754
cross_val: mean= 0.10432026650623527 , std= 0.007917387050248647

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bAlea1Ot-1612277796524)(output_150_3.png)]

LightGBM

  1. 确定任务类型(回归、分类、排序等),以及基学习器的类型(dark, gbdt、RF)
  2. 首先选择较高的学习率,大概0.1附近,这样是为了加快收敛的速度。这对于调参是很有必要的。
  3. 对决策树基本参数调参
  4. 正则化参数调参
  5. 最后降低学习率,这里是为了最后提高准确率
from time import time
import lightgbm as lgb
import datetime
data_train = lgb.Dataset(X_train, y_train)
data_val = lgb.Dataset(X_val, y_val)
初始化状态(未调参)
# 参数设定为默认状态
params1 = {
      "boosting_type": "gbdt"
    , "objective": "regression" # 二分类任务
    , "metric": {"mse", "rmse"}
    
    , "nthread": 4
#     , "device": "gpu"
#     , "gpu_device_id": 1
    , "verbose": 1

    , "learning_rate": 0.1
    
    , "max_depth": 5
    , "num_leaves": 31  # 由于lightGBM是leaves_wise生长,官方说法是要小于2^max_depth
    , "subsample": 1.0  # 数据采样
    , "colsample_bytree": 1.0  # 特征采样
    
    , 'reg_alpha': 0.0  # L1
    , 'reg_lambda': 0.0  # L2
}
t0 = time()
cv_result1 = lgb.cv(params=params1, train_set=data_train
                    , early_stopping_rounds = 20
                    , nfold=5
                    , stratified=False
                    , shuffle=True
#                     , metrics="binary_logloss",
                    , seed=0
                      )
print("参数处理时间:",datetime.datetime.fromtimestamp(time()-t0).strftime("%M:%S:%f"))
调整好的参数状态
num_boost_round = 1000
params2 = {
      "boosting_type": "gbdt"
    , "objective": "regression" # 二分类任务
    , "metric": {"mse", "rmse"}
    , "silent": True
    , "nthread": 4
#     , "device": "gpu"
#     , "gpu_device_id": 1
    , "verbose": 1

    , "learning_rate": 0.01
    
    , "max_depth": 3
    , "num_leaves": 31  # 由于lightGBM是leaves_wise生长,官方说法是要小于2^max_depth
    , "subsample": 1.0  # 数据采样
    , "colsample_bytree": 1.0  # 特征采样
    
    , 'reg_alpha': 0.0  # L1
    , 'reg_lambda': 0.0  # L2
}
t0 = time()
cv_result2 = lgb.cv(params=params1, train_set=data_train
                    , num_boost_round= num_boost_round
                    , early_stopping_rounds = 20
                    , nfold=5
                    , stratified=False
                    , shuffle=True
#                     , metrics="binary_logloss",
                    , seed=0
                      )
print("参数处理时间:",datetime.datetime.fromtimestamp(time()-t0).strftime("%M:%S:%f"))
#  选择最佳的estimators
print("Best_n_estimators: %d\nBest_cv_score: %.4f" 
      % (np.array(list(cv_result2.values())).shape[1],
         min(np.array(list(cv_result2.values()))[0]))
     )  
Best_n_estimators: 170
Best_cv_score: 0.3413
调参数状态
params3 = {
      "boosting_type": "gbdt"
    , "objective": "regression" # 二分类任务
    , "metric": {"mse", "rmse"}
    
    , "nthread": 4
#     , "device": "gpu"
#     , "gpu_device_id": 1
    , "verbose": 0

    , "learning_rate": 0.01
    
    , "max_depth": 3
    , "num_leaves": 31  # 由于lightGBM是leaves_wise生长,官方说法是要小于2^max_depth
    , "subsample": 0.8  # 数据采样
    , "colsample_bytree": 0.8  # 特征采样
    
    , 'reg_alpha': 0.0  # L1
    , 'reg_lambda': 0.0  # L2
}
t0 = time()
cv_result3 = lgb.cv(params=params1, train_set=data_train
                    , num_boost_round=num_boost_round 
                    , early_stopping_rounds = 20
                    , nfold=5
                    , stratified=False
                    , shuffle=True
#                     , metrics="binary_logloss",
                    , verbose_eval=True
                    , seed=0
                      )
print("处理时间:",datetime.datetime.fromtimestamp(time()-t0).strftime("%M:%S:%f"))
可视化指标
rmse指标
fig, ax = plt.subplots(1, 2, figsize = (14,5))

length1 = np.array(list(cv_result1.values())).shape[1]
length2 = np.array(list(cv_result2.values())).shape[1]
length3 = np.array(list(cv_result3.values())).shape[1]

ax[0].plot(range(length1), cv_result1[list(cv_result1.keys())[0]], label="param1", c="red")
ax[1].plot(range(length1), cv_result1[list(cv_result1.keys())[1]], label="param1", c="green")

ax[0].plot(range(length2), cv_result2[list(cv_result2.keys())[0]], label="param2", c="magenta")
ax[1].plot(range(length2), cv_result2[list(cv_result2.keys())[1]], label="param2", c="blue")

ax[0].plot(range(length3), cv_result3[list(cv_result3.keys())[0]], label="param3", c="yellow")
ax[1].plot(range(length3), cv_result3[list(cv_result3.keys())[1]], label="param3", c="c")

ax[0].set_xlabel("num_round", fontsize=12)
ax[1].set_xlabel("num_round", fontsize=12)
ax[0].set_ylabel(list(cv_result1.keys())[0], fontsize=12)
ax[1].set_ylabel(list(cv_result1.keys())[1], fontsize=12)
ax[0].set_ylim((0.09, 0.4))
ax[0].legend()
ax[1].legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rRsc3ioG-1612277796527)(output_167_0.png)]

mse指标
fig, ax = plt.subplots(1, 2, figsize = (14,5))

length1 = np.array(list(cv_result1.values())).shape[1]
length2 = np.array(list(cv_result2.values())).shape[1]
length3 = np.array(list(cv_result3.values())).shape[1]

ax[0].plot(range(length1), cv_result1[list(cv_result1.keys())[2]], label="param1", c="red")
ax[1].plot(range(length1), cv_result1[list(cv_result1.keys())[3]], label="param1", c="green")

ax[0].plot(range(length2), cv_result2[list(cv_result2.keys())[2]], label="param2", c="magenta")
ax[1].plot(range(length2), cv_result2[list(cv_result2.keys())[3]], label="param2", c="blue")

ax[0].plot(range(length3), cv_result3[list(cv_result3.keys())[2]], label="param3", c="yellow")
ax[1].plot(range(length3), cv_result3[list(cv_result3.keys())[3]], label="param3", c="c")

ax[0].set_xlabel("num_round", fontsize=12)
ax[1].set_xlabel("num_round", fontsize=12)
ax[0].set_ylabel(list(cv_result1.keys())[2], fontsize=12)
ax[1].set_ylabel(list(cv_result1.keys())[3], fontsize=12)
ax[0].set_ylim((0.1, 0.13))
ax[0].legend()
ax[1].legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-grOFKFt1-1612277796528)(output_169_0.png)]

rs  = lgb.train(params3, num_boost_round=num_boost_round, train_set=data_train, feval=data_val)
mean_squared_error(y_val, rs.predict(X_val))
0.10717527815511706

随机森林

from sklearn.ensemble import RandomForestRegressor

model = "RandomForest"

opt_models[model] = RandomForestRegressor()

# 调整参数
param_grid = {"n_estimators":[100, 150, 200]
              , "max_features":[1, 2, 3]
#               , "max_depth": [2, 3, 4, 5]
              , "min_samples_split": [2, 4, 6]
             }
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=1)

cv_score.name = model
score_models = score_models.append(cv_score)
----------------------
RandomForestRegressor(max_features=3, min_samples_split=4, n_estimators=200)
----------------------
score= 0.9796795029577018
rmse= 0.13930944030785755
mse= 0.01940712015888852
cross_val: mean= 0.10995452070949858 , std= 0.012431532597183166

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NGOUB705-1612277796529)(output_175_3.png)]

mean_squared_error(y_val, opt_models["RandomForest"].predict(X_val))
0.12410701776059764

多模型融合Bagging

def model_predict(test_data, test_y = [], stack=True):
    """投票法voting"""
    
    i = 0
    y_predict_total = np.zeros((test_data.shape[0], ))
    
    for model in opt_models.keys():
        if model != "LinearSVR" and model != "KNeighbors" :
            y_predict = opt_models[model].predict(test_data)
            y_predict_total += y_predict
            i += 1
        if len(test_y) > 0:
            print("{}-mse: ".format(model), mean_squared_error(test_y, y_predict))
    
    y_predict_mean = np.round(y_predict_total/i, 3)
    
    if len(test_y)>0:
        print("mean_mse: ", mean_squared_error(test_y, y_predict_mean))
    else:
#         y_predict_mean = pd.Series(y_predict_mean)
        return y_predict_mean
    
model_predict(X_val, y_val)

Stacking

模型的选择

  • 第一层的基模型最好是强模型,而第二层的基模型可以放一个简单的分类器,防止过拟合。
  • 第一层基模型的个数不能太小,因为一层模型个数等于第二层分类器的特征维度。大家可以把勉强将其想象成神经网络的第一层神经元的个数,当然这个值也不是越多越好。
  • 第一层的基模型必须准而不同",如果有一个性能很差的模型出现在第一层,将会严重影响整个模型融合的效果

获取次级学习器数据

from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from scipy import sparse
import xgboost
import lightgbm

from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor,ExtraTreesRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
def stacking_reg(clf,train_x,train_y,test_x,clf_name,kf,label_split=None):
    train=np.zeros((train_x.shape[0],1))
    test=np.zeros((test_x.shape[0],1))
    test_pre=np.empty((folds,test_x.shape[0],1))
    cv_scores=[]
    for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)):       
        tr_x=train_x[train_index]
        tr_y=train_y[train_index]
        te_x=train_x[test_index]
        te_y = train_y[test_index]
        
        if clf_name in ["rf","ada","gb","et","lr","lsvc","knn"]:
            clf.fit(tr_x,tr_y)
            pre=clf.predict(te_x).reshape(-1,1)
            train[test_index]=pre
            test_pre[i,:]=clf.predict(test_x).reshape(-1,1)
            cv_scores.append(mean_squared_error(te_y, pre))
        elif clf_name in ["xgb"]:
            train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1)
            test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1)
            z = clf.DMatrix(test_x, label=te_y, missing=-1)
            params = {'booster': 'gbtree',
                      'eval_metric': 'rmse',
                      'gamma': 1,
                      'min_child_weight': 1.5,
                      'max_depth': 5,
                      'lambda': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'eta': 0.03,
                      'tree_method': 'exact',
                      'seed': 2021,
                      'nthread': 12
                      }
            num_round = 10000
            early_stopping_rounds = 20
            watchlist = [(train_matrix, 'train'),
                         (test_matrix, 'eval')
                         ]
            if test_matrix:
                model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,
                                  early_stopping_rounds=early_stopping_rounds
                                  )
                pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit).reshape(-1,1)
                train[test_index]=pre
                test_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit).reshape(-1,1)
                cv_scores.append(mean_squared_error(te_y, pre))

        elif clf_name in ["lgb"]:
            train_matrix = clf.Dataset(tr_x, label=tr_y)
            test_matrix = clf.Dataset(te_x, label=te_y)
            params = {
                      'boosting_type': 'gbdt',
                      'objective': 'regression_l2',
                      'metric': 'mse',
                      'min_child_weight': 1.5,
                      'num_leaves': 2**5,
                      'lambda_l2': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'learning_rate': 0.03,
                      'tree_method': 'exact',
                      'seed': 2021,
                      'nthread': 12,
                      'silent': True,
                      }
            num_round = 10000
            early_stopping_rounds = 20
            if test_matrix:
                model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,
                                  early_stopping_rounds=early_stopping_rounds
                                  )
                pre= model.predict(te_x,num_iteration=model.best_iteration).reshape(-1,1)
                train[test_index]=pre
                test_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration).reshape(-1,1)
                cv_scores.append(mean_squared_error(te_y, pre))
        else:
            raise IOError("Please add new clf.")
        print("%s now score is:"%clf_name,cv_scores)
    test[:]=test_pre.mean(axis=0)
    print("%s_score_list:"%clf_name,cv_scores)
    print("%s_score_mean:"%clf_name,np.mean(cv_scores))
    return train.reshape(-1,1),test.reshape(-1,1)

建模学习器

def rf_reg(x_train, y_train, x_valid, kf, label_split=None):
    randomforest = RandomForestRegressor(n_estimators=600, max_depth=20, n_jobs=-1, random_state=2021, max_features="auto",verbose=1)
    rf_train, rf_test = stacking_reg(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split)
    return rf_train, rf_test,"rf_reg"

def ada_reg(x_train, y_train, x_valid, kf, label_split=None):
    adaboost = AdaBoostRegressor(n_estimators=30, random_state=2021, learning_rate=0.01)
    ada_train, ada_test = stacking_reg(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split)
    return ada_train, ada_test,"ada_reg"

def gb_reg(x_train, y_train, x_valid, kf, label_split=None):
    gbdt = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2021,max_depth=5,verbose=1)
    gbdt_train, gbdt_test = stacking_reg(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split)
    return gbdt_train, gbdt_test,"gb_reg"

def et_reg(x_train, y_train, x_valid, kf, label_split=None):
    extratree = ExtraTreesRegressor(n_estimators=600, max_depth=35, max_features="auto", n_jobs=-1, random_state=2021,verbose=1)
    et_train, et_test = stacking_reg(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split)
    return et_train, et_test,"et_reg"

def lr_reg(x_train, y_train, x_valid, kf, label_split=None):
    lr_reg=LinearRegression(n_jobs=-1)
    lr_train, lr_test = stacking_reg(lr_reg, x_train, y_train, x_valid, "lr", kf, label_split=label_split)
    return lr_train, lr_test, "lr_reg"

def xgb_reg(x_train, y_train, x_valid, kf, label_split=None):
    xgb_train, xgb_test = stacking_reg(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split)
    return xgb_train, xgb_test,"xgb_reg"

def lgb_reg(x_train, y_train, x_valid, kf, label_split=None):
    lgb_train, lgb_test = stacking_reg(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split)
    return lgb_train, lgb_test,"lgb_reg"

预测

def stacking_pred(x_train, y_train, x_valid, kf, clf_list, label_split=None, clf_fin="lgb", if_concat_origin=True):
    
    # 初级学习器
    for k, clf_list in enumerate(clf_list):
        clf_list = [clf_list]
        column_list = []
        train_data_list=[]
        test_data_list=[]
        for clf in clf_list:
            # 调用模型函数
            train_data,test_data,clf_name=clf(x_train, y_train, x_valid, kf, label_split=label_split)
            train_data_list.append(train_data)
            test_data_list.append(test_data)
            column_list.append("clf_%s" % (clf_name))
            
    # 组合次学习器的数据
    train = np.concatenate(train_data_list, axis=1)
    test = np.concatenate(test_data_list, axis=1)
    
    # 经验表明,将第一层模型的预测值和原始特征并入到第二层模型的训练中,可以得到更好的模型效果,并且可以防止过拟合
    if if_concat_origin:
        train = np.concatenate([x_train, train], axis=1)
        test = np.concatenate([x_valid, test], axis=1)
    print(x_train.shape)
    print(train.shape)
    print(clf_name)
    print(clf_name in ["lgb"])
    
    # 次级学习器
    if clf_fin in ["rf","ada","gb","et","lr","lsvc","knn"]:
        if clf_fin in ["rf"]:
            clf = RandomForestRegressor(n_estimators=600, max_depth=20, n_jobs=-1, random_state=2021, max_features="auto",verbose=1)
        elif clf_fin in ["ada"]:
            clf = AdaBoostRegressor(n_estimators=30, random_state=2021, learning_rate=0.01)
        elif clf_fin in ["gb"]:
            clf = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2021,max_depth=5,verbose=1)
        elif clf_fin in ["et"]:
            clf = ExtraTreesRegressor(n_estimators=600, max_depth=35, max_features="auto", n_jobs=-1, random_state=2021,verbose=1)
        elif clf_fin in ["lr"]:
            clf = LinearRegression(n_jobs=-1)
            
        clf.fit(train, y_train)
        pred = clf.predict(test).reshape(-1,1)
        return pred
    
    elif clf_fin in ["xgb"]:
        clf = xgboost
        train_matrix = clf.DMatrix(train, label=y_train, missing=-1)
        test_matrix = clf.DMatrix(train, label=y_train, missing=-1)
        params = {'booster': 'gbtree',
                  'eval_metric': 'rmse',
                  'gamma': 1,
                  
                  'min_child_weight': 1.5,
                  'max_depth': 5,
                  'lambda': 10,
                  
                  'subsample': 0.7,
                  'colsample_bytree': 0.7,
                  'colsample_bylevel': 0.7,
                  'eta': 0.03,
                  'tree_method': 'exact',
                  'seed': 2021,
                  'nthread': 12
                  }
        num_round = 10000
        early_stopping_rounds = 20
        watchlist = [(train_matrix, 'train'),
                     (test_matrix, 'eval')
                     ]
        model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,
                          early_stopping_rounds=early_stopping_rounds
                          )
        pre = model.predict(test,ntree_limit=model.best_ntree_limit).reshape(-1,1)
        return pre
    elif clf_fin in ["lgb"]:
        print(clf_name)
        clf = lightgbm
        train_matrix = clf.Dataset(train, label=y_train)
        test_matrix = clf.Dataset(train, label=y_train)
        params = {
                  'boosting_type': 'gbdt',
                  'objective': 'regression_l2',
                  'metric': 'mse',
                  'min_child_weight': 1.5,
                  'num_leaves': 2**5,
                  'lambda_l2': 10,
                  'subsample': 0.7,
                  'colsample_bytree': 0.7,
                  'colsample_bylevel': 0.7,
                  'learning_rate': 0.03,
                  'tree_method': 'exact',
                  'seed': 2021,
                  'nthread': 12,
                  'silent': True,
                  }
        num_round = 10000
        early_stopping_rounds = 20
        model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,
                          early_stopping_rounds=early_stopping_rounds
                          )
        print('pred')
        pre = model.predict(test,num_iteration=model.best_iteration).reshape(-1,1)
        print(pre)
        return pre
# 加载数据
train_data = pd.read_csv(train_data_file, sep="\t")
test_data = pd.read_csv(test_data_file, sep="\t")
corr_col = ['V0', 'V31', 'V1', 'V8', 'V27', 'V2', 'V16', 'V3', 'V4',
       'V12', 'V10', 'V36', 'V37', 'V24', 'V5', 'V6', 'V20', 'V11', 'V15',
       'V29', 'V7', 'V19', 'V18', 'V13', 'V17', 'V22', 'V30']

train_data_scalar = train_data[corr_col]
train_data_scalar = pd.DataFrame(mms.fit_transform(train_data_scalar), columns=corr_col)
train_data_scalar["target"] = train_data["target"]
test_data_scalar = test_data[corr_col]
test_data_scalar = pd.DataFrame(mms.fit_transform(test_data_scalar), columns=corr_col)
folds = 5
seed = 1
kf = KFold(n_splits=folds, shuffle=True, random_state=2021)

x_train = train_data_scalar[corr_col].values
x_val = test_data_scalar.values
y_train = train_data_scalar["target"].values

clf_list = [lgb_reg, gb_reg, rf_reg]

pred = stacking_pred(x_train, y_train, x_val, kf, clf_list, label_split=None, clf_fin="xgb", if_concat_origin=True)

参考

python scipy stats.probplot用法及代码示例
distplot与kdeplot详解
浅谈线性回归(1)–来点干货
使用scipy.stats.boxcox完成BoxCox变换
偏度与峰度的正态性分布判断
matplotlib基础绘图命令之errorbar的使用

Logo

GitCode AI社区是一款由 GitCode 团队打造的智能助手,AI大模型社区、提供国内外头部大模型及数据集服务。

更多推荐