Kaggle Tabular Playground Series - Jan 2022 学习笔记1（数据分析）

试题地址：Tabular Playground Series - Jan 2022

简介：给出了两家商店在三个国家在2015年-2018年的三种产品的每天的销售量，要求预测2019年的销售量。

本文参考 TPSFEB22-01 EDA which makes sense

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

读取数据

train_data=pd.read_csv('../datas/train.csv')
test_data=pd.read_csv('../datas/test.csv')

for df in [train_data, test_data]:
    df['date'] = pd.to_datetime(df.date)
    df.set_index('date', inplace=True, drop=False)

# Shape and preview
print('Training data df shape:',train_data.shape)
print('Test data df shape:',test_data.shape)
train_data.head()

查看有无缺失值

print('Number of missing values in training set:',train_data.isna().sum().sum())
print('')
print('Number of missing values in test set:',test_data.isna().sum().sum())

查看数据成分

print('Training cardinalities: n', train_data.nunique())
print('')
print('Test cardinalities: n', test_data.nunique())

查看日期范围

print('Training data:')
print('Min date', train_data['date'].min())
print('Max date', train_data['date'].max())
print('')
print('Test data:')
print('Min date', test_data['date'].min())
print('Max date', test_data['date'].max())

综上，我们可以发现：有三个国家，两个商店和三种产品，这样就会有18种组合。训练数据涵盖2015 - 2018年，测试数据要求我们预测2019年。训练数据和测试数据均无缺失值。

来看看每个组合销售量的图

plt.figure(figsize=(18, 15))
for i, (combi, df) in enumerate(train_data.groupby(['country', 'store', 'product'])):
    # df = df.set_index('date')
    # print(df.index)
    # break
    ax = plt.subplot(6, 3, i+1, ymargin=0.5)
    ax.plot(df.index,df.num_sold)
    
    ax.set_title(combi)

    
    ax.xaxis.set_major_locator(mdates.YearLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%y/%m/%d'))
    ax.xaxis.set_minor_locator(mdates.MonthLocator())

plt.tight_layout(h_pad=3.0)
plt.suptitle('Daily sales for 2015-2018', y=1.03)
plt.show()

从上图中可以发现每年年底每种产品的销售量远高于平常，可能需要将每年年底的日期单独提取出来作为特征。同时可以发现Kaggle Hat 和Kaggle Mug似乎具有季节性特征，考虑增加傅里叶特征。

plt.figure(figsize=(18, 20))
for i, (combi, df) in enumerate(train_data.groupby(['country', 'store', 'product'])):
    ax = plt.subplot(6, 3, i+1, ymargin=0.5)
    resampled = df.resample('MS').sum()
    for m, (combm, dfm) in enumerate(resampled.groupby(resampled.index.year)):
        # print(dfm)
        ax.plot(range(1 , 13) , dfm.num_sold, label=combm )
        # break
    # resampled = resampled.groupby(resampled.index.year)
    
    # break
    # ax.plot(range(1, 13), resampled.num_sold)
    ax.legend()
    ax.set_xticks(ticks=range(1, 13))
    # ax.set_xticklabels('JFMAMJJASOND')
    ax.set_title(combi)
    # ax.set_ylim(resampled.num_sold.min(), resampled.num_sold.max())
plt.suptitle('Monthly sales for 2015-2018', y=1.03)
plt.tight_layout(h_pad=3.0)
plt.show()

可以发现每年每月的波动很相似。同时，销售趋势并不是逐年递增，最明显的是挪威2016年的每月的销售量会低于2015年！所以可能还受其他因素的影响。

接下来看看每周是否有季节性特征

plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(train_data.groupby(['country', 'store', 'product'])):
    
    ax = plt.subplot(6, 3, i+1, ymargin=0.5)
    #计算每周每天销售的平均值
    resampled = df.groupby(df.index.dayofweek).mean()
    ax.bar(range(7), resampled.num_sold, 
           color=['b']*4 + ['g'] + ['orange']*2)
    ax.set_title(combi)
    ax.set_xticks(range(7))
    ax.set_xticklabels(['M', 'T', 'W', 'T', 'F', 'S', 'S'])
    ax.set_ylim(0, resampled.num_sold.max())
plt.suptitle('Sales per day of the week', y=1.03)
plt.tight_layout(h_pad=3.0)
plt.show()

可以发现一到了周末销量会有明显的升高，考虑增加每周的季节性指示器(Seasonal indicators)

接下来看看12月和1月的销量统计

plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(train_data.groupby(['country', 'store', 'product'])):
    ax = plt.subplot(6, 3, i+1, ymargin=0.5)
    ax.bar(range(1, 32),
           df.num_sold[df.date.dt.month==12].groupby(df.date.dt.day).mean(),
           color=['b'] * 25 + ['orange'] * 6)
    ax.set_title(combi)
    ax.set_xticks(ticks=range(5, 31, 5))
plt.tight_layout(h_pad=3.0)
plt.suptitle('Daily sales for December', y=1.03)
plt.show()

plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(train_data.groupby(['country', 'store', 'product'])):
    ax = plt.subplot(6, 3, i+1, ymargin=0.5)
    ax.bar(range(1, 32),
           df.num_sold[df.date.dt.month==1].groupby(df.date.dt.day).mean(),
           color=['b'] * 5 + ['orange'] * 26)
    ax.set_title(combi)
    ax.set_xticks(ticks=range(5, 31, 5))
plt.tight_layout(h_pad=3.0)
plt.suptitle('Daily sales for December', y=1.03)
plt.show()

可以看出，大约12月25日销量开始增长，基本上到1月5日回归正常。

之前发现挪威2016年月销售量是低于2015年的，后来发现可能跟GDP有关。

参考讨论1

参考讨论2

gdp_df = pd.read_csv('../datas/GDP_data_2015_to_2019_Finland_Norway_Sweden.csv')

gdp_df.set_index('year', inplace=True)
gdp_df

至此，我们完成了初步的数据分析。接下来我们将会使用时间序列和线性回归来尝试拟合数据。下一节：Kaggle Tabular Playground Series - Jan 2022 学习笔记2（使用时间序列的线性回归）

Kaggle Tabular Playground Series - Jan 2022 学习笔记1（数据分析）

Python相关栏目本月热门文章