背景及思路

背景介绍

聚合数据

分析数据

分析流程

进行分析处理

1.1调包和读取数据

1.2变量之间相关性分析，用corr

1.3业务回归模型-建模

1.4缺失值的处理

1.5回归模型评估

1.6模型迭代优化

最后的线性回归模型：

背景及思路

背景介绍
对于宝洁这样的快消品企业，重要的数据应用：

1.对商超门店的销售额做出精准预测

2.量化自身所能控制的各种促销因素所能产生的效果

3.对营销资源做出合理规划

聚合数据
在本例中，通过回归分析实现对各类因素投入产出对比出评估

分析数据
电视广告、线上、线下、门店内。微信渠道等促销投入和销售额

下列数据均以月为观测窗口：

Revenue 门店销售额
Reach 微信推送次数
Local_tv本地电视广告投入
Online 线上广告投入
Instore 门店内海报成列等投入
Person 门店销售人员投入
Event 促销事件：cobranding 品牌联合促销、holiday 节假日、special 门店特别促销、non-event无促销活动

分析流程
数据概况分析，单变量分析，相关与可视化，回归模型

进行分析处理

1.1调包和读取数据

import pandas as pd
#读取数据
#index_col=0  去除Unnamed=0的数据
store=pd.read_csv('w2_store_rev.csv',index_col=0)
store.head()

输出结果

store.info()

输出属性


Int64Index: 985 entries, 845 to 26
Data columns (total 7 columns):
revenue     985 non-null float64
reach       985 non-null int64
local_tv    929 non-null float64
online      985 non-null int64
instore     985 non-null int64
person      985 non-null int64
event       985 non-null object
dtypes: float64(2), int64(4), object(1)
memory usage: 61.6+ KB

从基本数据可以看出

event是object类似string
local_tv有56个空值

即类别型变量，在线性回归里面是没办法处理这个问题的

解决问题一，event类型

先看event的具体值

store.event.unique()

输出结果

array(['non_event', 'special', 'cobranding', 'holiday'], dtype=object)

查看着几个类别分别对revenue的影响

这几个类别对应的revenue

store.groupby('event')['revenue'].describe()

这几个类别对应的local_tv

store.groupby('event')['local_tv'].describe()

处理event变量，将其变为数字型变量

store=pd.get_dummies(store)
store.info()

输出结果


Int64Index: 985 entries, 845 to 26
Data columns (total 10 columns):
revenue             985 non-null float64
reach               985 non-null int64
local_tv            929 non-null float64
online              985 non-null int64
instore             985 non-null int64
person              985 non-null int64
event_cobranding    985 non-null uint8
event_holiday       985 non-null uint8
event_non_event     985 non-null uint8
event_special       985 non-null uint8
dtypes: float64(2), int64(4), uint8(4)
memory usage: 57.7 KB

1.2变量之间相关性分析，用corr

store.corr()

可以独立调出来，即其他变量与revenue的相关分析，并用sort_values将变量按照与revenue的相关关系的大小排序，ascending=True默认升序，ascending=False默认降序

(store.corr()[['revenue']]).sort_values(by='revenue',inplace=False,ascending=False)

回归、预测等，数据颗粒度越细越好，数据越全越好

1.3业务回归模型-建模

可视化分析：线性关系可视化，斜率与相关系数有关

三个变量分别与revenue的相关性可视化

local_tv 与revenue之间的线性关系图

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.regplot('local_tv','revenue',store)

person 与revenue之间的线性关系图

sns.regplot('person','revenue',store)

instore 与revenue之间的线性关系图

sns.regplot('instore','revenue',store)

解决问题二，处理local_tv有56个空值

1.4缺失值的处理

缺失值出现的时候，首先看比例，根据比例的不同情况来进行处理

缺失值比较少，比如案例中的5%左右，缺失不严重，可以直接进行填充
缺失值严重，如80%-90%较为严重时，之前我们可能会认为它没有意义而进行删除，但是在大数据时代，缺失也是一种信息，也需要进行填充

几种填充方法

填充0
均值填充
中位数填充
数据模型填充—local_tv和其他变量的线性模型填充

#线性回归分析
from sklearn.linear_model import LinearRegression as LR
#建立空的线性回归模型
model=LR()
#设定自变量和因变量
y=store['revenue']
x=store[['local_tv','person','instore']]

现在用fit拟合数据，就是说人为去完美这个数据，后面再增加进来，就原形毕露，先暂时用缺失值填充0处理

store=store.fillna(0)
store.info()

#用均值填充
store=store.fillna(store.local_tv.mean())


Int64Index: 985 entries, 845 to 26
Data columns (total 10 columns):
revenue             985 non-null float64
reach               985 non-null int64
local_tv            985 non-null float64
online              985 non-null int64
instore             985 non-null int64
person              985 non-null int64
event_cobranding    985 non-null uint8
event_holiday       985 non-null uint8
event_non_event     985 non-null uint8
event_special       985 non-null uint8
dtypes: float64(2), int64(4), uint8(4)
memory usage: 57.7 KB

查看自变量系数

model.fit(x,y)  #过拟合，为了得到模型假设而改变数据

model.coef_

输出

array([4.00189943e-01, 2.13224898e+03, 3.56098623e+00])

#'local_tv','person','instore'  按此顺序的自系数

现在查看为0时的截距，明显有负数，也就是数据不正常

model.intercept_

输出：-8967.736870300461 截距：所有变量为0时

1.5回归模型评估

核心就是看参差residual
其次MAE，平均绝对误差
再则RMSE，均方根误差

MAE与RMSE的区别是，RMSE可以放大误差

进行计算

xy打分
通过x来计算y的值
通过预测值与真实值的差，计算error的值
公式

#模型评估，x为'local_tv','person','instore'
score=model.score(x,y)  #xy打分
predictions=model.predict(x)  #计算y的预测值
error=predictions-y  #计算误差
rmse=(error**2).mean()**.5  #计算rmse,均方根误差
mae=abs(error).mean()  #计算mae，平均绝对误差

print('rmse,均方根误差的值为{}'.format(rmse))
print('mae,平均绝对误差的值为{}'.format(mae))

输出结果

rmse,均方根误差的值为8321.491623472051
mae,平均绝对误差的值为6556.036999600782

如果要更加细致的查看模型结构，可以用statsmodels数据库的ols工具

from statsmodels.formula.api import ols
x=store[['local_tv','person','instore']]
y=store['revenue']
model=ols('y~x',store).fit()

print(model.summary())

1.6模型迭代优化

在迭代优化过程中我们需要注意2点：

需要根据统计指标评估，x变量对y变量的解释度
从业务角度来看，新增x变量之后，之前的变量对目标的贡献值系数发生了什么变化

加入‘online’的运行结果

import pandas as pd
#读取数据
#index_col=0  去除Unnamed=0的数据
store=pd.read_csv('w2_store_rev.csv',index_col=0)
store.head()

store=pd.get_dummies(store)
store.info()

store=store.fillna(0)
store.info()

#线性回归分析
from sklearn.linear_model import LinearRegression as LR
#建立空的线性回归模型
model=LR()
#设定自变量和因变量
y=store['revenue']
x=store[['local_tv','person','instore','online']]
model.fit(x,y)  #过拟合，为了得到模型假设而改变数据
#模型评估，x为'local_tv','person','instore'
score=model.score(x,y)  #xy打分
predictions=model.predict(x)  #计算y的预测值
error=predictions-y  #计算误差
rmse=(error**2).mean()**.5  #计算rmse,均方根误差
mae=abs(error).mean()  #计算mae，平均绝对误差

print('rmse,均方根误差的值为{}'.format(rmse))
print('mae,平均绝对误差的值为{}'.format(mae))

输出结果

rmse,均方根误差的值为8106.512169325369
mae,平均绝对误差的值为6402.20288344189

不过我们还可以用别的更详细去填充，均值填充

import pandas as pd
#读取数据
#index_col=0  去除Unnamed=0的数据
store=pd.read_csv('w2_store_rev.csv',index_col=0)
store.head()

store=pd.get_dummies(store)
store.info()

store=store.fillna(store.local_tv.mean())
store.info()

#线性回归分析
from sklearn.linear_model import LinearRegression as LR
#建立空的线性回归模型
model=LR()
#设定自变量和因变量
y=store['revenue']
x=store[['local_tv','person','instore']]
model.fit(x,y)  #过拟合，为了得到模型假设而改变数据
#模型评估，x为'local_tv','person','instore'
score=model.score(x,y)  #xy打分
predictions=model.predict(x)  #计算y的预测值
error=predictions-y  #计算误差
rmse=(error**2).mean()**.5  #计算rmse,均方根误差
mae=abs(error).mean()  #计算mae，平均绝对误差

print('rmse,均方根误差的值为{}'.format(rmse))
print('mae,平均绝对误差的值为{}'.format(mae))

输出结果

rmse,均方根误差的值为5884.181391972567
mae,平均绝对误差的值为4717.8947664817115

误差值明显下降。

from statsmodels.formula.api import ols
x=store[['local_tv','person','instore']]
y=store['revenue']
model=ols('y~x',store).fit()
print(model.summary())

最后的线性回归模型：

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.746
Model:                            OLS   Adj. R-squared:                  0.745
Method:                 Least Squares   F-statistic:                     959.2
Date:                Sat, 27 Nov 2021   Prob (F-statistic):          4.09e-291
Time:                        12:55:11   Log-Likelihood:                -9947.5
No. Observations:                 985   AIC:                         1.990e+04
Df Residuals:                     981   BIC:                         1.992e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -5.288e+04   1804.489    -29.305      0.000   -5.64e+04   -4.93e+04
x[0]           1.7515      0.049     35.857      0.000       1.656       1.847
x[1]        2050.5749     61.866     33.146      0.000    1929.171    2171.979
x[2]           4.0903      0.193     21.229      0.000       3.712       4.468
==============================================================================
Omnibus:                        0.352   Durbin-Watson:                   2.056
Prob(Omnibus):                  0.839   Jarque-Bera (JB):                0.402
Skew:                           0.043   Prob(JB):                        0.818
Kurtosis:                       2.951   Cond. No.                     3.05e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.05e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Process finished with exit code 0

数据分析项目-宝洁销售额预测分析

背景及思路

背景介绍
对于宝洁这样的快消品企业，重要的数据应用：

1.对商超门店的销售额做出精准预测

2.量化自身所能控制的各种促销因素所能产生的效果

3.对营销资源做出合理规划

聚合数据
在本例中，通过回归分析实现对各类因素投入产出对比出评估

分析流程
数据概况分析，单变量分析，相关与可视化，回归模型

进行分析处理

大数据系统相关栏目本月热门文章

数据分析项目-宝洁销售额预测分析

背景及思路

背景介绍 对于宝洁这样的快消品企业，重要的数据应用： 1.对商超门店的销售额做出精准预测 2.量化自身所能控制的各种促销因素所能产生的效果 3.对营销资源做出合理规划

聚合数据 在本例中，通过回归分析实现对各类因素投入产出对比出评估

分析流程 数据概况分析，单变量分析，相关与可视化，回归模型

进行分析处理

大数据系统相关栏目本月热门文章

背景介绍
对于宝洁这样的快消品企业，重要的数据应用：

1.对商超门店的销售额做出精准预测

2.量化自身所能控制的各种促销因素所能产生的效果

3.对营销资源做出合理规划

聚合数据
在本例中，通过回归分析实现对各类因素投入产出对比出评估

分析流程
数据概况分析，单变量分析，相关与可视化，回归模型