栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

电商淘宝用户行为分析(更新中)

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

电商淘宝用户行为分析(更新中)

一、项目介绍

本次实战的数据是来自阿里巴巴提供的一个淘宝用户行为数据集,用于隐式反馈推荐问题的研究。
数据下载:https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1数据下载地址
由于数据过于庞大,自己的电脑性能有限,所以截取了前1000万条数据来进行实验,主要还是懂得方法。

二、分析数据

本数据集包含了2017年11月25日至2017年12月3日之间,有行为的约一百万随机用户的所有行为(行为包括点击、购买、加购、喜欢)。数据集的组织形式和MovieLens-20M类似,即数据集的每一行表示一条用户行为,由用户ID、商品ID、商品类目ID、行为类型和时间戳组成,并以逗号分隔。关于数据集中每一列的详细描述如下:

用户的4种行为:

三、数据清理

3.1 用pandas处理数据,导入数据的时候加上列名

import pandas as pd
data = pd.read_csv('UserBehavior_1千万条.csv', encoding='utf-8')
data.columns = ['user_id', 'item_id', 'category_id', 'behavior', 'time']  #加入列名
'''
用户id 商品id 商品类id 行为类型 时间戳
行为类型:pv :点击
buy: 够买
cart: 加入购物车
fav:收藏商品
'''

3.2 把时间戳变成时间,并且分成date和time两列

# 把时间戳变成时间,并且分成date和time两列
data["time"] = pd.to_datetime(data["time"], unit='s') #  把时间戳变为时间
data["time"] = data["time"].astype(str)
newdata = data["time"].str.split(" ", 2, True)
newdata.columns = ["date", "time"]
data = data.drop("time", axis=1).join(newdata)

3.3 重复值处理

# 重复值处理
data.drop_duplicates(inplace=True)
# print(data.duplicated().sum()) # 查看重复值

3.4 异常值处理
只保留11月25日到12月4日的数据

# 异常值处理
data = data.drop(data[(data["date"] < '2017-11-25') | (data["date"] > '2017-12-04')].index)
# print("删除之后:", data['date'].value_counts())
data = data.sort_values(by="date", ascending=True) # 按照date升序排序

至此,数据预处理完成,把处理好的数据保存方便后序导入数据库

data.to_csv("已经预处理的.csv", index=False)
四、数据分析

基于AARRR模型分析,这里可视化用pyecharts框架来画图
4.1用户获取(Acquisition)
采用日新增用户数DNU,考察每日新增用户和日均访问数量。

每日新增用户:
由下图可见,在12月1日新增用户数非常多(当然这只是去了1000万条数据的,本次实验主要学习的是方法)。

from pymysql import Connect
import pyecharts.options as opts
from pyecharts.charts import Line
from pyecharts.commons.utils import JsCode

conn = Connect(host='localhost', port=3306, user='root', password='123456', database='dianshang')
# 获取游标
cursor = conn.cursor()
sql = """
select date,count(behavior) as 日均访问量
from user 
where behavior = 'pv'
group by date
order by date;
"""
# 执行sql语句
cursor.execute(sql)

# 读取数据
data = cursor.fetchall()
x = []
y_data = []
x_data = []
for i in range(10):
    x.append(data[i][0])
    y_data.append(data[i][1])
    x_data.append(x[i][5:])

# 计算每日新增数据
add = []
for i in range(9):
    sum = 0
    sum = y_data[i+1] - y_data[i]
    if  sum < 0:
        sum = 0
    add.append(sum)

# 可视化
background_color_js = (
    "new echarts.graphic.LinearGradient(0, 0, 0, 1, "
    "[{offset: 0, color: '#c86589'}, {offset: 1, color: '#06a7ff'}], false)"
)
area_color_js = (
    "new echarts.graphic.LinearGradient(0, 0, 0, 1, "
    "[{offset: 0, color: '#eb64fb'}, {offset: 1, color: '#3fbbff0d'}], false)"
)

c = (
    Line(init_opts=opts.InitOpts(bg_color=JsCode(background_color_js)))
    .add_xaxis(xaxis_data=x_data)
    .add_yaxis(
        series_name="注册总量",
        y_axis=add,
        is_smooth=True,
        is_symbol_show=True,
        symbol="circle",
        symbol_size=6,
        linestyle_opts=opts.LineStyleOpts(color="#fff"),
        label_opts=opts.LabelOpts(is_show=True, position="top", color="white"),
        itemstyle_opts=opts.ItemStyleOpts(
            color="red", border_color="#fff", border_width=3
        ),
        tooltip_opts=opts.TooltipOpts(is_show=False),
        areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title="每日新增",
            pos_bottom="5%",
            pos_left="center",
            title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16),
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            boundary_gap=False,
            axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63"),
            axisline_opts=opts.AxisLineOpts(is_show=False),
            axistick_opts=opts.AxisTickOpts(
                is_show=True,
                length=25,
                linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
            ),
            splitline_opts=opts.SplitLineOpts(
                is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
            ),
        ),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            position="right",
            axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"),
            axisline_opts=opts.AxisLineOpts(
                linestyle_opts=opts.LineStyleOpts(width=2, color="#fff")
            ),
            axistick_opts=opts.AxisTickOpts(
                is_show=True,
                length=15,
                linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
            ),
            splitline_opts=opts.SplitLineOpts(
                is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
            ),
        ),
        legend_opts=opts.LegendOpts(is_show=False),
    )
    .render("每日新增.html")
)

日均访问数量:

from pymysql import Connect
import pyecharts.options as opts
from pyecharts.charts import Line
from pyecharts.commons.utils import JsCode

# 连接数据库
conn = Connect(host='localhost', port=3306, user='root', password='123456', database='dianshang')
# 获取游标
cursor = conn.cursor()
sql = """
select date,count(behavior) as 日均访问量
from user 
where behavior = 'pv'
group by date
order by date;
"""
# 执行sql语句
cursor.execute(sql)

# 读取数据
data = cursor.fetchall()
print(data)
x = []
y_data = []
x_data = []
for i in range(10):
    x.append(data[i][0])
    y_data.append(data[i][1])
    x_data.append(x[i][5:])
# print(x_data)

# 可视化
background_color_js = (
    "new echarts.graphic.LinearGradient(0, 0, 0, 1, "
    "[{offset: 0, color: '#c86589'}, {offset: 1, color: '#06a7ff'}], false)"
)
area_color_js = (
    "new echarts.graphic.LinearGradient(0, 0, 0, 1, "
    "[{offset: 0, color: '#eb64fb'}, {offset: 1, color: '#3fbbff0d'}], false)"
)

c = (
    Line(init_opts=opts.InitOpts(bg_color=JsCode(background_color_js)))
    .add_xaxis(xaxis_data=x_data)
    .add_yaxis(
        series_name="注册总量",
        y_axis=y_data,
        is_smooth=True,
        is_symbol_show=True,
        symbol="circle",
        symbol_size=6,
        linestyle_opts=opts.LineStyleOpts(color="#fff"),
        label_opts=opts.LabelOpts(is_show=True, position="top", color="white"),
        itemstyle_opts=opts.ItemStyleOpts(
            color="red", border_color="#fff", border_width=3
        ),
        tooltip_opts=opts.TooltipOpts(is_show=False),
        areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title="11-25到12-04日均访问量",
            pos_bottom="5%",
            pos_left="center",
            title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16),
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            boundary_gap=False,
            axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63"),
            axisline_opts=opts.AxisLineOpts(is_show=False),
            axistick_opts=opts.AxisTickOpts(
                is_show=True,
                length=25,
                linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
            ),
            splitline_opts=opts.SplitLineOpts(
                is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
            ),
        ),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            position="right",
            axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"),
            axisline_opts=opts.AxisLineOpts(
                linestyle_opts=opts.LineStyleOpts(width=2, color="#fff")
            ),
            axistick_opts=opts.AxisTickOpts(
                is_show=True,
                length=15,
                linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
            ),
            splitline_opts=opts.SplitLineOpts(
                is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
            ),
        ),
        legend_opts=opts.LegendOpts(is_show=False),
    )
    .render("11-25到12-04日均访问量.html")
)

4.2 用户激活(Activation)
首先考察PV(页面浏览量或者点击量),UV(独立访客数),人均浏览次数(PV/UV)
这里用SQL语句进行查询,这里的user是表的名称。

#计算PV,UV,人均浏览次数`在这里插入代码片`
select count(distinct user_id) "UV(独立访客数)",(select count(user_id) from user where behavior='pv') "PV(页面浏览量或者点击量)",
       (select count(user_id) from user where behavior='pv')/count(distinct user_id) 人均浏览次数
from user

查看每一天的PV(页面浏览量或者点击量)和UV(独立访客数)

从上图可以直观看到在进入12月份后,pv和uv都呈现明显的增加,12月份有个重要的电商节——双十二,因此我们合理推测可能是由于双十二的到来吸引了大量用户。

from pymysql import Connect
import pandas as pd
import pyecharts.options as opts
from pyecharts.charts import Line
from pyecharts.faker import Faker

# PV(页面浏览量或者点击量),UV(独立访客数)
conn = Connect(host='localhost', port=3306, user='root', password='123456', database='dianshang')
# 获取游标
cursor = conn.cursor()
sql = """
select date,count(distinct user_id) 日uv,count(user_id) 日pv
from user where behavior='pv' group by date
"""
# 执行sql语句
cursor.execute(sql)
# 读取数据
data = cursor.fetchall()
# print(data)
date_data = []
uv_data = []
pv_data = []
x = []
for i in range(9):
    x.append(data[i][0])
    uv_data.append((data[i][1]))
    pv_data.append((data[i][2]))
    date_data.append(x[i][5:]) # 切月份和日期出来
print(date_data)
print(uv_data)
print(pv_data)
# 数据可视化
c = (
    Line()
    .add_xaxis(date_data)
    .add_yaxis(
        "日pv",
        y_axis=pv_data,
        markpoint_opts=opts.MarkPointOpts(data=[opts.MarkPointItem(type_="min")]),
    )
    .add_yaxis(
        "日uv",
        y_axis=uv_data,
        markpoint_opts=opts.MarkPointOpts(data=[opts.MarkPointItem(type_="max")]),
    )
    .set_global_opts(title_opts=opts.TitleOpts(title="不同日期的日pv和uv"))
    .render("不同日期的日pv和uv.html")
)

从时间维度出发分析用户行为。

从日期来分析:

from pymysql import Connect
import pyecharts.options as opts
from pyecharts.charts import Line


conn = Connect(host='localhost', port=3306, user='root', password='123456', database='dianshang')
# 获取游标
cursor = conn.cursor()
sql = """
SELECt date,
sum(case when behavior = 'pv' then 1 else 0 end)as 浏览量,
sum(case when behavior = 'buy' then 1 else 0 end)as 购买量,
sum(case when behavior = 'fav' then 1 else 0 end)as 收藏量,
sum(case when behavior = 'cart' then 1 else 0 end)as 加购量
from user
group by date
"""
# 执行sql语句
cursor.execute(sql)
# 读取数据
data = cursor.fetchall()
print(data)
x = []
date_data = []
pv_data = []
buy_data = []
fav_data = []
cart_data = []
for i in range(9):
    x.append(data[i][0])
    date_data.append(x[i][5:])
    pv_data.append((data[i][1]))
    buy_data.append((data[i][2]))
    fav_data.append((data[i][3]))
    cart_data.append((data[i][4]))

print(date_data)
print(pv_data)
print(buy_data)
print(fav_data)
print(cart_data)




# 数据可视化
(
    Line()
    .add_xaxis(xaxis_data=date_data)
    .add_yaxis(
        series_name="浏览量",
        stack="总量",
        y_axis=pv_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        series_name="购买量",
        stack="总量",
        y_axis=buy_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        series_name="收藏量",
        stack="总量",
        y_axis=fav_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        series_name="加购量",
        stack="总量",
        y_axis=cart_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="用户活跃时间段-日期"),
        tooltip_opts=opts.TooltipOpts(trigger="axis"),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
        xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
    )
    .render("用户活跃时间段-日期.html")
)

从周来分析:

from pymysql import Connect
import pyecharts.options as opts
from pyecharts.charts import Line
conn = Connect(host='localhost', port=3306, user='root', password='123456', database='dianshang')
# 获取游标
cursor = conn.cursor()
sql = """
select date_format(date,'%W') as weeks,
	   sum(case when behavior='pv' then 1 else 0 end) as 浏览量,
	   sum(case when behavior='fav' then 1 else 0 end) as 收藏量,
	   sum(case when behavior='cart' then 1 else 0 end) as 加购量,
	   sum(case when behavior='buy' then 1 else 0 end) as 购买量
from user GROUP BY weeks order by field(weeks,'周一','周二','周三','周四','周五','周六','周日')
"""
# 执行sql语句
cursor.execute(sql)
# 读取数据
data = cursor.fetchall()
# print(data)
week_data = []
pv_data = []
buy_data = []
fav_data = []
cart_data = []
for i in range(6):
    week_data.append(data[i][0])
    pv_data.append((data[i][1]))
    buy_data.append((data[i][2]))
    fav_data.append((data[i][3]))
    cart_data.append((data[i][4]))

print(week_data)
print(pv_data)
print(buy_data)
print(fav_data)
print(cart_data)

# 数据可视化
(
    Line()
    .add_xaxis(xaxis_data=week_data)
    .add_yaxis(
        series_name="浏览量",
        stack="总量",
        y_axis=pv_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        series_name="购买量",
        stack="总量",
        y_axis=buy_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        series_name="收藏量",
        stack="总量",
        y_axis=fav_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        series_name="加购量",
        stack="总量",
        y_axis=cart_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="用户活跃时间段-周"),
        tooltip_opts=opts.TooltipOpts(trigger="axis"),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
        xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
    )
    .render("用户活跃时间段-周.html")
)
# (3)用户留存(Retention)写到这里

从小时来分析:

from pymysql import Connect
import pyecharts.options as opts
from pyecharts.charts import Line
conn = Connect(host='localhost', port=3306, user='root', password='123456', database='dianshang')
# 获取游标
cursor = conn.cursor()
sql = """
SELECt time,
sum(case when behavior = 'pv' then 1 else 0 end)as 浏览量,
sum(case when behavior = 'buy' then 1 else 0 end)as 购买量,
sum(case when behavior = 'fav' then 1 else 0 end)as 收藏量,
sum(case when behavior = 'cart' then 1 else 0 end)as 加购量
from user
group by hour(time)
order by hour(time);
"""
# 执行sql语句
cursor.execute(sql)
# 读取数据
data = cursor.fetchall()
print(data)
x = []
hour_data = []
pv_data = []
buy_data = []
fav_data = []
cart_data = []
for i in range(24):
    x.append(data[i][0])
    hour_data.append(x[i][0:2])
    pv_data.append((data[i][1]))
    buy_data.append((data[i][2]))
    fav_data.append((data[i][3]))
    cart_data.append((data[i][4]))

print(hour_data)
print(pv_data)
print(buy_data)
print(fav_data)
print(cart_data)

# 数据可视化
(
    Line()
    .add_xaxis(xaxis_data=hour_data)
    .add_yaxis(
        series_name="浏览量",
        stack="总量",
        y_axis=pv_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        series_name="购买量",
        stack="总量",
        y_axis=buy_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        series_name="收藏量",
        stack="总量",
        y_axis=fav_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        series_name="加购量",
        stack="总量",
        y_axis=cart_data,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="用户活跃时间段-小时"),
        tooltip_opts=opts.TooltipOpts(trigger="axis"),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
        xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
    )
    .render("用户活跃时间段-小时.html")
)

可以看出每天在10时平台总流量开始上升,到达14时达到最高值,6-9时比较平稳,12-14时左右应该是午饭时间,大部分的人都会看看手机。而在14时过后迅速下降,在20时到达最低,14时后下午的工作生活就开始了,学生也要上课了,很明显可以看到在18时减缓的趋势降低了很多,此时为下班放学的高峰期,20时吃完晚饭忙完一天的工作后,平台流量就再次开始上涨,直到两点,不难想到大部分的人都会睡觉。从11月30日到12月2日用户的浏览量和加购量有较大的增幅,同时收藏量也有明显增大的趋势,但购买量并没有呈现出明显增大,这可能是因为双十二即将来临,用户更加倾向于选择先收藏加购物车,再在双十二当天领取优惠下单。

4.3用户留存(Retention)

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/879544.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号