栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

客户个性分析 聚类 大数据

客户个性分析 聚类 大数据

客户个性分析 聚类 大数据

作者:桂Sir 联系方式:1052656099@qq.com

不同的消费者,由于受年龄、性别、群体、职业、民族等自身类型的不同,以及生活习惯,兴趣、爱好、和个人性格因素的影响,在对同一产品的选购过程中往往会表现出不同的心理差异。所以客户个性分析能够帮助企业更好地了解其客户,并使他们更容易根据不同类型客户的特定需求、行为和关注点来修改产品。例如,公司无需花钱向公司数据库中的每个客户推销新产品,而是可以分析哪个客户群最有可能购买该产品,然后仅在该特定市场上销售该产品。

我们的数据集是一个杂货公司客户记录数据,我们将对其进行无监督的聚类将客户分组,探索不同客户对业务的重要性。根据客户的不同需求和行为修改产品,帮助企业满足不同类型客户的关注

项目链接: https://www.kaggle.com/imakash3011/customer-personality-analysis

import numpy as np
import pandas as pd
import datetime
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from matplotlib.colors import ListedColormap
from sklearn import metrics
import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")
np.random.seed(42)
数据集预览

首先预览数据集 观察到共有29列变量

data = pd.read_csv("D:/Project/BIGdata/Homework/marketing_campaign.csv", sep="t")
data.head()
IDYear_BirthEducationMarital_StatusIncomeKidhomeTeenhomeDt_CustomerRecencyMntWines...NumWebVisitsMonthAcceptedCmp3AcceptedCmp4AcceptedCmp5AcceptedCmp1AcceptedCmp2ComplainZ_CostContactZ_RevenueResponse
055241957GraduationSingle58138.00004-09-201258635...70000003111
121741954GraduationSingle46344.01108-03-20143811...50000003110
241411965GraduationTogether71613.00021-08-201326426...40000003110
361821984GraduationTogether26646.01010-02-20142611...60000003110
453241981PhDMarried58293.01019-01-201494173...50000003110

5 rows × 29 columns

  • 观察数据集变量的数据类型 观察到存在缺失值
data.info()

RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Z_CostContact        2240 non-null   int64  
 27  Z_Revenue            2240 non-null   int64  
 28  Response             2240 non-null   int64  
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB
数据预处理
#删除缺少收入值的行
data = data.dropna()
#使用 **"Dt_Customer"** 创建一个特征,该特征指示客户在公司数据库中注册的天数
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"])
dates = []
for i in data["Dt_Customer"]:
    i = i.date()
    dates.append(i)  
days = []
d1 = max(dates) #taking it to be the newest customer
for i in dates:
    delta = d1 - i
    days.append(delta)
data["Customer_For"] = days
data["Customer_For"] = pd.to_numeric(data["Customer_For"], errors="coerce")

变量优化 方便我们进行后续的处理

#对其余变量进行优化
data["Age"] = 2021-data["Year_Birth"]

data["Spent"] = data["MntWines"]+ data["MntFruits"]+ data["MntMeatProducts"]+ data["MntFishProducts"]+ data["MntSweetProducts"]+ data["MntGoldProds"]

data["Living_With"]=data["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})

data["Children"]=data["Kidhome"]+data["Teenhome"]

data["Family_Size"] = data["Living_With"].replace({"Alone": 1, "Partner":2})+ data["Children"]

data["Is_Parent"] = np.where(data.Children> 0, 1, 0)

data["Education"]=data["Education"].replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate", "Graduation":"Graduate", "Master":"Postgraduate", "PhD":"Postgraduate"})

data=data.rename(columns={"MntWines": "Wines","MntFruits":"Fruits","MntMeatProducts":"Meat","MntFishProducts":"Fish","MntSweetProducts":"Sweets","MntGoldProds":"Gold"})

to_drop = ["Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"]
data = data.drop(to_drop, axis=1)
data.describe()
IncomeKidhomeTeenhomeRecencyWinesFruitsMeatFishSweetsGold...AcceptedCmp1AcceptedCmp2ComplainResponseCustomer_ForAgeSpentChildrenFamily_SizeIs_Parent
count2216.0000002216.0000002216.0000002216.0000002216.0000002216.0000002216.0000002216.0000002216.0000002216.000000...2216.0000002216.0000002216.0000002216.0000002.216000e+032216.0000002216.0000002216.0000002216.0000002216.000000
mean52247.2513540.4417870.50541549.012635305.09160626.356047166.99593937.63763527.02888143.965253...0.0640790.0135380.0094770.1502714.423735e+1652.179603607.0753610.9472022.5925090.714350
std25173.0766610.5368960.54418128.948352337.32792039.793917224.28327354.75208241.07204651.815414...0.2449500.1155880.0969070.3574172.008532e+1611.985554602.9004760.7490620.9057220.451825
min1730.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.000000e+0025.0000005.0000000.0000001.0000000.000000
25%35303.0000000.0000000.00000024.00000024.0000002.00000016.0000003.0000001.0000009.000000...0.0000000.0000000.0000000.0000002.937600e+1644.00000069.0000000.0000002.0000000.000000
50%51381.5000000.0000000.00000049.000000174.5000008.00000068.00000012.0000008.00000024.500000...0.0000000.0000000.0000000.0000004.432320e+1651.000000396.5000001.0000003.0000001.000000
75%68522.0000001.0000001.00000074.000000505.00000033.000000232.25000050.00000033.00000056.000000...0.0000000.0000000.0000000.0000005.927040e+1662.0000001048.0000001.0000003.0000001.000000
max666666.0000002.0000002.00000099.0000001493.000000199.0000001725.000000259.000000262.000000321.000000...1.0000001.0000001.0000001.0000009.184320e+16128.0000002525.0000003.0000005.0000001.000000

8 rows × 28 columns

#检验离群值
sns.set(rc={"axes.facecolor":"#FFF9ED","figure.facecolor":"#FFF9ED"})
pallet = ["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"]
cmap = colors.ListedColormap(["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
#Plotting following features
To_Plot = [ "Income", "Recency", "Customer_For", "Age", "Spent", "Is_Parent"]
print("Reletive Plot Of Some Selected Features: A Data Subset")
plt.figure()
sns.pairplot(data[To_Plot], hue= "Is_Parent",palette= (["#682F2F","#F3AB60"]))
#Taking hue 
plt.show()
Reletive Plot Of Some Selected Features: A Data Subset



我们观察到年龄和收入都存在离群值 我们将其删除

data = data[(data["Age"]<90)]
data = data[(data["Income"]<600000)]
print("The total number of data-points after removing the outliers are:", len(data))
The total number of data-points after removing the outliers are: 2212
#检验不同变量之间的相关性
corrmat= data.corr()
plt.figure(figsize=(20,20))  
sns.heatmap(corrmat,annot=True, cmap=cmap, center=0)
 

没有强相关的变量 不需要进行删减

#将分类变量转换为连续性变量
s = (data.dtypes == 'object')
object_cols = list(s[s].index)
LE=LabelEncoder()
for i in object_cols:
    data[i]=data[[i]].apply(LE.fit_transform)
ds = data.copy()
# 将促销活动的相关变量删除,使得留下来的变量更有利于后面的降维和聚类
cols_del = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
ds = ds.drop(cols_del, axis=1)
#S对每个变量进行规范化处理
scaler = StandardScaler()
scaler.fit(ds)
scaled_ds = pd.Dataframe(scaler.transform(ds),columns= ds.columns )
scaled_ds.head()
EducationIncomeKidhomeTeenhomeRecencyWinesFruitsMeatFishSweets...NumCatalogPurchasesNumStorePurchasesNumWebVisitsMonthCustomer_ForAgeSpentLiving_WithChildrenFamily_SizeIs_Parent
0-0.8935860.287105-0.822754-0.9296990.3103530.9776601.5520411.6902932.4534721.483713...2.503607-0.5558140.6921811.9735831.0183521.676245-1.349603-1.264598-1.758359-1.581139
1-0.893586-0.2608821.0400210.908097-0.380813-0.872618-0.637461-0.718230-0.651004-0.634019...-0.571340-1.171160-0.132545-1.6651441.274785-0.963297-1.3496031.4045720.4490700.632456
2-0.8935860.913196-0.822754-0.929699-0.7955140.3579350.570540-0.1785421.339513-0.147184...-0.2296791.290224-0.544908-0.1726640.3345300.2801100.740959-1.264598-0.654644-1.581139
3-0.893586-1.1761141.040021-0.929699-0.795514-0.872618-0.561961-0.655787-0.504911-0.585335...-0.913000-0.5558140.279818-1.923210-1.289547-0.9201350.7409590.0699870.4490700.632456
40.5716570.2943071.040021-0.9296991.554453-0.3922570.419540-0.2186840.152508-0.001133...0.1119820.059532-0.132545-0.822130-1.033114-0.3075620.7409590.0699870.4490700.632456

5 rows × 23 columns

降维

使用pca的方法提取主成分 为了方便可视化 提取三个主成分

pca = PCA(n_components=3)
pca.fit(scaled_ds)
PCA_ds = pd.Dataframe(pca.transform(scaled_ds), columns=(["col1","col2", "col3"]))
PCA_ds.describe().T
countmeanstdmin25%50%75%max
col12212.01.670354e-162.878377-5.969394-2.538494-0.7804212.3832907.444305
col22212.02.569775e-171.706839-4.312196-1.328316-0.1581231.2422896.142721
col32212.04.336495e-171.221956-3.530416-0.829067-0.0226920.7998956.611222
x =PCA_ds["col1"]
y =PCA_ds["col2"]
z =PCA_ds["col3"]
#绘制3D图
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(x,y,z, c="maroon", marker="o" )
ax.set_title("A 3D Projection Of Data In The Reduced Dimension")
plt.show()

聚类
#使用肘部法确定聚类的簇数
Elbow_M = KElbowVisualizer(KMeans(), k=10)
Elbow_M.fit(PCA_ds)
Elbow_M.show()

 
#使用层次聚类进行聚类
AC = KMeans(n_clusters=4)
# 拟合模型并预测集群
yhat_AC = AC.fit_predict(PCA_ds)
PCA_ds["Clusters"] = yhat_AC
#将集群功能添加到原始数据帧。
data["Clusters"]= yhat_AC
fig = plt.figure(figsize=(10,8))
ax = plt.subplot(111, projection='3d', label="bla")
ax.scatter(x, y, z, s=40, c=PCA_ds["Clusters"], marker='o', cmap = cmap )
ax.set_title("The Plot Of The Clusters")
plt.show()

聚类结果分析
#绘制簇的计数图
pal = ["#682F2F","#B9C0C9", "#9F8A78","#F3AB60"]
pl = sns.countplot(x=data["Clusters"], palette= pal)
pl.set_title("Distribution Of The Clusters")
plt.show()

pl = sns.scatterplot(data = data,x=data["Spent"], y=data["Income"],hue=data["Clusters"], palette= pal)
pl.set_title("Cluster's Profile based On Income And Spending")
# 基于收入和支出的集群概况
plt.legend()
plt.show()

#广告活动的接收程度
data["Total_Promos"] = data["AcceptedCmp1"]+ data["AcceptedCmp2"]+ data["AcceptedCmp3"]+ data["AcceptedCmp4"]+ data["AcceptedCmp5"]
plt.figure()
pl = sns.countplot(x=data["Total_Promos"],hue=data["Clusters"], palette= pal)
pl.set_title("Count Of Promotion Accepted")
pl.set_xlabel("Number Of Total Accepted Promotions")
plt.show()

#绘制购买的交易数量
plt.figure()
pl=sns.boxenplot(y=data["NumDealsPurchases"],x=data["Clusters"], palette= pal)
pl.set_title("Number of Deals Purchased")
plt.show()

Personal = [ "Kidhome","Teenhome","Customer_For", "Age", "Children", "Family_Size", "Is_Parent", "Education","Living_With"]

for i in Personal:
    plt.figure()
    sns.jointplot(x=data[i], y=data["Spent"], hue =data["Clusters"], kind="kde", palette=pal)
    plt.show()

客户个性分析

最后我们总结处不同客户群体的个性特征

营销策略分析


我们可以从结果中得到营销策略方法

客户群体1和0是我们的明星客户
他们有着以下的消费习惯:

第1号客户组似乎更喜爱使用catalog进行购买
第0号客户组似乎更喜爱在实体商店购买
第1号客户组似乎很少浏览公司网站
第2、3号客户组似乎很少使用网络购物
第0、2、4号客户组都很关注公司网站的信息

客户群体1、0都很钟爱购买葡萄酒和肉食

广告对所有客户的吸引力都不强
客户群体1和3非常乐于参加促销活动

策略建议:

  • 改善广告方案 当前的广告方案没什么效果
  • 为葡萄酒商品举行打折活动可以很好的吸引客户群体0
  • 注重目录销售可以稳定明星客户群体1,同时也应该积极向群体1宣传公司网站
  • 应积极向客户群体2、3宣传网络购物,以促进其消费
转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/604835.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号