客户个性分析聚类大数据

作者：桂Sir 联系方式：1052656099@qq.com

不同的消费者，由于受年龄、性别、群体、职业、民族等自身类型的不同，以及生活习惯，兴趣、爱好、和个人性格因素的影响，在对同一产品的选购过程中往往会表现出不同的心理差异。所以客户个性分析能够帮助企业更好地了解其客户，并使他们更容易根据不同类型客户的特定需求、行为和关注点来修改产品。例如，公司无需花钱向公司数据库中的每个客户推销新产品，而是可以分析哪个客户群最有可能购买该产品，然后仅在该特定市场上销售该产品。

我们的数据集是一个杂货公司客户记录数据，我们将对其进行无监督的聚类将客户分组，探索不同客户对业务的重要性。根据客户的不同需求和行为修改产品，帮助企业满足不同类型客户的关注

项目链接： https://www.kaggle.com/imakash3011/customer-personality-analysis

import numpy as np
import pandas as pd
import datetime
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from matplotlib.colors import ListedColormap
from sklearn import metrics
import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")
np.random.seed(42)

数据集预览

首先预览数据集观察到共有29列变量

data = pd.read_csv("D:/Project/BIGdata/Homework/marketing_campaign.csv", sep="t")
data.head()

	ID	Year_Birth	Education	Marital_Status	Income	Kidhome	Teenhome	Dt_Customer	Recency	MntWines	...	NumWebVisitsMonth	Z_CostContact	Z_Revenue	Response
0	5524	1957	Graduation	Single	58138.0	0	0	04-09-2012	58	635	...	7	3	11	1
1	2174	1954	Graduation	Single	46344.0	1	1	08-03-2014	38	11	...	5	3	11	0
2	4141	1965	Graduation	Together	71613.0	0	0	21-08-2013	26	426	...	4	3	11	0
3	6182	1984	Graduation	Together	26646.0	1	0	10-02-2014	26	11	...	6	3	11	0
4	5324	1981	PhD	Married	58293.0	1	0	19-01-2014	94	173	...	5	3	11	0

5 rows × 29 columns

观察数据集变量的数据类型观察到存在缺失值

data.info()


RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Z_CostContact        2240 non-null   int64  
 27  Z_Revenue            2240 non-null   int64  
 28  Response             2240 non-null   int64  
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB

数据预处理

#删除缺少收入值的行
data = data.dropna()

#使用 **"Dt_Customer"** 创建一个特征，该特征指示客户在公司数据库中注册的天数
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"])
dates = []
for i in data["Dt_Customer"]:
    i = i.date()
    dates.append(i)  
days = []
d1 = max(dates) #taking it to be the newest customer
for i in dates:
    delta = d1 - i
    days.append(delta)
data["Customer_For"] = days
data["Customer_For"] = pd.to_numeric(data["Customer_For"], errors="coerce")

变量优化方便我们进行后续的处理

#对其余变量进行优化
data["Age"] = 2021-data["Year_Birth"]

data["Spent"] = data["MntWines"]+ data["MntFruits"]+ data["MntMeatProducts"]+ data["MntFishProducts"]+ data["MntSweetProducts"]+ data["MntGoldProds"]

data["Living_With"]=data["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})

data["Children"]=data["Kidhome"]+data["Teenhome"]

data["Family_Size"] = data["Living_With"].replace({"Alone": 1, "Partner":2})+ data["Children"]

data["Is_Parent"] = np.where(data.Children> 0, 1, 0)

data["Education"]=data["Education"].replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate", "Graduation":"Graduate", "Master":"Postgraduate", "PhD":"Postgraduate"})

data=data.rename(columns={"MntWines": "Wines","MntFruits":"Fruits","MntMeatProducts":"Meat","MntFishProducts":"Fish","MntSweetProducts":"Sweets","MntGoldProds":"Gold"})

to_drop = ["Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"]
data = data.drop(to_drop, axis=1)

data.describe()

	Income	Kidhome	Teenhome	Recency	Wines	Fruits	Meat	Fish	Sweets	Gold	...	AcceptedCmp1	AcceptedCmp2	Complain	Response	Customer_For	Age	Spent	Children	Family_Size	Is_Parent
count	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	...	2216.000000	2216.000000	2216.000000	2216.000000	2.216000e+03	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000
mean	52247.251354	0.441787	0.505415	49.012635	305.091606	26.356047	166.995939	37.637635	27.028881	43.965253	...	0.064079	0.013538	0.009477	0.150271	4.423735e+16	52.179603	607.075361	0.947202	2.592509	0.714350
std	25173.076661	0.536896	0.544181	28.948352	337.327920	39.793917	224.283273	54.752082	41.072046	51.815414	...	0.244950	0.115588	0.096907	0.357417	2.008532e+16	11.985554	602.900476	0.749062	0.905722	0.451825
min	1730.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000e+00	25.000000	5.000000	0.000000	1.000000	0.000000
25%	35303.000000	0.000000	0.000000	24.000000	24.000000	2.000000	16.000000	3.000000	1.000000	9.000000	...	0.000000	0.000000	0.000000	0.000000	2.937600e+16	44.000000	69.000000	0.000000	2.000000	0.000000
50%	51381.500000	0.000000	0.000000	49.000000	174.500000	8.000000	68.000000	12.000000	8.000000	24.500000	...	0.000000	0.000000	0.000000	0.000000	4.432320e+16	51.000000	396.500000	1.000000	3.000000	1.000000
75%	68522.000000	1.000000	1.000000	74.000000	505.000000	33.000000	232.250000	50.000000	33.000000	56.000000	...	0.000000	0.000000	0.000000	0.000000	5.927040e+16	62.000000	1048.000000	1.000000	3.000000	1.000000
max	666666.000000	2.000000	2.000000	99.000000	1493.000000	199.000000	1725.000000	259.000000	262.000000	321.000000	...	1.000000	1.000000	1.000000	1.000000	9.184320e+16	128.000000	2525.000000	3.000000	5.000000	1.000000

8 rows × 28 columns

#检验离群值
sns.set(rc={"axes.facecolor":"#FFF9ED","figure.facecolor":"#FFF9ED"})
pallet = ["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"]
cmap = colors.ListedColormap(["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
#Plotting following features
To_Plot = [ "Income", "Recency", "Customer_For", "Age", "Spent", "Is_Parent"]
print("Reletive Plot Of Some Selected Features: A Data Subset")
plt.figure()
sns.pairplot(data[To_Plot], hue= "Is_Parent",palette= (["#682F2F","#F3AB60"]))
#Taking hue 
plt.show()

Reletive Plot Of Some Selected Features: A Data Subset

我们观察到年龄和收入都存在离群值我们将其删除

data = data[(data["Age"]<90)]
data = data[(data["Income"]<600000)]
print("The total number of data-points after removing the outliers are:", len(data))

The total number of data-points after removing the outliers are: 2212

#检验不同变量之间的相关性
corrmat= data.corr()
plt.figure(figsize=(20,20))  
sns.heatmap(corrmat,annot=True, cmap=cmap, center=0)

没有强相关的变量不需要进行删减

#将分类变量转换为连续性变量
s = (data.dtypes == 'object')
object_cols = list(s[s].index)

LE=LabelEncoder()
for i in object_cols:
    data[i]=data[[i]].apply(LE.fit_transform)

ds = data.copy()
# 将促销活动的相关变量删除，使得留下来的变量更有利于后面的降维和聚类
cols_del = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
ds = ds.drop(cols_del, axis=1)
#S对每个变量进行规范化处理
scaler = StandardScaler()
scaler.fit(ds)
scaled_ds = pd.Dataframe(scaler.transform(ds),columns= ds.columns )

scaled_ds.head()

	Education	Income	Kidhome	Teenhome	Recency	Wines	Fruits	Meat	Fish	Sweets	...	NumCatalogPurchases	NumStorePurchases	NumWebVisitsMonth	Customer_For	Age	Spent	Living_With	Children	Family_Size	Is_Parent
0	-0.893586	0.287105	-0.822754	-0.929699	0.310353	0.977660	1.552041	1.690293	2.453472	1.483713	...	2.503607	-0.555814	0.692181	1.973583	1.018352	1.676245	-1.349603	-1.264598	-1.758359	-1.581139
1	-0.893586	-0.260882	1.040021	0.908097	-0.380813	-0.872618	-0.637461	-0.718230	-0.651004	-0.634019	...	-0.571340	-1.171160	-0.132545	-1.665144	1.274785	-0.963297	-1.349603	1.404572	0.449070	0.632456
2	-0.893586	0.913196	-0.822754	-0.929699	-0.795514	0.357935	0.570540	-0.178542	1.339513	-0.147184	...	-0.229679	1.290224	-0.544908	-0.172664	0.334530	0.280110	0.740959	-1.264598	-0.654644	-1.581139
3	-0.893586	-1.176114	1.040021	-0.929699	-0.795514	-0.872618	-0.561961	-0.655787	-0.504911	-0.585335	...	-0.913000	-0.555814	0.279818	-1.923210	-1.289547	-0.920135	0.740959	0.069987	0.449070	0.632456
4	0.571657	0.294307	1.040021	-0.929699	1.554453	-0.392257	0.419540	-0.218684	0.152508	-0.001133	...	0.111982	0.059532	-0.132545	-0.822130	-1.033114	-0.307562	0.740959	0.069987	0.449070	0.632456

5 rows × 23 columns

降维

使用pca的方法提取主成分为了方便可视化提取三个主成分

pca = PCA(n_components=3)
pca.fit(scaled_ds)
PCA_ds = pd.Dataframe(pca.transform(scaled_ds), columns=(["col1","col2", "col3"]))
PCA_ds.describe().T

	count	mean	std	min	25%	50%	75%	max
col1	2212.0	1.670354e-16	2.878377	-5.969394	-2.538494	-0.780421	2.383290	7.444305
col2	2212.0	2.569775e-17	1.706839	-4.312196	-1.328316	-0.158123	1.242289	6.142721
col3	2212.0	4.336495e-17	1.221956	-3.530416	-0.829067	-0.022692	0.799895	6.611222

x =PCA_ds["col1"]
y =PCA_ds["col2"]
z =PCA_ds["col3"]
#绘制3D图
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(x,y,z, c="maroon", marker="o" )
ax.set_title("A 3D Projection Of Data In The Reduced Dimension")
plt.show()

聚类

#使用肘部法确定聚类的簇数
Elbow_M = KElbowVisualizer(KMeans(), k=10)
Elbow_M.fit(PCA_ds)
Elbow_M.show()

#使用层次聚类进行聚类
AC = KMeans(n_clusters=4)
# 拟合模型并预测集群
yhat_AC = AC.fit_predict(PCA_ds)
PCA_ds["Clusters"] = yhat_AC
#将集群功能添加到原始数据帧。
data["Clusters"]= yhat_AC

fig = plt.figure(figsize=(10,8))
ax = plt.subplot(111, projection='3d', label="bla")
ax.scatter(x, y, z, s=40, c=PCA_ds["Clusters"], marker='o', cmap = cmap )
ax.set_title("The Plot Of The Clusters")
plt.show()

聚类结果分析

#绘制簇的计数图
pal = ["#682F2F","#B9C0C9", "#9F8A78","#F3AB60"]
pl = sns.countplot(x=data["Clusters"], palette= pal)
pl.set_title("Distribution Of The Clusters")
plt.show()

pl = sns.scatterplot(data = data,x=data["Spent"], y=data["Income"],hue=data["Clusters"], palette= pal)
pl.set_title("Cluster's Profile based On Income And Spending")
# 基于收入和支出的集群概况
plt.legend()
plt.show()

#广告活动的接收程度
data["Total_Promos"] = data["AcceptedCmp1"]+ data["AcceptedCmp2"]+ data["AcceptedCmp3"]+ data["AcceptedCmp4"]+ data["AcceptedCmp5"]
plt.figure()
pl = sns.countplot(x=data["Total_Promos"],hue=data["Clusters"], palette= pal)
pl.set_title("Count Of Promotion Accepted")
pl.set_xlabel("Number Of Total Accepted Promotions")
plt.show()

#绘制购买的交易数量
plt.figure()
pl=sns.boxenplot(y=data["NumDealsPurchases"],x=data["Clusters"], palette= pal)
pl.set_title("Number of Deals Purchased")
plt.show()

Personal = [ "Kidhome","Teenhome","Customer_For", "Age", "Children", "Family_Size", "Is_Parent", "Education","Living_With"]

for i in Personal:
    plt.figure()
    sns.jointplot(x=data[i], y=data["Spent"], hue =data["Clusters"], kind="kde", palette=pal)
    plt.show()

客户个性分析

最后我们总结处不同客户群体的个性特征

营销策略分析

我们可以从结果中得到营销策略方法

客户群体1和0是我们的明星客户
他们有着以下的消费习惯：

第1号客户组似乎更喜爱使用catalog进行购买
第0号客户组似乎更喜爱在实体商店购买
第1号客户组似乎很少浏览公司网站
第2、3号客户组似乎很少使用网络购物
第0、2、4号客户组都很关注公司网站的信息

客户群体1、0都很钟爱购买葡萄酒和肉食

广告对所有客户的吸引力都不强
客户群体1和3非常乐于参加促销活动

策略建议：

改善广告方案当前的广告方案没什么效果
为葡萄酒商品举行打折活动可以很好的吸引客户群体0
注重目录销售可以稳定明星客户群体1，同时也应该积极向群体1宣传公司网站
应积极向客户群体2、3宣传网络购物，以促进其消费

客户个性分析 聚类 大数据

大数据系统相关栏目本月热门文章

客户个性分析聚类大数据