- 实操
- 数据的合并
- pandas中数据分组聚合
- pandas中数据的索引
问题:分析统计一组电影数据的电影分类的情况
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
file_path = r"F:NLP项目IMDB-Movie-Data.csv"
df = pd.read_csv(file_path)
# print(df["Genre"].head(3))
# 统计分类的列表
temp_list = df["Genre"].str.split(",").tolist()
genre_list = list(set([i for j in temp_list for i in j]))
# 构造全为0的数组,以及设置列标签
zeros_df = pd.Dataframe(np.zeros((df.shape[0], len(genre_list))), columns=genre_list)
# print(zeros_df)
# 给每个电影出现分类的位置赋值1
for i in range(df.shape[0]):
zeros_df.loc[i, temp_list[i]] = 1
# 统计每个分类的电影的数量和
genre_count = zeros_df.sum(axis=0)
# 排序
genre_count = genre_count.sort_values()
_x = genre_count.index
_y = genre_count.values
# 画图
plt.figure(figsize=(16, 9), dpi=144)
plt.bar(range(len(_x)), _y)
plt.xticks(range(len(_x)), _x)
plt.savefig("./电影分类.jpg")
plt.show()
输出:
数据的合并在pandas中,可以通过join() 方法或者**merge()**方法合并两组数据
如下:
例一:通过join(),后一个数据会按照前一个数据的格式进行合并,如果没有相应的值,则补充nan值
import pandas as pd
import numpy as np
df1 = pd.Dataframe(np.ones((2, 4)), index=['A', 'B'], columns=list("abcd"))
df2 = pd.Dataframe(np.zeros((3, 3)), index=['A', 'B', 'C'], columns=list("xyz"))
print(df2.join(df1))
输出:
x y z a b c d A 0.0 0.0 0.0 1.0 1.0 1.0 1.0 B 0.0 0.0 0.0 1.0 1.0 1.0 1.0 C 0.0 0.0 0.0 NaN NaN NaN NaN
例二:通过merge()
代码:
import pandas as pd
import numpy as np
# 先创建两个二维数组
t1 = pd.Dataframe(np.arange(12).reshape(3, 4), index=list("abc"), columns=list("zxcv"))
t2 = pd.Dataframe(np.array(range(16, 32)).reshape(4, 4), index=list("abcd"), columns=list("zxbn"))
print(t1)
print(t2)
# 默认merge()方法,默认合并是inner(交集)
t3 = t1.merge(t2)
print(t3)
# 采用outer(并集)
t4 = t1.merge(t2, how="outer")
print(t4)
# 采用左边为准
t5 = t1.merge(t2, left_on="z", right_on="x", how="left")
print(t5)
# 采用右边为准
t6 = t1.merge(t2, left_on="z", right_on="z", how="right")
print(t6)
输出:
z x c v
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
z x b n
a 16 17 18 19
b 20 21 22 23
c 24 25 26 27
d 28 29 30 31
Empty Dataframe
Columns: [z, x, c, v, b, n]
Index: []
z x c v b n
0 0 1 2.0 3.0 NaN NaN
1 4 5 6.0 7.0 NaN NaN
2 8 9 10.0 11.0 NaN NaN
3 16 17 NaN NaN 18.0 19.0
4 20 21 NaN NaN 22.0 23.0
5 24 25 NaN NaN 26.0 27.0
6 28 29 NaN NaN 30.0 31.0
z_x x_x c v z_y x_y b n
0 0 1 2 3 NaN NaN NaN NaN
1 4 5 6 7 NaN NaN NaN NaN
2 8 9 10 11 NaN NaN NaN NaN
z x_x c v x_y b n
0 16 NaN NaN NaN 17 18 19
1 20 NaN NaN NaN 21 22 23
2 24 NaN NaN NaN 25 26 27
3 28 NaN NaN NaN 29 30 31
pandas中数据分组聚合
问题:统计中美两国星巴克的数量
代码:
import pandas as pd import numpy as np file_path = "./starbucks_store_worldwide.csv" df = pd.read_csv(file_path) grouped = df.groupby(by='Country') #调用聚合方法 country_count = grouped['Brand'].count() print(country_count['US']) print(country_count['CN'])
输出:
13608 2734pandas中数据的索引
简单的索引操作:
1.获取index:df.index
2.指定index:df.index = [‘x’,‘y’]
3.重新设置index:df.reindex()
4.指定某一列作为index:df.set_index()
5.返回index的唯一值:df.set_index().index.unique()
6.复合索引中,交换里外层索引的位置,df.swaplevel()
练习:星巴克在各国的数量的前十排名
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
file_path = "./starbucks_store_worldwide.csv"
df = pd.read_csv(file_path)
# 准备数据
data1 = df.groupby(by='Country')['Brand'].count().sort_values(ascending=False)[:10]
_x = data1.index
_y = data1.values
# 绘图
plt.figure(figsize=(16, 9), dpi=144)
plt.bar(_x, _y)
plt.savefig("./sort_country.jpg")
plt.show()



