数据分析实战100例(基于SQL&Pandas)

文章目录

前言
1、探索Chipotle快餐数据
2、问题一览
3、题解
- 3.1、基于SQL
- 3.2、基于pandas
小结

前言

最近发现一个不错的数据实战小demo，总共有10个业务场景大约100个例题组成，笔者主要通过SQL与python的pandas包进行问题的解决，不足之处望读者多多指正。

1、探索Chipotle快餐数据

学习主题：了解自己的数据
对应数据集： chipotle.tsv
数据简介：
chipotle数据
字段信息 order_id（订单id） quantity （数量） item_name （类目） choice_description item_price（价格）

（图片源自网络，如有侵权，请联系删除）

2、问题一览

导入数据集
打印出全部的列名称
打印出前10条信息
被下单数最多商品(item)是什么？
在item_name这一列中，一共有多少种商品被下单？
在choice_description中，下单次数最多的商品是什么？
每一单(order)对应的平均总价是多少？
一共有多少商品被下单？
将item_price转换为浮点数
在该数据集对应的时期内，收入(revenue)是多少？
在该数据集对应的时期内，一共有多少订单？

3、题解 3.1、基于SQL

SELECt * FROM `chipotle`
-- 业务练习100例子.SQL
-- 了解chipotle数据
-- 字段信息  order_id quantity （数量） item_name  （类目）  choice_description  item_price

-- 1、导入数据集
--   见图

-- 2、打印出全部的列名称
desc `chipotle`

-- 3、打印出前10条信息
SELECt *
  from chipotle
 limit 10


-- 4、被下单数最多商品(item)是什么?
SELECt item_name
      ,max(num) AS max_num
  from (
		SELECt item_name
		     , sum(quantity) AS num
		  from chipotle
		 group by item_name
       ) a 

-- 5、在item_name这一列中，一共有多少种商品被下单？
SELECt count(distinct item_name)
  from chipotle

-- 6、在choice_description中，下单次数最多的商品是什么？
SELECt choice_description
      ,max(num) AS max_num
  from (
		SELECt choice_description
		     , sum(quantity) AS num
		  from chipotle
		 where choice_description !='NULL'
		 group by choice_description
       ) a 
 

-- 7、每一单(order)对应的平均总价是多少？
SELECt round(avg(jg),2)
  from (
		SELECt order_id
		      ,sum(cast(quantity as double))*cast(SUBSTR(item_price,2) as DOUBLE) as jg 
		  from chipotle
		group by order_id
        ) a
-- 8、一共有多少商品被下单？

SELECt count(1) as cnt 
  from chipotle

-- 9、将item_price转换为浮点数
SELECt ROUND(cast(SUBSTR(item_price,2) as float),2) as item_price
  from chipotle


-- 10、在该数据集对应的时期内，收入(revenue)是多少
SELECt ROUND(sum(quantity*item_price),2) as revenue
  FROM(
		SELECt order_id
					,sum(cast(quantity as double)) as quantity 
				,cast(SUBSTR(item_price,2) as DOUBLE) as item_price
			from chipotle
		group by order_id
						,cast(SUBSTR(item_price,2) as DOUBLE)
		  ) a 
			
-- 11、在该数据集对应的时期内，一共有多少订单？
SELECt count(distinct order_id) as order_id
  from chipotle

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--

3.2、基于pandas

# 了解chipotle数据
# 字段信息  order_id quantity （数量） item_name  （类目）  choice_description  item_price
# 1、导入数据
import pandas as pd 
chipo= pd.read_csv(r"E:codejupyter_notebookpandas_exerciseexercise_datachipotle.csv",
                  engine = 'python')

# 2、打印出全部的列名称
chipo.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

# 3、打印出前10条信息
chipo.head(10)

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	$10.98
6	3	1	Side of Chips	NaN	$1.69
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	$11.75
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	$9.25
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	$9.25

# 4、被下单数最多商品(item)是什么?
q4 = chipo[['item_name','quantity']].groupby(['item_name'],as_index=False).agg({'quantity':sum})
q4.sort_values(['quantity'],ascending=False,inplace=True)
q4.head(1)

	item_name	quantity
17	Chicken Bowl	761

# 5、在item_name这一列中，一共有多少种商品被下单？
chipo['item_name'].nunique()

# 6、在choice_description中，下单次数最多的商品是什么？
chipo['choice_description'].value_counts().head(1)

[Diet Coke]    134
Name: choice_description, dtype: int64

# 7、每一单(order)对应的平均总价是多少？
chipo[['order_id','sub_total']].groupby(by=['order_id']
).agg({'sub_total':'sum'})['sub_total'].mean()

21.394231188658654

# 8、一共有多少商品被下单？
total_items_orders = chipo['quantity'].sum()
total_items_orders

# 9、将item_price转换为浮点数
dollarizer = lambda x: float(x[1:-1])
chipo['item_price'] = chipo['item_price'].apply(dollarizer)

# 10、在该数据集对应的时期内，收入(revenue)是多少
chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2)
chipo['sub_total'].sum()

39237.02

# 11、在该数据集对应的时期内，一共有多少订单？
chipo['order_id'].nunique()

小结

日常工作hive用习惯了，搞这个pandas来写有点不习惯，反而没觉得方便多少，但确实是比较省代码的，暂时就玩玩当做对比SQL学习预习复习pandas的一种方式，除了数据比赛和割韭菜的培训班暂时还无法脑补啥样子的业务场景会用pandas，不足之处，望多多指正。

数据分析实战100例(基于SQL&Pandas)

Python相关栏目本月热门文章