[pandas练习册] 如何在pandas中重写SQL命令实现同样的数据查询效果

前言

python必知必会的数据处理知识1(pandas))
python必知必会的数据处理知识2(pandas))
pandas中高频函数详细说明

之前有文章介绍pandas这个包，以上为对应链接。本文使用pandas实现SQL的语言功能，可以当做是pandas的练习。

建议本文的使用方法：

首先看SQL这一列想要取出来的数据，然后在jupyter上使用pandas操作取出对应数据。最后与答案进行核对。

首先准备数据：

所有数据从该链接下载： http://ourairports.com/data/

import pandas as pd
airports = pd.read_csv('data/airports.csv')
airport_freq = pd.read_csv('data/airport-frequencies.csv')
runways = pd.read_csv('data/runways.csv')

1. SELECt, WHERe, DISTINCT, LIMIT

在SQL中的一些SELECT命令：我们使用where过滤数据，使用limit截断数据，使用distinct移除重复数据。那么在于pandas的对应中：

SQL	Pandas
select * from airports	airports
select * from airports limit 3	airports.head(3)
select id from airports where ident = ‘KLAX’	airports[airports.ident==‘KLAX’].id
select distinct type from airport	airports.type.unique()

2. SELECt with multiple conditions

SQL	Pandas
select * from airports where iso_region = ‘US-CA’ and type=‘seaplane_base’	airports[(airports.iso_region == ‘US-CA’) & (airports.type==‘seaplane_base’)]
select ident, name, municipality from airports where iso_region = ‘US-CA’ and type=‘large_airport’	airports[(airports.iso_region == ‘US-CA’) & (airports.type==‘large_airport’)][[‘ident’,‘name’,‘municipality’]]

3. ORDER BY

默认pandas会按照升序排序。关键函数sort_values()

SQL	Pandas
select * from airport_freq where airport_ident = ‘KLAX’ order by type	airports_freq[airports_freq.airport_ident == ‘KLAX’].sort_values(‘type’)
select * from airport_freq where airport_ident = ‘KLAX’ order by type desc	airports_freq[airports_freq.airport_ident == ‘KLAX’].sort_values(‘type’, ascending=False)

4. IN … NOT IN

pandas的.isin() 能够同样起到效果。取反的话，使用 ~

SQL	Pandas
select * from airport where type in (‘heliport’, ‘balloonport’)	airports[airports.type.isin([‘heliport’, ‘balloonport’])]
select * from airports where type not in (‘heliport’, ‘balloonport’)	airports[~airports.type.isin([‘heliport’, ‘balloonport’])]

5. GROUP BY, COUNT, ORDER BY

group by 很直接，在pandas中直接用.groupby()

count 有区别，pandas的count统计的是NaN value 的个数，如果想实现SQL中同样的语义，应该使用.size()

SQL	Pandas
select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type	airports.groupby([‘iso_country’, ‘type’]).size()
select iso_country, type, count() from airports group by iso_country, type order by iso_country, count() desc	airports.groupby([‘iso_country’,‘type’]).size().to_frame(‘size’).reset_index().sort_values([‘iso_country’,‘size’],ascending=[True, False])

第二个稍微复杂些，这里做些详细解释：

因为我们想要使用size的大小排序，所以.size()输出的结果需要成为Dataframe数据结构的一部分。而.groupby()操作之后，我们得到的是一个GroupByObject的数据类型。所以我们需要使用.to_frame()将其转换为Dataframe，使用.reset_index()重新设置索引。

6. HAVINg

SQL在过滤数据的时候，可以使用having
Pandas 中可以使用 .filter()并提供func（lambda）

SQL	Pandas
select type, count() from airports where iso_country = ‘US’ group by type having count() > 1000 order by count(*) desc	airports[airports.iso_country == ‘US’].groupby(‘type’).filter(lambda g: len(g) > 1000).groupby(‘type’).size().sort_values(ascending=False)

7. Top N records

by_country = airports.groupby(['iso_country']).size().to_frame('airport_count').reset_index()

首先通过如上命令获取一个by_country的dataframe

如下例子中，首先依据count选取最多的十个国家；然后选出第十一到第二十的国家。

SQL	Pandas
select iso_country from by_country order by size desc limit 10	by_country.nlargest(10, columns=‘airport_count’)
select iso_country from by_country order by size desc limit 10 offset 10	by_country.nlargest(20, columns=‘airport_count’).tail(10)

8. aggregate functions(MIN, MAX, MEAN)

SQL	Pandas
select max(length_ft), min(length_ft),avg(length_ft), median(length_ft) from runways	runways.agg({‘length_ft’:[‘min’,‘max’,‘mean’,‘median’]}).T

9. JOIN

Pandas对应的是.merge()函数，之前的文章有详细说明函数具体参数的含义。

SQL	Pandas
select airport_ident, type, description, frequency_mhz from airport_freq join airports on airport_freq.airport_ref = airports.id where airports.ident=‘KLAX’	airports_freq.merge(airports[airports.ident == ‘KLAX’][[‘id’]], left_on=‘airport_ref’,right_on=‘id’,how=‘inner’）[[‘airport_ident’,‘type’,‘description’,‘frequency_mhz’]]

10. UNIOn ALL and UNIOn

使用pd.concat()进行对应操作。

SQL	Pandas
select name, municipality from airports where ident = ‘KLAX’ union all select name, municipality from airports where ident=‘KLGB’	pd.concat([airports[airports.ident == ‘KLAX’][[‘name’,‘municipality’]], airports[airports.ident==‘KLGB’][[‘name’,‘municipality’]]])

11. INSERT

SQL	Pandas
create table heroes(id integer, name text);	df1 = pd.Dataframe({‘id’:[1,2],‘name’:[‘harry potter’, ‘ron weasley’]})
insert into heroes values(3, ‘hermione granger’)	pd.concat([df1, pd.Dataframe({‘id’:[3],‘name’:[‘hermione granger’]})]).reset_index(drop=True)

12. UPDATE

SQL	Pandas
update airports set home_link=‘http://www.lawa.org/welcomelax.aspx’ where ident=='KLAX	airports.loc[airports.ident==‘KLAX’,‘home_link’] = ‘http://www.lawa.org/welcomelax.aspx’

13. DELETE

SQL	Pandas
delete from lax_freq where type=‘MISC’	lax_freq=lax_freq[lax_freq[‘type’] !=‘MISC’]
lax_freq.drop(lax_freq[lax_freq.type==‘MISC’].index)

两种方法，一种直接重写，一种使用drop

14. and more

导出到其格式：

df.to_csv(...)  # csv file
df.to_hdf(...)  # HDF5 file
df.to_pickle(...)  # serialized object
df.to_sql(...)  # to SQL database
df.to_excel(...)  # to Excel sheet
df.to_json(...)  # to JSON string
df.to_html(...)  # render as HTML table
df.to_feather(...)  # binary feather-format
df.to_latex(...)  # tabular environment table
df.to_stata(...)  # Stata binary data files
df.to_msgpack(...)	# msgpack (serialize) object
df.to_gbq(...)  # to a Google BigQuery table.
df.to_string(...)  # console-friendly tabular output.
df.to_clipboard(...) # clipboard that can be pasted into Excel

绘图：

top_10.plot(
    x='iso_country', 
    y='airport_count',
    kind='barh',
    figsize=(10, 7),
    title='Top 10 countries with most airports')

[pandas练习册] 如何在pandas中重写SQL命令实现同样的数据查询效果

大数据系统相关栏目本月热门文章