python必知必会的数据处理知识1(pandas))
python必知必会的数据处理知识2(pandas))
pandas中高频函数详细说明
之前有文章介绍pandas这个包,以上为对应链接。本文使用pandas实现SQL的语言功能,可以当做是pandas的练习。
建议本文的使用方法:
首先看SQL这一列想要取出来的数据,然后在jupyter上使用pandas操作取出对应数据。最后与答案进行核对。
首先准备数据:
所有数据从该链接下载: http://ourairports.com/data/
import pandas as pd
airports = pd.read_csv('data/airports.csv')
airport_freq = pd.read_csv('data/airport-frequencies.csv')
runways = pd.read_csv('data/runways.csv')
1. SELECt, WHERe, DISTINCT, LIMIT
在SQL中的一些SELECT命令:我们使用where过滤数据,使用limit截断数据,使用distinct移除重复数据。那么在于pandas的对应中:
| SQL | Pandas |
|---|---|
| select * from airports | airports |
| select * from airports limit 3 | airports.head(3) |
| select id from airports where ident = ‘KLAX’ | airports[airports.ident==‘KLAX’].id |
| select distinct type from airport | airports.type.unique() |
| SQL | Pandas |
|---|---|
| select * from airports where iso_region = ‘US-CA’ and type=‘seaplane_base’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type==‘seaplane_base’)] |
| select ident, name, municipality from airports where iso_region = ‘US-CA’ and type=‘large_airport’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type==‘large_airport’)][[‘ident’,‘name’,‘municipality’]] |
默认pandas会按照升序排序。关键函数sort_values()
| SQL | Pandas |
|---|---|
| select * from airport_freq where airport_ident = ‘KLAX’ order by type | airports_freq[airports_freq.airport_ident == ‘KLAX’].sort_values(‘type’) |
| select * from airport_freq where airport_ident = ‘KLAX’ order by type desc | airports_freq[airports_freq.airport_ident == ‘KLAX’].sort_values(‘type’, ascending=False) |
pandas的.isin() 能够同样起到效果。取反的话,使用 ~
| SQL | Pandas |
|---|---|
| select * from airport where type in (‘heliport’, ‘balloonport’) | airports[airports.type.isin([‘heliport’, ‘balloonport’])] |
| select * from airports where type not in (‘heliport’, ‘balloonport’) | airports[~airports.type.isin([‘heliport’, ‘balloonport’])] |
group by 很直接,在pandas中直接用.groupby()
count 有区别,pandas的count统计的是NaN value 的个数,如果想实现SQL中同样的语义,应该使用.size()
| SQL | Pandas |
|---|---|
| select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type | airports.groupby([‘iso_country’, ‘type’]).size() |
| select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) desc | airports.groupby([‘iso_country’,‘type’]).size().to_frame(‘size’).reset_index().sort_values([‘iso_country’,‘size’],ascending=[True, False]) |
第二个稍微复杂些,这里做些详细解释:
因为我们想要使用size的大小排序,所以.size()输出的结果需要成为Dataframe数据结构的一部分。而.groupby()操作之后,我们得到的是一个GroupByObject的数据类型。所以我们需要使用.to_frame()将其转换为Dataframe,使用.reset_index()重新设置索引。
6. HAVINgSQL在过滤数据的时候,可以使用having
Pandas 中可以使用 .filter()并提供func(lambda)
| SQL | Pandas |
|---|---|
| select type, count(*) from airports where iso_country = ‘US’ group by type having count(*) > 1000 order by count(*) desc | airports[airports.iso_country == ‘US’].groupby(‘type’).filter(lambda g: len(g) > 1000).groupby(‘type’).size().sort_values(ascending=False) |
by_country = airports.groupby(['iso_country']).size().to_frame('airport_count').reset_index()
首先通过如上命令获取一个by_country的dataframe
如下例子中,首先依据count选取最多的十个国家;然后选出第十一到第二十的国家。
| SQL | Pandas |
|---|---|
| select iso_country from by_country order by size desc limit 10 | by_country.nlargest(10, columns=‘airport_count’) |
| select iso_country from by_country order by size desc limit 10 offset 10 | by_country.nlargest(20, columns=‘airport_count’).tail(10) |
| SQL | Pandas |
|---|---|
| select max(length_ft), min(length_ft),avg(length_ft), median(length_ft) from runways | runways.agg({‘length_ft’:[‘min’,‘max’,‘mean’,‘median’]}).T |
Pandas对应的是.merge()函数,之前的文章有详细说明函数具体参数的含义。
| SQL | Pandas |
|---|---|
| select airport_ident, type, description, frequency_mhz from airport_freq join airports on airport_freq.airport_ref = airports.id where airports.ident=‘KLAX’ | airports_freq.merge(airports[airports.ident == ‘KLAX’][[‘id’]], left_on=‘airport_ref’,right_on=‘id’,how=‘inner’)[[‘airport_ident’,‘type’,‘description’,‘frequency_mhz’]] |
使用pd.concat()进行对应操作。
| SQL | Pandas |
|---|---|
| select name, municipality from airports where ident = ‘KLAX’ union all select name, municipality from airports where ident=‘KLGB’ | pd.concat([airports[airports.ident == ‘KLAX’][[‘name’,‘municipality’]], airports[airports.ident==‘KLGB’][[‘name’,‘municipality’]]]) |
| SQL | Pandas |
|---|---|
| create table heroes(id integer, name text); | df1 = pd.Dataframe({‘id’:[1,2],‘name’:[‘harry potter’, ‘ron weasley’]}) |
| insert into heroes values(3, ‘hermione granger’) | pd.concat([df1, pd.Dataframe({‘id’:[3],‘name’:[‘hermione granger’]})]).reset_index(drop=True) |
| SQL | Pandas |
|---|---|
| update airports set home_link=‘http://www.lawa.org/welcomelax.aspx’ where ident=='KLAX | airports.loc[airports.ident==‘KLAX’,‘home_link’] = ‘http://www.lawa.org/welcomelax.aspx’ |
| SQL | Pandas |
|---|---|
| delete from lax_freq where type=‘MISC’ | lax_freq=lax_freq[lax_freq[‘type’] !=‘MISC’] |
| lax_freq.drop(lax_freq[lax_freq.type==‘MISC’].index) |
两种方法,一种直接重写,一种使用drop
14. and more导出到其格式:
df.to_csv(...) # csv file df.to_hdf(...) # HDF5 file df.to_pickle(...) # serialized object df.to_sql(...) # to SQL database df.to_excel(...) # to Excel sheet df.to_json(...) # to JSON string df.to_html(...) # render as HTML table df.to_feather(...) # binary feather-format df.to_latex(...) # tabular environment table df.to_stata(...) # Stata binary data files df.to_msgpack(...) # msgpack (serialize) object df.to_gbq(...) # to a Google BigQuery table. df.to_string(...) # console-friendly tabular output. df.to_clipboard(...) # clipboard that can be pasted into Excel
绘图:
top_10.plot(
x='iso_country',
y='airport_count',
kind='barh',
figsize=(10, 7),
title='Top 10 countries with most airports')


![[pandas练习册] 如何在pandas中重写SQL命令实现同样的数据查询效果 [pandas练习册] 如何在pandas中重写SQL命令实现同样的数据查询效果](http://www.mshxw.com/aiimages/31/671149.png)
