Pandas& NumPy Simple Practice Usage

(This blog merely for self-reading, self-review and every programming lover's share under no commercial circumstance.)

Recently a friend of mine has asked me desperately for helping. It was his homework, which has to be accomplished by Pandas and NumPy (Two Python Library). They are frequently used in data cleansing and data aggregation. The following could be separated into a brief tutorial of pandas and the practice operation for help working on the explicit data.

First and foremost, it is compelled to introduce the basic glossaries and common structure in pandas.

This is a Dataframe, which is a 2-dimensional labelled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. And by the way, each column in a Dataframe is a Series.

It is sincerely powerful. The figure beneath has shown you how many file format pandas could import. And almost every function you conceive can be used in pandas.

Then follows the practice operation.

In this practice, I should aggregate a 172282-rows x 9-columns table. My friend gave me the csv file to contain the data. It is supposed to import them into the command or some similar IDLE (Integrated Development and Learning Environment):

But firstly, we should check the path is where:

>>>import os
>>>os.chdir(‘C://Users/p1907831/desktop’)    #I fancy operating in Desktop.
>>>os.getcwd()		           #To check if the path has changed to the proper place.
>>>import pandas as pd
>>>import numpy as np
>>>a=pd.read_csv(‘data.csv’)          #The csv file has been uploaded to Desktop, so get it >>>directly.
>>>a			          #Obviously you want to see the Dataframe first.

Then it should be like this:

This is a sample of Dataframe in Windows cmd. It will be really complex if you aggregate these immense data in Excel. Hence, we don’t operate in Excel but in cmd by pandas.

>>>b=pd.read_csv('data.csv',usecols=['FirstDose'])
>>>c=pd.read_csv('data.csv',usecols=['SecondDose'])
>>>b
>>>c

Because we should add FirstDose and SecondDose, so, we need to extract them to the separate arguments: ’b’ and ‘c’. Then convert them into array in NumPy.

>>>d=b.to_numpy()
>>>e=c.to_numpy()
>>>d
>>>e

They are the structure of arrays now, nor Dataframe. Then add d+e:

>>>f=d+e

>>>f

The Total has been calculated. Now change the structure again and rename the colcumn:

>>>g= pd.Dataframe(f)	#pd.Dataframe function supports to convert array structure to Dataframe.
>>>g
>>>h=g.rename(columns={0:'TotalDose'})   #Rename multipule columns:DF.rename(columns={0:'TotalDose',’str’:’str,int:’str’})
>>>h

But it is not supposed to be so hurried because we could insert the column nor ‘h’ Dataframe to the original Dataframe. However, we should change ‘h’ to the list structure:

>>>h.values.tolist() #This operation converts every value into a single list and covered all in a whole list.

Apparently, if we import it into the csv file, there will be every square bracket around every value in every cell. Hence, we should input this:

>>>i = h.stack().tolist()

Then insert the list to the original Dataframe:

>>>a[‘TotalDose’] = i

Goddamn hell. The insertion supports to insert array too!!

>>>a[‘TotalDose’] =array

Never mind. Let’s keep on.

We need the current utter Dataframe, so let’s input it into csv first:

>>>a.to_csv(‘utter.csv’) #If you haven’t created the csv file, this function will help you create one in the path.

Then we also need a csv file covered the whole sum of all integer data according to the respective regions. In consequence:

>>>k=a.groupby(‘Region’).sum()	#case sensitive(大小写) warning!!
>>>k

"""
    The index here has changed to the Region and every integer data has been sum up and the other data structures have been automatically deleted. Then input it into a csv file.

	We also need a csv file covered the whole sum of all integer data grouped by the respective regions and every vaccine brands’ purchases in every region:

"""
>>>l=a.groupby([‘Region’,’Vaccine’]).sum()	#case sensitive(大小写) warning!!

Then input it into csv file.

Over.

"""replenishment"""

"""To use one single column or with any other of a csv file or a excel file"""

pd.read_csv('abaaba.csv',usecols=['abaaba','abaaba'])
pd.read_excel(""" abaabaaba """)

"""To use one single column or with any other in an already-existed Dataframe"""

a=pd.Dataframe({
'apple':['a','b','c'],
'banana':[1,2,3],
'carrot':['I','II','III']})

  apple  banana carrot
0     a       1      I
1     b       2     II
2     c       3    III

a.loc[:,'apple']        # : (colon) stands for index
'''or'''
a.iloc[:,0]             # 0 stands for the sequence of the columns


0    a
1    b
2    c
Name: apple, dtype: object

'''To access multiple columns in a Dataframe'''

a.loc[:,['apple','banana']]
'''or'''
a.iloc[:,[0,1]]

  apple  banana
0     a       1
1     b       2
2     c       3

Pandas& NumPy Simple Practice Usage

Python相关栏目本月热门文章