# Data Science Posts and Resources

Articles on Data Science

## Pandas (Python)

Pandas is a software module for the Python programming language for the purpose of data manipulation and analysis. It provides data structures and operations to manipulate numerical tables and time series data. It is build on top of numpy. It is applied for fast analysis and data cleaning and preparation.In this post we will learn various wasy to work with Pandas DataFrames.

Laxmi K Soni # Pandas

Pandas is probably the most powerful library. It provides high-performance tools for data manipulation and analysis. Furthermore, it is very effective at converting data formats and querying data out of databases. The two main data structures of Pandas are the series and the data frame. To work with Pandas, we need to import the module.

import pandas as pd

## Pandas Series

A series in Pandas is a one-dimensional array which is labeled. You can imagine it to be the equivalent of an ordinary Python dictionary.

series = pd.Series([ 10 , 20 , 30 , 40 ],
[ 'A' , 'B' , 'C' , 'D' ])

In order to create a series, we use the constructor of the Series class. The first parameter that we pass is a list full of values (in this case numbers). The second parameter is the list of the indices or keys (in this case strings). When we now print our series, we can see what the structure looks like.

print(series)
## A    10
## B    20
## C    30
## D    40
## dtype: int64

The first column represents the indices, whereas the second column represents the actual values.

#### ACCESSING VALUES

The accessing of values works in the same way that it works with dictionaries. We need to address the respective index or key to get our desired value.

print (series[ 'C' ])
## 30
print (series[ 1 ])
## 20

As you can see, we can choose how we want to access our elements. We can either address the key or the position that the respective element is at.

CONVERTING DICTIONARIES

Since series and dictionaries are quite similar, we can easily convert our Python dictionaries into Pandas series.

myDict = { 'A' : 10 , 'B' : 20 , 'C' : 30 }
series = pd.Series(myDict)

Now the keys are our indices and the values remain values. But what we can also do is, to change the order of the indices.

myDict = { 'A' : 10 , 'B' : 20 , 'C' : 30 }
series = pd.Series(myDict, index =[ 'C' , 'A' , 'B' ])

Our series now looks like this:

print(series)
## C    30
## A    10
## B    20
## dtype: int64

## PANDAS DataFrame

DataFrame is the main thing on which we’ll be mostly working on. Most manipulation or operation on the data will be applied by means of DataFrame.

#### Creating DataFrame using dictionary data

This is a simple process in which we just need to pass the json data to the DataFrame method.

cars = {'Brand':['Honda','Toyota','Ford','Audi'],'Price':[22000,21000,27000,35000]}
df = pd.DataFrame(cars)
df
##     Brand  Price
## 0   Honda  22000
## 1  Toyota  21000
## 2    Ford  27000
## 3    Audi  35000
data = { 'Name' : [ 'Anna' , 'Bob' , 'Charles' ], 'Age' : [ 24 , 32 , 35 ], 'Height' : [ 176 , 187 , 175 ]}
df = pd.DataFrame(data)

To create a Pandas data frame, we use the constructor of the class. In this case, we first create a dictionary with some data about three persons. We feed that data into our data frame. It then looks like this:

df
##       Name  Age  Height
## 0     Anna   24     176
## 1      Bob   32     187
## 2  Charles   35     175

As you can see, without any manual work, we already have a structured data frame and table.

To now access the values is a bit more complicated than with series. We have multiple columns and multiple rows, so we need to address two values.

print (df[ 'Name' ][ 1 ])
## Bob

So first we choose the column Name and then we choose the second element (index one) of this column. In this case, this is Bob .

When we omit the last index, we can also select only the one column. This is useful when we want to save specific columns of our data frame into a new one. What we can also do in this case is to select multiple columns.

print (df[[ 'Name' , 'Height' ]])
##       Name  Height
## 0     Anna     176
## 1      Bob     187
## 2  Charles     175

#### DATA FRAME FUNCTIONS

For data frames we have a couple of basic functions and attributes that we already know from lists or NumPy arrays.

BASIC FUNCTIONS AND ATTRIBUTES
df.T Transposes the rows and columns of the data frame
df.dtypes Returns data types of the data frame
df.ndim Returns the number of dimensions of the data frame
df.shape Returns the shape of the data frame
df.size Returns the number of elements in the data frame
df.head(n) Returns the first n rows of the data frame (default is five)
df.tail(n) Returns the last n rows of the data frame (default is five)

#### STATISTICAL FUNCTIONS

For the statistical functions, we will now extend our data frame a little bit and add some more persons.

data = { 'Name' : [ 'Anna' , 'Bob' , 'Charles' ,
'Daniel' , 'Evan' , 'Fiona' ,
'Gerald' , 'Henry' , 'India' ],
'Age' : [ 24 , 32 , 35 , 45 , 22 , 54 , 55 , 43 , 25 ],
'Height' : [ 176 , 187 , 175 , 182 , 176 ,
189 , 165 , 187 , 167 ]}
df = pd.DataFrame(data)
STATISTICAL FUNCTIONS
FUNCTION DESCRIPTION
count() Count the number of non-null elements
sum() Returns the sum of values of the selected columns
mean() Returns the arithmetic mean of values of the selected columns
median() Returns the median of values of the selected columns
mode() Returns the value that occurs most often in the columns selected
std() Returns standard deviation of the values
min() Returns the minimum value
max() Returns the maximum value
abs() Returns the absolute values of the elements
prod() Returns the product of the selected elements
describe() Returns data frame with all statistical values summarized

Now, we are not going to dig deep into every single function here. But let’s take a look at how to apply some of them.

print (df[ 'Age' ].mean())
## 37.22222222222222
print (df[ 'Height' ].median())
## 176.0

Here we choose a column and then apply the statistical functions on it. What we get is just a single scalar with the desired value.

We can also apply the functions to the whole data frame. In this case, we get returned another data frame with the results for each column.

print (df.mean())
## Age        37.222222
## Height    178.222222
## dtype: float64

#### APPLYING NUMPY FUNCTIONS

Instead of using the built-in Pandas functions, we can also use the methods we already know. For this, we just use the apply function of the data frame and then pass our desired method.

import numpy as np
print (df[ 'Age' ].apply(np.sin))
## 0   -0.905578
## 1    0.551427
## 2   -0.428183
## 3    0.850904
## 4   -0.008851
## 5   -0.558789
## 6   -0.999755
## 7   -0.831775
## 8   -0.132352
## Name: Age, dtype: float64

In this example, we apply the sine function onto our ages. It doesn’t make any sense but it demonstrates how this works.

#### LAMBDA EXPRESSIONS

A very powerful in Python are lambda expression . They can be thought of as nameless functions that we pass as a parameter.

print (df[ 'Age' ].apply( lambda x: x * 100 ))
## 0    2400
## 1    3200
## 2    3500
## 3    4500
## 4    2200
## 5    5400
## 6    5500
## 7    4300
## 8    2500
## Name: Age, dtype: int64

By using the keyword lambda we create a temporary variable that represents the individual values that we are applying the operation onto. After the colon, we define what we want to do. In this case, we multiply all values of the column Age by 100.

df = df[[ 'Age' , 'Height' ]]
print (df.apply( lambda x: x.max() - x.min()))
## Age       33
## Height    24
## dtype: int64

Here we removed the Name column, so that we only have numerical values. Since we are applying our expression on the whole data frame now, x refers to the whole columns. What we do here is calculating the difference between the maximum value and the minimum value.

#### ITERATING

Iterating over data frames is quite easy with Pandas. We can either do it in the classic way or use specific functions for it.

for x in df[ 'Age' ]:
print (x)
## 24
## 32
## 35
## 45
## 22
## 54
## 55
## 43
## 25

As you can see, iterating over a column’s value is very simple and nothing new. This would print all the ages. When we iterate over the whole data frame, our control variable takes on the column names.

STATISTICAL FUNCTIONS
iteritems() Iterator for key-value pairs
iterrows() Iterator for the rows (index, series)
itertuples() Iterator for the rows as named tuples

Let’s take a look at some practical examples.

for key, value in df.iteritems():
print ( '{}: {}' .format(key, value))
## Age: 0    24
## 1    32
## 2    35
## 3    45
## 4    22
## 5    54
## 6    55
## 7    43
## 8    25
## Name: Age, dtype: int64
## Height: 0    176
## 1    187
## 2    175
## 3    182
## 4    176
## 5    189
## 6    165
## 7    187
## 8    167
## Name: Height, dtype: int64

Here we use the iteritems function to iterate over key-value pairs. What we get is a huge output of all rows for each column.

On the other hand, when we use iterrows , we can print out all the column-values for each row or index.

for index, value in df.iterrows():
print (index,value)
## 0 Age        24
## Height    176
## Name: 0, dtype: int64
## 1 Age        32
## Height    187
## Name: 1, dtype: int64
## 2 Age        35
## Height    175
## Name: 2, dtype: int64
## 3 Age        45
## Height    182
## Name: 3, dtype: int64
## 4 Age        22
## Height    176
## Name: 4, dtype: int64
## 5 Age        54
## Height    189
## Name: 5, dtype: int64
## 6 Age        55
## Height    165
## Name: 6, dtype: int64
## 7 Age        43
## Height    187
## Name: 7, dtype: int64
## 8 Age        25
## Height    167
## Name: 8, dtype: int64

#### SORTING

One very powerful thing about Pandas data frames is that we can easily sort them.

#### SORT BY INDEX

df = pd.DataFrame(np.random.rand( 10 , 2 ),
index =[ 1 , 5 , 3 , 6 , 7 , 2 , 8 , 9 , 0 , 4 ],
columns =[ 'A' , 'B' ])

Here we create a new data frame, which is filled with random numbers. We specify our own indices and as you can see, they are completely unordered.

print (df.sort_index())
##           A         B
## 0  0.661836  0.395690
## 1  0.297976  0.352147
## 2  0.762745  0.633820
## 3  0.962743  0.967597
## 4  0.589327  0.829333
## 5  0.804942  0.239460
## 6  0.233210  0.009424
## 7  0.840167  0.355551
## 8  0.585224  0.235954
## 9  0.118757  0.958975

By using the method sort_index , we sort the whole data frame by the index column. The result is now sorted:

#### INPLACE PARAMETER

When we use functions that manipulate our data frame, we don’t actually change it but we return a manipulated copy. If we wanted to apply the changes on the actual data frame, we would need to do it like this:

df = df.sort_index()

But Pandas offers us another alternative as well. This alternative is the parameter inplace . When this parameter is set to True , the changes get applied to our actual data frame

df.sort_index( inplace = True )

#### SORT BY COLUMNS

Now, we can also sort our data frame by specific columns.

data = { 'Name' : [ 'Anna' , 'Bob' , 'Charles' ,
'Daniel' , 'Evan' , 'Fiona' ,
'Gerald' , 'Henry' , 'India' ],
'Age' : [ 24 , 24 , 35 , 45 , 22 , 54 , 54 , 43 , 25 ],
'Height' : [ 176 , 187 , 175 , 182 , 176 ,
189 , 165 , 187 , 167 ]}

df = pd.DataFrame(data)

df.sort_values( by =[ 'Age' , 'Height' ], inplace = True )

print (df)
##       Name  Age  Height
## 4     Evan   22     176
## 0     Anna   24     176
## 1      Bob   24     187
## 8    India   25     167
## 2  Charles   35     175
## 7    Henry   43     187
## 3   Daniel   45     182
## 6   Gerald   54     165
## 5    Fiona   54     189

Here we have our old data frame slightly modified. We use the function sort_values to sort our data frames. The parameter by states the columns that we are sorting by. In this case, we are first sorting by age and if two persons have the same age, we sort by height.

#### JOINING AND MERGING

Another powerful concept in Pandas is joining and merging data frames.

names = pd.DataFrame({
'id' : [ 1 , 2 , 3 , 4 , 5 ],
'name' : [ 'Anna' , 'Bob' , 'Charles' ,
'Daniel' , 'Evan' ],
})
ages = pd.DataFrame({
'id' : [ 1 , 2 , 3 , 4 , 5 ],
'age' : [ 20 , 30 , 40 , 50 , 60 ]
})
df = pd.merge(names,ages, on = 'id' )
df.set_index( 'id' , inplace = True )

First we use the method merge and specify the column to merge on. We then have a new data frame with the combined data but we also want our id column to be the index. For this, we use the set_index method. Now when we have two separate data frames which are related to one another, we can combine them into one data frame. It is important that we have a common column that we can merge on. In this case, this is id .

#### JOINS

It is not necessarily always obvious how we want to merge our data frames. This is where joins come into play. We have four types of joins.

JOIN MERGE TYPES
left Uses all keys from left object and merges with right
right Uses all keys from right object and merges with left
outer Uses all keys from both objects and merges them
inner Uses only the keys which both objects have and merges them
(default)

Now let’s change our two data frames a little bit.

names = pd.DataFrame({
'id' : [ 1 , 2 , 3 , 4 , 5 , 6 ],
'name' : [ 'Anna' , 'Bob' , 'Charles' ,
'Daniel' , 'Evan' , 'Fiona' ],
})
ages = pd.DataFrame({
'id' : [ 1 , 2 , 3 , 4 , 5 , 7 ],
'age' : [ 20 , 30 , 40 , 50 , 60 , 70 ],
'Height' : [ 176 , 187 , 175 , 182 , 176 ,
189 ]
})

Our names frame now has an additional index 6 and an additional name. And our ages frame has an additional index 7 with an additional name.

df = pd.merge(names,ages, on = 'id' , how = 'inner' )
df.set_index( 'id' , inplace = True )

If we now perform the default inner join , we will end up with the same data frame as in the beginning. We only take the keys which both objects have. This means one to five.

df = pd.merge(names,ages, on = 'id' , how = 'left' )
df.set_index( 'id' , inplace = True )

When we use the left join , we get all the keys from the names data frame but not the additional index 7 from ages. This also means that Fiona won’t be assigned any age.

The same principle goes for the right join just the other way around

df = pd.merge(names,ages, on = 'id' , how = 'right' )
df.set_index( 'id' , inplace = True )

Now, we only have the keys from the ages frame and the 6 is missing. Finally, if we use the outer join , we combine all keys into one data frame.

df = pd.merge(names,ages, on = 'id' , how = 'outer' )
df.set_index( 'id' , inplace = True )

#### QUERYING DATA

Like in databases with SQL, we can also query data from our data frames in Pandas. For this, we use the function loc , in which we put our expression.

print (df.loc[df[ 'age' ] == 24 ])
## Empty DataFrame
## Columns: [name, age, Height]
## Index: []
print (df.loc[(df[ 'age' ] == 24 ) &
(df[ 'Height' ] > 180 )])
## Empty DataFrame
## Columns: [name, age, Height]
## Index: []
print (df.loc[df[ 'age' ] > 30 ][ 'name' ])
## id
## 3    Charles
## 4     Daniel
## 5       Evan
## 7        NaN
## Name: name, dtype: object

Here we have some good examples to explain how this works. The first expression returns all rows where the value for Age is 24.

The second query is a bit more complicated. Here we combine two conditions. The first one is that the age needs to be 24 but we then combine this with the condition that the height is greater than 180. This leaves us with one row.

In the last expression, we can see that we are only choosing one column to be returned. We want the names of all people that are older than 30.

Similar to NumPy, we can also easily read data from external files into Pandas. Let’s say we have an CSV-File like this (opened in Excel): The only thing that we need to do now is to use the function read_csv to import our data into a data frame.

df = pd.read_csv( 'data.csv' )
df.set_index( 'id' , inplace = True )
print (df)

We also set the index to the id column again. This is what we have imported:

This of course, also works the other way around. By using the method to_csv , we can also save our data frame into a CSV-file.

data = { 'Name' : [ 'Anna' , 'Bob' , 'Charles' ,
'Daniel' , 'Evan' , 'Fiona' ,
'Gerald' , 'Henry' , 'India' ],
'Age' : [ 24 , 24 , 35 , 45 , 22 , 54 , 54 , 43 , 25 ],
'Height' : [ 176 , 187 , 175 , 182 , 176 ,
189 , 165 , 187 , 167 ]}
df = pd.DataFrame(data)
df.to_csv( 'mydf.csv' )

#### PLOTTING DATA

Since Pandas builds on Matplotlib, we can easily visualize the data from our data frame.

data = { 'Name' : [ 'Anna' , 'Bob' , 'Charles' ,
'Daniel' , 'Evan' , 'Fiona' ,
'Gerald' , 'Henry' , 'India' ],
'Age' : [ 24 , 24 , 35 , 45 , 22 , 54 , 54 , 43 , 25 ],
'Height' : [ 176 , 187 , 175 , 182 , 176 ,
189 , 165 , 187 , 167 ]}
df = pd.DataFrame(data)
df.sort_values( by =[ 'Age' , 'Height' ])
df.hist()
plt.show()

In this example, we use the method hist to plot a histogram of our numerical columns. Without specifying anything more, this is what we end up with: But we can also just use the function plot to plot our data frame or individual columns.

df.plot()
plt.show()

The result is the following: 