Data Science Posts and Resources

Articles on Data Science

Understanding frequency tables

Frequency tables are a great tool to help explore datasets and get an idea about relationships between variables

Laxmi K Soni

6-Minute Read

Introduction

To discover the relationship between variables is the main task of data analysis. Tools like frequency tables helps to explore the data and get an idea of the relationships between variables. A frequency table is just a data table that shows the counts of one or more categorical variables.

To explore frequency tables, we will take the titanic dataset


import numpy as np
import pandas as pd
import os

tit_train = pd.read_csv("../data/titanic/train.csv")

cabin_as_text = tit_train['Cabin'].astype(str)

new_cabin = np.array([cabin[0] for cabin in cabin_as_text])

tit_train["Cabin"] = pd.Categorical(new_cabin)

One-Way Tables

In pandas frequency tables are known as crosstabs. Using the pd.crosstab() function we can get the frequency tables. This function takes one or more array type objects as indexes or columns and then constructs a new dataframe of the variable counts based on the supplied arrays.


cross_tab_survived = pd.crosstab(index=tit_train["Survived"], columns="count")      
print(cross_tab_survived)
## col_0     count
## Survived       
## 0           549
## 1           342

We can make a more crosstabs to explore other variables

cross_tab_plcass = pd.crosstab(index=tit_train["Pclass"], columns="count")   
print(cross_tab_plcass)
## col_0   count
## Pclass       
## 1         216
## 2         184
## 3         491
cross_tab_sex = pd.crosstab(index=tit_train["Sex"], columns="count")   
print(cross_tab_sex)
## col_0   count
## Sex          
## female    314
## male      577
cross_tab_cab = pd.crosstab(index=tit_train["Cabin"], columns="count")   
print(cross_tab_cab)
## col_0  count
## Cabin       
## A         15
## B         47
## C         59
## D         33
## E         32
## F         13
## G          4
## T          1
## n        687

The one-way tables give us useful insights. We can see the distribution of records across the categories. For example we find that males are more than females by a significant margin and there were more third class travelers than first and second class combined.

Since the crosstab function produces DataFrames, the DataFrame operations we’ve learned work on crosstabs:

print (cross_tab_cab.sum(), "\n")   # Sum the counts
## col_0
## count    891
## dtype: int64
print (cross_tab_cab.shape, "\n")   # Check number of rows and cols
## (9, 1)

One of the most useful feature of frequency tables is that they allow to extract the proportion of the data that belongs to each category. With a one-way table, we can do this by dividing each table value by the total number of records in the table.

cross_tab_cab / cross_tab_cab.sum()
## col_0     count
## Cabin          
## A      0.016835
## B      0.052750
## C      0.066218
## D      0.037037
## E      0.035915
## F      0.014590
## G      0.004489
## T      0.001122
## n      0.771044
cross_tab_cab
## col_0  count
## Cabin       
## A         15
## B         47
## C         59
## D         33
## E         32
## F         13
## G          4
## T          1
## n        687

Two-Way Tables

Two-way frequency tables, also called contingency tables, are tables of counts with two dimensions where each dimension is a different variable. Two-way tables can give us insight into the relationship between two variables. To create a two way table, pass two variables to the pd.crosstab() function instead of one.

Table of survival vs. sex

survive_sex = pd.crosstab(index=tit_train["Survived"], 
                           columns=tit_train["Sex"])

survive_sex.index= ["died","survived"]

survive_sex
## Sex       female  male
## died          81   468
## survived     233   109

Table of survival vs passenger class

survive_class = pd.crosstab(index=tit_train["Survived"], 
                            columns=tit_train["Pclass"])

survive_class.columns = ["class1","class2","class3"]
survive_class.index= ["died","survived"]

survive_class
##           class1  class2  class3
## died          80      97     372
## survived     136      87     119

Table of survival vs passenger class


survived_class = pd.crosstab(index=tit_train["Survived"], 
                            columns=tit_train["Pclass"],
                             margins=True)   # Include row and column totals

survived_class.columns = ["class1","class2","class3","rowtotal"]
survived_class.index= ["died","survived","coltotal"]

survived_class
##           class1  class2  class3  rowtotal
## died          80      97     372       549
## survived     136      87     119       342
## coltotal     216     184     491       891

To get the proportion of counts along each column (in this case, the survival rate within each passenger class) divide by the column totals:

survived_class/survived_class.loc["coltotal"]
##            class1    class2    class3  rowtotal
## died      0.37037  0.527174  0.757637  0.616162
## survived  0.62963  0.472826  0.242363  0.383838
## coltotal  1.00000  1.000000  1.000000  1.000000

To get the proportion of counts along each row divide by the row totals. Use the df.div() to the get division to on a column by column basis:

survived_class.div(survived_class["rowtotal"],
                   axis=0)
##             class1    class2    class3  rowtotal
## died      0.145719  0.176685  0.677596       1.0
## survived  0.397661  0.254386  0.347953       1.0
## coltotal  0.242424  0.206510  0.551066       1.0

Alternatively, you can transpose the table with df.T to swap rows and columns and perform row by row division as normal:

survived_class.T/survived_class["rowtotal"]
##               died  survived  coltotal
## class1    0.145719  0.397661  0.242424
## class2    0.176685  0.254386  0.206510
## class3    0.677596  0.347953  0.551066
## rowtotal  1.000000  1.000000  1.000000

Higher Dimensional Tables

The crosstab() function lets you create tables out of more than two categories. Higher dimensional tables can be a little confusing to look at, but they can also yield finer-grained insight into interactions between multiple variables. Let’s create a 3-way table inspecting survival, sex and passenger class:

surv_sex_class = pd.crosstab(index=tit_train["Survived"], 
                             columns=[tit_train["Pclass"],
                                      tit_train["Sex"]],
                             margins=True)   # Include row and column totals

surv_sex_class
## Pclass        1           2           3       All
## Sex      female male female male female male     
## Survived                                         
## 0             3   77      6   91     72  300  549
## 1            91   45     70   17     72   47  342
## All          94  122     76  108    144  347  891

Notice that by passing a second variable to the columns argument, the resulting table has columns categorized by both Pclass and Sex.

surv_sex_class[2]        
## Sex       female  male
## Survived              
## 0              6    91
## 1             70    17
## All           76   108
surv_sex_class[2]["female"]   
## Survived
## 0       6
## 1      70
## All    76
## Name: female, dtype: int64

Due to the convenient hierarchical structure of the table, we still use one division to get the proportion of survival across each column:

surv_sex_class/surv_sex_class.loc["All"]    # Divide by column totals
## Pclass           1                   2                3                 All
## Sex         female      male    female      male female      male          
## Survived                                                                   
## 0         0.031915  0.631148  0.078947  0.842593    0.5  0.864553  0.616162
## 1         0.968085  0.368852  0.921053  0.157407    0.5  0.135447  0.383838
## All       1.000000  1.000000  1.000000  1.000000    1.0  1.000000  1.000000

Here we see that over 90% of women in 1st class and 2nd class survived, but only 50% of women in 3rd class survived. Men in 1st class also survived with a greater rate than men in lower classes. Passenger class seems to have a significant impact on survival, so it would likely be useful to include as a feature in a predictive model.

Summary

Frequency tables are effective tools for understanding relationships between features of a dataset. It is easy to inspect the data in the frequency tables. Sometimes creating plots from the frequency tables helps in detecting the patterns in the data.

Say Something

Comments

Nothing yet.

Recent Posts

Categories

About

about