Understanding frequency tables
Frequency tables are a great tool to help explore datasets and get an idea about relationships between variables
Introduction
To discover the relationship between variables is the main task of data analysis. Tools like frequency tables helps to explore the data and get an idea of the relationships between variables. A frequency table is just a data table that shows the counts of one or more categorical variables.
To explore frequency tables, we will take the titanic dataset
import numpy as np
import pandas as pd
import os
tit_train = pd.read_csv("../data/titanic/train.csv")
cabin_as_text = tit_train['Cabin'].astype(str)
new_cabin = np.array([cabin[0] for cabin in cabin_as_text])
tit_train["Cabin"] = pd.Categorical(new_cabin)
One-Way Tables
In pandas frequency tables are known as crosstabs. Using the pd.crosstab() function we can get the frequency tables. This function takes one or more array type objects as indexes or columns and then constructs a new dataframe of the variable counts based on the supplied arrays.
cross_tab_survived = pd.crosstab(index=tit_train["Survived"], columns="count")
print(cross_tab_survived)
## col_0 count
## Survived
## 0 549
## 1 342
We can make a more crosstabs to explore other variables
cross_tab_plcass = pd.crosstab(index=tit_train["Pclass"], columns="count")
print(cross_tab_plcass)
## col_0 count
## Pclass
## 1 216
## 2 184
## 3 491
cross_tab_sex = pd.crosstab(index=tit_train["Sex"], columns="count")
print(cross_tab_sex)
## col_0 count
## Sex
## female 314
## male 577
cross_tab_cab = pd.crosstab(index=tit_train["Cabin"], columns="count")
print(cross_tab_cab)
## col_0 count
## Cabin
## A 15
## B 47
## C 59
## D 33
## E 32
## F 13
## G 4
## T 1
## n 687
The one-way tables give us useful insights. We can see the distribution of records across the categories. For example we find that males are more than females by a significant margin and there were more third class travelers than first and second class combined.
Since the crosstab function produces DataFrames, the DataFrame operations we’ve learned work on crosstabs:
print (cross_tab_cab.sum(), "\n") # Sum the counts
## col_0
## count 891
## dtype: int64
print (cross_tab_cab.shape, "\n") # Check number of rows and cols
## (9, 1)
One of the most useful feature of frequency tables is that they allow to extract the proportion of the data that belongs to each category. With a one-way table, we can do this by dividing each table value by the total number of records in the table.
cross_tab_cab / cross_tab_cab.sum()
## col_0 count
## Cabin
## A 0.016835
## B 0.052750
## C 0.066218
## D 0.037037
## E 0.035915
## F 0.014590
## G 0.004489
## T 0.001122
## n 0.771044
cross_tab_cab
## col_0 count
## Cabin
## A 15
## B 47
## C 59
## D 33
## E 32
## F 13
## G 4
## T 1
## n 687
Two-Way Tables
Two-way frequency tables, also called contingency tables, are tables of counts with two dimensions where each dimension is a different variable. Two-way tables can give us insight into the relationship between two variables. To create a two way table, pass two variables to the pd.crosstab() function instead of one.
Table of survival vs. sex
survive_sex = pd.crosstab(index=tit_train["Survived"],
columns=tit_train["Sex"])
survive_sex.index= ["died","survived"]
survive_sex
## Sex female male
## died 81 468
## survived 233 109
Table of survival vs passenger class
survive_class = pd.crosstab(index=tit_train["Survived"],
columns=tit_train["Pclass"])
survive_class.columns = ["class1","class2","class3"]
survive_class.index= ["died","survived"]
survive_class
## class1 class2 class3
## died 80 97 372
## survived 136 87 119
Table of survival vs passenger class
survived_class = pd.crosstab(index=tit_train["Survived"],
columns=tit_train["Pclass"],
margins=True) # Include row and column totals
survived_class.columns = ["class1","class2","class3","rowtotal"]
survived_class.index= ["died","survived","coltotal"]
survived_class
## class1 class2 class3 rowtotal
## died 80 97 372 549
## survived 136 87 119 342
## coltotal 216 184 491 891
To get the proportion of counts along each column (in this case, the survival rate within each passenger class) divide by the column totals:
survived_class/survived_class.loc["coltotal"]
## class1 class2 class3 rowtotal
## died 0.37037 0.527174 0.757637 0.616162
## survived 0.62963 0.472826 0.242363 0.383838
## coltotal 1.00000 1.000000 1.000000 1.000000
To get the proportion of counts along each row divide by the row totals. Use the df.div() to the get division to on a column by column basis:
survived_class.div(survived_class["rowtotal"],
axis=0)
## class1 class2 class3 rowtotal
## died 0.145719 0.176685 0.677596 1.0
## survived 0.397661 0.254386 0.347953 1.0
## coltotal 0.242424 0.206510 0.551066 1.0
Alternatively, you can transpose the table with df.T to swap rows and columns and perform row by row division as normal:
survived_class.T/survived_class["rowtotal"]
## died survived coltotal
## class1 0.145719 0.397661 0.242424
## class2 0.176685 0.254386 0.206510
## class3 0.677596 0.347953 0.551066
## rowtotal 1.000000 1.000000 1.000000
Higher Dimensional Tables
The crosstab() function lets you create tables out of more than two categories. Higher dimensional tables can be a little confusing to look at, but they can also yield finer-grained insight into interactions between multiple variables. Let’s create a 3-way table inspecting survival, sex and passenger class:
surv_sex_class = pd.crosstab(index=tit_train["Survived"],
columns=[tit_train["Pclass"],
tit_train["Sex"]],
margins=True) # Include row and column totals
surv_sex_class
## Pclass 1 2 3 All
## Sex female male female male female male
## Survived
## 0 3 77 6 91 72 300 549
## 1 91 45 70 17 72 47 342
## All 94 122 76 108 144 347 891
Notice that by passing a second variable to the columns argument, the resulting table has columns categorized by both Pclass and Sex.
surv_sex_class[2]
## Sex female male
## Survived
## 0 6 91
## 1 70 17
## All 76 108
surv_sex_class[2]["female"]
## Survived
## 0 6
## 1 70
## All 76
## Name: female, dtype: int64
Due to the convenient hierarchical structure of the table, we still use one division to get the proportion of survival across each column:
surv_sex_class/surv_sex_class.loc["All"] # Divide by column totals
## Pclass 1 2 3 All
## Sex female male female male female male
## Survived
## 0 0.031915 0.631148 0.078947 0.842593 0.5 0.864553 0.616162
## 1 0.968085 0.368852 0.921053 0.157407 0.5 0.135447 0.383838
## All 1.000000 1.000000 1.000000 1.000000 1.0 1.000000 1.000000
Here we see that over 90% of women in 1st class and 2nd class survived, but only 50% of women in 3rd class survived. Men in 1st class also survived with a greater rate than men in lower classes. Passenger class seems to have a significant impact on survival, so it would likely be useful to include as a feature in a predictive model.
Summary
Frequency tables are effective tools for understanding relationships between features of a dataset. It is easy to inspect the data in the frequency tables. Sometimes creating plots from the frequency tables helps in detecting the patterns in the data.