Exploring and preparing data

Exploring and preparing data with titanic dataset

February 8, 2020

Laxmi K Soni

12-Minute Read

.badCode {
background-color: black;
}

Introduction

The first a part of any data analysis or predictive modeling task is an initial exploration of the datasets. Albeit we collected the datasets ourself and we have already got an inventory of questions in mind that we simply want to answer, it’s important to explore the datasets before doing any serious analysis, since oddities within the datasets can cause bugs and muddle your results. Before exploring deeper questions, we got to answer many simpler ones about the shape and quality of datasets . That said, it’s important to travel into initial data exploration with an enormous picture question in mind.

This post aims to boost a number of the questions we ought to consider once we check out a replacement data set for the primary time and show the way to perform various Python operations associated with those questions.

In this post, we’ll explore the Titanic disaster training set available from Kaggle.co. The dataset consists of 889 passengers who rode aboard the Titanic.

Getting the dataset

To get the dataset into pandas dataframe simply call the function read_csv.

import pandas as pd
import numpy as np

tit_train = pd.read_csv("../data/titanic/train.csv")      # Read the data

Checking the dimensions of the dataset with df.shape and the variable data types of df.dtypes.

tit_train.shape

## (891, 12)

tit_train.dtypes

## PassengerId      int64
## Survived         int64
## Pclass           int64
## Name            object
## Sex             object
## Age            float64
## SibSp            int64
## Parch            int64
## Ticket          object
## Fare           float64
## Cabin           object
## Embarked        object
## dtype: object

The output displays that we re working with a set of 891 records and 12 columns. Most of the column variables are encoded as numeric data types (ints and floats) but a some of them are encoded as “object”.

Check the head of the data to get a better sense of what the variables look like:

tit_train.head(5)

##    PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
## 0            1         0       3  ...   7.2500   NaN         S
## 1            2         1       1  ...  71.2833   C85         C
## 2            3         1       3  ...   7.9250   NaN         S
## 3            4         1       1  ...  53.1000  C123         S
## 4            5         0       3  ...   8.0500   NaN         S
## 
## [5 rows x 12 columns]

We have a combination of numeric columns and columns with text data.

In dataset analysis, variables or features that split records into a fixed number of unique categories, such as Sex, are called as categorical variables.

Pandas can be used to interpret categorical variables as such when we load dataset, but we can convert a variable to categorical if necessary

After getting a sense of the datasets structure, it is a good practice to look at a statistical summary of the features with df.describe():

tit_train.describe().transpose()

##              count        mean         std  ...       50%    75%       max
## PassengerId  891.0  446.000000  257.353842  ...  446.0000  668.5  891.0000
## Survived     891.0    0.383838    0.486592  ...    0.0000    1.0    1.0000
## Pclass       891.0    2.308642    0.836071  ...    3.0000    3.0    3.0000
## Age          714.0   29.699118   14.526497  ...   28.0000   38.0   80.0000
## SibSp        891.0    0.523008    1.102743  ...    0.0000    1.0    8.0000
## Parch        891.0    0.381594    0.806057  ...    0.0000    0.0    6.0000
## Fare         891.0   32.204208   49.693429  ...   14.4542   31.0  512.3292
## 
## [7 rows x 8 columns]

we notice that the non-numeric columns are omitted from the statistical summary provided by df.describe().

We can find the summary of the categorical variables by passing only those columns to describe():


cat_vars = tit_train.dtypes[tit_train.dtypes == "object"].index

print(cat_vars)

## Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

tit_train[cat_vars].describe().transpose()

##          count unique                  top freq
## Name       891    891  Oreskovic, Mr. Luka    1
## Sex        891      2                 male  577
## Ticket     891    681                 1601    7
## Cabin      204    147          C23 C25 C27    4
## Embarked   889      3                    S  644

The summary of the categorical features shows the count of non-NaN records, the number of unique categories, the most frequent occurring value and the number of occurrences of the most frequent value.

Although describe() gives a concise overview of each variable, it does not necessarily give us enough information to determine what each variable means.

Certain features like “Age” and “Fare” are easy to understand, while others like “SibSp” and “Parch” are not. The details of these are provided by kaggle on the data download page.

# VARIABLE DESCRIPTIONS:
# survival        Survival
#                 (0 = No; 1 = Yes)
# pclass          Passenger Class
#                 (1 = 1st; 2 = 2nd; 3 = 3rd)
# name            Name
# sex             Sex
# age             Age
# sibsp           Number of Siblings/Spouses Aboard
# parch           Number of Parents/Children Aboard
# ticket          Ticket Number
# fare            Passenger Fare
# cabin           Cabin
# embarked        Port of Embarkation
#                 (C = Cherbourg; Q = Queenstown; S = Southampton)

After looking at the data we ask yourself a few questions:

QUESTIONS
Do we require all of the variables ?
Should we transform any variables ?
Check if there are any NA values, outliers etc ?
Should we create new variables?

Do we require all of the Variables?

Removal of unnecessary variables is a first step when dealing with any data set, since removing variables reduces complexity and can make computation on the data faster.

Whether we should get rid of a variable or not will depend on size of the data set and the goal of the analysis. With a dataset like the titanic data, there’s no requirement to remove variables from a computing perspective. But it can be helpful to drop variables that will only distract from your goal.

Let’s go through each variable and consider whether we should keep it or not in the context of survival prediction.

“PassengerId” is just a number assigned to each passenger. We can remove it

del tit_train['PassengerId']

Variable “Survived” shows whether each passenger lived or died. Since survival prediction is our goal, we definitely need to keep it.

Features describing passengers numerically or grouping them into a few broad categories could be useful for survival prediction. Therefore variables Pclass, Sex, Age, SibSp, Parch, Fare and Embarked can be kept.

further, “Name” appears to be a character string of the name of each passenger and it will also help in identifying passenger so we can keep it

Next, let’s see at “Ticket”

tit_train['Ticket'][0:10]

## 0           A/5 21171
## 1            PC 17599
## 2    STON/O2. 3101282
## 3              113803
## 4              373450
## 5              330877
## 6               17463
## 7              349909
## 8              347742
## 9              237736
## Name: Ticket, dtype: object

tit_train['Ticket'].describe()

## count      891
## unique     681
## top       1601
## freq         7
## Name: Ticket, dtype: object

Ticket has 681 unique values: almost as many as there are passengers. Categorical variables with this many levels are generally not very useful for prediction. Let’s remove it

del tit_train['Ticket']

Lastly let’s see the “Cabin” variable

tit_train['Cabin'][0:10]

## 0     NaN
## 1     C85
## 2     NaN
## 3    C123
## 4     NaN
## 5     NaN
## 6     E46
## 7     NaN
## 8     NaN
## 9     NaN
## Name: Cabin, dtype: object

tit_train['Cabin'].describe()

## count             204
## unique            147
## top       C23 C25 C27
## freq                4
## Name: Cabin, dtype: object

Cabin also has 147 unique values, which shows it may not be useful for prediction. On the other hand, the names of the different levels for the cabin variable seem to have a some structure, each starts with a capital letter followed by a number. We can use that structure to reduce the number of levels to make categories large enough that they might be useful for prediction later on. So Lets Keep Cabin for now.

Should we transform Any Variables?

Pclass is an integer variable that indicates a passenger’s class, with 1 being first class, 2 as second class and 3 as third class. We can transform this by transforming Pclass into an ordered categorical variable


pclass_new = pd.Categorical(tit_train['Pclass'], ordered=True)

pclass_new = pclass_new.rename_categories(["class1","class2","class3"])

pclass_new.describe()

##             counts     freqs
## categories                  
## class1         216  0.242424
## class2         184  0.206510
## class3         491  0.551066

tit_train['Pclass'] = pclass_new

Now see the Cabin variable. It appears that each Cabin is in a general section of the ship indicated by the capital letter at the start of each factor level

tit_train['Cabin'].unique()

## array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
##        'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
##        'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
##        'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
##        'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
##        'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
##        'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
##        'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
##        'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
##        'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
##        'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
##        'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
##        'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
##        'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
##        'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
##        'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
##        'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
##        'C148'], dtype=object)

If we grouped the cabin just by this letter, we could lesser the number of levels while getting some useful information.


chr_cabin = tit_train["Cabin"].astype(str)

n_Cabin = np.array([cabin[0] for cabin in chr_cabin]) 

n_Cabin = pd.Categorical(n_Cabin)

n_Cabin.describe()

##             counts     freqs
## categories                  
## A               15  0.016835
## B               47  0.052750
## C               59  0.066218
## D               33  0.037037
## E               32  0.035915
## F               13  0.014590
## G                4  0.004489
## T                1  0.001122
## n              687  0.771044

The output of describe() shows we can group Cabin into broader categories, but we also discovered something interesting: 688 of the records have Cabin are “n” which is shortened from “nan”. In other words, more than 2/3 of the passengers do not have a cabin.

A missing cabin variable could be an indication that a passenger died.

We can keep the new variable cabin

tit_train["Cabin"] = n_Cabin

Checking to if there are null Values, Outliers or Other garbage Values?

To check the missing values we us pd.isnull() function for example


mock_vector = pd.Series([1,None,3,None,7,8])

mock_vector.isnull()

## 0    False
## 1     True
## 2    False
## 3     True
## 4    False
## 5    False
## dtype: bool

If the missing values are numeric then they can be simple deleted. If missing values are categorical then they can be treated as additional category with value as NA.

To check if there is missing age values in titanic dataset

tit_train["Age"].describe()

## count    714.000000
## mean      29.699118
## std       14.526497
## min        0.420000
## 25%       20.125000
## 50%       28.000000
## 75%       38.000000
## max       80.000000
## Name: Age, dtype: float64

We see that the count of age (712) is less than the total row count of dataset(889). To check indexes of the missing ages we use np.where()

missvalues = np.where(tit_train["Age"].isnull() == True)
missvalues

## (array([  5,  17,  19,  26,  28,  29,  31,  32,  36,  42,  45,  46,  47,
##         48,  55,  64,  65,  76,  77,  82,  87,  95, 101, 107, 109, 121,
##        126, 128, 140, 154, 158, 159, 166, 168, 176, 180, 181, 185, 186,
##        196, 198, 201, 214, 223, 229, 235, 240, 241, 250, 256, 260, 264,
##        270, 274, 277, 284, 295, 298, 300, 301, 303, 304, 306, 324, 330,
##        334, 335, 347, 351, 354, 358, 359, 364, 367, 368, 375, 384, 388,
##        409, 410, 411, 413, 415, 420, 425, 428, 431, 444, 451, 454, 457,
##        459, 464, 466, 468, 470, 475, 481, 485, 490, 495, 497, 502, 507,
##        511, 517, 522, 524, 527, 531, 533, 538, 547, 552, 557, 560, 563,
##        564, 568, 573, 578, 584, 589, 593, 596, 598, 601, 602, 611, 612,
##        613, 629, 633, 639, 643, 648, 650, 653, 656, 667, 669, 674, 680,
##        692, 697, 709, 711, 718, 727, 732, 738, 739, 740, 760, 766, 768,
##        773, 776, 778, 783, 790, 792, 793, 815, 825, 826, 828, 832, 837,
##        839, 846, 849, 859, 863, 868, 878, 888], dtype=int64),)

len(missvalues)

## 1

Before we do anything with missing values its good to check the distribution of the missing values to know the central tendency of age.

tit_train.hist(column='Age',    
                   figsize=(9,6),   
                   bins=20)

The histogram shows that couple of passengers are near age 80.

To check the fare variable we create the box plot

tit_train["Fare"].plot(kind="box",
                           figsize=(9,9))

50% of the data in the box plot represents the median. There are outliers in the data. There are passengers who paid double the amount than any other passenger. We can check this using np.where() function

ind = np.where(tit_train["Fare"] == max(tit_train["Fare"]) )

tit_train.loc[ind]

##      Survived  Pclass  ... Cabin Embarked
## 258         1  class1  ...     n        C
## 679         1  class1  ...     B        C
## 737         1  class1  ...     B        C
## 
## [3 rows x 10 columns]

Before modeling datasets using ML models it is better to address missing values, outliers, mislabeled data, bad data because they can corrupt the analysis and lead to wrong results.

Should we Create New Variables?

The decision to create new variables should be taken while preparing the data. The new variable could represent aggregate of existing variables, for example in titanic dataset we can create a new variable called family which stores the number of members in that family.

tit_train["Family"] = tit_train["SibSp"] + tit_train["Parch"]

we can check who has most family members on the board

mostfamily = np.where(tit_train["Family"] == max(tit_train["Family"]))

tit_train.loc[mostfamily]

##      Survived  Pclass                               Name  ... Cabin  Embarked  Family
## 159         0  class3         Sage, Master. Thomas Henry  ...     n         S      10
## 180         0  class3       Sage, Miss. Constance Gladys  ...     n         S      10
## 201         0  class3                Sage, Mr. Frederick  ...     n         S      10
## 324         0  class3           Sage, Mr. George John Jr  ...     n         S      10
## 792         0  class3            Sage, Miss. Stella Anna  ...     n         S      10
## 846         0  class3           Sage, Mr. Douglas Bullen  ...     n         S      10
## 863         0  class3  Sage, Miss. Dorothy Edith "Dolly"  ...     n         S      10
## 
## [7 rows x 11 columns]

Summary

There are question that should be answered while investing any dataset. Once the basic questions are answered one can move further to find relationship between variables/features and build the machine learning models.

Data Science Posts and Resources