Dealing with missing values in datasets




If you've worked with data regularly before, then you know almost every dataset has missing values. In this article, I wanted to shed some light on some common practice in dealing with missing values. 

Missing values most commonly come in the form of None, Nan, Null or just left blank. However, sometimes you might also find missing values disguised in other forms like 0. So it's important to keep in mind that not every dataset's missing values are represented alike and to approach this problem objectively, understanding that each dataset exhibits its own, sporadic, complications and roadblocks.


Domain Knowledge 

I remember once working with a dataset which incorporated information about patients who were diagnosed with tuberculosis. One of the features of the dataset was the age of the patient. For some patients, the age was 0 and -1. At first sight, I was deceived into thinking these are missing values or simply just wrong input. 

After putting all data-related work aside, and just reading about tuberculosis for a better part of a day, I learned that some children are born with tuberculosis. Furthermore, some are diagnosed with it before birth, hence the 0 and negative values in the dataset. 

It is extremely important to gain as much domain knowledge about the dataset's topic before applying any data science/ml practice. You might find this preliminary part to be stalling, especially if it's the first time learning about the field. So patience is key as data science isn't an exact science, and relies heavily on intuition. A value missing because it doesn't exist, isn't the same as a value missing because it wasn't recorded. The latter encourages predicting an engineered value while the former doesn't make sense to try and guess. 


Lets start working on a real life dataset. Examples always demonstrate practice much better. I've chosen a Kaggle dataset, the core_dataset included in the Human Resources Data Set found here

I'll begin by importing the necessary libraries and loading the dataset







Kaggle provides information about the dataset. You should always read all of the information provided. However, in industrial practice, most of the time data is thrown at you without any attached information. The next step would to look at a sample, as well as the features of the dataset.































Both of these give you a semi-vague image of the dataset's structure. Go through individual samples and read each feature. Start comparing and understanding. 

Proceed with counting the missing values. 





















Date of Termination has a relatively higher amount of missing values compared to the rest of the features. The number 199 might be high for some datasets of a small sample size and not so much for others of a larger sample size. To comprehend the magnitude of the situation, let's calculate the percentage. 







Almost 66 percent of the feature's values are Nan, that's two thirds!

At this point, I would like to reiterate the idea of domain knowledge or data intuition. Stop looking at numbers and try to understand why the data is missing. Construct your own intuition and externalize it onto your work. 

As I was reading through this dataset, the feature 'Reason For Term' stood out. So i took a deeper look.




















After looking at the feature's unique values and their counts, two values stood out the most, NA-still employed and NA-Has not started yet. 

And while the sum of their counts was exactly the same ( 188 + 11 = 199 ) as the count of missing values in the 'Date of Termination' feature, I just had to make sure they were in the same rows. 








I began by constructing a new DataFrame which isolated the samples that had 'Date of Termination' as null. After that all I had to do was look at the unique values of 'Reason For Term'







Now i was confident in reasoning the missing values. The values are missing because they simply don't exist. The employee was either still employed or has not started yet.

Deciding what to do with the missing values is subject to several other factors and is widely opinionated, Ill leave that up to you. Nevertheless, here are some useful shortcuts to help you achieve what you intend to do. 

Filling the missing values 

Pandas has a method 'fillna( )' read the doc here, which can automatically fill all the missing values in a variety of ways. One way would be to directly feed a value as the first parameter. Doing so will fill all missing values with that specified value. Another approach would be to select a method ( read the doc ) to do the filling for you. 











The first automatically fills all missing values with zero. The second fills all missing values using a method called ffill which basically checks for the last valid value in the column and uses it. bfill uses the next valid value. 

The last command uses both ffill and goes on to fill the remaining missing values with 0. 

Dropping the missing values

Deciding whether to drop missing values relies on factors including model score as well as other attributes. I've worked on projects where the population size was considerably large in comparison to the subset of samples with missing values, and excluding this subset wouldn't harm the outcome of the model. However, there are many cases in which the population size is small, and discarding the subset which incorporates these missing values will contradict the generalization of the dataset and potentially the model performance. 

Pandas provides the 'dropna( )' method as a shortcut for users to quickly drop missing values based on axis. 















The parameter axis is defaulted to 0, which is equivalent to row. The second command basically asks for all rows with missing values to be dropped. However, if we change the axis to 1, all columns with a missing value will be dropped. In the case of this dataset, all columns had at least one missing value, so all will be dropped. 

Conclusion

Resolving the dilemma of how to deal with missing values is completely subjective to the unique situation at hand and should be handled objectively from any experience you've had with other datasets. Furthermore, domain knowledge is an extremely relevant factor when considering your options. In this article, I showed some ways in dealing with missing values. However, there are more diligent and industry-like practices like imputing values. I will soon be writing a separate article on these techniques.