As a data analyst or data professional, how do you handle missing data? When it comes to data-centric fields like statistical analysis or machine learning, handling missing data is a common issue. Eventually, missing data can impact your data quality and the value of data analysis.

Missing data can have multiple sources in the online world. This includes incomplete fields in a customer survey form, missing files in a database system, or errors in data entries. The reality is that a majority of such datasets have very few missing values (less than 5%) and can be automatically managed with the right machine learning algorithms and other tools.

Image result for missing data"

However, instead of viewing missing data (or values) as a limitation, it can be viewed as an opportunity to interpret it using the right modeling tool for achieving business value. One such effective method is the automatic imputation of missing data that can be performed on datasets with a large number of missing values (more than 5%).

In the next few sections, we shall introduce how automatic imputation of missing data works, its various techniques, and its importance for better data analysis. 

What is Missing Data Imputation?

In essence, missing data imputation comprises of various techniques all aimed at generating (or replacing) values for missing data variables. For example, in automatic imputation of missing data, analytics tools work towards creating multiple sets of “complete” data for each dataset. 

Why is this beneficial? So that during statistical analysis, each of the imputed datasets (with their imputed values) can be pooled together. This enables a more accurate estimate of the statistical analysis as compared to that of a single dataset value.

Image result for missing data imputation"

What are some of the common techniques used for missing data imputations?

Here are a few:

Mean, Median, and Mode technique

In this technique, the mean, median, or mode value is used as the imputation value used for statistical estimation of the missing data. For example, in the mean method, the mean value of a particular variable (containing numeric data) replaces the missing value of the same variable.

On a similar note, in the median technique, the median value of a variable (with skewed distribution pattern) is used as the missing value. For the mode technique, the missing value is replaced with the most frequently occurring variable (with non-numerical value) in the dataset.

Last Observation Carried Forward (LOCF)

This is a common imputation technique used in a time-based dataset. A missing value is replaced with the latest or last observed value in the dataset.

Next Observation Carried Backward (NOCB)

This imputation technique works in the reverse order to the LOCF method. That means the missing value is replaced with the next observed value in the dataset.

Multiple imputations

For this imputation technique, a distributed set of observed data is used to estimate a set of imputation values for the missing data. In this method, multiple datasets are created and then individually analyzed for estimating imputation values. 

An example of multiple imputations is the Multiple Imputation by Chained Equations or MICE method. This method is executed through multiple regression modes followed by modeling each missing value conditionally based on the observed values of the dataset.

Implementing Missing Data Imputation Using Python – A Case Study

Here is a real-world case study of implementing missing data imputation using Python language.

 The client has an online survey response data stored in a Google BigQuery warehousing table. Due to some survey questions that were skipped, many survey responses are having missing data. The client’s goal is to be able to retrieve selected variable from BigQuery, perform automatic imputation on the BigQuery variables with missing values, and then save the output in BigQuery table. 

Why the need for automatic imputation? So that the imputation service could be automatically scheduled on a daily or weekly basis. Along with using the Python language, the imputation process needs to be executed using Google Cloud services and Google AppEngine.

Why use the Python imputation method?

  • The “Autoimpute” method in Python that enables imputation execution and analysis.
  • The “missingpy” library in Python for missing data imputation.

The solution: Using Python, an advanced algorithm for imputation was implemented that does not simply use the “mean” values in the dataset but also uses all the available data variables. At runtime, multiple instances of this imputation service were created after specifying the following:

  • The source database table
  • The variables to be fetched
  • The variables with missing data that required the imputation
  • The destination database table that would store all the selected variables including those with imputation values.

In the following section, we shall look at how to impute the missing values in a dataset.

How to Impute Missing Values

The mechanism of imputing missing values is generally categorized as the following three main classes:

Missing Completely At Random (or MCAR)

missing completely at random

This method assumes that the missing data (or missingness) is not related to any of the other variables that are being observed (or X) or missing (or Y). For example, a study of the causes of obesity in primary school children and has missing data on some children who could not attend the clinical study in the first place. 

Missing At Random (or MAR)

This method is based on the premise that missing data (or missingness) is partially related to the observed (or X) but not missing (or Y) variable in the dataset. An example of this is a study that monitors students leaving a school and comprising of missing data because the parents shifted to a different city.

Missing Not At Random (or MNAR)

This method is used when the dataset does not meet the criteria of the previous two methods. MNAR method is based on the missingness being related to both the observed (X) and missing (Y) variables in the dataset.

For example, in the school obesity study, missing data is generated when parents withdraw their children voluntarily from the study as they found it to be very offensive.

How does automatic imputation of missing data contribute to better data analysis? Let’s understand that in the next section.

Key Benefits of Missing Data Imputation

Imputation of missing data have key use cases and benefits when it comes to effective data analysis. As a data analyst, you can design statistical or predictive data models that can work with missing data imputation. 

Here are a few use cases for missing data imputation:

Linear Regression

Based on linear regression, this imputation method can be used to make a prediction about the missing data based on the existing variables. Following that, the predicted value is substituted as the missing value. This method is very valuable in data analytics as it consumes a large volume of data for accurate predictions. Additionally, it avoids a significant change in the mean deviation value.

Random Forest

This is another imputation method that works well with either MAR or MNAR models. Data models based on random forest deploy multiple numbers of decision trees to calculate missing values. Along with reducing error estimates, random forest works effectively in analysis of large datasets.

Machine learning

Irrespective of the imputation technique used, there is always bound to be some level of bias in these techniques. While multiple imputations (using several datasets) is a safe bet, machine learning models are best equipped to eliminate any potential bias in missing data imputation.

A combination of machine learning and multiple imputation is the best approach in reducing missing data and enabling better analysis of the output data.

Conclusion

In any dataset, some amount of missing data is considered normal and can be automatically handled by the analytical engine. The emergence of several techniques that enable automatic imputation of missing data is transforming how data analysts and scientists are approaching this challenge.

With its extensive expertise in artificial intelligence, machine learning and BigQuery automation, Countants is poised to provide the right technological support for client projects in data analytics.Looking for the right approach for missing data imputation in your business? Contact us now at our website and we will get back to you.