Data Wrangling Best Practices in R - Part 2

Advanced File Transformation & Reshaping – Part 2

Part 2 of the R data-wrangling series, covering advanced techniques such as pivoting, joining tables, string manipulation, and functional programming with purrr.
R
Data Wrangling
Author

Bilal Mustafa

Published

September 21, 2023

Introduction:

Data wrangling is a crucial step in the data analysis process, and R provides a powerful set of tools and packages to help you clean, transform, and prepare your data for analysis. In this blog post, we will explore some best practices for effective data wrangling in R. Whether you are a beginner or an experienced data analyst, these tips will help you streamline your data preparation workflow and ensure the reliability of your analysis.

Now continuing from the previous post (Best Practices for Data Wrangling in R - Part 1), we will use the airquality data from datasets library to further understand how data wrangling helps us to get the deeper and significant insights of our data.


Reading Data

Raw Input Data

Firstly read your data into R. For this exercise I will be using a simulated dataset.

```{r}
library(datasets)

data(airquality)
```

Understand Your Data

It’s imperative to have a thorough understanding of your dataset before getting started with data wrangling. Knowing your data’s structure, the significance of each variable, and any potential problems or abnormalities is part of this. To obtain an understanding of the data you’re working with, start by analyzing it using functions like head(), summary(), and str().

```{r}
# Example: Inspect the first few rows of a dataset
head(airquality)
```
ABCDEFGHIJ0123456789
 
 
Ozone
<int>
Solar.R
<int>
Wind
<dbl>
Temp
<int>
Month
<int>
Day
<int>
1411907.46751
2361188.07252
31214912.67453
41831311.56254
5NANA14.35655
628NA14.96656
```{r}
# Example: Get a summary of the dataset
summary(airquality)
```
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               
```{r}
# Example: Display the structure of the dataset
str(airquality)
```
'data.frame':   153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

This will help you make informed decisions during the wrangling process.


Data Cleaning

Data cleaning involves checking the headers, handling missing values, outliers, and errors in your dataset. Here are some best practices for data cleaning in R:

Handle Missing Values:

  • Identify missing values using functions like is.na() or complete.cases().

  • Decide whether to impute missing values, remove rows with missing data, or keep them, depending on the context.

  • Use packages like dplyr or tidyr to perform missing data operations.

```{r}
library(dplyr)
library(tidyr)

# Example: Remove rows with missing values
airquality_clean <- na.omit(airquality)

head(airquality_clean)
```
ABCDEFGHIJ0123456789
 
 
Ozone
<int>
Solar.R
<int>
Wind
<dbl>
Temp
<int>
Month
<int>
Day
<int>
1411907.46751
2361188.07252
31214912.67453
41831311.56254
7232998.66557
8199913.85958

Other methods to remove NA’s include na.omit() , complete.cases(), rowSums(), drop_na(), and filter().

```{r}
# #Remove rows with NA's using na.omit()
# airquality_clean <- na.omit(airquality)
# 
# #Remove rows with NA's using complete.cases
# airquality_clean <- airquality[complete.cases(airquality),]
# 
# #Remove rows with NA's using rowSums()
# airquality_clean <- airquality[rowSums(is.na(airquality)) == 0,]
# 
# #Import the tidyr package
# library("tidyr")
# 
# #Remove rows with NA's using drop_na()
# airquality_clean <- airquality %>% drop_na()
# 
# #Remove rows that contains all NA's
# airquality_clean <-
#   airquality[rowSums(is.na(airquality)) != ncol(airquality),]
# 
# #Load the dplyr package
# library("dplyr")
# 
# #Remove rows that contains all NA's
# airquality_clean <-
#   filter(airquality, rowSums(is.na(airquality)) != ncol(airquality))
# 
# airquality_clean <- airquality %>% filter(!is.na(Ozone))
```

Manage Outliers

  • Visualize data using boxplots, histograms, or scatter plots to detect outliers.

  • Consider using statistical methods or domain knowledge to handle outliers, such as winsorization or transformation.

```{r}
#| label: Boxplot of air wauality data set
#| fig-alt: "Boxplot of air wauality data set"

# Example: Visualize outliers using a boxplot
boxplot(airquality_clean)


# # Remove outliers from the 'income' variable
# airquality_clean <- airquality_clean %>%
#   filter(Ozone >= 0)
```

Boxplot of air wauality data set

Correct Errors

  • Check for data entry errors and inconsistencies.

  • Use data validation rules or regular expressions to identify and correct errors.


Data Transformation

To make your data acceptable for analysis, you must shape and reformat it through data transformation. The following are a few excellent practices for R’s data transformation:

Use Tidy Data Principles

  • Follow the principles of tidy data, where each variable is a column, each observation is a row, and each type of observational unit is a table.

  • The tidyr package provides functions like gather() and spread() for reshaping data.

```{r}
# # Example: Convert data from wide to long format
airquality_clean_long <- airquality_clean %>%
  gather(key = "Column_name", value = "value")   

head(airquality_clean_long)
```
ABCDEFGHIJ0123456789
 
 
Column_name
<chr>
value
<dbl>
1Ozone41
2Ozone36
3Ozone12
4Ozone18
5Ozone23
6Ozone19

Apply Data Type Conversions

  • Ensure that variables have the correct data types (e.g., numeric, character, factor) for analysis.

  • Use functions like as.numeric(), as.character(), or as.factor() to convert data types.

```{r}
# Example: Convert a variable to numeric
airquality_clean$Temp_numeric <- as.numeric(airquality_clean$Temp) 

head(airquality_clean)
```
ABCDEFGHIJ0123456789
 
 
Ozone
<int>
Solar.R
<int>
Wind
<dbl>
Temp
<int>
Month
<int>
Day
<int>
Temp_numeric
<dbl>
1411907.4675167
2361188.0725272
31214912.6745374
41831311.5625462
7232998.6655765
8199913.8595859

Data Validation

To ensure that your processed data satisfies the criteria of your study, validation is a crucial stage in the data wrangling process. Here are a few guidelines for using R’s data validation features.

Perform Sanity Checks

  • Check summary statistics, distributions, and relationships between variables to ensure they align with your expectations.
```{r}
# Example: Check summary statistics
summary(airquality_clean)
```
     Ozone          Solar.R           Wind            Temp      
 Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
 1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
 Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
 Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
 3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
 Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
     Month            Day         Temp_numeric  
 Min.   :5.000   Min.   : 1.00   Min.   :57.00  
 1st Qu.:6.000   1st Qu.: 9.00   1st Qu.:71.00  
 Median :7.000   Median :16.00   Median :79.00  
 Mean   :7.216   Mean   :15.95   Mean   :77.79  
 3rd Qu.:9.000   3rd Qu.:22.50   3rd Qu.:84.50  
 Max.   :9.000   Max.   :31.00   Max.   :97.00  

Validate Data Integrity

  • Verify that data transformations have not introduced errors.

  • Compare original and transformed data to identify discrepancies.

Document Your Steps

For reproducibility and collaboration, it’s essential to record your data manipulation procedures. To write a narrative that details the choices you made while handling the data, think about utilizing R Markdown or Jupyter Notebooks. Make your work accessible and understandable to others by using code comments, explanations, and visualizations.


Conclusion

A crucial first step in data analysis is data wrangling; by using R’s best practices, you can speed up the process and guarantee the accuracy of your findings. You may improve the efficiency of your data wrangling workflow and provide more reliable analyses by comprehending your data, putting effective cleaning and transformation strategies into practice, testing your findings, and documenting your approach.

Always tailor these best practices to your unique needs, keeping in mind that the individual methods and packages you use may vary depending on your dataset and research goals.


Back to top
1,228 Total Pageviews