Data Wrangling Best Practices in R

Introduction:

Data wrangling is a crucial step in the data analysis process, and R provides a powerful set of tools and packages to help you clean, transform, and prepare your data for analysis. In this blog post, we will explore some best practices for effective data wrangling in R. Whether you are a beginner or an experienced data analyst, these tips will help you streamline your data preparation workflow and ensure the reliability of your analysis.

Now continuing from the previous post (Best Practices for Data Wrangling in R - Part 1), we will use the airquality data from datasets library to further understand how data wrangling helps us to get the deeper and significant insights of our data.

Reading Data

Raw Input Data

Firstly read your data into R. For this exercise I will be using a simulated dataset.

```{r}
library(datasets)

data(airquality)
```

Understand Your Data

It’s imperative to have a thorough understanding of your dataset before getting started with data wrangling. Knowing your data’s structure, the significance of each variable, and any potential problems or abnormalities is part of this. To obtain an understanding of the data you’re working with, start by analyzing it using functions like head(), summary(), and str().

```{r}
# Example: Inspect the first few rows of a dataset
head(airquality)
```

ABCDEFGHIJ0123456789

	Ozone <int>	Solar.R <int>	Wind <dbl>	Temp <int>	Month <int>	Day <int>
1	41	190	7.4	67	5	1
2	36	118	8.0	72	5	2
3	12	149	12.6	74	5	3
4	18	313	11.5	62	5	4
5	NA	NA	14.3	56	5	5
6	28	NA	14.9	66	5	6

```{r}
# Example: Get a summary of the dataset
summary(airquality)
```

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0

```{r}
# Example: Display the structure of the dataset
str(airquality)
```

'data.frame':   153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

This will help you make informed decisions during the wrangling process.

Data Cleaning

Data cleaning involves checking the headers, handling missing values, outliers, and errors in your dataset. Here are some best practices for data cleaning in R:

Handle Missing Values:

Identify missing values using functions like is.na() or complete.cases().
Decide whether to impute missing values, remove rows with missing data, or keep them, depending on the context.
Use packages like dplyr or tidyr to perform missing data operations.

```{r}
library(dplyr)
library(tidyr)

# Example: Remove rows with missing values
airquality_clean <- na.omit(airquality)

head(airquality_clean)
```

ABCDEFGHIJ0123456789

	Ozone <int>	Solar.R <int>	Wind <dbl>	Temp <int>	Month <int>	Day <int>
1	41	190	7.4	67	5	1
2	36	118	8.0	72	5	2
3	12	149	12.6	74	5	3
4	18	313	11.5	62	5	4
7	23	299	8.6	65	5	7
8	19	99	13.8	59	5	8

Other methods to remove NA’s include na.omit() , complete.cases(), rowSums(), drop_na(), and filter().

```{r}
# #Remove rows with NA's using na.omit()
# airquality_clean <- na.omit(airquality)
# 
# #Remove rows with NA's using complete.cases
# airquality_clean <- airquality[complete.cases(airquality),]
# 
# #Remove rows with NA's using rowSums()
# airquality_clean <- airquality[rowSums(is.na(airquality)) == 0,]
# 
# #Import the tidyr package
# library("tidyr")
# 
# #Remove rows with NA's using drop_na()
# airquality_clean <- airquality %>% drop_na()
# 
# #Remove rows that contains all NA's
# airquality_clean <-
#   airquality[rowSums(is.na(airquality)) != ncol(airquality),]
# 
# #Load the dplyr package
# library("dplyr")
# 
# #Remove rows that contains all NA's
# airquality_clean <-
#   filter(airquality, rowSums(is.na(airquality)) != ncol(airquality))
# 
# airquality_clean <- airquality %>% filter(!is.na(Ozone))
```

Manage Outliers

Visualize data using boxplots, histograms, or scatter plots to detect outliers.
Consider using statistical methods or domain knowledge to handle outliers, such as winsorization or transformation.

```{r}
#| label: Boxplot of air wauality data set
#| fig-alt: "Boxplot of air wauality data set"

# Example: Visualize outliers using a boxplot
boxplot(airquality_clean)


# # Remove outliers from the 'income' variable
# airquality_clean <- airquality_clean %>%
#   filter(Ozone >= 0)
```

Boxplot of air wauality data set

Correct Errors

Check for data entry errors and inconsistencies.
Use data validation rules or regular expressions to identify and correct errors.

Data Transformation

To make your data acceptable for analysis, you must shape and reformat it through data transformation. The following are a few excellent practices for R’s data transformation:

Use Tidy Data Principles

Follow the principles of tidy data, where each variable is a column, each observation is a row, and each type of observational unit is a table.
The tidyr package provides functions like gather() and spread() for reshaping data.

```{r}
# # Example: Convert data from wide to long format
airquality_clean_long <- airquality_clean %>%
  gather(key = "Column_name", value = "value")   

head(airquality_clean_long)
```

ABCDEFGHIJ0123456789

	Column_name <chr>	value <dbl>
1	Ozone	41
2	Ozone	36
3	Ozone	12
4	Ozone	18
5	Ozone	23
6	Ozone	19

Apply Data Type Conversions

Ensure that variables have the correct data types (e.g., numeric, character, factor) for analysis.
Use functions like as.numeric(), as.character(), or as.factor() to convert data types.

```{r}
# Example: Convert a variable to numeric
airquality_clean$Temp_numeric <- as.numeric(airquality_clean$Temp) 

head(airquality_clean)
```

ABCDEFGHIJ0123456789

	Ozone <int>	Solar.R <int>	Wind <dbl>	Temp <int>	Month <int>	Day <int>	Temp_numeric <dbl>
1	41	190	7.4	67	5	1	67
2	36	118	8.0	72	5	2	72
3	12	149	12.6	74	5	3	74
4	18	313	11.5	62	5	4	62
7	23	299	8.6	65	5	7	65
8	19	99	13.8	59	5	8	59

Data Validation

To ensure that your processed data satisfies the criteria of your study, validation is a crucial stage in the data wrangling process. Here are a few guidelines for using R’s data validation features.

Perform Sanity Checks

Check summary statistics, distributions, and relationships between variables to ensure they align with your expectations.

```{r}
# Example: Check summary statistics
summary(airquality_clean)
```

     Ozone          Solar.R           Wind            Temp      
 Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
 1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
 Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
 Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
 3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
 Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
     Month            Day         Temp_numeric  
 Min.   :5.000   Min.   : 1.00   Min.   :57.00  
 1st Qu.:6.000   1st Qu.: 9.00   1st Qu.:71.00  
 Median :7.000   Median :16.00   Median :79.00  
 Mean   :7.216   Mean   :15.95   Mean   :77.79  
 3rd Qu.:9.000   3rd Qu.:22.50   3rd Qu.:84.50  
 Max.   :9.000   Max.   :31.00   Max.   :97.00

Validate Data Integrity

Verify that data transformations have not introduced errors.
Compare original and transformed data to identify discrepancies.

Document Your Steps

For reproducibility and collaboration, it’s essential to record your data manipulation procedures. To write a narrative that details the choices you made while handling the data, think about utilizing R Markdown or Jupyter Notebooks. Make your work accessible and understandable to others by using code comments, explanations, and visualizations.

Conclusion

A crucial first step in data analysis is data wrangling; by using R’s best practices, you can speed up the process and guarantee the accuracy of your findings. You may improve the efficiency of your data wrangling workflow and provide more reliable analyses by comprehending your data, putting effective cleaning and transformation strategies into practice, testing your findings, and documenting your approach.

Always tailor these best practices to your unique needs, keeping in mind that the individual methods and packages you use may vary depending on your dataset and research goals.