<- seq(10, 200, 5)
data
<- mean(data)
mean_value <- sd(data) / sqrt(length(data))
se_value <- mean_value - 1.96 * se_value
ci_lower <- mean_value + 1.96 * se_value
ci_upper <- c(ci_lower, ci_upper)
ci ci
[1] 87.10773 122.89227
Bilal Mustafa
August 12, 2024
In previous sections, we looked at measures of centrality and variability, which are important for summarizing data. However, when drawing conclusions about a population based on sample data, it is critical to measure the uncertainty of such estimates. Confidence intervals define a range of values within which the true population parameter is predicted to fall, providing insight into the accuracy of your estimations. In this post, we’ll look at how to use R to construct confidence intervals for a variety of data types, including simple vectors, grouped data, nominal data, and multinomials.
A confidence interval is a set of values calculated from sample data that are likely to contain the population parameter at a given degree of confidence (typically 95%). The breadth of the confidence interval measures the precision of the estimate; narrower intervals indicate more precise estimations.
The confidence interval for a simple vector (a single set of numeric values) is derived using the standard error of the mean. Assuming normal distribution, a 95% confidence interval may be calculated as:
Where 𝑥ˉxˉ is the sample mean, 𝑍 Z is the critical value from the standard normal distribution (1.96 for 95% confidence), and SE is the standard error. In R, you can compute this as follows:
data <- seq(10, 200, 5)
mean_value <- mean(data)
se_value <- sd(data) / sqrt(length(data))
ci_lower <- mean_value - 1.96 * se_value
ci_upper <- mean_value + 1.96 * se_value
ci <- c(ci_lower, ci_upper)
ci
[1] 87.10773 122.89227
For grouped data (data divided into categories or groups), you may want to generate confidence intervals for each group’s mean. This entails determining the mean, standard error, and confidence intervals independently for each group. In R, this may be done using the tapply() function and the same steps as above:
# Create example data
set.seed(123) # For reproducibility
data <- data.frame(
values = rnorm(30, mean = 10, sd = 2), # 30 random values with mean 10 and sd 2
group = rep(c("A", "B", "C"), each = 10) # 3 groups: A, B, and C
)
group_means <- tapply(data$values, data$group, mean)
group_ses <- tapply(data$values, data$group, function(x) sd(x) / sqrt(length(x)))
ci_lower <- group_means - 1.96 * group_ses
ci_upper <- group_means + 1.96 * group_ses
ci <- data.frame(Group = names(group_means), CI_Lower = ci_lower, CI_Upper = ci_upper)
ci
Group CI_Lower CI_Upper
A A 8.966928 11.33157
B B 9.130435 11.70405
C C 7.997039 10.30473
Nominal data are categories that lack intrinsic ordering (e.g., gender, eye color). Use the binomial distribution to construct confidence intervals for proportions in nominal data. For example, to determine the confidence interval for the proportion of a specific category, you can use the following R code:
data <- c("Category1", "Category2", "Category1", "Category3", "Category1",
"Category2", "Category1", "Category3", "Category1", "Category2")
prop <- sum(data == "Category1") / length(data)
se_prop <- sqrt((prop * (1 - prop)) / length(data))
ci_lower <- prop - 1.96 * se_prop
ci_upper <- prop + 1.96 * se_prop
ci <- c(ci_lower, ci_upper)
ci
[1] 0.1900968 0.8099032
Multinomial data contain several categories, with each observation falling into one of several possible categories. Because all categories must be considered at the same time in multinomial data, confidence intervals for proportions might become more complex. One popular method is to use the DescTools package, which includes functions for calculating multinomial confidence intervals:
data <- c("Category1", "Category2", "Category1", "Category3", "Category1",
"Category2", "Category1", "Category3", "Category1", "Category2")
# install.packages("DescTools")
library(DescTools)
counts <- table(data) # Count occurrences of each category
ci <- MultinomCI(counts, conf.level = 0.95)
ci
est lwr.ci upr.ci
Category1 0.5 0.3 0.8715862
Category2 0.3 0.1 0.6715862
Category3 0.2 0.0 0.5715862
This package calculates the confidence intervals for the proportion of each category in the multinomial data set.
Confidence intervals are a critical tool in descriptive statistics, allowing you to quantify the uncertainty of your estimates and providing a range within which the true population parameter is likely to fall. Whether you’re working with simple numeric data, grouped data, or categorical data, understanding how to compute and interpret confidence intervals using R will enhance the accuracy and reliability of your data analysis.