What is the interval within which a population parameter is expected to lie?

Suppose you know the mean value of a sample and you wish to use the sample mean to estimate the interval that the population’s mean will lie in. The Interval Estimation technique can be used to arrive at this estimate at some specified confidence level. This technique can be easily extended to estimating the interval for other population level statistics such as the variance. 

To illustrate, suppose you have polled one hundred randomly selected households in Boston, Massachusetts, and you have found their average annual income to be $65,713. Can you use this single number to estimate, with some quantifiable measure of confidence such as 90%, the range in which the average annual income of all three hundred thousand households in Boston will lie? It turns out, you can do this! 

The following figure illustrates this situation:

What is the interval within which a population parameter is expected to lie?
A 90% confidence interval [63863, 675902] for the mean Boston area household income (in 2021 dollars) (Image by Author)

In the above illustration, you may notice that we have assumed a certain probability distribution for the annual incomes of all 300K households in Boston. While it helps to have some idea what this distribution might be, it is not a strict requirement. You can get a decent interval estimate of the unknown population mean even if you have absolutely no idea what distribution the population follows!

In real life decision making, interval estimates can be more useful than relying on a point estimate. For example, next time you hear a news story on some drug having shown an efficacy of 70% in curing some civilization-threatening disease, you may want to check up the 95% confidence interval that the drug maker would have reported. If that interval turns out to be wide one, say 25%–90%, then the 70% point estimate may not be a good indication of the true efficacy.

Estimating the interval for the population mean

Let’s illustrate the process of interval estimation using a real world data set.

We’ll use the following data set of 23.7K water samples taken from various beaches in the New York City metro area from 2005 to 2021. But instead of using the entire data set, we’ll use only the water samples data for Midland beach in Staten Island, New York.

What is the interval within which a population parameter is expected to lie?
Water quality data samples taken from the beaches of New York City between 2005 and 2021 (Source: NYC OpenData under Terms of Use) (Image by Author)

You may download the data either straight from the NYC OpenData site, or (preferably) download the curated version of the data set from over here.

We’ll use the Pandas library to load data set into memory. We’ll start by importing all the Python packages that we would need:

import math
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm


#Load the data file
df = pd.read_csv('DOHMH_Beach_Water_Quality_Data.csv', header=0, infer_datetime_format=True, parse_dates=['Sample_Date'])

Our focus will be on the water quality data gathered for MIDLAND BEACH:

df_midland = df[df['Beach_Name']=='MIDLAND BEACH']

#print the data frame
print(df_midland)

We see the following output:

What is the interval within which a population parameter is expected to lie?
Water quality data for Midland beach, Staten Island (Image by Author)

This is our sample. Before we proceed further, notice that there are NaNs in the DataFrame in rows containing microbial values that were too small for the measuring apparatus to detect. We’ll replace all these NaNs with 0.

df_midland.fillna(value=0,inplace=True)

Let’s print out the summary statistics for the sample:

df_midland['Enterococci_Results'].describe()

What is the interval within which a population parameter is expected to lie?
Summary statistics of the sample from Midland beach (Image by Author)

We’ll print out one more statistic, which is the most frequently occurring value a.k.a. the mode:

print(df_midland['Enterococci_Results'].mode())

We get a value of 0 for the mode, which means most samples which were taken from Midland beach contain zero (or undetectable level) levels of Enterococci. 

The following plot shows the frequency distribution of sample values:

plt.hist(df_midland['Enterococci_Results'], bins=100)
plt.xlabel('Number of Enterococci detected in the sample')
plt.ylabel('Number of samples')
plt.show()

What is the interval within which a population parameter is expected to lie?
Frequency distribution of sample values at Midland beach (Image by Author)

We see that the samples from Midland beach have a large variance.

Let’s review what we have learned so far about the data:

  • Our sample size n is 969
  • The sample mean Y_bar is 25.311042
  • The sample standard deviation S is 149.939190
  • The sample is highly skewed in favor of zero values (as indeed it should for any well maintained public beach!)

What we do not know are:

  • The population mean, µ. The population mean is the mean of the infinite number of samples that could theoretically have been collected at Midland beach from 2005 to 2021.
  • The population standard deviation, σ
  • The frequency distribution of the population.

What we want to find out is:

Can we create a probabilistic estimate for the interval in which the population mean µ might lie in? Specifically, can we identify an interval [µ_low, µ,high] so that µ lies in [µ_low, µ,high] with some probability such as 95%?

Let’s see how to estimate this interval.

We will define a random variable Y which represents the quantity we are measuring, namely the Enterococci count in the water sample.

We know that Y takes values such as 0, 10, 100, 4, 8 etc. and its units are MPN per 100 ml (the units are immaterial as long as they are all in the same units).

Given a specified probability ρ, we wish to find out two numbers µ_low and µ,high such that:

What is the interval within which a population parameter is expected to lie?
An interval estimate for random variable Y (Image by Author)

It is customary to use the notation (1-α) in place of ρ.

What is the interval within which a population parameter is expected to lie?
An interval estimate for random variable Y (Image by Author)

What we are saying here is that we want to carve out a region from the Probability Distribution Function of Y, such that:

  1. The area of this region is (1-α), and,
  2. This region extends from µ_low and µ_high on the X-axis. 

The figure below illustrates this situation in the context of a purely hypothetical distribution for the random variable Y. Remember, we don’t know the actual distribution of the population Y! Note that we are using the customary notation x and f(x) in the figure below because showing a variable called Y on the X-axis might make one’s head spin! So just think of Y, when you see x in this figure.

What is the interval within which a population parameter is expected to lie?
An illustration of an interval estimate for the random variable x (Image by Author)

Let’s look at the following expression again:

What is the interval within which a population parameter is expected to lie?
An interval estimate for random variable Y (Image by Author)

We know that the total area under the curve of the Probability Density Function is 1.0. So the un-shaded area is equal to α. We can re-express the above equation as follows:

What is the interval within which a population parameter is expected to lie?
Confidence interval for the population mean (Image by Author)

We could slide µ_low and µ_high back and forth along the X-axis of the PDF curve until each un-shaded region has area α/2. The following figure illustrates this:

What is the interval within which a population parameter is expected to lie?
A reformulation of interval estimate for the random variable x (Image by Author)

Thus we have:

What is the interval within which a population parameter is expected to lie?
The interval estimate of Y re-expressed in terms of tail probabilities α/2 (Image by Author)

P(Y ≤ µ_low) and P(Y ≤ µ_high) are simply the cumulative probabilities denoted by the Cumulative Distribution Function F(.) of Y. Thus, we have:

What is the interval within which a population parameter is expected to lie?
The interval bounds µ_low and µ_high expressed as inverse CDFs

We will now (temporarily) assume that Y follows the standard normal distribution N(0, 1), i.e. it is normally distributed with a zero mean and a unit standard deviation. 

You may be surprised that we are assuming that Y is N(0,1) distributed since we have seen that the Y is nowhere close to being even normally distributed, leave alone N(0,1) distributed! Here’s a reminder of how the distribution of sample Y looks like:

What is the interval within which a population parameter is expected to lie?
Frequency distribution of sample values Y at Midland beach (Image by Author)

But worry not.

You will see that we will take this rather bold risk, and we will still get away with it!

Now, an important consequence of assuming that Y ~ N(0,1) is that CDF of Y, i.e. F(.) becomes the CDF of the standard normal distribution denoted by Φ (pronounced as Phi). Thus, F(x) = Φ(x), and therefore we have:

What is the interval within which a population parameter is expected to lie?
The low and high interval estimates in terms of the inverse CDF of N(0,1) distribution (Image by Author)

Let us take another small leap and define a real number p such that:

What is the interval within which a population parameter is expected to lie?
Introducing the relationship between p and α (Image by Author)

Since 0≤ α ≤ 1, we have: 0 ≤ p ≤ 1

Next, we define a quantity z_p as follows:

What is the interval within which a population parameter is expected to lie?
Definition of the z-value for a standard normal distribution with CDF Φ(x) (Image by Author)

In the above definition, (1-p) is interpreted as the cumulative probability. For example:

What is the interval within which a population parameter is expected to lie?
z-value for p=0.95 (Image by Author)

This is illustrated below on the PDF of the N(0,1) distribution:

What is the interval within which a population parameter is expected to lie?
z-value for p=0.95 (Image by Author)

Similarly:

What is the interval within which a population parameter is expected to lie?
z-value for p=0.05 (Image by Author)

Again, illustrated below:

What is the interval within which a population parameter is expected to lie?
z-value for p=0.05 (Image by Author)

Thus, we see that:

What is the interval within which a population parameter is expected to lie?
The symmetry of the z-value (Image by Author)

Or, in general due to the symmetry of N(0,1) distribution around 0, we have:

What is the interval within which a population parameter is expected to lie?
For a given p, the two corresponding z-values are symmetric around 0 (Image by Author)

Now, recollect that:

What is the interval within which a population parameter is expected to lie?
The low and high interval estimates in terms of the inverse CDF of N(0,1) distribution (Image by Author)

And since:

What is the interval within which a population parameter is expected to lie?

We have, after combining the above equations, the following result:

What is the interval within which a population parameter is expected to lie?
The high and low interval levels for the random variable Y expressed in terms of z-values (Image by Author)

We can now express the (1-α)*100% confidence interval for the Enterococci count Y in terms of z-values as follows:

What is the interval within which a population parameter is expected to lie?
The (1-α) confidence interval for the random variable Y expressed in terms of z-values (Image by Author)

The following figure illustrates the two z-values for the 90% confidence interval around the zero mean of a standard normal distribution:

What is the interval within which a population parameter is expected to lie?
The z-values corresponding to the 90% confidence interval for the N(0,1) distribution (Image by Author)

Let’s keep in mind that all of the above assumes that the Enterococci count Y has a standard normal distribution even though we know very well that it is not. We have to find a way to deal with this Achilles’ heel.

At the risk of mixing some metaphors, we will now play our Ace of Spades and ‘cure’ our Achilles’ heel, in the following two steps:

  • We will redefine Y as the mean count of Enterococci found in n samples. Recollect that earlier, Y was just a point estimate, a single count. Now, we redefine Y to be the mean count as follows:

What is the interval within which a population parameter is expected to lie?
The mean Enterococci count from n samples (Image by Author)

Here, Y_1, Y_2,…Y_n are the Enterococci counts that were measured in samples 1,2,3,…n. Note that Y_1, Y_2,…Y_n are themselves independent, identically distributed random variables. And therefore, the redefined Y as the mean of Y_1, Y_2,…Y_n, is also a random variable. To see why the mean count is also a random variable, imagine collecting another sample of size n from Midland beach. It will yield a different mean value. Now collect a third sample of size n, and you’ll get a third, possibly different value for the mean value Y, and so one. Thus, the mean count of Enterococci, Y, is itself a random variable. And therefore, everything that we have derived so far regarding the confidence intervals of Y still stands true. 

  • Now, let’s define another random variable Z such that:

What is the interval within which a population parameter is expected to lie?
Z as a function of Y (Image by Author)

In the above formula:

  • n is the sample size (in our data set, n=969)
  • µ and σ are respectively the population mean and population standard deviation.

It can be shown that if n is ‘large enough’, then by the Central Limit Theorem, Z is approximately N(0, 1) distributed.

It can also be shown that for large n, the sample standard deviation S provides an unbiased estimate of the population standard deviation σ. Simply put, we can replace σ with S in the above formula for Z.

What is the interval within which a population parameter is expected to lie?
Z ~ N(0,1) (Image by Author)

Just like Y, the variable Z is also a random variable, and Z has an approximately standard normal distribution. Therefore, we can formulate the (1-α) confidence interval estimate for Z in the same way as for Y:

What is the interval within which a population parameter is expected to lie?
The (1-α) confidence interval estimate for Z in terms of z values (Image by Author)

Substituting the formula for Z in the above equation:

What is the interval within which a population parameter is expected to lie?

Rearranging, we have the following result:

What is the interval within which a population parameter is expected to lie?
(1-α) confidence interval estimate for the population mean (Image by Author)

What we have accomplished is to arrive at the interval estimate [µ_low, µ_high] for the population mean µ at a (1-α) confidence level.

Let’s use the above formula to calculate the interval estimate for µ at 95% confidence level for the samples collected at Midland beach:

#sample size n
n = len(df_midland['Enterococci_Results'])

#sample mean Y
Y = df_midland['Enterococci_Results'].mean()

#sample standard deviation
S = df_midland['Enterococci_Results'].std()

#significance alpha (1-alpha)*100 = 95%
alpha = 0.05

#p-value for required alpha
p = alpha / 2

#z value for the specified p-value
z_p=norm.ppf(1-p)

#mu_low
mu_low = Y - z_p*S/math.sqrt(n)

#mu_high
mu_high = Y + z_p*S/math.sqrt(n)

print('95% Confidence intervals for the population mean (mu)='+str((mu_low, mu_high)))

We see the following output:

95% Confidence intervals for the population mean (mu)=(15.870403932430401, 34.751680690892634)

The rather wide confidence interval is due to the highly skewed distribution of the sample values which has in turn led to a large value of sample standard deviation S=149.94.


Applicability of Interval Estimation to Regression Modeling

When you train a regression model, the coefficients of the regression variables acquire their ‘fitted’ values as follows:

What is the interval within which a population parameter is expected to lie?
The observed values Y_obs are some function of the regression matrix X, the fitted coefficients vector β_cap and the residual errors of regression ε (Image by Author)

The bold notation indicates that the values are matrices.

For e.g., the equation of a trained linear regression model can be expressed as follows:

What is the interval within which a population parameter is expected to lie?
Equation of the fitted linear regression model (Image by Author)

Each one of the fitted coefficients β_cap is a random variable. To see why, imagine that the model is trained on another training sample of size n. On the second sample, the fitted coefficients are likely to take on a different vector of values. On a third training sample of size n, β_cap would take on yet another vector of values and so one. Thus, each β_cap_j in β_cap is a random variable having some unknown distribution.

Thus, whenever you train your regression model on a training data set, what you are getting are mere point estimates of the true population values of β_cap. 

Given this situation, it would be useful to know, at some level of confidence, what is the interval in which the ‘true’ population value of each β_cap_j in β_cap lies in. In other words, it would be useful to know the interval estimate for β_cap at some confidence level (1-α)*100%.

Whenever you train a regression model, the your modeling software will usually reports these confidence intervals for all regression coefficients.

Let us see a quick illustration of the application of interval estimates by building a regression model on the Enterococci counts data set.

Our variables are as follows:
Y = Enterococci_Results
X = [Sample_Location, MEASUREMENT_DAY_OF_WEEK, MEASUREMENT_MONTH], i.e. 3 regression variables.

Start by importing all the required packages:

import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm

Load the data set into memory using Pandas:

df = pd.read_csv('DOHMH_Beach_Water_Quality_Data.csv', header=0, infer_datetime_format=True, parse_dates=['Sample_Date'])

Carve out the data for the Midland beach:

df_midland = df[df['Beach_Name']=='MIDLAND BEACH']

#print the data frame
print(df_midland)
0

Add the dummy variables: MEASUREMENT_DAY_OF_WEEK, MEASUREMENT_MONTH

df_midland = df[df['Beach_Name']=='MIDLAND BEACH']

#print the data frame
print(df_midland)
1

Form the regression expression:

df_midland = df[df['Beach_Name']=='MIDLAND BEACH']

#print the data frame
print(df_midland)
2

Use Patsy to carve out the X and y matrices:

df_midland = df[df['Beach_Name']=='MIDLAND BEACH']

#print the data frame
print(df_midland)
3

Build and train a Generalized Linear Model with a Poisson link function:

df_midland = df[df['Beach_Name']=='MIDLAND BEACH']

#print the data frame
print(df_midland)
4

Print the fitted model’s summary:

df_midland = df[df['Beach_Name']=='MIDLAND BEACH']

#print the data frame
print(df_midland)
5

We see the following output. I have highlighted the fitted coefficients of the model. Notice also that statsmodels has printed out the interval estimates for the true population level values of the coefficients at the 95% confidence level:

What is the interval within which a population parameter is expected to lie?
GLM Poisson Regression Model’s training output showing the fitted coefficients and the corresponding interval estimates at the 95% confidence level (Image by Author)


This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

import mathimport matplotlib.pyplot as pltfrom scipy.stats import invweibullfrom scipy.stats import normimport numpy as npimport pandas as pd#Load the data filedf = pd.read_csv('DOHMH_Beach_Water_Quality_Data.csv', header=0, infer_datetime_format=True, parse_dates=['Sample_Date'])#filter out the data for our beach of interest, which is the MIDLAND BEACHdf_midland = df[df['Beach_Name']=='MIDLAND BEACH']#print the data frameprint(df_midland)#replace all these NaNs with 0df_midland.fillna(value=0,inplace=True)#print out the summary statistics for the sampledf_midland['Enterococci_Results'].describe()#print out one more statistic which is the most frequently occurring value a.k.a. the modeprint(df_midland['Enterococci_Results'].mode())#The following plot shows the frequency distribution of sample values:plt.hist(df_midland['Enterococci_Results'], bins=100)plt.xlabel('Number of Enterococci detected in the sample')plt.ylabel('Number of samples')plt.show()#Calculate the interval estimate for the population mean mu#sample size nn = len(df_midland['Enterococci_Results'])#sample mean YY = df_midland['Enterococci_Results'].mean()#sample standard deviationS = df_midland['Enterococci_Results'].std()#significance alpha (1-alpha)*100 = 95%alpha = 0.05#p-value for required alphap = alpha / 2#z value for the specified p-valuez_p=norm.ppf(1p)#mu_lowmu_low = Yz_p*S/math.sqrt(n)#mu_highmu_high = Y+z_p*S/math.sqrt(n)print('95% Confidence intervals for the population mean (mu)='+str((mu_low, mu_high)))##############################################################################plot the pdf of the inverse Weibull distribution.fig = plt.figure()fig.suptitle('Probability Density Function f(x)')plt.xlabel('x')plt.ylabel('Probability density')c = 100x = np.linspace(invweibull.ppf(0.00000001, c), invweibull.ppf(0.999999999, c), 10000)y = invweibull.pdf(x, c)plt.plot(x, y, 'r-', linewidth=1, alpha=0.6, color='black')mu_l = 0.99mu_h = 1.035#shaded_x = x[np.logical_and(x >= mu_l, x <= mu_h)]#plt.fill_between(shaded_x, invweibull.pdf(shaded_x, c), color='blue', alpha=0.65, linewidth=0)shaded_x_low = x[np.logical_and(x >= 0, x <= mu_l)]plt.fill_between(shaded_x_low, invweibull.pdf(shaded_x_low, c), color='blue', alpha=0.65, linewidth=0)shaded_x_high = x[np.logical_and(x >= mu_h, x <= 10000)]plt.fill_between(shaded_x_high, invweibull.pdf(shaded_x_high, c), color='blue', alpha=0.65, linewidth=0)plt.show()##############################################################################plot the pdf of the Normal distribution.from scipy.stats import normfig = plt.figure()fig.suptitle('Probability Density Function f(x)')plt.xlabel('x')plt.ylabel('Probability density')x = np.linspace(norm.ppf(0.00000001), norm.ppf(0.999999999), 10000)y = norm.pdf(x)plt.plot(x, y, 'r-', linewidth=1, alpha=0.6, color='black')z_l = 1000z_h = 1.645shaded_x = x[np.logical_and(x >= z_l, x <= z_h)]plt.fill_between(shaded_x, norm.pdf(shaded_x), color='blue', alpha=0.65, linewidth=0)plt.show()

view raw hosted with ❤ by GitHub


References and Copyrights

Data set

DOHMH Beach Water Quality Data taken from NYC OpenData under their Terms of Use

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

What is an interval estimate of a population parameter?

interval estimation, in statistics, the evaluation of a parameter—for example, the mean (average)—of a population by computing an interval, or range of values, within which the parameter is most likely to be located.

What is the range within which the population parameter is expected to fall for a given level of confidence?

In other words, an interval estimate is a range in which the population parameter probably falls for a given level of confidence. We have to establish a level of confidence because we can never be 100 percent sure that the population mean falls within our confidence interval.

What is the interval for the population mean?

The general format of a confidence interval estimate of a population mean is: Sample mean ± Multiplier × Standard error of mean. For variable , a confidence interval estimate of its population mean is. x ¯ j ± Multiplier s j n.

Does a 95% confidence interval always contains the population parameter?

We know that 95% of these intervals will include the population parameter. However, without any additional information we cannot say which ones! Thus with only one sample, and no other information about the population parameter, we can say there is a 95% chance of including the parameter in our interval.