Which of the conditions below must be met in order to conduct a Z test?
Just about every statistics student I've ever tutored has asked me this question at some point. When I first started tutoring I'd explain that it depends on the problem, and start rambling on about the central limit theorem until their eyes glazed over. Then I realized, it's easier to understand if I just make a flowchart. So, here it is! Show
Basically, it depends on four things:
When you're working on a statistics word problem, these are the things you need to look for. Proportion problems are never t-test problems - always use z! However, you need to check that \(np_{0}\) and \(n(1-p_{0})\) are both greater than 10, where \(n\) is your sample size and \(p_{0}\) is your hypothesized population proportion. This is basically saying that the population proportions (for example, % male and % female) should both be large enough so they will be adequately represented in the sample. Generally speaking, the problem will explicitly tell you if the population standard deviation is known - if they don't say, assume that it's unknown. The same goes for a normally distributed population - if they don't say "assume the population is normally distributed", or something to that effect, then do not just make up that assumption. Fortunately if the sample size is large enough, it doesn't matter! The coronavirus pandemic has made a statistician out of us all. We are constantly checking the numbers, making our own assumptions on how the pandemic will play out, and generating hypotheses on when the “peak” will happen. And it’s not just us performing hypothesis building – the media is thriving on it. A few days back I was reading a news article that mentioned this outbreak “could potentially be seasonal” and relent in warmer conditions: So I started wondering – what else can we hypothesize about the coronavirus? Are adults more likely to be affected by the outbreak of coronavirus? How does Relative Humidity impact the spread of the virus? What is the evidence to support these claims? How can we test these hypotheses? As a test Statistic formula enthusiast, all these questions dig up my old knowledge about the fundamentals of Hypothesis Testing. In this article, we will discuss the concept of Hypothesis Testing and the difference between Z and t Test. We will then conclude our Hypothesis Testing learning using a COVID-19 case study. Are you new to the world of statistics and analytics? You should go through the below resources as well: Table of Contents
Fundamentals of Hypothesis TestingLet’s take an example to understand the concept of Hypothesis Testing. A person is on trial for a criminal offense and the judge needs to provide a verdict on his case. Now, there are four possible combinations in such a case:
As you can clearly see, there can be two types of error in the judgment – Type 1 error, when the verdict is against the person while he was innocent and Type 2 error, when the verdict is in favor of Person while he was guilty According to the Presumption of Innocence, the person is considered innocent until proven guilty. That means the judge must find the evidence which convinces him “beyond a reasonable doubt”. This phenomenon of “Beyond a reasonable doubt” can be understood as Probability (Judge Decided Guilty | Person is Innocent) should be small. The basic concepts of Hypothesis Testing are actually quite analogous to this situation. We consider the Null Hypothesis to be true until we find strong evidence against it. Then. we accept the Alternate Hypothesis. We also determine the Significance Level (⍺) which can be understood as the Probability of (Judge Decided Guilty | Person is Innocent) in the previous example. Thus, if ⍺ is smaller, it will require more evidence to reject the Null Hypothesis. Don’t worry, we’ll cover all of this using a case study later. Steps to Perform Hypothesis testingThere are four steps to perform Hypothesis Testing:
Steps 1 to 3 are quite self-explanatory but on what basis can we make a decision in step 4? What does this p-value indicate? We can understand this p-value as the measurement of the Defense Attorney’s argument. If the p-value is less than ⍺ , we reject the Null Hypothesis or if the p-value is greater than ⍺, we fail to reject the Null Hypothesis. Critical Value, p-valueLet’s understand the logic of Hypothesis Testing with the graphical representation for Normal Distribution. Typically, we set the Significance level at 10%, 5%, or 1%. If our test score lies in the Acceptance Zone we fail to reject the Null Hypothesis. If our test score lies in the critical zone, we reject the Null Hypothesis and accept the Alternate Hypothesis.
But why do we need p-value when we can reject/accept hypotheses based on test scores and critical value? p-value has the benefit that we only need one value to make a decision about the hypothesis. We don’t need to compute two different values like critical value and test scores. Another benefit of using p-value is that we can test at any desired level of significance by comparing this directly with the significance level. This way we don’t need to compute test scores and critical value for each significance level. We can get the p-value and directly compare it with the significance level. Directional HypothesisIn the Directional Hypothesis, the null hypothesis is rejected if the test score is too large (for right-tailed and too small for left tailed). Thus, the rejection region for such a test consists of one part, which is right from the center. Non-Directional HypothesisIn a Non-Directional Hypothesis test, the Null Hypothesis is rejected if the test score is either too small or too large. Thus, the rejection region for such a test consists of two parts: one on the left and one on the right. What is the Z Test?z tests are a statistical way of testing a hypothesis when either:
If we have a sample size of less than 30 and do not know the population variance, then we must use a t-test. One-Sample Z testWe perform the One-Sample Z test when we want to compare a sample mean with the population mean. Here’s an Example to Understand a One Sample Z TestLet’s say we need to determine if girls on average score higher than 600 in the exam. We have the information that the standard deviation for girls’ scores is 100. So, we collect the data of 20 girls by using random samples and record their marks. Finally, we also set our ⍺ value (significance level) to be 0.05. In this example:
Since the P-value is less than 0.05, we can reject the null hypothesis and conclude based on our result that Girls on average scored higher than 600. Two Sample Z TestWe perform a Two Sample Z test when we want to compare the mean of two samples. Here’s an Example to Understand a Two Sample Z TestHere, let’s say we want to know if Girls on average score 10 marks more than the boys. We have the information that the standard deviation for girls’ Score is 100 and for boys’ score is 90. Then we collect the data of 20 girls and 20 boys by using random samples and record their marks. Finally, we also set our ⍺ value (significance level) to be 0.05. In this example:
Thus, we can conclude based on the P-value that we fail to reject the Null Hypothesis. We don’t have enough evidence to conclude that girls on average score of 10 marks more than the boys. Pretty simple, right? What is the t-Test?t-tests are a statistical way of testing a hypothesis when:
One-Sample t-TestWe perform a One-Sample t-test when we want to compare a sample mean with the population mean. The difference from the Z Test is that we do not have the information on Population Variance here. We use the sample standard deviation instead of population standard deviation in this case. Here’s an Example to Understand a One Sample t-TestLet’s say we want to determine if on average girls score more than 600 in the exam. We do not have the information related to variance (or standard deviation) for girls’ scores. To a perform t-test, we randomly collect the data of 10 girls with their marks and choose our ⍺ value (significance level) to be 0.05 for Hypothesis Testing. In this example:
Our P-value is greater than 0.05 thus we fail to reject the null hypothesis and don’t have enough evidence to support the hypothesis that on average, girls score more than 600 in the exam. Two-Sample t-TestWe perform a Two-Sample t-test when we want to compare the mean of two samples. Here’s an Example to Understand a Two-Sample t-TestHere, let’s say we want to determine if on average, boys score 15 marks more than girls in the exam. We do not have the information related to variance (or standard deviation) for girls’ scores or boys’ scores. To perform a t-test. we randomly collect the data of 10 girls and boys with their marks. We choose our ⍺ value (significance level) to be 0.05 as the criteria for Hypothesis Testing. In this example:
Thus, P-value is less than 0.05 so we can reject the null hypothesis and conclude that on average boys score 15 marks more than girls in the exam. Deciding between Z Test and T-TestSo when we should perform the Z test and when we should perform t-Test? It’s a key question we need to answer if we want to master statistics. If the sample size is large enough, then the Z test and t-Test will conclude with the same results. For a large sample size, Sample Variance will be a better estimate of Population variance so even if population variance is unknown, we can use the Z test using sample variance. Similarly, for a Large Sample, we have a high degree of freedom. And since t-distribution approaches the normal distribution, the difference between the z score and t score is negligible. Case Study: Hypothesis Testing for Coronavirus using PythonNow let’s implement the Two-Sample Z test for a coronavirus dataset. Let’s put our theoretical knowledge into practice and see how well we can do. You can download the dataset here. This dataset has been taken from John Hopkin’s repository and you can find the link here for it. This dataset here the below features:
And we have added the feature of Temperature and Humidity for Latitude and Longitude using Python’s Weather API – Pyweatherbit. A common perception about COVID-19 is that Warm Climate is more resistant to the corona outbreak and we need to verify this using Hypothesis Testing. So what will our null and alternate hypothesis be?
Note: We are considering Temperature below 24 as Cold Climate and above 24 as Hot Climate in our dataset. Thus. we do not have evidence to reject our Null Hypothesis that temperature doesn’t affect the COV-19 outbreak. Although we cannot find the Temperature’s impact on COV-19, this problem has just been taken for the conceptual understanding of what we have learned in this article. There are certain limitations of the Z test for COVID-19 datasets:
So, we need to be more cautious and research more to identify the pattern of this pandemic. End NotesIn this article, we followed a step by step procedure to understand the fundamentals of Hypothesis Testing, Type 1 Error, Type 2 Error, Significance Level, Critical Value, p-Value, Non-Directional Hypothesis, Directional Hypothesis, Z Test and t-Test and finally implemented Two Sample Z Test for a coronavirus case study. For more details you can also read these articles: Always remember – “Statistics is the Grammar of Data Science”. Did you find this article useful? Can you think of any other applications of different statistical tests? Let me know in the comments section below and we can come up with more ideas. What conditions must be met for the hypothesis test to be valid?Each observation must be independent of all other observations_ There must be at least 3 levels of the categorical variable_ There must be at least 10 success' and 10 'failure observations_ There must be an expected count of at least 5 for each level of the categorical variable.
What conditions are necessary in order to use the zIn order to use two proportions Z-test, the two populations must be normal or approximately normal and two samples must be independent and randomly sampled from the two populations.
What conditions must be met to use Z procedures in a significance test about a population proportion?Now in order to implement z procedures in a significance test about a population proportion the population is greater than 10 times the sample size.
|