\

Chapter 6: Statistics

29 min read

Statistics is a core component of any data scientist’s toolkit. Since many commercial layers of a data science pipeline are built from statistical foundations (for example, A/B testing), knowing foundational topics of statistics is essential.

Interviewers love to test a candidate’s knowledge about the statistic basics, starting with topics like the Central Limit Theorem and the Law of Large Numbers, and then progressing on to the concepts underlying hypothesis-testing, particularly p-values and confidence intervals, as well as Type 1 and Type II error, and their interpretations. All of those topics play an important role in the statistical underpinning of A/B testing. Additionally, derivations and manipulations involving random variables of various probability distributions are also common, particularly in finance interviews. Lastly, a common topic in more technical interviews will involve utilizing MLE and/or MAP.

Topics to Review Before Your Interview

Properties of Random Variables

For any given random variable X, the following properties hold true (below we assume X is continuous, but it also holds true for discrete random variables). The expectation (average value, or mean) of a random variable is given by the integral of the value of X with its probability density function (PDF) fx (x): µ = E[X] = ∫ xfx(x)dx

and the variance is given by: Var(X) = E[(X - E[X])²] = E[X²] - (E[X])²

The variance is always non-negative, and its square root is called the standard deviation, which is heavily used in statistics. σ = √Var(X) = √E[(X - E[X])²] = √E[X²] - (E[X])²

The conditional values of both the expectation and variance are as follows. For example, consider the case for the conditional expectation of X, given that Y = y: E[X|Y = y] = ∫ xf(X|Y)(x|y)dx

For any given random variables X and Y, the covariance, a linear measure of relationship between the two variables, is defined by the following: Cov(X,Y) = E[(X - E[X])(Y-E[Y])] = E[XY]-E[X]E[Y]

and the normalization of covariance, represented by the Greek letter p, is the correlation between X and Y: ρ(X,Y) = Cov(X,Y) / √Var(X)Var(Y)

All of these properties are commonly tested in interviews, so it helps to be able to understand the mathematical details behind each and walk through an example for each. For example, assume X follows a Uniform distribution on the interval [a, b], then we have the following: fx(x) = 1 / (b-a) Therefore the expectation of X is: E[X] = ∫ xfx(x)dx = ∫ (from a to b) x * (1/(b-a)) dx = [x² / (2(b-a))] (from a to b) = (a+b)/2

Although it is not necessary to memorize the derivations for all the different probability distributions, you should be comfortable deriving them as needed, as it is a common request in more technical interviews. To this end, you should make sure to understand the formulas given above and be able to apply them to some of the common probability distributions like the exponential or uniform distribution.

Law of Large Numbers

The Law of Large Numbers (LLN) states that if you sample a random variable independently a large number of times, the measured average value should converge to the random variable’s true expectation. Stated more formally, X̄n = (X₁ + … + Xn) / n → µ, as n → ∞

This is important in studying the longer-term behavior of random variables over time. As an example, a coin might land on heads 5 times in a row, but over a much larger n we would expect the proportion of heads to be approximately half of the total flips. Similarly, a casino might experience a loss on any individual game, but over the long run should see a predictable profit over time.

Central Limit Theorem

The Central Limit Theorem (CLT) states that if you repeatedly sample a random variable a large number of times, the distribution of the sample mean will approach a normal distribution regardless of the initial distribution of the random variable. Recall from the probability chapter that the normal distribution takes on the form: fx(x) = (1 / (σ√(2π))) exp( -(x-µ)² / (2σ²) ) with the mean and standard deviation given by µ and σ respectively. The CLT states that: X̄n = (X₁ + … + Xn) / n ~ N(µ, σ²/n) and hence (X̄n - µ) / (σ/√n) ~ N(0,1)

The CLT provides the basis for much of hypothesis testing, which is discussed shortly. At a very basic level, you can consider the implications of this theorem on coin flipping: the probability of getting some number of heads flipped over a large n should be approximately that of a normal distribution. Whenever you’re asked to reason about any particular distribution over a large sample size, you should remember to think of the CLT whether it is Binomial, Poisson, or any other distribution.

Hypothesis Testing

General setup

The process of testing whether or not a sample of data supports a particular hypotheses is called hypothesis testing. Generally, hypotheses concern particular properties of interest for a given population, such as its parameters, like μ (for example, the mean conversion rate among a set of users). The steps in testing a hypothesis are as follows:

  1. State a null hypothesis and an alternative hypothesis. Either the null hypothesis will be rejected (in favor of the alternative hypothesis) or it will fail to be rejected (although failing to reject the null hypothesis does not necessarily mean it is true, but rather that there is not sufficient evidence to reject it).
  2. Use a particular test statistic of the null hypothesis to calculate the corresponding p-value.
  3. Compare the p-value to a certain significance level α. Since the null hypothesis typically represents a baseline (e.g., the marketing campaign did not increase conversion rates, etc.), the goal is to reject it with statistical significance and hope that there is a significant outcome. Hypothesis tests are either one-tailed or two-tailed tests. A one-tailed test has the following types of null and alternative hypothesis: H₀: µ = µ₀ versus H₁: µ < µ₀ or H₁: µ > µ₀ whereas a two-tailed test has these types: H₀: µ = µ₀ versus H₁: µ ≠ µ₀ where H₀ is the null hypothesis and H₁ is the alternative hypothesis, and µ is the parameter of interest.

Understanding hypothesis testing is the basis of A/B testing, a topic commonly covered in tech companies’ interviews. In A/B testing, various versions of a feature are shown to a sample of different users, and each variant is tested to determine if there was an uplift in core engagement metrics. Say, for example, that you are working for Uber Eats, which wants to determine whether email campaigns will increase its product’s conversion rates. To conduct an appropriate hypothesis test, you would need two roughly equal groups (equal with respect to dimensions like age, gender, location, etc.) One group would receive the email campaigns and the other group would not be exposed. The null hypothesis in this case would be that the two groups exhibit equal conversion rates, and the hope is that the null hypothesis would be rejected.

Test Statistics

A test statistic is a numerical summary designed for the purpose of determining whether the null hypothesis or the alternative hypothesis should be accepted as correct. More specifically, it assumes that the parameter of interest follows a particular sampling distribution under the null hypothesis. For example, the number of heads in a series of coin flips may be distributed as a binomial distribution, but with a large enough sample size, the sampling distribution should be approximately normally distributed. Hence, the sampling distribution for the total number of heads in a large series of coin flips would be considered normally distributed. Several variations in test statistics and their distributions are the following:

  1. Z-test: assumes the test statistic follows a normal distribution under the null hypothesis
  2. t-test: uses a student’s t-distribution rather than a normal distribution
  3. Chi-squared: used to assess goodness of fit, and check whether two categorical variables are independent
Z-Test

Generally the Z-test is used when the sample size is large (to invoke the CLT) or when the population variance is known, and a t-test is used when the sample size is small and when the population variance is unknown. The Z-test for a population mean is formulated as: z = (X̄ - µ₀) / (σ/√n) ~ N(0,1) in the case where the population variance σ² is known.

t-Test

The t-test is structured similarly to the Z-test, but uses the sample variance s² in place of population variance. The t-test is parametrized by the degrees of freedom, which refers to the number of independent observations in a dataset, denoted below by n - 1: t = (X̄ - µ₀) / (s/√n) ~ t_(n-1) where s² = Σ(xᵢ - X̄)² / (n-1)

As stated earlier, the t distribution is similar to the normal distribution in appearance but has larger tails (i.e., extreme events happen with greater frequency than the modeled distribution would predict), a common phenomenon, particularly in economics and earth sciences.

Chi-squared Test

The Chi-squared test statistic, which is used to assess goodness of fit, and is calculated as follows: χ² = Σ ((Oᵢ - Eᵢ)² / Eᵢ) where Oᵢ is the observed value of interest and Eᵢ is its expected value. A Chi-square test statistic takes on a particular number of degrees of freedom, which is based on the number of categories in the distribution. To use the squared test to check whether two categorical variables are independent, create a table of counts (called a contingency table) with the values of one variable forming the rows of the table and the values of the other variable forming its columns, and checking for intersections. It uses the same style of Chi-squared test statistic as given above.

Hypothesis Testing for Population Proportions

Note that, due to the CLT, the Z-test can be applied to random variables of any distribution. For example, when estimating the sample proportion of a population having a characteristic of interest, we can view the members of the population as Bernoulli random variables with those having the characteristic represented by “1s” and those lacking it represented by “0s”. Viewing the sample proportion of interest as the sum of these Bernoulli random variables divided by the total population size, we can then compute the sample mean and variance of the overall proportion, about which we can form the following set of hypotheses: H₀: p = p₀ versus H₁: p ≠ p₀ and the corresponding test statistic to conduct a Z-test would be: z = (p̂ - p₀) / √(p₀(1-p₀)/n) In practice, these test statistics form the core of A/B testing. For instance, consider the previously discussed case, in which we seek to measure conversion rates within groups A and B, where A is the control group and B has the special treatment (in this case, a marketing campaign). Adopting the same null hypothesis as before, we can proceed to use a Z-test to assess the difference in empirical population means (in this case, conversion rates) and test its statistical significance at a pre-determined level. When asked about A/B testing or related topics, you should always cite the relevant test statistic and the cause of its validity (usually the CLT).

p-values and confidence intervals

Both p-values and confidence intervals are commonly covered topics during interviews. Put simply, a p-value is the probability of observing the value of the calculated test statistic under the null hypothesis assumptions. Usually, the p-value is assessed relative to some pre-determined level of significance (0.05 is often chosen). In conducting a hypothesis test, an α, or measure of the acceptable probability of rejecting a true null hypothesis, is typically chosen prior to conducting the test. Then, a confidence interval can also be calculated to assess the test statistic. This is a range of values that, if a large sample were taken, would contain the parameter value of interest (1-α)% of the time. For instance, a 95% confidence interval would contain the true value 95% of the time. If 0 is included in the confidence intervals, then you cannot reject the null hypothesis (and vice versa). The general form for a confidence interval around the population mean looks like the following, where the term is the critical value (for the standard normal distribution): µ ± z_(α/2) * (σ/√n) In the prior example with the A/B testing on conversion rates, we see that the confidence interval for a population proportion would be p̂ ± z_(α/2) * √(p̂(1-p̂)/n) since our estimate of the true proportion will have the following parameters when estimated as approximately Gaussian: µ_p̂ = p, σ²_p̂ = np(1-p)/n² = p(1-p)/n As long as the sampling distribution of a random variable is known, the appropriate p-values and confidence intervals can be assessed. Knowing how to explain p-values and confidence intervals, in technical and nontechnical terms, is very useful during interviews, so be sure to practice these. If asked about the technical details, always remember to make sure you correctly identify the mean and variance at hand.

Type I and II errors

There are two errors that are frequently assessed: type I error, which is also known as a “false positive,” and type II error, which is also known as a “false negative.” Specifically, a type I error is when one rejects the null hypothesis when it is correct, and a type II error is when the null hypothesis is not rejected when it is incorrect. Usually 1-α is the referred to as the confidence level, whereas 1-β is referred to as the power. If you plot sample size versus power, generally you should see a larger sample size corresponding to a larger power. It can be useful to look at power in order to gauge the sample size needed for detecting a significant effect. Generally, tests are set up in such a way as to have both 1-α and 1-β relatively high (say at 0.95, and 0.8 respectively). In testing multiple hypotheses, it is possible that if you ran many experiments - even if a particular outcome for one experiment is very unlikely - you would see a statistically significant outcome at least once. So, for example, if you set α = 0.05 and run 100 hypothesis tests, then by pure chance you would expect 5 of the tests to be statistically significant. However, a more desirable outcome is to have the overall α of the 100 tests be 0.05, and this can be done by setting the new α to α/n where n is the number of hypothesis tests (in this case, α/n = 0.05/100 = 0.0005). This is known as Bonferroni correction, and using it helps make sure that the overall rate of false positives is controlled within a multiple testing framework.

Generally most interview questions concerning Type I and II errors are qualitative in nature, for instance, requesting explanations of terms or of how you would go about assessing errors/power in an experimental setup.

MLE and MAP

Any probability distribution has parameters, and so fitting parameters is an extremely crucial part of data analysis. There are two general methods for doing so. In maximum likelihood estimation (MLE) the goal is to estimate the most likely parameters given a likelihood function: θ_MLE = arg max L(θ), where L(θ) = f(x₁,…x_n|θ) Since the values of X are assumed to be i.i.d., then the likelihood function becomes the following: L(θ) = Π (from i=1 to n) f(xᵢ|θ) The natural log of L(θ) is then taken prior to calculating the maximum; since log is a monotonically increasing function, maximizing the log-likelihood log L(θ) is equivalent to maximizing the likelihood: log L(θ) = Σ (from i=1 to n) log f(xᵢ|θ) Another way of fitting parameters is through maximum a posteriori estimation (MAP), which assumes a prior distribution. θ_MAP = arg max g(θ) f(x₁,…x_n|θ) where the similar log-likelihood is again employed, and g(θ) is a density function of θ. Both MLE and MAP are especially relevant in statistics and machine learning, and knowing these is recommended, especially for more technical interviews. For instance, a common question in such interviews is to derive the MLE for a particular probability distribution. Thus, understanding the above steps, along with the details of the relevant probability distributions, is crucial.

40 Real Statistics Interview Questions

Easy

  • 6.1. Uber: Explain the Central Limit Theorem. Why it is useful?

  • 6.2. Facebook: How would you explain a confidence interval to a non-technical audience?

  • 6.3. Twitter: What are some common pitfalls encountered in A/B testing?

  • 6.4. Lyft: Explain both covariance and correlation formulaically, and compare and contrast them.

  • 6.5. Facebook: Say you flip a coin 10 times and observe only one heads. What would be your null hypothesis and p-value for testing whether the coin is fair or not?

  • 6.6. Uber: Describe hypothesis testing and p-values in layman’s terms?

  • 6.7. Groupon: Describe what Type I and Type II errors are, and the tradeoffs between them.

  • 6.8. Microsoft: Explain the statistical background behind power.

  • 6.9. Facebook: What is a Z-test and when would you use it versus a t-test?

  • 6.10. Amazon: Say you are testing hundreds of hypotheses, each with t-test. What considerations would you take into account when doing this?

Medium

  • 6.11. Google: How would you derive a confidence interval for the probability of flipping heads from a series of coin tosses?
  • 6.12. Two Sigma: What is the expected number of coin flips needed to get two consecutive heads?
  • 6.13. Citadel: What is the expected number of rolls needed to see all 6 sides of a fair die?
  • 6.14. Akuna Capital: Say you’re rolling a fair six-sided dice. What is the expected number of rolls until you roll two consecutive 5s?
  • 6.15. D.E. Shaw: A coin was flipped 1000 times, and 550 times it showed heads. Do you think the coin is biased? Why or why not?
  • 6.16. Quora: You are drawing from a normally distributed random variable X ~ N(0, 1) once a day. What is the approximate expected number of days until you get a value greater than 2?
  • 6.17. Akuna Capital: Say you have two random variables X and Y, each with a standard deviation of 1. What is the variance of aX + bY for constants a and b?
  • 6.18. Google: Say we have X ~ Uniform(0, 1) and Y ~ Uniform(0, 1) and the two are independent. What is the expected value of the minimum of X and Y?
  • 6.19. Morgan Stanley: Say you have an unfair coin which lands on heads 60% of the time. How many coin flips are needed to detect that the coin is unfair?
  • 6.20. Uber: Say you have n numbers 1…n, and you uniformly sample from this distribution with replacement n times. What is the expected number of distinct values you would draw?
  • 6.21. Goldman Sachs: There are 100 noodles in a bowl. At each step, you randomly select two noodle ends from the bowl and tie them together. What is the expectation on the number of loops formed?
  • 6.22. Morgan Stanley: What is the expected value of the max of two dice rolls?
  • 6.23. Lyft: Derive the mean and variance of the uniform distribution U(a, b).
  • 6.24. Citadel: How many cards would you expect to draw from a standard deck before seeing the first ace?
  • 6.25. Spotify: Say you draw n samples from a uniform distribution U(a, b). What are the MLE estimates of a and b?

Hard

  • 6.26. Google: Assume you are drawing from an infinite set of i.i.d random variables that are uniformly distributed from (0, 1). You keep drawing as long as the sequence you are getting is monotonically increasing. What is the expected length of the sequence you draw?
  • 6.27 Facebook: There are two games involving dice that you can play. In the first game, you roll two dice at once and receive a dollar amount equivalent to the product of the rolls. In the second game, you roll one die and get the dollar amount equivalent to the square of that value. Which has the higher expected value and why?
  • 6.28. Google: What does it mean for an estimator to be unbiased? What about consistent? Give examples of an unbiased but not consistent estimator, and a biased but consistent estimator.
  • 6.29. Netflix: What are MLE and MAP? What is the difference between the two?
  • 6.30. Uber: Say you are given a random Bernoulli trial generator. How would you generate values from a standard normal distribution?
  • 6.31. Facebook: Derive the expectation for a geometric random variable.
  • 6.32. Goldman Sachs: Say we have a random variable X ~ D, where D is an arbitrary distribution. What is the distribution F(X) where F is the CDF of X?
  • 6.33. Morgan Stanley: Describe what a moment generating function (MGF) is. Derive the MGF for a normally distributed random variable X.
  • 6.34. Tesla: Say you have N independent and identically distributed draws of an exponential random variable. What is the best estimator for the parameter λ?
  • 6.35. Citadel: Assume that log X ~ N(0, 1). What is the expectation of X?
  • 6.36. Google: Say you have two distinct subsets of a dataset for which you know their means and standard deviations. How do you calculate the blended mean and standard deviation of the total dataset? Can you extend it to K subsets?
  • 6.37. Two Sigma: Say we have two random variables X and Y. What does it mean for X and Y to be independent? What about uncorrelated? Give an example where X and Y are uncorrelated but not independent.
  • 6.38. Citadel: Say we have X ~ Uniform(-1, 1) and Y = X². What is the covariance of X and Y?
  • 6.39. Lyft: How do you uniformly sample points at random from a circle with radius R?
  • 6.40. Two Sigma: Say you continually sample from some i.i.d. uniformly distributed (0, 1) random variables until the sum of the variables exceeds 1. How many samples do you expect to make?

40 Real Statistics Interview Solutions

Solution #6.1 The Central Limit Theorem (CLT) states that if any random variable, regardless of distribution, is sampled a large enough number of times, the sample mean will be approximately normally distributed. This allows for studying of the properties for any statistical distribution as long as there is a large enough sample size. The mathematical definition of the CLT is as follows: For any given random variable X, as n approaches infinity, X̄n = (X₁ + … + Xn) / n ~ N(µ, σ²/n) At any company with a lot of data, like Uber, this concept is core to the various experimentation platforms used in the product. For a real-world example, consider testing whether adding a new feature increases rides booked in the Uber platform, where each Xᵢ is an individual ride and is a Bernoulli random variable (i.e., the rider books or does not book a ride). Then, if the sample size is sufficiently large, we can assess the statistical properties of the total number of bookings, as well as the booking rate (rides booked / rides opened on app). These statistical properties play a key role in hypothesis testing, allowing companies like Uber to decide whether or not to add new features in a data-driven manner.

Solution #6.2 Suppose we want to estimate some parameters of a population. For example, we might want to estimate the average height of males in the US. Given some data from a sample, we can compute a sample mean for what we think the value is, as well as a range of values around that mean. Following the previous example, we could obtain the heights of 1000 random males in the U.S. and compute the average height, or the sample mean. This sample mean is a type of point estimate, and while useful, will vary from sample to sample. Thus we can’t tell anything about the variation in the data around this estimate, which is why we need a range of values through a confidence interval. Confidence intervals are a range of values with a lower and an upper bound such that if you were to sample the parameter of interest a large number of times, the 95% confidence interval would contain the true value of this parameter 95% of the time. We can construct a confidence interval using the sample standard deviation and sample mean. The level of confidence is determined by a margin of error that is set beforehand. The narrower the confidence interval, the more precise is the estimate, since there is less uncertainty associated with the point estimate of the mean.

Solution #6.3 A/B testing has many possible pitfalls that depend on the particular experiment and setup employed. One common drawback is that groups may not be balanced, possibly resulting in highly skewed results. Note that balance is needed for all dimensions of the groups - like user demographics or device used - because, otherwise, the potentially statistically significant results from the test may simply be due to specific factors that were not controlled for. Two types errors are frequently assessed: Type I error, which is also known as a “false positive”, and Type II error, also known as a “false negative”. Specifically, Type I error is rejecting a null hypothesis when that hypothesis is correct, whereas Type II error is failing to reject a null hypothesis when its alternative hypothesis is correct. Another common pitfall is not running an experiment for long enough. Generally speaking, experiments are run with a particular power threshold and significance threshold; however, they often do not stop immediately upon detecting an effect. For an extreme example, assume you’re at either Uber or Lyft and running a test for two days, when the metric of interest (e.g., rides booked) is subject to weekly seasonality. Lastly, dealing with multiple tests is important because there may be interactions between results of tests you are running and so properly attributing results may be impossible in simple A/B tests. In addition, as the number of variations you run increases, so does the sample size needed. In practice, while it seems technically feasible to test 1000 variations of a button when optimizing for click-through rate, variations in tests are usually based on some intuitive hypothesis concerning core behavior.

Solution #6.4 For any given random variables X and Y, the covariance, a linear measure of relationship, is defined by the following: Cov(X, Y) = E[(X - E[X])(Y- E[Y])] = E[XY] - E[X]E[Y] Specifically, covariance indicates the direction of the linear relationship between X and Y and can take on any potential value from negative infinity to infinity. The units of covariance are based on the units of X and Y, which may differ. The correlation between X and Y is the normalized version of covariance that takes into account the variances of X and Y: ρ(X,Y) = Cov(X,Y) / √Var(X)Var(Y) Since correlation results from scaling covariance, it is dimensionless (unlike covariance) and is always between -1 and 1 (also unlike covariance).

Solution #6.5 The null hypothesis is that the coin is fair, and the alternative hypothesis is that the coin is biased: H₀: p₀ = 0.5, H₁: p₁ ≠ 0.5 Note that, since the sample size here is 10, you cannot apply the Central Limit Theorem and so cannot approximate a binomial using a normal distribution. The p-value here is the probability of observing the results obtained given that the null hypothesis is true, i.e., under the assumption that the coin is fair. In total for 10 flips of a coin, there are 2¹⁰ = 1024 possible outcomes, and in only 10 of them are there 9 tails and one heads. Hence, the exact probability of the given result is the p-value, which is 10/1024 ≈ 0.0098. Therefore, with a significance level set, for example, at 0.05, we can reject the null hypothesis.

Solution #6.6 The process of testing whether data supports particular hypotheses is called hypothesis testing and involves measuring parameters of a population’s probability distribution. This process typically employs at least two groups, one a control that receives no treatment and the other(s), which do receive the treatment(s) of interest. Examples could be the height of two groups of people, the conversion rates for particular user flows in a product, etc. Testing also involves two hypotheses, the null hypothesis, which assumes no significant difference between the groups, and the alternative hypothesis, which assumes a significant difference in the measured parameter(s) as a consequence of the treatment. A p-value is the probability of observing the given test results under the null hypothesis assumptions. The lower this probability, the higher the chance that the null hypothesis should be rejected. If the p-value is lower than the pre-determined significance level α, generally set at 0.05, then it indicates that the null hypothesis should be rejected in favor of the alternative hypothesis. Otherwise, the null hypothesis cannot be rejected, and it cannot be concluded that the treatment has any significant effect.

Solution #6.7 Both errors are relevant in the context of hypothesis testing. Type I error is when one rejects the null hypothesis when it is correct, and is known as a false positive. Type II error is when the null hypothesis is not rejected when the alternative hypothesis is correct; this is known as a false negative. In layman’s terms, a type I error is when we detect a difference, when in reality there is no significant difference in an experiment. Similarly, a type II error occurs when we fail to detect a difference, when in reality there is a significant difference in an experiment. Type I error is given by the level of significance α, whereas the type II error is given by β. Usually, 1-α is referred to as the confidence level, whereas 1-β is referred to as the statistical power of the test being conducted. Note that, in any well-conducted statistical procedure, we want to have both α and β be small. However, based on the definition of the two, it is impossible to make both errors small simultaneously: the larger α is, the smaller β is. Based on the experiment and the relative importance of false positives and false negatives, a Data Scientist must decide what thresholds to adopt for any given experiment. Note that experiments are set up so as to have both 1-α and 1-β relatively high (say at .95, and .8 respectively).

Solution #6.8 Power is the probability of rejecting the null hypothesis when, in fact, it is false. It is also the probability of avoiding a Type II error. A Type II error occurs when the null hypothesis is not rejected when the alternative hypothesis is correct. This is important because we want to detect significant effects during experiments. That is, the higher the statistical power of the test, the higher the probability of detecting a genuine effect (i.e, accepting the alternative hypothesis and rejecting the null hypothesis). A minimum sample size can be calculated for any given level of power - for example, say a power level of 0.8. An analysis of the statistical power of a test is usually performed with respect to the test’s level of significance (α) and effect size (i.e., the magnitude of the results).

Solution #6.9 In a Z-test, your test statistic follows a normal distribution under the null hypothesis. Alternatively, in a t-test, you employ a student’s t-distribution rather than a normal distribution as your sampling distribution. Considering the population mean, we can use either Z-test or t-test only if the mean is normally distributed, which is possible in two cases: the initial population is normally distributed, or the sample size is large enough (n ≥ 30) so we can apply the Central Limit Theorem. If the condition above is satisfied, then we need to decide which type of test is more appropriate to use. In general, we use Z-tests if the population variation is known, and vice versa: we use t-test if the population variation is unknown. Additionally, if the sample size is very large (n > 200), we can use the Z-test in any case, since for such large degrees of freedom, t-distribution coincides with z-distribution up to thousands. Considering the population proportion, we can use Z-test (but not t-test) for it in the case and np₀ ≥ 10 and n(1-p₀) ≥ 10, i.e., when each of the number of successes and the number of failures is at least 10.

Solution #6.10 The primary consideration is that, as the number of tests increases, the chance that a stand-alone p-value for any of the t-tests is statistically significant becomes very high due to chance alone. As an example, with 100 tests performed and a significance threshold of α = 0.05, you would expect 5 of the experiments to be statistically significant due only to chance. That is, you have a very high probability of observing at least one significant outcome. Therefore, the chance of incorrectly rejecting a null hypothesis (i.e., committing Type I error) increases. To correct for this effect, we can use a method called the Bonferroni correction, wherein we set the significance threshold to α/m, where m is the number of tests being performed. In the above scenario having 100 tests, we can set the significance threshold to instead be 0.05/100 = 0.0005. While this correction helps to protect from Type I error, it is still prone to Type II error (i.e., failing to reject the null hypothesis when it should be rejected). In general, the Bonferroni correction is mostly useful when there is a smaller number of multiple comparisons of which a few are significant. If the number becomes sufficiently high that many tests yield statistically significant results, the number of Type II errors may also increase significantly.

Solution #6.11 The confidence interval (CI) for a population proportion is an interval that includes a true population proportion with a certain degree of confidence 1 - α. For the case of flipping heads from a series of coin tosses, the proportion follows the binomial distribution. If the series size is large enough (each of the number of successes and the number of failures is at least 10), we can utilize the Central Limit Theorem and use the normal approximation for the binomial distribution, meaning the sample proportion p̂ is approximately N(p, p(1-p)/n). We estimate this as N(p̂, p̂(1-p̂)/n). where p̂ is the proportion of heads tossed in series, and n is the series size. The CI is centered at the series proportion, and plus or minus a margin of error: p̂ ± z_(α/2) * √(p̂(1-p̂)/n) where z_(α/2) is the appropriate value from the standard normal distribution for the desired confidence level. For example, for the most commonly used level of confidence 95%, z_(α/2) = 1.96.