class: center, middle, inverse, title-slide # Sampling ## seeing what samples can do ### Matthew Crump ### 2018/07/20 (updated: 2019-02-25) --- class: pink, center, middle, clear # Make sure you understand sampling distributions --- class: pink, center, middle, clear # Make sure you understand this next graph --- class: center, middle, clear <img src="figs/distribution/pop_samp.png" width="2245" /> --- # Big ideas for this course 1. Psychology interpets patterns in data to draw conclusions about psychological processes -- 2. Chance can produce "patterns" in data -- 3. **Problem**: How can we know if the pattern is real, or simply a random accident produced by chance --- # Issues for this class 1. **Sampling distributions** 2. **Normal distributions and central limit theorem** 3. **Estimation** --- class: pink, center, middle, clear # Samples and populations --- # Samples and populations - Population: A defined set of things - Sample: a subset of the population --- # Random Sampling - A process for generating a sample (taking things from a population) -- - Random samples ensure that each value in a sample is drawn **independently** from other values -- - all values in the population have a chance of being in the sample --- # Example: Sampling heights of people Let's say we wanted to know something about how tall people are. We can't measure the entire population (it's too big). So we take a sample. -- What would happen if: 1. We only measured really tall people (biased sample) -- 2. We randomly measured a bunch of people? --- # Population statistics Populations have statistics. For example, The population of all people has: 1. A distributions of heights 2. The distribution has a mean (mean height of all people) 3. The distribution has a standard deviation --- # The population problem In the real world, we usually do not have all of the data for the entire population. So, we never actually know: 1. The population distribution 2. The population mean 3. The population standard deviation, etc. --- # The sampling solution Unknown: The population Solution: Take a sample of the population 1. Samples will tend to look the population they came from, especially when sample-size (N) is large. 2. We can use the sample to **estimate** the population. --- # The sampling problem We take samples, and use them to estimate things. This works well when we have large, representative samples. But, how do we know if the sample we obtained is "normal", or happens to be "weird"? Solution: We need to learn how the process of sampling works. We can use R to simulate the process of sampling. Then we can see how samples behave. --- # Samples become populations - As sample-size increases, the sample becomes more like the population. - As sample N approaches the population N, the sample becomes the population. --- # Law of large numbers - As sample-size increases, properties of the sample become more like properties of the population Example: - As sample-size increases, the mean of the sample becomes more like the mean of the population --- # Simulation: Population mean=100 <!-- --> --- # The sampling problem We take samples, and use them to estimate things. This works well when we have large, representative samples. **But, how do we know if the sample we obtained is "normal", or happens to be "weird"?** Solution: **Sampling Distributions** --- class: pink, center, middle, clear # Sampling distributions --- # What are sampling distributions? - Definition: The distribution of a sample statistic - Example: - Many samples are drawn from the same distribution - A statistic (e.g., mean, standard deviation) is computed for each sample, and saved - The sampling distribution is the distribution of the measured statistic for each sample - Sampling distributions can be simulated in R --- # Begin with a distribution <!-- --> --- # Take many samples Save a sample statistic (e.g., mean) for each sample <!-- --> --- # Plot distribution of sample statistic <img src="figs/distribution/4unifmany-1.png" width="1792" /> --- # Sampling distribution is bell-shaped Notice that the sampling distribution of the mean is bell-shaped, also called a **Normal Distribution**. <img src="figs/distribution/4unifmany-1.png" width="50%" /> --- # Sampling distributions for anything A sampling distribution can be found for any sample statistic. 1. Choose a statistic to measure (e.g., mean, median, variance, standard deviations, max, min, etc.) 2. Measure statistics for each sample 3. Plot the sampling distribution --- # A few sampling distributions <img src="figs/distribution/4samplestats-1.png" width="1792" /> --- # Use for sampling distributions? Question: What does a sampling distribution tell us? Answer: - The distribution of values a sample statistic can take, for a sample of a particular size In other words, - Gives us information about range and probability of obtaining particular sample statistics --- # Sampling distribution of the mean <img src="figs/distribution/4unifmany-1.png" width="1792" /> --- # Standard error of the mean (SEM) - Definition: the standard deviation of the sampling distribution of the sample means Formulas: Can be computed directly for samples of any size if you know the standard deviation of the population distribution. `\(\text{SEM}=\frac{\text{standard deviation}}{\sqrt{N}}\)` `\(\text{SEM}=\frac{\sigma}{\sqrt{N}}\)` `\(\sigma\)` = population standard deviation --- # SEM What does the SEM (standard error of the mean) tell you. - Let's say your sample mean was 5, and the SEM was 2. - The SEM is the standard deviation of the sampling distribution of the sample mean - Now you know that your sample mean is 5, but as an estimate of the population mean, that number varies a little bit. SEM tells you how much in standard deviation units. --- # Central limit theorem With enough samples, sampling distributions are approximately **normal distributions** - Sampling distributions have the same shape as a normal distribution, even when the distribution that the sample came from does not have a normal shape. --- class: pink, center, middle, clear # Normal Distributions --- # Normal distributions are bell-shaped <img src="figs/distribution/standardNormal-eps-converted-to.png" width="80%" /> --- # Normal distribution formula <img src="figs/distribution/Normal_formula.png" width="715" /> --- # Normal distribution parameters Normal distributions have two important parameters that change their shape: 1. The mean (where the peak of the distribution is centered) 2. The standard deviation (how spread out the distribution is) --- # Normal: Changing the mean <!-- --> --- # Normal: Changing standard deviation <!-- --> --- # rnorm() R has a function for generating numbers from a normal distribution. - n = number of samples - mean = mean of distribution - sd = standard deviation of distribution ```r rnorm(n=100, mean = 50, sd = 25) ``` --- # plotting a sample from a normal ```r hist(rnorm(n=100, mean=50, sd=25)) ``` <img src="4b_Sampling_files/figure-html/unnamed-chunk-14-1.png" width="60%" /> --- # increasing N ```r hist(rnorm(n=1000, mean=50, sd=25)) ``` <img src="4b_Sampling_files/figure-html/unnamed-chunk-15-1.png" width="60%" /> --- # Animating the central limit theorem <!-- --> --- # Normal & central limit <img src="figs/distribution/4sampledistmeannorm-1.png" width="1792" /> --- # Uniform & central limit <img src="figs/distribution/4samplemeanunif-1.png" width="1792" /> --- # exponential & Central limit <img src="figs/distribution/4samplemeanExp-1.png" width="1792" /> --- # Importance of central limit theorem 1. We see that our sample statistics are distributed normally 2. We can use our knowledge of normal distributions to help us make inferences about our samples. Question: A. What do we need to know about the normal distribution to make use of it? --- # Normal distributions and probability <img src="figs/distribution/4normalSDspercents-1.png" width="1792" /> --- # Normal distributions and probability <img src="figs/distribution/4normalSDspercentsB-1.png" width="1792" /> --- # pnorm() Use the `pnorm()` function to determine the proportion of numbers up to a particular value q = quantile What proportion of values are smaller than 0, for a normal distribution with mean =0, and sd= 1? ```r pnorm(q=0, mean= 0, sd =1) ``` ``` ## [1] 0.5 ``` --- # pnorm() continued What proportion of values are between 0 and 1, for a normal distribution with mean =0, and sd =1? ```r lower_value <- pnorm(q=0, mean= 0, sd =1) higher_value <- pnorm(q=1,mean=0, sd=1) higher_value-lower_value ``` ``` ## [1] 0.3413447 ``` --- class: pink, center, middle, clear # Estimation --- # Goals of estimation - Use statistics of samples to estimate the statistics of the population (parent distribution) they came from - Use statistics of samples to estimate "error" in the sample --- # Biased vs. unbiased estimators Biased estimators: Sample statistics that are give systematically wrong estimates of a population parameter Unbiased estimators: Sample statistics that are not biased estimates of a population parameter --- # Sample means are unbiased - The mean of a sample is an unbiased estimator of the population mean --- # Sample demonstration <!-- --> --- # Standard deviation is biased - The standard deviation formula (dividing by N) is a **biased** when applied to a sample, is a biased estimator of the population standard deviation. Formula for Population Standard Deviation `\(\text{Standard Deviation} = \sqrt{\frac{\sum{(x_{i}-\bar{X})^2}}{N}}\)` --- # Sample demonstration <!-- --> --- # Sample Standard Deviation - If we divide by N-1, which is the formula for a sample standard deviation, we get an **unbiased** estimate of the population standard deviation Formula for **Sample Standard Deviation** `\(\text{Standard Deviation} = \sqrt{\frac{\sum{(x_{i}-\bar{X})^2}}{N-1}}\)` --- # Sample demonstration <!-- --> --- class: center, middle, clear <img src="figs/distribution/pop_samp.png" width="2245" /> --- # sd() and SEM in R `sd()` computes the standard deviation using N-1 ```r x <- c(4,6,5,7,8) sd(x) ``` ``` ## [1] 1.581139 ``` SEM is estimate of standard deviation divided by square root of N ```r sd(x)/sqrt(length(x)) ``` ``` ## [1] 0.7071068 ``` --- # Questions for yourself 1. What is the difference between a population mean and sample mean? 2. What is the difference between a population standard deviation and sample standard deviation? 3. There are two standard deviation formulas for a sample, one divides by N, and the other divided by N-1. What is the difference between the two? --- # More questions 1. What is a sampling distribution, how is it different from a single sample? 2. What is the sampling distribution of the sample means? 3. What is the standard error of the mean (SEM), and how does it relate to the sampling distribution of the sample means? --- # Even more questions 1. What is the difference between the standard error of the mean, and the estimated standard error of the mean? --- # Next class: Inference 1. Wednesday, February 27th: We explore foundational ideas for statistical inference 2. Correlations quiz due today @ 11:59pm 3. Distribution/Sample quiz begins today, due Next Monday @ 11:59pm ---