Statistics

The power behind statistical techniques and metrics like standard deviation, z-scores, confidence intervals, probability estimation, and hypothesis testing begins by getting data for analysis. Once you've structured and measured your data, you can move on to visualizing large amounts of data and finding patterns.

To get the most out of this sampling lesson, you'll need to be comfortable programming in Python, and you'll need to be familiar with the pandas library.

Here are a couple of takeaways you can expect by the end of this lesson:

  • How population and samples work

  • Various sampling methods you can use

In statistics, we call the set of all individuals relevant to a particular statistical question a population. For our analyst's question, everyone at the company was relevant. So, the population in this case consisted of everyone at the company.

We call a smaller group selected from a population a sample. When we select a smaller group from a population we're sampling. In our example, the data analyst took a sample of 100 people from a population of over 50,000 people.

Whether a dataset is a sample or a population depends on the question we're trying to answer. For our analyst's question, the population consisted of everyone at the company. However, if we change the question, the same group of individuals can become a sample.

For instance, if we tried to determine if people at international companies are satisfied at work, then our company of over 50,000 employees would be just one sample in a much larger population. The population would consist of all employees working at all international companies. Our company's employees alone would not be representative of this larger population.

Populations don't necessarily consist of people. Behavioral scientists, for instance, often try to answer questions about populations of monkeys, rats, or other lab animals. Similarly, other people try to answer questions about countries, companies, vegetables, soils, factory-produced equipment, etc.

The individual elements of a population or a sample go by many names. You'll often see the elements of a population referred to as individuals, units, events, and observations. These are all interchangeable, and they refer to the same thing: the individual parts of a population. When we use the term "population individuals," we don't necessarily mean "people." "Individuals" can refer to people, needles, frogs, stars, etc.

In the case of a sample, you'll often see the following terminology used interchangeably: sample unit, sample point, sample individual, or sample observation.

It's important to note that when sampling, statisticians often refer to the Central Limit Theorem, which recommends a minimum sample size of 30. This applies even in stratified sampling. With a sample size of at least 30, the sample mean is generally closer to the population mean. For example:

This code is performing random sampling from the PTS column of a DataFrame called wnba, and computing the sample mean (average) for each sample. Here's what each line does:

These larger sample sizes help ensure that our samples are more representative of their corresponding populations.

For every statistical question we want to answer, we should try to use the population. In practice, that's not always possible because the populations of interest usually vary from large to extremely large. Also, getting data is generally not an easy task, so small populations often pose problems too.

We can solve these problems by sampling from the population that interests us. Although it's not as good as working with the entire population, working with a sample is the next best thing.

When we sample, the data we get might be more or less similar to the data in the population. For instance, let's say we know that the average salary in our company is $34,500, and the proportion of women is 60 percent. We take two samples and find these results:

As you can see, the metrics of the two samples are different than the metrics of the population. A sample is, by definition, an incomplete dataset for the question we're trying to answer. For this reason, there's almost always some difference between the metrics of a population and the metrics of a sample. We can see this difference as an error, and because it's the result of sampling, we call it sampling error.

We call a metric specific to a population a parameter, and we call one specific to a sample a statistic. In our example above, the average salary of all the employees is a parameter because it's a metric that describes the entire population. The average salaries from our two samples are examples of statistics because they only describe the samples.

Another way to think of the concept of the sampling error is as the difference between a parameter and a statistic:

sampling error=parameter−statisticsampling error=parameter−statistic

At this point in the lesson, we'll move from the tech company example to working with a real-world dataset. Our first challenge will be to measure sampling error using this dataset.

The dataset is about basketball players in the WNBA (Women's National Basketball Association), and contains general information about players, along with their metrics for the 2016-2017 season.

Simple Random Sampling

When we sample, we want to minimize the sampling error as much as we can. We want our sample to represent the population as accurately as possible.

If we sampled to measure the mean height of adults in the U.S., we'd like our sample statistic (sample mean height) to get as close as possible to the population's parameter (population mean height). For this to happen, we need the individuals in our sample to form a group that is similar in structure with the group forming the population.

The U.S. adult population is diverse, comprised of people of various heights. If we sampled 100 individuals from various basketball teams, then we'd almost certainly get a sample with a structure that is significantly different than the population. As a consequence, we should expect a large sampling error (a large discrepancy between our sample's statistic (sample mean height) and the population's parameter (population mean height)).

In statistical terms, we want our samples to be representative of their corresponding populations. If a sample is representative, then the sampling error is low. The more representative a sample is, the smaller the sampling error. The less representative a sample is, the greater the sampling error.

To make our samples representative, we can try to give every individual in the population an equal chance for selection in our samples. We want a very tall individual to have the same chance of being selected as a short individual. To give every individual an equal chance at selection, we need to sample randomly.

One way to perform random sampling is to generate random numbers and use them to select a few sample units from the population. In statistics, this sampling method is called simple random sampling, often abbreviated as SRS.

In our previous exercise, we used Series.sample() to sample. This method performs simple random sampling by generating an array of random numbers, and then using those numbers to select values from a Series at the indices corresponding to those random numbers. We can also extend this method for DataFrame objects, where we can sample random rows or columns.

When we use the random_state parameter, like we did in the previous exercise with Series.sample(30, random_state = 1), we make the generation of random numbers predictable. This is because Series.sample() uses a pseudorandom number generator. A pseudorandom number generator uses an initial value to generate a sequence of numbers that has properties similar to those of a sequence that is truly random. With random_state, we specify that initial value used by the pseudorandom number generator.

If we want to generate a sequence of five numbers using a pseudorandom generator, and begin from an initial value of 1, we'll get the same five numbers no matter how many times we run the code. If we ran wnba['Games Played'].sample(5, random_state = 1) we'd get the same sample every time we ran the code.

Pseudorandom number generators are useful in scientific research where reproducible work is necessary. In our case, pseudorandom number generators allow us to work with the same samples as you do in the exercises, which, in turn, allows for meaningful answer-checking.

On the scatter plot from the previous screen, we can see that the sample means vary a lot around the population mean. With a minimum sample mean of 115 points, a maximum of 301.4, and a population mean of roughly 201.8, we can tell that the sampling error is quite large for some of the cases.

Because sample means vary a lot around the population mean, there's a good chance we get a sample that isn't representative of the population:

We can solve this problem by increasing the sample size. As we increase the sample size, the sample means vary less around the population mean, and the chances of getting an unrepresentative sample decrease.

In the previous exercise, we took 100 samples, and each had a sample size of 10 units. Here is what happens when we repeat the procedure but increase the size of the samples:

We can see how sample means tend to vary less around the population mean as we increase the sample size. From this observation we can make two conclusions:

  • Simple random sampling isn't a reliable sampling method when the sample size is small. Because sample means vary a lot around the population mean, there's a good chance we'll get an unrepresentative sample.

  • When we do simple random sampling, we should try to get a sample that is as large as possible. A large sample decreases the variability of the sampling process, which, in turn, decreases the chances that we'll get an unrepresentative sample.

Stratified Sampling

Because simple random sampling is entirely random, it can exclude certain population individuals who are relevant to some of our questions.

For example, players in basketball play in different positions on the court. The metrics of a player (number of points, number of assists, etc.) depend on their position, and we might want to analyze the patterns for each individual position. If we perform simple random sampling, there's a chance that our sample won't include some categories. In other words, it's not guaranteed that we'll have a representative sample that has observations for every position we want to analyze.

There are five unique positions in our data set:

Let's quickly decipher each abbreviation:

Abbreviation
Full name

F

Forward

G

Guard

C

Center

G/F

Guard/Forward

F/C

Forward/Center

The downside of simple random sampling is that it can exclude individuals playing a certain position. Visually, and on a smaller scale, this is what could happen:

To ensure we end up with a sample that has observations for all the categories of interest, we can change the sampling method. We can organize our data set into different groups and then do simple random sampling for every group. We can group our data set by player position, and then sample randomly from each group.

Visually, and on a smaller scale, we need to do this:

We call this sampling method stratified sampling, and we call each stratified group a stratum.

Proportional Stratified Sampling

Earlier in this lesson, we performed simple random sampling 100 times on the original dataset, and for each sample, we computed the mean number of total points a player scores in a season. The problem is that the number of games played, which ranges from 2 to 32, influences the number of total points:

ExplainCopy

Approximately 72.7 percent of the players had more than 23 games for the 2016-2017 season, which means that this category of players who played many games probably influenced the mean. Let's take a look at the other percentages:

As a side note on the output above, (1.969, 12.0], (12.0, 22.0] and (22.0, 32.0] are number intervals. The ( character indicates that the beginning of the interval isn't included, and the ] indicates that the endpoint is included. For example, (22.0, 32.0] means that 22.0 isn't included, while 32.0 is, and the interval contains this array of numbers: [23, 24, 25, 26, 27, 28, 29, 30, 31, 32].

Getting back to our discussion, when we compute the mean of the total points using the population (the entire dataset), those 72.7 percent of players who played more than 23 games will probably significantly influence the mean. However, when we sample randomly, we can end up with a sample where the proportions are different than in the population.

For instance, we might end up with a sample where only 2 percent of the players played more than 23 games. This will result in a sample mean that underestimates the population mean. Or we could have a sample where more than 95 percent of the players had 23 games in the 2016-2017 season. This will result in overestimating the population mean. This scenario of under- or overestimation is common for small samples.

One solution to this problem is to use stratified sampling while being mindful of the proportions in the population. We can stratify our data set by the number of games played, and then sample randomly from each stratum a proportional number of observations.

In the diagram above, we can see the following from a population of 20 individuals:

  • 14 individuals played more than 22 games.

  • 4 individuals played between 13 and 22 games.

  • 2 individuals played below 13 games.

Transforming these figures to percentages, 70 percent of the individuals played more than 22 games, 20 percent played between 13 and 22 games, and 10 percent played below 13 games. Because we sampled proportionally, the same percentages (70 percent, 20 percent, 10 percent) are preserved in the sample (even though the absolute values are different): 70 percent played more than 22 games, 20 percent played between 13 and 22 games, and 10 percent played less than 13 games.

Choosing the Right Strata

You might not have been impressed by the results we got by sampling proportionally. The variability of the sampling was quite large, and many sample means were unrepresentative, being far from the population mean. In fact, this sampling method doesn't seem to perform better than simple random sampling:

img

The poor performance is a result of choosing bad strata. We stratified the data by the number of games played, but this isn't a good approach. A player has technically played one game even if she only played for one or two minutes, even though others play 30 or 40 minutes, and it still only qualifies as one game.

It makes more sense to stratify the data by number of minutes played, rather than by number of games played. The minutes played are a much better indicator of how much a player scored in a season than the number of games played.

Our dataset contains the total number of minutes played by each player for the entire season. If we make strata based on minutes played, and then sample proportionally using stratified sampling, we get something visibly better than simple random sampling (especially in terms of variability):

imgWe increased the sample size to 12 so that we can create a better proportional sampling for the strata organized by minutes played.

Here are a few guidelines for choosing good strata:

1. Minimize the variability within each stratum.

For instance, avoid having in the same stratum a player that has scored 10 points and a player that has scored 500. If the variability is high, it might be a sign that you either need more granular stratification (you need more strata), or you need to change the criterion of stratification (an example of criterion is minutes played).

2. Maximize the variability between strata.

Good strata are different from one another. If you have strata that are similar to one another with respect to what you want to measure, you might need a more granular stratification, or you might need to change the stratification criterion. On the previous screen, stratifying the data by games played resulted in strata that were similar to each other regarding the distribution of the total points. We managed to increase the variability between strata by changing the criterion of stratification to minutes played.

3. The stratification criterion should correlate strongly with the property you're trying to measure.

For instance, the column describing minutes played (the criterion) should correlate strongly with the number of total points (the property we want to measure).

We've left the code editor open for you to experiment with the different sampling methods we've learned so far. One thing you can try is to replicate the last graph above. You can then play with sample sizes and try to get insights into how variability and sampling error change.

Cluster Sampling

The dataset we've been working with was scraped from the WNBA's website. The website centralizes data on basketball games and players in the WNBA. Let's suppose for a moment that such a site didn't exist, and the data were instead scattered across each individual team's website. There are twelve unique teams in our data set, which means we'd have to scrape twelve different websites, each of which requires its own scraping script.

This scenario is quite common in the data science workflow: you want to answer some questions about a population, but the data is scattered in such a way that data collection is either time-consuming or close to impossible. For instance, let's say you want to analyze how people review and rate movies as a function of movie budget. There are many websites that can help with data collection, but how can you go about it so that you can spend a day or two on getting the data you need, rather than a month or two?

One way is to list all the data sources you can find, and then randomly pick only a few of them from which to collect. Then you can individually sample each of the sources you've randomly picked. We call this sampling method cluster sampling, and we call each of the individual data sources a cluster.

In our case, we'd first list all the possible data sources. Assuming that all the teams in our dataset have a website from which we can take data, we end up with this list of clusters (each team's website is a cluster) :

Then we need to find a way to randomly pick a few clusters from our listing. There are many ways to do that, but the important thing to keep in mind is that we should avoid picking a cluster twice. Here's one way to sample four clusters randomly:

Once we pick the clusters, we move to collecting the data. We can collect all the data from each cluster, but we can also perform sampling on each. It's actually possible to use different sampling methods for different clusters. For instance, we can use stratified sampling on the first two clusters, and simple random sampling on the other two.

Practical statistical analysis revolves entirely around the distinction between a population and a sample. When we're doing statistics in practice, our goal is either to describe a sample or a population or to use a sample to draw conclusions about the population to which it belongs (or a mix of these two goals).

When we describe a sample or a population (by measuring averages, proportions, and other metrics; by visualizing properties of the data through graphs; etc.), we do descriptive statistics.

When we try to use a sample to draw conclusions about a population, we do inferential statistics (we infer information from the sample about the population).

Last updated