♦pixabay.comStatistics is essential in data science, laying the groundwork for meaningful analysis. It goes beyond basic charts to offer a detailed examination of data, enabling us to draw solid conclusions instead of mere guesses. This article will discuss some key statistics concepts vital for data analysts.
1. Population and SamplesPopulation (N)A population is the entire group that is the subject of a statistical study. Think of this as a complete collection of data points.
Sample (n)A sample is a subset of the population selected for analysis. This distinction is significant because, in many cases, studying an entire population is impractical or impossible, necessitating the use of samples.
Population (N)>Sample (n)
♦Article from Scribbr2. Parameters and StatisticsParameters (μ, σ²)These measures, like the mean (μ) or variance (σ²), describe the entire population. They are often unknown because examining every data point in a population is impractical. Through sample analysis, we can estimate these parameters.
Sample Statistics (x̅, s²)These are estimates derived from the sample, such as the sample mean (x̅) and variance (s²), aimed at estimating their population counterparts.
Inferential statistics use sample statistics to make educated guesses about the population, relying on proper sampling methods to validate these guesses. Techniques like hypothesis testing and confidence intervals are crucial for ensuring the accuracy of our conclusions.
3. EstimatorsAn estimator is a methodology for estimating a population parameter based on sample data, often denoted as p̂. For example: The sample mean is a commonly used estimator for the population mean. Effective estimators are both unbiased and precise.
The bias of an estimatorThe bias of an estimator refers to the difference between the estimator’s expected value and the actual value of the parameter it aims to estimate.
♦By the author- If Bias = 0, the estimator is unbiased, meaning on average, it correctly estimates the parameter.
- If Bias ≠ 0, the estimator is biased, indicating a systematic deviation from the true parameter value.
The precision of an estimatorPrecision relates to the estimator’s variance and reflects how closely the estimated values cluster around the true value. Low variance means high precision, indicating consistent estimates across different samples.
♦By the author4. Sampling Techniques- Simple Random Sampling: This method ensures that every member of the population has an equal chance of being selected, thereby guaranteeing fair representation.
- Stratified Sampling: The population is segmented into subgroups, and samples are drawn from each to maintain proportional representation.
- Cluster Sampling: Dividing the population into clusters and randomly selecting entire clusters for a comprehensive study.
- Systematic Sampling: Involves selecting samples at regular intervals, providing a streamlined approach to random sampling.
♦Article from Scribbr4. Variable TypesVariables are classified as categorical (nominal, ordinal) and numerical (discrete, continuous).
Example: Nominal: Colors (red, blue); Ordinal: A variable having some order — Education levels (high school < bachelor’s < master’s);
Discrete: A discrete variable can only take integer values.
Continuous: A variable that can take any floating value.
5. Measures of Central TendencyMeanThe sample mean is used to estimate the true mean (μ) of a distribution of the whole population. The sample mean is often noted (x̅) and is defined as follows :
♦By the authorwhere X1,…, Xn is a sample of n independent measurements.
MedianActing as the dataset’s midpoint, the median ensures a fair representation by placing exactly half the data above and half below it. This is especially useful for skewed distributions, where it better captures a “central” value.
ModeThe mode is the value that appears most often in a set of data values. It is the value that is most likely to be sampled.
6. Measures of DispersionRangeThe range represents the difference between the largest and smallest values in a sample of data.
Interquartile range (IQR)The IQR tells us about the middle chunk of our data, spanning from the first quartile (Q1) to the third quartile (Q3). Imagine lining up all your data points and picking out the middle section: Q1 marks the start (25% in from the lowest values) and Q3 marks the end (25% in from the highest values). This range, the IQR, covers the middle 50% of your data, helping you see where most of your data lies while ignoring the outliers.
♦Article from KDnuggetsVariance and Standard DeviationThe sample variance is used to estimate the true variance σ² of a distribution. It measures how far a sample of measurements is spread out from their average value. It is often noted s² and is defined as follows:
♦By the authorVariance is the square of the standard deviation. Standard deviation, denoted as s for the sample estimate is more interpretable because it is in the same units as the data.
Z-ScoreFor each individual in your sample, you can calculate his or her Z-score. This measure will allow you to determine how many standard deviations your individual is within from the mean.
♦By the authorA Z-score is positive for all values that are above average and negative for all values that are below average. For example, if you get a Z-score of 1.5 for an individual in your sample, this means that the individual is 1.5 standard deviations away from the mean.
7. CorrelationCorrelation measures the strength and direction of a linear relationship between two variables.
Example: In a study, we find a strong positive correlation (0.9) between study hours and exam scores, indicating that more study hours correlate with higher scores.
♦By the author8. Normal DistributionA normal curve, or bell curve, shows how common different values are. Imagine you line up everyone by height. Most people would be in the middle, being of average height. Very tall and very short people would be at the ends. This pattern, where most things are average and only a few are extreme, looks like a bell. That’s why we call it a bell curve. The bell curve, or normal distribution, is important because it helps us understand how data are spread around the average.
♦By the author9. Skewness and KurtosisSkewness and kurtosis are two statistics that measure different aspects of a distribution’s shape, further detailing how it differs from a perfect bell curve or normal distribution.
Skewnessmeasures the asymmetry of a distribution. If a distribution has a long tail to the right (more high values), it’s positively skewed. If the tail is to the left (more low values), it’s negatively skewed. This tells us if the distribution leans a certain way.
Kurtosismeasures the “tailedness” of a distribution or how heavy or light the tails are compared to a normal distribution. High kurtosis means more of the data is in the tails and peaks, while low kurtosis means the data is more evenly spread out.
♦By the author10. Confidence IntervalsA range of values, derived from sample data, that’s likely to include the true unknown population parameter.
Example: Estimating a 95% confidence interval for the mean height of adult males.
If you want to know more about confidence intervals, check out my article here.
ConclusionRemember, statistics is like a toolkit for a detective. It helps you uncover insights and solve mysteries hiding in data. These concepts are the keys to unlocking the magic of numbers and turning them into valuable insights.
Thank you for reading!
If you found this article informative and helpful, please don’t hesitate to 👏 and follow me on Medium | LinkedIn.
♦10 Basic Statistics Concepts For Data Analysts was originally published in Code Like A Girl on Medium, where people are continuing the conversation by highlighting and responding to this story.