Top 10 Statistics Terms To Know

Michael Pyrcz, PhD, P.Eng (daytum Founding Advisor)

Hey everyone! We think that statistics concepts can sometimes be intimidating and seemingly esoteric. However, we know that it doesn’t need to be.

One thing we wanted to do was simplify some of the commonly used statistics vocab and to “Explain Like I’m 5” definitions. Below are a list of the top 10 terms we believe will boost your geostatistics skills.

Feel free to comment below with requests for more terms or any other suggestions you have for this list!


1) Bootstrap — The practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution.

Calculating the uncertainty in a sample statistic by resampling from the sample itself!

2) Clustering — Grouping of samples into sets known as clusters, such that the differences within the clusters is minimized and the difference between clusters is maximized.

Grouping data points by their characteristics (think separating apples and oranges by color).

3) Heteroscedasticity — The statistical dispersion is not consistent across all subpopulations of a variable.

Variance changes in the data set.

4) Imputation — The process of replacing missing data with substituted values.

Filling in missing data with representative values.

5) Monte Carlo Simulation — Repeated random sampling to solve a numerical problem, often applied to represent uncertainty.

A brute force method that combines individual distributions to find an overall population distribution.

6) Multicollinearity — A state of interactions among the independent variables.

Two+ variables in a regression model are highly linearly related.

7) Overfit — A statistical model that contains more parameters than can be justified by the data.

Starting to model the nuisance and noise in the data and will do poorly predicting away from data.

8) Principal Component Analysis (PCA) — Dimensionality-reduction method of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set

A rotation that finds the combinations of features with the most information.

9) Stationarity — The statistic of interest is invariant under translation.

Distribution is constant with a shift.

10) Variogram — A measurement of disimilarity vs. distance. Calculated as one half the average squared difference of values separated by a spatial lag vector.

A measure of how much your variable changes over distance.

————————————

John Foster, PhD, P.Eng (daytum Co-Founder & Chief Technology Officer) 

When John is not writing code or helping students, he’s an avid outdoorsman, pilot, and all-around adventurist. 

 Michael Pyrcz, PhD, P.Eng (daytum Founding Advisor) 

When Michael is not building python packages or mentoring students, he’s either running, out on his Jeep, or kayaking around Lake Austin. You can find him on Twitter here, and his YouTube channel here. 

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts