Brief Musings on Subsurface Data Analytics and Machine Learning

Michael Pyrcz, PhD, P.Eng (daytum Founding Advisor)

Subsurface is Unique
Due to (1) sparse data, (2) heterogeneous spatial system, (3) high degree of uncertainty, (4) thick layer of unavoidable interpretation, and (5) extremely high value development decisions

We must go beyond the data!
Data analytics is the application of data cleaning and statistical analysis to support decision making. Robust use of statistics and domain knowledge (geoscience and engineering) remain critical!

Garbage In, Garbage Out!
Data cleaning is 80-90% of the effort, and data curation with massive variety and volume of metadata remains challenging. The principles of “Garbage In, Garbage Out”, and “Correlation is not Causation” remain in effect

Machine Learning Model Accuracy
Base it on testing error components such as (1) model variance – sensitivity of the model due to limited data, (2) model bias – error due to an inability to fit the complexity of the system, and (3) irreducible error due to missing variables or ranges of the variables in the training dataset

Model Complexity and Accuracy: Don’t Jump to Complicated!
Due to the trade-off between model variance and model bias, it is common for lower complexity models to outperform the accuracy of complicated models

Modeling Complexity and Interpretability
More complicated models are generally more difficult to interrogate and communicate. The model may work, but we may fail to learn from it and trust it!

Non-Parametric Models
These are typically parameter-rich, requiring a large number of implicit parameters, and therefore, requiring a larger amount of training data and resulting in a greater risk of overfit

Overfit Models
Overfit models explain almost all variance in the training, expressing high confidence, but perform poorly in testing with new data that was not used to train the model. The overfit model fits your data idiosyncrasies!

Overfit is Insidious!
Model parameters are set to maximize the fit with the training data, and model hyperparameters determine the model complexity and are set by tuning with the withheld testing data to avoid overfit


When Michael is not building python packages or mentoring students, he’s either running, out on his Jeep, or kayaking around Lake Austin. You can find him on Twitter here, and his YouTube channel here.

Brief Musings on Subsurface Data Analytics and Machine Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts