Classical statisticians aren't wrong to be curious about learning Machine Learning (ML)
A pragmatic approach to solving problems start with learning the tools to that end.
The most recent buzz around ML in my own observations are when I was a Statistics student in Zürich around 2019. I remember scrambling to find a job upon graduation, during a biomedical PhD that wasn't quite the right fit for me. Needless to say, I was feeling unprepared for a world of impending uncertainties compounded by the 2020 pandemic and a moving target of what Data Scientists were for. As I've learned, even classical statisticians have their gaps. It's called other-statisticians’-specialisations. Let's digress back into ML.
Pun intended, it's an ensemble of methods that mostly finds its basis in linear regression. Some demonstrations absorb as many features as it can find, without so much as to check for actual linear relationships, independence of covariates, colinearities or omitted variables biases, all assumptions of linear regression modeling. Others incorporate Orthogonalisation to scrupulously pick features that don't show dependence for the convenience of model interpretation and performance (Askanazi & Grinberg, 2024). I stan for this rigour. That's not a complete review, but one observation from a classical statistician diversifying she/her skills.
How I am being curious
To elaborate on my initial and current qualms, I struggle to find the connection between method and contribution to scientific answers with ML. I feel that is a healthy struggle in which fuels my curiosity. I come from health care as well, and with this duality, I feel an accountability as the analyst, to explain the connection between the choice of model, parameters, to its contribution to the scientific question. Neural Networks are transformative to computer vision questions, and while I dabbled with it with the Julia Language, I am limited to explain why my choice of number of layers, type of layers and learning rate, for example, should be used other than the fact that it gives for nice diagnostic results such as accuracy, precision, positive predictive probability, Sensitivity or False positive rate.
But what about the possibilities that Computer Vision can predict and prevent full blown cancer metastases ? As an Oncology RN myself, I would support the complete shut down of these pre-cursory processes. The pain and suffering, and absorption of resources of Cancer in the developed world far outweigh actual solutions we have for them. We have urgencies to explore further.
In the Commercial and Healthcare realm, if that dichotomy exists for you, I see the benefits of the favourable options given by an ML model’s ROC that can potentially cover all budget scenarios. ML models that are good, affect money and the allocation of health care resources. It can't get more IRL (in real life) than that. I accept assumptions, I acknowledge the arbitrariness of parameter choices affects my confidence for reproducibility and therefore inference for RL things. I question : Can a trained model on data, which will inevitably be older data than the predictions we need today, be reproducible ? As we observe in the news, the past seems to repeat itself. Scary as it sounds, but trained models have also done so much good and bad already, so why not take the positive forces and go with them ?
Feature selection, and the important features
I've learned over time the elegance of these words, while summary statistics from classical regression can already point to what features seem to contribute the variation of the outcome's mean, and which to the most extent (important feature), and with what uncertainty (variance or standard deviation). Learning ML truly lends itself to Data Scientists becoming more acute in elegant commercial terms.
In the same commercial world, I’ve learned that ML models ingest tens of thousands of features, without any evidence of sense checking if they meet linear regression assumptions. Furthermore, classical model evaluation techniques such as the AIC (Akaike’s information criterion) and BIC (Bayesian information criterion) don’t have main character energy in the ML literature I’ve come across in the last five years. Then again, five years is not a lot, and there are more semantically sound diagnostics such as the ones mentioned before. Here are the respective formulas of the afore mentioned classical techniques. L hat being the likelihood, k being the number of parameters, n being the number of observation points.
One can see that the choice of which model diagnostic to use is based on where the Data Scientist wants the model to be most penalised for ; number of observations or number of parameters. I posit that Data Science in Big Tech would tend to BIC owing to the number of rows their data may have, life sciences would tend to AIC, especially in experimental settings where the number of variables are even fewer, if used at all. That is also a broad stroke on this topic of which I have no doubt have finer layers to its story.
Back to the topic of feature selection. Linear regression assumptions cover the systematic approach to ensuring we minimise multicolinearity (the numeric teaming up of independant variables), autocorrelation (errors between data points trend together) and omitted variable biases (we didn’t forget to include an important feature that show up as inflated error). Reservations against ML can rather come from a place of concern for model integrity and interpretability, and personally for me, the accountability we have on the decisions we make because of it. I hope that those who need “Data Scientists” can be fidel to this scrupulous process that has benefited the good decisions we have made already based on good practice.
The importance of code engineering
I insist that writing code is tied even to the development and learning for classical statisticians. Writing code is a doorway to furthering more technical knowledge just by the sheer efficiency of analysis made by using software alone. If there are gaps to our technical knowledge, writing code makes us fail faster. I have observed this myself as a lead developer of the R package phase1b. I would argue, that being a hands on analyst, is THE duality that is a Statistician or Data Scientist. Thus I propose that one of the barriers of ML is the plethora of software availability just in one language. R, Python and Julia have their own families of ML softwares, and how we may learn of them seem to be at the mercy of such things as the word of mouth, or the power of our prompt engineering (search and or AI tools) just to name a few. The downloads of ML related R packages make some currently the around top 500 most downloaded R packages out of over 20’000 packages on CRAN. See “yardstick” and “tidymodels”. It is surely confounded by the buzz, but ML skills are a ubiquitous part of the market forces and learning about multitude of softwares to that end is increasingly a barrier.
If classical statisticians want to finally find comfort in ML tools, more can be done to clarify their qualms. One such of mine is connecting the choice of Neural Network parameters to the scientific question. I marvel at the results yet cannot come to terms with the how. Furthermore, the search for improving the reliability of prediction error is another such qualm. Overall, coding skills are a pathway to understand what they are and an attempt at solutions rather than stagnating at problems.
For real though, why classical statisticians aren't wrong to be curious
The dichotomy of classical Statistics and ML comes across as more pronounced to classical Statisticians. This may be the basis for our curiosity. The earliest documentation I have found is 1959 of when the term was coined and tied to such things as Artificial Intelligence and Computer Vision. As per the Wikipedia article :
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions
An observation is that the single distinction that would tend a classical statistician toward ML is an emphasis on prediction. Yet, there is no shame in wanting to predict. Off-hand comments among classical statisticians, for the lack of better word, boycott ML terminologies. But this is limiting is so many ways. I get the things the Montagues are saying about the Capulets but I respectfully question persistent trauma bonds. Any RL questions can be asked in a scientific framework. Many business questions are translated in Data Science terms, or Statistical terms, and then solved with a bunch of caveats. While the off-hand comments are sometimes reasonable, the truth is, there are real contributions of Machine Learning. The invisible hand of the market is pointing us to ML. And it is not necessarily wrong to do so.
Classical Statisticians towards pragmatism
Let's first address our barriers : choice of ML methodology, the moving parts of parameters options such as learning rate, the multitude of ways of cross validation, the overwhelming number of ways to handle features, substantiating the rationale for those chosen paths, are merely new learnings. If these "extra stuff" has been useful in the past, shouldn't we include and improve on them ? It seems, there is room for tools to be sharper with the advent of technology creating ways for efficiency gains (DeepSeek just entered the chat). Furthermore, learning ML for classical statisticians is already a great achievement with the evolving knowledge and softwares that make these efforts, kind of prohibitive. Let's be real about that. Finally, I posit that learning Machine Learning if that is so a choice of a classical statistician, should be approached with a united focus on solving the actuals problems we have.