Sign in

Top Writer in AI | Writing “I wish I found this earlier” posts about Data Science and Machine Learning

Get the same performance even after dropping 50 features

Intro to Feature Selection With Variance Thresholding

Today, it is common for datasets to have hundreds if not thousands of features. On the surface, this might seem like a good thing — more features give more information about each sample. But more often than not, these additional features don’t provide that much value and introduce unnecessary complexity.

The biggest challenge of Machine Learning is to create models that have robust predictive power by using as few features as possible. But given the massive sizes of today’s datasets, it is easy to lose the oversight of which features are important and which ones aren’t.

That’s why, there is…

Successive halving completely crushes GridSearch and RandomSearch


For a while now, GridSearchCV and RandomizedSearchCV classes of Scikit-learn have been the go-to choice for hyperparameter tuning. Given a grid of possible parameters, both use a brute-force approach to figure out the best set of hyperparameters for any given model. Though they provide pretty robust results, tuning heavier models on large datasets can take too much time (we are talking hours, here). This meant that unless you have got a machine with 16+ cores, you were in trouble.

But in December 2020, version 0.24 of Scikit-learn came out along with two new classes for hyperparameter tuning — HalvingGridSearch and…

Utilize the hottest ML library for state-of-the-art performance

What is XGBoost and why is it so popular?

Let me introduce you to the hottest Machine Learning library in the ML community — XGBoost. In recent years, it has been the main driving force behind the algorithms that win massive ML competitions. Its speed and performance are unparalleled and it consistently outperforms any other algorithms aimed at supervised learning tasks.

The library is parallelizable which means the core algorithm can run on clusters of GPUs or even across a network of computers. This makes it feasible to solve ML tasks by training on hundreds of millions of training examples with high performance.

Originally, it was written in C++…

Make interactive plots without having to learn a new library

Introduction To Plotly

Libraries in the Scipy Stack work seamlessly together. In terms of visualization, the relationship between pandas and maltplotlib clearly stands out. Without even importing it, you can generate matplotlib plots with the plotting API of pandas. Just use the .plot keyword on any pandas DataFrame or Series and you will get access to most of the functionality of maptlotlib:

Easily learn what is only learned by hours of search and exploration

About This Project

Kaggle is a wonderful place. It is a gold mine of knowledge for data scientists and ML engineers. There are not many platforms where you can find high-quality, efficient, reproducible, awesome codes brought by experts in the field all in the same place.

It has hosted 164+ competitions since its launch. These competitions attract experts and professionals from around the world to the platform. As a result, there are many high-quality notebooks and scripts on each competition and for the massive amount of open-source datasets Kaggle provides.

At the beginning of my data science journey, I would go to Kaggle…

Deep dive into GridSearch and RandomSearch classes of Scikit-learn

What is a hyperparameter?

Today, algorithms that hide a world of math under the hood can be trained with only a few lines of code. Their success depends first on the data trained and then, on what hyperparameters were used by the user. So, what are these hyperparameters?

Hyperparameters are user-defined values like k in kNN and alpha in Ridge and Lasso regression. They strictly control the fit of the model and this means, for each dataset, there is a unique set of optimal hyperparameters to be found. The most basic way of finding this perfect set would be randomly trying out different values…

You might as well ditch Linear Regression

Problems of Linear Regression

Linear Regression a.k.a. Ordinary Least Squares is one of the easiest and most widely used ML algorithms. But it suffers from a fatal flaw — it is super easy for the algorithm to overfit the training data.

For the simplest case — 2D data, everything just clicks visually: Line of best fit is the line that minimizes the sum of squared residuals (SSR):

Everything I love about Scikit-Learn, in one place

Why Do You Need a Pipeline?

Data cleaning and preparation is easily the most time-consuming and boring task in machine learning. All ML algorithms are really fussy, some want normalized or standardized features, some want encoded variables and some want both. Then, there is also the issue of missing values which is always there.

Dealing with them is no fun at all, not to mention the added bonus that comes with repeating the same cleaning operations on all training, validation and test sets. Fortunately, Scikit-learn’s Pipeline is a major productivity tool to facilitate this process, cleaning up code and collapsing all preprocessing and modeling steps into…

Learn what it takes to master them

What Is KNN?

Raise your hand if kNN is the first algorithm you were introduced in a machine learning course ✋🤚

k Nearest Neighbors algorithm is one of the most commonly used algorithms in machine learning. Because of its simplicity, many beginners often start their wonderful journey of ML with this algorithm. It is one of the few algorithms which can smoothly be used both for regression and classification.

So, what makes kNN so versatile and easy at the same time? The answer is hidden in how it works under the hood.

Imagine you have a variable with 2 categories which are visualized…

There is more to it than vanilla train_test_split function


Arguably, the first function you learned from Scikit-learn is train_test_split. It performs the most basic yet crucial task: dividing the data into train and test sets. You fit the relevant model to the training set and test its accuracy on the test set. Sounds simple. But let me tell you, it is not simple. Not simple at all.

Scikit-learn provides a whopping amount of 15 different functions to split your data depending on different use-cases. Some of them you never heard of, some of them are used daily. But don’t worry, we won’t cover all of them here. …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store