For a while now,
RandomizedSearchCV classes of Scikit-learn have been the go-to choice for hyperparameter tuning. Given a grid of possible parameters, both use a brute-force approach to figure out the best set of hyperparameters for any given model. Though they provide pretty robust results, tuning heavier models on large datasets can take too much time (we are talking hours, here). This meant that unless you have got a machine with 16+ cores, you were in trouble.
Let me introduce you to the hottest Machine Learning library in the ML community — XGBoost. In recent years, it has been the main driving force behind the algorithms that win massive ML competitions. Its speed and performance are unparalleled and it consistently outperforms any other algorithms aimed at supervised learning tasks.
The library is parallelizable which means the core algorithm can run on clusters of GPUs or even across a network of computers. This makes it feasible to solve ML tasks by training on hundreds of millions of training examples with high performance.
Originally, it was written in C++…
Libraries in the Scipy Stack work seamlessly together. In terms of visualization, the relationship between
maltplotlib clearly stands out. Without even importing it, you can generate
matplotlib plots with the plotting API of
pandas. Just use the
.plot keyword on any
pandas DataFrame or Series and you will get access to most of the functionality of
Kaggle is a wonderful place. It is a gold mine of knowledge for data scientists and ML engineers. There are not many platforms where you can find high-quality, efficient, reproducible, awesome codes brought by experts in the field all in the same place.
It has hosted 164+ competitions since its launch. These competitions attract experts and professionals from around the world to the platform. As a result, there are many high-quality notebooks and scripts on each competition and for the massive amount of open-source datasets Kaggle provides.
At the beginning of my data science journey, I would go to Kaggle…
Today, algorithms that hide a world of math under the hood can be trained with only a few lines of code. Their success depends first on the data trained and then, on what hyperparameters were used by the user. So, what are these hyperparameters?
Hyperparameters are user-defined values like k in kNN and alpha in Ridge and Lasso regression. They strictly control the fit of the model and this means, for each dataset, there is a unique set of optimal hyperparameters to be found. The most basic way of finding this perfect set would be randomly trying out different values…
Linear Regression a.k.a. Ordinary Least Squares is one of the easiest and most widely used ML algorithms. But it suffers from a fatal flaw — it is super easy for the algorithm to overfit the training data.
For the simplest case — 2D data, everything just clicks visually: Line of best fit is the line that minimizes the sum of squared residuals (SSR):
Data cleaning and preparation is easily the most time-consuming and boring task in machine learning. All ML algorithms are really fussy, some want normalized or standardized features, some want encoded variables and some want both. Then, there is also the issue of missing values which is always there.
Dealing with them is no fun at all, not to mention the added bonus that comes with repeating the same cleaning operations on all training, validation and test sets. Fortunately, Scikit-learn’s
Pipeline is a major productivity tool to facilitate this process, cleaning up code and collapsing all preprocessing and modeling steps into…
Raise your hand if kNN is the first algorithm you were introduced in a machine learning course ✋🤚
k Nearest Neighbors algorithm is one of the most commonly used algorithms in machine learning. Because of its simplicity, many beginners often start their wonderful journey of ML with this algorithm. It is one of the few algorithms which can smoothly be used both for regression and classification.
So, what makes kNN so versatile and easy at the same time? The answer is hidden in how it works under the hood.
Imagine you have a variable with 2 categories which are visualized…
Arguably, the first function you learned from Scikit-learn is
train_test_split. It performs the most basic yet crucial task: dividing the data into train and test sets. You fit the relevant model to the training set and test its accuracy on the test set. Sounds simple. But let me tell you, it is not simple. Not simple at all.
Scikit-learn provides a whopping amount of 15 different functions to split your data depending on different use-cases. Some of them you never heard of, some of them are used daily. But don’t worry, we won’t cover all of them here. …
Machine Learning has gained so much popularity that it became one of those topics where people suppose you know everything about it. It is now common to use fancy terms like training/fitting a model, train set, validating a model, a cost function, and many others without giving a second thought to whether people understand it or not. Sometimes, it feels like you should have been born knowing these things…
Look at me too, rambling about machine learning without explaining what it is in the first place.
Today, I am here to break this knowledge bias by explaining some of the…