Sign in

Top Writer in AI | Writing “I wish I found this earlier” posts about Data Science and Machine Learning

You don’t have to write all the code yourself

Photo by Arek Socha on Pixabay

Whatever you are coding, chances are someone already did something similar. Instead of reinventing the wheel, you can customize someone else’s code to suit your needs. After all, that’s the whole point of the open-source community. This vast pool of code written by millions of people is available at your fingertips if you know just one programming concept — inheritance.

Inheritance is a must for any object-oriented programming language, and Python is no exception. Inheritance allows you to reuse the code others had written and tweak it to your needs. …


Your favorite models choosing features themselves

Photo by Andrea Piacquadio on Pexels

Introduction

There are many feature selection methods in Machine Learning. Each one may give different results depending on how you use them, so it is hard to trust a single method entirely. Wouldn’t it be cool to have multiple methods cast their own vote on whether we should keep a feature or not? It would be just like the Random Forests algorithm, where it combines the predictions of multiple weak learners to form a strong one. It turns out, Sklearn has already given us the tools to make such a feature selector on our own.

Together, using those tools, we will…


Learn from the same resources that put people in FAANG companies

Photo by Nina Uhlíková on Pexels

Data Science Can Be Overwhelming

Learning data science from scratch can be pretty intimidating and overwhelming, especially if you accidentally come across some article on the internet that lists a bunch of technical things that you must know to get a job in data science. You are not alone.


Get the same model performance even after dropping 93 features

Photo by Victoriano Izquierdo on Unsplash

The basic feature selection methods are mostly about individual properties of features and how they interact with each other. Variance thresholding and pairwise feature selection are a few examples that remove unnecessary features based on variance and the correlation between them. However, a more pragmatic approach would select features based on how they affect a particular model’s performance. One such technique offered by Sklearn is Recursive Feature Elimination (RFE). It reduces model complexity by removing features one by one until the optimal number of features is left.

It is one of the most popular feature selection algorithms due to its…


The crucial lesson to understand tricky nuances

Photo by Karolina Grabowska on Pexels

Introduction

The world of OOP is vast and rich and takes a while to master. After you got down the basics, it is time to learn the 3 core principles of OOP: Inheritance, Polymorphism, and Encapsulation. However, along the way, there are so many filler concepts or prerequisites you need to learn. One of them is differentiating between class-level and instance-level data. This differentiation is crucial to understand Inheritance, one of the strong pillars of OOP.

In this article, we will discuss how instance attributes differ from global class attributes and categorize methods into class-level and instance-level types.

Instance-level attributes

Remember that self


Add the time-tested method to your arsenal

Photo by Pixabay on Pexels

What Is the Correlation Coefficient?

In my last article on the topic of Feature Selection, we focused on a technique to remove features based on their individual properties. In this post, we will look at a more reliable, more robust method that lets us see the connection between features and decide if they are worth keeping. This method, as you have read from the title, uses Pairwise Correlation.

First of all, let’s briefly touch on Pearson’s correlation coefficient — commonly denoted as r. This coefficient can be used to quantify the linear relationship between two distributions (or features) in a single metric. …


Implement a simple Linear Regression with OOP basics on your own

Photo by Pixabay on Pexels

What Is Object-Oriented Programming?

It is no doubt that data scientists spend the majority of their time doing procedural programming — writing code that executes as a sequence of steps. Doing data analysis in Jupyter Notebooks, writing Python scripts to clean the data are all examples of this. This type of thinking and doing things come to people naturally. After all, that is how we go through each day, one step — one process at a time.

However, as data scientists, we should be eternally grateful that there is another, much better system of coding — Object-Oriented Programming (OOP). OOP gives us massively powerful…


Get the same performance even after dropping 50 features

Photo by Billel Moula on Pexels

Intro to Feature Selection With Variance Thresholding

Today, it is common for datasets to have hundreds if not thousands of features. On the surface, this might seem like a good thing — more features give more information about each sample. But more often than not, these additional features don’t provide that much value and introduce unnecessary complexity.

The biggest challenge of Machine Learning is to create models that have robust predictive power by using as few features as possible. But given the massive sizes of today’s datasets, it is easy to lose the oversight of which features are important and which ones aren’t.

That’s why there is…


Successive halving completely crushes GridSearch and RandomSearch

Photo by Karolina Grabowska on Pexels

Introduction

For a while now, GridSearchCV and RandomizedSearchCV classes of Scikit-learn have been the go-to choice for hyperparameter tuning. Given a grid of possible parameters, both use a brute-force approach to figure out the best set of hyperparameters for any given model. Though they provide pretty robust results, tuning heavier models on large datasets can take too much time (we are talking hours, here). This meant that unless you have got a machine with 16+ cores, you were in trouble.

But in December 2020, version 0.24 of Scikit-learn came out along with two new classes for hyperparameter tuning — HalvingGridSearch and…


Utilize the hottest ML library for state-of-the-art performance

Photo by Dom Gould on Pexels

What is XGBoost and why is it so popular?

Let me introduce you to the hottest Machine Learning library in the ML community — XGBoost. In recent years, it has been the main driving force behind the algorithms that win massive ML competitions. Its speed and performance are unparalleled and it consistently outperforms any other algorithms aimed at supervised learning tasks.

The library is parallelizable which means the core algorithm can run on clusters of GPUs or even across a network of computers. This makes it feasible to solve ML tasks by training on hundreds of millions of training examples with high performance.

Originally, it was written in C++…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store