Top Writer in AI | Writing “I wish I found this earlier” posts about Data Science and Machine Learning

You might as well ditch Linear Regression

Image for post
Image for post

Problems of Linear Regression

Linear Regression a.k.a. Ordinary Least Squares is one of the easiest and most widely used ML algorithms. But it suffers from a fatal flaw — it is super easy for the algorithm to overfit the training data.

For the simplest case — 2D data, everything just clicks visually: Line of best fit is the line that minimizes the sum of squared residuals (SSR):


Everything I love about Scikit-Learn, in one place

Image for post
Image for post

Why Do You Need a Pipeline?

Data cleaning and preparation is easily the most time-consuming and boring task in machine learning. All ML algorithms are really fussy, some want normalized or standardized features, some want encoded variables and some want both. Then, there is also the issue of missing values which is always there.

Dealing with them is no fun at all, not to mention the added bonus that comes with repeating the same cleaning operations on all training, validation and test sets. Fortunately, Scikit-learn’s Pipeline is a major productivity tool to facilitate this process, cleaning up code and collapsing all preprocessing and modeling steps into…


Learn what it takes to master them

Image for post
Image for post

What Is KNN?

Raise your hand if kNN is the first algorithm you were introduced in a machine learning course ✋🤚

k Nearest Neighbors algorithm is one of the most commonly used algorithms in machine learning. Because of its simplicity, many beginners often start their wonderful journey of ML with this algorithm. It is one of the few algorithms which can smoothly be used both for regression and classification.

So, what makes kNN so versatile and easy at the same time? The answer is hidden in how it works under the hood.

Imagine you have a variable with 2 categories which are visualized…


There is more to it than vanilla train_test_split function

Image for post
Image for post

Introduction

Arguably, the first function you learned from Scikit-learn is train_test_split. It performs the most basic yet crucial task: dividing the data into train and test sets. You fit the relevant model to the training set and test its accuracy on the test set. Sounds simple. But let me tell you, it is not simple. Not simple at all.

Scikit-learn provides a whopping amount of 15 different functions to split your data depending on different use-cases. Some of them you never heard of, some of them are used daily. But don’t worry, we won’t cover all of them here. …


Get the terminology right for the most exciting field out there

Image for post
Image for post

The Curse of Knowledge

Machine Learning has gained so much popularity that it became one of those topics where people suppose you know everything about it. It is now common to use fancy terms like training/fitting a model, train set, validating a model, a cost function, and many others without giving a second thought to whether people understand it or not. Sometimes, it feels like you should have been born knowing these things…

Look at me too, rambling about machine learning without explaining what it is in the first place.

Today, I am here to break this knowledge bias by explaining some of the…


EDA — done right…

Image for post
Image for post

Why Are You Stuck?

It is hard to get started when that blank void of a Jupyter notebook is staring at you. You have a dataset with hundreds of features and you have no idea where to begin. Your gut feeling tells you: “Normal, start with a feature that is normally distributed”. As always…

You dive head-first into the data, moving from feature to feature until you find yourself chasing wild geese in a forest without any purpose whatsoever. So, what is the reason?

Well, to start, it means you don’t have a clear process. Many say Exploratory Data Analysis is the most important…


The biggest concepts of statistics explained for you not to use

Image for post
Image for post

Demotivation

Let me get your hopes down from the beginning. You won’t be using the concepts from this article until you get a job in data science. You may practice it a few times on public datasets and that’s it.

Then, why learn them? Well, I don’t want you to be like a new-hatched bird thrown from Everest when you do get a real job. So, what are we talking about?

We are talking about the icing on the cake🎂, the thing that ends any data-related project in business and scientific research — hypothesis testing. …


And how you can use it in practice right now

Image for post
Image for post

Introduction

You did all your best and managed to measure the speed of sound in your room 50 times using an electronic device. Feeling very proud and excited, you look at your results:


Become ridiculously good at Normal Distribution

Image for post
Image for post

Introduction

You wake up a statistician in the middle of the night and ask them about the formula of the Normal Distribution. Half-asleep, half-dreaming, they will recite this formula to the letter:


Thankfully, there is the Poisson distribution for these cases

Image for post
Image for post

The Story

You have been freelancing for 10 years now. So far, your average annual income was about 80,000$. This year, you feel like you are stuck in a rut and decide to hit 6 figures. To do that, you want to start by calculating the probability of this exciting achievement happening but you don’t know how to do so.

Turns out, you are not alone. …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store