Linear Regression a.k.a. Ordinary Least Squares is one of the easiest and most widely used ML algorithms. But it suffers from a fatal flaw — it is super easy for the algorithm to overfit the training data.

For the simplest case — 2D data, everything just clicks visually: Line of best fit is the line that minimizes the sum of squared residuals (SSR):

Data cleaning and preparation is easily the most time-consuming and boring task in machine learning. All ML algorithms are really fussy, some want normalized or standardized features, some want encoded variables and some want both. Then, there is also the issue of missing values which is always there.

Dealing with them is no fun at all, not to mention the added bonus that comes with repeating the same cleaning operations on all training, validation and test sets. Fortunately, Scikit-learn’s `Pipeline`

is a major productivity tool to facilitate this process, cleaning up code and collapsing all preprocessing and modeling steps into…

Raise your hand if kNN is the first algorithm you were introduced in a machine learning course ✋🤚

*k* Nearest Neighbors algorithm is one of the most commonly used algorithms in machine learning. Because of its simplicity, many beginners often start their wonderful journey of ML with this algorithm. It is one of the few algorithms which can smoothly be used both for regression and classification.

So, what makes kNN so versatile and easy at the same time? The answer is hidden in how it works under the hood.

Imagine you have a variable with 2 categories which are visualized…

`train_test_split`

functionArguably, the first function you learned from Scikit-learn is `train_test_split`

. It performs the most basic yet crucial task: dividing the data into train and test sets. You fit the relevant model to the training set and test its accuracy on the test set. Sounds simple. But let me tell you, it is not simple. Not simple at all.

Scikit-learn provides a whopping amount of 15 different functions to split your data depending on different use-cases. Some of them you never heard of, some of them are used daily. But don’t worry, we won’t cover all of them here. …

Machine Learning has gained so much popularity that it became one of those topics where people suppose you know everything about it. It is now common to use fancy terms like training/fitting a model, train set, validating a model, a cost function, and many others without giving a second thought to whether people understand it or not. Sometimes, it feels like you should have been born knowing these things…

Look at me too, rambling about machine learning without explaining what it is in the first place.

Today, I am here to break this knowledge bias by explaining some of the…

It is hard to get started when that blank void of a Jupyter notebook is staring at you. You have a dataset with hundreds of features and you have no idea where to begin. Your gut feeling tells you: “Normal, start with a feature that is normally distributed”. As always…

You dive head-first into the data, moving from feature to feature until you find yourself chasing wild geese in a forest without any purpose whatsoever. So, what is the reason?

Well, to start, it means you don’t have a clear process. Many say Exploratory Data Analysis is the most important…

Let me get your hopes down from the beginning. You won’t be using the concepts from this article until you get a job in data science. You may practice it a few times on public datasets and that’s it.

Then, why learn them? Well, I don’t want you to be like a new-hatched bird thrown from Everest when you *do* get a real job. So, what are we talking about?

We are talking about the icing on the cake🎂, the thing that ends any data-related project in business and scientific research — **hypothesis testing**. …

You did all your best and managed to measure the speed of sound in your room 50 times using an electronic device. Feeling very proud and excited, you look at your results:

You wake up a statistician in the middle of the night and ask them about the formula of the Normal Distribution. Half-asleep, half-dreaming, they will recite this formula to the letter:

You have been freelancing for 10 years now. So far, your average annual income was about 80,000$. This year, you feel like you are stuck in a rut and decide to hit 6 figures. To do that, you want to start by calculating the probability of this exciting achievement happening but you don’t know how to do so.

Turns out, you are not alone. …