XGBoost is a real beast.
It is a tree-based power horse that is behind the winning solutions of many tabular competitions and datathons. Currently, it is the “hottest” ML framework of the “sexiest” job in the world.
While basic modeling with XGBoost can be straightforward, you need to master the nitty-gritty to achieve maximum performance.
With that said, I present you this article, which is the result of
On Kaggle, everyone knows that to win a tabular competition, you need to out-feature engineer others. Almost anyone can perform awesome EDA, develop a validation strategy and tune hyperparameters to squeeze every bit of model performance.
The key to the top is always feature engineering, and it is not something taught in tutorials, books, or courses. It is all about creativity, experience, and domain knowledge.
With the addition of the time component, feature engineering becomes even more important in time-series forecasting challenges. This has been proven once again by the top players participating in this month’s (July) TPS Playground competition.
As noted in Forbes, more than 80% of data in organizations today is unstructured. Traditionally, companies have ignored this type of data because of the challenges they face analyzing it and generating meaningful insights. However, the landscape is rapidly changing as other types of storage systems are being invented, such as block-, file-, and object-based storage systems.
Among the three, object storage seems most promising, which is proven by the fact that Goliaths like Amazon, Google, and IBM already offer enterprise solutions for object-based data repositories.
Unlike ordinary machine learning problems, time series forecasting requires extra preprocessing steps.
On top of the normality assumptions, most ML algorithms expect a static relationship between the input features and the output.
A static relationship requires inputs and outputs with constant parameters such as mean, median, and variance. In other words, algorithms perform best when the inputs and outputs are stationary.
This is not the case in time series forecasting. Distributions that change over time can have unique properties such as seasonality and trend. …
No matter how powerful, machine learning cannot predict everything. A well-known area where it can become pretty helpless is related to time series forecasting.
Despite the availability of a large suite of autoregressive models and many other algorithms for time series, you cannot predict the target distribution if it is white noise or follows a random walk.
So, you must detect such distributions before you make further efforts.
In this article, you will learn what white noise and random walk are and explore proven statistical techniques to detect them.
This post is about the core processes that make up an in-depth time series analysis. Specifically, we will talk about:
and if seasonality or trends among multiple series affect each other.
Most importantly, we will build some very cool visualizations, and this image should be…
Recently, the Optiver Realized Volatility Prediction competition has been launched on Kaggle. As the name suggests, it is a time series forecasting challenge.
I wanted to participate, but it turns out my knowledge in time series couldn’t even begin to suffice to participate in a competition of such a magnitude. So, I accepted this as the ‘kick in the pants’ I needed to start paying serious attention to this large sphere of ML.
As the first step, I wanted to learn and teach every single Pandas function you can use to manipulate time-series data. …
There is an annoying habit of soccer fans. Whenever a young but admittedly exceptional player emerges, they compare him to legends like Messi or Ronaldo.
They choose to forget that the legends have been dominating the game since before the newbies had regrown teeth.
Comparing Plotly to Matplotlib was, in a sense, similar to that in the beginning. Matplotlib had been in heavy use since 2003, and Plotly had just come out in 2014.
Many were bored with Matplotlib by this time, so Plotly was warmly welcomed for its freshness and interactivity. …
What can a $1 billion investment buy?
On Tuesday this week, OpenAI and GitHub answered this question boldly with the preview of a new AI tool — GitHub Copilot. It can write user-compatible code and is much better at the task than its predecessor — GPT-3.
Copilot autocompletes code snippets, suggests new lines of code, and can even write whole functions based on the description provided. According to the GitHub blog, the tool is not just a language-generating algorithm based on user input — it is a virtual pair programmer.
It learns and adapts to the user’s coding habits, analyzes…
For the past few years, there has been an explosion of interest in interactive plots. People were bored out of their minds from the old, static plots they have been using since before they were born. This was understandable because, unlike other aspects of data science, which rapidly change, there had not been many advancements in data visualization. Matplotlib, Seaborn, and ggplot have been dominating the game, and many people wanted a change.
Then, Plotly came.
In 2015, a widely accepted Plotly.js data visualization framework was open-sourced for Python and R, making Plotly the most downloaded graphing library in the…