More Annotations

Favourite Annotations

Text

FASTMLGOOGLE'S PRINCIPLES ON AI WEAPONS, MASS SURVEILLENCE, AND SIGNING OUTCONTENTSGOODBOOKS-10K 2020-05-12. The most common form of cheating in first person shooter games is wall-hacking, or seeing enemy players through obstacles. We propose a solution to this problem building on a mechanism already used in some professional e-sports matches: taking random screenshots during gameplay. If a game takes screenshots and uploads them to

“the

EXTREME LEARNING MACHINES ELM is a Chinese invention. Imagine a classic feed-forward neural network with one hidden layer, subtract backpropagation and you have an ELM. The input-hidden weights are constant - they are apparently initialized analytically so even though they are semi-random the thing works. The model learns only the hidden-output weights, which amounts

to

CONVERTING CATEGORICAL DATA INTO NUMBERS WITH PANDAS AND This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat

steep

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CLASSIFYING TEXT WITH BAG-OF-WORDS: A TUTORIAL WHAT YOU WANTED TO KNOW ABOUT MEAN AVERAGE PRECISION First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is

MAP.

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

A VERY FAST DENOISING AUTOENCODER WHAT IS BETTER: GRADIENT-BOOSTED TREES, OR A RANDOM FORESTSEE MORE ON

FASTML.COM

“the

to

steep

MAP.

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

A VERY FAST DENOISING AUTOENCODER WHAT IS BETTER: GRADIENT-BOOSTED TREES, OR A RANDOM FORESTSEE MORE ON

FASTML.COM

to

POPULAR: TOP TEN MOST VIEWED PAGES, AS REPORTED BY GOOGLE These are the pages that received the most unique pageviews ever, as of January 2017. Unique pageviews is the number of visits during which the specified page was viewed at least once. The order is basically the same as when counting pageviews, or hits. The numbers in parens show positions in the ADVERSARIAL VALIDATION, PART ONE Adversarial validation, part one. Many data science competitions suffer from a test set being markedly different from a training set (a violation of the “identically distributed” assumption). It is then difficult to make a representative validation set. We propose a method for selecting training examples most similar to test examples and HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET Two solutions come to mind. One is two pd.concat ( ( train, test )), get_dummies () and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode. Another way is to add the missing columns, filled WHAT YOU WANTED TO KNOW ABOUT MEAN AVERAGE PRECISION First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is

MAP.

TUNING HYPERPARAMS AUTOMATICALLY WITH SPEARMINT Enter Spearmint, a piece of software to automatically tune hyperparams. Now you can see the promise it offers. We will concentrate on how to use it in practice, because a learning curve might be quite steep, even though the README is pretty good. INTERACTIVE IN-BROWSER 3D VISUALIZATION OF DATASETS Interactive in-browser 3D visualization of datasets. In this post we’ll be looking at 3D visualization of various datasets using the data-projector software from Datacratic. The original demo didn’t impress us initially as much as it could, because the data there is synthetic - it shows a bunch of small spheres in rainbow colors. KAGGLE JOB RECOMMENDATION CHALLENGE 2012-08-27. This is an introduction to Kaggle job recommendation challenge. It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite. Spot these two big differences: There are no explicit ratings. Instead, there’s info about which jobs user applied to. This is known as one-class MICHAEL JORDAN ON DEEP LEARNING Michael Jordan on deep learning. 2014-09-14. On September 10th Michael Jordan, a renowned statistician from Berkeley, did Ask Me Anything on Reddit. These are his thoughts on deep learning. My first and main reaction is that I’m totally happy that any area of machine learning (aka, statistical inference and decision-making; see my other post NUMERAI - LIKE KAGGLE, BUT WITH A CLEAN DATASET, TOP TEN Numerai is an attempt at a hedge fund crowd-sourcing stock market predictions. It presents a Kaggle-like competition, but with a few

welcome twists.

“the

to

steep

MAP.

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

A VERY FAST DENOISING AUTOENCODER WHAT IS BETTER: GRADIENT-BOOSTED TREES, OR A RANDOM FORESTSEE MORE ON

FASTML.COM

“the

to

steep

MAP.

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

A VERY FAST DENOISING AUTOENCODER WHAT IS BETTER: GRADIENT-BOOSTED TREES, OR A RANDOM FORESTSEE MORE ON

FASTML.COM

to

MAP.

welcome twists.

FASTMLGOOGLE'S PRINCIPLES ON AI WEAPONS, MASS SURVEILLENCE, AND SIGNING OUTCONTENTSGOODBOOKS-10K The most common form of cheating in first person shooter games is wall-hacking, or seeing enemy players through obstacles. We propose a solution to this problem building on a mechanism already used in some professional e-sports matches: taking random screenshots during

gameplay.

EXTREME LEARNING MACHINES What do you get when you take out backpropagation out of a multilayer perceptron? You get an extreme learning machine, a non-linear model with the speed of a linear one. CONVERTING CATEGORICAL DATA INTO NUMBERS WITH PANDAS AND Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. INTERACTIVE IN-BROWSER 3D VISUALIZATION OF DATASETS PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET It turns out that Converting categorical data into numbers with Pandas and Scikit-learn has become the most popular article on this site. Let’s revisit the topic and look at Pandas’ get_dummies() more closely. Using the function is straightforward - you specify which columns you want encoded and get a dataframe with original columns replaced with one-hot encodings. WHAT YOU WANTED TO KNOW ABOUT MEAN AVERAGE PRECISION Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. CLASSIFYING TIME SERIES USING FEATURE EXTRACTION When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS FASTMLGOOGLE'S PRINCIPLES ON AI WEAPONS, MASS SURVEILLENCE, AND SIGNING OUTCONTENTSGOODBOOKS-10K The most common form of cheating in first person shooter games is wall-hacking, or seeing enemy players through obstacles. We propose a solution to this problem building on a mechanism already used in some professional e-sports matches: taking random screenshots during

gameplay.

FASTML.COM

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS

ABOUT - FASTML

About. This site is brought to you by the letters “M” and “L”. It is meant to tackle interesting topics in machine learning while being entertaining and easy to read and understand. ADVERSARIAL VALIDATION, PART ONE UPDATE: It’s a-live! Now you can create 3D visualizations of your own data sets. Visit cubert.fastml.com and upload a CSV or libsvm-formatted file.. Let’s see if validation scores translate into leaderboard scores, then. INTERACTIVE IN-BROWSER 3D VISUALIZATION OF DATASETS In this post we’ll be looking at 3D visualization of various datasets using the data-projector software from Datacratic. The original demo didn’t impress us initially as much as it could, because the data there is synthetic - it shows a bunch of small spheres in rainbow colors. Real datasets look better.

REVISITING NUMERAI

In this article, we revisit Numerai and their weekly data science tournament. New developments include a much larger dataset, tougher

requirements

HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET It turns out that Converting categorical data into numbers with Pandas and Scikit-learn has become the most popular article on this site. Let’s revisit the topic and look at Pandas’ get_dummies() more closely. Using the function is straightforward - you specify which columns you want encoded and get a dataframe with original columns replaced with one-hot encodings. POPULAR: TOP TEN MOST VIEWED PAGES, AS REPORTED BY GOOGLE These are the pages that received the most unique pageviews ever, as of January 2017. Unique pageviews is the number of visits during which the specified page was viewed at least once. The order is basically the same as when counting pageviews, or hits. The numbers in parens show positions in the CLASSIFYING TEXT WITH BAG-OF-WORDS: A TUTORIAL There is a Kaggle training competition where you attempt to classify text, specifically movie reviews. No other data - this is a perfect opportunity to do some experiments with text classification. GO NON-LINEAR WITH VOWPAL WABBIT Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are: a neural network with a single hidden layer; automatic creation of polynomial, specifically quadratic and cubic,

features

KAGGLE JOB RECOMMENDATION CHALLENGE This is an introduction to Kaggle job recommendation challenge.It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite. Spot these two big differences: MICHAEL JORDAN ON DEEP LEARNING On September 10th Michael Jordan, a renowned statistician from Berkeley, did Ask Me Anything on Reddit. These are his thoughts on deep learning. My first and main reaction is that I’m totally happy that any area of machine learning (aka, statistical inference and decision-making; see my other post :-) is beginning to make impact on real-world problems. FASTMLGOOGLE'S PRINCIPLES ON AI WEAPONS, MASS SURVEILLENCE, AND SIGNING OUTCONTENTSGOODBOOKS-10K 2020-05-12. The most common form of cheating in first person shooter games is wall-hacking, or seeing enemy players through obstacles. We propose a solution to this problem building on a mechanism already used in some professional e-sports matches: taking random screenshots during gameplay. If a game takes screenshots and uploads them to

“the

to

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CONVERTING CATEGORICAL DATA INTO NUMBERS WITH PANDAS AND This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat

steep

CLASSIFYING TIME SERIES USING FEATURE EXTRACTION Classifying time series using feature extraction. 2018-10-09. When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET Two solutions come to mind. One is two pd.concat ( ( train, test )), get_dummies () and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode. Another way is to add the missing columns, filled ONE WEIRD REGULARITY OF THE STOCK MARKET GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS LARGE SCALE L1 FEATURE SELECTION WITH VOWPAL WABBIT IMPUTE MISSING VALUES WITH AMELIA FASTMLGOOGLE'S PRINCIPLES ON AI WEAPONS, MASS SURVEILLENCE, AND SIGNING OUTCONTENTSGOODBOOKS-10K 2020-05-12. The most common form of cheating in first person shooter games is wall-hacking, or seeing enemy players through obstacles. We propose a solution to this problem building on a mechanism already used in some professional e-sports matches: taking random screenshots during gameplay. If a game takes screenshots and uploads them to

“the

to

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CONVERTING CATEGORICAL DATA INTO NUMBERS WITH PANDAS AND This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat

steep

CLASSIFYING TIME SERIES USING FEATURE EXTRACTION Classifying time series using feature extraction. 2018-10-09. When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET Two solutions come to mind. One is two pd.concat ( ( train, test )), get_dummies () and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode. Another way is to add the missing columns, filled ONE WEIRD REGULARITY OF THE STOCK MARKET GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS LARGE SCALE L1 FEATURE SELECTION WITH VOWPAL WABBIT IMPUTE MISSING VALUES WITH AMELIA EXTREME LEARNING MACHINES ELM is a Chinese invention. Imagine a classic feed-forward neural network with one hidden layer, subtract backpropagation and you have an ELM. The input-hidden weights are constant - they are apparently initialized analytically so even though they are semi-random the thing works. The model learns only the hidden-output weights, which amounts

to

ADVERSARIAL VALIDATION, PART ONE Adversarial validation, part one. Many data science competitions suffer from a test set being markedly different from a training set (a violation of the “identically distributed” assumption). It is then difficult to make a representative validation set. We propose a method for selecting training examples most similar to test examples and

REVISITING NUMERAI

Revisiting Numerai. In this article, we revisit Numerai and their weekly data science tournament. New developments include a much larger dataset, tougher requirements for models, and bigger payouts. Let’s start with data. The training set has roughly half a million examples, each with 50 features. Then there’s validation, test, and live. HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET Two solutions come to mind. One is two pd.concat ( ( train, test )), get_dummies () and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode. Another way is to add the missing columns, filled TUNING HYPERPARAMS AUTOMATICALLY WITH SPEARMINT Enter Spearmint, a piece of software to automatically tune hyperparams. Now you can see the promise it offers. We will concentrate on how to use it in practice, because a learning curve might be quite steep, even though the README is pretty good. ONE WEIRD REGULARITY OF THE STOCK MARKET One weird regularity of the stock market. 2018-12-11. Everybody had the fantasy of predicting the stock market. We investigated the subject in Are stocks predictable?. In short, they are not, at least the prices. The next step would be to go from prices to volatility measures. The reason is that one can use the volatility to properly

price

INTERACTIVE IN-BROWSER 3D VISUALIZATION OF DATASETS Interactive in-browser 3D visualization of datasets. In this post we’ll be looking at 3D visualization of various datasets using the data-projector software from Datacratic. The original demo didn’t impress us initially as much as it could, because the data there is synthetic - it shows a bunch of small spheres in rainbow colors. DEEP LEARNING MADE EASY Deep learning made easy. 2013-05-01. As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat. MICHAEL JORDAN ON DEEP LEARNING On September 10th Michael Jordan, a renowned statistician from Berkeley, did Ask Me Anything on Reddit. These are his thoughts on deep learning. My first and main reaction is that I’m totally happy that any area of machine learning (aka, statistical inference and decision-making; see my other post :-) is beginning to make impact on real-world problems. KAGGLE JOB RECOMMENDATION CHALLENGE 2012-08-27. This is an introduction to Kaggle job recommendation challenge. It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite. Spot these two big differences: There are no explicit ratings. Instead, there’s info about which jobs user applied to. This is known as one-class FASTMLGOOGLE'S PRINCIPLES ON AI WEAPONS, MASS SURVEILLENCE, AND SIGNING OUTCONTENTSGOODBOOKS-10K 2020-05-12. The most common form of cheating in first person shooter games is wall-hacking, or seeing enemy players through obstacles. We propose a solution to this problem building on a mechanism already used in some professional e-sports matches: taking random screenshots during gameplay. If a game takes screenshots and uploads them to

“the

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CONVERTING CATEGORICAL DATA INTO NUMBERS WITH PANDAS AND This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat

steep

RUNNING THINGS ON A GPU CLASSIFYING TIME SERIES USING FEATURE EXTRACTION Classifying time series using feature extraction. 2018-10-09. When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. HOW MUCH DATA IS ENOUGH? A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course.. The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET Two solutions come to mind. One is two pd.concat ( ( train, test )), get_dummies () and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode. Another way is to add the missing columns, filled CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS WHAT IS BETTER: GRADIENT-BOOSTED TREES, OR A RANDOM FORESTSEE MORE ON

FASTML.COM

“the

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CONVERTING CATEGORICAL DATA INTO NUMBERS WITH PANDAS AND This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat

steep

FASTML.COM

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS WHAT IS BETTER: GRADIENT-BOOSTED TREES, OR A RANDOM FORESTSEE MORE ON

FASTML.COM

to

MATH FOR MACHINE LEARNING 2014-08-18. Sometimes people ask what math they need for machine learning. The answer depends on what you want to do, but in short our opinion is that it is good to have some familiarity with linear algebra and multivariate differentiation. Linear algebra is a cornerstone because everything in machine learning is a vector or a

matrix.

HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET Two solutions come to mind. One is two pd.concat ( ( train, test )), get_dummies () and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode. Another way is to add the missing columns, filled

REVISITING NUMERAI

Revisiting Numerai. In this article, we revisit Numerai and their weekly data science tournament. New developments include a much larger dataset, tougher requirements for models, and bigger payouts. Let’s start with data. The training set has roughly half a million examples, each with 50 features. Then there’s validation, test, and live. HOW MUCH DATA IS ENOUGH? A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course.. The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. CLASSIFYING TEXT WITH BAG-OF-WORDS: A TUTORIAL TF-IDF. TF-IDF stands for “term frequency / inverse document frequency” and is a method for emphasizing words that occur frequently in a given document, while at the same time de-emphasising words that occur frequently in many documents. Our score with TfidfVectorizer and 20k features was 95.6%, a big improvement. A VERY FAST DENOISING AUTOENCODER A very fast denoising autoencoder. Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising

Autoencoders

price

TUNING HYPERPARAMS FAST WITH HYPERBAND Tuning hyperparams fast with Hyperband. Hyperband is a relatively new method for tuning iterative algorithms. It performs random sampling and attempts to gain an edge by using time spent optimizing in the best way. We explain a few things that were not clear to us right away, and try the algorithm in FASTMLGOOGLE'S PRINCIPLES ON AI WEAPONS, MASS SURVEILLENCE, AND SIGNING OUTCONTENTSGOODBOOKS-10K 2020-05-12. The most common form of cheating in first person shooter games is wall-hacking, or seeing enemy players through obstacles. We propose a solution to this problem building on a mechanism already used in some professional e-sports matches: taking random screenshots during gameplay. If a game takes screenshots and uploads them to

“the

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CONVERTING CATEGORICAL DATA INTO NUMBERS WITH PANDAS AND This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat

steep

RUNNING THINGS ON A GPU CLASSIFYING TIME SERIES USING FEATURE EXTRACTION Classifying time series using feature extraction. 2018-10-09. When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. HOW MUCH DATA IS ENOUGH? A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course.. The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET Two solutions come to mind. One is two pd.concat ( ( train, test )), get_dummies () and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode. Another way is to add the missing columns, filled with zeros, and delete any extra columns. CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS WHAT IS BETTER: GRADIENT-BOOSTED TREES, OR A RANDOM FORESTSEE MORE ON

FASTML.COM

“the

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CONVERTING CATEGORICAL DATA INTO NUMBERS WITH PANDAS AND This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat

steep

RUNNING THINGS ON A GPU CLASSIFYING TIME SERIES USING FEATURE EXTRACTION Classifying time series using feature extraction. 2018-10-09. When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. HOW MUCH DATA IS ENOUGH? A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course.. The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET Two solutions come to mind. One is two pd.concat ( ( train, test )), get_dummies () and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode. Another way is to add the missing columns, filled with zeros, and delete any extra columns. CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS WHAT IS BETTER: GRADIENT-BOOSTED TREES, OR A RANDOM FORESTSEE MORE ON

FASTML.COM

to

matrix.

REVISITING NUMERAI

Revisiting Numerai. In this article, we revisit Numerai and their weekly data science tournament. New developments include a much larger dataset, tougher requirements for models, and bigger payouts. Let’s start with data. The training set has roughly half a million examples, each with 50 features. Then there’s validation, test, and live. HOW MUCH DATA IS ENOUGH? A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course.. The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. CLASSIFYING TEXT WITH BAG-OF-WORDS: A TUTORIAL TF-IDF. TF-IDF stands for “term frequency / inverse document frequency” and is a method for emphasizing words that occur frequently in a given document, while at the same time de-emphasising words that occur frequently in many documents. Our score with TfidfVectorizer and 20k features was 95.6%, a big improvement. WHAT YOU WANTED TO KNOW ABOUT MEAN AVERAGE PRECISION First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is

MAP.

price

TUNING HYPERPARAMS FAST WITH HYPERBAND Tuning hyperparams fast with Hyperband. Hyperband is a relatively new method for tuning iterative algorithms. It performs random sampling and attempts to gain an edge by using time spent optimizing in the best way. We explain a few things that were not clear to us right away, and try the algorithm in FASTMLBACKGROUND IMAGESLINKSREVISITING NUMERAICONTENTSGOODBOOKS-10KABOUT 2020-05-12. The most common form of cheating in first person shooter games is wall-hacking, or seeing enemy players through obstacles. We propose a solution to this problem building on a mechanism already used in some professional e-sports matches: taking random screenshots during gameplay. If a game takes screenshots and uploads them to

“the

to

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CLASSIFYING TIME SERIES USING FEATURE EXTRACTION Classifying time series using feature extraction. 2018-10-09. When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

WHAT YOU WANTED TO KNOW ABOUT MEAN AVERAGE PRECISION First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is

MAP.

TUNING HYPERPARAMS AUTOMATICALLY WITH SPEARMINT ONE WEIRD REGULARITY OF THE STOCK MARKET KAGGLE JOB RECOMMENDATION CHALLENGE 2012-08-27. This is an introduction to Kaggle job recommendation challenge. It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite. Spot these two big differences: There are no explicit ratings. Instead, there’s info about which jobs user applied to. This is known as one-class MICHAEL JORDAN ON DEEP LEARNING Michael Jordan on deep learning. 2014-09-14. On September 10th Michael Jordan, a renowned statistician from Berkeley, did Ask Me Anything on Reddit. These are his thoughts on deep learning. My first and main reaction is that I’m totally happy that any area of machine learning (aka, statistical inference and decision-making; see my other post FASTMLBACKGROUND IMAGESLINKSREVISITING NUMERAICONTENTSGOODBOOKS-10KABOUT 2020-05-12. The most common form of cheating in first person shooter games is wall-hacking, or seeing enemy players through obstacles. We propose a solution to this problem building on a mechanism already used in some professional e-sports matches: taking random screenshots during gameplay. If a game takes screenshots and uploads them to

“the

to

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CLASSIFYING TIME SERIES USING FEATURE EXTRACTION Classifying time series using feature extraction. 2018-10-09. When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

MAP.

ABOUT - FASTML

About. This site is brought to you by the letters “M” and “L”. It is meant to tackle interesting topics in machine learning while being entertaining and easy to read and understand. FastML probably grew out of a frustration with papers you need a PhD in math to understand and with either no code or half-baked Matlab implementation

of

REVISITING NUMERAI

Revisiting Numerai. In this article, we revisit Numerai and their weekly data science tournament. New developments include a much larger dataset, tougher requirements for models, and bigger payouts. Let’s start with data. The training set has roughly half a million examples, each with 50 features. Then there’s validation, test, and live. HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET Two solutions come to mind. One is two pd.concat ( ( train, test )), get_dummies () and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode. Another way is to add the missing columns, filled with zeros, and delete any extra columns. EVALUATING RECOMMENDER SYSTEMS One of the primary decision factors here is quality of recommendations. You estimate it through validation, and validation for recommender systems might be tricky. There are a few things to consider, including formulation of the task, form of available feedback, and a metric to optimize for. We address these issues and

present an example.

RUNNING THINGS ON A GPU Running Cudamat. You can compare CPU/GPU speed by running RBM implementations for both provided in the example programs. On our setup, times per iteration are as follows: CPU: 160 seconds. GeForce 9600 GT: 27 seconds (six times faster) GeForce GTX 550 Ti: 8 seconds (20 times faster than a CPU) WHAT YOU WANTED TO KNOW ABOUT MEAN AVERAGE PRECISION First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is

MAP.

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS Goodbooks-10k: a new dataset for book recommendations. 2017-11-29. There have been a few recommendations datasets for movies (Netflix, Movielens) and music (Million Songs), but not for books. That is, until now. The dataset contains six million ratings for ten thousand most popular books (with most ratings). There are also: WHAT IS BETTER: GRADIENT-BOOSTED TREES, OR A RANDOM FOREST Folks know that gradient-boosted trees generally perform better than a random forest, although there is a price for that: GBT have a few hyperparams to tune, while random forest is practically tuning-free.Let’s look at what the literature says about how these two methods compare. ONE WEIRD REGULARITY OF THE STOCK MARKET One weird regularity of the stock market. 2018-12-11. Everybody had the fantasy of predicting the stock market. We investigated the subject in Are stocks predictable?. In short, they are not, at least the prices. The next step would be to go from prices to volatility measures. The reason is that one can use the volatility to properly

price

DEEP LEARNING MADE EASY Deep learning made easy. 2013-05-01. As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat. FASTMLBACKGROUND IMAGESLINKSREVISITING NUMERAICONTENTSGOODBOOKS-10KABOUT 2020-05-12. The most common form of cheating in first person shooter games is wall-hacking, or seeing enemy players through obstacles. We propose a solution to this problem building on a mechanism already used in some professional e-sports matches: taking random screenshots during gameplay. If a game takes screenshots and uploads them to

“the

to

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CLASSIFYING TIME SERIES USING FEATURE EXTRACTION Classifying time series using feature extraction. 2018-10-09. When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

MAP.

“the

to

PIPING IN R AND IN PANDAS In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). CLASSIFYING TIME SERIES USING FEATURE EXTRACTION Classifying time series using feature extraction. 2018-10-09. When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. CLASSIFIER CALIBRATION WITH PLATT'S SCALING AND ISOTONICSEE MORE ON

FASTML.COM

MAP.

ABOUT - FASTML

About. This site is brought to you by the letters “M” and “L”. It is meant to tackle interesting topics in machine learning while being entertaining and easy to read and understand. FastML probably grew out of a frustration with papers you need a PhD in math to understand and with either no code or half-baked Matlab implementation

of

REVISITING NUMERAI

Revisiting Numerai. In this article, we revisit Numerai and their weekly data science tournament. New developments include a much larger dataset, tougher requirements for models, and bigger payouts. Let’s start with data. The training set has roughly half a million examples, each with 50 features. Then there’s validation, test, and live. HOW TO USE PD.GET_DUMMIES() WITH THE TEST SET Two solutions come to mind. One is two pd.concat ( ( train, test )), get_dummies () and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode. Another way is to add the missing columns, filled with zeros, and delete any extra columns. EVALUATING RECOMMENDER SYSTEMS One of the primary decision factors here is quality of recommendations. You estimate it through validation, and validation for recommender systems might be tricky. There are a few things to consider, including formulation of the task, form of available feedback, and a metric to optimize for. We address these issues and

present an example.

MAP.

price

FASTML

MACHINE LEARNING MADE EASY

* RSS

Navigate…» Home» Contents» Popular» Links» Backgrounds»

About» RSS

* Home

* Contents

* Popular

* Links

* Backgrounds

* About

ONE WEIRD REGULARITY OF THE STOCK MARKET

2018-12-11

Everybody had the fantasy of predicting the stock market. We investigated the subject in Are stocks predictable? . In short, they are not, at least the prices. The next step would be to go from prices to volatility measures. The reason is that one can use the volatility to properly price stock options using the Black-Scholes model

. Wikipedia

says that the formula has only one parameter that cannot be directly observed in the market: the average future volatility of the underlying asset. Therefore, the question is, can one predict that

volatility?

Read on →

CLASSIFYING TIME SERIES USING FEATURE EXTRACTION

2018-10-09

When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. In this article, we look at how to automatically extract relevant features with a Python package called tsfresh.

Read on →

GOOGLE’S PRINCIPLES ON AI WEAPONS, MASS SURVEILLENCE, AND SIGNING

OUT

2018-07-02

In June Google published its ”AI principles

”, the post

signed by the CEO himself. It talks about AI sensors for predicting the risk of wildfires. Of farmers using AI to monitor the health of their herds. Of doctors starting to use AI to help diagnose cancer and prevent blindness. Great stuff! We take a look at the context.

Read on →

HOW TO USE THE PYTHON DEBUGGER

2018-02-28

This article is not about machine learning, but about a piece of software engineering that often comes handy in data science practice. When writing code, everybody gets errors. Sometimes it is difficult to debug them. Using a debugger may help, but can also be intimidating. This is a TLDR tutorial on using pdb in IPython, focused on looking at variables inside functions.

Read on →

PREPARING CONTINUOUS FEATURES FOR NEURAL NETWORKS WITH GAUSSRANK

2018-01-22

We present a novel method for feature transformation, akin to standardization. The method comes from Michael Jahrer, who recently has won another competition and afterwards shared the approach he

used.

Read on →

TWO FACES OF OVERFITTING

2017-12-05

Overfitting is on of the primary problems, if not THE primary problem in machine learning. There are many aspects to it, but in a general sense, overfitting means that estimates of performance on unseen test examples are overly optimistic. That is, a model generalizes worse

then expected.

We explain two common cases of overfitting: including information from a test set in training, and the more insidious form: overusing a

validation set.

Read on →

GOODBOOKS-10K: A NEW DATASET FOR BOOK RECOMMENDATIONS

2017-11-29

There have been a few recommendations datasets for movies (Netflix, Movielens) and music (Million Songs), but not for books. That is,

until now.

Read on →

REVISITING NUMERAI

2017-10-17

In this article, we revisit Numerai and their weekly data science tournament. New developments include a much larger dataset, tougher requirements for models, and bigger

payouts.

Read on →

IT’S EMBARASSING, REALLY

2017-09-18

In August, we published the first version of goodbooks-10k , a new dataset for book recommendations. By pure chance, that coincided with a proclamation of Kaggle Datasets Awards. Oh, how we hoped to get one!

Read on →

INTRODUCTION TO POINTER NETWORKS

2017-07-03

Pointer networks are a variation of the sequence-to-sequence model with attention. Instead of translating one sequence into another, they yield a succession of pointers to the elements of the input series. The most basic use of this is ordering the elements of a variable-length sequence or set.

Read on →

report this ad

← Older Contents

signing out

* How to use the Python debugger * Preparing continuous features for neural networks with GaussRank * Two faces of overfitting * Goodbooks-10k: a new dataset for book recommendations

TWITTER

Follow @fastml for notifications about new

posts.

* Status updating…

Follow @fastml

Also check out @fastml_extra for things related to machine learning and data science in general.

GITHUB

Most articles come with some code . We push it

to Github.

https://github.com/zygmuntz

report this ad

Details

Image Url

HTML Url

Moderation By

More Annotations

Maria Garcia

2021-05-20 06:37:19

Maria Garcia

2021-05-20 06:37:19

Maria Garcia

2021-05-20 06:37:19

Maria Garcia

2021-05-20 06:37:21

Maria Garcia

2021-05-20 06:37:24

Maria Garcia

2021-05-20 06:37:25

Maria Garcia

2021-05-20 06:37:25

Maria Garcia

2021-05-20 06:37:25

Maria Garcia

2021-05-20 06:37:26

Maria Garcia

2021-05-20 06:37:27

Maria Garcia

2021-05-20 06:37:30

Maria Garcia

2021-05-20 06:37:34

Favourite Annotations

Maria Garcia

2020-03-29 12:49:55

Maria Garcia

2020-03-29 12:50:27

Maria Garcia

2020-03-29 12:50:30

Maria Garcia

2020-03-29 12:51:00

Maria Garcia

2020-03-29 12:51:24

Maria Garcia

2020-03-29 12:51:55

Maria Garcia

2020-03-29 12:51:58

Maria Garcia

2020-03-29 12:52:37

Maria Garcia

2020-03-29 12:52:53

Maria Garcia

2020-03-29 12:53:20

Maria Garcia

2020-03-29 12:55:03

Maria Garcia

2020-03-29 12:55:38

Text

“the

to

steep

MAP.

FASTML.COM

FASTML.COM

“the

to

steep

MAP.

FASTML.COM

FASTML.COM

to

MAP.

welcome twists.

“the

to

steep

MAP.

FASTML.COM

FASTML.COM

“the

to

steep

MAP.

FASTML.COM

FASTML.COM

to

MAP.