More Annotations

Favourite Annotations

Text

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

AUTOENCODERS AND ANOMALY DETECTION WITH MACHINE LEARNING Autoencoders and anomaly detection with machine learning in fraud analytics. Tweet. 01 May 2017. All my previous posts on machine learning have dealt with supervised learning. But we can also use machine learning for unsupervised learning. The latter are e.g. used for clustering and (non-linear) dimensionality reduction. DEALING WITH UNBALANCED DATA IN MACHINE LEARNINGSEE MORE ON

SHIRING.GITHUB.IO

DATA SCIENCE FOR BUSINESS NETWORK ANALYSIS OF GAME OF THRONES FAMILY TIES In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones. Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative. PLOTTING TREES FROM RANDOM FOREST MODELS WITH GGRAPH Preparing the data and modeling. The data set I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset.The data was downloaded from the UC Irvine Machine Learning Repository.. The first data set looks at the predictor classes: FEATURE SELECTION IN MACHINE LEARNING (BREAST CANCER DATASETS) Feature Selection in Machine Learning (Breast Cancer Datasets) Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model HOW TO BUILD A SHINY APP FOR DISEASE- & TRAIT-ASSOCIATED This app is based on the gwascat R package and its ebicat38 database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously. The National Human Genome Research Institute (NHGRI) catalog of Genome-Wide Association Studies (GWAS) is a curated

resource of

R VS PYTHON

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

SHIRING.GITHUB.IO

resource of

R VS PYTHON

SHIRIN'S PLAYGROUND

ABOUT ME

Welcome to my page! I’m Shirin, a biologist turned bioinformatician turned data scientist. I’m especially interested in machine learning and data visualization. DATA SCIENCE FOR BUSINESS Training and test data. My input data is the tibble retail_p_day, that was created in my last post.. I am splitting this dataset into training (all data points before/on Nov. 1st 2011) and test samples (all data points after Nov. 1st 2011). DATA ON TOUR: PLOTTING 3D MAPS AND LOCATION TRACKS Hiking tracks. The hiking tracks we followed came mostly from a German hiking guide-book, the Rother Wanderführer, 7th edition from 2016.They were in standard .gpx format and could be read with readGPX().. Only one of our hikes did not come from this book, but from Wikiloc.It could be treated the same way as the other hiking tracks, though, so I combined all hiking tracks. CHARACTERIZING TWITTER FOLLOWERS WITH TIDYTEXT Now, we can access information from Twitter, like timeline tweets, user timelines, mentions, tweets & retweets, followers, etc. All the following datasets were retrieved on June 7th 2017, converted to a data frame for tidy analysis and saved for later use:

R VS PYTHON

functions.

HOW TO MAP YOUR GOOGLE LOCATION HISTORY WITH R ## timestampMs latitudeE7 longitudeE7 accuracy activitys ## 1 1482393378938 519601402 76004708 29 NULL ## 2 1482393333953 519601402 76004708 29 NULL ## 3 1482393033893 519603616 76002628 20 1482393165600, still, 100 ## 4 1482392814435 519603684 76001572 20 1482392817678, still, 100 ## 5 1482392734911 519603684 76001572 20

NULL ## 6

SOCIAL NETWORK ANALYSIS AND TOPIC MODELING OF CODECENTRIC I have written the following post about Social Network Analysis and Topic Modeling of codecentric’ s Twitter friends and followers for codecentric’s blog:. Recently, Matthias Radtke has written a very nice blog post on Topic Modeling of the codecentric Blog Articles, where he is giving a comprehensive introduction to Topic Modeling. CAN WE PREDICT FLU DEATHS WITH MACHINE LEARNING AND R? Among the many R packages, there is the outbreaks package. It contains datasets on epidemics, on of which is from the 2013 outbreak of influenza A H7N9 in China, as

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

EXPRANALYSIS PACKAGE DEALING WITH UNBALANCED DATA IN MACHINE LEARNINGSEE MORE ON

SHIRING.GITHUB.IO

DATA SCIENCE FOR BUSINESS DATA SCIENCE FOR BUSINESS NETWORK ANALYSIS OF GAME OF THRONES FAMILY TIES In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones. Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative. PLOTTING TREES FROM RANDOM FOREST MODELS WITH GGRAPH Preparing the data and modeling. The data set I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset.The data was downloaded from the UC Irvine Machine Learning Repository.. The first data set looks at the predictor classes: CONDITIONAL GGPLOT2 GEOMS IN FUNCTIONS (QTL PLOTS) The first example uses the hyper data set and builds a simple QTL model with three modeling functions: the EM algorithm, Haley-Knott regression and multiple imputation. The genome wide LOD threshold is calculated with permutation. Feeding this LOD threshold into the summary output gives us the markers with a significant phenotype association (i.e. the QTL). HOW TO BUILD A SHINY APP FOR DISEASE- & TRAIT-ASSOCIATED This app is based on the gwascat R package and its ebicat38 database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously. The National Human Genome Research Institute (NHGRI) catalog of Genome-Wide Association Studies (GWAS) is a curated

resource of

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

EXPRANALYSIS PACKAGE DEALING WITH UNBALANCED DATA IN MACHINE LEARNINGSEE MORE ON

SHIRING.GITHUB.IO

resource of

SHIRIN'S PLAYGROUND

18 Dec 2016 » How to build a Shiny app for disease- & trait-associated locations of the human genome. This app is based on the gwascat R package and its ebicat38 database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously. CATEGORIES - GITHUB PAGES Dealing with unbalanced data in machine learning. Building meaningful machine learning models for disease prediction. Plotting trees from Random Forest models with ggraph. Hyper-parameter Tuning with Grid Search for Deep Learning. Building deep neural nets with h2o and rsparkling that predict arrhythmia of DATA SCIENCE FOR BUSINESS Training and test data. My input data is the tibble retail_p_day, that was created in my last post.. I am splitting this dataset into training (all data points before/on Nov. 1st 2011) and test samples (all data points after Nov. 1st 2011). EXPLORING THE HUMAN GENOME (PART 1) The narrow traditional definition of a gene is that it is a hereditary unit of information, which meant that it is a unit of DNA which encodes for the production of a protein. The Human Genome Project has estimated that the human genome comprises 20000 to 25000 genes. However, if we take the definition of gene more liberally, we could

also

FEATURE SELECTION IN MACHINE LEARNING (BREAST CANCER DATASETS) Feature Selection in Machine Learning (Breast Cancer Datasets) Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model MIGRATING FROM GITHUB TO GITLAB WITH RSTUDIO (TUTORIAL) GitHub vs. GitLab. Git is a distributed implementation of version control. Many people have written very eloquently about why it is a good idea to use version control, not only if you collaborate in a team but also if you work on your own; one example is this article from RStudio’s Support pages.. In short, its main feature is that version control allows you to keep track of the changes you DATA SCIENCE FOR FRAUD DETECTION I have written the following post about Data Science for Fraud Detection at my company codecentric’s blog:. Fraud can be defined as “the crime of getting money by deceiving people” (Cambridge Dictionary); it is as old as humanity: whenever two parties exchange goods or conduct business there is the potential for one party

scamming the other.

HYPER-PARAMETER TUNING WITH GRID SEARCH FOR DEEP LEARNING Hyper-parameter tuning with grid search allows us to test different combinations of hyper-parameters and find one with improved accuracy. Keep in mind though, that hyper-parameter tuning can only improve the model so much without overfitting. If you can’t achieve sufficient accuracy, the input features might simply not be adequate for the EXPLORE PREDICTIVE MAINTENANCE WITH FLEXDASHBOARD I have written the following post about Predictive Maintenance and flexdashboard at my company codecentric’s blog:. Predictive Maintenance is an increasingly popular strategy associated with Industry 4.0; it uses advanced analytics and machine learning to optimize machine costs and output (see Google Trends plot below). EXTREME GRADIENT BOOSTING AND PREPROCESSING IN MACHINE In last week’s post I explored whether machine learning models can be applied to predict flu deaths from the 2013 outbreak of influenza A H7N9 in China. There, I compared random forests, elastic-net regularized generalized linear models, k-nearest neighbors, penalized discriminant analysis, stabilized linear discriminant analysis, nearest shrunken centroids, single C5.0 tree and partial

SHIRIN'S PLAYGROUND

SHIRING.GITHUB.IO

DATA SCIENCE FOR BUSINESS DATA SCIENCE FOR BUSINESS PLOTTING TREES FROM RANDOM FOREST MODELS WITH GGRAPH Preparing the data and modeling. The data set I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset.The data was downloaded from the UC Irvine Machine Learning Repository.. The first data set looks at the predictor classes: AUTOENCODERS AND ANOMALY DETECTION WITH MACHINE LEARNING Autoencoders and anomaly detection with machine learning in fraud analytics. Tweet. 01 May 2017. All my previous posts on machine learning have dealt with supervised learning. But we can also use machine learning for unsupervised learning. The latter are e.g. used for clustering and (non-linear) dimensionality reduction. FEATURE SELECTION IN MACHINE LEARNING (BREAST CANCER DATASETS) Feature Selection in Machine Learning (Breast Cancer Datasets) Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model HOW TO BUILD A SHINY APP FOR DISEASE- & TRAIT-ASSOCIATED This app is based on the gwascat R package and its ebicat38 database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously. The National Human Genome Research Institute (NHGRI) catalog of Genome-Wide Association Studies (GWAS) is a curated

resource of

R VS PYTHON

SHIRIN'S PLAYGROUND

SHIRING.GITHUB.IO

resource of

R VS PYTHON

DESEQ2 COURSE WORK

DESeq2 Course Work. Tweet. 29 September 2016. The following workflow has been designed as teaching instructions for an introductory course to RNA-seq data analysis with DESeq2. The course is designed for PhD students and will be given at the University of Münster from 10th to 21st of October 2016. For questions or other comments, please EXPRANALYSIS PACKAGE exprAnalysis package. I created the R package exprAnalysis designed to streamline my RNA-seq data analysis pipeline. Below you find the vignette for installation and usage of the package. This package combines functions from various packages used to analyze and visualize expression data from NGS or expression chips. EXPLORING THE HUMAN GENOME (PART 1) The narrow traditional definition of a gene is that it is a hereditary unit of information, which meant that it is a unit of DNA which encodes for the production of a protein. The Human Genome Project has estimated that the human genome comprises 20000 to 25000 genes. However, if we take the definition of gene more liberally, we could

also

R VS PYTHON

I’m an avid R user and rarely use anything else for data analysis and visualisations. But while R is my go-to, in some cases, Python might actually be a better alternative. That’s why I wanted to see how R and Python fare in a one-on-one comparison of an analysis that’s representative of what I would typically work with. BUILDING DEEP NEURAL NETS WITH H2O AND RSPARKLING THAT The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O can be integrated with Apache Spark ( Sparkling Water) and therefore allows the implementation of complex or big models in a fast and scalable manner. H2O distributes a wide range of common machine learning CHARACTERIZING TWITTER FOLLOWERS WITH TIDYTEXT Now, we can access information from Twitter, like timeline tweets, user timelines, mentions, tweets & retweets, followers, etc. All the following datasets were retrieved on June 7th 2017, converted to a data frame for tidy analysis and saved for later use: HOW TO BUILD A SHINY APP FOR DISEASE- & TRAIT-ASSOCIATED This app is based on the gwascat R package and its ebicat38 database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously. The National Human Genome Research Institute (NHGRI) catalog of Genome-Wide Association Studies (GWAS) is a curated

resource of

BUILDING MEANINGFUL MACHINE LEARNING MODELS FOR DISEASE Webinar for the ISDS R Group. This document presents the code I used to produce the example analysis and figures shown in my webinar on building meaningful machine learning models for disease prediction. SCRATCHING THE SURFACE OF GENDER BIASES The world map. The map has been downloaded from the Natural Earth Data website.The country borders were reduced by 200 meters with ArcGIS Pro, so that clicking within any country on the map would show the corresponding country’s border as the nearest point. ArcGIS Pro was also used to convert the map to Mercator projection.The changed shapefiles can be downloaded from my Github SOCIAL NETWORK ANALYSIS AND TOPIC MODELING OF CODECENTRIC I have written the following post about Social Network Analysis and Topic Modeling of codecentric’ s Twitter friends and followers for codecentric’s blog:. Recently, Matthias Radtke has written a very nice blog post on Topic Modeling of the codecentric Blog Articles, where he is giving a comprehensive introduction to Topic Modeling.

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

EXPRANALYSIS PACKAGE DEALING WITH UNBALANCED DATA IN MACHINE LEARNINGSEE MORE ON

SHIRING.GITHUB.IO

resource of

NULL ## 6

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

EXPRANALYSIS PACKAGE DEALING WITH UNBALANCED DATA IN MACHINE LEARNINGSEE MORE ON

SHIRING.GITHUB.IO

resource of

NULL ## 6

SHIRIN'S PLAYGROUND

ABOUT ME

Welcome to my page! I’m Shirin, a biologist turned bioinformatician turned data scientist. I’m especially interested in machine learning and data visualization. ARCHIVE - SHIRING.GITHUB.IO May 28, 2017 » Data Science for Business - Time Series Forecasting Part 1: EDA & Data Preparation. May 20, 2017 » New R Users group in Münster! May 15, 2017 » Network analysis of Game of Thrones family ties. May 2, 2017 » Update to autoencoders and anomaly detection with

machine learning in

CATEGORIES - GITHUB PAGES Dealing with unbalanced data in machine learning. Building meaningful machine learning models for disease prediction. Plotting trees from Random Forest models with ggraph. Hyper-parameter Tuning with Grid Search for Deep Learning. Building deep neural nets with h2o and rsparkling that predict arrhythmia of MIGRATING FROM GITHUB TO GITLAB WITH RSTUDIO (TUTORIAL) GitHub vs. GitLab. Git is a distributed implementation of version control. Many people have written very eloquently about why it is a good idea to use version control, not only if you collaborate in a team but also if you work on your own; one example is this article from RStudio’s Support pages.. In short, its main feature is that version control allows you to keep track of the changes you CONDITIONAL GGPLOT2 GEOMS IN FUNCTIONS (QTL PLOTS) The first example uses the hyper data set and builds a simple QTL model with three modeling functions: the EM algorithm, Haley-Knott regression and multiple imputation. The genome wide LOD threshold is calculated with permutation. Feeding this LOD threshold into the summary output gives us the markers with a significant phenotype association (i.e. the QTL). EXPLORING THE HUMAN GENOME (PART 1) The narrow traditional definition of a gene is that it is a hereditary unit of information, which meant that it is a unit of DNA which encodes for the production of a protein. The Human Genome Project has estimated that the human genome comprises 20000 to 25000 genes. However, if we take the definition of gene more liberally, we could

also

FEATURE SELECTION IN MACHINE LEARNING (BREAST CANCER DATASETS) Feature Selection in Machine Learning (Breast Cancer Datasets) Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model BUILDING DEEP NEURAL NETS WITH H2O AND RSPARKLING THAT The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O can be integrated with Apache Spark ( Sparkling Water) and therefore allows the implementation of complex or big models in a fast and scalable manner. H2O distributes a wide range of common machine learning EXPLORE PREDICTIVE MAINTENANCE WITH FLEXDASHBOARD I have written the following post about Predictive Maintenance and flexdashboard at my company codecentric’s blog:. Predictive Maintenance is an increasingly popular strategy associated with Industry 4.0; it uses advanced analytics and machine learning to optimize machine costs and output (see Google Trends plot below).

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

DEALING WITH UNBALANCED DATA IN MACHINE LEARNINGSEE MORE ON

SHIRING.GITHUB.IO

NETWORK ANALYSIS OF GAME OF THRONES FAMILY TIES In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones. Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative. FEATURE SELECTION IN MACHINE LEARNING (BREAST CANCER DATASETS) Feature Selection in Machine Learning (Breast Cancer Datasets) Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model CONDITIONAL GGPLOT2 GEOMS IN FUNCTIONS (QTL PLOTS) The first example uses the hyper data set and builds a simple QTL model with three modeling functions: the EM algorithm, Haley-Knott regression and multiple imputation. The genome wide LOD threshold is calculated with permutation. Feeding this LOD threshold into the summary output gives us the markers with a significant phenotype association (i.e. the QTL). HOW TO BUILD A SHINY APP FOR DISEASE- & TRAIT-ASSOCIATED This app is based on the gwascat R package and its ebicat38 database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously. The National Human Genome Research Institute (NHGRI) catalog of Genome-Wide Association Studies (GWAS) is a curated

resource of

NULL ## 6

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

DEALING WITH UNBALANCED DATA IN MACHINE LEARNINGSEE MORE ON

SHIRING.GITHUB.IO

resource of

NULL ## 6

SHIRIN'S PLAYGROUND

ABOUT ME

machine learning in

also

with my data.

HYPER-PARAMETER TUNING WITH GRID SEARCH FOR DEEP LEARNING Hyper-parameter tuning with grid search allows us to test different combinations of hyper-parameters and find one with improved accuracy. Keep in mind though, that hyper-parameter tuning can only improve the model so much without overfitting. If you can’t achieve sufficient accuracy, the input features might simply not be adequate for the EXPLAINING COMPLEX MACHINE LEARNING MODELS WITH LIME HowLIMEworks 1. Permutationofeachtestcasetoexplain 2. Complexmodelpredictsallpermutedtestcases 3. Distancebetweenpermutationsandoriginaltextcaseis EXPLORE PREDICTIVE MAINTENANCE WITH FLEXDASHBOARD I have written the following post about Predictive Maintenance and flexdashboard at my company codecentric’s blog:. Predictive Maintenance is an increasingly popular strategy associated with Industry 4.0; it uses advanced analytics and machine learning to optimize machine costs and output (see Google Trends plot below).

SHIRIN'S PLAYGROUND

ABOUT ME

Welcome to my page! I’m Shirin, a biologist turned bioinformatician turned data scientist. I’m especially interested in machine learning and data visualization. DATA SCIENCE FOR BUSINESS NETWORK ANALYSIS OF GAME OF THRONES FAMILY TIESALL GAME OF THRONES BOOKSGAME OF THRONES BOOK DOWNLOADGAME OF THRONES BOOK ONEGAMES OF THRONES BOOKS LISTNEW GAME OF THRONES BOOKTHE GAME OF THRONES BOOKS In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones. Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative. DEALING WITH UNBALANCED DATA IN MACHINE LEARNINGSEE MORE ON

SHIRING.GITHUB.IO

PLOTTING TREES FROM RANDOM FOREST MODELS WITH GGRAPH Preparing the data and modeling. The data set I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset.The data was downloaded from the UC Irvine Machine Learning Repository.. The first data set looks at the predictor classes:

DESEQ2 COURSE WORK

AUTOENCODERS AND ANOMALY DETECTION WITH MACHINE LEARNINGANOMALY DETECTION ALGORITHMSANOMALY DETECTION TECHNIQUESKERAS ANOMALY

DETECTION

Autoencoders and anomaly detection with machine learning in fraud analytics. Tweet. 01 May 2017. All my previous posts on machine learning have dealt with supervised learning. But we can also use machine learning for unsupervised learning. The latter are e.g. used for clustering and (non-linear) dimensionality reduction. HOW TO BUILD A SHINY APP FOR DISEASE- & TRAIT-ASSOCIATED This app is based on the gwascat R package and its ebicat38 database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously. The National Human Genome Research Institute (NHGRI) catalog of Genome-Wide Association Studies (GWAS) is a curated

resource of

NULL ## 6

SHIRIN'S PLAYGROUND

ABOUT ME

Welcome to my page! I’m Shirin, a biologist turned bioinformatician turned data scientist. I’m especially interested in machine learning and data visualization. DATA SCIENCE FOR BUSINESS NETWORK ANALYSIS OF GAME OF THRONES FAMILY TIESALL GAME OF THRONES BOOKSGAME OF THRONES BOOK DOWNLOADGAME OF THRONES BOOK ONEGAMES OF THRONES BOOKS LISTNEW GAME OF THRONES BOOKTHE GAME OF THRONES BOOKS In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones. Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative. DEALING WITH UNBALANCED DATA IN MACHINE LEARNINGSEE MORE ON

SHIRING.GITHUB.IO

DESEQ2 COURSE WORK

AUTOENCODERS AND ANOMALY DETECTION WITH MACHINE LEARNINGANOMALY DETECTION ALGORITHMSANOMALY DETECTION TECHNIQUESKERAS ANOMALY

DETECTION

resource of

NULL ## 6

ARCHIVE - SHIRING.GITHUB.IO May 28, 2017 » Data Science for Business - Time Series Forecasting Part 1: EDA & Data Preparation. May 20, 2017 » New R Users group in Münster! May 15, 2017 » Network analysis of Game of Thrones family ties. May 2, 2017 » Update to autoencoders and anomaly detection with

machine learning in

CATEGORIES - GITHUB PAGES Dealing with unbalanced data in machine learning. Building meaningful machine learning models for disease prediction. Plotting trees from Random Forest models with ggraph. Hyper-parameter Tuning with Grid Search for Deep Learning. Building deep neural nets with h2o and rsparkling that predict arrhythmia of AUTOENCODERS AND ANOMALY DETECTION WITH MACHINE LEARNING Autoencoders and anomaly detection with machine learning in fraud analytics. Tweet. 01 May 2017. All my previous posts on machine learning have dealt with supervised learning. But we can also use machine learning for unsupervised learning. The latter are e.g. used for clustering and (non-linear) dimensionality reduction. DATA ON TOUR: PLOTTING 3D MAPS AND LOCATION TRACKS Hiking tracks. The hiking tracks we followed came mostly from a German hiking guide-book, the Rother Wanderführer, 7th edition from 2016.They were in standard .gpx format and could be read with readGPX().. Only one of our hikes did not come from this book, but from Wikiloc.It could be treated the same way as the other hiking tracks, though, so I combined all hiking tracks. EXPLORING THE HUMAN GENOME (PART 1) The narrow traditional definition of a gene is that it is a hereditary unit of information, which meant that it is a unit of DNA which encodes for the production of a protein. The Human Genome Project has estimated that the human genome comprises 20000 to 25000 genes. However, if we take the definition of gene more liberally, we could

also

DATA SCIENCE FOR BUSINESS Training and test data. My input data is the tibble retail_p_day, that was created in my last post.. I am splitting this dataset into training (all data points before/on Nov. 1st 2011) and test samples (all data points after Nov. 1st 2011). FEATURE SELECTION IN MACHINE LEARNING (BREAST CANCER DATASETS) Feature Selection in Machine Learning (Breast Cancer Datasets) Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model BUILDING DEEP NEURAL NETS WITH H2O AND RSPARKLING THAT The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O can be integrated with Apache Spark ( Sparkling Water) and therefore allows the implementation of complex or big models in a fast and scalable manner. H2O distributes a wide range of common machine learning

R VS PYTHON

scamming the other.

Toggle navigation Shirin's playgRound

* About me

* Archive

* Categories

* Feeds

* Tags

Submit

SHIRIN'S PLAYGROUND EXPLORING AND PLAYING WITH DATA IN R

*

02 NOV 2017 » EXPLORE PREDICTIVE MAINTENANCE WITH FLEXDASHBOARD

Shirin Glander

I have written the following post about Predictive Maintenance and

flexdashboard

at my company codecentric ’s blog:

Continue reading...

*

28 SEP 2017 » BLOCKCHAIN & DISTRIBUTED ML - MY REPORT FROM THE

DATA2DAY CONFERENCE

Shirin Glander

Continue reading...

*

20 SEP 2017 » FROM BIOLOGY TO INDUSTRY. A BLOGGER’S JOURNEY TO

DATA SCIENCE.

Shirin Glander

Today, I have given a webinar for the Applied Epidemiology Didactic of the University of Wisconsin - Madison titled “From Biology to Industry. A Blogger’s Journey to Data Science.”

Continue reading...

*

19 SEP 2017 » WHY I USE R FOR DATA SCIENCE - AN ODE TO R

Shirin Glander

I have written a blog post about why I love R

and prefer

it to other languages. The post is on my new site , but since it isn’t on R-bloggers yet I am also posting the link here

:

Continue reading...

*

14 SEP 2017 » MOVING MY BLOG TO BLOGDOWN

Shirin Glander

It’s been a long time coming but I finally moved my blog from Jekyll/Bootstrap on Github pages to blogdown, Hugo and Netlify ! Moreover, I also now have my own domain name www.shirin-glander.de . :-)

Continue reading...

*

06 SEP 2017 » DATA SCIENCE FOR FRAUD DETECTION

Shirin Glander

I have written the following post about Data Science for Fraud

Detection

at my company codecentric ’s blog:

Continue reading...

*

04 SEP 2017 » MIGRATING FROM GITHUB TO GITLAB WITH RSTUDIO

(TUTORIAL)

Shirin Glander

GITHUB VS. GITLAB

Continue reading...

*

28 JUL 2017 » SOCIAL NETWORK ANALYSIS AND TOPIC MODELING OF CODECENTRIC’S TWITTER FRIENDS AND FOLLOWERS

Shirin Glander

I have written the following post about Social Network Analysis and Topic Modeling of codecentric’s Twitter friends and followers for codecentric ’s blog:

Continue reading...

*

17 JUL 2017 » HOW TO DO OPTICAL CHARACTER RECOGNITION (OCR) OF NON-ENGLISH DOCUMENTS IN R USING TESSERACT?

Shirin Glander

One of the many great packages of rOpenSci has implemented the open source engine Tesseract

.

Continue reading...

*

28 JUN 2017 » CHARACTERIZING TWITTER FOLLOWERS WITH TIDYTEXT

Shirin Glander

Lately, I have been more and more taken with tidy principles of data analysis. They are elegant and make analyses clearer and easier to comprehend. Following the TIDYVERSE and GGRAPH, I have been quite intrigued by applying tidy principles to text analysis with Julia Silge and David Robinson’s TIDYTEXT

.

Continue reading...

*

13 JUN 2017 » DATA SCIENCE FOR BUSINESS - TIME SERIES FORECASTING PART 3: FORECASTING WITH FACEBOOK'S PROPHET

Shirin Glander

In my last two posts (Part 1

and Part 2

),

I explored time series forecasting with the TIMEKIT package.

Continue reading...

*

09 JUN 2017 » DATA SCIENCE FOR BUSINESS - TIME SERIES FORECASTING PART 2: FORECASTING WITH TIMEKIT

Shirin Glander

In my last post

,

I prepared and visually explored time series data.

Continue reading...

*

28 MAY 2017 » DATA SCIENCE FOR BUSINESS - TIME SERIES FORECASTING PART 1: EDA & DATA PREPARATION

Shirin Glander

Data Science is a fairly broad term and encompasses a wide range of techniques from data visualization to statistics and machine learning models. But the techniques are only tools in a - sometimes very messy - toolbox. And while it is important to know and understand these tools, here, I want to go at it from a different angle: What is the task at hand that data science tools can help tackle, and what question do we want to have answered?

Continue reading...

*

20 MAY 2017 » NEW R USERS GROUP IN MÜNSTER!

Shirin Glander

This is to announce that Münster now has its very own R users group!

Continue reading...

*

15 MAY 2017 » NETWORK ANALYSIS OF GAME OF THRONES FAMILY TIES

Shirin Glander

In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones.

Continue reading...

*

02 MAY 2017 » UPDATE TO AUTOENCODERS AND ANOMALY DETECTION WITH MACHINE LEARNING IN FRAUD ANALYTICS

Shirin Glander

This is a reply to Wojciech Indyk’s comment on yesterday’s post on autoencoders and anomaly detection with machine learning in fraud

analytics :

Continue reading...

*

01 MAY 2017 » AUTOENCODERS AND ANOMALY DETECTION WITH MACHINE LEARNING IN FRAUD ANALYTICS

Shirin Glander

All my previous posts on machine learning have dealt with supervised learning. But we can also use machine learning for unsupervised learning. The latter are e.g. used for clustering and (non-linear) dimensionality reduction.

Continue reading...

*

23 APR 2017 » DOES MONEY BUY HAPPINESS AFTER ALL? MACHINE LEARNING

WITH ONE RULE

Shirin Glander

This week, I am exploring Holger K. von Jouanne-Diedrich’s OneR

package

for

machine learning. I am running an example analysis on world happiness data and compare the results with other machine learning models (decision trees, random forest, gradient boosting trees and neural

nets).

Continue reading...

*

23 APR 2017 » EXPLAINING COMPLEX MACHINE LEARNING MODELS WITH LIME

Shirin Glander

The classification decisions made by machine learning models are usually difficult - if not impossible - to understand by our human brains. The complexity of some of the most accurate classifiers, like neural networks, is what makes them perform so well - often with better results than achieved by humans. But it also makes them inherently hard to explain, especially to non-data scientists.

Continue reading...

*

16 APR 2017 » HAPPY EASTER: PLOTTING HARE POPULATIONS IN GERMANY

Shirin Glander

For Easter, I wanted to have a look at the number of hares in Germany. Wild hare populations have been rapidly declining over the last 10 years but during the last three years they have at least been stable.

Continue reading...

*

09 APR 2017 » DATA ON TOUR: PLOTTING 3D MAPS AND LOCATION TRACKS

Dr. Shirin Glander

Recently, I was on Gran Canaria for a vacation. So, what better way to keep up the holiday spirit a while longer than to visualize all the places we went in R!?

Continue reading...

*

02 APR 2017 » DEALING WITH UNBALANCED DATA IN MACHINE LEARNING

Shirin Glander

In my last post

,

where I shared the code that I used to produce an example analysis to go along with my webinar on building meaningful models for disease

prediction

,

I mentioned that it is advised to consider over- or under-sampling when you have unbalanced data sets. Because my focus in this webinar was on evaluating model performance, I did not want to add an additional layer of complexity and therefore did not further discuss how to specifically deal with unbalanced data.

Continue reading...

*

31 MAR 2017 » BUILDING MEANINGFUL MACHINE LEARNING MODELS FOR

DISEASE PREDICTION

Shirin Glander

WEBINAR FOR THE ISDS R GROUP

Continue reading...

*

16 MAR 2017 » PLOTTING TREES FROM RANDOM FOREST MODELS WITH GGRAPH

Shirin Glander

Today, I want to show how I use Thomas Lin Pedersen’s awesome ggraph package to plot decision trees from Random Forest models.

Continue reading...

*

07 MAR 2017 » HYPER-PARAMETER TUNING WITH GRID SEARCH FOR DEEP

LEARNING

Shirin Glander

Last week I showed how to build a deep neural network with H2O and

RSPARKLING

. As we

could see there, it is not trivial to optimize the hyper-parameters for modeling. Hyper-parameter tuning with grid search allows us to test different combinations of hyper-parameters and find one with

improved accuracy.

Continue reading...

*

27 FEB 2017 » BUILDING DEEP NEURAL NETS WITH H2O AND RSPARKLING THAT PREDICT ARRHYTHMIA OF THE HEART

Shirin Glander

Last week, I introduced how to run machine learning applications on Spark from within R, using the SPARKLYR package. This week, I am showing how to build feed-forward deep neural networks or multilayer perceptrons. The models in this example are built to classify ECG data into being either from _healthy_ hearts or from someone suffering from _arrhythmia_. I will show how to prepare a dataset for modeling, setting weights and other modeling parameters and finally, how to evaluate model performance with the H2O package

via RSPARKLING.

Continue reading...

*

19 FEB 2017 » PREDICTING FOOD PREFERENCES WITH SPARKLYR (MACHINE

LEARNING)

Shirin Glander

This week I want to show how to run machine learning applications on a Spark cluster. I am using the SPARKLYR package, which provides a handy interface to access Apache Spark functionalities via R.

Continue reading...

*

12 FEB 2017 » CONDITIONAL GGPLOT2 GEOMS IN FUNCTIONS (QTL PLOTS)

Shirin Glander

When running an analysis, I am usually combining functions from multiple packages. Most of these packages come with their own plotting functions. And while they are certainly convenient in that they allow me to get a quick glance at the data or the output, they all have their own style. If I want to prepare a report, proposal or a paper though, I want all my plots to come from a single cast so that they give a consistent feel to the story I want to tell with my data.

Continue reading...

*

06 FEB 2017 » SCRATCHING THE SURFACE OF GENDER BIASES

Shirin Glander

Today, I want to share my analysis of the World Gender Statistics

dataset.

Continue reading...

*

30 JAN 2017 » NEW FEATURES IN WORLD GENDER STATISTICS APP

Shirin Glander

In my last post , I

built a shiny app to explore World Gender Statistics

.

Continue reading...

*

29 JAN 2017 » EXPLORING WORLD GENDER STATISTICS WITH SHINY

Shirin Glander

This week I explored the World Gender Statistics dataset. You can look at 160 measurements over 56 years with my Shiny app here

.

Continue reading...

*

22 JAN 2017 » R VS PYTHON - A ONE-ON-ONE COMPARISON

Shirin Glander

I’m an avid R user and rarely use anything else for data analysis and visualisations. But while R is my go-to, in some cases, Python might actually be a better alternative

.

Continue reading...

*

15 JAN 2017 » FEATURE SELECTION IN MACHINE LEARNING (BREAST CANCER

DATASETS)

Shirin Glander

Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model, we generally want to restrict the features in our models to those, that are most relevant for the response variable we want to predict. Using as few features as possible will also reduce the complexity of our models, which means it needs less time and computer power to run and is easier to understand.

Continue reading...

*

05 JAN 2017 » GENE HOMOLOGY PART 3 - VISUALIZING GENE ONTOLOGY OF

CONSERVED GENES

Shirin Glander

WHICH GENES HAVE HOMOLOGS IN MANY SPECIES?

Continue reading...

*

30 DEC 2016 » HOW TO MAP YOUR GOOGLE LOCATION HISTORY WITH R

Shirin Glander

It’s no secret that Google Big Brothers most of us. But at least they allow us to access quite a lot of the data they have collected on us. Among this is the Google location history.

Continue reading...

*

22 DEC 2016 » ANIMATING PLOTS OF BEER INGREDIENTS AND SIN TAXES OVER

TIME

Shirin Glander

With the upcoming holidays, I thought it fitting to finally explore

the ttbbeer

package. It contains data on beer ingredients used in US breweries from 2006 to 2015 and on the (sin) tax

rates for

beer, champagne, distilled spirits, wine and various tobacco items

since 1862.

Continue reading...

*

18 DEC 2016 » HOW TO BUILD A SHINY APP FOR DISEASE- & TRAIT-ASSOCIATED LOCATIONS OF THE HUMAN GENOME

Shirin Glander

This app is based on the gwascat

R

package and its _ebicat38_ database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously.

Continue reading...

*

14 DEC 2016 » GENE HOMOLOGY PART 2 - CREATING DIRECTED NETWORKS WITH

IGRAPH

Shirin Glander

In my last post

I

created a gene homology network for human genes. In this post I want to extend the network to include edges for other species.

Continue reading...

*

11 DEC 2016 » CREATING A NETWORK OF HUMAN GENE HOMOLOGY WITH R AND

D3

Shirin Glander

EDITED ON 20 DECEMBER 2016

Continue reading...

*

04 DEC 2016 » HOW TO SET UP YOUR OWN R BLOG WITH GITHUB PAGES AND

JEKYLL BOOTSTRAP

Shirin Glander

THIS POST IS IN REPLY TO A REQUEST: HOW DID I SET UP THIS R BLOG?

Continue reading...

*

02 DEC 2016 » EXTREME GRADIENT BOOSTING AND PREPROCESSING IN MACHINE LEARNING - ADDENDUM TO PREDICTING FLU OUTCOME WITH R

Shirin Glander

In last week’s post I explored whether machine learning models can be applied to predict flu deaths from the 2013 outbreak of influenza A H7N9 in China. There, I compared random forests, elastic-net regularized generalized linear models, k-nearest neighbors, penalized discriminant analysis, stabilized linear discriminant analysis, nearest shrunken centroids, single C5.0 tree and partial least squares.

Continue reading...

*

27 NOV 2016 » CAN WE PREDICT FLU DEATHS WITH MACHINE LEARNING AND R?

Shirin Glander

EDITED ON 26 DECEMBER 2016

Continue reading...

*

20 NOV 2016 » ANALYSING THE GILMORE GIRLS' COFFEE ADDICTION WITH R

Shirin Glander

Last week’s post showed how to create a Gilmore Girls character

network

.

Continue reading...

*

13 NOV 2016 » CREATING A GILMORE GIRLS CHARACTER NETWORK WITH R

Shirin Glander

With the impending (and by many - including me - much awaited) Gilmore Girls Revival , I wanted to take a somewhat different look at our beloved characters from Stars Hollow.

Continue reading...

*

06 NOV 2016 » IS 'YEAH' JOSH AND CHUCK'S FAVORITE WORD?

Shirin Glander

TEXT MINING AND SENTIMENT ANALYSIS OF A STUFF YOU SHOULD KNOW PODCAST

Continue reading...

*

01 NOV 2016 » EXPLORING THE HUMAN GENOME (PART 2) - TRANSCRIPTS

Shirin Glander

HOW MANY TRANSCRIPTS AND PROTEINS DO GENES HAVE?

Continue reading...

*

23 OCT 2016 » EXPLORING THE HUMAN GENOME (PART 1) - GENE ANNOTATIONS

Shirin Glander

When working with any type of genome data, we often look for annotation information about genes, e.g. what’s the gene’s full name, what’s its abbreviated symbol, what ID it has in other databases, what functions have been described, how many and which transcripts exist, etc.

Continue reading...

*

16 OCT 2016 » USA/ CANADA ROADTRIP 2016

Shirin Glander

MAPPING GPS DATA FROM OUR USA/ CANADA ROADTRIP

Continue reading...

*

29 SEP 2016 » DESEQ2 COURSE WORK

Shirin Glander

-------------------------

Continue reading...

*

28 SEP 2016 » EXPRANALYSIS PACKAGE

Shirin Glander

I created the R package EXPRANALYSIS designed to streamline my RNA-seq data analysis pipeline. Below you find the vignette for installation and usage of the package.

Continue reading...

* Prev

* 1

* Next

------------------------- Also check out R-bloggers for lots of cool

R stuff!

� 2019 Shirin Elsinghorst

with help from

Jekyll Bootstrap and Bootstrap

Details

Image Url

HTML Url

Moderation By

More Annotations

David Lawrence

2020-02-13 10:21:03

David Lawrence

2020-02-13 10:21:09

David Lawrence

2020-02-13 10:21:37

David Lawrence

2020-02-13 10:21:37

David Lawrence

2020-02-13 10:21:38

David Lawrence

2020-02-13 10:21:56

David Lawrence

2020-02-13 10:21:58

David Lawrence

2020-02-13 10:22:01

David Lawrence

2020-02-13 10:22:24

David Lawrence

2020-02-13 10:22:26

David Lawrence

2020-02-13 10:22:36

David Lawrence

2020-02-13 10:23:26

Favourite Annotations

David Lawrence

2019-09-29 13:30:57

David Lawrence

2019-09-29 13:31:05

David Lawrence

2019-09-29 13:31:36

David Lawrence

2019-09-29 13:31:53

David Lawrence

2019-09-29 13:32:04

David Lawrence

2019-09-29 13:32:20

David Lawrence

2019-09-29 13:32:40

David Lawrence

2019-09-29 13:32:52

David Lawrence

2019-09-29 13:33:17

David Lawrence

2019-09-29 13:33:17

David Lawrence

2019-09-29 13:33:27

David Lawrence

2019-09-29 13:35:31

Text

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

SHIRING.GITHUB.IO

resource of

R VS PYTHON

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

SHIRING.GITHUB.IO

resource of

R VS PYTHON

SHIRIN'S PLAYGROUND

ABOUT ME

R VS PYTHON

functions.

NULL ## 6

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

SHIRING.GITHUB.IO

resource of

SHIRIN'S PLAYGROUND

DESEQ2 COURSE WORK

SHIRING.GITHUB.IO

resource of

SHIRIN'S PLAYGROUND

also

scamming the other.

SHIRIN'S PLAYGROUND

SHIRING.GITHUB.IO

resource of