R For Machine Learning
Your one-stop-shop for machine learning algorithms. Each algorithm is complete with a short description and links to examples. If you would.
R and Python are both open-source programming languages with a large community. New libraries or tools are added continuously to their respective catalog. R is mainly used for statistical analysis while Python provides a more general approach to data science.
R and Python are state of the art in terms of programming language oriented towards data science. Learning both of them is, of course, the ideal solution. R and Python requires a time-investment, and such luxury is not available for everyone. Python is a general-purpose language with a readable syntax. R, however, is built by statisticians and encompasses their specific language.
In this tutorial, you will learn
R
Academics and statisticians have developed R over two decades. R has now one of the richest ecosystems to perform data analysis. There are around 12000 packages available in CRAN (open-source repository). It is possible to find a library for whatever the analysis you want to perform. The rich variety of library makes R the first choice for statistical analysis, especially for specialized analytical work.
The cutting-edge difference between R and the other statistical products is the output. R has fantastic tools to communicate the results. Rstudio comes with the library knitr. Xie Yihui wrote this package. He made reporting trivial and elegant. Communicating the findings with a presentation or a document is easy.
Python
Python can pretty much do the same tasks as R: data wrangling, engineering, feature selection web scrapping, app and so on. Python is a tool to deploy and implement machine learning at a large-scale. Python codes are easier to maintain and more robust than R. Years ago; Python didn't have many data analysis and machine learning libraries. Recently, Python is catching up and provides cutting-edge API for machine learning or Artificial Intelligence. Most of the data science job can be done with five Python libraries: Numpy, Pandas, Scipy, Scikit-learn and Seaborn.
Python, on the other hand, makes replicability and accessibility easier than R. In fact, if you need to use the results of your analysis in an application or website, Python is the best choice.
Popularity index
The IEEE Spectrum ranking is a metrics that quantify the popularity of a programming language. The left column shows the ranking in 2017 and the right column in 2016. In 2017, Python made it at the first place compared to a third rank a year before. R is in 6th place.
Job Opportunity
The picture below shows the number of jobs related to data science by programming languages. SQL is far ahead, followed by Python and Java. R ranks 5th.
If we focus on the long-term trend between Python (in yellow) and R (blue), we can see that Python is more often quoted in job description than R.
Analysis done by R and Python
However, if we look at the data analysis jobs, R is by far, the best tool.
Percentage of people switching
There are two keys points in the picture below.
- Python users are more loyal than R users
- The percentage of R users switching to Python is twice as large as Python to R.
Difference between R and Python
Parameter | R | Python |
---|---|---|
Objective | Data analysis and statistics | Deployment and production |
Primary Users | Scholar and R&D | Programmers and developers |
Flexibility | Easy to use available library | Easy to construct new models from scratch. I.e., matrix computation and optimization |
Learning curve | Difficult at the beginning | Linear and smooth |
Popularity of Programming Language. Percentage change | 4.23% in 2018 | 21.69% in 2018 |
Average Salary | $99.000 | $100.000 |
Integration | Run locally | Well-integrated with app |
Task | Easy to get primary results | Good to deploy algorithm |
Database size | Handle huge size | Handle huge size |
IDE | Rstudio | Spyder, Ipthon Notebook |
Important Packages and library | tydiverse, ggplot2, caret, zoo | pandas, scipy, scikit-learn, TensorFlow, caret |
Disadvantages | Slow High Learning curve Dependencies between library | Not as many libraries as R |
Advantages |
|
|
R or Python Usage
Geometry dash 2.0 free download pc. Python has been developed by Guido van Rossum, a computer guy, circa 1991. Python has influential libraries for math, statistic and Artificial Intelligence. You can think Python as a pure player in Machine Learning. However, Python is not entirely mature (yet) for econometrics and communication. Python is the best tool for Machine Learning integration and deployment but not for business analytics.
The good news is R is developed by academics and scientist. It is designed to answer statistical problems, machine learning, and data science. R is the right tool for data science because of its powerful communication libraries. Besides, R is equipped with many packages to perform time series analysis, panel data and data mining. On the top of that, there are not better tools compared to R.
In our opinion, if you are a beginner in data science with necessary statistical foundation, you need to ask yourself following two questions:
- Do I want to learn how the algorithm work?
- Do I want to deploy the model?
If your answer to both questions is yes, you'd probably begin to learn Python first. On the one hand, Python includes great libraries to manipulate matrix or to code the algorithms. As a beginner, it might be easier to learn how to build a model from scratch and then switch to the functions from the machine learning libraries. On the other hand, you already know the algorithm or want to go into the data analysis right away, then both R and Python are okay to begin with. One advantage for R if you're going to focus on statistical methods.
Secondly, if you want to do more than statistics, let's say deployment and reproducibility, Python is a better choice. R is more suitable for your work if you need to write a report and create a dashboard.
In a nutshell, the statistical gap between R and Python are getting closer. Most of the job can be done by both languages. You'd better choose the one that suits your needs but also the tool your colleagues are using. It is better when all of you speak the same language. After you know your first programming language, learning the second one is simpler.
Conclusion
In the end, the choice between R or Python depends on:
- The objectives of your mission: Statistical analysis or deployment
- The amount of time you can invest
- Your company/industry most-used tool
- Neural Networks and Deep Learning : Single-hidden-layer neural network are implemented in packagennet (shipped with base R). PackageRSNNS offers an interface to the Stuttgart Neural Network Simulator (SNNS). Packages implementing deep learning flavours of neural networks includedeepnet (feed-forward neural network, restricted Boltzmann machine, deep belief network, stacked autoencoders),RcppDL (denoising autoencoder, stacked denoising autoencoder, restricted Boltzmann machine, deep belief network) andh2o (feed-forward neural network, deep autoencoders). An interface to tensorflow is available intensorflow.
- Recursive Partitioning : Tree-structured models for regression, classification and survival analysis, following the ideas in the CART book, are implemented inrpart (shipped with base R) andtree. Packagerpart is recommended for computing CART-like trees. A rich toolbox of partitioning algorithms is available in Weka , packageRWeka provides an interface to this implementation, including the J4.8-variant of C4.5 and M5. TheCubist package fits rule-based models (similar to trees) with linear regression models in the terminal leaves, instance-based corrections and boosting. TheC50 package can fit C5.0 classification trees, rule-based models, and boosted versions of these.
Two recursive partitioning algorithms with unbiased variable selection and statistical stopping criterion are implemented in packageparty andpartykit. Functionctree() is based on non-parametric conditional inference procedures for testing independence between response and each input variable whereasmob() can be used to partition parametric models. Extensible tools for visualizing binary trees and node distributions of the response are available in packageparty andpartykit as well.
Graphical tools for the visualization of trees are available in packagemaptree.
Trees for modelling longitudinal data by means of random effects is offered by packageREEMtree. Partitioning of mixture models is performed byRPMM.
Computational infrastructure for representing trees and unified methods for prediction and visualization is implemented inpartykit. This infrastructure is used by packageevtree to implement evolutionary learning of globally optimal trees. Survival trees are available in various package,LTRCtrees allows for left-truncation and interval-censoring in addition to right-censoring. - Random Forests : The reference implementation of the random forest algorithm for regression and classification is available in packagerandomForest. Packageipred has bagging for regression, classification and survival analysis as well as bundling, a combination of multiple models via ensemble learning. In addition, a random forest variant for response variables measured at arbitrary scales based on conditional inference trees is implemented in packageparty.randomForestSRC implements a unified treatment of Breiman's random forests for survival, regression and classification problems. Quantile regression forestsquantregForest allow to regress quantiles of a numeric response on exploratory variables via a random forest approach. For binary data, ThevarSelRF andBoruta packages focus on variable selection by means for random forest algorithms. In addition, packagesranger andRborist offer R interfaces to fast C++ implementations of random forests. Reinforcement Learning Trees, featuring splits in variables which will be important down the tree, are implemented in packageRLT.wsrf implements an alternative variable weighting method for variable subspace selection in place of the traditional random variable sampling. PackageRGF is an interface to a Python implementation of a procedure called regularized greedy forests. Random forests for parametric models, including forests for the estimation of predictive distributions, are available in packagestrtf (predictive transformation forests, possibly under censoring and trunction) andgrf (an implementation of generalised random forests).
- Regularized and Shrinkage Methods : Regression models with some constraint on the parameter estimates can be fitted with thelasso2 andlars packages. Lasso with simultaneous updates for groups of parameters (groupwise lasso) is available in packagegrplasso; thegrpreg package implements a number of other group penalization models, such as group MCP and group SCAD. The L1 regularization path for generalized linear models and Cox models can be obtained from functions available in packageglmpath, the entire lasso or elastic-net regularization path (also inelasticnet) for linear regression, logistic and multinomial regression models can be obtained from packageglmnet. Thepenalized package provides an alternative implementation of lasso (L1) and ridge (L2) penalized regression models (both GLM and Cox models). Packagebiglasso fits Gaussian and logistic linear models under L1 penalty when the data can't be stored in RAM. PackageRXshrink can be used to identify and display TRACEs for a specified shrinkage path and to determine the appropriate extent of shrinkage. Semiparametric additive hazards models under lasso penalties are offered by packageahaz. A generalisation of the Lasso shrinkage technique for linear regression is called relaxed lasso and is available in packagerelaxo. Fisher's LDA projection with an optional LASSO penalty to produce sparse solutions is implemented in packagepenalizedLDA. The shrunken centroids classifier and utilities for gene expression analyses are implemented in packagepamr. An implementation of multivariate adaptive regression splines is available in packageearth. Various forms of penalized discriminant analysis are implemented in packageshda andsda. PackageLiblineaR offers an interface to the LIBLINEAR library. Thencvreg package fits linear and logistic regression models under the the SCAD and MCP regression penalties using a coordinate descent algorithm. The same penalties are also implemented in thepicasso package. An implementation of bundle methods for regularized risk minimization is available form packagebmrm. The Lasso under non-Gaussian and heteroscedastic errors is estimated byhdm, inference on low-dimensional components of Lasso regression and of estimated treatment effects in a high-dimensional setting are also contained. PackageSIS implements sure independence screening in generalised linear and Cox models. Normal and binary logistic linear models under various
- Boosting and Gradient Descent : Various forms of gradient boosting are implemented in packagegbm (tree-based functional gradient descent boosting). Packagexgboost implements tree-based boosting using efficient trees as base learners for several and also user-defined objective functions. The Hinge-loss is optimized by the boosting implementation in packagebst. PackageGAMBoost can be used to fit generalized additive models by a boosting algorithm. An extensible boosting framework for generalized linear, additive and nonparametric models is available in packagemboost. Likelihood-based boosting for Cox models is implemented inCoxBoost and for mixed models inGMMBoost. GAMLSS models can be fitted using boosting bygamboostLSS. An implementation of various learning algorithms based on Gradient Descent for dealing with regression tasks is available in packagegradDescent.
- Support Vector Machines and Kernel Methods : The functionsvm() frome1071 offers an interface to the LIBSVM library and packagekernlab implements a flexible framework for kernel learning (including SVMs, RVMs and other kernel learning algorithms). An interface to the SVMlight implementation (only for one-against-all classification) is provided in packageklaR. The relevant dimension in kernel feature spaces can be estimated usingrdetools which also offers procedures for model selection and prediction.
- Bayesian Methods : Bayesian Additive Regression Trees (BART), where the final model is defined in terms of the sum over many weak learners (not unlike ensemble methods), are implemented in packagesBayesTree,BART, andbartMachine. Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian processes including Bayesian CART and treed linear models are made available by packagetgp. Bayesian structure learning in undirected graphical models for multivariate continuous, discrete, and mixed data is implemented in packageBDgraph; corresponding methods relying on spike-and-slab priors are available from packagessgraph. Naive Bayes classifiers are available innaivebayes.
- Optimization using Genetic Algorithms : Packagergenoud offers optimization routines based on genetic algorithms. The packageRmalschains implements memetic algorithms with local search chains, which are a special type of evolutionary algorithms, combining a steady state genetic algorithm with local search for real-valued parameter optimization.
- Association Rules : Packagearules provides both data structures for efficient handling of sparse binary data as well as interfaces to implementations of Apriori and Eclat for mining frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules. Packageopusminer provides an interface to the OPUS Miner algorithm (implemented in C++) for finding the key associations in transaction data efficiently, in the form of self-sufficient itemsets, using either leverage or lift.
- Fuzzy Rule-based Systems : Packagefrbs implements a host of standard methods for learning fuzzy rule-based systems from data for regression and classification. PackageRoughSets provides comprehensive implementations of the rough set theory (RST) and the fuzzy rough set theory (FRST) in a single package.
- Model selection and validation : Packagee1071 has functiontune() for hyper parameter tuning and functionerrorest() (ipred) can be used for error rate estimation. The cost parameter C for support vector machines can be chosen utilizing the functionality of packagesvmpath. Functions for ROC analysis and other visualisation techniques for comparing candidate classifiers are available from packageROCR. Packageshdi andstabs implement stability selection for a range of models,hdi also offers other inference procedures in high-dimensional models.
- Other procedures : Evidential classifiers quantify the uncertainty about the class of a test pattern using a Dempster-Shafer mass function in packageevclass. TheOneR (One Rule) package offers a classification algorithm with enhancements for sophisticated handling of missing values and numeric data together with extensive diagnostic functions.
- Meta packages : Packagecaret provides miscellaneous functions for building predictive models, including parameter tuning and variable importance measures. The package can be used with various parallel implementations (e.g. MPI, NWS etc). In a similar spirit, packagemlr3 offers a high-level interface to various statistical and machine learning packages. PackageSuperLearner implements a similar toolbox. Theh2o package implements a general purpose machine learning platform that has scalable implementations of many popular algorithms such as random forest, GBM, GLM (with elastic net regularization), and deep learning (feedforward multilayer networks), among others.
- GUI rattle is a graphical user interface for data mining in R.
- Visualisation (initially contributed by Brandon Greenwell) Thestats::termplot() function package can be used to plot the terms in a model whose predict method supportstype='terms'. Theeffects package provides graphical and tabular effect displays for models with a linear predictor (e.g., linear and generalized linear models). Friedman’s partial dependence plots (PDPs), that are low dimensional graphical renderings of the prediction function, are implemented in a few packages.gbm,randomForest andrandomForestSRC provide their own functions for displaying PDPs, but are limited to the models fit with those packages (the functionpartialPlot fromrandomForest is more limited since it only allows for one predictor at a time). Packagespdp,plotmo, andICEbox are more general and allow for the creation of PDPs for a wide variety of machine learning models (e.g., random forests, support vector machines, etc.); bothpdp andplotmo support multivariate displays (plotmo is limited to two predictors whilepdp uses trellis graphics to display PDPs involving three predictors). By default,plotmo fixes the background variables at their medians (or first level for factors) which is faster than constructing PDPs but incorporates less information.ICEbox focuses on constructing individual conditional expectation (ICE) curves, a refinement over Friedman's PDPs. ICE curves, as well as centered ICE curves can also be constructed with thepartial() function from thepdp package.ggRandomForests provides ggplot2-based tools for the graphical exploration of random forest models (e.g., variable importance plots and PDPs) from therandomForest andrandomForestSRC packages.