What is RStudio

What is r?

Joachim Zuckarelli April 24, 2018

Big Data - the increasingly ubiquitous availability of large and constantly growing amounts of data has been discussed as a technical, economic and social phenomenon for years. The project to obtain valuable information from big data, which disciplines such as "data science" or "analytics" have taken on, requires powerful statistical tools. The same applies to the methodically demanding evaluation of comparatively small amounts of data, for example in the "classic" academic field.

A statistical tool that has become more and more important in recent years is the programming language R. Unlike Python, which is also widely used in data science, R is a language specially developed for statistical applications. Its core functions lie in the statistical evaluation and the visualization of data.

R, which was developed in 1992 by Ross Ihaka and Robert Gentleman in Auckland (and whose first name probably owes its own name), is distributed as open source software under the GNU General Public License by the Vienna-based R Foundation for Statistical Computing. While the heart of R is being further developed by the R (Development) Core Team (from which the foundation also emerged), the real strength of R lies in the availability of additional functions in the form of so-called packages. Independent developers all over the world offer packages for a wide variety of purposes, from classic regression to machine learning. More than 12,000 of these function packages, which are estimated to contain more than 220,000 functions [1], are available free of charge via the Comprehensive R Archive Network (CRAN) and several other hubs (including the extensive package collection of the Bioconductor project especially for the bioinformatics area [2]). For hardly any statistical problem, and it may require the use of a niche method, however rare, there is not already a suitable solution in R, which is downloaded ready-to-use, but can also be adapted to your own needs at any time thanks to the open source license can be.

So it is actually no wonder that R has now clearly outperformed its commercial competitors such as SAS, SPSS, Stata and Mathematica in terms of popularity among data science professionals [3]. Applicants with R skills are also in great demand on the job market, as shown, for example, by the analyzes of LinkedIn profiles and job advertisements for data scientists in large online job exchanges [4].

So there are good reasons to take a closer look at R. This article provides a brief, application-oriented and introductory overview of the R programming language in its two main domains, the analysis and visualization of data.

R and its ecosystem

Unlike some of its commercial competitors (such as SPSS), R does not have a stylish graphical user interface that allows statistical functions to be called up without any programming knowledge. After downloading and installing R (available from one of the CRAN mirror servers [5]), which is available for Windows, Linux and MacOS, the rather unappealing RGui only offers a few very rudimentary functions for editing and executing R- Code as well as for installing packages. You won't find any convenient and useful features such as syntax highlighting.

In view of the rather sparse development environment that R is equipped with, a number of other providers have closed the gap and developed more comfortable R editors. Examples are the RCommander and RStudio [6].

The RCommander is particularly interesting because it is an R package itself and can therefore also be obtained from CRAN under the name RCmdr and started from the RGui. In addition, it allows many statistical operations to be called menu-driven. The calls to action are translated into R code and executed. This approach can be particularly interesting for beginners who want to understand in a practical way how certain operations are implemented in R.

RStudio is a powerful IDE that is provided by the company of the same name for free (and for commercial use, with collaboration tools, server integration and better support for a fee). The flagship product from RStudio is not only a very popular development environment for R, but also a good example of how functioning business models can be implemented in an open source software environment. Companies like RStudio that have an interest in promoting the development, dissemination and use of R (especially in a business environment) are united in the R Consortium to jointly finance and advance projects that serve these goals. The members of the consortium include Microsoft, IBM and Oracle.

If you don't want to use a special R development environment such as RStudio, you can of course edit R code in Notepad ++, Sublime or Microsoft's free Visual Studio Code with all the comforts.

If you are looking for help and technical information when developing with R, you will often find what you are looking for in the mandatory documentation of the R packages, which are often very detailed and always contain executable code examples. The so-called CRAN Task Views offer a good overview of the relevant R packages and their specifics for a certain area (for example finance, machine learning or time series) [7]. The open access journal R Journal and the Journal of Statistical Software (not limited to R only) report on new developments relating to R (including the publication of important packages) [8]. The popular Internet communities such as the blog aggregator R-bloggers or StackOverflow, where the proportion of R-related threads has increased significantly in recent years, are of course also good starting points [9]. Here R, although it is a special purpose language, will probably be one of the most frequently discussed programming languages ​​this year [4].

The language R - all objects or what?

This section introduces the most important R language concepts. The focus is on topics that are particularly important for practical work with R. In favor of this, a broad discussion of advanced topics, such as environments (R's clever namespace concept) is dispensed with at this point.

Execution Modes

R is an interpreted language by default, but there is also a bytecode compiler [10]. Unlike in most other programming languages ​​(but similar to other statistical languages ​​such as Stata), it can make sense in R to execute instructions not only in the form of programs, i.e. in script mode, but also in interactive mode, i.e. by entering individual instructions at the prompt shown in R as>. In this way, for example, the data can be "questioned" live or experimented with on the optimal settings for a graphic. The most recently executed instructions can be accessed at any time using the function History() can be viewed.

While R code files are pure text files, R offers the possibility of R objects (for example data sets or details of statistical estimation models) with the help of the function Save() in binary .RDataFormat to save. With save.image () the entire R workspace can be backed up. This way you can resume work at a later point in time at the exact point where you left it.

A relatively new way of running R code is in the form of server-side apps. The package shiny is used, which is made available free of charge by RStudio [11]. It allows web-based applications to be generated from input and output elements that are completely controlled by R-code. These apps can then either be run on their own, shiny-capable Linux server (a corresponding server is available in a basic version as an open source solution) or via the RStudio platform shinyapps.io (free of charge for up to 5 apps with a total of no more than 25 hours of computing time per month). In this way, R programs, which can be parameterized using standard input elements such as sliders and text boxes, can be made accessible to users via the web without them having to install R themselves or understand R code.

Object-based versus object-oriented

R is all about objects. As John Chambers, developer of the R predecessor language S, put it in a nutshell with regard to R: "Everything that exists is an object. Everything that happens is a function call".

Objects are not just variables, but also functions, operators and entire expressions that make up an R script. However, R is not a fully object-oriented language in the classical sense, such as Java. R is wonderful to work with without understanding what concepts like classes, inheritance, encapsulation, and polymorphism mean.

R supports various "sharp" approaches to object orientation that are beyond the scope of this article to consider. For the gourmets among the readers, we recommend the informative paper by John Chambers with regard to the application of the object-oriented and the functional programming paradigm in R, from which the memorable quote is also taken [12].

Understanding the syntax

R programs consist of expressions, for example assignments or function calls. An expression ends with the end of the line. Only if there is more than one expression in a line do the expressions have to be separated from each other by semicolons.

Comments are through # introduced and always extend to the end of the line; R does not recognize multiline comments.

R is case-sensitive. For object names (for example of functions and variables), not only alphanumeric characters but also periods and underscores are allowed. Object names with an underscore are rarely found, however, more often the period is used to structure object identifiers.

Code blocks (e.g. in loops or functions) are enclosed in curly brackets as in C. In general, developers with C experience can find their way around R well, since the syntax (for example for conditions) is partly structured analogously to C. There are also the logical and comparison operators known from C such as &&, || or != for use.

For example, Google’s R Code Style Guide [13] provides guidelines for the understandable design of code.

Variables in R

Variables in R do not have to be declared, but are created dynamically at the moment of assignment. Variables are accessed using their name, which in R is referred to as a symbol. Values ​​are assigned in R using the assignment operator <-; the following is an example of a variable x the value 5 assigned, more precisely, an object with the value 5 created on the above icon x can be accessed:

x <- 5

The equal sign can also be used as an assignment operator. The arrow operator has the advantage, however, that it shows the direction of the assignment. Object and assigned value can therefore also be swapped: 5 -> x is therefore a valid assignment. The assignment operator is actually a function (and this in turn is an object, because everything is an object). We remember John Chambers quoted at the beginning of this section: "Everything that happens is a function call". The function call `<--` (x, 5), has the same effect as the assignment above and is the operation that the R interpreter performs when our assignment is called.

The most important elementary data types in R are:

  • integrity (Integers),
  • double (Floating point numbers),
  • logical (logical values, TRUE and FALSE) and
  • character (Strings).

Values ​​of the same type can be found with the help of the function c () put together to so-called vectors:

> v <- c (1,2,3,5)> v [1] 1 2 3 5

When entering the variable name at the prompt (>) R shows the content of the variable (more precisely: the content of the object that is addressed with the symbol entered). The elements of a vector can be accessed by indexing them in square brackets, with indexing starting at 1:

> v [4] [1] 5

The number in square brackets in front of the R output is the index of the first element that is displayed in the line (useful information if, for example, long vectors are output).

More complex data constructs that can also contain data of different types are lists (type list) and data frames (type data.frame). In the following example, two objects are created, a vector of strings and a vector with a number as the only element, and then merged into a list:

> x1 <- c ("Waldi", "Hasso")> x2 <- c (42.5)> my.list <- list (dog = x1, shoulder middle = x2)> my.list $ dog [1] "Waldi" "Hasso" $ ShoulderMeans [1] 14.5> my.list $ Dog [2] [1] "Hasso"

When calling the function list () the names of the elements can optionally be specified. The elements can then be entered in the notation using these names list $ elementname be addressed. Alternatively, an element can be saved as a list [[index]] addressed.

While lists are often used by statistical functions to return multiple values, data frames are the workhorse of statistics in R because they represent tables. One possibility of generating data frames is to connect several vectors (unlike lists, which are necessarily of equal length):

name <- c ("Katharina", "Peter", "Sophie", "Anna", "Joachim") gender <- c ("w", "m", "w", "w", "m") age <- c (18, 25, 28, 22, 37) friends <- data.frame (name, gender, age)

In order to look at the content of the data frame, it is sufficient to enter your name at the prompt. Alternatively, the data can be entered with the function View () can be displayed in a spreadsheet-like representation in the R development environment.

The columns of a data frame can now be accessed using the notation dataframe $ column, for example friends $ sex can be accessed. If a data element of a data frame is to be addressed via indices, a notation is of the form dataframe [line, column] possible, for example friends [1,2], which supplies the value of the second variable (= column) in the first data record (= row), in this example w, the family of Katharina. Indices can also be omitted and so for example with friends [, 3] the third column of the data frame, i.e. the variable Age, can be addressed as a vector.

In addition to the elementary data types as well as the vectors, lists and data frames, there are a number of other object types in R, for example factors. These are categorical variables with a predefined set of characteristics, for example school grades or hair colors. Or for functions (which are also objects) the data types special and closure, depending on whether the function is permanently built into R (such as the function `<-`that implements the assignment operator) or not. We'll take a closer look at custom functions below.

R is weakly typed: As seen in the examples above, the object type does not have to be specified when initializing variables (there is no declaration for which the type can be specified anyway). Instead, R determines the type itself. Here comes an as Coercion designated mechanism for use. Coercion tries to ensure that data is of the type required for the particular operation being performed. For example, if numbers and text information are combined in a vector, this vector is automatically assigned the type characterso that both types of data can be stored. Put simply, R chooses the lowest common (type) denominator.

All data that R needs is held in memory. To manage storage requirements, R uses automatic disposal of data that is no longer required (Garbage collection) and demand-driven data loading (Lazy loading).

Control structures

Innovations in the area of ​​control structures are known to be rare in programming languages, and so R also knows the classic control construct if / elsewhose syntax is analogous to C:

if (condition) {statement block} else {statement block}

The for- Loop in R is syntactically different from its namesake in C. In R it has the form

for (variable in list) {statement block}

A "typical" for-Loop would iterate through a vector with integer elements; there list but any list (and thus also a list of objects of very different types) can be, in R it is very easy to use for to achieve the same effect that one would use in other languages for each-Loop would endeavor.

In addition to the classic whileLoop of shape

while (condition) {statement block}

exists one repeat-Loop without further running conditions, which just runs until it is with break is left.

The curly brackets can be omitted for all control structures if only a single expression follows.

Functions

Like everything in R, functions are objects. They are generated by assigning the function head and body to a symbol, for example:

square <- function (x) {return (x ^ 2)}

According to the weak typing in R, the arguments in the function header are specified without a data type.

Functions must always be called with round brackets, even if no arguments are passed; otherwise the source code of the function is displayed (a very useful feature).

The arguments of a function can be given a default value, for example square <-function (x = 3), and addressed by their name: square (x = 4) is therefore a valid function call. Functions can also take an indefinite number of arguments, as in the following example, which calculates the sum of a series of arguments previously squared:

square sum <- function (...) {x <- as.double (list (...)) return (sum (x ^ 2))}

Here, a list is first made from the special object ..., which represents an undefined number of arguments, and this is then converted into a vector of floating point numbers, which can then be used for further work. Now the function could be something like sum of squares (3,4.05) be called.

When an argument is passed to an R function, the interpreter first creates a new environment (namespace) for the function and then an object that is initialized with the value of the argument; In R, argument transfers are de facto "by value". Since functions are objects themselves, they can also be passed as arguments to other functions. One example of this is the function tapply, which we will look at again below.

Certain functions in R, such as print or summary, seemingly can handle arguments of very different types, including: the complex return objects from statistical model estimates. Such functions are called generic functions. Ultimately, they are just envelope functions which, depending on the class of the argument passed to them, call the corresponding "special function" for the associated class. For example, the special function that represents the results of a linear regression is called print.lm (lm For linear model). It is from the generic function print always called when print a lmObject is passed as an argument. Here one of the approaches of how R deals with object orientation shines through; namely that the methods do not belong to the object, but are implemented via generic (shell) functions.

Working with packages

Packages, the additional packages from R, which are primarily available via the CRAN, can be used with library (packagename) getting charged. To do this, the package must first include install.package (packagename, dependencies = TRUE) to be installed. The good argument dependencies = TRUE ensures that all packages on which the one to be installed depends (on whose functions it falls back) are also installed, if not already done. A list of the installed packages can be found with installed.packages () be called.

Once the package has been loaded, the functions made available in it can be called without any further action. With ? function name the help for a function installed as part of the package can also be called at the prompt.

Statistics with R: First steps

Obviously, if you want to use R for statistical applications, you have to make the data available in R first. By default, R offers a whole range of functions for reading data from text-based files, especially the general function read.table (), for which all important parameters such as column separators, delimiters for character strings and the decimal separator can be specified. The functions read.csv () (Comma as column separator, point as decimal separator), read.csv2 () (Semicolon as column separator, comma as decimal separator), read.delim () (Tab as column separator, point as decimal separator) and read.delim2 () (Tab as column separator, comma as decimal separator) are ultimately special versions of read.table () each with different standard values ​​for the central format arguments.

A number of import functions are also available for data from other statistics packages, e. B. read.spss () (for SPSS data), read.ssd () (for SAS data), read.dta (for stata data); these functions are all in the package foreign contain. Excel data (both .xls as well as .xlsx) can for example with read.xls () from the gdata-Package can be read.

By assigning

data <- read.table ("C: /umfragedaten.csv", sep = ";", dec = ".");

one could now use the in survey data.csv contained data table in a data frame named Data save (Caution: R does not use a backslash!). By the way, to read data from the web only the file name has to be replaced by the URL.

But we want to make it easy for ourselves here and use the data set Anscombe, which is included in the R package car is included. Many more data sets to try out can also be found in the R package datasets.

With names (Anscombe) we can contact us after we have the package library (car) have loaded, look at the variables that are contained in the data record:

> names (Anscombe) [1] "education" "income" "young" "urban"

The variables form the per capita expenditure on education for the 51 US states (including Washington D.C.)education), the median income (income), as well as the number of under-eighteen year olds (young) and the number of citizens living in urban areas (urban), each per 1000 inhabitants, for the year 1970. The abbreviations of the states are in the data record as line names and can be entered with row.names (Anscombe) are displayed. They are of course also displayed when you look at the record with View (Anscombe) can be shown completely.

To get an initial overview of the data set, you can use head (Anscombe) and tail (Anscombe) look at the first or last five lines of data.

A more systematic impression of the data can of course be obtained with the tools of descriptive statistics; R offers the functions for this, among other things mean () for the arithmetic mean, median () for the median (50% quantile), min () and Max() for the smallest and largest value, and quantile () for any quantile. With

> median (Anscombe $ income) [1] 3257

For example, one could find the median average income of all US states (the number looks low, but of course there has been both significant real growth and sizeable inflation since 1970). The function summary (), which takes a single variable (i.e. a vector) or an entire data set as an argument, summarizes common key figures of descriptive statistics clearly.

With the help of cor () the correlation coefficient is determined as a measure of the relationship, here using the example of the relationship between education expenditure and the proportion of young citizens in the population (the underlying functions for variance and covariance are called in R. var () and cov ()):

> cor (Anscombe $ education, Anscombe $ young) [1] 0.3114855

A word about the handling of missing data: Many functions in R give the special constant by default N / A (For not available) if the data to which they are applied contain missings, i.e. missing data points. For example, this is the value of the sum function sum () equal N / Aif the vector whose elements it sums up contains a missing. By setting the argument supported by many functions na.rm on TRUE you simply exclude the missings and make sure that the functions return meaningful values, even if the data is "full of holes". With the function is.na () one can check where there are missings in a vector.

The statistical functions of R can not only be applied to entire vectors or data frames, but with the help of the function tapply () also on grouped data. For example, one could determine the median of spending on education across all US states, depending on whether the proportion of youth in the population is higher or lower than the average for all states:

> tapply (Anscombe $ education, INDEX = Anscombe $ young> mean (Anscombe $ young), FUN = median) FALSE TRUE 189 207

The argument INDEX is the grouping criterion (here: an expression that returns a vector of logical values), FUN the function to be used. This example clearly shows that in R functions that are objects themselves can be passed as arguments to another function.

In addition to descriptive statistics, the almost unlimited spectrum of statistical methods can now be used in R. At this point just a simple example: If you wanted to use a linear regression to test the hypothesis that states with a higher proportion of younger citizens per capita spend more on education, you could use the function lm () (For linear model) estimate a linear regression:

> m <- lm (education ~ income + young + urban, data = Anscombe)> summary (m) Call: lm (formula = education ~ income + young + urban, data = Anscombe) Residuals: Min 1Q Median 3Q Max -60,240 -15,738 -1,156 15,883 51,380 Coefficients: Estimate Std. Error t value Pr (> | t |) (Intercept) -2.868e + 02 6.492e + 01 -4.418 5.82e-05 *** income 8.065e-02 9.299- 03 8.674 2.56e-11 *** young 8.173e-01 1.598e-01 5.115 5.69e-06 *** urban -1.058e-01 3.428e-02 -3.086 0.00339 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 Residual standard error: 26.69 on 47 degrees of freedom Multiple R-squared: 0.6896, Adjusted R-squared: 0.6698 F- statistic: 34.81 on 3 and 47 DF, p-value: 5.337e-12

How many statistical functions are there too lm () returns an object, which is itself ultimately a list of objects and represents the model results.

> names (m) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms "" model "

Well you could say m $ coefficients access the vector of regression coefficients to continue working with it.

Graphics with R: Getting started

R has very strong graphics functionalities to visualize data. The standard graphics package comes from the factory graphics as well as the package lattice installed, which is particularly well suited for creating graphics with multiple panels.

There is also with ggplot2 a very popular graphics package that is completely different from "grammar of graphics" described approach follows (hence the gg in ggplot2). Graphics on an abstract level are understood as mapping data to visual properties of the image and are logically described based on this basic idea. Anyone interested in using this package, which can be used to create very attractive graphics, should refer to the practice-oriented and richly illustrated R Graphics Cookbook by Winston Chang [14].

For the beginning, however, the means of the package belonging to the R standard installation are sufficient graphics completely off. It provides a whole range of functions for generating a wide variety of types of graphics for the presentation of both categorical and continuous data, including histograms (hist ()), Box plots (boxplot ()), Bar charts (bar()), Pie charts (pie ()), Point clouds (plot()) or mosaic plots (mosaicplot ()). A number of display functions also exist for three-dimensional data, including three-dimensional surfaces persp () and the function of the same name for heatmaps heatmap ().

Each graphic can be adapted to any user request, no matter how unusual, with a seemingly unlimited number of arguments. The parameters can either be transferred directly to the respective graphic function (and therefore only apply to the current graphic), or they are transferred with the function par() for all graphics on the current graphics device. By redirecting the output to other R graphics devices, graphics can also be conveniently exported in formats such as BMP, JPG, PNG or PDF.

The following is an example of a very simple histogram showing the distribution of education spending (the outlier with the particularly high education spending is Alaska, by the way):

hist (Anscombe $ education, breaks = seq (from = 87.5, to = 400, by = 25), main = "Education spending of the US states (1970)", xlab = "Per capita education spending [US dollars]" , ylab = "number of states", col = "deepskyblue1", border = "white")

With the argument breaks instructs R to group the data for the histogram into $ 25 wide classes. Alternatively, the class boundaries could have been passed as a vector of explicit values, or just a number of classes could have been specified and R could have determined their width itself.

With the arguments col and border the fill and border color of the bars are specified. In addition to hexadecimal RGB color codes such as # FF0000 for red, R also knows a number of color constants [15]. Of course, R can also handle other color spaces such as HSL.

A similar looking histogram would be obtained with the Package ggplot2 generate as follows:

ggplot (data = Anscombe, aes (x = education)) + geom_histogram (fill = "deepskyblue1", color = "white", binwidth = 25) + labs (title = "US State Education Spending (1970)", x = "Per capita spending on education [US dollars]", y = "Number of states") + scale_x_continuous (breaks = seq (from = 0, to = 400, by = 50), limits = c (75,400)) + scale_y_continuous ( expand = c (0, 0), breaks = seq (from = 0, to = 13, by = 1), minor_breaks = NULL, limits = c (0,13)) + theme (axis.line.x = element_line ( color = "black", size = 0.5), axis.line.y = element_line (color = "black", size = 0.5));

Here you can see very nicely how according to the concept of grammar of graphics initially a ggplotObject is created that contains the Anscombe-Data set is transferred as data. With aes (for aesthetics) we say ggplotthat we want to occupy the x-coordinate of our representation with the education expenditure. So here we are mapping numerical data to a property of the representation.

So far it is not yet clear which type of representation we actually choose. That only happens in the next step, in which we do a so-called geom, d. H. a kind of representation, add (literally "add"), in our example a histogram. Finally, in a similar way, we'll add labels and an axis.

Conclusion

Due to its open source nature and literally thousands of additional packages, R has long since become the Swiss Army Knife for statisticians and data scientists. This is reflected, among other things, in the high demand for R-knowledge on the job market and the large number of R-related tutorials, blogs and forums on the Internet. Anyone interested in statistics should have looked at R at least once. This article provides a first introduction to the language and its characteristics.

  1. R. Muenchen: The Popularity of Data Science Software, 2017
  2. Package collection of the Bioconductor project
  3. K. Rexer, P. Gearan and H. Allen: 2017 Data Science Survey, 2017
  4. D. Smith: New surveys show continued popularity of R, 2015
  5. CRAN mirror server
  6. RCommander
    RStudio
  7. CRAN task views
  8. Open access journal R Journal
    J. Fox and A. Leanage: R and the Journal of Statistical Software, 2016
  9. StackOverflow
    R bloggers
  10. T. Galili: Speed ​​up your R code using a just-in-time (JIT) compiler, 2012
  11. Package shiny
  12. J. M. Chambers: Object-Oriented Programming, Functional Programming and R, Statistical Science, Vol. 29, No. 2, 167-180, 2017;
  13. Google’s R Code Style Guide
  14. W.Chang: R Graphics Cookbook, O'Reilly, 2013
  15. Set of color constants