Tidy features (like pipes) are detrimental to performance. The best things R has going for it are data.table, ggplot, stringr, RMarkdown, RStudio, and the massive, unmatched breadth and depth of special-purpose statistics libraries. Combined, this is a formidable and highly performant toolset for data analytics workflows, and I can say with some certainty that even though “base Python” might look prettier than “base R,” the combination of Python and NumPy is not necessarily more powerful or even a more elegant syntax. The data.table syntax is quite convenient and powerful, even if it does not produce the same “warm fuzzy” feeling that pipes might. NumPy syntax is just as clunky as anything in R, if not worse, largely because NumPy was not part of the base Python design (as opposed to languages like R and MATLAB that were designed for data frames and matrices).
What is probably not a good idea (which the article unfortunately does) is to introduce people to R by talking about data.frame without mentioning data.table. Just as an example, the article mentions read.table, which is a very old R function which will be very slow on large files. The right answer is to use fread and data.table, and if you are new to R then get the hangs of these early on so that you don’t waste a lot of time using older, essentially obsolete parts of the language.
> Tidy features (like pipes) are detrimental to performance.
Detrimental to the runtime performance; if you happen to be reading and processing tabular data from a csv (which is all I've ever used R for, I must admit), then you get real performance gains as a programmer. For one thing, it allows a functional style where it is much harder to introduce bugs. If someone is trying to write performant code they should be using a language with actual data structures (and maybe one that is a bit easier to parallelism than R). The vast bulk of the work done in R is not going to be time sensitive but is going to be very vulnerable to small bugs corrupting data values.
Tidyverse, and really anything that Hadley Wickham is involved in, should be the starting point for everyone who learns R in 2018.
> languages like R and MATLAB that were designed for data frames and matrices
Personal bugbear; the vast majority of data I've used in R has been 2-dimensional, often read directly out of a relational database. It makes a lot of sense why the data structures are as they are (language designed a long time ago in a RAM-lite environment), but it is just so unpleasant to work with them. R would be vastly improved by /single/ standard "2d data" class with some specific methods for "all the data is numeric so you can matrix multiply" and "attach metadata to a 2d structure".
There are 3 different data structures used in practice amongst the R libraries (matrix, list-of-lists, data.frame). Figuring out what a given function returns and how to access element [i,j] is just an exercise in frustration. I'm not saying a programmer can't do what I want, but I am saying that R promotes a complicated hop-step-jump approach to working with 2d data that isn't helpful to anyone - especially non-computer engineers.
I think what you're saying is mostly on point. I wanted to share a couple possible balms for your bugbears.
For attach metadata to an anything, why not use attributes()/attr() or the tidy equivs? Isn't that what it is for?
It might not make you feel much better, but data.frame is just a special list, c.f. is.list(data.frame()). So, if you don't want to use the connivence layers for data.frame you can just pretend it is a list and reduce the ways of accessing data structures by one.
You can paper over the distinction between data.frames and matrices if it comes up for you often enough. E.g.
`%matrix_mult%` <- function(x,y) {
if("data.frame" %in% class(x)) {
x <- as.matrix(x)
stopifnot(all(is.numeric(x)))
}
if("data.frame" %in% class(y)) {
y <- as.matrix(y)
stopifnot(all(is.numeric(y)))
}
stopifnot(dim(x)[2] == dim(y)[1])
x %*% y
}
d1 %matrix_mult% d2
... but I'll grant that isn't the language default.
I wrote a function once for a friend that modified the enclosing environment, and changed + so that sometimes it added two numbers together and sometimes it added to numbers and an extra 1 just to be helpful. I can sort myself out, but thanks for the thoughts.
The issue is that I learn these things /after/ R does something absolutely off the wall with its type system. And a lot of my exposure comes from using other people's libraries.
For my own work I just use tidyverse for everything. It solves all my complaints, mainly by replacing apply() with mutate(), data.frame with tibble and getting access to the relational join commands from dplyr. I'll cool with the fact my complaints are ultimately petty.
> For attach metadata to an anything, why not use attributes()/attr() or the tidy equivs? Isn't that what it is for?
I've never met attr before, and so am unaware of any library that uses attr to expose data to me. The usual standard as far as I can tell is to return a list.
> It might not make you feel much better, but data.frame is just a special list, c.f. is.list(data.frame()). So, if you don't want to use the convenience layers for data.frame you can just pretend it is a list and reduce the ways of accessing data structures by one.
Well, I could. But data frames have the relational model embedded into them, so all the libraries that deal with relational data use data frames or some derivative. I need that model too, most of my data is relational.
The issue is that sometimes base R decides that since the data might not be relational any more it needs to change the data structure. Famously happens in apply() returning a pure list, or dat[x, y] sometimes being a data frame or sometimes a vector depending on the value of y. It has been a while since I've run in to any of this, because as mentioned most of it was fixed up in the Tidyverse verbs and tibble (with things like its list-column thing).
> `%matrix_mult%` <- function(x,y) { if("data.frame" %in% class(x)) { x <- as.matrix(x) stopifnot(all(is.numeric(x))) } if("data.frame" %in% class(y)) { y <- as.matrix(y) stopifnot(all(is.numeric(y))) } stopifnot(dim(x)[2] == dim(y)[1]) x %*% y }
I have got absolutely no idea what that does in all possible edge cases, and to be honest if the problem that is solving isn't actually one I confront often enough to look in to it.
It just bugs me that I have to use as.matrix() to tell R that my 2d data is all made up of integers, when it already knows it is 2d data (because it is a data frame) and that it is made up of integers (because data frame is a list of vectors, which can be checked to be integer vectors). I don't instinctively see why it can't be something handled in the background of the data.frame code, which already has a concept of row and column number. Having a purpose-built data type only makes sense to me in the context that at one point they used it to gain memory efficiencies.
I mean, on the surface
data %>% select(-date) %>% foreign_function()
and
data %>% select(-date) %>% as.matrix %>% foreign_function()
look really similar, but changing data types half way through is actually adding a lot of cognitive load to that one-liner, because now I have to start thinking about converting data structures in the middle of what was previously high-level data manipulation. And you get situations that really are just weird and frustrating to work through, eg, [1].
scale() for example uses attributes to hold on to the parameters used for scaling. Most packages that use attributes provide accessor functions so that the useR doesn't need to concern themselves with how the metadata are stored. I'll grant that people do tend to use lists because the access semantics are easier.
If you're in a situation where 80% of the time is spent in 20% of the code, you only have to use less expressive features in those hot-spots; you don't have to give up your pipes or whatever in places that don't contribute much to the run-time.
> Tidy features (like pipes) are detrimental to performance.
But they are some absolutely amazing features to use. After helping my wife learn R, and learning about all the dypler features, going back to other languages sucked. C#'s LINQ is about as close as I can get to dypler like features in a main stream language.
Of course R's data tables and data frames are what enable dypler to do its magic, but wow what magic it is.
I also base this on my own experience. I typically work with 2-3 million row datasets. I found that doing certain data operations was quite slow in plyr but a lot faster in data.table. It’s possible that if I had spent time reordering my plyr pipelines and filtering out unneeded columns or rows, then it would have worked better. However, data.table doesn’t require such planning ahead and thinking about what columns/rows you need to send to the next operation in a pipeline, because multiple operations can be executed from a single data.table call, and the underlying C library is able to make optimized decisions (like dropping columns not requested in the query), similar to an in-memory SQL database. So between dealing with slow code while doing interactive analysis, and/or having to spend time hand-optimizing dplyr pipelines, I found data.table to be a significant improvement in productivity (other than the one-time effort of having to rewrite a few internal packages/scripts to use data.table instead of dplyr)
Thanks for the reference. Why don't you keep your data in a DB? I load almost anything that isn't a small atomic data frame into a RDBMS.
BTW one thing that always made me avoid DT (I even preferred sqldf before dplyr was created) was its IMHO weird syntax. I always found the syntax of (d)plyr much more convenient. ATM it seems to me that dplyr has won the contest of alternative data management libraries. I cannot remember when I last read a blog post, article, or book that preferred DT over dplyr. I'm old enough to have learned that wrt libraries, it's wise to follow the crowd.
About that article: I assume that DT uses an index for that column while dplyr does a full search. If that's really the case the result wouldn't be that much a surprise.
Answering questions in a rapid, interactive way (, while using C to be efficient enough that one can run it on millions of rows):
# Given a dataset that looks like this…
> head(dt, 3)
mpg cyl disp hp drat wt qsec vs am gear carb name
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
# What's the mean hp and wt by number of carburettors?
> dt[, list(mean(hp), mean(wt)), by=carb]
carb V1 V2
1: 4 187.0 3.8974
2: 1 86.0 2.4900
3: 2 117.2 2.8628
4: 3 180.0 3.8600
5: 6 175.0 2.7700
6: 8 335.0 3.5700
# How many Mercs are there and what's their median hp?
> dt[grepl('Merc', name), list(.N, median(hp))]
N V2
1: 7 123
# Non-Mercs?
> dt[!grepl('Merc', name), list(.N, median(hp))]
N V2
1: 25 113
# N observations and avg hp and wt per {num. cylinders and num. carburettors}
> dcast(dt, cyl + carb ~ ., value.var=c("hp", "wt"), fun.aggregate=list(mean, length))
cyl carb hp_mean wt_mean hp_length wt_length
1: 4 1 77.4 2.151000 5 5
2: 4 2 87.0 2.398000 6 6
3: 6 1 107.5 3.337500 2 2
4: 6 4 116.5 3.093750 4 4
5: 6 6 175.0 2.770000 1 1
6: 8 2 162.5 3.560000 4 4
7: 8 3 180.0 3.860000 3 3
8: 8 4 234.0 4.433167 6 6
9: 8 8 335.0 3.570000 1 1
I used slightly verbose syntax so that it is (hopefully) clear even to non-R users.
You can see that the interactivity is great at helping you compose answers step-by-step, molding the data as you go, especially when you combine with tools like plot.ly to also visualize results.
What a lot of people don't get is that this kind of code is what R is optimized for, not general purpose programming (even though it can totally do it). While I don't use R myself, I did work on R tooling, and saw plenty of real world scripts - and most of them looked like what you posted, just with a lot more lines, and (if you're lucky) comments - but very little structure.
I still think R has an atrocious design as a programming language (although it also has its beautiful side - like when you discover that literally everything in the language is a function call, even all the control structures and function definitions!). It can be optimized for this sort of thing, while still having a more regular syntax and fewer gotchas. The problem is that in its niche, it's already "good enough", and it is entrenched through libraries and existing code - so any contender can't just be better, it has to be much better.
Completely agree. dplyr is nice enough but the verbose style gets old fast when you're trying to use it in an interactive fashion. imo data.table is the fastest way to explore data across any language, period.
I strongly agree, having worked quite a bit in several languages including Python/NumPy/Pandas, MATLAB, C, C++, C#, even Perl ... I am not sure about Julia, but last time I looked at it, the language designers seemed to be coming from a MATLAB type domain (number crunching) as opposed to an R type domain (data crunching), and so Julia seemed to have a solid matrix/vector type system and syntax, but was missing a data.table style type system / syntax.
Julia v0.7-alpha dropped and it has a new system for missing data handling. JuliaDB and DataFrames are two tabular data stores (the first of which is parallel and allows out-of-core for big data). This has changed pretty dramatically over the last year.
Plus I don't have to remember a lot of function names and what order to input vars to the functions. Just have a remember the data.table index syntax and I can do a lot of stuff. I'm sure I can do dplyr once I learn the functions but the data.table syntax seems very simple and elegant to me.
What is probably not a good idea (which the article unfortunately does) is to introduce people to R by talking about data.frame without mentioning data.table. Just as an example, the article mentions read.table, which is a very old R function which will be very slow on large files. The right answer is to use fread and data.table, and if you are new to R then get the hangs of these early on so that you don’t waste a lot of time using older, essentially obsolete parts of the language.