Appendix C — A Running List of Annoying R Things

C.1 Consistently poor documentation

  • e.g., CRAN reference manuals come in PDF form.
  • e.g., CRAN reference manuals covers package functions/data in alphabetical order with no guidance about which functions might be more or less important/useful/central to the package.
  • e.g., Documentation for the tidymodels package parsnip does not tell you what engines it supports (despite engines being the heart of what parsnip does). It does, however, tell you that you can call the function parsnip::show_engines() if you want to know that information. Of course, this requires you to first install the package.

C.2 Generic functions & dispatching

  • Whenever you call a R function, you never know what code you are about to invoke with certainty. Part of this ambiguity is intentional. R uses what I called “dispatching” to decide which of the many print() functions or which of the many describe() functions to actually execute based on the class of the first argument passed to the function. One of the major benefits of dispatching is that package authors may tweak standard functions (e.g., print()) to do special things for the objects their package instantiates and users never have to know that there are new functions available and never have to worry about which one(s) to use when.
  • But ignoring whose code you are invoking can also lead to problems. When I call the function filter() what can I expect to happen? It entirely depends what packages happen to be loaded at the time. Worse than this, it depends on the order in which those packages were loaded because R defaults to selecting functions from more-recently loaded packages.
  • R knows that there are multiple functions with the name “filter” and even knows that there are multiple filter() functions that share a common first argument type. R could tell you when it detected such a scenario and warn you that it was making a choice and let you decide if R is making that choice the way you wish. But it doesn’t. It silently makes a selection and never lets you know that anything funny might be going on.
  • Packages such as conflicted seek to address this, by allowing for greater control over how conflicts are resolved (e.g., specifying that you want filter() to invoke stats::filter() rather than the more-recently loaded dplyr::filter(). However, this places all responsibility for handling all relevant conflicts to the user. Furthermore, how these conflicts are resolved is still implicit when the functions are actually being called (my code still looks likes filter() and you need to go hunting to see if there’s a conflict and if/when/how I asked the conflict to be resolved).

C.3 Inconsistent acknowledgement of namespaces

  • One way to solve the ambiguity described in Section C.2 is to be explicit when calling functions (e.g., dplyr::filter() instead of filter()). However, the complaint is that this causes your code to be “pretty busy”. Of course, there is a healthy conversation to be had about clean-and-mysterious vs. busy-by-transparent.
  • But ignoring the fact that functions are “attached” to specific packages is mirrored elsewhere. For example, there is a tendency to ignore the fact that data sets are also “attached” to specific packages. When one loads a package into the R environment, (e.g., library(fivethirtyeight)), a variety of data sets may become immediately available (e.g., drinks). I would encourage you to try and find evidence that the drinks dataframe is in any way associated with the fivethirtyeight package. RStudio will tell you that your environment is empty, which is a bit odd given that you now have access to a data frame that you didn’t when you first started up RStudio. If you check immediately before and after loading fivethirtyeight, you might see that drinks is available after, but not before, doing so. You might also see that the documentation for drinks (e.g., ?drinks) tells you that it is part of the fivethirtyeight package. But this is all a bit mysterious. When one runs a series of library() statements and a variety of new dataframes pop into existence, how does one determine where they all came from? You wade through lots of (poor) documentation. Or, in RStudio, you can step through the individual package environments, one by one.

C.4 Everything is a package

  • When you update R (not RStudio) it will remove all the packages you have installed. This is undesirable.
  • If you poke around for a solution, you may come upon the installr package. The fact that this is a package and not a feature of R itself is odd (why not contribute this code/functionality to the R project itself?).
  • The R ecosystem is extremely fragmented and weirdly territorial. Do we have an R package that provides generalized linear modeling functionality? No. We have 13 of them! Or 37 of them? It’s not clear. How is this sort of situation can be maintained as an equilibrium is unclear. Why is all this effort sunk into “one more GLM package” as opposed to adding features to existing packages? Are the maintainers not open to such additions? If so, why is everyone using their packages and not starting a more open, more contributor-friendly project?

C.5 How does R work? & “lazy evaluation”

  • R code such as mpg %>% filter(model == "mustang") is not valid because model is not the name of an R variable or anything else that R might be aware of. To see this, type model at the console and you will get an Error: object 'model' not found message. But this code becomes valid once you load the tidyverse package (or dyplyr). How this happens it mysterious, but the short version is that R allows packages to decide and change how R itself works.