Skip to content

Drunk-under-the-lamppost testing

Edit: Glancing over this again, it struck me that the title may be interpreted as being mean. Sorry about that. It wasn’t my intent. I was trying to be constructive and I really like that analogy. The original post is mostly reasonable other than on this one point that I thought was important to call out.

I’m writing a response here to Abraham Mathews’s post, Best practices for code review, R edition, because my comment there didn’t show up and I think the topic’s important. Mathews’s post starts out on the right track, then veers away from best practices in the section “What code should be reviewed?” where he says,

…In general, we would never want to review four or five different files at a time. Instead, code review should be more targeted and the focus must be on the code that requires more prudent attention. The goal should be to review files or lines of code that contain complex logic and may benefit from a glance from other team members.

Given that guidance, the single file from the above example that we should never consider for code review is basic_eda.R. It’s just a simple file with procedural code and will only be run once or twice. …

The standard for code review in industry and large open-source projects is to review every piece of code before it’s merged. The key to this strategy is ensuring that every line of code has been viewed by at the very least the author and one trusted team member. Sampling-based code review that’s biased to where the group thinks errors may be has the drunks-under-the-lamppost problem of not covering a large chunk of code. Software developers obsess on test coverage, but it’s very challenging and we haven’t been able to get there with Stan. If we were developing flight control or pacemaker or transactional banking software, the standards would be much higher.

Typically, APIs are designed top-down from a client’s perspective (the client being a human or another piece of code that calls the API), then coded bottom up. Each component is reviewed and unit tested before being merged. The key to this strategy is being able to develop high-level modules with the confidence that the low-level pieces work. It may sound like it’s going to take longer to unit test as you go, but the net is a huge time savings with the upside of having more reliable code.

It’s also critical to keep the three key components of software development in synch: documenting (i.e., design), testing, and coding. In larger projects, features of any size always start with a functional spec outlining how it works from the client point of view—that’s usually written like the eventual documentation will be written because that’s what says what code does. With just doc, the key here is to make sure the API that is being delivered is both easy to document and easy to test. For example, large functions with intertwined, dependent arguments, as often found in REPL languages like R, Python, and Julia, produce what programmers call a “bad smell”, precisely because such functions are hard to document and test.

Consider the rgamma function in R. It takes three parameter arguments, shape, rate, and scale. Experienced statisticians might know that scale and rate parameters are conventionally inverses, yet this isn’t mentioned in the doc anywhere other than implicitly with the values of the default arguments. What happens if you supply both scale and rate? The doc doesn’t say, so I just tried it. It does not return an error, as one might expect from languages that try to keep their arguments coherent, but rather uses the rate and ignores the scale (order doesn’t matter). At the point someone proposed the rgamma function’s API, someone else should’ve piped up and said, “Whoa, hang on there a second, cowpoke; this function’s going to be a mess to test and document because of the redundant arguments.” With scale not getting a default and rate and shape being inverses, the tests need to cover behavior for all 8 possible input patterns. The doc should really say what happens when both scale and rate are specified. Instead, it just says “Invalid arguments will result in return value ‘NaN’, with a warning.” That implies that inconsistent rate and scale arguments (e.g., rate = 10, scale = 10) aren’t considered invalid arguments.

I should also say that my comments above are intended for API design, such as an R package one might want to distribute or a piece of infrastructure a lab or company wants to support. I wouldn’t recommend this style of functional design and doc and testing for exploratory research code, because it’s much harder to design up front and isn’t intended to be portable or distributed beyond a tiny group of collaborators. I’m not saying don’t test such code, I’m just saying the best practices there would be different than for designing APIs for public consumption. For example, no need to test Windows and Linux and Mac if you only ever target one platform, no reason to test all the boundary conditions if they’re never going to be used, and so on. It absolutely still helps to design top down and write modular reusable components bottom up. It’s just usually not apparent what these modules will be until after many iterations.

P.S. I highly recommend Hunt and Thomas’s book, The Pragmatic Programmer. It’s a breeze to read and helped me immensely when I was making the move from a speech recognition researcher writing experimental code to an industrial programmer. Alternatives I’ve read suffer from being too long and pedantic, too dogmatic, and/or too impractical.

P.P.S. I’ve been meaning to write a blog post on the differences in best practices in research versus publicly distributed code. I know they’re different, but haven’t been able to characterize what I’d recommend in the way of methodology for research code. Maybe that’s because I spend maybe one day/month on personal or group research code (for example, the code Andrew and I developed for an analysis of SARS-CoV-2 seroprevalence), and nineteen days a month working on Stan API code. I’d be curious as to what other people do to test and organize their research code.


  1. OneEyedMan says:

    Did you consider submitting this as a bug?

    I’m often astounded at how R functions in popular libraries manage to have thirty or more arguments and examples that coverjust 3 or so use cases. I wonder if attempts to consider a richer family of examples would help with the testing and bug finding.

  2. Although I agree that Julia is a REPL language I disagree that it’s APIs tend to have things like this redundant arguments example… In Julia typically if you want a different behavior for a function, you write a method that acts on different types… So you might ask it to sample from a gamma with rand(Gamma(shape,scale)), and there might be a second method to construct a Gamma type that’s Gamma(shape;rate=r) but there is no Gamma(shape,scale,rate) with ambiguous meaning for example.

    Julia is a breath of fresh air for an ex Common Lisp hacker drowning in the low quality semantics of R code … give it a try

    • I completely agree that R’s semantics is a disaster, particularly w.r.t. scope and assignment and multiple ways to handle objects. I also don’t like the syntax. Python is much better, but still too weakly typed for my taste. I’m like the chemistry teacher who’s always scolding their students about lack of units. Types in programming languages serve the same purpose—they help keep complicated things straight by providing a soundness check on formulas. Common Lisp is arguably even worse than R on the typing front.

      I’m far less familiar with Julia. I tried searching for [Julia regression] and it turns out there are hundreds of packages. The Julia doc seems as bad as R’s in terms of not talking about return types or values cleanly. For example, here’s StatsBase.jl.

      It also looks like the convention in Julia is to replace a pile of arguments with a configuration object. That just pushes the problem down to the object constructor or setters and getters. If there aren’t named arguments, as in C++ or Java, this idiom is forced on you. Julia does have keyword arguments, but they didn’t appear to be so popular in examples I found.

      • (mostly for Bob’s benefit, but made a little more pedantic than Bob would need for others)

        Common Lisp and Julia are both *dynamically typed* where the type of the object is very very well defined, but determined at runtime. C++ of course is static typed, where the type of an object is determined by the text you type… except of course for the fact that the template language does arbitrary turing complete calculations pre-compile time… The exception to this is in C++ you have “virtual” methods that specialize on the implicit first argument. so a.f(x) calls a different function depending on the runtime type of a.

        One of R’s big problems is that at its heart it has no *user defined* types. There’s no “struct” so to speak. So everything is secretly a list of vectors of lists of vectors of lists of floats and vectors of strings… In other words there’s no way to separate the meaning “a complex group of information that represents an MCMC computation” from the implementation of that thing as “a crap load of base types munged together”. Neither Common Lisp nor Julia suffer from that problem.

        Julia takes the CLOS multi-method/multiple dispatch and bolts it directly onto the language from the start. It then JIT compiles you a method specialized to the runtime types.

        For example:

        ## define a simple one line function

        julia> f(a,b) = a + a*b;

        julia> f(2,2)

        julia> f(2.0,2.0)

        julia> @code_native f(2,2)
        ; ┌ @ REPL[21]:1 within `f’
        ; │┌ @ REPL[21]:1 within `*’
        imulq %rdi, %rsi
        ; │└
        ; │┌ @ int.jl:53 within `+’
        leaq (%rsi,%rdi), %rax
        ; │└
        nopl (%rax)
        ; └

        julia> @code_native f(2.0,2.0)
        ; ┌ @ REPL[21]:1 within `f’
        ; │┌ @ REPL[21]:1 within `*’
        vmulsd %xmm1, %xmm0, %xmm1
        ; │└
        ; │┌ @ float.jl:401 within `+’
        vaddsd %xmm0, %xmm1, %xmm0
        ; │└
        nopl (%rax)
        ; └


        So not only does the type of a function’s return value depend on the types that you call it with… but it compiles different super fast machine code depending on the types you call it with!

        There’s a lot of discussion about “type stable code” in the Julia community. That is, code that always returns the same type regardless of what happens during its computation. Of course it has union types, so type stability doesn’t mean you always have to say return a float, it just means you have to always return one of a small set of types, such as for example Union{Float64,missing} which represents either a float or a missing value.

        And you don’t *have* to be type stable… it’s just that if you want fast code, type stability is a necessity.

        I know in Stan you guys have done a lot of work to get fast evaluation by sophisticated use of Templates… In Julia the compiler does that work for you for the most part.

        The WORST thing about R is the nonstandard evaluation. Like in ggplot2 specifications where you type aes(a,b) and it isn’t evaluated… it’s converted to some kind of expression and carried around deep into some of Hadley’s code and then later it knows that you want to use the data frame columns named a, and b. Here aes(a,b) captures the symbols you used in the arguments… but there is *no* indication textually that this is going on.

        I confess I really don’t understand R nonstandard evaluation… and I really don’t want to. I think it’s a terrible terrible idea, because it makes it virtually impossible to know what your code means and I HATE the Hadleyverse because it makes intensive use of this construct. Of course lots of people love the Hadleyverse because they can write simple looking code and get things done. But in fact, there is no real semantics to that code… it’s just “it does whatever Hadley decided it should do”

        Julia on the other hand has first-class macros… a macro is a Julia function that operates on parsed Julia expressions and can rewrite them doing arbitrary Turing complete calculations on the code. You call a macro using the @ sign… so @code_native was a macro that took the expression I gave it, and then compiled the expression to machine code and spit out a human readable version of the machine code.

        Julia also has first class symbols… so if you want to refer to a symbolic concept you just do that… prefixing a colon to the symbol f(:a,:b) means “call the function f with the symbol :a and symbol :b”

      • cs student says:

        the result of Julia’s type system is that you are allowed to write out all the types, and then it will indeed yell at you if you call a function with the wrong type. but doing this is discouraged, because someone may want to call your function with some type you hadn’t thought of yet.

        Of course you can imagine hierarchies of abstract types to get around this, and Julia does some of this, but it’s not always easy to coordinate which abstract types to use.

        So from the end user perspective, you’re supposed to use it like Python — heavily dynamic and “duck typing” maybe with some type hinting. But then the compiler infers types and machine code is generated with full type information.

        (I think an example is in Julia’s automatic differentiation libraries — some just define a new dual number type. Then you can call a huge amount of code, including by people who have never even heard of autodiff and didn’t plan for their code to work with it, and get your derivatives automatically compiled down to fast machine code. But some libraries put “Float64” in their function signatures for type safety, which made it impossible to use with autodiff for no good reason.)

  3. A.G.McDowell says:

    Code review in theory is a wonderful thing. Code review in practice can be a recipe for much heat and very little light. Some reviewers will obsess about minor points of style (number of characters in variable names) and some things which may seem positively destructive (removal of commented out code which the writer may wish to keep around ‘just in case’) while failing to spot what later appear to be obvious problems in the behaviour of the code.

    There are a few automated bug-checkers (not style-checkers) which have paid attention to keeping a low false positive rate, or can be configured that way, and they are genuinely useful. Findbugs is one example. This has been at least partially superseded by SpotBugs, which however has default settings that cause it to produce warnings that are not necessarily signs of outright bugs.

    • Code review, like any aspect of a project that involves people to compromise, can lead to problems. One of the best programmers I know quit a job because of the code review practices.

      When done well, code review is transformative. And I don’t say that lightly like some kind of NSF program officer. When done well, programmers welcome code review. But everyone has to be on board with it being a cooperative process, not a platform for their pet peeves. This takes some maturity on the part of developers. For Stan, it took a while for practices to settle down. Auto code formatting helped.

      It’s not a silver bullet for code problems, though. It’s very hard to assess correctness of code during code review. That’s why good review has to focus on test coverage and ask questions like whether the boundary conditions are well enough deefined in the doc and match in the tests. What review should be focusing on is whether the code is idiomatic, whether the right algorithm is being applied to the problem, and whether the testing and doc is thorough.

      For large projects, style consistency is important. Assuming you’ve chosen an autoformatter, this is more about writing idiomatic code in whatever language is being used and sticking to the project’s style guide on things like pointers vs. references and exceptions, not about low-level decisions like length of variable names.

      Length of variable names is a difficult balancing act. If they’re too long, the structure of the code is obfuscated. Being able to visually scan the structure of code is what all this formatting is about. If it’s not easy to see the structure of a function, that’s a sign it needs to be broken into a series of smaller function calls. If variable names are too short, you lose track of what they are. The second problem should be mitigated by keeping functions short, where variable names can be kept short, too.

      Please don’t leave commented-out code around “just in case”. That’s what version control is for. The main reason is that it makes the code very hard to read, especially if it’s commented out with block comments instead of line comments. When I started working on speech recognition at Bell Labs in the 1990s, my lab-mates handed me a 100 page long TCL/TK program, but reassured me only a few pages of it were active. Of course, they didn’t tell me which ones. You seriously can’t make this stuff up. They were electrical engineers who thought they were good software engineers. If you insist on having commented-out code in files, at least do it with line-level comments like # (R) or // (C++) so that the block that’s commented out is clear visually.

      • Phillip Middleton says:

        Late to the party again .. but agreed with all you mention. More specifically, keeping code format/style to language official guidelines (i.e. PEP-8, etc) certainly doesn’t hurt, within reason. And yeah, production-release code in my view shouldn’t have alot of commented code, if any.

        The real (and in my mind the most critical) issue is conducting good high to low level design of the library/package/API/etc with respect to its intended behavior. I was always told not to write a single line of code (excepting for some small and simple cases) until the algorithmic design related to the intended function/class/thing is well defined and understandable. In my experience this aids code review immensely. We generally avoid, to the extent possible, the usual mewing over coding styles (and there can be much mewing…. ).

        Typing standards for variables (weak, strong, otherwise completely uninterpretable) far supercedes ‘style’ish’ elements in code review. I’m thinking weedy things like char length of a variable here (though this may be a requirement – for example, if passed to some endpoint system, say, a particular RDBMS or messaging system that makes such limitations). For writing clarity, shorter is generally better, but clarity of the variable/function/class meaning in plain language (vs x , n , y , z , et al.) is part of balancing the ‘form/function’ scale – understandability of the code being the goal.

  4. Janne Sinkkonen says:

    On “I’d be curious as to what other people do to test and organize their research code”:

    Maybe I shouldn’t say anything about organizing the code, for my filenames are often 1.R, 2.R, etc., or their Python equivalents. But on testing: I test as I write, in REPL, basically every line. That is often sufficient for smallish research projects.

    A big question is how much abstractions to use in a research code: when should I write a class or a function, and how general those should be. Another one is, how to get from research code to production code without writing everything from scratch.

    • The man of million programming errors says:

      I also test my code in REPL as I write it.

      Then, when coming back to it in a different session, I realize I had unintentionally made it depend on some variable in the global namespace; maybe because it’s name was changed or forgetting to include it in the parameter list for the function.

    • What do you mean by testing a line of REPL as you write it? Lines of REPL code look like function calls or assignments. How do you test them? For example, where do I put tests in:

      mu = rnorm(1e6)
      sigma = exp(rnorm(1))
      y = rnorm(1e6, mu, sigma)

      What would the tests look like?

      What happens when you change one of the earlier lines?

      Usually, you want to write classes or functions when code would otherwise be too hard to read. Classes and functions also act like a better form of doc (code doesn’t lie), at least assuming they’re named coherently. What you don’t want to do in research code is grasp for abstractions for things you might need next year. Code what you need when you need it and only when you find yourself cutting-and-pasting the third time (this is just a rule of thumb) should you worry about encapsulatting the functionality into a function.

      Usually it’s easier to just start over for production codes. The concerns are so different that you really want it organized differently from the get go.

      • Bob, I think a lot of research code meets the template of the Jupyter Notebook sort of thing… Basically what it does is suck down a bunch of files from the internet/a directory, and fit several different models spitting out some graphs and soforth and then halt.

        It’s really a very fancy pocket calculator.

        In that context, the tests look like you execute the code up to point x… then you by hand use the REPL like a debugger, verifying that the state at point x is what you think it should be… then you continue executing along to the next point… manually verify… etc.

        Or you might put the tests right in the code.

        x = f(a,b,c)

        x < 0 && error("x is negative fix me")

        refactoring this kind of code into abstractions and soforth is often a waste of time… Yes I generate the same kind of graphs 3 or 5 times in this script by copying and pasting, but it turns out each time the graph has to be a little different, manually positioning certain labels for example by plopping magic numbers into the label code to get things to look just right… so writing one master function with a set of switches to determine what special version of the graph I want is no less work than just taking the graphing code template and modifying that. Of course, if I'm doing it 35 times it's probably a problem, but even up to say 7 or 10 times it's probably no better to write and debug a special function with different branches etc to get just the right aesthetics in each graph.

        What **is** absolutely useful is to create function abstractions for mathematical/computational functions in use in the code. Don't just use the expression a+b*c*sqrt(a/b)^3+1/(1+x^2) by copying and pasting it everywhere always referring to global variables! Don't write some raw for loop that searches through an array and copy and paste it all over the place modifying the conditions you're looking for each time.

        When the structure of your code is "do the following set of tasks exactly once in order and then end" then it makes sense to use this "script" or "notebook" or "batch" style of coding… and when doing the batch style coding in a REPL, the REPL is a lot like a debugger.

        I don't necessarily advocate doing things this way, but in research code, often you just don't know what you need to do next at any given point in time… It's like a prolog implementation, you're always searching and then backtracking..

        so if my code looks like




        when I hit the next line I don't know what I need to do. What comes next depends on the values computed so far, and my interpretation of the graphs I've output. I might well spend several minutes in the REPL manually graphing things, and eventually deciding that the best next step is:


        and when I see the output of quux… I might well decide that choices I made at line foo were wrong, and then I backtrack and change foo (maybe a choice of model for example) and then based on that I may need to change bar, and baz and choose quux2 instead.

        It's a very different kind of programming from one where there's some basic spec that the client has given you and you're supposed to implement that functionality.

        • you execute the code up to point x… then you by hand use the REPL like a debugger, verifying that the state at point x is what you think it should be

          Maybe this is just semantics, but that doesn’t meet the threshold I have for a test, which is being automated and reproducible. Unless it’s actually added to the code and updated along with everything else. And then it seems like a very awkward way to proceed for code. I understand the process and develop myself sometimes this way, but only for a couple lines at a time before moving into real code.

          It’s really a very fancy pocket calculator.

          That’s the problem in a nutshell. Even working in REPL environments, I work a little bit at a time as you describe, but then I always pull everything into a master script that’ll run everything. I usually don’t test much at all (at least in the sense that I’d consider a test).

          Yes I generate the same kind of graphs 3 or 5 times in this script by copying and pasting, but it turns out each time the graph has to be a little different, manually positioning certain labels for example by plopping magic numbers into the label code to get things to look just right… so writing one master function with a set of switches to determine what special version of the graph I want is no less work than just taking the graphing code template and modifying that.

          That’s a good point. This always drives me crazy when I run into it. Functional programming can help with some of this, but then the code’s more complex. But as you mention, when it’s simple shared functionality like a math function, it’s clearly better to pull it out into a named function. Where to make that call is a matter of judgement again, but I’m with you in usually just duplicated five calls to graphics. With ggplot, there’s often a lot of shared structure I can exploit, so I often build functions to build particular kinds of graphs, then the work’s largely in getting the data in the right format for the function (and perhaps adding bells and whistles, which is easy with ggplot’s imperative setter style).

          “do the following set of tasks exactly once in order and then end”

          My code almost never looks like that. I often have loops with nested functionality and edge conditions (which often matter even with one data set to make sure you don’t accidentally drop the first or last data point, for example).

          It’s a very different kind of programming from one where there’s some basic spec that the client has given you and you’re supposed to implement that functionality.

          I’ve never seen that kind of thing in action. When I worked in industry, the marketing department would come up with the functional spec for a bunch of features, then engineering would price them out in terms of time, person power and risk. Then marketing would choose from that menu and engineering would come up with technical specs. Then you started implementing. Those technical specs usually just outlined the client API and inter-module communication APIs so that multiple people could start tackling the problem together. You never get a full blueprint you can piece togther. But even those are always understood to be a work in progress pending actually trying to code the damn thing.

          I don’t necessarily advocate doing things this way, but in research code, often you just don’t know what you need to do next at any given point in time

          That’s been true of all code I’ve ever worked on. Often I know the end goal, but not the logical next step. In fact, that’s usually where beginners get stuck—trying to break an overwhelming job into manageable pieces.

          I might as well say what I wind up doing in practice: writing statistical tests, not unit tests. I do things like simulation-based calibration to make sure my model’s coded well and posterior predictive checks to make sure things fit. I don’t think I’ve ever written something that felt like a unit test in R. But then I don’t write R modules for distribution. There I would definitely be writing unit tests.

          • > I work a little bit at a time as you describe, but then I always pull everything into a master script that’ll run everything. I usually don’t test much at all (at least in the sense that I’d consider a test).

            Exactly, in the sense of unit tests or even statistical tests, they are often not in the code, or maybe exist primarily at the end. You build a script, you test little bits of it manually at a time in the REPL convincing yourself that when you do

            quux = foo |> bar |> baz

            that foo is transformed by bar and baz into an object quux that has the right variables and data types and isn’t missing any items or whatever… and then move on. (This is Julia “pipeline” syntax)

            But at the end of the script, if you’re being careful, it might make sense to do a whole batch of “quality control”… Test that all the previously produced quantities “make sense” in some sense, and throw errors if they don’t. WHy wouldn’t you do this inserted into the code in between steps? That can obfuscate the heck out of the logic of what you’re doing further up, and often in research, the logic of the research is more important than the logic of the code… In the sense “why are they even doing this? What question is it designed to answer?” Not “what does this loop do to the data?”.

            In fact, I think it’s much better to have plain text explanation than a bunch of tests. Basically in Literate style, you’re writing your spec as you go along.

            At the end of the script, either the stuff you produced worked, and there are some data sets and a set of graphs in various places, in which case it’s nice to have some proof that it passes the “statistical tests” you mention…. Or it didn’t work… in which case all of it will need to be rerun, and it’s fine to have all the results of the diagnostics all in one place at the end of the script where you can refer to them and go back and figure out what needs work…

            When doing this stuff, I often work a little at a time in a REPL, figuring out what comes next in the script… but I always work in a master script (maybe with sub-scripts where I’m storing some specialized functions, like plotter functions or data munger functions) and I always run the master script start to finish in a batch way before producing “Master output” for a report. Sometimes I even run it twice and diff them to see that it produces the same output.

            Usually with research code, it isn’t just that you don’t know what next step to take, you might not even know what the end goal is at any given point. A lot of lines of code in my scripts are along the lines of answering the question “Hey would it be useful to consider the following question and maybe head in such and such a direction? Here’s a plot”

            Insisting that people should write a specialized function to make each plot of that sort is a huge waste of time. 80% of the time that plot will come out and you’ll say “the answer is no… keep that in mind now that we move on in a different direction”

            Here’s an example of this kind of Notebook. I made it so my friends could go get the latest COVID graphs and to get more experience with Julia. This is the static nbviewer version. I ran this this morning, so it’s got graphs up to date as of this morning.


            Note that I wrote a graph maker function, and then reran it over each state… that’s a good example where you really don’t want to copy and paste.

            You can boot this up on binder as well and interact with it:

            The first time binder has to build the Docker image, so I ran it myself first to hopefully cause it to cache it… this Docker image takes like 10-15 minutes to build because this repo has in-progress data analysis tutorials using a wide variety of data munging tools, many of which aren’t used in this particular notebook. Still, it has to download and precompile all those things before it has a Docker image.

            This is not typical of actual interaction with Julia, though Julia is known for its “time to first plot” issue, because the JIT compiler has a tendency to chew hard on certain things when you use a new library:

            For example:

            julia> @time using Queryverse
            21.783020 seconds (55.90 M allocations: 2.665 GiB, 6.63% gc time)

            Queryverse is a “kitchen sink” library like tidyverse. When you load it Julia does a bunch of compilation to create the various methods it brings into your namespace. But once you pay that price… it’s got super-super fast CSV reading, and very nice Query pipeline tools, as well as an interface to VegaLite plotting. So

      • Anonymous says:

        > What do you mean by testing a line of REPL as you write it? Lines of REPL code look like function calls or assignments.

        I meant roughly what Daniel described above. I test individual assignments or steps of processing (steps of a pipeline, plots, files, etc.) by evaluating them in a somewhat representative environment, like with a subset of my data, the original data if short enough, sometimes with special cases like zero-length vectors or whatever they are. But the key points were to do that in REPL (Python, R), and in _small_ pieces, and test _while_ I write the code. That applies especially for algorithmic code, which would be very hard to debug and which comes with high likelihood of errors if I write entire functions.

        But I understood that this kind of development you don’t count as testing as it leaves no test functionality to the code (except a couple of asserts here and there.). Well that’s not testing then, just my way of writing research code…

        The thing with abstractions is that you never quite know how things would develop in future. It feel good to write clean code, including functions and classes and avoiding repetition. But as what happens with the algorithms is often unexpected, the abstractions tend to break, or the whole code abandoned, in which case spending time with cleanliness has been at least party wasted. Maybe that holds for all software development, my feeling just is that with research code things tend to be particularly unexpected on average, and the code often kind of disposable, so it makes less sense to invest in it.

  5. Daniel Marlay says:

    With regards to best practices for research versus publicly distributed code, I’ve found the following two papers to be quite relevant:

    – Best practices for scientific computing –
    – Good enough practices in scientific computing –

    What I particularly like about the second paper is the way it adapts the recommendations from the first paper to what is actually practical and achievable for a typical researcher. It recognises that most researchers are not software engineers, and until they start to release publicly distributed code with a significant user base they probably don’t need to be.

    I think it is also important to recognise that for research code, while software engineering practices help, the goals are slightly different. Correctness is important (we want reproducible research), but reusability, maintainability and extensibility are significantly less important. They do still have some importance, as the practices that contribute to these goals also aid in supporting correctness, and we know that often research code goes on to have a life and use much longer than originally planned. Nevertheless, refactoring is your friend and there is definitely such a thing as premature optimisation.

    P.S. Completely agree with the comment on “The Pragmatic Programmer”

    • Thanks—that’s exactly what I was looking for. I already object to the title, though. I think “best practices” depend on the goal. That’s different for API design in a public-facing package and in one-off research code. What I want is a set of best practices for research code that’ll help me feel more confident my code’s doing what I think it’s doing. That’s probalby what they meant by the second article.

      At a quick glance, I agree with and I’m already doing most everything the checklist in the second article says (though who knows why they felt compelled to talk about directory layout as if we’re all writing unix code). The one place I fall short is in submitting code and data sets for DOIs. That might make sense for code and data I want to support into the future.

  6. cs student says:

    I would be very interested in a post on best practices for testing research code, consider this comment a vote for that piece.

    It’s surprising how much of the testing advice from the software engineering world assumes that you are on a large team and have end users other than your future self.

    There are a lot of goals of software testing that often just aren’t applicable to research code. Like “make sure you’ve captured all the customer’s requirements”, “testing forces you to design so it’s easy to refactor”, “make sure future hires can figure out how the code runs”, “make sure we can do continuous integration so deployment is easy”, “figure out a good way to mock out our database and the API we use”, “let’s make sure we don’t have security flaws”.

    It seems to me that in a sense the main reason to test research code is to be sure that your scientific conclusions are correct for the experiments you ran. Very important, but still secondary, is whether your code is correct on other inputs for experiments you haven’t run. It seems like the other usual benefits of testing either don’t exist for research code or are just not very important. I’m talking code written for a single project — once you start maintaining a package over the long term, even for internal use, I think probably a lot of the benefits of testing come back. And perhaps these statements are too strong.

    There are also many things in the research context that are sort of neglected by most of the written advice I’ve seen on testing. Like how to carefully design tests to deal sensibly with floating point and numerical stuff. Or dealing with randomness in Monte Carlo type algorithms: of course fix the random seed, but in some cases irrelevant changes to the algorithm end up giving you slightly different results under the same seed. I’m sure knowledge of how to do these things right is out there but it’s not usually included in the advice on how to write good tests.

    I think there are some good ideas that can probably be imported directly. Quickcheck-style property testing seems like a big one — cheap, low effort, great fit for research code context, and with scientific and mathematical programs in particular there are often really obvious properties (invariant to permuted inputs, etc.) you know your functions should have.

    • There are a lot of goals of software testing that often just aren’t applicable to research code.

      Everything you listed can be relevant in the right situation for research code.

      1. Client(*) requirements. You’re the client now. You definitely want to be careful about capturing your requirements in research code. Starting by enumerating those requirements in a functional spec is often helpful, and I just mean a few bullet points, not a 20-page document. (*)”Customer” sounds commercial; anyone or anythnng that uses software is a client.
      2. Future hires. That’s you or your colleagues in the future going back to look at your code. This may just be someone reading your source who’s interested in your paper.
      3. Continuous integration. Think of this as just keeping the software in a working state. Every time you change something, you want to make sure you can still run everything and you still get the right results. Continuous integration just automates this process, but the automation isn’t the key part unless you have a bigger team. It’s testing every change and keeping the code in a runnable state.
      4. Figure out a good way to mock out our database. OK, that one’s usually not relevant for a single-person project, but that’s a very specific thing. You might be interested in mocking out some statistics before computing them seriously, though. For instance, I might implement with central intervals before getting around to adding shortest posterior intervals, but everything in the graphics or whatever can be mocked out with the easier implementation. Think of this as developing in stages. Sometimes it’s easier to do with stubs for the pieces. It makes it possible to test things independently, which can be huge.
      5. Security flaws. This is important but usually trivial for most things like R analysis code. But you do want to be careful that your code doesn’t do thing like rewrite system files, etc., or require root permission to run a simple analysis, etc.

      There are whole books written about numerical analysis, much of which centers around testing. That’s only a small set of the code that’s written in the world, so the general advice usually doesn’t go there. There’s much less written about testing probabilistic algorithms, but we’ve been hard at work on that methodology from simulation-based calibration to posterior predictive checks to cross-validation. If people have good references here, I’d love to hear about them. For example, you need to fix the seed for reproducibility, but the bigger issue of replicability would have you run with different seeds and make sure the result isn’t seed dependent. With MCMC, you can check end to end by comparing estimates using MCMC standard error estimates. This can be problematic because you need enough draws to accurately estimate MCMC standard error.

      I hadn’t seen Quickcheck before. That looks interesting. We’ve built similar tools ourselves to test automatic differentiation and other general properties of Stan programs.

      • cs student says:

        those are all very good points. i wouldn’t have made those connections at all, but they all make sense. perhaps many best practices in software engineering have a “lighter-weight” equivalent, with less infrastructure, that makes sense when writing research code.

        i wish more research code were well-tested, and i wish i were better able to test my own code. if there were some clear practical advice specific to the research code context, I think that would be a really valuable thing. just directly following advice intended for big companies doesn’t work though — i’ve tried, too much overhead — and it seems like it probably takes a fair amount of experience to know what practices to keep and what to discard, and how thorough to be.

        possibly an interesting subject for a blog post :)

  7. Hey Bob,

    Thanks for your commends. It’s been very insightful and education reading through the post and entire discussion.

    One thing I’d push back in relation to which code to review is that in commercial settings, I think it’s a different calculus that a research setting. Time doesn’t permit everything in an entire code base to be evaluated and so I don’t think most Data Science teams can review everything.

    With that said, the exception would be teams where the work is directly being productionalized and could have an immediate impact on product performance. For example, I’d hope that a team that does the product recommendations at a large ecommerce site is reviewing everything as that impacts the bottom line and the overall user experience. Of course, that stuff is generally not written in R. And that is another distinction that is important. If a team is productionalizing code in R as opposed to java/python/etc, will dictate the nature and breadth of the code review. I’m not suggesting that R isn’t used for critical stuff, but I’d image how teams in commercial settings approach code review should largely be dictated by the criticalness of the analysis.


  8. Carlos Ungil says:

    > With just doc, the key here is to make sure the API that is being delivered is both easy to document and easy to test. For example, large functions with intertwined, dependent arguments, as often found in REPL languages like R, Python, and Julia, produce what programmers call a “bad smell”, precisely because such functions are hard to document and test.

    The users of the API may feel that being easy to use is more important than being easy to document, code and test.

    > The doc should really say what happens when both scale and rate are specified.

    The easiest solution for that undocumented behavior would be to document that the behavior is unspecified. Or undefined, which makes testing even easier. ;-)

    Giving enough rope to the users (even if they could hang themselves getting undocumented output from inconsistent input) may be better from an utilitarian point of view than keeping them on a short leash because it’s more convenient to the programmer.

    • I think the appearance of easier to use in ambiguous APIs is incorrect, it’s just easier to get bugs.

      When you write a line of code, what it does should be totally unambiguous and clear. This is the best way to get the right answer. I mostly hate debuggers, and I find bugs by reading the code carefully. There’s nothing worse than the second full moon type bugs that only strike under weird circumstances where you couldn’t know what happens by just looking at the code.

      I worked at a financial data analysis firm that had a huge number of scripts written in C-shell… the way the script worked was ENTIRELY determined by unknown global environment variables. There was absolutely no way to know what it would do when you ran it by reading the code or looking at the way it was called… horrors

      • For those who like horror stories… this was during the Y2K transition, and at the last minute they repurposed our research team to fix bugs in the data processing team’s code. Eventually I had to get them to clone me a new server that had identical layout to the production server (back in the days when this meant buying an expensive Sun microsystems box) and then I’d touch every file in the filesystem to set the modification date to yesterday… and run the script… and then do a find over the entire filesystem to find all the files that were modified by that script because … there was no real effective way to know what the script had done from reading it.

        Good times… err no.

      • Carlos Ungil says:

        > When you write a line of code, what it does should be totally unambiguous and clear.

        I don’t disagree with what your comments, but If I write a line of code that calls gamma with both a scale and a rate argument it’s my fault that it is unambiguously and clearly wrong. ;-)

        Would it be nice if the implementation noticed the issue? Yes, but it’s not even clear what the “right” behavior would be: error or warning, when both arguments are present or when they are inconsistent.

        Absent a better implementation, would it be nice to lose the ability to define gamma using either scale or rate? No, then I may need to implement it myself (and if I cannot be trusted to not provide inconsistent arguments I may not do a great work writing, testing and documenting the lost functionality).

        Would it be better to keep it but using two functions gammaWithScale and gammaWithRate instead of a single function gamma with keyword arguments? It may be more convenient if the coverage of tests is really an important objective but I don’t think the user gains much by being forced to use different functions.

        • I think in R the correct behavior would be to allow either scale or rate. If both are given:

          error(“you can not provide both rate and scale at the same time when calling [drpq]gamma functions.”)

          A nice thing about Julia is that you can have *one* constructor name, namely Gamma, but it has multiple *methods* so that:

          Gamma(shape; rate=myrate)

          are all potentially ways to construct a Gamma structure that represents a distribution.

          R doesn’t have this possibility of course.

          • Carlos Ungil says:

            > Gamma(shape,scale)
            > Gamma(shape; rate=myrate)

            This solution is worse than the one where both scale and rate are keyword parameters because it requires the user to know that “scale” is unnamed while “rate” is named. The user cannot do Gamma(myshape, scale=myscale). Nor Gamma(shape=myshape), for that matter.

            It there are three alternatives (say a circle is defined giving the radius, diameter or area) the trick of defining one as the second argument in a two-positional-args function and another as the keyword argument in a one-positional-arg-plus-one-keyword-arg function won’t be enough to solve the “problem”.

            • Well of course, in Julia you CAN offer Gamma(myshape; scale=myscale) as additional method. You can have many methods for every constructor. The point is you shouldn’t have any constructor where two *conflicting* pieces of information can be provided at the same time.

              • Carlos Ungil says:

                julia> function gamma(shape; scale=1)
                println(“gamma with shape=”, shape, ” and scale=”, scale)
                gamma (generic function with 1 method)

                julia> gamma(1, scale=2)
                gamma with shape=1 and scale=2

                julia> function gamma(shape; rate=1)
                println(“gamma with shape=”, shape, ” and rate=”, rate)
                gamma (generic function with 1 method)

                julia> gamma(1, rate=2)
                gamma with shape=1 and rate=2

                julia> gamma(1, scale=2)
                ERROR: MethodError: no method matching gamma(::Int64; scale=2)
                Closest candidates are:
                gamma(::Any; rate) at REPL[3]:2 got unsupported keyword argument “scale”
                [1] kwerr(::NamedTuple{(:scale,),Tuple{Int64}}, ::Function, ::Int64) at ./error.jl:157
                [2] top-level scope at REPL[5]:1

              • keyword arguments are not something I’ve worked with a lot yet in Julia. It turns out the keys are not part of the type signature of the method, so you have to have them all specified or use the splatting syntax in which case you get a named tuple…

                I think the technique closer to what I’m talking about would be something like


                now you can either pass a Shape struct and a Scale struct, or you can pass a Shape struct and a Rate struct… but you can’t pass both. (Note that this isn’t how Distributions.jl actually works, but in principle it’s possible to design this kind of API).

              • The actual Distributions.jl implementation uses numbers rather than structs, and only interprets the second number as a scale:

                Gamma() # Gamma distribution with unit shape and unit scale, i.e. Gamma(1, 1)
                Gamma(α) # Gamma distribution with shape α and unit scale, i.e. Gamma(α, 1)
                Gamma(α, θ) # Gamma distribution with shape α and scale θ

                params(d) # Get the parameters, i.e. (α, θ)
                shape(d) # Get the shape parameter, i.e. α
                scale(d) # Get the scale parameter, i.e. θ

              • Carlos Ungil says:

                You can also do that in R.

              • another option would be something like

                gamma(shape,par2; partype=:scale)

                now you could do



              • R has so called S3 classes, and S4 classes (and apparently something called R5)… the semantics and their interactions are a disaster. There is nothing keeping you from turning anything else into an object of a particular class in S3:


                “You can use this approach to turn any object into an object of class “foo”, whether it makes sense or not.”


                S4 is somewhat more strict, but also less widely used, and much more verbose…

                As far as I know, neither system works on *multiple dispatch* where the entire type signature of the function matters. it’s only the first argument that is dispatched on.

                Basically it’s ugly, hacky, has not well defined interactions between the S3 and S4 classes… you may have to define everything twice, once with S3 and once with S4 in order to get the appropriate functionality… I don’t know anyone who actually uses all this stuff in research level code (I’m sure some of the libraries do, like rstan or whatever).

                The whole thing is a mess in my opinion.

              • Carlos Ungil says:

                > As far as I know, neither system works on *multiple dispatch* where the entire type signature of the function matters. it’s only the first argument that is dispatched on.

                If you want your knowledge to go farther read a couple of paragraphs into your second link.

              • Christopher says:

                For what it’s worth, the ‘vctrs’ package (which is slowly being used to replace the foundations of the Tidyverse with something more consistent) does have a way for double (or multiple) dispatch on S3 classes; it’s not super pretty, but it seems to work.


              • Anoneuoid says:

                which is slowly being used to replace the foundations of the Tidyverse with something more consistent

                Tidyverse is such a dependency hell… Almost every update breaks something and generates more work. I never understood how it became so popular.

              • Cool that they’ve managed to figure out multiple dispatch in R, it still isn’t going to convince me to continue to use R. I’ve abandoned it in favor of Julia, because I think that experience is way way better. The semantics are well thought out, it’s compiled, it runs FAST which R will never do without putting performance sensitive code in C/Fortran/C++ libraries… and it has excellent macro capabilities.

              • Carlos Ungil says:

                The point of my clarifications about the things that “of course” R and Julia can or cannot do was not to keep you using R. But it’s not surprising that multiple dispatch is not a reason for you to stay. If you cared about it, you wouldn’t have just learnt that it’s part of base R. The design of S4 and its implementation in R has flaws but being recent is not one of them: it was included in R 1.4 in 2001.

  9. The single most frustrating thing about R is that it’s dog slow. Noone writes significant code directly in R, it’s always bolted on from C/C++. It has lots of good libraries mostly written in C but if your code needs to run fast it needs to be in a different language. Whenever you install an R package most of the install time is spent running gcc.

    Therefore the payback to learning S3 or S4 object systems is that you can try to write things and learn that you have to rewrite them. At least that’s my experience.

    In contrast to in R where if you work at it a bit you can get some obfuscated code with multiple dispatch on S4 classes and maybe S3 in the very latest stuff as mentioned above… Julia gives you multiple dispatch on every kind of argument, and this isn’t just some clever way to be expressive and what the cool kids are doing…it compiles you a specialized method that takes advantage of the type knowledge to be extra fast.

    Suppose for example that you are going to write an agent based simulation… you’ll have 1M objects moving around a 2D space. The objects need to have special behaviors based on their combined roles during interaction, and the environment. For example maybe it’s a COVID simulation like the Imperial college model, or an ecology model, or molecular Dynamics, or cellular interactions during embryo’ll call say


    on random pairs of nearby objects for each square. You interact a million times for each simulated day, and you want to run a simulation of 20 years of interactions. If you code this in R you are wasting your time… because it will run easily 100 times slower than Julia (or Rust or Scala or SBCL or God forbid C++) and you WILL wind up recoding it in some other performant language.

    Basically the payback to learning advanced R is that you get to write hinky code that’s so slow you’d never want to use it… I see it as all disadvantage.

    What R is good at is reading in data and making graphs. Putting stuff in and out of databases. And calling C code. Any other computation is painful.

    I started using R in 1997 or so with version 0.6 it was a breath of fresh air compared to being forced to use SAS and had much better statistical graphics than Python… since then it’s gotten way better in terms of graphics… but it’s always remained dog slow and both python and R are incapable of being fast in and of themselves. Any significant computation in either one is always in a C/C++ library.

    Julia has what it takes not only to give good clean Semantics with limited verbose puffery but to execute in a timely manner. And it’s not missing very basic stuff, like hash tables (Dicts in Julia, and some fairly complicated hacks implemented in a variety of ways in R

    R is the language where whatever you want someone implemented it in some other language and bolted it on. If it doesn’t have what you want… get out some other language and bolt it on.

    • This was supposed to be a reply to Carlos…

      Also, in my opinion this isn’t really controversial, R doesn’t try to be a language in which you’d write simulations directly for example. If you think of R as a language that tries to be really good at reading in data and making graphs and calling special purpose C/C++ code that executes quickly… it’s doing a good job at what it tries to do.

      It’s just that it turns out there exists a language now that is trying to be an interactive language that is good at high performance distributed numerical computing as well as data processing and making graphs, and it’s designed by computer scientists with a lot of perspective on computation. Whereas S was designed by statisticians at Bell Labs basically to do interactive graphics and to be extended by bolting on Fortran code… the first two books about S were:

      “In 1984 two books were published by the research team at Bell Laboratories: S: An Interactive Environment for Data Analysis and Graphics[3] (1984 Brown Book) and Extending the S System” (Wikipedia: )

      I’m sure “Extending the S System” was about how to write Fortran code to make S do stuff fast. ( Yep… sure enough).

Leave a Reply