REPLs vs Notebooks

December 30th, 2019


tl;dr

  state history language
Notebook persistent modifiable imperative
REPL ephemeral immutable declarative

There are three common ways of “running” code: REPLs, notebooks and execution.

Let’s start with execution. By this, I’m referring to running python main.py from your command line. It runs like you might expect, with only things explicitly printed showing up in stdout. Application code is almost always run in this mode, but it’s certainly much less friendly to developers. Peering inside state requires you sprinkling your code with print or debug statements that you’ll need to remove later. With large code bases, reloading the application itself can take a few seconds.

REPLs (read, evaluate, print loops) are most commonly used for learning: my first experience with programming was summoning python from command line and typing, print 'Hello, world'. They are excellent for learning: experimenting with simple concepts, especially when you need to read and modify state the outlives your REPL session. In particular, Chrome’s JavaScript console, psql or redis-cli are all REPLs. Set x = 1 and then run x * 2. Don’t like the result? x = 2 and then re-run x * 2.

Notebooks are REPLs as documents. Jupyter notebooks have code cells that are executed by the ipython interpreter. The odd things about Jupiter notebooks is that they’re not run in any given order: you can easily have circular dependencies and non-reproducible results. ObservableHQ has a fascinating take on this. Each cell is like an async function that can be assigned a value. References are tracked, so that when any cell is updated, all its dependencies are as well (this makes for downright magical notebooks when Generators and setTimeouts are used). There’s an invisible DAG behind all the cells. Runkit is an interesting in-between. It also utilizes the “cell” concept as units of execution. Unlike Observable, it runs top to bottom. Re-running any intermediate cell re-runs all cells below it. This is one of the key brilliances of Observable: they leveraged the fact that JavaScript deals with async much better than Python to make each cell a “promise” that when resolves, resolves across the notebook. This relieves the problem of constantly shifty intermediary state and x = 1 might not actually be true, depending on what you’ve run last. x is x no matter where you are in the notebook.

From the user’s perspective, what differentiates REPLs from notebooks is the ability to easily view and modify history local to your notebook. You run a series of commands transforming data, with each command producing intermediary state. If final data is not what you expect, you review your commands and intermediary states, find the ones that is not correct (sometimes by visualizing the intermediary state), fix it and re-run. When exploring and experimenting with ideas, not having to reset state or manually backtrack your final state saves cognitive overhead. You technically could do this with databases, but you’d need to retain all past commands run against the database and (for performance reasons) the intermediary states to be able to rewind to past state and play forward. This would be cost prohibitive.

Notebooks, therefore, are mostly useful as data dumping grounds: extracting a narrow slice of truth into notebook state and preparing it for presentation. REPLs, on the other hand, are much more useful for working with the trunk and any REPL state is ephemeral.

Snowflake acquisition of Numeracy has crystalized this for me. Historically, one of the most common requests from Numeracy customers was Python notebooks. We eschewed them and with Snowflake’s acquisition, it seems even less likely. SQL clients are effectively REPLs.1 Introducing notebook-like state forced the product to now represent the notebook state, in addition to the query and database state. This is made doubly confusing by sharing and collaboration, where you must now resolve divergent state for arbitrary data objects.

  1. As a side note, one of the reasons that analysts will turn to Python notebooks is being able to use a Turing-complete, imperative language. SQL was designed for to insertion and extraction of relational data, and not transformation of ordered data.