My favorite functions for data manipulation

Dec 15, 2024

Software engineering is all about breaking complex problems into simpler parts. Great engineers have a real talent for this. Early in my career, I learned a strategy from one such engineer that I use almost every time I write code. It’s simple, low-level, and widely applicable. I’ll call it map, reduce, or filter. The basic idea is that much of what programs do boils down to these 3 fundamental operations on collections of data - mapping each element to another element, reducing the collection to a single value (or, another container for the collection), and filtering out elements in the collection.

Most of the low-level work applications perform is simple data manipulation. You read data, transform it, and filter, both when reading and when writing it. There is unlimited variability in how programs can be written. Programming languages are extremely powerful and flexible in this way, but the unlimited possibility space for how to solve a problem doesn’t make the job of the programmer easy - you have to search this space for solutions that meet your tradeoff goals.

A skilled programmer can rule out huge branches of the tree of possibility space very quickly, and focus on the ones that are more likely to lead to great solutions. Thinking about the problem as a combination of map, reduce, and filter operations on collections focuses you on tools I have found solve a surprisingly high percentage of problems.

Very often, you have a collection of elements, and you need to extract a value from each element, or more generally perform some transformation on each element. For example, let’s say you have a collection of user objects, and you need a list of their ids. At a high level, you are performing a map - you are mapping each value to another value.

Almost as often, you have a collection of elements that you need to repackage in another container, or extract a single value from. Examples of repackaging include putting the elements in an array of user objects in a dictionary where the key is the user id and the value is the user object (in some languages this might be doable with map, but in JavaScript for example, an object is a single value, not a collection). You might need to extract a single value from a collection, like the newest user, or the sum of users’ account balances. These are reduce operations - you are reducing a collection to a single value.

Also very frequently you will need to filter a collection. The filter operation returns a subset of the collection. For example, you may want to filter out inactive users in a particular view.

These three operations obviously don’t solve every problem you’ll run into, but you might be surprised how many they do solve. I find these to be a great starting point for breaking down a problem. Start with the data, and think about the operations you need to perform on it in terms of map, reduce, and filter. You will probably get to a working solution pretty quickly, and from there you can optimize for performance, brevity, intentionality, etc. For example, it might be more intention-revealing to use a find function than reduce, if you need the newest user.

You could perform all of these operations with your favorite brand of iteration, like a for loop. These functions have some non-obvious advantages over looping. Principal to these is that map, reduce, and filter are pure functions that promote immutability. They discourage reassignable variables or mutating collections. They don’t mutate the collection you perform them on - they return a new collection or object. There’s no need to instantiate reassignable variables so that you can reassign them in a for loop. Immutability at the reference and value level cuts out two classes of cognitive load that usually just get in the way of understanding what the program is really doing. Along that same line, another benefit of using map, reduce, and filter as functional primitives is that you clearly inform the reader of your intention - you know that these functions do one and exactly one thing, whereas a for loop could be used for any iterative algorithm.

Use of map and reduce in particular also promotes writing code that is composable. The function you pass to map or reduce is often reusable for mapping/reducing collections of different types of elements, or on individual elements. Code that was written to transform a single value can be easily adopted to handle a collection of values of the same type - just pass the same function to map.

This is borrowing from and bastardizing the great canon on functional programming. My favorite exploration of the ideas of functional programming is from Tom Harding. Part of the magic of map is in its functional programming properties - arrays/collections are just one example of “containers” (to avoid using the infamous m word that rhymes with gonad), and the map function just means you are transforming the value a container holds without changing the type of the container. This can be useful for other constructs like Result, which can be a Success or Failure, both of which implement map, where Success maps a value to another Success with the return value from the function passed to map, and Failure just returns itself.

To summarize, a simple approach I take to breaking down problems involving data manipulation is to reach first for map, reduce, or filter as my functional primitives. This one cool trick reduces a surprising amount of cognitive load when searching for solutions to a problem, and provides building blocks for understanding much of what programs do.

Systemicity

Discussion about this post