The Pillars of Creation. Only slightly more grand than modern software. source

Working on Large Software Systems

The challenges, nuances, and opportunities that arise from incredible scale.

8 min readAug 30, 2023

Engineering can be viewed as the dichotomy between creation and evolution.

The act of creation is fundamental to the human experience — it’s perhaps the most human thing there is. At the nexus of human creativity and technical prowess — new technology can pop into existence. This almost sacred process is sought after by many, but is inherently fleeting. The creations with merit continue to exist and grow and evolve as they are woven into humanity’s technological fabric.

To constantly be in a state of creation means to create nothing useful.

Evolution, while also fundamental, is notably less romantic. Yet, the common experience is for an engineer to enter the lifecycle of a technology not at its genesis — but sometime afterwards during maturation — long after the initial bright light of creative expansion has dimmed. During this phase, the technology’s utility has caused it to expand in complexity and size as its impact grows as a function of its ubiquity.

The place most find themselves in, therefore, is one of technological stewardship.

Maintenance consumes most effort, and, if fortunate — meaningful evolution is still occurring. The challenges presented during the mature phase of a technology are an entirely different sort than the beginning. During the initial phases, a barren, empty, green field is the engineer’s sandbox — and it’s quicky transformed into an intermingled arrangement of creations that constitute the system.

Creation is a one way door — and a useful creation tends to stick around. We can iterate on the existing, but we can never recreate it.

Iteration is fundamentally different from creation. In the former, there is a dual mandate — evolve without violating the existing; there is no such duality for creation. This reality is not a dire one — far from it. It’s in this dual mandate that we will find problems of worthy complexity — many of which require a strong creative instinct to overcome.

Transforming an existing system from one state to another, adding utility along the way, without impacting crucial invariants — this is the heart and soul of the largest technologies on earth. There is no canonical runbook for this category of problem — no single source of truth. The state-space is unbounded, the tradeoffs are fluid, and the problems are non-tractable.

Fortunately, we can navigate the complexity of large systems confidently, albeit imperfectly, with a set of principles and techniques paid for by the engineers whose shoulders we stand upon.

Let’s disembark the abstract pedestal for a moment and talk about some concrete ideas. We will only scrape the surface now — future writings will peel back more layers.

First, let’s look to Charles Darwin’s theory of natural selection for some inspiration. To naturally select is to make an incremental change, then test it in the unforgiving arena of nature. Importantly, the size of the change is small, so as to not disturb the essence of the being too drastically. Also importantly, the iterative process is reversible — the changes are small, and the selection process quickly rejects the undesirable — reversibility is achieved — often cruelly — by not promoting the iteration to an exponentially growing number of offspring.

In engineering terms, this translates to a rigorous continuous integration and continuous deployment process. A series of small changes, tested rigorously, allow for change to be more like a stream as opposed to a tsunami. Also borrowed from this natural phenomenon is fanout deployment architectures. Geometric rollout is a common strategy that mimics the exponential promotion of favorable traits. The basic idea is to minimize the scope and scale of the initial changes — ideally catching and reversing problems when they only impact a small subset of the total system.

Secondly, and perhaps most importantly, we must acknowledge the limitations of our analytical abilities. Imagine what safeguards would not exist if, in our hubris, we concluded that our system analysis was guaranteed to catch all issues. If analysis granted omniscience, then all other mechanisms would be frivolous.

Of course, this is not the case. So we are left with the sobering reality that no matter how hard we try — an analytical fog of war persists. Clearing this fog has quickly diminishing returns, and its often more prudent to protect against the certain uncertainties than it is to invest in more analysis. We now arrive at the next feature of working with complex systems — production is the true testing ground.

Test driven development purists may recoil at the idea of testing code in production — but this is an invariant. The problem arises when we fail to acknowledge this reality, and allow this folly to color our decision making. We may analyze and create various types of tests — but nothing can fully capture the non-determinism of production.

Armed with knowledge of our own limitations, effort is instead spent on protective mechanisms to help detect, mitigate, and revert the problems that sneak through to production. This means observability, alarming, and mechanisms to help engineers quickly root cause and repair the system when things go awry. It also means designing systems that respond well to failure. Avoiding bimodality, ensuring static stability, and other design principles lead the way here.

Sometimes, we can search for empirical answers to our analytical questions. In some cases, dry-run rollouts can empirically test changes in a safe way before they see the light of day. Imagine changing the data source for a customer interaction — it might be prudent to fetch from both sources and ensure they always match. Weeks or months with no differences is a strong empirical — not analytical — result that instills a sense of confidence even a shrewd skeptic can enjoy.

Rollback mechanisms are the last resort — but are not enough in themselves. Once a problem has been detected, engineers may initiate a rollback to revert the system state to the last known safe point. Many a rollback has been corrupted by the lack of backwards compatibility. Dataset migration is the primary example of this — since state is persisted independent of the actual software change. If you are transforming a dataset from A to B, its paramount you’ve implemented the ability to transform from B to A, otherwise — how is the change reversible? Even worse, if a problem causes you to end up in an unpredictable state C — what are the chances you can transform from C, back to A, without some clever ad-hoc engineering?

Just as we acknowledge our analytical limitations, we also must acknowledge the limitations of the techniques we use to reduce risk. Risk is ever-present. Rather than fearing risk of system changes as if it were a unique existential threat — it should be treated like any other resource and balanced in the nuanced calculus that informs our decision making.

It is risky to change systems, yes, but it is also risky to not change them.

Exchanging the risk of a production outage for the certainty of stagnation and obsolescence is not a victory. The risk of acute pain has been traded for the slow but guaranteed agony of technological obsolescence- either at the hands of your competitors or the landscape itself shifting beneath you.

Inaction is not an option, and progress mustn’t be slowed too drastically.

Incrementally reducing risk via process improvement is not unique in this industry. When problems make it through the various layers of security mechanisms, its akin to a hazard vector making its way through holes in slices of swiss cheese. We can add new, differentiated layers to close gaps, or improve existing layers, to incrementally reduce risk.

With this acknowledgement of risk, and our limited ability at comprehension which shrinks as the system becomes more complex — an important question arises — does the ability for an engineer to make an impact naturally diminish?

The answer is nuanced — and in short — it depends on how you measure impact. While its likely that working on an established complex system means that the logical complexity of your contribution is likely to be smaller than creating a new system from scratch — the changes that are made will be further reaching.

A 1% improvement on a system that is 100 fold larger than another, is akin to a 100% improvement on the other. If, while a system scales exponentially, the rate at which changes can be effected diminishes linearly, than the marginal impact of a contributor also accelerates.

Efforts to reduce non-differentiated costs are also a key driver of productivity for large systems. At the one extreme of scale, a one person startup with a single customer has an operational load to volume ratio that is extremely high. Furthermore, the surface area of the system, that is, the entire scope of the internal and external business logic, is also overwhelmingly large compared to the overall scale.

On the other end of the spectrum, a global system with dozens or hundreds of replicas managed by consistent operational tools has a much more favorable operational load ratio. A system at this scale can invest huge amounts of resources into improvements, and enjoy the fact that this system, now thousands or millions of times larger than our hypothetical one-person startup, can be managed operationally by a small handful of engineers.

In closing, large software systems come with a set of nuanced properties that are unexplored by most because of the relative novelty of this class of creation. Every layer of the technological stack has evolved rapidly over the last several decades, causing disruptions and a seemingly endless sequence of “eternal paradigms” that last about as long as the expiring assumptions they make.

It’s hard to predict exactly what form these systems will take in the years and decades to come. What’s more important than the final form is understanding the invariants. Put another way — don’t try to predict what’s going to change, predict what’s not going to change. Here’s an attempt:

Complexity of large systems will not decrease — indeterminism is here to stay.
Scale will continue to increase
Availability will be important
We will keep needing to make changes to established systems

Perhaps, in some future timeline where Linux is a relic and all computing is done on quantum platforms — most of the existing knowledge related to our current infrastructure will be obsolete. However, you can be sure that this new system will be complex, it will be indeterministic (perhaps more so, thanks to quantum fickleness), it will meet the scaling demands of future humanity, who will demand reliability, and it will surely need to be iterated on, just as we have with every other creation before it.

One can imagine that in this future landscape, however dissimilar it is, the same patterns and principles discussed here will apply.

If you enjoyed this article, you will enjoy the other content from VidaVolta’s website and Medium page.

Originally published at https://www.vidavolta.io on August 30, 2023.

Working on Large Software Systems

The challenges, nuances, and opportunities that arise from incredible scale.

Written by VidaVolta