There is a lot of buzz around the practice of Progressive Delivery lately. Rightfully so, as it’s a great addition to continuous delivery. By gradually exposing new versions to a subset of users, you’re further mitigating risks. As usual with new and shiny things, many of us are eager to try it out, learn about the tools and shift to this way of working. But as is also common, there are reasons to do it, reasons to not do it, and reasons you might not succeed at it. Let me save you some trouble by elaborating on a few of them.
Probably the most common failure in progressive delivery practices occurs when handling any kind of data. If the app is stateless, this is relatively easy to deal with. Though if there are any kind of stateful services involved (and there often are), dealing with data is increasingly difficult. This means services will have to be backwards and forwards compatible.
In general, there’s a simple rule. Does your app scale horizontally and you can currently do rolling updates with zero downtime? Then you’re probably ready to try progressive delivery of your app as well. Here are some practical examples of that rule:
Let’s say you’re deploying v2 of a microservice and routing traffic of 10% of your users to this new version. Hopefully any data is stored external to the service itself, i.e. in a database or event storage and not in ephemeral storage. All data has a schema, which should be versioned alongside your application. In this scenario, the data structure has changed as well; now you also have schema version 1 and version 2. Since the remaining 90% of requests still rely on schema v1 you will have to account for that. For example, by having v2 of the service read data from schema v1 while writing data in schema v2.
Now let’s say there’s an issue with this new version. You want to rollback and revert the 10% of traffic back to v1. There were probably some data mutations in the meantime, so any data written with schema v2 is lost unless you also account for this. You may want to migrate any data from the new schema back to the old one. Either that or you decide not to have rollbacks when data is affected. Instead, you fix the issue and deploy a new version for the same 10% of users, at the risk of now having a discrepancy of two versions for 90% of your users.
If this service is stateful and has data in local ephemeral storage, such as in-memory (and changing the service to move its data externally is not an option), the same caveats apply. Additionally, you should also make sure an instance delegates its state to other instances of this service on receiving a shutdown signal, or flushes its state to temporary external storage to be read by a new version later.
And even if there is no stored data to speak of, and a service only sends and receives requests, there’s still a data schema. For example, a REST API will have to increment version endpoints and keep old ones until they’re completely phased out. Any other service using this API will have to be updated to use the new version simultaneously, or the 10% of traffic will only answer the question “does the new version break the old version?” rather than “does the new version do what it’s supposed to?”.
You’ve solved backward compatibility and your new version is live for 10% of your end users. Everything seems to be working. But how can you be sure?
A common metric mentioned in all kinds of Progressive Delivery examples is the number of request errors. The idea is that HTTP status codes tell us whether or not something is working correctly. While that is a helpful metric, it’s far from ideal to use as your baseline. For example, it won’t tell you whether the error codes were implemented correctly. A service could be happily sending status 200 while everything is on fire, if error handling wasn’t properly implemented. More importantly, it won’t tell you what your users are experiencing.
It’s more useful to monitor business metrics, such as transactions completed, orders processed, or whichever service you provide to your end users. If for those 10% of users there’s an unusual drop in the number of transactions, chances are there’s a problem.
It’s tempting to think of Progressive Delivery as a way to skip automated testing. A small part of your end user traffic will now act as your tests, right?
Yes and no.
If you’re monitoring the right things, as described above, those users will give you feedback about whether your changes themselves work, and whether they broke anything else. Provided they are actually using those features at that specific time. Or in case of load testing, there’s currently a high amount of traffic. But what if you’re releasing multiple times a day? You don’t want to be reliant on whether your end users feel like helping you out. Therefore, you still want to invest some time into automated tests that run in production, in addition to monitoring.
This will also help in moving towards fully automated Progressive Delivery. When you’ve released the app to 10% of your users, you’re monitoring no major errors, and your automated tests succeed, there’s no reason not to scale to 100% automatically.