As devices move towards being always online, and software development moves away from a fixed release schedule to a more organic continuous development, it is often useful to move end-user releases for those devices to a similar track. Assuming you have good development infrastructure, and suitable testing, it may be worthwhile looking at combining that with continuous deployment to provide maximum speed to market.
In order to ensure that a release is suitable for the market, it is important that the release passes through a number of phases, as listed below. While we have phrased this around embedded deployment, it is easy to draw parallels between these phases and those in most other software development scenarios.
Testing and development on development devices. Breakage here is acceptable, and even expected. Device users should be capable of performing a full rollback/reinstall themselves.
Testing on internal devices. Devices are used for normal operation by people within the company. Breakage here is acceptable, and quickly reverted on site.
Release to beta users. These users are outside the company, but have a very close relationship. Breakage here is unfortunate, but acceptable.
Initial public release for opt-in early adopters. Wide audience, breakage here not particularly acceptable, but acknowledged as a possibility.
Staged general release. Full audience. Breakage unacceptable.
As you get towards the later stages of deployment, it is often good to further shard the installation base down. This is most commonly done either by geography, or via a trickle rollout process - allowing only 10-20% of users to receive the update each day.
Releases should automatically move from one phase to the other, once a set of criteria have been passed. The criteria should cover a reasonable subset of the following:
Installation success - the update should have rolled out correctly to a sufficient percentage of the devices in that tier.
Time - allow the release some time to ‘soak’ in on a tier, to give sufficient users to operate it in all of it’s different modes.
Error reports - ideally this is an automatic gate based on some backing system such as bug tracking. If there are no bugs tagged with the specific version number, then this criteria is considered to be satisfied.
Operational analytics - if using automated pushes, it is generally necessary to have some form of centralised data collection. Ensuring that all devices with the new release have statistics that are ‘similar’ to other releases is important. This should show if a specific feature has stopped functioning, or perhaps if it has suffered significant slow down.
No manual blocker is in place - this allows a nice centralised place in which to block a release for some reason.
Once all of the phase transition criteria have been met, the release gate can be opened to the next phase. Ideally, moving on to the next phase should be entirely automatic.
Once a software system has moved to continuous delivery, it is also important to ensure that a mechanism for rolling back a release is equally fast. No system is perfect, so there is always a chance that a release will make its way through all of the delivery criteria and yet still have a flaw. Once this happens there are two options - push a new release which resolves the issue, or roll back the release to the previous known good option. Which path to take depends on the severity of the issue and the speed at which the deployment can be done. However it is vital to make sure that both options are available, and that they are both easy. Pushing a fixed release should generally follow the criteria as listed above. Rolling back to a previous release should be able to bypass some of this criteria for two reasons:
As the rollback release was previously in production, it has already passed the release criteria.
Rolling back is generally an emergency response to an immediate issue, as such the procedure should be as fast as possible.
For further information on continuous delivery as compared to continuous deployment, there is an excellent blog post from PuppetLabs on the topic.