Gray release provides a fast and low-cost trial-and-error mechanism for changes, and it can be implemented in multiple different levels. A typical gray release mechanism is to provide a complete and independent gray environment to validate the changes before they are officially released to production. Another typical gray release mechanism is to change the production environment in batches and achieve small-scale production trial-and-error capabilities by finely controlling the pace of changes and the scope of impact.
Gray Environment
The purpose of a gray testing environment is to isolate production traffic, reduce risk, and form a closed loop for testing within the environment. The gray testing environment should be completed before the production release. The basic test coverage should include all intranet traffic and 1% of online traffic.
Gray Batching
Here are three common batching methods: intra-cluster batching, inter-cluster serialization, and inter-cluster scattering.
Clusters in the figure refer to logical groups that can continue to be divided, including but not limited to units, regions, data centers, availability zones, VPC, clusters, groups, custom logical regions, etc. For gray changes in the online production environment, the following requirements are recommended: batchable, controllable batch interval, observable/verifiable, pausable/rollbackable.
Batchable: It means that the gray mechanism must satisfy at least one of the batching methods: intra-cluster batching, inter-cluster serialization, and inter-cluster scattering. After determining the gray release method, at least 2 batches are required for release. If grayscale capability is not available, it is recommended to increase the approval declaration hierarchy.
Controllable interval: It means that the release time interval of each batch can be controlled. It is generally recommended that the duration of a gray release for a high-risk change is at least one hour. The total observed duration of gray release for core systems in the production environment should not be less than 30-60 minutes. After the first batch of changes, it is recommended to observe for at least 20 minutes, and the intervals between subsequent batch releases can be determined as needed.
Observable: It means that after each batch of changes is released, it needs to be observed and verified that the current batch has been released without any problem before proceeding to the next batch. The means of observation and verification include but are not limited to the following methods: record at least one indicator that reflects the health status of the core business (such as business monitoring items, log file names, etc.) in the change system, or record the confirmation personnel or use automated observation to confirm and determine the success of the release through file verification, etc.
Rollbackable: During the gray release, the ability to rollback batch by batch or fully rollback should be available, and each rollback should have corresponding change records and be traceable.