Technology resilience

06 April, 2020

Most large organisations have some form of resilience measures in place. Disruptions could be technology related (power outage, website crash, cyberattack) or non-technology related (riots, floods, viruses).

Either way, technology operations staff are usually called to support the businesses during these disruptions. They activities they undertake could be to restore a server, setup an alternate working site, or simply to set up VPNs.

Here are some business continuity terms technology staff share with businesses.

Some resilience knowledge are specific to technology staff only. This is because users often don't feel it until a disruption occurs.

Resilience IT architecture

IBM differentiates two terms that are often used interchangebly but mean different things. A fault tolerant environment has no service interruption but a significantly higher cost, while a highly available environment has a minimal service interruption.

A high availability architecture looks like this.

Perfect illustration from Mitchell Anicas's article @Digital Ocean

Starting from the left, a user's laptop is making a request to the system. He has no idea what's going on in the background. He's happy as long as he can browse his social media seamlessly.

This architecture is an active-passive cluster. One will always be up, while another lies dormant ready to kick-in at anytime. Both clusters operate in full capacity. An active-active cluster would be operating at 50-50 capacity.

The floating IP is an address that always points to the live device. If one device dies, it points to the other. But the load balancers already have an one IP address each. How will it work with two? ARP! Address resolution protocol will map the floating IP onto the live device's IP, like fixing the ends of two straws to connect the flow.

Servers aren't living things, so how would the secondary one know if the primary died? Between the two load balancers is the heartbeat. Without it, a technician would need to manually configure and boot the secondary up - which can take a few hours. Very bad for customer experience. The heartbeat can do it automatically, instantaneously.

Now, there are two load balancers, app servers, and databases each. The secondary load balancer and app server only work when they have to. But the database are always replicating to ensure that there's a mirror copy all the time. Databases are usually connected in a NAS storage or Network Attached Storage so that anyone can access them anytime.

Other terms

Backups - Data duplicated on different versions which are not part of the main system. Backup intervals could be done daily, weekly, monthly, and yearly. Data redundancy and data mirroring for high availability is not a backup because corrupted data is copied to alternate storage in real-time.

Rollback - An operation performed on a system to return it to a previous state. This minimizes data loss in an outage or provides a quick fix buggy deployment. Rollback versioning are usually managed together with Change Management (an ITIL process).

Some data centers run on different modes of availability depending on the criticality of the application.

Hot sites usually have a fully running redundant system with application and database available. Switching can be immediate.

Warm sites may have the application ready but data not yet replicated across. Mostly for medium criticality application.

Cold sites only have basic infrastructure in place with no configuration done yet. For required but non-essential applications.

Back