Data powers business. Therefore, it becomes imperative to prepare your data processing architecture for any disaster. In case of data integration, this means keeping your data sources, data integration infrastructure, ETL mappings and the logs immune to any disaster.
Let’s look at some of the ways in which you can make your data and the data integration infrastructure on Amazon Web Services (AWS) ready for any disaster in a cost-effective manner, meeting your recovery time objective*.
A typical big data customer uses the following services on AWS:
Data Integration Platform
Informatica Big Data Management (BDM)**
Amazon EMR: Managed hadoop service (Another popular option: Cloudera on EC2)
S3: Object-based Storage for storing files
RDS: Managed Relational Databases
Dynamo DB: NoSQL Database Service
Redshift: Managed peta-byte scale Data Warehouse
There are two steps in any disaster recovery plan – Prepare and Recover. Let’s look at each for a holistic data integration architecture using above services.
In this step, create a backup strategy for each component of your architecture along with the frequency at which the backups should be taken. The services used should minimize the redundancy and in turn, costs, and allow you to recover your architecture within the acceptable time.
Informatica BDM needs an EC2 instance to host the Informatica server along with two database instances to host Informatica Domain and Model Repository Service (MRS).
To prepare for any disaster, install and configure the BDM server on Amazon’s Elastic Block Store (EBS) volume. Once the setup is complete, take a snapshot of the EBS volume.This snapshot is stored in S3, which is highly durable storage service from Amazon. Alternatively, you can also schedule snapshots using CloudWatch events (Scheduled Snapshots).
Use Amazon’s Relational Database Service (RDS) for the two database instances that BDM needs. Deploy the two instances in multi-availability zones (multi-AZ). Amazon will keep a read replica of the database in a nearby availability zone. Also, take the snapshots of the database instances to be able to migrate the instances from one region to another quickly.
Usually, use hadoop cluster only for large-scale distributed processing and not for data persistence on HDFS. Once the processing is done, move the data to S3, RDS, Redshift or any other service through Informatica workflows or custom scripts.
Also, use an external database (RDS) for hive metastore (External Hive Metastore). Thereafter, create snapshots and or enable multi-AZ deployment to prepare your hive metadata for any disaster.
Amazon, itself, takes care of preparing the data sources for disaster through automated backups and snapshots. Let’s look at the popular services:
S3 – Amazon takes care of durably storing your data on multiple devices and across multiple facilities in a region.
If you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. – Amazon Web Services
RDS – You can backup RDS in two ways – Automated Backups and DB Snapshots. With automated backups, RDS automatically performs daily snapshots and captures transaction logs as the database is updated. Use DB Snapshots to take point-in-time backups at your convenient time and frequency.
DynamoDB – Either prepare a backup in a secondary DynamoDB instance in another region or take point-in-time snapshot at regular intervals.
Redshift – Take point-in-time snapshot of the redshift data warehouse, just like you did for RDS and DynamoDB.
If you ‘prepared well’, you can easily recover every component of your architecture from any disaster or system failure. Let’s look at each of the components individually:
Launch a new EC2 instance from a pre-configured AMI (Amazon Machine Image) or your own AMI.
Create a new EBS volume from the latest available snapshot as created in the preparation phase above. Attach this volume to the newly launched EC2 instance or the desired pre-existing instance. Additionally, restore your database instances from the latest RDS snapshots, as taken in preparation phase.
Update the gateway node (More Info) to restore the Informatica domain to its last saved state.
Restore hive metastore db from the last snapshot available. Spin up a new EMR cluster and use the restored DB as the external database for hive metastore (External Hive Metastore).
S3 – Your objects in S3 should survive the disaster as S3 can easily handle at least two simultaneous site failures with a 99.999999999% durability.
RDS – Restore the database using the latest available snapshot.
DynamoDB – Restore the NoSQL database using the latest available snapshot.
Redshift – Restore the data warehouse using the latest available snapshot.
With this, the whole architecture is recovered to the last known state before the disaster.
Smart businesses want and ensure business continuity. The key component of this is a ‘Smart Disaster Recovery Plan’ with key emphasis on ‘Prepare’. Smart preparation can provide the following advantages:
- Business continuity by recovering from any region or system failure.
- Ability to replicate the whole setup within minutes in a new geographic region.
- Ability to debug issues by quickly replicating the production environment for QA/Dev.
* Recovery Time Objective (RTO): Acceptable time to recover from any disaster and bring the system back to an acceptable state. Typically 8-12 Hours for data integration jobs
** Disclaimer: I am a Big Data and Cloud Specialist (Professional Services) in Informatica in the Big Data Team. The views and opinions expressed here are my own (based on my experience) and do not necessarily reflect the official policy or position of Informatica LLC.