Designing Datalakes in AWS
This article is primarily to talk about datalake and how it could be implemented in AWS.
Journey of Datalake
At the high level you can categorize an environment into two regions
- Application (Presentation Layer)
- Data (Backend Layer)
In the data region there are multiple ways the data are stored, the journey started and traversed as below.
File System →hierarchical database →RBBMS →Datawarehouse
Then datawarehouse became hugely complicated and trouble to maintain.
DW → Data Mart → DataLake
Need for Datalake
Storing vast amount of data (structured / unstructured) into a structured DW/DM resulted in data loss. In short you can call this being data agnostic, it should be never be restricted to store only certain data formats.
To cater large scale & real time analytics and to cope up with the pace in which the new data getting ingested into DW/DM from new platform like IOT devices, mobile apps, there is a need to make the data layer schema agnostic.
Evolution of the architecture happened to be from DW → DM → DL
Where as the ideal data flow architecture should be in reverse order.
Data sources (DB/IOT/Mobile apps) → DL →DM →DW
In one line, the need of datalake could be defined as below