As you may aware Hadoop Eco system consists of so many open source tools. There is a lot of research is going on in this area now and everyday you would see a new version of an existing framework or a new framework altogether getting popular undermining the existing ones. Hence if you are a Hadoop developer you need to constantly gather current technological advancements, which happen around you.

As a start to understand the technological frameworks around, I myself tried to sketch a diagram to summarize some of the key open source frameworks and their relationship with their usage. I will try to evolve this diagram as much as I learn in the future and I will not forget to share the same with you all as well.


Steps

1. Feeding RDBMS data to HDFS via Sqoop

2. Cleansing imported data via Pig

3. Loading HDFS data to Hive using Hive Scripts. This can be done by manually running Hive scripts or scheduled through Oozie work scheduler

4. Hive Data Warehouse schema’s are stored separately in a Hive Data Warehouse RDBMS Schema

5. In Hadoop 1.x, Spark and Shark need to be installed separately to do real time query via Hive. In Hadoop 2.x YARN basically bundles Spark and Shark components

6. Batch queries can be executed directly via Hive

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)