We will learn how to load data from JSON, CSV, TSV, Pipe Delimited or any other type for delimited file to spark Dataframe.
HDFS is file system designed by Google and used by Hadoop. It provides reliable, highly available store for data processing. Let us take a look at HDFS and its architecture.
We need Hadoop environment for practice and setting that up on Linux is no fun. There is better alternative in using Cloudera virtual machine. Let me show you how can we set up and use Cloudera quick-start VM to get hands on practice for Hadoop.
There are multiple use cases when we need to transpose/pivot table and Hive does not provide us with easy function to do so. Let me show you workaround for how to pivot table in hive.
There are many advanced aggregate functions in hive. Lets take a look at look at collect_set and collect_list and how can we use them effectively.
In this blog, we will take look at another set of advanced aggregation functions in hive.
In second part of sqoop import, We will learn additional parameters for using Import more effectively.
In this blog, we will learn how to use sqoop import command and its different parameters to move data to HDFS.
One reason which made Hadoop ecosystem popular is its ability to process different forms of data. But not all data is present in HDFS i.e Hadoop Distributed File System. We have been using relational databases to store and process structured data for a long time. That is why a lot of data still resides in RDBMS…