What is HDFS - Overview of Hadoop's distributed file system

Updated On February 12, 2021 | By Mahesh Mogal

Hadoop comes with its own distributed filesystem, HDFS. As the use of the internet is increasing, we are producing data at astonishing rates. With the increase in the size of data, we face issues of storage as well as processing such huge datasets. We can increase the limit of a single machine to a finite extent (Vertical scaling). It becomes costly and unreliable to use a single device after some point. This is when it becomes necessary to partition data across many machines. Such file storage systems are known as distributed file systems.

HDFS stands for Hadoop distributed filesystem. It is designed to store and process huge datasets reliable, fault-tolerant and in a cost-effective manner. HDFS helps Hadoop to achieve these features. In this article, we are going to take a 1000 foot overview of HDFS and what makes it better than other distributed filesystems.

The Design Goals of HDFS

The Hadoop distributed filesystem has significant advantages over other distributed filesystems. The following are design goals that make it suitable for storing and processing massive datasets.

Commodity Hardware

Hadoop distributed filesystem is designed to run on commonly available hardware. This makes HDFS much more cost-efficient and easy to scale out.

Write once read many times

Hadoop distributed filesystem is designed for datasets that are written once and read many times. After a file is written to HDFS, it should not be changed. This assumption enables high throughput data access and also works well with the Map-reduce model.

Recovering Hardware Failures

Hardware failure is the norm rather than the exception. As we are using commodity hardware for HDFS, chances of hardware failure are high. But HDFS is designed in a way that it can recover from such failures quickly and maintain high availability. This is achieved by data replication. We are going to learn about it in the next tutorials.

Large datasets

HDFS is used for large datasets in the range of gigabytes or terabytes or even petabytes. It works well with a small number of large files.

Streaming Data Access

When we are performing analysis on HDFS data, it involves a large proportion, if not all, the dataset. So the time to read the whole dataset is more important than latency in reading the first record in case of Hadoop distributed filesystem. That is why HDFS focuses on high throughput data access than low latency.

Limitations of HDFS

Hadoop distributed filesystem works well for many large datasets as the distributed filesystem. But we should know there are some limitations of HDFS which makes it a bad fit for some applications.

Low latency data access

Applications that require low latency data access, in the range of milliseconds will not work well with HDFS. It is designed to provide high throughput at the expense of low latency.

Lots of small files

Namenode holds data about file location in the HDFS cluster. If there are too many files, Namenode will not have enough memory to store such metadata about each file. We will learn about HDFS architecture in the next tutorial.

Arbitrary data modification

Hadoop distributed file system does not support updating data once it is written. We can append data to end of files but modifying arbitrary data is not possible in HDFS. Hence application which needs data to be changed will not work well with HDFS.

Conclusion

We have taken an overview of HDFS and what makes it efficient than other distributed filesystems. We have also seen some limitations to it. We will start going in details of HDFS in the next tutorials. You can read about HDFS architecture in this post.

What is HDFS

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Where and Filter in Spark Dataframes

In this blog, we will learn how to filter rows from spark dataframe using Where and Filter functions.

Where and Filter in Spark Dataframes
Read More
Distinct Rows and Distinct Count from Spark Dataframe

Getting distinct values from columns or rows is one of most used operations. We will learn how to get distinct values as well as count of distinct values.

Distinct Rows and Distinct Count from Spark Dataframe
Read More
Sorting in Spark Dataframe

In this blog, we will learn how to sort rows in spark dataframe based on some column values.

Sorting in Spark Dataframe
Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap