Hadoop comes with its own distributed filesystem, HDFS. As the use of the internet is increasing, we are producing data at astonishing rates. With the increase in the size of data, we face issues of storage as well as processing such huge datasets. We can increase the limit of a single machine to a finite extent (Vertical scaling). It becomes costly and unreliable to use a single device after some point. This is when it becomes necessary to partition data across many machines. Such file storage systems are known as distributed file systems.

HDFS stands for Hadoop distributed filesystem. It is designed to store and process huge datasets reliable, fault-tolerant and in a cost-effective manner. HDFS helps Hadoop to achieve these features. In this article, we are going to take a 1000 foot overview of HDFS and what makes it better than other distributed filesystems.

The Design Goals of HDFS

The Hadoop distributed filesystem has significant advantages over other distributed filesystems. The following are design goals that make it suitable for storing and processing massive datasets.

Commodity Hardware

Hadoop distributed filesystem is designed to run on commonly available hardware. This makes HDFS much more cost-efficient and easy to scale out.

Write once read many times

Hadoop distributed filesystem is designed for datasets that are written once and read many times. After a file is written to HDFS, it should not be changed. This assumption enables high throughput data access and also works well with the Map-reduce model.

Recovering Hardware Failures

Hardware failure is the norm rather than the exception. As we are using commodity hardware for HDFS, chances of hardware failure are high. But HDFS is designed in a way that it can recover from such failures quickly and maintain high availability. This is achieved by data replication. We are going to learn about it in the next tutorials.

Large datasets

HDFS is used for large datasets in the range of gigabytes or terabytes or even petabytes. It works well with a small number of large files.

Streaming Data Access

When we are performing analysis on HDFS data, it involves a large proportion, if not all, the dataset. So the time to read the whole dataset is more important than latency in reading the first record in case of Hadoop distributed filesystem. That is why HDFS focuses on high throughput data access than low latency.

Limitations of HDFS

Hadoop distributed filesystem works well for many large datasets as the distributed filesystem. But we should know there are some limitations of HDFS which makes it a bad fit for some applications.

Low latency data access

Applications that require low latency data access, in the range of milliseconds will not work well with HDFS. It is designed to provide high throughput at the expense of low latency.

Lots of small files

Namenode holds data about file location in the HDFS cluster. If there are too many files, Namenode will not have enough memory to store such metadata about each file. We will learn about HDFS architecture in the next tutorial.

Arbitrary data modification

Hadoop distributed file system does not support updating data once it is written. We can append data to end of files but modifying arbitrary data is not possible in HDFS. Hence application which needs data to be changed will not work well with HDFS.

Conclusion

We have taken an overview of HDFS and what makes it efficient than other distributed filesystems. We have also seen some limitations to it. We will start going in details of HDFS in the next tutorials. You can read about HDFS architecture in this post.

What is HDFS
Mahesh Mogal

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Table of Contents
    Add a header to begin generating the table of contents

    Stay updated with latest blogs

    Posts you may be interested in

    Manage S3 Bucket Polices
    S3

    Set, Get and Delete AWS S3 bucket policies

    In this blog, we are going to learn how to get, put and delete S3 bucket policies suing S3 Console as well as programmatically using AWS CLI & Python

    Manage S3 Bucket Polices
    Read More →
    iam policy vs s3 policy vs s3 acls
    S3

    IAM Policies VS S3 Policies VS S3 Bucket ACLs – What should be used?

    You can manage S3 permission using IAM policy or S3 Policy or S3 ACLs. We will understand the difference between them and use cases for each way.

    iam policy vs s3 policy vs s3 acls
    Read More →
    Create S3 bucket
    S3

    Create S3 bucket using AWS CLI and Python Boto3

    In this blog, we are going to learn how to create an S3 bucket using AWS CLI, Python Boto3 and S3 management console.

    Create S3 bucket
    Read More →

    Leave a Comment

    Your email address will not be published. Required fields are marked *

    Share via
    Copy link