In this blog, we will learn how to filter rows from spark dataframe using Where and Filter functions.
We have created partitioned tables, inserted data into them. Now, what if we want to drop some partition or add a new partition to the table? Can we have one partition at different locations? Let us try to answer these questions in this blog post.
We can run below query to add partition to table.
1 2 3 4 |
ALTER TABLE order_partition_extrenal ADD PARTITION(year='2014', month='02') LOCATION '/apps/hive/warehouse/maheshmogal.db/order_partition/year=2014/month=02'; |
Hive is metastore for tables. Here we are adding new information about partition to table metadata. We are telling hive this partition for this table is has its data at this location.
Does this mean we can have our partitions at diffrent locations?
Of course we can. Not just in different locations but also in different file systems.
Consider use case, you have a huge amount of data but you do not use old data that frequently (something like log data). In that case, you can set up a job that will move old data to S3 ( It is Amazons cheap store service. You can learn more about it here). And then point those old partitions to S3 location. So your latest data will be in HDFS and old partitions in S3 and you can query that hive table seamlessly.
1 2 |
ALTER TABLE log_messages ADD PARTITION (year = 2019, month = 12) LOCATION 's3n://bucket_name/logs/2019/12'; |
With Alter table command, we can also update partition table location.
1 2 |
ALTER TABLE log_messages PARTITION (year = 2019, month = 12) SET LOCATION '/maheshmogal.db/order_new/year=2019/month=12'; |
We can also rename existing partitions using below query.
1 2 |
ALTER TABLE order_partition_extrenal PARTITION (year=2013,month=7) RENAME TO PARTITION(year=2019,month=07); |
We can also drop partition from hive tables.
1 2 |
ALTER TABLE order_partition_extrenal DROP PARTITION (year=2019, month=7); |
This will delete the partition from the table. But what about data when you have an external hive table? Hive doe not drop that data. It just removes these details from table metadata. If you also want to drop data along with partition fro external tables then you have to do it manually.
1 |
hadoop fs -rmr /maheshmogal.db/order_new/year=2019/month=7 |
This was a short article, but quite useful. I hope you will find it useful. Partitioning is one of the important topics in the Hive. Partitioning is also one of the core strategies to improve query performance in a hive. In the last few articles, we have covered most of the details of Partitioning in Hive. Next, we will start learning about bucketing an equally important aspect in Hive with its unique features and use cases. Hope to see you there.
In this blog, we will learn how to filter rows from spark dataframe using Where and Filter functions.
Getting distinct values from columns or rows is one of most used operations. We will learn how to get distinct values as well as count of distinct values.
In this blog, we will learn how to sort rows in spark dataframe based on some column values.