Data Types in Hive
Like relational database, hive support multiple primitive data types. With them, Hive also supports collection data types as well to make data reading fast from disk. Let us learn more about supported data types in Hive before diving deep deep.
Primitive Data types
You can refer table below to know which primitive data types are supported by hive. Like relational databases it is pretty self explanatory.
Data Type | Size for Data Type | Example |
---|---|---|
TINYINT | 1 Byte | 18 |
SMALLINT | 2 Bytes | 548 |
INT | 4 Bytes | 78495 |
BIGINT | 8 Bytes | 7895462 |
BOOLEAN | TRUE | FALSE | TRUE |
FLOAT | Single Precision | 3.14 |
DOUBLE | Double Precision | 3.14 |
STRING | "This is test string" | |
TIMESTAMP (hive version 0.8.0+) | 13254869(Unix Timestamp) or 2020-05-30 12:34:24 (JDBC complaint time) | |
BINARY (hive version 0.8.0+) | Array of Bytes | |
DECIMAL(give version 0.11.0+) | Precision of 38 digits | 1.25846 |
DATE(Hive version 0.12.0 +) | Supports only String values in Hive | "2020-05-30" |
INTERVAL (hive version 1.2.0+) | Supports only String with Hive | INTERVAL '1' DAY |
It is useful to remember that these data types have been implemented in Java. So they follow basic rules of Java typecasting. That means when you try to add INT and FLOAT, INT will be converted to FLOAT and then those two numbers will be added.
Collection Data Types
Along with above mentioned primitive data types, the Hive also supports collection data types like Array, Structs, and Maps. Collection data types are useful when we want to retrieve data for processing. If we save data in normalized fashion then we might need another disk scan to get related data. To optimize disk read time, Hive stores related information at once place so that we can access it faster (for example, storing the address in Struct, or students enrolled classes in an array, etc)
Below are three collection data types supported by Hive
- STRUCT
- for example, STRUCT {home_no: STRING, street_name: STRING, city: STRING}
- MAP
- Like in many programming languages, Map holds key-value pairs.
- for example, MAP(STRING, STRING)
- ARRAY
- ARRAY holds list of the same types of items.
- for example, [‘BANANA’, ‘APPLE’, ‘I AM NOT FRUIT BUT I AM STRING’]
There is also UNION type in Hive but support for it is incomplete. Some queries ( like where and group by ) will fail with UNION type.
Hive Facts
Creating a Sample Table
Now that we have learnt about data types in Hive, below you can find sample table created using few of above Data types.
1 2 3 4 5 6 7 8 9 10 11 |
hive (maheshmogal)> CREATE TABLE student ( > id INT, > name STRING, > birth_date DATE, > enrolled_classes ARRAY<STRING>, -- class names in string format > assignments MAP<STRING, DECIMAL>, --assignment name as keys with grades in decimal > address STRUCT<street: STRING, city: STRING, zip: STRING> -- address fields > ); OK Time taken: 2.38 seconds hive (maheshmogal)> |
File Encoding for Collection Data Types
Now that we know different data types supported in Hive, the next logical question is how Hive will know in which format that data is stored in file especially for collection data types. How array elements are separated or what separates the hive key and value pair? This is where we can use table properties in Hive.
We Have briefly touched on table properties in create table section. We will see now how to let hive know about the structure of our collection data types. Consider the query below.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
hive (maheshmogal)> CREATE TABLE student ( > id INT, > name STRING, > birth_date DATE, > enrolled_classes ARRAY<STRING>, > assignments MAP<STRING, DECIMAL>, > address STRUCT<street: STRING, city: STRING, zip: STRING> > ) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > COLLECTION ITEMS TERMINATED BY ',' > MAP KEYS TERMINATED BY ':' > LINES TERMINATED BY '\n'; |
Here we can see that how we can let hive know about our data type and how they are represented in a file. For example, Map key and value are separated by ‘:’ and each pair in the map is separated by ‘,’.
We can change these values depending on our data. So we can have a pipe (|) delimited data but collection items can be separated by comma(,).
Last line in query, LINES TERMINATED BY \n, Let hive know that each new line in file contains new row. Currently hive only supports new line for identifying a new row. So you can avoid mentioning it in query. ‘\n’ is default value for this filed.
ROW FORMAT DELIMITED should be first-line after table definition followed by filed separate definitions. Only LOCATION property can be defined before ROW FORMAT DELIMITED.
Hive Facts
Conclusion
We have learned the basics of Hive Data types. Using hive table properties we can let hive know how that data is stored in our files. We will use these properties going forward in the next blog. See you there.