Introduction
In Hive terminology, external tables are tables not managed with Hive. Their purpose is to facilitate importing of data from an external file into the metastore.
The external table data is stored externally, while Hive metastore only contains the metadata schema. Consequently, dropping of an external table does not affect the data.
In this tutorial, you will learn how to create, query, and drop an external table in Hive.
Prerequisites
- Ubuntu 18.04 LTS or later
- Access to command line with sudo privileges
- Apache Hadoop installed and running
- Apache Hive installed and running
Note: This tutorial uses Ubuntu 20.04. However, Hive works the same on all operating systems. This means the process of creating, querying and dropping external tables can be applied to Hive on Windows, Mac OS, other Linux distributions, etc.
Creating an External Table in Hive – Syntax Explained
When creating an external table in Hive, you need to provide the following information:
- Name of the table – The
create external table
command creates the table. If a table of the same name already exists in the system, this will cause an error. To avoid this, addif not exists
to the statement. Table names are case insensitive. - Column names and types – Just like table names, column names are case insensitive. Column types are values such as
int
,char
,string
, etc. - Row format – Rows use native or custom SerDe (Serializer/Deserializer) formats. Native SerDe will be used if the row format is not defined, or if it is specified as delimited.
- Field termination character – This is a
char
type character which separates table values in a row. - Storage format – You can specify storage formats such as textfile, sequencefile, jsonfile, etc.
- Location – This is the HDFS directory location of the file containing the table data.
The correct syntax for providing this information to Hive is:
create external table if not exists [external-table-name] (
[column1-name] [column1-type], [column2-name] [column2-type], …)
comment '[comment]'
row format [format-type]
fields terminated by '[termination-character]'
stored as [storage-type]
location '[location]';
Create a Hive External Table – Example
For the purpose of a practical example, this tutorial will show you how to import data from a CSV file into an external table.
Step 1: Prepare the Data File
1. Create a CSV file titled ‘countries.csv’:
sudo nano countries.csv
2. For each country in the list, write a row number, the country’s name, its capital city, and its population in millions:
1,USA,Washington,328
2,France,Paris,67
3,Spain,Madrid,47
4,Russia,Moscow,145
5,Indonesia,Jakarta,267
6,Nigeria,Abuja,196
3. Save the file and make a note of its location.
Step 2: Import the File to HDFS
1. Create an HDFS directory. You will use this directory as an HDFS location of the file you created.
hdfs dfs -mkdir [hdfs-directory-name]
2. Import the CSV file into HDFS:
hdfs dfs -put [original-file-location] [hdfs-directory-name]
3. Use the -ls
command to verify that the file is in the HDFS folder:
hdfs dfs -ls [hdfs-directory-name]
The output displays all the files currently in the directory.
Note: To learn more about HDFS, refer to What Is HDFS? Hadoop Distributed File System Guide.
Step 3: Create an External Table
1. After you import the data file to HDFS, initiate Hive and use the syntax explained above to create an external table.
2. To verify that the external table creation was successful, type:
select * from [external-table-name];
The output should list the data from the CSV file you imported into the table:
3. If you wish to create a managed table using the data from an external table, type:
create table if not exists [managed-table-name](
[column1-name] [column1-type], [column2-name] [var2-name], …)
comment '[comment]';
4. Next, import the data from the external table:
insert overwrite table [managed-table-name] select * from [external-table-name];
5. Verify that the data is successfully inserted into the managed table.
select * from [managed-table-name];
How to Query a Hive External Table
To display all the data stored in a table, you will use the select * from
command followed by the table name. Hive offers an expansive list of query commands to let you narrow down your searches and sort the data according to your preferences.
For example, you can use the where
command after select * from
to specify a condition:
select * from [table_name] where [condition];
Hive will output only the rows which satisfy the condition given in the query:
Instead of the asterisk character which stands for “all data”, you can use more specific determiners. Replacing the asterisk with a column name (such as CountryName, from the example above) will show you only the data from the chosen column.
Here are some other useful query functions and their syntax:
Function | Syntax |
---|---|
Query a table according to multiple conditions | select * from [table_name] where [condition1] and [condition2]; |
Order table data | select [column1_name], [column2_name] from [table_name] order by [column_name]; |
Order table data in descending order | select [column1_name], [column2_name] from [table_name] order by [column_name] desc; |
Show the row count | select count(*) from [table_name]; |
How to Drop a Hive External Table
1. Dropping an external table in Hive is performed using the same drop command used for managed tables:
drop table [table_name];
The output will confirm the success of the operation:
2. Querying the dropped table will return an error:
However, the data from the external table remains in the system and can be retrieved by creating another external table in the same location.
Conclusion
After reading this tutorial, you should have general understanding of the purpose of external tables in Hive, as well as the syntax for their creation, querying, and dropping.