Friday, November 15, 2024
Google search engine
HomeData Modelling & AIHow to Create an External Table in Hive

How to Create an External Table in Hive

Introduction

In Hive terminology, external tables are tables not managed with Hive. Their purpose is to facilitate importing of data from an external file into the metastore.

The external table data is stored externally, while Hive metastore only contains the metadata schema. Consequently, dropping of an external table does not affect the data.

In this tutorial, you will learn how to create, query, and drop an external table in Hive.

How to Create an External Table in HiveHow to Create an External Table in Hive

Prerequisites

  • Ubuntu 18.04 LTS or later
  • Access to command line with sudo privileges
  • Apache Hadoop installed and running
  • Apache Hive installed and running

Note: This tutorial uses Ubuntu 20.04. However, Hive works the same on all operating systems. This means the process of creating, querying and dropping external tables can be applied to Hive on Windows, Mac OS, other Linux distributions, etc.


Creating an External Table in Hive – Syntax Explained

When creating an external table in Hive, you need to provide the following information:

  • Name of the table – The create external table command creates the table. If a table of the same name already exists in the system, this will cause an error. To avoid this, add if not exists to the statement. Table names are case insensitive.
  • Column names and types – Just like table names, column names are case insensitive. Column types are values such as int, char, string, etc.
  • Row format – Rows use native or custom SerDe (Serializer/Deserializer) formats. Native SerDe will be used if the row format is not defined, or if it is specified as delimited.
  • Field termination character – This is a char type character which separates table values in a row.
  • Storage format – You can specify storage formats such as textfile, sequencefile, jsonfile, etc.
  • Location – This is the HDFS directory location of the file containing the table data.

The correct syntax for providing this information to Hive is:

create external table if not exists [external-table-name] (
[column1-name] [column1-type], [column2-name] [column2-type], …)
comment '[comment]'
row format [format-type]
fields terminated by '[termination-character]'
stored as [storage-type]
location '[location]';

Create a Hive External Table – Example

For the purpose of a practical example, this tutorial will show you how to import data from a CSV file into an external table.

Step 1: Prepare the Data File

1. Create a CSV file titled ‘countries.csv’:

sudo nano countries.csv

2. For each country in the list, write a row number, the country’s name, its capital city, and its population in millions:

1,USA,Washington,328
2,France,Paris,67
3,Spain,Madrid,47
4,Russia,Moscow,145
5,Indonesia,Jakarta,267
6,Nigeria,Abuja,196
Editing a CSV file in Nano.Editing a CSV file in Nano.

3. Save the file and make a note of its location.

Step 2: Import the File to HDFS

1. Create an HDFS directory. You will use this directory as an HDFS location of the file you created.

hdfs dfs -mkdir [hdfs-directory-name]

2. Import the CSV file into HDFS:

hdfs dfs -put [original-file-location] [hdfs-directory-name]
Using hdfs dfs put command to copy the file in the HDFS directoryUsing hdfs dfs put command to copy the file in the HDFS directory

3. Use the -ls command to verify that the file is in the HDFS folder:

hdfs dfs -ls [hdfs-directory-name]
Using hdfs dfs ls command to list the contents of an HDFS directoryUsing hdfs dfs ls command to list the contents of an HDFS directory

The output displays all the files currently in the directory.


Note: To learn more about HDFS, refer to What Is HDFS? Hadoop Distributed File System Guide.


Step 3: Create an External Table

1. After you import the data file to HDFS, initiate Hive and use the syntax explained above to create an external table.

Creating an external table in HiveCreating an external table in Hive

2. To verify that the external table creation was successful, type:

select * from [external-table-name];

The output should list the data from the CSV file you imported into the table:

Using the select command to see the contents of an external tableUsing the select command to see the contents of an external table

3. If you wish to create a managed table using the data from an external table, type:

create table if not exists [managed-table-name](
[column1-name] [column1-type], [column2-name] [var2-name], …)
comment '[comment]';
Creating a managed table in HiveCreating a managed table in Hive

4. Next, import the data from the external table:

insert overwrite table [managed-table-name] select * from [external-table-name];
Using the insert command to import data from an external table to a managed tableUsing the insert command to import data from an external table to a managed table

5. Verify that the data is successfully inserted into the managed table.

select * from [managed-table-name];
Using the select command to see if the content of the external table successfully transferred to the managed tableUsing the select command to see if the content of the external table successfully transferred to the managed table

How to Query a Hive External Table

To display all the data stored in a table, you will use the select * from command followed by the table name. Hive offers an expansive list of query commands to let you narrow down your searches and sort the data according to your preferences.

For example, you can use the where command after select * from to specify a condition:

select * from [table_name] where [condition];

Hive will output only the rows which satisfy the condition given in the query:

Using conditions to narrow down the output of the select command in HiveUsing conditions to narrow down the output of the select command in Hive

Instead of the asterisk character which stands for “all data”, you can use more specific determiners. Replacing the asterisk with a column name (such as CountryName, from the example above) will show you only the data from the chosen column.

Here are some other useful query functions and their syntax:

Function Syntax
Query a table according to multiple conditions select * from [table_name] where [condition1] and [condition2];
Order table data select [column1_name], [column2_name] from [table_name] order by [column_name];
Order table data in descending order select [column1_name], [column2_name] from [table_name] order by [column_name] desc;
Show the row count select count(*) from [table_name];

How to Drop a Hive External Table

1. Dropping an external table in Hive is performed using the same drop command used for managed tables:

drop table [table_name];

The output will confirm the success of the operation:

Dropping an external table in HiveDropping an external table in Hive

2. Querying the dropped table will return an error:

Querying a dropped external table returns an error.Querying a dropped external table returns an error.

However, the data from the external table remains in the system and can be retrieved by creating another external table in the same location.

Conclusion

After reading this tutorial, you should have general understanding of the purpose of external tables in Hive, as well as the syntax for their creation, querying, and dropping.

Was this article helpful?
YesNo

RELATED ARTICLES

Most Popular

Recent Comments