Thursday, December 26, 2024
Google search engine
HomeData Modelling & AIBig dataDemystifying Stages in Snowflake

Demystifying Stages in Snowflake

This article was published as a part of the Data Science Blogathon

Introduction

If you are just starting your journey with Snowflake, then I strongly recommend you to check my article that tells about how to get started with Snowflake. The term Stage has a specific purpose w.r.t Snowflake terminology. It is different compared to what we name as a staging area in the normal ETL terminology.

Stages in Snowflake are locations used to store data. If the data that needs to be loaded into Snowflake is stored in other cloud regions like AWS S3 or Azure or GCP then these are called External stages whereas if the data is stored inside Snowflake then these are called Internal stages.

Table of Contents

1.Internal Stages

First,  I will walk through in detail the Internal Stages.

Internal Stages are further divided as below

  1. User Stages
  2. Table Stages
  3. Internal Named stages

Before ingesting data into a table in Snowflake, data has to be loaded into a stage using the PUT command, and then it has to be loaded into the table using the COPY INTO command. Similarly, if we need to unload the data from a table, it has to be loaded into the stage using the GET command, and then it has to be exported using the COPY INTO command.

Note that there is a slight change in the syntax based on whether the local system is Unix or windows. I have used windows based system to store the local files in this article.

As of now, all the tasks are done via the SNOWSQL command-line interface. SNOWSQL is a client tool where you can install on your local machine and then connect to the snowflake. When you load the data from local to the stage, snowflake automatically compresses the file. If you check the file in the stage after loading you will see an extension of .gz all the time inside the stage.

1.a.User stages

User stages are tied to a specific user. Every user has a default stage created. We will not be able to either modify or remove these stages.

We can copy the files to these stages to load them further into the table. Once the load is completed we need to ensure to remove these files explicitly otherwise we need to pay for storage. Files in one user stage cannot be accessed by another user. So if you need to load multiple tables from a specific user then this is the best option. We need to refer to the user stages using ‘@~’

1.b.Table stages

Table stages are tied to a specific user. Whenever a table is created, then automatically table stage is created. Similar to the user stage will not be able to either modify or remove the table stage, however, we need to clean up space after the files are loaded. The table stage for a particular table cannot be accessed through another table. So if you need to load one table then you can choose the table stage. We need to refer to the user stages using ‘@%’

1.c.Internal Named Stages

These stages offer more flexibility compared to user or table stages. These are some of the snowflake objects. So all the operations that can be performed on objects can be performed on Internal named stages as well. We need to create these stages manually and we can also specify the file format options while creating the stage itself which is unlike the table or user stage. We need to refer to the user stages using ‘~’.

2.Use Case for Internal Stage

2.a.Problem Definition

Load source files from the local system into multiple tables in snowflake and then process the data. Processed data will be available in the target table. Unload the data from the target table into a file in the local system.

Note: Since the processing of data is out of scope for this article, I will skip this. I will populate the data in the target table manually. Let’s assume that aggregation of a particular employee salary.

2.b.Solution

Since we need to load multiple tables here, we can either go for the user stage or named stage. Processing the data is out of scope for this article, so I have not included that part. Assume that processed data is available in the target table. Since it is a single target table, we can use either table stage or named stage to unload the data from a table.

2.c.Steps to be Implemented

  1. Connect and log in to SNOWSQL.
  2. Load all the files from the local disk to the user stage using the PUT command.
  3. Load the files from the user stage to the source tables in snowflake using the COPY INTO table command.
  4. Unload the file from the target table into the table stage using COPY into the table stage.
  5. Unload the stage and copy it into the local system using GET.
  6. Clean all the files in the stages to avoid billing for storage.

 

2.d.The Architecture of the Implementation

 

The architecture | Stages in Snowflake

                                            Image source: Author

2.e.Naming Conventions

To understand easily I will use the naming conventions as below

  • Source Files in local disk – F_EMP_SAL1.txt,F_EMP_SAL2.txt
  • Source Tables :T_EMP_SAL1,T_EMP_SAL2
  • User stages : STG_EMP_USER
  • Target table: T_EMPTARGET
  • Table stages: STG_EMP_TABLE
  • Target file in local: F_EMP_TARGET

2.f.Pre-requisites for Implementation

  • SNOWSQL should be installed in the client machine.
  • Snowflake free trial account. If you do not have then check this article on how to create a free trail account on the snowflake.
  • Local txt Files with content is as below:
  •            F_EMP_SAL1.txt

1,aaron,3000

2,vidhya,4000

 

  •               F_EMP_SAL2.txt

1,Ben,7000

2,vidhya,2000

  • Source and target tables are created in Snowflake with the below script
 use role accountadmin;
use warehouse compute_wh;
use database DEMO_DB;
use schema PUBLIC;
create table T_EMP_SAL1(emp_id integer,emp_name varchar,empsal float);
create table T_EMP_SAL2(emp_id integer,emp_name varchar,empsal float);
create table T_EMP_TARGET(emp_id integer,emp_name varchar,empsal float);
insert into T_EMP_TARGET
values
(1,'Aaron',8000),
(1,'Vidhya',4000),
(1,'Ben',7000);

2.g.Implementation

We will see the code used in this section

 

Code

#Logging into SNOWSQL

snowsql -a <>.ap-south-1.aws -u <>

#1.Load all the files from the local disk to the user stage using PUT command

use role accountadmin;

use warehouse compute_wh;

use database DEMO_DB;

use schema PUBLIC;

put file://D:SnowflakeContentF_EMP_SAL1.txt @~;

put file://D:SnowflakeContentF_EMP_SAL2.txt @~;

select * from T_EMP_SAL1;#This returns empty table

select * from T_EMP_SAL2;#This returns empty table

#2.Load the files from the user stage to the source tables in snowflake using COPY INTO table command

copy into T_EMP_SAL1 from @~/F_EMP_SAL1.txt;

copy into T_EMP_SAL2 from @~/F_EMP_SAL2.txt;

select * from T_EMP_SAL1;

select * from T_EMP_SAL2;

#3.Copy the file from target table into the table stage

copy into @%T_EMP_TARGET from T_EMP_TARGET;

select * from T_EMP_TARGET;

#4.Unload into local system using GET.

get @%T_EMP_TARGET file://D:SnowflakeContent

#5.Clean all the files in the stages to avoid billing for storage.

list @~;

rm @~ pattern='.*txt.*';
rm @%T_EMP_TARGET;

Let’s see the output now for all the above commands

output of commands | Stages in Snowflake

Image Source – Author

output

 

Image Source – Author

output 3

Image Source – Author

2.h.Output

Let’s see if the file is exported.

file exporteddata in file

 

 

           Image source: Author                           

3.External Stage

If the files are located in an external cloud location, for example, if you need to load files from AWS S3 into snowflake then an external stage can be used.


Unlike Internal stages, loading and unloading the data can be directly done using COPY INTO. Get and Put commands are not supported in external stages. The external stage can be created via Web user interface or SNOWSQL as well. We will see how to create using the web user interface.

4.Use Case for External Stage

4.a.Problem Definition

Load files from AWS S3 into a snowflake table using an external stage.

4.b.Prerequisites

  1. To perform this demo, you need to have an AWS account.
  2. An access key and secret key to connect to AWS account.
  3. Create a table where data needs to be loaded in snowflake with the below script

4.c.Steps for Implementation

1.Create an AWS S3 bucket as shown below 

aws s3 bucket | Stages in Snowflake

Image source: Author

2.Upload files on S3

                                                 

upload file on s3 | Stages in Snowflake

                                                         Image source: Author

3. Create an external stage in snowflake using AWS keys

create external stage Image source: Author

4.d.Output

                                                output | Stages in Snowflake

 Image source: Author

5.Conclusion

In this article, we have seen what a stage is and how to use stages in Snowflake with a use case. Have you observed that in the first use case we have not used any command like create stage? This is because we have used table and user stage, you can also use the internal stage instead of this as the internal stage offers more flexibility compared to the table stage and user stage.


I hope you got a clear idea about the stages in Snowflake. If you have any questions or feedback then please comment below. 

Happy Learning!!

Here is my Linkedin profile in case if you want to connect with me.

Have something to say! Feel free to contact me here.

 

 

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

         
Deepti J

20 Jul 2021

RELATED ARTICLES

Most Popular

Recent Comments