Building a Simple ETL Pipeline with Python and Google Cloud Platform

Limited Time Offer!

For Less Than the Cost of a Starbucks Coffee, Access All DevOpsSchool Videos on YouTube Unlimitedly.
Master DevOps, SRE, DevSecOps Skills!

Enroll Now

Source:-towardsdatascience.com

At the end of this article, you will be able to extract a file from an FTP server and load it into a data-warehouse using Python in Google Cloud Functions.
In this article, we will look do the following:
Set up a Cloud Function
Extract data
Transform data
Load data
Automate our pipeline
Firstly, what is ETL?
Extract, Transform, Load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source or in a different context than the source.
There are a lot of ETL tools out there and sometimes they can be overwhelming at times, especially when you simply want to copy a file from point A to B. So today, I am going to show you how to extract a CSV file from an FTP server (Extract), modify it (Transform) and automatically load it into a BigQuery table (Load) using python 3.6 and Google Cloud Functions.
Google Cloud Functions: Cloud Functions (CF) is Google Cloud’s Serverless platform set to execute scripts responding to specified events, such as a HTTP request or a database update. An alternative to CF is AWS Lambda or Azure Functions.
Scenario
Let me give you some background info before we begin. A 3rd-party service provider had an FTP server which housed several CSV files, each containing the daily transaction data of a logistic company. We needed that data in our data warehouse so we could share it internally with stakeholders and to monitor performance.
The infrastructure team at this logistics company exports one CSV file daily from their database, and uploads it to an FTP server.
Our job was to copy those files daily from the server, clean it and then load it into our data warehouse so we could connect it to other data sources and run analytics on them.
This is what our ETL pipeline diagram will look like in the end:

Created with Lucidchart

Setting up your Cloud Function
Open the Functions Overview page in the Cloud Console:
Go to the Cloud Functions Overview page
Make sure that the project for which you enabled Cloud Functions is selected.
Click Create function.
Name your function.
In the Trigger field, select HTTP Trigger.
In the Source code field, select Inline editor. In this exercise, you will use the code we are going to work on together so you can delete the default code in the editor.
Use the Runtime dropdown to select a runtime.
Make sure your runtime is set to “Python 3.7” and under “Advanced options” change the region to one closest to you. As at the writing of this post, CF isn’t available in every Google data-centre region, so check here to see where Cloud Functions is enabled.
When you complete these steps, your display should look like this:

screenshot from GCP console
Our custom code
A Cloud Function has two files; a main.py and a requirements.txt file. The latter host all the file dependencies we need for our script to work, so click on the requirements.txt tab and make sure you have them included in the editor like so:

A quick summary of all dependencies:
google-cloud-bigquery: this library allows us to access and interact with BigQuery
python-csv: this library is used to manipulate CSV files with Python
requests: is a HTTP library used to send HTTP requests, which we will need to access the FTP URL.
wget: used to download files from the internet
pytest-shutil: this is used for SSH access
Extract
Now in the main.py tab, you can start including the code below. Looking at line 1 to 4. We created a function called “ftp_function”, which we will refer to in the future when accessing the Cloud Function with a HTTP request. We then login to the FTP server with the necessary credentials, and navigate to the appropriate directory where the file is stored on the server.
Please note, the FTP server I was working on, had multiple CSVs representing transaction data for different days. So to get the most recent file in the directory, I used the code starting at line 7 to line 9.
The rest of the code grabs the file and downloads it.

Transform
To “transform” the data, we are going to make a simple change to the CSV file we just downloaded. We will simply change every occurrence of “FBA” to “AMAZON” in the CSV file. Here is the code below.

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x