How to Visualize Data on Superset

Real time data ingestion from MySQL to Snowflake using AWS and Snowpipes

SourceFuse
7 min readJul 19, 2024

By Shadab Sutar, Architect, SourceFuse

How to Visualize Data on Superset

Introduction

In today’s world where information is freely available on the internet, internet users worldwide can upload upwards of terabytes per second. Therefore, it’s crucial that large and mid-sized companies adopt robust data strategies which can comb through the vast quantity of data available and make sense of it through various methods.

The ability to properly transform the data into intelligence not only allows decision-makers to rely on the insights and trends it reveals and to take relevant actions based on it.

Our Example Use Case

Consider a hypothetical organization, ABC, which is going through a rapid growth, both financially and headcount wise. The executive management now wants to have clear cash-flow insights and how it can be further optimized to ensure funds are properly allocated to departments and areas which need them the most.

The solution proposed to the management team involves a business intelligence (BI) dashboard which will allow them to easily filter and observe financial data based on which crucial decisions can be made.

To create a stable working dashboard which can be relied upon, there are various techniques or mechanisms involved, such as:

  • Building ETL pipelines
  • Preparing data warehouses
  • Real time data ingestion
  • Report generation
  • Alert mechanisms to inform every stakeholder
  • Anomaly detection in data

In this blog, we are going to focus upon one of the techniques mentioned above: Real time data ingestion.

Real time data ingestion is a technique in which blobs or a batch of data is ingested to a destination database as soon as it is available in the source database. To achieve this we are going to consider the following technical components:

  • Source: MySQL database
  • Destination: Snowflake database
  • AWS: Services which migrate the data from source to destination
  • Snowflake snowpipes: specific task which help in loading the data to the relevant database tables

Process flow diagram:

Process flow diagram:

The above diagram depicts a systematic step by step flow of data from source to destination. To achieve this, what follows is a tutorial type example, detailing each step.

MySQL — The Source

In our use case, a MySQL database is already set up where all the transactions are inserted — a crucial database as this holds all the finance related information. This data is received from multiple sources within the various organization-wide departments and hence the database has to be present and working all the time.

However, building a BI dashboard on top of the database creates a heavy load and the management does not want to run the risk of bringing down the database with other query transaction loads. The solution is to move it to a separate database, such as Snowflake.

Setting up AWS for Monitoring the Source Database

AWS DMS: Data Migration Service (DMS) is an AWS managed service which connects to a source, reads data from the source, and loads it into a target (S3, in our example). This is achieved by following these steps:

  1. Within AWS console go to DMS
  2. Click on create task
  3. Enter the relevant details and click create task
  4. Once the task is created, click on endpoints from the left navigation
  5. Create two endpoints — one for the source and one for the target
  6. In our case, source endpoint will have MySQL URL and credentials
  7. Our target endpoint will have S3 location — When the target is selected as S3, we also need to set up some parameters in order for the DMS to generate the file to place in the S3 bucket
Setting up AWS for Monitoring the Source Database

8. Make sure to create the S3 bucket well in advance before creating the DMS

9. Once the setup is done, start the DMS replication instance and see the data files getting placed in your S3 bucket

Setting up SNS topic and the IAM role

  • Create an IAM role with S3 read access and SNS full access permissions. This is required by the SNS to issue a notification to Snowflake whenever there is a file placed inside the S3.
  • Next, navigate to the SNS topic and create a new topic. Make a note of the ARN as we would need later in Snowflake configuration. Also note that we will come back to the SNS topic and edit its access policy with some other information. For now, save and move on to the next step.

Setting up Snowflake for Reading Real Time Data From S3

  1. Create a storage integration:
Create a storage integration:

2. Describe the created storage integration:

Describe the created storage integration:

Executing the above command in Snowflake will return two very important properties:

  • IAM ARN
  • External ID

Next, go back to AWS console and navigate to IAM role and select the IAM role you just created. Under IAM role, click on ‘trust relationship’ and update the external ID and ARN with the values provided by Snowflake.

3. Fetch the Snowflake SNS ARN value:

Fetch the Snowflake SNS ARN value

Now copy the result and go back to SNS topic, click edit, and update the access policy with Snowflake’s result.

4. Create a Snowflake file format:

Create a Snowflake file format

File format will inform Snowflake to look for which type of files within the S3 bucket.

5. Create a Snowflake stage:

A Snowflake stage is a temporary location where the new received files are placed for processing by a configured snowpipe.

6. Create a Snowpipe with auto_ingest property set as true:

Create a Snowpipe with auto_ingest property set as true

A Snowpipe is a task in Snowflake which when set to auto_ingest true, continuously monitors the Snowflake stage for a new file. As soon as a new file is received, Snowpipe executes and loads the data into the table configured into it.

Note: There are various other properties of Snowpipe which are out of scope for this blog post.

Visualization of the Data

There are many intelligent BI tools available in the market to visualize the data as per your requirements. However, for the sake of this blog I will refer to the open-source reporting tool — Apache Superset.

Apache Superset is a tool with all the required features to filter, sort, and group data based as desired. It has a comprehensive list of charts available in order to visualize your data. To set-up our example database as the source in Superset, do the following steps:

  1. Login to your superset instance, and navigate to the dataset. Click on ‘create new dataset’, add your source database as Snowflake, and choose your relevant schemas and table:
Visualization of the Data

2. Once your dataset is added to your superset instance, the next step is to select a chart type by which you can visualize your dataset:

Visualization of the Data

3. After you have selected your chart and configured it to suit your needs, publish it as a dashboard — and that’s it. You have your raw data automatically pulled into a Snowflake database and using superset it can be visualized by your users. It’s as simple as that! The below screenshot captures the essence of superset’s capabilities to depict your data.

Visualization of the Data

Conclusion

In today’s world where data is gold for organizations, it is also very challenging to maintain accurate, real-time data and processes to help make sense of it. At SourceFuse, we leverage AI functionality that helps our customers with raw data analytics, building an ETL process, anomaly detection, and data visualization to support precise and valuable decisions.

Need your data to work for you the way you want it to? Discover more with SourceFuse. LET’S TALK

Look out for future blogs, when I’ll be going into more detail on Superset visualization.

--

--

SourceFuse
SourceFuse

Written by SourceFuse

Strategic digital transformation helping businesses evolve through cloud-native technologies

No responses yet