Now that we understand why introducing automation into your data management strategy and how Data Vault 2.0 makes automation possible (See Part 1). I want to review the architecture I will be using for the Data Vault 2.0 automation demonstration. Though there are many possible architectural configurations that can be successfully implemented for your organization. Here are my five personal reasons as to why I elected this architecture for my demonstration.
Personal Experience and Success - I have successfully implemented this exact architectural configuration for multiple Data Vault implementations and have a comfort level with it.
Time to Market - The time to stand this configuration up is relatively quick and you can be underway in a matter of days. I am a data specialist and not a Hardware Software expert. But even I was able to get this all up and running relatively quickly. The sooner your business is leveraging the data to gain business insight the better.
Fits any size Organization - This configuration can be leveraged for the smallest Data Vault to the world’s largest Data Vault implementation. It is a configuration that can scale and meet any size organization’s needs and growth.
Public Cloud Agnostic - Regardless of what public cloud provider your organization may be using. This configuration can work for you.
Budget Friendly- These options won’t break your budget. These technologies can be leveraged for any size organization and budget. No long-term contracts, pay for only what you need when you need
Database Platform - Snowflake
At this point I would suspect most of you have heard of Snowflake. I am not going to go
into an in-depth review of Snowflake. There is a vast amount of information on Snowflake and if you would like to test drive the platform yourself, You may Click Here for a 30-day Free Trial. But what I would like to do is give you my reasons for electing to use Snowflake for this demo and how it is a good fit for a Data Vault 2.0 Implementation.
Snowflake is a SaaS offering so it was very easy for me to get up and running. As soon as I had my account URL, I was able to quickly create database(s) and compute I needed. No need to provision any infrastructure and or install any software.
Snowflake can handle all types of data. Just a few years back if you needed to handle Semi-Structured data, you would have not been able to use a traditional database. In the past you had to have separate solutions for handling Structure and Semi-Structured data. However, with Snowflake you can manage Structured, Semi-Structured and now recently announced Unstructured data.
Snowflake can handle many aspects of the Data Vault Architecture. A Landing Area that capable of handling all types of Data, Staging, Raw Vault, Business Vault & Information Mart area.
Snowflake also provides some core features I find to be very helpful for supporting and Agile development methodology that supports Autonomous delivery teams. For example
Zero Copy Clone can empower a development team to be able to clone realistic data into an area that only that team is working in. This allows the team to profile, develop, modify, and test data without impacting other teams or processes. All this without having to physically copy data and increase your data storage
Virtual Data Warehouses allow you to appropriate create and manage the compute you need when you need. As your Data Vault grows you can Horizontally and Vertically scale your compute appropriately. Also, with the separation of compute you can create separate compute for your development teams to develop and test without having contention with other development teams.
Snowflake recognition and support of Data Vault 2.0. Snowflake is one of the only database platforms that has Blogs, User Groups and people that support and advise how Data Vault 2.0 may be leveraged in Snowflake and what types of features can be leveraged and how they should best be used. See Kent Graziano & Patrick Cuba as well as there Data Vault Users Group.
Not only will Snowflake be the Database for this demonstration. I will also be using Snowflake procedures for doing the ELT. Keeping the data and the process 100% within Snowflake and refraining from data leaving Snowflake.
Orchestration - Airflow
Orchestration isn’t the sexiest or even talked about component of the Data Vault architecture. But it is a critical and important component of the Architecture. Even though in theory you can load all your Hubs, Links and Satellites in parallel. You do want to manage the appropriate level of concurrency to manage your costs and SLA most effectively. You may also have dependency on source delivery schedules and you do have the dependencies on the Raw Vault to refresh your PITs and Bridges. There are many variations of coordination and scheduling that do need to be managed.
For this demonstration I will be using Airflow. I elected to use Airflow not only because it is very flexible and robust, but also because it is open source (part of the Apache Software Foundation) and I didn’t have to go out and spend a lot of money. Airflow workflows are defined by writing a python script. Since it is also written in Python, many organizations tend to have many resources comfortable with Python. Making it is easy to add custom code and write additional workflows for other tasks.
Airflow does have a User Interface that allows you to view your workflows and monitor
them e.g. monitoring runtimes, logs, etc. When your jobs do fail, you can check the logs and rerun the entire load or only the tasks that depend on the failed task. For my demonstration Airflow will be running on a small VM and this pretty much is the only infrastructure that had to be established for this demonstration to be executed. However, if your organization is uncomfortable with managing Airflow code and infrastructure on its own. There are some managed Airflow services out there you could investigate.
Data Vault 2.0 Automation - Vaultspeed
Last but not the least for my demonstration architecture is the Data Vault Automation
tool. For this demonstration I will be using Vaultspeed. Vaultspeed is a controlled delivery solution and not a runtime application. What do I mean by that? Vaultspeed only leverages the metadata from your data source and guides you through the design process to in turn generate the ETL/ELT and DDL code that will be deployed into your production run time system. Vaultspeed is in no way in the critical path or leveraged at run time. So the need to have Vaultspeed collocated near your data is not necessary.
Vaultspeed is a SaaS solution currently available in the Azure cloud. But does not require
your organization to be leveraging Azure. Once you have established an Account with Vaultspeed, you are given a URL and log in credentials. Your access is just via a browser
for all your development work. However, there is just a small agent that does have to be installed somewhere within your organizations network. This agent is what will do the communicating with your database and or code management repository. For security purposes, Vaultspeed the application in the cloud will house no critical credentials in the cloud to would place your data at risk. Any time Vaultspeed needs to harvest metadata from your database or deploy code to your environment. It places that request in a que and waits for that agent to “phone home” and see if there is anything that needs to be executed within your environment. For example, if you need metadata pulled into Vaultspeed. You would place that request in Vaultspeed and if the Agent is pinging every
30 seconds. That request will sit in the que until the Agent grabs the request and then Harvests the metadata from the database. It then packages up that information and ships it back up to Vaultspeed. NEVER is Vaultspeed either connected or has access to your business physical data. Getting Vaultspeed up and running is quite quick and this leads into my 5 reasons why I decided to use Vaultspeed for demonstrating Data Vault Automation.
Vaultspeed requires very little effort to be up and running. As I stated earlier, I am a data guy and not an infrastructure person. So managed solutions are very friendly in that fashion. For the demonstration, the same VM that will be running Airflow will also be hosting the agent that will be communicating with the Snowflake Database.
Data Vault 2.0 Certified. Vaultspeed is the only automation technology that is Vendor Tool Certified by the Data Vault Alliance. Giving the confidence to know that the code and DDL being generated is DV2.0 compliant.
No need to write any code! Though Vaultspeed does have Vaultspeed Studio for doing some custom enhancements to their templates. That would be used only in the Business Vault space for implementing some soft rule scenarios. But for the Raw Vault and Business Vault core objects. Vaultspeed will generate all the code and DDL needed for runtime.
Vaultspeed is Agile friendly and allows for you to develop and deploy in releases. This includes initial releases and incremental releases for both DDL and ETL/ELT code. Vaultspeed tracks what structures are in production and if those structures change how to handle the change. For example, if an existing source table gets a new column added. Vaultspeed will handle the delta changes to the DDL and ETL/ELT to be sure that new column being added does not impact any Hash Keys or Hash Diffs on objects currently deployed.
Generates Data Vault DDL & Code that best leverages the technologies being used. In this case I will be using Snowflake for the structures and the ETL/ELT. So Vaultspeed will generate DDL and Procedures to best leverage the Snowflake technology. Though I have my Snowpro and Data Vault Certifications. I can trust that Vaultspeed is generating code that most effectively uses Snowflake and is keeping to the Data Vault Standard.
As I stated earlier, there are a multitude of configurations and options that can be used to deliver your Data Vault initiative. But this configuration of tools and technologies has been used for relatively small organizations and for very large organizations. All these solutions can be scaled to meet the needs of your organization.
Diagram below shows in purple the area of Data Vault 2.0 that Vautlspeed