7 december 2017Blogg

What is Snowflake and how does it work?

When it comes to databases, I am Larry Ellison’s former best buddy. I remember commands for increasing tablespaces by heart, and all too many times, I have set the SGA and PGA values in the hopes of improving performance. I know that the value to be set first in the products from Redmond is ‘max server memory’, unless you want to try later and figure out where all the available RAM might have disappeared. Backup maintenance and the inner life of RMAN are not among my favourite topics, however. Snowflake is a Data Warehouse solution which has already mesmerised me by its simplicity. I will now tell you why.

Nevertheless, being well-versed in the above-mentioned functions, which are cryptic to some people, has not become any less important with the emergence of cloud service providers. The scalability of cloud and the reduced time required by maintenance thanks to elasticity are common marketing points. However, the truth is that whether you are using a data warehouse or an operational database, sooner or later you have to familiarise yourself with database parameter settings of your selected cloud service provider in order to achieve optimal performance, or you just need to add either CPU or disk capacity.

Increased demands on cloud and, consequently, cloud databases

Harri Kallio discussed how the maintenance expenses of traditional databases and the increased volume of data, have increased the demands placed on cloud – and, consequently, also on cloud databases. A separate database administrator should not be needed and the database should be increasingly automated. However, at the same time, opportunities should be offered to save history data flexibly and to respond to various capacity changes faster than before.

As far as the three existing large cloud operators are concerned, Amazon already complemented its service offering by releasing the Redshift Spectrum service. The service solves the problems caused by increased data volumes by providing an option to make data queries directly in Amazon’s S3 service.

Spectrum is an excellent addition to Redshift, but a completely new operator has taken the utilisation of the S3 service to a new level. Snowflake, a company founded by former employees of Oracle, Microsoft and Vectorwise, offers a scalable data warehouse as a service from cloud. Their selling point is the separation of database disk capacity from CPU performance, and they offer this at a cost-effective price. You only pay for the capacity and performance you use.

My first reaction was: “They must be lying”

However, now that I have used Snowflake’s service in a couple of projects, I can say that the company also walks the walk. I can really just create a database with the required users and schemas and then start populating it with data. The capacity will not run out. To tell the truth, they never even ask me how much capacity I need. And actually, I don’t even care, since capacity is inexpensive. All too inexpensive.

I get the performance I require by creating a virtual data warehouse of the size I want. In the name of simplicity, the warehouses come in the regular T-shirt sizes. Of course, I can change the size of my T-shirt to smaller or larger, if I want, on the fly and anytime.

Before making a query, the virtual data warehouse may also be in a sleep mode and only start when it is needed. When the query tasks end, the virtual data warehouse will go to sleep again if you have set it up to do so. The expenses are based on the on-time of the active virtual data warehouses and the disk capacity used.

How does Snowflake really function?

My answer is: “Elementary, my dear Watson.” Data is micro-partitioned, compressed and encrypted in the AWS S3 service and connected to through virtual data warehouses. Virtual data warehouses retrieve and review the data and return a result set to the user. The technology applied by the virtual data warehouses is covered by Snowflake’s IPR, but in practice, the warehouses are EC2 instances in Auto Scaling groups that have huge amounts of RAM enabling caching of large masses of data.

Figure. The Snowflake architecture

In practice, Snowflake applies the best practices of AWS and has built a very cost-effective and scalable service on top of them. The S3 service is inexpensive, stable and scalable for storing large volumes of data, and launching EC2 instances in the cloud on an as-needed basis makes a “pay-per-use” model possible.

At the same time, the shared data architecture helps meeting diverse needs by utilising virtual data warehouses. Does your Data Science pal need a dedicated, effective data warehouse for processing large data masses, but you are afraid that he’ll use up all capacity ETL executions would need? No worries. Snowflake enables the creation of separate virtual data warehouses with different size classes. All warehouses read the same data set without disturbing each other, while still in compliance with the ACID principle.

In addition, this also discreetly provides a solution for splitting the costs of using the data warehouse. Snowflake’s own UI offers an opportunity to itemise the credits used per virtual data warehouse. Thus, it is possible to split the final cost of credits directly without having to perform additional calculations.

The basic unit cost of credits is determined on the basis of the Snowflake version being used, ranging from the basic Standard version to the Enterprise Edition for Sensitive Data version.

The most significant differences between these include the 24/7 support on Premium and the EE versions’ AD support and data encryption option with either one’s own keys or keys maintained by Snowflake. In practice, however, the basic functionalities are not limited by the version; only the price per query will go up. In this way, costs can also be contained by obtaining a separate, more rudimentary version for development at a lower price per credit and the version requiring 24/7 support would be used for production.

Snowflake charms with simplicity

On top of all this, Snowflake meets the high usability criteria and offers an opportunity to make queries on old data in a simple way. The snapshot technology, familiar from the world of disk systems, makes it possible to create databases within seconds. You’ll get access to a test database almost instantly, without having to endure hours-long exercises to create a new environment. The billing unit is one second and the JSON support functions well. The list of Snowflake’s properties is way too long to include here, which is why I recommend you read this posting.

Is that stuff ready for production?

Yes. The eight-largest bank in the USA just switched to Snowflake and also decided to invest in the company after seeing the product’s potential. In addition, several suppliers, ranging from Microsoft’s PowerBI and Tableau, have incorporated Snowflake support in their products.

We at Solita also believe in this service and have therefore incorporated Snowflake support in our own Agile Data Engine  software. Agile Data Engine enables you to proceed from a development version to a full production model  faster than before, as the developer can simply create a database and start creating a data model including downloads without having to worry about disk capacity or performance. In practice, once Agile Data Engine has been implemented, the user can release their first stage tables and downloads from the tables within a few minutes, if they want.

I recommend that everyone checks out Snowflake, or at least pay attention the next time they hear the name.

Mika Heino works in the Data team at Solita. In the name of the DevOps culture, he is always eager to find technology solutions which make people’s lives easier and liberate them to expand their horizons. Since he is not a fan of long introductions, Mika uses the remaining space to challenge Ilkka Suomaa to blog about a week in the life of a Solita sales rep.