Blog

Snowpark in real life – what have we learned?

Filip Wästberg Data Scientist, Solita

Published 08 May 2023

Reading time 5 min

It has been almost half a year since Snowflake released general availability to run Python in Snowflake through Snowpark. This has opened up opportunities for Data Scientists to run machine learning workloads on Snowflakes’ compute engine. Since the general release in November, we have helped various clients setting up and evaluating Snowpark as a data science and machine learning platform. These are our first impressions of Snowpark.

Snowpark is great for machine learning in the database

Many data science projects start with a Data Scientist who develops a machine learning model or some analysis that they want to share with stakeholders. For many organisations that we work with, it is vital to get the output from a machine learning model into a database so that it can be used in, for example, a BI tool or an operational system. Snowpark solves this problem without having to add additional tools.

Where does Snowpark fit in?

Many data science teams struggle with orchestrating machine learning pipelines. A lot of them get stuck or spend way too much time figuring out open source tools like Airflow or Docker to get things up and running.

Of course, there are a number of ways to solve this. At Solita, we generally emphasize setting up a proper MLOps framework on top of a data warehouse like Snowflake. This can be done with third party data science platforms (like Posit Connect, Dataiku, Hex), cloud vendor solutions (Azure ML, Sagemaker, Vertex) or be built on open source components. Each has its pros and cons: third party tools generally solve a lot more than just orchestration and scheduling (thus more expensive) and cloud vendor solutions and open source components can be complicated to set up if you lack competence and don’t have the infra team onboard. Here we should probably also mention one of Snowflake’s main competitors, Databricks, but we’ll save that article for another time.

The bottom line is that many organisations in the early stages of their data science journey struggle with operationalising machine learning. If you have data in Snowflake, Snowpark is a great way to get started before making investments into additional platforms. Snowpark isn’t necessarily a replacement for MLOps- and Data Science platforms, instead they can – and probably should – be used in conjunction for large scale data science projects.

There are some things that Snowpark cannot do, yet

1. Models as APIs

Besides running ML-models to enrich data in a database (often referred to as batch predictions) one of the most common ways to put a ML-model into production is by publishing it as an API for predictions on demand (online predictions). This is not something that Snowpark supports, so if you have a use case where you need real time inference from a model outside of the database you probably need more than Snowpark (talk to us if you’re not sure what route is best for you).

2. Interactive applications

Another popular way to share data science work is by building interactive applications using tools like Shiny (in R) and Streamlit (in Python). These applications are different from traditional BI tools as they have a Python or R runtime. As it happens, Streamlit has been acquired by Snowflake, so it will be integrated into Snowflake in the future (check out this sneak peak) which will open up a lot of opportunities for sharing results and monitoring ML-models.

Snowpark is still new and evolving

One thing that we have noticed when working with Snowpark is that there are a lot of new things coming out all the time. As a partner to Snowflake and a tech consultancy company, we try to keep up, but if we struggle, we can imagine how overwhelming this can be to others. This rapid pace can cause some tutorials to become out of date and can make it a bit hard to get started. Fortunately, the Snowflake team is super helpful to get their customers going. So if there is something that’s not working for you, don’t hesitate to reach out to us or Snowflake.

VSCode + Snowpark = <3

You can use Python worksheets in Snowflake to write Python, this means that the code you are writing runs in Snowflake. However, you can’t (yet) get any graphical output, like plots, in Snowflake. This, and just the general developer experience, means that you will likely need a computer or server for Python development that can later be moved to Snowflake. We have mainly been using VSCode for this and it works great. Snowflake also has a VSCode extension.

Python in the database is not a new thing

Snowflake is unique because it is very easy to manage, you only pay for what you use and it handles scaling up and down automatically. Although Snowpark for Python is a great feature for Snowflake users, it’s worth mentioning that this is not really something new in the database world. Microsoft SQL Server announced support for R & Python in 2018, BigQuery ML has also been around since 2018 and even Oracle has had support for R & Python for some years now. We like Snowpark but we are more excited about what is about to come next.

The best is yet to come

We are especially excited about the Streamlit integration into Snowflake. To be able to build and host interactive applications with a Python runtime in Snowflake will be a real differentiator. However, only time will tell if Snowflake will be the best place to host these kinds of applications (some of the previously mentioned third party tools also support this). Besides Streamlit, we are hoping to see a lot more work being done around integrating MLOps tools like MLflow into Snowflake. As of now, there is no straightforward way of monitoring models in Snowpark. It will also be interesting to see what, and if, the open source community can contribute to Snowpark in order to make it easier to get started.

  1. Data