We are proud to announce a new partnership with Databricks! Databricks offers a Unified Analytics Platform driven by the mission to unify Data Science, Data Engineering and Business, but what is it about and what have we been doing with it?
Cloud platforms have enabled the swift growth of cloud native SaaS services (Software-As-A-Service) and challenged the market for even faster innovation cycles. Databricks is one of these services and one of the world’s fastest growing enterprise software companies (1).
Databricks was established by the original creators of the open source unified analytics engine Apache Spark, which is an in-memory cluster computing framework. Cluster computing distributes tasks to multiple computing nodes and in the end gathers the results for analysis purposes or other further use. The Databricks platform is built on top of Apache Spark. Databricks aims to optimize Spark’s performance and to create a co-working space for data scientists, data engineers and business analysts. Besides being a unified analytics engine, Spark powers four libraries: SQL and DataFrames (structured data module), MLlib (machine learning), GraphX (graph computations) and Spark Streaming (stream processing).
Collaboration in Databricks
Looking at Databricks from a data team centric perspective, the collaboration possibilities and the support of Python, R, SQL and Scala programming languages become highlighted as the main benefits. For someone who is an experienced data professional (i.e. native in many programming languages) this might not be a big thing. However, for a team this might be a great relief and a huge enabler.
An organisation might have people working close to the business who are familiar in querying with SQL (e.g. business analysts or data analysts) and who are not that familiar in working in the command line. The platform gathers the whole team to one space and lets its members share and use other members’ results and data. Working simultaneously in the same workspace enables the team to log & track changes and create an audit trail on the work done. This can also be the solution for gathering the notebooks running locally on laptops to one place. The platform is built to support different skill levels and lets the users work in a graphical user interface with notebooks, but makes it also possible to work from the CLI (command line interface). For a smaller team the possibility to skip Spark cluster configurations, library configurations and resource management comes in handy. Furthermore, this hosted service runs on Amazon Web Services (AWS) and Microsoft Azure.
Databricks can be used to simplify and unify the data engineering pipeline and data science into one scalable place.
At Solita we have been working with Databricks with a number of customer projects ranging from IoT streaming analytics to supply chain prediction. In one project we used Databricks to simplify and unify the data engineering pipeline and data science into one scalable place. The original pipeline had been developed over time in R & Python and used JSON API calls to fetch data. By gathering the whole code to Unified Analytics Platform, we were able to run the different notebooks, API calls, database queries, data transformation, database transactions, predictive calculations and output formatting in one place and reduce the unnecessary complexity of the pipeline. We could also manage the daily orchestrations of the jobs and process monitoring directly in the pipeline.
Behind the scenes Databricks is also working with an open source machine learning lifecycle platform called MLflow, a next gen analytics engine Databricks Delta and a Unified Analytics Platform for Genomics, but more about these solutions later…
When to consider Databricks?
You want to…
- optimise your Spark jobs
- gather your analytics pipeline into one place
- leverage scalability of the cloud-native platform
Solita is a Professional partner of Databricks and offers consulting services in the Nordics, Estonia and Germany. Read more about out services in analytics and data science.
Want to hear more about Databricks? Contact Anssi Tikka ([email protected] or+358504433789) for sales or Veli-Matti Soukka for Databricks specific questions
(1) Reference