Can we use open source for data analysis in a regulated organisation?

Filip Wästberg Data Scientist, Solita

Published 31 Aug 2022

Reading time 3 min

Many organisations we at Solita work with have closed and regulated environments for data analysis. Usually, these are government agencies, financial institutions or pharmaceutical companies, which, for obvious reasons, cannot allow data to be analysed in just any way. Their closed environments have long been dominated by licensed, closed source code. You pay for access to tools and programming languages to do data analysis.

But most other industries have moved in a different direction. For a few years now, open-source software, i.e. software that has no licence cost and has been developed by a community rather than a company has dominated the market. Universities, in particular, have shifted in recent years from a plethora of licensed software to open-source software. Most recent graduates in statistics, economics and engineering that I meet want to work on data analysis using open-source programming languages like Python and R.

For analysts to install open-source software, they need administrative rights on their computer and access to the internet. For perfectly legitimate reasons, regulated organisations cannot always provide this. But it is possible to work with open source even in regulated organisations. We just need to introduce a little more verification and validation of the open-source code.

Package handling

The open-source nature of the code allows open-source users to contribute ‘packages’, i.e. extensions of the programming language that others can use. In R and Python, there are thousands of packages available for free. These packages have revolutionised the way data analysts work. Installing a package is no more difficult than executing a line of code that downloads the package and installs it in your analysis environment.

However, if you work in a regulated organisation, it’s not a given that you can install software at will. In addition, IT may want control over the packages you use.

The solution for package management is to download packages on a server in the closed environment from addresses authorised by IT. This can be done with varying levels of security; the point is that IT owns and manages the packages used in the organisation. Analysts can easily call the package server and use R and Python just as they would on an ordinary computer.

Access to data

Another problem is access to data. In many cases, not all analysts should have access to all data. The solution to this problem, as is often the case with older licensed software, is to have a dedicated server that analysts work against, which in turn allows users to access data from central or distributed data sources. The server has R and Python installed, along with the development environments (IDEs) that analysts want to use. IT can then administer rights to the data via user accounts from that server, leaving the analysts to focus on analysing the data.

Scheduling and production

Finally, analysts want to be able to easily schedule and share analyses made in R and Python. I have seen many home-made solutions for scheduling over the years that are not necessarily owned by IT. They’re often fragile and dependent on individuals. My preferred solution is again a server where you can schedule scripts, share analytics reports, dashboards, APIs, and other analytics products.

In recent years, there have been great advances in how you can use open source for data analysis in regulated and closed organisations. If you as a regulated business – whether a government agency, bank or pharmaceutical company – want to compete for labour, you need an open-source strategy.

Is your organisation thinking about how to use open source in a closed and regulated IT environment? We’re happy to talk more about potential solutions that give you control over infrastructure while allowing analysts to focus on analysing data in their preferred tools.

  1. Data
  2. Tech