4 januari 2018Blogg

Tips and tricks on data science scripting

What are the requirements for analysis code that is ready to be shared and ready for production? What should you take into account when writing this kind of code? In this blog, I cover the four key points all data scientists should keep in mind when writing code.

I often review analysis scripts written by a variety of data scientists in R or Python. The qualitative side of the analyses has been very well and very analytically completed. The data has been processed by eliminating outliers and deleting or imputing missing values. The model and the variables used in the modelling have been carefully selected, and the results have been tested to verify that the solution answers the initial problem.

However, regrettably often, the scripts have been written in order to resolve a specific problem – i.e. only for the needs of the author – not to be shared with others or to be utilised when resolving similar problems in the future.

The end result is far from being one that can automatically be put into production in a manner that would allow the author to get a good night’s sleep. Below are the four key points that I think each data scientist should keep in mind when writing code.

#1 Management of environments, CI/CD and version management

A solution should not be implemented solely for production purposes – which is regrettably often the case – but development should take place in a development environment reserved for this particular purpose. The solution should be tested in a test environment that is similar to the production environment and only transferred to production when its functionality has been verified. This avoids unexpected problems in production and allows further development without eliminating an already functional solution.

In my opinion, continuous integration/deployment tools should be part of a data scientist’s everyday work.

Code should be kept in version management instead of the coder’s laptop. This traditional “it works on my machine” thinking is regrettably typical in the field of data science, because it allows easy use of quick tests. It is detrimental to the result, however, as without proper version management, the version history will only be available inside the developer’s own head. This will hinder sharing of the solution with other developers: a sick leave can put the development of the solution on hold for a long time.

#2 Logs, monitoring and alerts

Many solutions are implemented without any logs, monitoring or alerts. The purpose of logs is to highlight any implementation problem areas. Another benefit of logs that is interesting from the developer viewpoint is the opportunity to analyse how the solution is used. This allows for analytical further development of the solution in the proper spirit of data science, rather than relying on best guesses.

When combined with logs, monitoring and automatic alerts allow automatic detection of problems and errors.

Too often errors are detected by the end users instead of proactive alerts. If a solution is being developed in a modern cloud service, using logs, monitoring and alerts is relatively easy. AWS CloudWatch is a good service for implementing these features, for example.

#3 Password management and parametrization

Issues that are quite familiar to software developers are too often forgotten in the case of AI scripts. In many cases, passwords, ten-line SQL expressions and database URLs are in the same pile as the rest of the code. These issues are vital from the viewpoint of information security, readability and further utilisation of the code.

Tools meant for automating provisioning and configuration management can be used for parametrization and managing passwords.

Tools meant for automating provisioning and configuration management, such as Ansible, or password management services offered by cloud services can be used for parametrization and managing passwords.

#4 Test. Test. Test.

You can never overemphasise the importance of testing. Machines don’t make errors, but developers and users do. Data scientists usually invest in testing the data and the model, but qualitative testing of the overall solution is almost always incomplete, even though the most important aspect in terms of the final solution is the end users’ experience of the functionality of the solution.

The more functional the overall solution, the easier it is to sell. This is where testing helps.

Qualitative testing should be automated from the very beginning. Furthermore, people writing code should always think about how its functionality can be tested. Testing is also part of high-quality programming practices.

If the three main points mentioned above are taken into account when developing a solution, creating high-quality, comprehensive, automated tests will be much easier. Read more from the blog post: What is elegant code actually?.

Rauno Paukkeri works at Solita as a team manager and data scientist. He is a passionate student and proponent of algorithmics. Rauno is interested in numbers in all their forms, whether related to AI, code or data as a strategic opportunity for innovative business management. Rauno also identifies himself as a critic, despite being a loyal team player. Rauno’s hobbies include learning how to tile a bathroom and lay a parquet floor, and he also enjoys spending time with his family playing the piano.