SOLID Principles for Data Science and Machine Learning – Improve your coding by sticking to 5 simple principles

Anel Music
13 min readSep 23, 2022

Alright, I get it, being a data scientist is challenging enough. You have to be able to handle messy datasets, then (after cleaning it for quite some time) you must make sense of the data by asking the right questions, you need to be well versed in algorithms and statistical methods that might be suited to model that data, then you have to conduct experiments and on top of that, you should know how to communicate with engineers, domain experts as well as managers and clients. Writing clean code is for Software Engineers and “productionalizing” your Jupyter Notebooks should be taken care of by Machine Learning Engineers anyways — shouldn’t it?

Well, yes and no.

Yes, we can agree on the fact that Machine Learning Engineers should be responsible for bringing the model into production but no, this should not be your excuse to not improve your coding over time.

First of all, as a Data Scientist, you might aspire to become a Machine Learning Engineer one day because you are looking for new challenges and want to be closer to the model deployment. Thus, you should prepare for these specific challenges early on.

However, even if being a Machine Learning Engineer is not your ultimate goal, you should at least try to support your colleagues with good code. In the end, after conducting your experiments, it’s all about putting the model into production. If your code does not have to be substantially rewritten or refactored first, the model deployment time is shortened, and with it the feedback loop which means faster iteration, faster deployment, faster improvement, and faster customer satisfaction.

Now, that we have established that it makes sense to improve your coding — what exactly is clean/good code?

In a nutshell:

Clean Code Requirements:
Clean Code is easy to read, easy to use, easy to extend and easy to test.

SOLID Principles

Solid principles are 5 software design principles that will help you write better code that satisfies the clean code requirements mentioned above.

The bad news:

You have to understand the SOLID principles and force yourself to comply with them at first.

The good news:

Once you have internalized these principles, you will implement them without really thinking about it. Violating them will feel “unnatural” to you.

#1 Single Responsibility Principle:

Your class should have only one job

You have probably already seen it, the so-called God object. The God object is the instance of a class that can do virtually anything. In the data science context, this might be a class that reads in data, performs preprocessing, trains a model, evaluates the model, makes new predictions and possibly even is responsible for postprocessing. Concentrating the responsibility of many completely different tasks in one class has many disadvantages. A particularly apparent one is that the class becomes very large and it becomes difficult to oversee the effects of changes. Fortunately, refactoring God classes is very easy.

Let’s assume that you have a classifier class as shown in [IMG1]. The classifier has two member variables for name and performance and two methods for predicting and updating a simple model performance dashboard. Identifying the responsibilities of classes might not always be obvious and leaves room for discussion but here I think, it is safe to say that a classifier should not be responsible for updating the dashboard.

[IMG1] Single Responsibility Principle violated

If we delegate the dashboard responsibility to a separate Dashboard class we can resolve the violation of the Single Responsibility Principle fairly easy.

[IMG2] Single Responsibility refactored

As shown in [IMG2] we can introduce a new class Dashboard and use it’s update method to update the dashboard. This way our classes with strictly separated responsibilities become much shorter, easier to explain and easier to understand which is one of the reasons why micro-services have become such a popular architecture.

#2 Open Closed Principle:

Your class should be open for extension but closed for modification

Every code is open for extension which means that you can always add new features. The question is — How much of the existing code do you need to change to add your new feature? This might sound strange but ideally, you should be able to add new features without changing the existing code at all.

Let’s assume you are presented with the store_data method shown in [IMG3]. Depending on the storage_type this function either stores data in a SQL database or a CSV file.

[IMG3] Open Closed violated

Now, let’s imagine that you want to add a new feature that would allow you to store the data to a MongoDB database. No big deal, right? You simply add another if-condition to the store_data method, check for (storage_type == “mongodb”) and if satisfied you execute some //store to mongodb code.

This would work perfectly fine but it would also violate the open-closed principle as adding your new feature (extension) would require the change of already existing code (modification).

That’s something you generally want to avoid due to the danger of introducing new bugs. You would also have to extend already existing unit tests which sometimes could be difficult especially if you don’t fully understand what the function you’ve extended does.

[IMG4] Open Closed refactored

In [IMG4], you can see one way to fix the Open Closed violation. First, you should notice that our DatasetManager no longer has a store_data method. Instead, it has a member storer of type DataStorer. You may ask — what is a DataStorer? Here, the DataStorer is just an Interface that defines what different types of DataStorers have in common. In our case, all DataStorers should have a store_data() method.

If we want to write our data into an SQL database we can create a class SQLStorer that inherits from the DataStorer interface (You can also say implements the interface DataStorer) and implements the store_data() function . If you want to add a new feature for writing the data into a CSV file you don’t need to change existing code. You can create a new class CSVStorer that also inherits from DataStorer and defines the store_data() method. Similarly, adding a new MongoDB feature requires you only to again extend the code by providing a new MongoDBStorer class without modifying existing classes or functions. You’ve probably realized that the DataStorer Interface provides a common “template” for all types of DataStorers (SQL, CSV, MongoDB, S3, BlobStorage, etc.) and because our DatasetManager depends on a generic storer Object of type DataStorer we can pass any DataStorer subclass object to it. This is extremely helpful. It means that if we change the way we store our data (e.g from csv to S3) we can simply pass a S3Storer object instead of a CSVStorer object to our DatasetManager constructor without breaking the client code.

In general, your classes should always depend on abstractions (Interfaces) and not on implementations (concrete Classes).

#3 Liskov Substitution Principle:

You should be able to replace a parent class object by any child class object without altering the correctness of your code

If you are already puzzled by the short statement above, don’t worry, you are not the only one who feels this way when being handed a formal definition of the Liskov Substitution Principle. The good news is that, unlike all other principles, the Liskov Substitution Principle works with concrete checkpoints and essentially leaves no room for subjective interpretation. Therefore, two programmers will not get into a situation where one sees the principle as violated while the other does not (which e.g for Single Responsibility could be the case). This also means that modern linters such as mypy will help you identify such violations but to resolve them it’s still important to understand the idea behind the principle first.

I have to admit, it took me quite a bit of time to figure out how to best demonstrate this principle. In my opinion a simple before and after illustration would not be effective. That’s why I’ve decided to go into a little more detail here.

Let’s say we want to be able to pay for an order we made. A simple implementation is shown in [IMG5_1]. Here, we have an ApplePay class that is responsible for processing the payment. To do so, the ApplePay class has a pay() method that receives the order and a phone number that should be verified using some sort of verification procedure within the pay() method which then sets the order status to ‘paid’. The PaymentProcessor interface serves as “template” and should be implemented by all kinds of different PaymentProcessors.

On the left side, you can see the fairly simple client code. First, we instantiate the Order class and add a keyboard to our order. Then, we instantiate the ApplePay class and call the pay() method using our order and phone number.

[IMG5_1] Classes without Liskov violation

Let’s assume we want to add new feature that allows payment via PayPal. We can implement our PaymentProcessor interface and create a new class called PayPalPay as shown in [IMG5_2]. For PayPalPay, we would implement some sort of verify-nr procedure in the pay() method and set the order.status to ‘paid’. The client code almost doesn’t change. So far nothing new and also no Liskov violation.

[IMG5_2] Classes without Liskov violation

Unfortunately, PayPal doesn’t work with phone number verification. Instead, it uses an email address to verify an account as illustrated in [IMG5_3]. Okay, let’s resolve the problem.

[IMG5_3] Classes without Liskov violation

A quick and “hacky” remedy is shown in [IMG5_4]. Instead of passing the phone number in the client code payer.pay(order, ‘+491520000’) call, we could simply pass an email address pay.pay(order, ‘abc@def.com’) and instead of implementing a phone number verification procedure, we could implement an email address verification procedure. We only have to remember that the parameter phone_nr does not hold a phone number but rather an email address. Sadly, we can’t change the parameter name from phone_nr to email_adress because we’re adhering to the PaymentProcessor interface.

[IMG5_4] Liskov Substitution violated

This quick and “hacky” remedy would definitely work, produce the output we expect and it shouldn’t cause any problems:

  • if we remember that the phone_nr in the pay() method of the PayPalPay class is an email address and
  • if we misuse this parameter phone_nr for our purposes in the email verification procedure by treating it like an email address and
  • if we remember to pass an email address instead of a phone number to the pay() method in the client code when using the PayPalPay class and
  • if no one ever by accident passes a phone number to the pay() method of a PayPalPay object which would cause an error in the email verification procedure of the pay() method in the PayPalPay class

Way too many “ifs” — if you ask me.

As you can see, violating the Liskov Substitution Principle even for this simple example results in a variety of problems. These problems basically occurred because our child class objects could not be used interchangeably. To be more precise: We can’t exchange the payer objects in the two client code snippets above because the way the pay() method is called depends on which class (ApplePay or PayPalPay) we instantiate.

[IMG6] Liskov Substitution refactored

[IMG6] shows how to resolve the violation. To no longer misuse the phone_nr parameter as an email_address we could remove it from the pay() method in the PaymentProcessor interface. This way, irrespective of the class we instantiate, each client code call of the payer.pay(order) method will look exactly the same because no second parameter (phone_nr/email_adress) is required anymore. Whether the verification is using an email address or a phone number is now encoded in the constructor itself. ApplePay now has a member phone_nr and PayPalPay has a member email_addr. The verification procedure inside the pay() method access the corresponding member (phone_nr/email_adress) accordingly.

We’re no longer violating the Liskov Substitution Principle and therefore we could replace the payer object created using the ApplePay class and the payer object created using the PayPalPay class if needed. Now, we can call the pay(order) method on any payer object and we won’t make mistakes that might lead to errors and crashes. Granted, the Liskov Substitution Principle might be a bit confusing at first. Always remember that the objects you create should be replaceable if they have the same parent class.

#4 Interface Segregation Principle:

It’s better to have multiple specific interfaces instead of one big general interface

A first glance, having one large interface that declares all the methods that subclasses might want to implement sounds practical but if you think about it, it is actually the complete opposite. [IMG7] shows what happens if you have too general interfaces. The ImgSegmenter Interface provides a common “template” for all ImgSegmenter subclasses. If we want to create a concrete class that inherits from ImgSegmenter, we must provide an implementation for all abstract methods (segment_semantics, segment_instances) declared in the ImgSegmenter interface otherwise the compiler (or interpreter) will throw an error when we try to create an object.

DeepLab is a semantic segmentation algorithm, thus we can only provide the implementation for the segment_semantics() method. However, inheriting from an interface forces us to provide an implementation for the segment_instances() method, too. As a workaround, we can use a python pass or (a bit better) raise an Exception to indicate that DeepLab can only be used only for semantic segmentation. In contrast our MaskRCNNSuper* Algorithm can perform instance segmentation and semantic segmentation so we don’t have to raise any Exceptions.

*MaskRCNN was naturally designed for instance segmentation. However, using a postprocessing method that combines instances of the same class into one segment we could modify the MaskRCNN algorithm to work for semantic segmentation as well. For real use cases you wouldn’t do that as there are more efficient algorithms available for semantic segmentation (DeepLab, YoloAct, YoloAct++,Poly-YOLO)

[IMG7] Interface Segregation violated

As you can see, using a too general interface might sound convenient but comes at a price. Some subclasses might not be able to provide implementations for all methods. Because inheriting from an interface forces you to provide an implementation for every abstract method, you will need to come up with hacks, such as raising Exceptions. A better design choice is illustrated in [IMG8].

[IMG8] Interface Segregation refactored

After refactoring, we have two specific interfaces (ImgSegmenter and InstanceSegmenter). Both can be implemented fully by their respective subclasses (MaskRCNNSuper and DeepLab). There is no need to raise Exceptions. MaskRCNNSuper can be used for semantic segmentation and instance segmentation and therefore implements the InstanceSegmenter interface that has the segment_instances() method and inherits the segment_semantics() method from its parent class ImgSegmenter. DeepLab, in contrast, works only for semantic segmentation thus only implements the ImgSegmenter interface.

As both of the concrete classes (MaskRCNNSuper and DeepLab) have the same super class ImgSegmenter (thanks to Polymorphism), we can pass objects of both classes to the constructor of our Modelling class. I’d like to emphasize once again, that your classes should always depend on abstractions (Interfaces) and not on an implementation (concrete class).

#5 Dependency Inversion Principle:

Classes should depend on abstraction and not in concrete subclasses

By now, you’ve seen me stating a couple of times that classes should always depend on abstractions and never on a concrete implementation. Let’s try to understand what this really means. [IMG9] shows a Modelling class that is directly dependent on a DeepNN class because it has a member algorithm of type DeepNN. Inside its fit_data() method it calls the fit_deepNN() method of the DeepNN class.

[IMG9] Dependency Inversion violated 1

Let’s say the requirements have changed (e.g less powerful hardware than expected is available) and we need to use a way faster model such as logistic regression. Thus, we can create a new LogReg class as shown in [IMG10]. If we now pass an object of type LogReg to the constructor of the Modelling class, the code will break because Modellings constructor expects an algorithm of type DeepNN. Also inside its fit_data() method fit_deepNN() is called which is only available in the DeepNN class and not in the LogReg class.

[IMG10] Dependency Inversion violated 2

To fix the problem, we can change the constructor of Modelling so that it now expects an algorithm of type LogReg as shown in [IMG11]. (Which by the way violates the open closed principle). In addition we need to change the implementation of the fit_data() method to now call algorithm.fit_LogReg() instead of algorithm.fit_deepNN().

[IMG11] Dependency Inversion violated 3

You might think that this small change is not a big deal but try to think about what would happen if the requirements change once again. Maybe the dataset changes completely and the number of features increases whereas the number of observations decreases. In this scenario a SVM Classifier might be more suited. You would again have to change the Modelling class constructor and it’s fit_data() method to call the fit_svm() method of a new SVM class.

The problem here is that our Modelling class depends on a concrete implementation (a subclass). This means every time you pass a different class type (deepNN, logReg, SVM) to the constructor you would have to change the Modelling class. Wouldn’t it be great to be able to pass objects to the constructor of the Modelling class, without changing the type it expects over and over again? Yes, it would and we can achieve this quite easily by inverting its dependency.

[IMG12] Dependency Inversion refactored

The refactored solution in [IMG12] inverts the dependency by making Modelling dependent on an abstraction (an interface) instead of a concrete class (an implementation). This way the constructor expects an object of type Model instead of DeepNN or LogReg as before. DeepNN and LogReg are now concrete subclasses that implement the Model interface and create objects that can be passed to Modelling. As both concrete classes (DeepNN and LogReg) follow the interface Model, both must implement a fit() method. This also means that irrespective of the type of the object passed to the Modelling constructor, what’s inside the fit_data(data) method remains unchanged (algorithm.fit()).

Some final words:

Admittedly, these principles may seem intimidating at first, but they quickly become second nature.
The SOLID principles make your code modular, more readable, easier to understand, and easier to test. Don’t hesitate to pull in other resources to help you better understand these principles.

I hope you’ve learned something. Thank you for taking the time to read my article.

--

--