Deployment of Water Quality Prediction System on Streamlit Cloud

8 min readMay 7, 2023

Introduction

Water is an essential resource that is necessary for the survival of all living organisms. Water quality is an important factor that determines the suitability of water for consumption. The municipal corporations are responsible for safe drinking water to their citizens. To ensure the safety of drinking water, water quality parameters such as aluminium, arsenic, chromium, copper, mercury, ammonia, and bacteria are regularly monitored by municipal corporations. However, manual monitoring of these parameters is a tedious and time-consuming task. Therefore, there is a need for automated approaches that can predict water quality parameters accurately and efficiently. Machine Learning is a promising approach that can be to automate this process. So, with the help of machine learning algorithms we need to build the system which can automatically detect whether the new sample of water is safe or unsafe for drinking.

Project Aim

The purpose of this project is to predict the quality of water using machine learning on a dataset collected from a particular city’s municipal corporation. The project aims to develop the classification model that can label any new water sample as either safe or unsafe for drinking based on the parameters such as aluminium, ammonia, fluoride, chromium, copper and bacteria with high accuracy and then deploy it on streamlit cloud.

Dataset

The first step is to collect water quality data from municipal corporation of a particular city. The dataset contains water quality parameters such as aluminium, arsenic, barium, ammonia, fluoride, copper and bacteria etc. The data is collected from different sources such as lakes, rivers, and reservoirs. The dataset has been collected over a period of few years and it contains a total of around 8000 samples.

Dataset Link: https://drive.google.com/file/d/1xJfXSY3Fvifb-gg10ipo2pPb-zp25vgl/view?usp=sharing

We have plotted the pie chart to check the percentage distribution of Target variable.

Data Preparation

The collected data needs to be cleaned and prepared before applying machine learning algorithms. The steps which have been taken in this project for data preparation is removing missing values or irrelevant inputs, and handling categorical variable using Label Encoding Technique and then the dataset has been divided into training and testing sets. The training set contains 80% of the data whereas the testing set contains 20% of the data.

Classification Model

Random Forest Classifier is a machine learning algorithm that belongs to the family of ensemble models. It is a type of decision tree-based algorithm that combines multiple decision trees to create a single model. In this algorithm, a forest of decision trees is created, where each tree is trained on a different subset of training data. The term “random” in the name of the algorithm refers to the fact that each tree in the forest is trained on random subset of the features from the dataset. This helps to reduce the risk of overfitting, which occurs when a model is too complex and performs well on the training data, but poorly on new data. By using a different subset of features for each tree, the random forest algorithm creates a more diverse set of models, which are better at generalizing to new data.

To make prediction using random forest classifier, the algorithm passes the input data through each decision tree in the forest, and the prediction from each tree is counted as a “vote”. The final prediction is then determined based on the majority vote from all the trees in the forest. Random Forest classifier is a popular algorithm for a classification task, and it has been used in a wide range of applications, such lead prediction, fraud detection, and image classification etc. Some of the advantages of using random forest classifier include its ability to handle large datasets, its robustness to handle outliers, and its ability to handle missing data.

There are different hyperparameters of random forest which can be tuned to improve the performance of the random forest classifier such as no. of estimators, max depth, min sample split and min sample leaf. These hyperparameters can lead to underfitting or overfitting of model, so this is very important to perform hyperparameter tunning to get the optimal value of these hyperparameter and to perform hyperparameter tunning we are going to use a method called Randomized Search CV.

Randomized Search CV is a technique used in machine learning to perform Hyperparameter Optimization, which involves selecting the best set of hyperparameters for a model. In Randomized Search CV, a specified number of random combinations of hyperparameters are selected from a pre-defined search space. The model is then trained and evaluated for each of these combinations, and the combination that results in the best performance metric is selected as the optimal set of hyperparameters.

Compared to Grid Search CV, which exhaustively searches through all possible combinations of hyperparameters. Randomized Search CV is more efficient as it searches through only a random subset of the combinations, reducing the computational cost. Randomized Search CV is particularly useful when the search space of hyperparameters is large and the computational cost of training the model is high. By using Randomized Search CV, the search space can be explored in a more efficient manner, allowing for quicker identification of the best hyperparameters of the model.

Overfitting Avoiding Mechanism

To avoid overfitting of the model, we are going to perform feature selection over here. Feature Selection is one of the finest ways to avoid overfitting and this is better than Regularization, as regularization is computationally more expensive on the other hand feature selection process is easy to implement and understand. In this process, we select the most relevant features that contribute the most to water quality. Here, we have calculated the information gain of each feature and the features with the least information gain is removed from the final dataset. Information gain is use to determine the usefulness of a feature in splitting a dataset into subsets that are more homogenous with respect to target variable. It measures the reduction in entropy, or disorder, achieved by splitting the data based on the presence or absence of a particular feature. The higher the information gain, the more valuable the feature is for making predictions about the target variable.

Selection of Evaluation Metrics

In case of imbalanced dataset, accuracy is not the most suitable metric for evaluating the performance of a binary classification model, as it may give misleading results. This is because accuracy is biased towards the majority class and may show high accuracy even if the model is not performing well on the majority class. Instead, the following evaluation metrics can be used:

Precision: It measures the proportion of true positive (TP) predictions out of all the positive predictions (TP + FP). Precision is a useful metric when the goal is to minimize false positives, i.e., the model predicts the positive class only when it is certain that it is the true positive.
Recall: It measures the proportion of true positive predictions out of all actual positive instances (TP + FN). Recall is a useful metric when the goal is to minimize false negatives, i.e., the model identifies all positives instances in the dataset.
F1-score: It is the harmonic mean of precision and recall. F1-score is a useful metric when we want to balance the importance of precision and recall, and we need a single metric to evaluate the performance of the model.
ROC-AUC: It measures the area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (sensitivity) against the false positive rate (1 — specificity) for different classification thresholds. ROC-AUC is a useful metric when we want to evaluate the model’s ability to distinguish between the positive and negative classes across all possible thresholds.

In short, the choice of evaluation metric depends on the specific problem at hand and the goals of the model. It is recommended to get the combination of these metrics to gain a better understanding of the model’s performance.

Result Analysis

The proposed algorithm have been evaluated using above mentioned performance metrics such as precision, recall, F1-score, and ROC-AUC score. The model has achieved highest F1-score of 97.7% as well as ROC-AUC score of 85% which indicates that the model can easily distinguish two classes and this model can be recommend for the deployment in real world.

At last, we need to save the model in a pickle format.

Create a Streamlit App

To create an app, we first need to load the machine learning model and allow user to interacts with it.

Then, we can define a function that can take the new water sample parameters as an input and make a prediction based on given parameters.

Then, finally the web page is created which can take all the inputs and prediction button should also be included.

Lastly, to be able to run the app locally, you need to initiate the following command from your terminal:

This will fire up a browser window where you can see your current Streamlit app.

Deploying Streamlit App on Streamlit Cloud

Once the app is uploaded, you can configure the app settings on Streamlit Cloud. This includes specifying the required dependencies, setting up environment variables, and configuring the app’s resources. We’re now ready to power up our app and try out its classification capabilities. Here’s how we’ll do that:

Step 1: Create a requeriments.txt file at the root of your folder with the libraries that we used