Sameer

Indian Railways - Data Visualization App

Mohammed Sameeruddin — Sat, 04 Jun 2022 09:25:04 GMT

Introduction

Indian Railways is a statutory agency that manages the entire India's national railway system and is maintained by the Ministry of Railways, Government of India. As of March 31, 2022, it administers the world's fourth-largest national railway system, with a total route length of 67,956 kilometers (42,226 mi). On a daily basis, it runs 13,169 passenger trains both long-distance and suburban routes almost covering 7,325 stations across India.

Indian railways have been one of the go-to options to travel from one to another. This blog is mainly intended to discuss how we can visualize the railways' data as per the station directly on the map.

Data Source

Indian Railways data was collected by Sanjay and Sajjad.

To know more about the difficulties they faced while collecting the data - mainly to maintain the authenticity, you can read it from here.

Source: https://github.com/datameet/railways

There are three files -

stations - contains the data of all the stations of India along with the coordinate values and other details. The format of the file is GeoJSON type.

{    "geometry": {        "type": "Point",        "coordinates": [75.4516454, 27.2520587]    },    "type": "Feature",    "properties": {        "state": "Rajasthan",        "code": "BDHL",        "name": "Badhal",        "zone": "NWR",        "address": "Kishangarh Renwal, Rajasthan"    }}

trains - contains the data of all the train paths (railway routes) along with the coordinate values and other details. The format of the file is GeoJSON type.

{    "geometry": {        "type": "LineString",        "coordinates": [            [72.89173899999999, 19.070320000000002],            [78.2266994458, 26.0352337224],            [78.18700399999999, 26.145594],            [78.18229199999999, 26.216483]        ]    },    "type": "Feature",    "properties": {        "third_ac": true,        "arrival": "15:35:00",        "from_station_code": "LTT",        "name": "Mumbai LTT - Gwalior (Weekly) Special",        "zone": "CR",        "chair_car": true,        "first_class": true,        "duration_m": 45,        "sleeper": true,        "from_station_name": "LOKMANYA TILAK TERM",        "number": "01101",        "departure": "15:50:00",        "return_train": "01102",        "to_station_code": "GWL",        "second_ac": true,        "classes": "",        "to_station_name": "GWALIOR JN",        "duration_h": 23,        "type": "Exp",        "first_ac": true,        "distance": 1216    }}

schedules - contains the data of all the train schedules along with other details. The format of the file is JSON type.

{    "arrival": "None",    "day": 1,    "train_name": "Falaknuma Lingampalli MMTS",    "station_name": "KACHEGUDA FALAKNUMA",    "station_code": "FM",    "id": 302214,    "train_number": "47154",    "departure": "07:55:00"}

In Addition to this, I downloaded the shape file of India that contains the boundaries of the states.

Implementation

`import` Packages

As always we begin with importing the packages that are essential for this project.

Data Reading

JSON file reader - all the JSON file data including the GeoJSON format can be read using the below function.
SHP file reader - helps to read the shape file data.

Data Preparation

The files stations.json and trains.json need to be converted into shape files and thus the data can be easily manipulated further using the geopandas package.

stations shape file

trains shape file

Indian States Boundaries

In the stations shapefile, there are so many missing values (especially in the state columns. To get the state name, we need to perform spatial join, keeping stations on the left and Indian states' boundaries on the right.

Similarly, in the trains shapefile, there are a few stations that contain multiple states. The following is the list for the same.

MAULA ALI Telangana, Punjab
NAGARGALI Maharashtra, Karnataka
INDARGARH Rajasthan, Madhya Pradesh
ALAI Rajasthan, Uttrakhand
Jagdishpur Bihar, Jharkhand
KOMATIPALLI Telangana, Andhra Pradesh
NAGARI Andhra Pradesh, Maharashtra
Kashinagar Halt West Bengal, Odisha
SULGARE Maharashtra, Chattisgarh

Data Visualization

Now that the data is prepared, we can use Plotly to get an interactive map plot.

Stations

With the above function, we can plot the stations by providing the state_name param.

Andhra Pradesh stations

Trains

With the above function, we can plot the train paths by providing the state_name param.

Andhra Pradesh trains

Trains (from station)

With the above function, we can plot the train paths provided from_ station.

Chennai Beach station

App Development

Using all the procedures above and other packages (like dash) that support Plotly figures, we can develop a full-fledged web application.

Since we have two aspects stations and trains, we can create HTML tabs and accordingly display the map.

Stations

Input Layout

Apart from just having state_name it also has additional inputs such as displaying the boundary of the state.

Output Layout

Trains

Input Layout

Output Layout

The below plot shows each stop from the from_ station to all the remaining stations.

The below plot does not show any stops but all the stations from the from_ station in the form of a network.

`from` station `to` station

If we select a specific to_ station, we get to see additional details of the train like -

Arrival - the time when the train arrives.
Departure - the time when the train departures.
1st Class - 1 if the train is 1st class, else 0.
Time - duration to reach the destination station.
Sleeper - 1 if the train is sleeper type class, else 0.
Distance - the total distance in kilometers.

The map for the same is below.

Well, that's it for this article. You can find my work in the below links.

You can also subscribe to my newsletter for such exclusive content. Thanks all.

End

DeepGlobe Road Extraction - Challenge

Mohammed Sameeruddin — Mon, 21 Feb 2022 12:48:27 GMT

Introduction

The Geoscience and Remote Sensing Society - one of the well-known communities to learn and contribute to Geospatial Science has sponsored the DeepGlobe machine vision challenge in 2018, which includes the deep analysis of satellite images of Earth.

The 3 challenges are -

Road Extraction
Building Detection
Land Cover Classification

As part of this, I picked up the problem of Road Extraction as roads have always been a crucial part in various aspects be it transportation, traffic management, city planning, road monitoring, GPS navigation, etc.

The challenges of DeepGlobe are purely research-based and focus on the real problems.

DL Problem Formulation

Data Overview

Source https://competitions.codalab.org/competitions/18467
In total, we have 14796 images. Out of which
- 6226 training satellite images
- 6226 training mask images
- 1243 validation satellite images
- 1101 test satellite images
Validation and test data sets do not contain respective mask images. This is something we need to predict.

Problem Type

If we observe the masks, there are 2 classes. Therefore, it is a binary image segmentation task.
This task is a computer vision task wherein we are dealing with images (satellite images).

Performance Metric

Since it is an image segmentation task, the metrics that we stick to are -

IoU Score - the area of the overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth.
- 0 IoU means there is no overlap between the predicted segmentation and the ground truth.
- 1 IoU means the predicted segmentation and ground truth are exactly identical.
Accuracy - it is the ratio of correct predictions to the total number of input samples. The one caveat here is that we need to have an equal number of classes to consider this metric.

Data Download

The dataset memory consumption is around 4GB. We can download this dataset easily by using Kaggle APIs or CURL WGET chrome extension that takes less time compared with the regular download.
For convenience, I have already downloaded it to my system. The structure of the data after extracting the files is as follows.
metadata.csv file contains image paths for all training, validation, and test sets.

Exploratory Data Analysis

Originally, the size of each image is (1024, 1024). If we train the model considering this size, it would end up having more trainable parameters. We need to make sure the image sizes are small and have uniformity in them.
It is also mentioned that some of the training masks are not binary images. A binary is an image whose pixels are either 0 or 255. To maintain uniformity, we need to binarize the training masks with the help of the OpenCV package.

Binarization

To know more about image binarization, you can feel free to check my blog.

The outputs are as follows.

Image Resizing

The ideal shape of an image that we want is (256, 256). We need to resize all the images in training, validation, and test sets.

Some practitioners also consider the shape of the images to be (128, 128).

Training Images

Let's visualize a sample of training data that includes both satellite image and its respective mask image.

By just observing this sample data, we can tell that this task is limited to extracting the larger paths which are clearly visible.

Validation Images

It would have been perfect if there were validation mask images so that it could become easy to track the models' performance. Unfortunately, we are not provided the validation data with the mask images.

To subsidize this problem, we can take 1% of training data as validation and train the models'.

This sample set contains a mix of larger paths and thinner paths.

Test Images

The test dataset does not contain the targets or mask images. This is something we need to predict. Let's visualize the same.

This sample set contains a mix of larger paths and thinner paths.

This ends our EDA. However, if you would like to know the complete details with all the code, you can check my notebook hosted on the Jovian platform.

Transfer Learning

Transfer learning is a part of ML and DL problems wherein the model's knowledge that has been trained in solving one problem is stored and further used for solving another problem that is quite similar.

The main advantage of transfer learning is that we do not have to train the models from scratch and rather use the pre-trained ones. Tensorflow by Google is a handy library for implementing deep learning models. It also includes various pre-trained models that can solve any problem statements with slight tweaks in the model's architecture.

We are dividing this modeling (transfer learning) part into 2 aspects.

No Augmentation

ResNet34

The below output is the validation data prediction. The rightmost image is the prediction mask and we can observe that for those images that contain larger paths, the model is able to extract the path but not as efficiently as it needs to be.

The below output is the test data prediction. The right image is the prediction mask. As usual, the model only identifies the larger paths but not the thinner paths. We can also see some incomplete paths that have been predicted.

VGG16

The below output is the validation data prediction. The rightmost image is the prediction mask and we can observe that the performance is quite similar to the previous (ResNet34) model. Complex patterns of the paths where there are bushes covered are not identified properly.

MobileNet

The below output is the test data prediction. The right image is the prediction mask. The model's performance is not up to the mark when compared to the above previous models.

Inception

For the complex patterns, the model doesn't perform well. But it does a pretty good job in identifying the thinner patterns which are clear.

The below output is the test data prediction. The right image is the prediction mask. For some patterns, it does a good job which cannot be identified by ResNet34 or VGG16.

ResNet50

The below output is the validation data prediction. The rightmost image is the prediction mask. It was only able to extract the larger paths that looked very clear.

Test data prediction.

Augmentation

I have trained the same list of models by slightly augmenting the data. The results are not up to the mark.

Model Performance

Comparing the results in order to select the best model.
The results of the augmented data are not so good compared with the original data (no augmentation). Clearly, it can be noted by checking the performances of the various models.
In fact the masks obtained for the augmented data are not as clear as the masks obtained for the original (no augmentation) data.
Hence, it is not a valid decision to consider the models that have been trained on the augmented data.
The only option left now is to select the best model out of other models that have been trained on the original (no augmentation) data.
The best model(s) from all the above models are ResNet34, VGG16, and Inception.

Feel free to check my notebook for transfer learning hosted on the Jovian platform.

U-NET Model

Apart from implementing transfer learning models, I also wanted to cross-check the performance by implementing a famous deep learning algorithm called U-NET that is mostly used for image segmentation tasks.

The performance of the U-NET model is significantly better than the transfer learning. As usual, we shall divide this into 2 aspects.

No Augmentation

The below output is the validation data prediction. The rightmost image is the prediction mask and we can observe that the model is able to extract most of the paths (including thinner paths).

Even the paths that have not been highlighted in the original mask, are still being identified with the help of this model.

Test data prediction.

Augmentation

In the case of data augmentation, the model still performs its best. Augmentation has been helpful to identify complex patterns. Results of the validation prediction can be seen below.

Test data prediction.

Model Performance

The above are the results obtained by the U-NET model developed from scratch.
The left image corresponds to the results of the model that is trained on the original (no augmentation) data.
The right image corresponds to the results of the model that is trained on the original (augmented) data.
Clearly by observing the graphs, we can say that the left one (model) shows slight overfitting when compared with the right one (model). Although, it seems the difference is negligible.
Personally, I feel that the model trained on the augmented data does show some closer predictions. This is something we should observe in the process of error analysis.

Feel free to check my notebook for U-NET modeling hosted on the Jovian platform.

Error Analysis

Error analysis is very much helpful (in case, when we have multiple models) in order to decide which model is the best model and which is not. This is a stage where we can know how we should be improving the model or the dataset.

In total, we have 6 models that have been trained on both augmented data and non-augmented data. The IoU score is the performance metric that we have considered. Based on my observation, U-NET that is trained on non-augmented data happens to be the best model since it has the highest IoU score.

The error analysis has to be performed on sample data. But if you do have time and efficient system requirements you can also consider the whole dataset. For this analysis, I have considered only 100 random images.

Feel free to check my notebook for deep error analysis hosted on the Jovian platform.

Single Image Segmentation

As part of this, we need to extract the road path when passed a single satellite image as input to the model. This is important in real-life scenarios as we do not generally pass the entire test set to the model. Instead of getting the prediction mask, it is always better to highlight the road path on the satellite image itself. We can highlight the path by any color.

Once we have the model loaded, we can always pass a single image for getting the route. Upon getting the prediction mask, with the help of simple image processing techniques, we can still retain the background and only highlight the path.

Hence, the below function.

No Augmentation

100034_sat

100393_sat

Augmentation

100034_sat

100393_sat

UI Part

Video Demonstration

https://youtu.be/488uCkulDxA

Jovian Project Link https://jovian.ai/msameeruddin/collections/deep-globe-road-extraction
LinkedIn https://www.linkedin.com/in/mohammed-sameeruddin-596a70133/

References

https://www.appliedaicourse.com/
https://youtu.be/GAYJ81M58y8
https://www.cs.toronto.edu/~urtasun/publications/mattyus_etal_iccv17.pdf
https://arxiv.org/pdf/1908.08223.pdf
https://www.kaggle.com/vanvalkenberg/road-maps-from-aerial-images
https://www.kaggle.com/balraj98/road-extraction-from-satellite-images-deeplabv3

Well, that's it from my end. Make sure to subscribe to my newsletter to never miss an update from the exclusive content that I post.

San Francisco Crime Classification

Mohammed Sameeruddin — Sun, 23 Jan 2022 07:38:13 GMT

San Francisco - City

San Francisco is the cultural, commercial, and financial capital of the state of California in the United States. It is the 17th most populated city in the United States and a renowned tourist destination noted for its cool summers, the Golden Gate Bridge, and some of the world's best restaurants. Despite being a city known for its growth, San Francisco remains one of the most dangerous places to live due to an increase in criminal activities.

SF OpenData - an open data catalog of San Francisco has provided 12 years of crime data dated from 1/1/2003 to 5/13/2015. These criminal records are mostly the incidents derived from SFPD Crime Incident Reporting System. They have published this dataset on Kaggle.

The main agenda of this whole case study is to predict the category of the crime for the incident that can occur in the future. Thus, helping them reduce the time and effort of organizing the data.

ML Problem Formulation

Data Overview

Source https://www.kaggle.com/c/sf-crime/data
In total, we have 3 files
- train.csv.zip
- test.csv.zip
- sampleSubmission.csv.zip
The training data has 9 columns that include the target column as well.
The test data has 6 columns that exclude the target column (this is something we need to predict).

Problem Type

There are 39 different classes of crimes that need to be classified.
Hence this leads to a multi-class classification problem.

Performance Metric

Since it is a classification problem, the metrics that we stick to are -

Multi Log-Loss - the ultimate aim of any machine learning model is to reduce the loss. For this problem, a perfect classifier would be the one that reduces the multiclass logarithmic loss to 0.

Confusion Matrix - is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class.

Objectives & Constraints

Predict the probability of each data point belonging to each of the 39 classes.
Constraints:
- Interpretability
- Class probabilities are needed
- Penalize the errors in class probabilities => Metric is Log-loss
- No latency constraints

Exploratory Data Analysis

The original training data that are given -

The original test data that are given -

Clearly, from the Dates column we can extract time-based features such as -

date
year
month
day
hour
minute
season
hour_type
time_period

Also, from the Address column we can extract the address type such as -

Training data after extracting time-based and address-type features.

Test data after extracting time-based and address-type features.

Univariate Analysis

So, we got some important features. We can perform the univariate analysis considering each column and get to know how the data is distributed.

Target Distribution

The above is the class distribution of the column category.
Clearly, we can see that Larceny/Theft is the most occurred type of crime in the all the years.
The last 5 to 6 crimes are negligible, meaning they occurred very rarely.
The data is not balanced. Therefore, while building the model it is better to do the stratification-based splitting.

Address-type Distribution

From the above plot, we can tell that most of the crimes occurred on Streets and Crosses.
- St stands for the street.
- Cross signifies the junction point.

Police-department Distribution

Southern is the bay area where most of the crimes got reported.

Year Distribution

We know that the data is recorded from 1/1/2003 to 5/13/2015.
- The data of the year 2015 is not recorded fully.
The year 2013 has more number crimes. Others also have a similar occurrence range.

Month Distribution

In all of the months, we can observe that the occurrence of the crimes roughly ranges from 60k to 80k.

Weekday (Day) Distribution

Friday is the day where most of the crimes occurred.
Sunday is the day when fewer crimes (compared with other days) occurred.
- This is more likely because it is a holiday.
This distribution is almost similar.

Hour Distribution

It is observed from the above that most of the crimes happen either in the evening or at midnight.

Time-period Distribution

We can observe that most of the crimes usually happen either during the evening or at night.

Hour-type Distribution

Morning, Mid-Night, and Night are considered to be the time period suitable for crimes to be happening.
- In fact, they have a similar distribution.
Evening and Noon are the time periods where the business is usually continuous.

Occurrences Yearly

The above is an animation plot showing the occurrences that had happened yearly.
The above is the count of each occurrence type that occurred in the police department.
The most common crime that occurred is Larceny/Theft.
This is something we observed in the plot of overall occurrences in all the years.

Map Visualization Yearly

The above shows the exact location of the types of crimes that occurred per police department.
This is a year-wise plot and the police department that is chosen is Richmond.

Choropleth Visualization Monthly

The above is the choropleth map based on the count of the occurrences per police department.
Southern is the police department where most of the crimes got reported.
This is a month-wise plot for the year 2015.

Geo-Density Map

The GIF showcases the density map of the top 12 crimes that occurred per year.
The most important thing to observe in this is that Larency/Theft is always present at first.
- The second is Other Offenses.
The density is taken in descending order and hence retains the top crimes.

Multivariate Analysis

In order to do multivariate analysis, we must require other features based on the columns that we have. We can follow the below steps to obtain them.

Categorical columns can be encoded to One-Hot-Encoding.
- Procedure to follow up - here.
We can extract the spatial distance features by considering all the police stations in San Francisco and computing the haversine distance between each station to all the coordinate values that are present in the data.
- Procedure to follow up - here.
We can perform some arithmetic operations between latitude and longitude. These are known as location-based features.
- Procedure to follow up - here.
Address column can be converted to Bog of Words and TF-IDF representations.
- Procedure to follow up - here.
In doing so, we end up with 130 features. This is pretty good to learn from the data. Earlier we had around 7 features.

We shall visualize a TSNE plot to see whether we can separate the classes fully or not. Since the data is large, I have considered only the 10% of the whole data by performing a stratification split. The reason for doing this is because TSNE consumes more time to get trained. The 10% of the whole data is almost 87804 data points.

TSNE - 10%

Currently, the dataset consists of 39 classes.
From the above plot, we can clearly see that the data is highly imbalanced.
The clusters are not separated properly.

TSNE - 10% (top 12 crimes)

The above is better compared to when we considered all the categories.
The clusters also formed in a better way.
The separation of the clusters is also better.

TSNE - 10% (top 5 crimes)

The above plot is far better compared to all the previous plots.
We can see the separation clearly.

Modeling & Training

Now that we got the features it is time for us to apply various classification models and check the performance of each. For this problem, we will go with the following models.

Logistic Regression
Decision Tree
Random Forest
XGBoost Classifier

We can also try

KNN

I did not implement it as it consumes more time to train and due to the system's inefficiency, I opted to skip it.

Let's train some models...

Dummy Classifier

Also known as a random model. This model assigns a random class label for each data point. This is helpful in order to check if other important models do better than this or not. The loss that we obtain from this - can be set as a threshold and thus aiming to achieve the minimum loss than the threshold.

Logistic Regression

After hyperparameter tuning, it happens the best value of C to be 30 and l2 as a regulariser. We proceed with log loss since accuracy won't help as the data is imbalanced. Fortunately, the loss is lesser than the loss of the dummy classifier.

When the model is tested on the test data, the overall Kaggle score is -

Decision Tree

After hyperparameter tuning, it happens the best value of max_depth to be 50 and min_samples_split to be 500. We proceed with log loss since accuracy won't help as the data is imbalanced. Fortunately, the loss is lesser than the loss of the dummy classifier.

When the model is tested on the test data, the overall Kaggle score is -

Random Forest

After hyperparameter tuning, it happens the best value of max_depth to be 8 and n_estimators to be 100. We proceed with log loss since accuracy won't help as the data is imbalanced. Fortunately, the loss is lesser than the loss of the dummy classifier.

When the model is tested on the test data, the overall Kaggle score is -

XGBoost Classifier

After hyperparameter tuning, it happens the best value of max_depth to be 8 and n_estimators to be 150. We proceed with log loss since accuracy won't help as the data is imbalanced. Fortunately, the loss is lesser than the loss of the dummy classifier.

When the model is tested on the test data, the overall Kaggle score is -

We have to calibrate the probabilities (predictions), as probability is playing a vital role in the problem of classifying.

The results of StackingClassifier are not up to the mark.

Models' Summary

From the above, we can observe that all models (except DummyClassifier) performed well.
- The loss of each model is less than the DummyClassifier.
- The best model compared to all other models is XGBoost. The loss of this very little.

The full code for modeling can be found here.

Single Point Prediction

As part of this, we need to predict the category of the crime when the original test data (single point) is passed as input to the model. This is important in real-life scenarios as we do not generally pass the features and preprocessed test data to the model.

The code for the single-point prediction can be found here.

The video demonstration.

https://youtu.be/S07Zl7W0_fA

Jovian Project Link -

https://jovian.ai/msameeruddin/collections/san-francisco-crime-classification

LinkedIn Profile -

https://www.linkedin.com/in/mohammed-sameeruddin-596a70133/

References

https://www.appliedaicourse.com/
https://towardsdatascience.com/why-training-set-should-always-be-smaller-than-test-set-61f087ed203c
https://towardsdatascience.com/leveraging-geolocation-data-for-machine-learning-essential-techniques-192ce3a969bc
https://www.slideshare.net/RohitDandona/san-francisco-crime-prediction-report
https://www.researchgate.net/publication/305288147_San_Francisco_Crime_Classification
https://cseweb.ucsd.edu/classes/wi15/cse255-a/reports/fa15/012.pdf

Future Work

We can attempt to apply deep learning-based models and check the performance with respect to the already trained models that we did. To my knowledge, most of the kagglers who attempted to solve this problem implemented their solutions using deep learning models. We can definitely give it a try.

Earthquakes Data Visualization

Mohammed Sameeruddin — Wed, 29 Sep 2021 08:58:24 GMT

Introduction

An earthquake is the result of a sudden release of stored energy in the Earth's crust that creates seismic waves. According to USGS, an earthquake is what happens when two blocks of the earth suddenly slip past one another. The surface where the two blocks slip is known as the fault or fault plane. USGS tracks the data of the hazards and updates it every minute. The data is stored in CSV format.

In this blog, we will learn how to visualize the seismic data on the basis of the world or specific area. We will also learn to animate the visualization process based on the time factor.

Credits of Cover Image - Photo by Cdric Dhaenens on Unsplash

Data Source

USGS is the main source of data. It is well known for various disaster management systems including Earthquakes, Floods, Volcanoes, etc. The seismic data is recorded in four categories.

Hourly updated every minute
Daily updated every minute
Weekly updated every minute
Monthly updated every minute

For the project, we will only consider Daily, Weekly, and Monthly data.

Source - USGS Spreadsheet Format. In this link, you can find additional details like column names, and column descriptions.

Using this live-streaming data we can develop a live-updating dashboard (maybe I will cover that next time).

Implementation

If you would like to get the video explanation, you can watch the below video with some additional details.

https://youtu.be/0TMpqXxqnC0

`import` Packages

As always we begin with importing the packages that are essential for this project.

Data Manipulation

In the data, there are almost 22 columns in which most of them are NaN. There are few columns only a seismic researcher knows the importance of that. For the visualization, we will only consider 5 columns.

time useful for animating the visualization process.
- Monthly If monthly data is fetched, we shall extract the dates from the time column.
- Weekly If weekly data is fetched, we shall extract the weekdays from the time column.
- Daily If daily data is fetched, we shall extract the hours (numbers) from the time column.
latitude useful for plotting the earthquake location on the map.
longitude useful for plotting the earthquake location on the map.
mag useful to represent marker size as based on the magnitude of the earthquake.
place useful to visualize the data either worldwide or specific area.
- The place column consists of both sub area and the main area. This has to be split in order to visualize the data pertaining to a specific area (main area).

Data Fetching & Preprocessing

The below is the function to obtain the preprocessed data after performing the above operations. By default, it fetches the data worldwide.

The function takes 3 parameters such as -

manner the value of the parameter should be either of these daily, weekly, and monthly. By default, the value is daily.
bbox is actually a bounding box parameter with which the data is filtered pertaining to the value that is passed. By default, the value is Worldwide.
mag_thresh the value is a magnitude threshold that takes integer or float values. Based on this, the data is filtered. By default, it is 2.

Data Visualization

Once the data is fetched, we shall visualize to observe data patterns. The data shall be plotted on the map (stamen-terrain). We will be using Plotly to ease the visualization process.

Hence the below function.

Worldwide

Let's look at the animations for the Worldwide data.

Daily Basis

The animation is created on the basis of hours (numbers). Feel free to zoom in and zoom out the map.

Weekly Basis

The animation is created on the basis of the weekdays. Feel free to zoom in and zoom out the map.

Monthly Basis

The animation is created on the basis of the dates. Feel free to zoom in and zoom out the map.

Area Specific

Let's look at the animations for the Area based data.

Daily Basis

The animation is created on the basis of hours (only pertaining to the bbox value that is passed). Feel free to zoom in and zoom out the map.

Weekly Basis

The animation is created on the basis of the weekdays (only pertaining to the bbox value that is passed). Feel free to zoom in and zoom out the map.

Monthly Basis

The animation is created on the basis of the dates (only pertaining to the bbox value that is passed). Feel free to zoom in and zoom out the map.

Well, that's it for this article. You can find my work in the below links.

You can also subscribe to my newsletter for such exclusive content. Thanks all.

End

Exploring Foursquare API for Venues

Mohammed Sameeruddin — Tue, 21 Sep 2021 11:14:15 GMT

Introduction

In this article, we will try to explore an API provided by Foursquare (a company that is well known for its location-based services). With this API, we can power location-based experiences in our app or website.

Given a certain location, we can find different types of venues starting from coffee shops to travel & transport. To leverage the API we need to first create a developer account. While creating so, Foursquare provides two types of ID:

Client ID
Client Secret

These two are very important to fetch data of any category of venues pertaining to the location that is provided. Besides this, we also require openrouteservice.org API and mapbox.com API. You can refer to my previous blog to get it so.

With this blog, I would like to share my working POC which I implemented as part of the weekend project.

Credits of Cover Image - Photo by Alessandra Onisor on Unsplash

Implementation

If you would like to get the video explanation, you can watch the below video with some additional details.

https://youtu.be/HjN4j5qd8ig

`import` Packages

As always we begin with importing the packages that are essential for this POC.

API Reading

OpenRouteService Token

Mapbox Token

Foursquare Venues

Foursquare provides several venues categories. For every venue category, we have a unique ID (specifically termed as categoryId). To know the category of the different venues that the company provides, you can check this for the information. The below image explains it.

From the image, we can note that we have a categoryName and its categoryId.

We can refer to the site every time and fetch the details for a venue of our interest.
Or, we can save all the names and ids and create a dropdown that gives freedom for a user to select anything from the list (dropdown).

(As part of this project, I have scraped all the names and ids as per the category and saved them in a JSON file. However, if you are interested to learn how I scraped, you can check out my notebook. You can also download the data from here.)

In total, we have 10 main categories. In each category, we have sub-categories.

Foursquare Credentials

I have created my developer account and saved the secrets in my local file.

The keys of the credentials can be seen below.

Extract Main & Sub Categories

Since the data is loaded in a variable called foursquare_venues, we just need to parse the data accordingly to extract the main categories and the respective sub-categories.

Hence, the below helper functions.

Location Helper (Geocoder)

The main aim of the Foursquare API is to fetch venues data pertaining to a location. Without the location (coordinates), we can't fetch the data. To get the location coordinates, we must follow two aspects :

Device location (which comes by default)
User's interest (where the app user can change the location area)

Hence, the below helper functions.

App Initialization

JupyterDash is part of the Dash package which is exclusively used for creating user applications in the Jupyter-Notebook environment. Let's just initialize the app.

The default external_stylesheets is used for app design. It is a classic design that can be used for any web application.

App Layout

The layout design for the JupyterDash application is similar to the Dash application. The variable (object) app has an attribute called layout which is used to save the whole design.

Layout Design

Layout Output

The user has to select the dropdown values to visualize the locations of the venues pertaining to the place name and categories. The backend is not yet there.

App Callbacks

The callbacks (backend) for the JupyterDash application are similar to the Dash application. The callbacks act as a decorator to the updating function and thus function need not be called in order to represent the output/result.

In the above code, we have two callbacks.

The first one indicates the modification of the sub-category dropdown. Notice how the sub-category dropdown changes when the main category is updated.

The second one is for displaying the map.

(The above are just the images but not the working app.)

App Server

To view the application, the server of the app needs to be activated. Feel free to select different values from the dropdown to observe the change in the map result.

(Whenever you want to change the location, just mention a different place name and press enter.)

Well, that's it for this article. You can find my notebook with the full code here. You can also subscribe to my newsletter for such exclusive content. Thanks all.

End

South Africa - Crime Data Visualization

Mohammed Sameeruddin — Fri, 03 Sep 2021 09:03:09 GMT

Introduction

SAPS (South African Police Service) has provided crime data via Kaggle which contains the history of crime statistics from 2004 to 2015 per province and station in South Africa. The dataset contains a set of shapefiles and CSV files.

Data Source https://www.kaggle.com/slwessels/crime-statistics-for-south-africa

Let's see if we can generate an insightful analysis and report using this data by leveraging Python and Geemap for this project.

Credits of Cover Image - Photo by Simone Busatto on Unsplash

Implementation

We will analyze the data fully using Python. To do so, we begin with importing necessary packages.

If you are interested in watching the video explanation, then the below is for you.

https://youtu.be/2YFW3vlzvTs

`import` Packages

We are going to use certain packages like geopandas, geemap, etc to read data from shapefiles easily. We will need to import this along with some additional packages.

Earth Engine Initialization

The next step is to initialize the earth engine module. We do this in order to visualize the spatial data integrating with geemap.

(You will need to authenticate the earth engine package after signing up on the Google Earth Engine website).

Geocoder

In order to load the map centering the South Africa or the province of South Africa, we need a location. For that purpose, we require a geocoder to avoid hard-coded coordinate values.

(To know more about OpenRouteSerivices, you can refer to this blog).

ORS Token

Location Fetcher

South Africa location

Data Loading / Reading

The dataset contains two CSV files and one shapefile. We can load CSV files using pandas and shapefile using geopandas or geemap.

The sizes of the data can be known using the shape attribute.

String data manipulation can be done with the apply() method.

In order to analyze the data per each province and district, we also need some additional data (shapefiles) which we can easily obtain from diva-gis.org. Because we need South African states and districts, we can simply search for that to get the shapefile data.

Exploratory Data Analysis

Since South Africa contains 9 provinces, we can graph a bar chart to visualize the population data of each province.

We can also directly visualize the population data on the map in the form of a choropleth map. We can leverage geemap to get an interactive map plot. For that, we first need to merge sa_states with popn_province_df.

Choropleth Map of Population

In the above map, the darker region indicates the higher population and vice-versa.

The list of the total number of crime categories.

If we want to know the number of cases reported in each station for a selected crime category and year belonged to a particular province, then we shall have a function wherein we can pass the inputs and get the result accordingly.

Murders in the year 2006 at Western Cape

Murders in the year 2010 at Eastern Cape

Mapping Stations per district at a selected province

We can also plot the stations where the crimes are reported. We can show that per each district of a selected province. The following function helps us achieve that.

Murders reported in the year 2006 at the stations of Western Cape

Murders reported in the year 2010 at the stations of Eastern Cape

Choropleth Map of Districts

As a matter of fact, we can also plot a choropleth map with respect to total crimes reported in a selected province in each district irrespective of the station that the crime has been reported. This is done by filtering the data based on the year that is passed.

We will incorporate some amazing functionalities of geopandas such as spatial join.

(More details about spatial joins will be explained in the coming posts).

Murders in the year 2010 at Eastern Cape

Murders in the year 2008 at Gauteng

If we explore the first five rows of the crime data, we can see that it contains years as columns.

We shall sum all the years' data column-wise to end up having total crimes being reported per station. Using this, we can plot a choropleth map (of a different kind) based on province and districts. Again, we will incorporate spatial join techniques for district-wise visualization.

Total Murders at Western Cape

We can change the parameters in every function to get the result according to the province, year, and crime category that are passed.

You can find my notebook with the full code here.

Well, that's it for this article, you can subscribe to my newsletter for such exclusive content. Thanks all.

Exploring Open Route Service API

Mohammed Sameeruddin — Thu, 02 Sep 2021 06:26:48 GMT

Introduction

I have been working with geospatial data for quite a while. I always wanted to create something similar to google maps (at least a few services that it provides like geocoding, route-directions, etc). Google maps' do provide an API for various services. But the API is not free. If you want to leverage the API for your applications, then you may want to provide credit card details and other things. Earlier, the Google API was free to use, but now the plan is changed. However, you can switch to Mapbox services in order to leverage the same functionalities completely for free.

In this blog, I would like to share some interesting services that an open route provides. So sit tight and read this post carefully.

Credits of Cover Image - Photo by Glenn Carstens-Peters on Unsplash

OpenRouteService Token

Steps

You need to first create an account in openrouteservice.org. You can sign-up using your GitHub.
Navigate to the Dashboard section and generate a free token.
(Optional) - To test the service, you can go to the examples section and test for different services. You can also get the code snippet in your favorite programming language.
Copy the API token and save it in a txt file (do not show your API to anyone).

Mapbox Token

In order to visualize the results, we need to also create a Mapbox token which we will integrate with Plotly and generate an interactive plot.

Steps

Create an account by visiting mapbox.com.
Navigate to account (account.mapbox.com) and create your free API.
You can also use the default API which Mapbox creates for every user.
Copy the API token and save it in a txt file (do not show your API to anyone).

`import` Packages

Plotly Visualization

After this, we need to visualize the results from the APIs on the map after all we are fetching geospatial data. We will just make a class visualize so.

ScatterMapbox

Create a class called MapPlotter to visualize the geospatial data.
It contains 3 private methods (common methods) to sort of using them in different scenarios.
It contains 4 public methods (plotting methods) to sort of visualizing the data.

Helper Functions

We will also need helper functions to extract latitudes and longitudes separately from coordinates (list).

OpenRouteServices `class`

The services already provide all the functionalities in the form of code snippets. I just consolidated everything into one class to make myself easy enough to use it accordingly.

Create a class called OpenRouteService to fetch or extract data using API.
By default, it inherits the MapPlotter class for visualization.
It contains only 4 public methods for fetching the data in different situations.

So, everything is set now. We need to create an object and access different methods to explore the Open Route Services.

Object Creation

Geocoding Service

We can get the coordinates for any place name. A sample output can be seen below. Feel free to zoom in and zoom out.

Directions Service

We can get the directions for any two places (from to destination). A sample output can be seen below. Feel free to zoom in and zoom out.

Isochrones Service

We can get the isochrones for any two places (from to destination). A sample output can be seen below. Feel free to zoom in and zoom out.

POIs Service

We can get the POIs (point of interest) for any two places (from to destination). A sample output can be seen below. Feel free to zoom in and zoom out.

Well, this was just a simple POC (Proof of Concept) I tried to explore the ORS API. You can also make use of this API in your applications.

Full Notebook https://jovian.ai/msameeruddin/gdip-open-route

That's it for this article, you can subscribe to my newsletter for such exclusive content. Thanks all.

End

#9 - Separating Numerical, Categorical, and Textual Data

Mohammed Sameeruddin — Wed, 21 Jul 2021 05:52:23 GMT

Introduction

For every tabular data, we know that the data is stored in the form of a matrix (rows and columns). It is obvious that every column may not be of the same data type. We tend to have data of different types. Some column values may be integers (numerical), real-valued (float), and categorical (string). The main agenda of this article is to explore some unknown methods which can effectively work for separating the columns based on their type.

Unfortunately, if in the dataset there are both categorical and textual data, Pandas will consider both as categorical data. But, with a simple hack, we can differentiate that easily.

If we have categorical data like ['short', 'tall', 'average'], we can still extract the information even though each value is of string type. We can plot a pie chart or bar chart to visualize it.
But, if we have textual data like [ 'This is the first document.', 'This document is the second document.'], we cannot do much for this although the type of each value is a string, we know that it is actually a sentence.

With Pandas, we will see how to solve this separation task programmatically.

Credits of Cover Image - Photo by Will Francis on Unsplash

`import` Packages

import numpy as npimport pandas as pdimport random

Scenario 1

For this example, we will create our own (dummy) data set.

real_values = np.random.rand(10, 3)num_cate = np.array([random.choice([1, 2, 3]) for i in range(10)])str_cate = np.array([random.choice(['short', 'tall', 'average']) for i in range(10)])# -----df = pd.DataFrame(data=real_values, columns=['col_1', 'col_2', 'col_3'])df['col_4'] = num_catedf['col_5'] = str_cate

The first five rows of df look like -

col_1	col_2	col_3	col_4	col_5
0.902552	0.212317	0.208628	2	short
0.308503	0.959903	0.631784	3	tall
0.459789	0.199742	0.176000	2	tall
0.760787	0.167179	0.137287	2	average
0.199737	0.120213	0.145068	2	average

We can check the type of each column by -

df.dtypes'''col_1    float64col_2    float64col_3    float64col_4      int32col_5     object'''

We can see that col_5 is of type object which simply represents that it is different from int and float. To take it to the next level, we can have a function that can give the columns that are numerical and string separately.

def split_categories_numericals(dframe):    cols = list(dframe.columns)    num_cols = list(dframe._get_numeric_data().columns)    cate_cols = list(set(cols) - set(num_cols))    return cate_cols, num_cols

cate_cols, num_cols = split_categories_numericals(dframe=df)print(cate_cols) # ['col_5']print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']

Scenario 2

The function split_categories_numericals() works fine for the above data set. Let's add one more column that has textual sentences as values.

sents = [    'This is the first document.',    'This document is the second document.',    'And this is the third one.',    'Is this the first document?',    'Hello, how are you doing?',    'Did you go to the party yesterday?']text_data = [random.choice(sents) for i in range(10)]df['col_6'] = text_data

Now, the first five rows of the df look like -

col_1	col_2	col_3	col_4	col_5	col_6
0.902552	0.212317	0.208628	2	short	This document is the second document.
0.308503	0.959903	0.631784	3	tall	Is this the first document?
0.459789	0.199742	0.176000	2	tall	Hello, how are you doing?
0.760787	0.167179	0.137287	2	average	Did you go to the party yesterday?
0.199737	0.120213	0.145068	2	average	Hello, how are you doing?

If we check the dtypes of df, we get -

df.dtypes'''col_1    float64col_2    float64col_3    float64col_4      int32col_5     objectcol_6     object'''

We can see col_5 and col_6 have the same type, i.e., object. It is because they have string values. But, col_5 - we can get a pie chart, compute the likelihood, etc. Whereas col_6, since it has textual data, we need to convert it into vectors (numerical vectors) and then analyze it.

Now, the ultimate question is how can we know if a data set has textual data (sentences) as values. If we try to apply split_categories_numericals(), we get -

cate_cols, num_cols = split_categories_numericals(dframe=df)print(cate_cols) # ['col_5', 'col_6']print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']

It considers both col_5 and col_6 as categorical.

Is there any programming hack to automatically differentiate categorical data and textual data?

The answer is Yes. We can very well achieve that with the following function.

def split_text_categories_numericals(dframe):    cols = list(dframe.columns)    num_cols = list(dframe._get_numeric_data().columns)    cate_cols = list(set(cols) - set(num_cols))    text_cols = []; category_cols = []    for ccol in cate_cols:        each_col_list = dframe[ccol].str.split(' ').to_list()        col_val_len_arr = np.array(list(map(len, each_col_list)))        if np.any(col_val_len_arr > 1):            text_cols.append(ccol)        else:            category_cols.append(ccol)    return num_cols, category_cols, text_cols

num_cols, cate_cols, text_cols = split_text_categories_numericals(dframe=df)print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']print(cate_cols) # ['col_5']print(text_cols) # ['col_6']

We have separated the columns with a simple hack now.

Use Cases

If in the data set, there are columns of different types, and if you want to do text processing, using the above function you can exactly get the columns that have textual data (sentences).
By extracting textual columns, you can easily process it and get the most out of it.

Challenges

The function split_text_categories_numericals() only works for the preprocessed and cleaned data.
It might not work efficiently if the function is applied directly to the original data.

End

Logistic Regression Algorithm in Practice

Mohammed Sameeruddin — Tue, 13 Jul 2021 07:42:50 GMT

Introduction

Logistic regression is a statistical model which is extensively used in binary classification tasks. The name logistic because it uses a logistic function to do the classification. The logistic function is also known as cross-entropy. Besides this, we use a special function known as sigmoid function to prevent the impact of outliers in the whole modeling.

Credits of Cover Image - Photo by Luis Soto on Unsplash

Mathematical Model

Given the data points, we need to classify them into two classes (binary classification). We shall find a separating hyperplane to divide the classes.

$$f(w, b) = w^Tx+ b \rightarrow (1)$$

The notation of sigmoid function can be stated as -

$$S(x) = \frac{1}{1 + e^{-x}}$$

The plot of the sigmoid function is -

Credits - The above plot is taken from Google.

If we observe the bounds of the curve, the upper bound is 1 and the lower bound is 0. The sigmoid function only returns the value in the range of 0 and 1 which is sufficient for doing binary classification.

If we use a sigmoid function by substituting the above approximation (1), we get -

$$\implies \frac{1}{1 + e^{-(w^Tx + b)}} \rightarrow (2)$$

Equation (2) is the actual model which is used in predicting the class label.

Given x, w, and b. We need to pass this in the model that in turn returns a probability value.

If the probability value is less than 0.5, the value is predicted as 0.
If the probability value is greater than equal to 0.5, the value is predicted as 1.

Everything is properly written except two things which are still unknown. The two things are the parameters w and b. The error in the model depends on these two parameter values. Here, w is the coefficient and b is the intercept. We cannot simply assign random values for w and b. Instead, they are to be chosen wisely with the help of the stochastic gradient descent process.

Stochastic Gradient Descent (SGD)

The ultimate goal is now to find the best values for w and b. Here w is a vector and b is a scalar value. We shall use the logistic function or cross-entropy function to find these values.

It is stated as -

$$L = \frac{1}{n} \sum_{i=1}^n \bigg[-y_i \log\big[\sigma(w^Tx_i + b)\big] - (1 - y_i) \log\big[1 - \sigma(w^Tx_i + b)\big]\bigg] \rightarrow (3)$$

Note - Sigmoid function is often denoted as sigma.

SGD is an iterative process where we initially assign random values (possibly 0) for both w and b. At every iteration,

For w, we differentiate (3) with respect to w and obtain dw.
For b, we differentiate (3) with respect to b and obtain db.

Mathematically, it can be represented as -

$$dw = \frac{1}{n} \sum_{i=1}^n x_i \big[\sigma(w^Tx_i + b) - y_i\big]$$

and

$$db = \frac{1}{n} \sum_{i=1}^n \big[\sigma(w^Tx_i + b) - y_i\big]$$

Note - I have actually differentiated the above on a paper and verified.

We update/replace the actual w and b with dw and db respectively at every iteration until the values are not fully minimized. The updating process can be understood in the following way.

$$w = w - \alpha * dw$$

and

$$b = b - \alpha * db$$

Now that we have understood the complete process, let's implement the same from scratch.

Logistic Regression - Code

We will start by importing the necessary libraries as always at the start.

Libraries

import warningswarnings.filterwarnings('ignore')import numpy as npimport pandas as pdfrom matplotlib import pyplot as pltfrom sklearn.datasets import make_classification

Data Creation

For now, we will depend on a toy dataset that we can easily create by the module sklearn.

X, y = make_classification(    n_samples=500,     n_features=2,     n_informative=2,    n_redundant=0,    n_clusters_per_class=1,    n_classes=2,    random_state=60)d = {'col{}'.format(i + 1) : X[:,i] for i in range(len(X[0]))}df = pd.DataFrame(d)df['label'] = y

Data Splitter

We need to split the data into two - the training set and the testing set. We do this by a random splitter function.

def splitter(dframe, percentage=0.8, random_state=True):    """    :param DataFrame dframe: Pandas DataFrame    :param float percentage: Percentage value to split the data    :param boolean random_state: True/False    :return: train_df, test_df    """    if random_state:        dframe = dframe.sample(frac=1)    thresh = round(len(dframe) * percentage)    train_df = dframe.iloc[:thresh]    test_df = dframe.iloc[thresh:]    return train_df, test_df

Train Test Split

train_df, test_df = splitter(dframe=df)

Construction

class LogisticRegression():

The name of the classifier is LogisticRegression and it is a class where we define other methods.

`init()` Method

    def __init__(self, train_df, test_df, label, lambda_=0.0001, n_iters=1000):        self.lambda_ = lambda_        self.n_iters = n_iters        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)        self.X_train = self.X_train.values        self.y_train = self.y_train.values        self.X_test = self.X_test.values        self.y_test = self.y_test.values        self.n_ = len(self.X_train)        self.w, self.b = self.find_best_params()

The above method is a constructor that takes five parameters -

train_df refers to the subset of the data that is used to train the regressor.
test_df refers to the subset of the data that is used to test the regressor.
label refers to the series of data which is actually the column name of the class label.
lambda_ refers to a constant that is used to update the parameters during the process of SGD.
n_iters refers to a constant that is used to decide the total iterations of the process of SGD.

`split_features_targets()` Method

    def split_features_targets(self, df, label):        X = df.drop(columns=[label], axis=1)        y = df[label]        return X, y

The above method is used to separate features and targets from the data. It takes two parameters -

df refers to the entire dataset that is passed for classifying.
label refers to the series of df which is actually the column name of the class label.

`sigmoid()` Method

    def sigmoid(self, z):        return 1 / (1 + np.exp(-z))

The above method is a sigmoid function that takes one parameter. It is used to return the value in the range of 0 and 1.

z refers to a data value.

`diff_params_wb()` Method

    def diff_params_wb(self, w, b):        lm = np.dot(self.X_train, w) + b        z = self.sigmoid(z=lm)        w_ = (1 / self.n_) * np.dot((z - self.y_train), self.X_train)        b_ = (1 / self.n_) * np.sum((z - self.y_train))        return w_, b_

The above method is used to differentiate the parameters. It takes two functional parameters -

w refers to the initial weight vector that is used in the process of SGD.
b refers to the initial intercept value that is used in the process of SGD.

`find_best_params()` Method

    def find_best_params(self):        ow = np.zeros_like(a=self.X_train[0])        ob = 0        for i in range(self.n_iters):            w_, b_ = self.diff_params_wb(w=ow, b=ob)            ow = ow - (self.lambda_ * w_)            ob = ob - (self.lambda_ * b_)        return ow, ob

The above method is used to get the best (minimal) values of the parameters w and b. It takes no functional parameters. This method follows the process of SGD to update the original values of w and b iteratively.

`draw_line()` Method

    def draw_line(self, ax):        if (len(self.w) == 2):            a, b = self.w            c = self.b            y_min = np.min(self.X_train)            y_max = np.max(self.X_train)            p1 = [((-b*y_min - c)/a), y_min]            p2 = [((-b*y_max - c)/a), y_max]            points = np.array([p1, p2])            ax.plot(points[:, 0], points[:, 1], color='#BA4A00')        return None

The above method is used to draw a hyperplane. It takes one parameter -

ax refers to the axis for which the hyperplane is plotted.

`predict()` Method

    def predict(self, with_plot=False):                y_test_preds = self.sigmoid(z=(np.dot(self.X_test, self.w) + self.b))        y_test_preds_c = np.where((y_test_preds >= 0.5), 1, 0)        y_train_preds = self.sigmoid(z=(np.dot(self.X_train, self.w) + self.b))        y_train_preds_c = np.where((y_train_preds >= 0.5), 1, 0)        if with_plot:            fig = plt.figure(figsize=(10, 4))            ax1 = fig.add_subplot(1, 2, 1)            ax1.title.set_text('Training')            ax1.scatter(self.X_train[:, 0], self.X_train[:, 1], c=y_train_preds_c, label='points')            self.draw_line(ax=ax1)            ax1.legend()            ax2 = fig.add_subplot(1, 2, 2)            ax2.title.set_text("Testing")            ax2.scatter(self.X_test[:, 0], self.X_test[:, 1], c=y_test_preds_c, label='points')            self.draw_line(ax=ax2)            ax2.legend()            plt.show()        return y_test_preds_c

The above method is used to predict the class label for the new unseen data. It takes one parameter (which is optional) -

with_plot refers to a boolean value to decide if to plot the best fit line along with the data points.

By default, this function parameter takes the False value and therefore optional.

`score()` Method

    def score(self, preds):        preds = np.array(preds)        if (len(self.y_test) == len(preds)):            non_z = np.count_nonzero(a=np.where((preds == self.y_test), 1, 0))            return non_z / len(preds)        return "Lengths do not match"

The above method is used to compute the level of accuracy score that determines whether a model is performing well or not. It is a fraction of the total number of correctly classified data with the total number of all the data points. Generally, the model whose accuracy level is greater than 0.80 or 80 is considered to be a good model. It takes two parameters -

preds refers to an array of predictions (class labels).

Full Code

class LogisticRegression():    def __init__(self, train_df, test_df, label, lambda_=0.0001, n_iters=1000):        self.lambda_ = lambda_        self.n_iters = n_iters        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)        self.X_train = self.X_train.values        self.y_train = self.y_train.values        self.X_test = self.X_test.values        self.y_test = self.y_test.values        self.n_ = len(self.X_train)        self.w, self.b = self.find_best_params()    def split_features_targets(self, df, label):        X = df.drop(columns=[label], axis=1)        y = df[label]        return X, y    def sigmoid(self, z):        return 1 / (1 + np.exp(-z))    def diff_params_wb(self, w, b):        lm = np.dot(self.X_train, w) + b        z = self.sigmoid(z=lm)        w_ = (1 / self.n_) * np.dot((z - self.y_train), self.X_train)        b_ = (1 / self.n_) * np.sum((z - self.y_train))        return w_, b_    def find_best_params(self):        ow = np.zeros_like(a=self.X_train[0])        ob = 0        for i in range(self.n_iters):            w_, b_ = self.diff_params_wb(w=ow, b=ob)            ow = ow - (self.lambda_ * w_)            ob = ob - (self.lambda_ * b_)        return ow, ob    def draw_line(self, ax):        if (len(self.w) == 2):            a, b = self.w            c = self.b            y_min = np.min(self.X_train)            y_max = np.max(self.X_train)            p1 = [((-b*y_min - c)/a), y_min]            p2 = [((-b*y_max - c)/a), y_max]            points = np.array([p1, p2])            ax.plot(points[:, 0], points[:, 1], color='#BA4A00')        return None    def predict(self, with_plot=False):                y_test_preds = self.sigmoid(z=(np.dot(self.X_test, self.w) + self.b))        y_test_preds_c = np.where((y_test_preds >= 0.5), 1, 0)        y_train_preds = self.sigmoid(z=(np.dot(self.X_train, self.w) + self.b))        y_train_preds_c = np.where((y_train_preds >= 0.5), 1, 0)        if with_plot:            fig = plt.figure(figsize=(10, 4))            ax1 = fig.add_subplot(1, 2, 1)            ax1.title.set_text('Training')            ax1.scatter(self.X_train[:, 0], self.X_train[:, 1], c=y_train_preds_c, label='points')            self.draw_line(ax=ax1)            ax1.legend()            ax2 = fig.add_subplot(1, 2, 2)            ax2.title.set_text("Testing")            ax2.scatter(self.X_test[:, 0], self.X_test[:, 1], c=y_test_preds_c, label='points')            self.draw_line(ax=ax2)            ax2.legend()            plt.show()        return y_test_preds_c    def score(self, preds):        preds = np.array(preds)        if (len(self.y_test) == len(preds)):            non_z = np.count_nonzero(a=np.where((preds == self.y_test), 1, 0))            return non_z / len(preds)        return "Lengths do not match"

Logistic Regression - Testing

We have already created a toy data set. We just need to test the model on that data.

Note - The data that we created is random. Results may differ for each execution.

Object Creation

lr = LogisticRegression(    train_df=train_df,     test_df=test_df,     label='label',     lambda_=0.001,     n_iters=100)

Weights

print(lr.w)# [-3.77376685e-02 -5.56964363e-06]

Intercept

print(lr.b)# 0.0017126093242175264

Prediction

preds = lr.predict(with_plot=True)

Training & Testing Plot

Performance Score

print(lr.score(preds=preds))# 0.84

The accuracy happens to be >= 84% which is a decent percentage and hence the model is good.

Challenges

Well, this entire code is been developed from scratch and it is sure that it, by performance - may not be as efficient as the library methods. But, it was good to understand the mathematics behind the work.

My code is slow.
In this article, we considered linearly separable data. In the case of non-linearly separable data, we need to do feature transformation.
No regularization is implemented.
Although outliers do not impact much because of the sigmoid function that is been used. But still, we can remove them to avoid problems.
In real life, the data will not be like toy data set. It will be completely different and challenging also.

References

YouTube video https://bit.ly/2UDmtZM
Wikipedia article https://en.wikipedia.org/wiki/Logistic_regression

End

Linear Regression Algorithm in Practice

Mohammed Sameeruddin — Tue, 22 Jun 2021 12:53:26 GMT

Introduction

Regression analysis is a process of predicting a response variable given an explanatory variable. The response variable is also called a dependent variable and an explanatory variable is known as an independent variable. Given a problem statement, when there are multiple explanatory variables and one response variable, then the process is known as Multiple Linear Regression. On contrary, if a problem statement contains only one explanatory variable and one response variable, it is known as Simple Linear Regression.

Note - For implementing regression analysis, there needs to be a strong relationship between the explanatory variable and the response variable.

Unlike predicting the class label (in classification task), here we predict the real value (continuous value) for the new unseen data.

Credits of Cover Image - Photo by Stary Smok on Unsplash

Mathematical Model

Let's denote the response variable as y and the explanatory variable as X. Here X, can take either a single feature or multiple features.

In the case of X having one feature, the model would be -

$$\hat{y} = w^Tx + b \rightarrow (1)$$

In the case of X having multiple features, the model would be -

$$X = [x_1, x_2, x_3, \dots, x_n]$$

$$\hat{y} = w_1x_1 + w_2x_2 + w_3x_3 + \dots + w_nx_n + b$$

$$\hat{y} = \sum_{i=1}^n w_ix_i + b$$

$$\hat{y} = W^TX + b \rightarrow (2)$$

The mathematical model (1) is used for simple linear regression tasks, whereas (2) for multiple linear regression tasks. In both cases, the response variable y^ is the target feature.

The actual response variable is denoted as y and we all know that the machine learning model cannot predict very accurately. If it cannot predict accurately, it surely contains an error. Thus, the main focus now shifted to have errors as small as possible.

$$y = \hat{y} + E \rightarrow (3)$$

Substituting (1) in (3) for simple linear regression, we get -

$$y = w^Tx + b + E \rightarrow (4)$$

Substituting (2) in (3) for multiple linear regression, we get -

$$y = W^TX + b + E \rightarrow (5)$$

The error E is the difference between the actual target feature and the predicted target feature.

$$E = (y - \hat{y})$$

Stochastic Gradient Descent (SGD)

Let's consider the model equation (1) for understanding optimization.

SGD is an iterative process where we initially assign random values (possibly 0) for both w and b. At every iteration,

For w, we differentiate (1) with respect to w and obtain dw.
For b, we differentiate (1) with respect to b and obtain db.

Mathematically, it can be represented as -

$$dw \implies \frac{\partial}{\partial w} = \frac{2}{N} \sum_{i=1}^n x_i \big[(w^Tx_i + b) - y_i \big]$$

and

$$db \implies \frac{\partial}{\partial b} = \frac{2}{N} \sum_{i=1}^n \big[(w^Tx_i + b) - y_i \big]$$

We update/replace the actual w and b with dw and db respectively at every iteration until the values are not fully minimized. The updating process can be understood in the following way.

$$w = w - \alpha * dw$$

and

$$b = b - \alpha * db$$

The entire process is the same for obtaining the minimal values of W and b for the model equation (2).

Now that we have understood the complete process, let's implement the same from scratch.

Linear Regression - Code

We will start by importing the necessary libraries as always at the start.

Libraries

import warningswarnings.filterwarnings('ignore')import osimport pandas as pdimport numpy as npimport imageiofrom matplotlib import pyplot as pltfrom sklearn.datasets import make_regression

Data Creation

For now, we will depend on a toy dataset that we can easily create by the module sklearn.

X, y = make_regression(n_samples=200, n_features=1, noise=5, random_state=15)df = pd.DataFrame(dict(col1=X[:, 0], col2=y))

Data Splitter

We need to split the data into two - the training set and the testing set. We do this by a random splitter function.

def splitter(dframe, percentage=0.8, random_state=True):    """    :param DataFrame dframe: Pandas DataFrame    :param float percentage: Percentage value to split the data    :param boolean random_state: True/False    :return: train_df, test_df    """    if random_state:        dframe = dframe.sample(frac=1)    thresh = round(len(dframe) * percentage)    train_df = dframe.iloc[:thresh]    test_df = dframe.iloc[thresh:]    return train_df, test_df

Train Test Split

train_df, test_df = splitter(dframe=df)

Construction

class LinearRegression():

The name of the regression is LinearRegression and it is a class where we define other methods.

`init()` Method

    def __init__(self, train_df, test_df, label, lambda_=0.001, n_iters=1000):        self.n_iters = n_iters        self.lambda_ = lambda_        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)        self.X_train = self.X_train.values        self.y_train = self.y_train.values        self.X_test = self.X_test.values        self.y_test = self.y_test.values        self.n_train = len(self.X_train)        self.n_test = len(self.X_test)        self.wb_params = {'w' : [], 'b' : []}        self.w, self.b = self.find_best_params()

The above method is a constructor that takes five parameters -

train_df refers to the subset of the data that is used to train the regressor.
test_df refers to the subset of the data that is used to test the regressor.
label refers to the series of data which is actually the column name of the class label.
lambda_ refers to a constant that is used to update the parameters during the process of SGD.
n_iters refers to a constant that is used to decide the total iterations of the process of SGD.

`split_features_targets()` Method

    def split_features_targets(self, df, label):        X = df.drop(columns=[label], axis=1)        y = df[label]        return X, y

The above method is used to separate features and targets from the data. It takes two parameters -

df refers to the entire dataset that is passed for classifying.
label refers to the series of df which is actually the column name of the class label.

`diff_params_wb()` Method

    def diff_params_wb(self, w, b):        lm = np.dot(self.X_train, w) + b        w_ = (2 / self.n_train) * np.dot(self.X_train.T, (lm - self.y_train))        b_ = (2 / self.n_train) * np.sum((lm - self.y_train))        return w_, b_

The above method is used to differentiate the parameters. It takes two parameters -

w refers to the initial weight vector that is used in the process of SGD.
b refers to the initial intercept value that is used in the process of SGD.

`find_best_params()` Method

    def find_best_params(self):        ow = np.zeros_like(a=self.X_train[0])        ob = 0        for i in range(self.n_iters):            w_, b_ = self.diff_params_wb(w=ow, b=ob)            ow = ow - (self.lambda_ * w_)            ob = ob - (self.lambda_ * b_)            self.wb_params['w'].append(ow)            self.wb_params['b'].append(ob)        return ow, ob

`predict()` Method

   def predict(self, with_plot=False, save_process=False):        y_test_preds = np.dot(self.X_test, self.w) + self.b        y_train_preds = np.dot(self.X_train, self.w) + self.b        if with_plot:            fig = plt.figure(figsize=(10, 4))            ax1 = fig.add_subplot(1, 2, 1)            ax1.title.set_text('Training')            ax1.scatter(self.X_train, self.y_train, label='points')            ax1.plot(self.X_train, y_train_preds, color='red', label='best fit')            ax1.legend()            ax2 = fig.add_subplot(1, 2, 2)            ax2.title.set_text("Testing")            ax2.scatter(self.X_test, self.y_test, label='points')            ax2.plot(self.X_test, y_test_preds, color='red', label='best fit')            ax2.legend()            plt.show()        if save_process:            self.save_process_togif(test_x=self.X_test, test_y=self.y_test)        return y_test_preds

The above method is used to predict the real (continuous) value for the new unseen data. It takes two parameters (which are optional) -

with_plot refers to a boolean value to decide if to plot the best fit line along with the data points.
save_process refers to a boolean value to decide if to save the process of SGD in GIF format.

By default, these functional parameters take the False value and therefore optional.

`save_process_togif()` Method

    def save_process_togif(self, test_x, test_y):        wp = self.wb_params['w']        bp = self.wb_params['b']        c = 0        for i in range(0, len(wp), 50):            c += 1            d = '0' + str(c) if (len(str(c)) == 1) else str(c)            test_p = np.dot(test_x, wp[i]) + bp[i]            fig = plt.figure(figsize=(10, 10))            plt.title("Testing")            plt.scatter(test_x, test_y, label='points')            plt.plot(test_x, test_p, color='red', label='best fit')            plt.legend()            plt.savefig('{}-lr-plot.png'.format(d))            plt.close(fig)        path = os.getcwd()        files_list = os.listdir(path=path)        png_list = [i for i in files_list if (i[0] != '.') and (i.split('.')[1] == 'png')]        png_list.sort()        png_gif = [imageio.imread(i) for i in png_list]        kargs = {'duration': 1}        gif_name = 'process-lin-reg.gif'        imageio.mimsave(gif_name, png_gif, **kargs)        print('Process saved in  ', path + '\\' + gif_name)        return None

The above method is used to save the process of SGD in the form of GIF. It takes two parameters -

test_x refers to the features whose target needs to be predicted.
test_y refers to the actual target values.

`score()` Method

    def score(self, preds):        preds = np.array(preds)        if (len(self.y_test) == len(preds)):            y_act_mean = np.mean(self.y_test)            sst = np.sum((self.y_test - y_act_mean) ** 2)            ssr = np.sum((self.y_test - preds) ** 2)            return (1 - (ssr / sst))        return "Lengths do not match"

The above method is used to calculate the R^2. It takes one parameter -

preds refers to the predicted values that are used to calculate R^2.

Full Code

class LinearRegression():    def __init__(self, train_df, test_df, label, lambda_=0.001, n_iters=1000):        self.n_iters = n_iters        self.lambda_ = lambda_        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)        self.X_train = self.X_train.values        self.y_train = self.y_train.values        self.X_test = self.X_test.values        self.y_test = self.y_test.values        self.n_train = len(self.X_train)        self.n_test = len(self.X_test)        self.wb_params = {'w' : [], 'b' : []}        self.w, self.b = self.find_best_params()    def split_features_targets(self, df, label):        X = df.drop(columns=[label], axis=1)        y = df[label]        return X, y    def diff_params_wb(self, w, b):        lm = np.dot(self.X_train, w) + b        w_ = (2 / self.n_train) * np.dot(self.X_train.T, (lm - self.y_train))        b_ = (2 / self.n_train) * np.sum((lm - self.y_train))        return w_, b_    def find_best_params(self):        ow = np.zeros_like(a=self.X_train[0])        ob = 0        for i in range(self.n_iters):            w_, b_ = self.diff_params_wb(w=ow, b=ob)            ow = ow - (self.lambda_ * w_)            ob = ob - (self.lambda_ * b_)            self.wb_params['w'].append(ow)            self.wb_params['b'].append(ob)        return ow, ob    def predict(self, with_plot=False, save_process=False):        y_test_preds = np.dot(self.X_test, self.w) + self.b        y_train_preds = np.dot(self.X_train, self.w) + self.b        if with_plot:            fig = plt.figure(figsize=(10, 4))            ax1 = fig.add_subplot(1, 2, 1)            ax1.title.set_text('Training')            ax1.scatter(self.X_train, self.y_train, label='points')            ax1.plot(self.X_train, y_train_preds, color='red', label='best fit')            ax1.legend()            ax2 = fig.add_subplot(1, 2, 2)            ax2.title.set_text("Testing")            ax2.scatter(self.X_test, self.y_test, label='points')            ax2.plot(self.X_test, y_test_preds, color='red', label='best fit')            ax2.legend()            plt.show()        if save_process:            self.save_process_togif(test_x=self.X_test, test_y=self.y_test)        return y_test_preds    def save_process_togif(self, test_x, test_y):        wp = self.wb_params['w']        bp = self.wb_params['b']        c = 0        for i in range(0, len(wp), 50):            c += 1            d = '0' + str(c) if (len(str(c)) == 1) else str(c)            test_p = np.dot(test_x, wp[i]) + bp[i]            fig = plt.figure(figsize=(10, 10))            plt.title("Testing")            plt.scatter(test_x, test_y, label='points')            plt.plot(test_x, test_p, color='red', label='best fit')            plt.legend()            plt.savefig('{}-lr-plot.png'.format(d))            plt.close(fig)        path = os.getcwd()        files_list = os.listdir(path=path)        png_list = [i for i in files_list if (i[0] != '.') and (i.split('.')[1] == 'png')]        png_list.sort()        png_gif = [imageio.imread(i) for i in png_list]        kargs = {'duration': 1}        gif_name = 'process-lin-reg.gif'        imageio.mimsave(gif_name, png_gif, **kargs)        print('Process saved in  ', path + '\\' + gif_name)        return None    def score(self, preds):        preds = np.array(preds)        if (len(self.y_test) == len(preds)):            y_act_mean = np.mean(self.y_test)            sst = np.sum((self.y_test - y_act_mean) ** 2)            ssr = np.sum((self.y_test - preds) ** 2)            return (1 - (ssr / sst))        return "Lengths do not match"

Linear Regression - Testing

We have already created a toy data set. We just need to test the model on that data.

Note: The data that we created is random. Results may differ for each execution.

Object Creation

lr = LinearRegression(    train_df=train_df,     test_df=test_df,     label='col2',     lambda_=0.01,     n_iters=1000)

Weights

print(lr.w)# [5.90699167]

Intercept

print(lr.b)# -0.5282539569012746

Prediction

preds = lr.predict(with_plot=True, save_process=True)

Training & Testing Plot

The regression line sufficiently passes through the data points.

SGD Process Demonstration

The test values are used to execute the process of SGD. As the values of w and b change for every iteration the best fit line changes accordingly, and at some point, the line stops changing, which basically means that we have got the minimum values for w and b.

Performance Score

print(lr.score(preds=preds))# 0.6830871781610702

The accuracy is almost 69% for the data whose shape is (200, 2). If we had taken/created large data, then there would have been some change.

Challenges

My code is slow.
No regularization is implemented although there types of it.
- Linear regression with L1 regularization is known as Ridge regression.
- Linear regression with L2 regularization is known as Lasso regression.
- Linear regression with both L1 and L2 regularizations is known as ElasticNet regression.
If outliers are present in the data, it may impact the model significantly.

References

YouTube video https://bit.ly/2TRmnxd
Wikipedia article https://en.wikipedia.org/wiki/Linear_regression

End

K-Nearest-Neighbors Classification Algorithm in Practice

Mohammed Sameeruddin — Tue, 08 Jun 2021 14:33:16 GMT

Introduction

KNN algorithm is a supervised classification algorithm that is mainly used to predict which category a query point belongs to, given a bunch of data values with respect to its corresponding categories (class labels). Talking from the perspective of the classification tasks, KNN is very much similar to Naive Bayes but slightly different in terms of technicality and implementation. In Naive Bayes, we compute the likelihood by probabilistic methods, whereas in KNN, we compute distance measurements along with other extensions added.

One great advantage of KNN is that, though it is used for classification tasks, we can, however, extend it for regression tasks as well. In this article, we will try to understand and implement KNN from scratch.

Note: If you want to know how Naive Bayes works, then I would strongly recommend checking out my blog post - Naive Bayes in Practice.

Credits of Cover Image - Photo by Tom Rumble on Unsplash

General Intuition

Let's imagine we have 2 sets of points separated as per the category. Now, given a new point, how can you predict or guess which category it belongs to?

The task of putting the query point in the right class label can be tricky. We cannot just randomly assign the class label.

Mathematically, we have to compute the distance between the query point and each data point. We need to set a threshold and select the top nearest distance values. Let's say the threshold value is 5.

What we can see is that the query point is nearer to the blue category with 3 points and the orange category with 2 points. Now, when we apply the majority voting technique, the class label of the query point will be blue since 3 is greater than 2.

The same concept is used in developing the algorithm. Now that we have understood the theory, let's implement this totally from scratch.

Distance Measurements - Theory

Well, to compute the distance between two points, we have various methods. Here, we will just select a few distance measurement metrics.

Types of distance metrics

Euclidean Distance
Manhattan Distance
Minkowski Distance
Hamming Distance
Cosine Distance

1. Euclidean Distance

The length of a line segment between any two points in the coordinate system is Euclidean distance. This metric computes the shortest distance between two points. This is also referred to as the L2 norm.

Mathematical formula

Let

$$p = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the distance between p and q can be represented as -

$$pq = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}$$

The same pattern is followed in the case of higher-dimensional vectors.

2. Manhattan Distance

Manhattan distance follows the concept of Taxicab Geometry. The usual Euclidean distance metric is replaced in such a way that the distance between points is actually the sum of absolute differences between the points. This is also referred to as the L1 norm.

Mathematical formula

Let

$$p = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the distance between p and q can be represented as -

$$pq = |x_1 - x_2| + |y_1 - y_2|$$

The same pattern is followed in the case of higher-dimensional vectors.

3. Minkowski Distance

Minkowski distance is the generalized form of both Euclidean and Manhattan distance metrics. This is also called the Lp norm. If the value of p is 1 then it is known as Manhattan, if p is 2 then Euclidean, and so on.

Mathematical formula

Let

$$r = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the distance between r and q can be represented as -

$$rq = \sqrt[p]{(x_1 - x_2)^p + (y_1 - y_2)^p}$$

The same pattern is followed in the case of higher-dimensional vectors.

4. Hamming Distance

The hamming distance metric is highly recommended when we are dealing with binary vectors. In case, if it is used in non-binary vectors, it produces terrible results. It is the number of bit positions in which two bits are different provided the length of two vectors are the same.

Mathematical formula

Let

$$p = [0, 1, 1, 0, 1, 0, 0]$$

and

$$q = [1, 0, 1, 0, 1, 0, 1]$$

the distance between p and q can be represented as -

the sum of non-identical bits between the two points or vectors.

5. Cosine Distance

In order to find the angular distance, first, we have to find the cosine similarity. The similarity between two non-zero points by their inner product is called cosine similarity.

Mathematical formula

Let

$$p = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the cosine similarity between p and q can be represented as -

$$\text{cosine-sim}(p, q) = \frac{p.q}{||p||_2 \ ||q||_2} = \frac{p.q}{\sqrt{p.p} \ \sqrt{q.q}}$$

and cosine distance between p and q can be represented as -

$$\text{cosine-dist}(p, q) = 1 - \text{cosine-sim}(p, q)$$

The same pattern is followed in the case of higher-dimensional vectors.

Distance Measurements - Code

Coding the above distance measures or metrics becomes very easy with the help of NumPy being the amazing package for numerical computation and linear algebra. Once we install it in the system, we can easily use it by importing the package.

Full code

import numpy as npclass DistanceMeasures():    def __init__(self, point1, point2):        """        :param array point1: Point        :param array point2: Point        """        self.point1 = np.array(point1)        self.point2 = np.array(point2)    def euclidean_measure(self):        flag = True if (len(self.point1) == len(self.point2)) else False        if flag:            dist = np.linalg.norm(self.point1 - self.point2)            return round(dist, 4)        return None    def manhattan_measure(self):        flag = True if (len(self.point1) == len(self.point2)) else False        if flag:            dist = np.abs(self.point1 - self.point2).sum()            return round(dist, 4)        return None    def minkowski_measure(self, p):        flag = True if (len(self.point1) == len(self.point2)) else False        if not flag and (p <= 0):            return None        if (p == 1):            return self.manhattan_measure()        elif (p == 2):            return self.euclidean_measure()        dist = np.sum(np.abs(self.point1 - self.point2) ** p) ** (1 / p)        return round(dist, 4)    def hamming_measure(self):        flag = True if (len(self.point1) == len(self.point2)) else False        if flag:            return np.sum(self.point1 != self.point2)        return None    def cosine_measure(self):        flag = True if (len(self.point1) == len(self.point2)) else False        if flag:            nume = np.dot(a=self.point1, b=self.point2)            denome = np.sqrt(np.dot(a=self.point1, b=self.point1)) * np.sqrt(np.dot(a=self.point2, b=self.point2))            return 1 - (nume / denome)        return None

Custom KNN - Code

We have already understood the intuition of how this algorithm works. Now, we just need to apply that concept in the code. Let's do that.

Note: This article doesn't cover the performance metrics or hyperparameter tuning of the model.

Libraries

import numpy as npimport pandas as pdimport plotly.graph_objects as gofrom plotly.subplots import make_subplotsfrom sklearn.datasets import make_classification

Construction

class KNN():

The name of the classifier is KNN and it is a class where we define other methods.

`init()` Method

    def __init__(self, n_neighbors, train_df, test_df, label, metric='euclidean'):        """        :param int n_neighbors: Number of nearest neighbors        :param str metric: Distance measurement metric.         Accepts `euclidean`, `manhattan`, `minkowski`, `hamming`, and `cosine` methods        """        self.n_neighbors = n_neighbors        self.available_metrics = ['euclidean', 'manhattan', 'minkowski', 'hamming', 'cosine']        self.metric = metric if metric in self.available_metrics else 'euclidean'        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)        self.X_train = self.X_train.values        self.y_train = self.y_train.values        self.X_test = self.X_test.values        self.y_test = self.y_test.values

The above method is a constructor that takes five parameters -

n_neighbors refers to the number of top nearest neighbors that are used to predict the class label.
train_df refers to the subset of the data that is used to train the classifier.
test_df refers to the subset of the data that is used to test the classifier.
label refers to the series of data which is actually the column name of the class label.
metric refers to the type of distance metric that is used to compute the distance between points or vectors.

`split_features_targets()` Method

    def split_features_targets(self, df, label):        """        :param DataFrame df: The main dataset        :param str label: The column name which signifies class labels        :return: X, y        """        X = df.drop(columns=[label], axis=1)        y = df[label]        return X, y

The above method is used to separate features and targets from the data. It takes two parameters -

df refers to the entire dataset that is passed for classifying.
label refers to the series of df which is actually the column name of the class label.

`most_freq()` Method

    def most_freq(self, l):        """        :param list l: List of values (can have duplicates)        :return: Most occurred element from l        """        return max(set(l), key=l.count)

The above method is used to get the most occurred elements in the given list. Basically, this method replicates the majority voting technique that predicts the class label. It takes one parameter -

l refers to the list of elements (class labels).

`find_nearest_neighbors()` Method

    def find_nearest_neighbors(self, X_new):        """        :param array X_new: Array/List of each feature        :return: distance_labels - dictionary        """        distances = []        for X_row in self.X_train:            dm = DistanceMeasures(point1=X_row, point2=X_new)            if (self.metric == 'euclidean'):                distances.append(dm.euclidean_measure())            elif (self.metric == 'manhattan'):                distances.append(dm.manhattan_measure())            elif (self.metric == 'minkowski'):                distances.append(dm.minkowski_measure(p=3))            elif (self.metric == 'hamming'):                distances.append(dm.hamming_measure())            else:                distances.append(dm.cosine_measure())        distance_labels = {d : l for (d, l) in zip(distances, self.y_train)}        distance_labels = dict(            sorted(distance_labels.items(), key=lambda x:x[0])[:self.n_neighbors]        )        return distance_labels

The above method is used to find the top nearest neighbors for the query point with respect to all the data points present in the training data. It takes one parameter -

X_new refers to the query point for which the class label is predicted.

`predict()` Method

    def predict(self):        if (len(self.X_test) == 1):            odl = self.find_nearest_neighbors(X_new=self.X_test)            return self.most_freq(l=list(odl.values()))        preds = []        for test in self.X_test:            odl = self.find_nearest_neighbors(X_new=test)            preds.append(self.most_freq(l=list(odl.values())))        return np.array(preds)

The above method is used to predict the class labels for the X_test data points. It takes no parameters but uses the methods like most_freq() and find_nearest_neighbors() in order to complete the task.

`score()` Method

    def score(self, preds, with_plot=False):        """        :param array preds: Predictions        :param boolean with_plot: True/False        :return: accuracy level        """        if (len(self.y_test) == len(preds)):            if with_plot:                self.plot_it(preds=preds)            return sum([1 if (i == j) else 0 for (i, j) in zip(self.y_test, preds)]) / len(preds)        return "Lengths do not match"

The above method is used to compute the level of accuracy score that determines whether a model is performing well or not. It is a fraction of the total number of correctly classified data with the total number of all the data points. Generally, the model whose accuracy level is greater than 0.80 or 80 is considered to be a good model. It takes two parameters -

preds refers to an array of predictions (class labels).
with_plot refers to a boolean value from which data visualization is decided.

`plot_it()` Method

    def plot_it(self, preds):        """        :param array preds: Predictions        :return: None        """        fig = make_subplots(rows=1, cols=2)        x_ = list(self.X_train[:, 0])        y_ = list(self.X_train[:, 1])        c_ = list(self.y_train)        fig.add_trace(            go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Training'),            row=1, col=1        )        x_ = list(self.X_test[:, 0])        y_ = list(self.X_test[:, 1])        c_ = list(preds)        fig.add_trace(            go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Testing'),            row=1, col=2        )        title = 'Training {} Testing'.format(' '*68)        fig.update_layout(            title=title, height=300, width=900,             margin=dict(l=0, b=0, t=40, r=0), showlegend=False        )        fig.show()        return None

The above method is used to visualize the data. Both training and testing data are visualized separately. It takes one parameter -

preds refers to an array of predictions (class labels).

Full code

class KNN():    def __init__(self, n_neighbors, train_df, test_df, label, metric='euclidean'):        """        :param int n_neighbors: Number of nearest neighbors        :param str metric: Distance measurement metric.         Accepts `euclidean`, `manhattan`, `minkowski`, `hamming`, and `cosine` methods        """        self.n_neighbors = n_neighbors        self.available_metrics = ['euclidean', 'manhattan', 'minkowski', 'hamming', 'cosine']        self.metric = metric if metric in self.available_metrics else 'euclidean'        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)        self.X_train = self.X_train.values        self.y_train = self.y_train.values        self.X_test = self.X_test.values        self.y_test = self.y_test.values    def split_features_targets(self, df, label):        """        :param DataFrame df: The main dataset        :param str label: The column name which signifies class labels        :return: X, y        """        X = df.drop(columns=[label], axis=1)        y = df[label]        return X, y    def most_freq(self, l):        """        :param list l: List of values (can have duplicates)        :return: Most occurred element from l        """        return max(set(l), key=l.count)    def find_nearest_neighbors(self, X_new):        """        :param array X_new: Array/List of each feature        :return: distance_labels - dictionary        """        distances = []        for X_row in self.X_train:            dm = DistanceMeasures(point1=X_row, point2=X_new)            if (self.metric == 'euclidean'):                distances.append(dm.euclidean_measure())            elif (self.metric == 'manhattan'):                distances.append(dm.manhattan_measure())            elif (self.metric == 'minkowski'):                distances.append(dm.minkowski_measure(p=3))            elif (self.metric == 'hamming'):                distances.append(dm.hamming_measure())            else:                distances.append(dm.cosine_measure())        distance_labels = {d : l for (d, l) in zip(distances, self.y_train)}        distance_labels = dict(            sorted(distance_labels.items(), key=lambda x:x[0])[:self.n_neighbors]        )        return distance_labels    def predict(self):        if (len(self.X_test) == 1):            odl = self.find_nearest_neighbors(X_new=self.X_test)            return self.most_freq(l=list(odl.values()))        preds = []        for test in self.X_test:            odl = self.find_nearest_neighbors(X_new=test)            preds.append(self.most_freq(l=list(odl.values())))        return np.array(preds)    def score(self, preds, with_plot=False):        """        :param array preds: Predictions        :param boolean with_plot: True/False        :return: accuracy level        """        if (len(self.y_test) == len(preds)):            if with_plot:                self.plot_it(preds=preds)            return sum([1 if (i == j) else 0 for (i, j) in zip(self.y_test, preds)]) / len(preds)        return "Lengths do not match"    def plot_it(self, preds):        """        :param array preds: Predictions        :return: None        """        fig = make_subplots(rows=1, cols=2)        x_ = list(self.X_train[:, 0])        y_ = list(self.X_train[:, 1])        c_ = list(self.y_train)        fig.add_trace(            go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Training'),            row=1, col=1        )        x_ = list(self.X_test[:, 0])        y_ = list(self.X_test[:, 1])        c_ = list(preds)        fig.add_trace(            go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Testing'),            row=1, col=2        )        title = 'Training {} Testing'.format(' '*68)        fig.update_layout(            title=title, height=300, width=900,             margin=dict(l=0, b=0, t=40, r=0), showlegend=False        )        fig.show()        return None

Custom KNN - Testing

Before testing the model, we need to first create a toy dataset and then split it into training and testing subsets.

Data Creation

X, y = make_classification(    n_samples=300,     n_features=2,     n_informative=2,    n_redundant=0,    n_clusters_per_class=1,    n_classes=3,    random_state=60)df = pd.DataFrame(dict(col1=X[:,0], col2=X[:,1], label=y))

The first five rows of df look like -

col1	col2	label
0.774027	-1.654300	0
0.487983	-0.202421	0
0.762589	-0.440150	0
-1.920213	0.264316	1
0.773207	-0.014049	2

Data splitter

def splitter(dframe, percentage=0.8, random_state=True):    """    :param DataFrame dframe: Pandas DataFrame    :param float percentage: Percentage value to split the data    :param boolean random_state: True/False    :return: train_df, test_df    """    if random_state:        dframe = dframe.sample(frac=1)    thresh = round(len(dframe) * percentage)    train_df = dframe.iloc[:thresh]    test_df = dframe.iloc[thresh:]    return train_df, test_df

The above function is used to divide the data into two parts.

train_df, test_df = splitter(dframe=df)

Object Creation

neigh = KNN(    train_df=train_df,     test_df=test_df,     label='label',     n_neighbors=5,     metric='cosine')

Prediction

preds = neigh.predict()

Accuracy Score

acc = neigh.score(preds=preds, with_plot=True)print(acc)

The accuracy happens to be >= 95% which is a decent percentage and hence the model is good.

Challenges

It is better to use cross-validation techniques to find the optimal value of k to obtain better results. We did not discuss that in this article.
If the data is huge, then it can very tough job for the computer as there will be time and space constraints. Because KNN happens to be slow compared to other models. Of course, we can make use of methods like KD-Tree and LSH (Locality Sensitive Hashing) to fasten the model performance.
Oftentimes, the majority voting technique may not work effectively. To avoid this problem, we can make weighted KNN which works really well.
It is always good to use the package methods rather than writing our own algorithms. The algorithms that we write may not be optimized enough to proceed.

References

KNN - Wikipedia article https://bit.ly/3w4cUAW
TDS blog article https://bit.ly/3x5f3ML
Distance metrics https://bit.ly/3ctuBlL

End

Naive Bayes Classification Algorithm in Practice

Mohammed Sameeruddin — Sat, 29 May 2021 10:05:59 GMT

Introduction

Classification is a task of grouping things together on the basis of the similarity they share with each other. It helps organize things and thus makes the study more easy and systematic. In statistics, classification refers to the problem of identifying to which set of categories an observation or data value belongs to.

Examples -

Identifying whether an email is a spam or not spam.
On certain parameters of health, identifying or classifying whether a patient gets a heart attack or not.

For humans, it can be very easy to do the classification task assuming that he/she has proper domain-specific knowledge and given certain features he/she can achieve it by no means. But, it can be tricky for a machine to classify - unless it is provided with proper training from the data and algorithm (classifier) that is used for learning. Thanks to Machine Learning and AI with which, now we are able to train machines with lots of data and predicting the results on unseen data.

Credits of Cover Image - Photo by Annie Spratt on Unsplash

Machine Learning

Machine Learning is a method to make machines learn from the previous experience (data) with which predictions are made on the future unseen data. These predictions may not be perfect and accurate but simply standard guesses that can be considered to make future decisions.

Since we are talking about a machine to do the classification task, luckily, over the years researchers have built various classification algorithms that can be used for our own datasets (assuming the dataset is processed enough). Some of the types of classification algorithms are -

Logistic Regression
Naive Bayes Classifier
K-Nearest Neighbor (KNN)
Decision Trees
Neural Networks

Note - In this article, we will focus only on the Naive Bayes Classifier - algorithm.

Naive Bayes - Theory

A simple and robust classifier that belongs to the family of probabilistic classifiers. It follows the idea of the Bayes Theorem assuming that every feature is independent of every other feature. Given the categorical features (not real-valued data) along with categorical class labels, Naive Bayes computes likelihood for each category from every feature with respect to each category of class labels. Thus, it will choose a specific category of a class label whose likelihood is maximum.

Let's assume that we are given class labels such as -

$$C_k = [c_1, c_2, c_3, \dots, c_K]$$

and features such as -

$$X = [x_1, x_2, x_3, \dots, x_n]$$

with the help of the Bayes Theorem, we can have -

$$P(C_k | X) = \frac{P(C_k) P(X | C_k)}{P(X)}$$

Now, by applying chain rule and applications of conditional probability, we will get -

$$P(C_k | x_1, x_2, x_3, \dots, x_n) \propto P(C_k) \prod P(x_i | C_k) \ (i \ \text{from} \ 1 \ \text{to} \ n)$$

Clearly, the above is a probability model which can be used to construct a classifier such as -

$$\hat{y} = \text{argmax} \bigg[P(C_k) \prod P(x_i | C_k) \ (i \ \text{from} \ 1 \ \text{to} \ n) \bigg]$$

That's it. Based on this idea, the classifier predicts the category. Let's try to code this from scratch.

Naive Bayes - Code

For the practice of Naive Bayes, we shall have a processed categorical data with features and class labels to compute likelihoods. The whole process is explained below.

Data Source - https://bit.ly/3oZ9s83

Libraries

import pandas as pdimport numpy as np

Data Reading

data_source = 'http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'cdf = pd.read_csv(    filepath_or_buffer=data_source,     names=['buying','maint','doors','persons','lug_boot','safety','class'],     sep=',')

The data is stored in the variable called cdf.

Construction

class CategoricalNB():

The name of the classifier is CategoricalNB() and it is a class where we define other methods.

   def __init__(self, train_df, test_df, label):        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)        self.X_test_vals = self.X_test.values        self.y_test_vals = self.y_test.values        self.X_likelihood, self.y_likelihood = self.compute_likelihood()

The above method is a constructor that takes three parameters such as -

train_df refers to the subset of the data that is used to train the classifier.
test_df refers to the subset of the data that is used to test the classifier.
label refers to the series of data which is actually the column name of the class label.

    def split_features_targets(self, df, label):        X = df.drop(columns=[label], axis=1)        y = df[label]        return X, y

The above method is used to separate features and targets from the data. It takes two parameters such as -

df refers to the entire dataset that is passed for classifying.
label refers to the series of df which is actually the column name of the class label.

    def compute_likelihood(self):        X_likelihood = {}        yc_df = self.y_train.value_counts().to_frame()        yc_df.reset_index(inplace=True)        yc_df.columns = ['class', 'count']        y_vc = {i : j for (i, j) in zip(yc_df['class'], yc_df['count'])}        y_vc_k = list(y_vc.keys())        for col in self.X_train:            each_col_dict = {}            x_col_vals = self.X_train[col].value_counts().to_frame().index.to_list()            fydf = pd.DataFrame(data={col : self.X_train[col], 'y' : self.y_train})            for ex in x_col_vals:                each_x_dict = {}                x_ex_df = fydf[fydf[col] == ex]                for ey in y_vc_k:                    x_y_df = x_ex_df[x_ex_df['y'] == ey]                    each_x_dict[ey] = len(x_y_df) / y_vc[ey]                each_col_dict[ex] = each_x_dict            X_likelihood[col] = each_col_dict        y_likelihood = {i : j / sum(list(y_vc.values())) for (i, j) in y_vc.items()}        return X_likelihood, y_likelihood

The above method is used to compute the likelihood of each category from every feature with respect to each category of the class label. This is very useful as it is helpful to predict the category of the new data point. Here, likelihood is simply the probability - computed as per the category.

    def predictor(self, X_new):        cols = list(self.X_likelihood.keys())        col_new = {i : j for (i, j) in zip(cols, X_new)}        lprobs = {}        for l, v in self.y_likelihood.items():            cate_v = [self.X_likelihood[cn][cl][l] for (cn, cl) in col_new.items()]            lprobs[l] = round((np.prod(cate_v) * v), 4)        prob_ks = list(lprobs.keys())        prob_vs = list(lprobs.values())        return prob_ks[np.argmax(prob_vs)]

The above method is used to get the specific category of a class label whose probability for the parameter X_new is maximum. Here, the parameter X_new is basically the query point for which we want the prediction.

    def predict(self):        if len(self.X_test_vals) == 1:            return self.predictor(X_new=self.X_test_vals[0])        preds = [self.predictor(X_new=i) for i in self.X_test_vals]        return preds

The above method is used to predict the categories of a class label for the completely unseen dataset. This method works for both single query point or multiple query points.

    def accuracy_score(self, preds):        actual_vals = np.array(self.y_test_vals)        preds = np.array(preds)        corrects = np.count_nonzero(np.where((actual_vals == preds), 1, 0))        return corrects / len(actual_vals)

The above method is used to compute the level of accuracy (a metric) to determine how the model (algorithm) is performing. It is a fraction of the total number of correctly classified data with the total number of all the data points. Generally, the model whose accuracy level is greater than 0.80 or 80 is considered to be a good model.

Full Code

class CategoricalNB():    def __init__(self, train_df, test_df, label):        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)        self.X_test_vals = self.X_test.values        self.y_test_vals = self.y_test.values        self.X_likelihood, self.y_likelihood = self.compute_likelihood()    def split_features_targets(self, df, label):        X = df.drop(columns=[label], axis=1)        y = df[label]        return X, y    def compute_likelihood(self):        X_likelihood = {}        yc_df = self.y_train.value_counts().to_frame()        yc_df.reset_index(inplace=True)        yc_df.columns = ['class', 'count']        y_vc = {i : j for (i, j) in zip(yc_df['class'], yc_df['count'])}        y_vc_k = list(y_vc.keys())        for col in self.X_train:            each_col_dict = {}            x_col_vals = self.X_train[col].value_counts().to_frame().index.to_list()            fydf = pd.DataFrame(data={col : self.X_train[col], 'y' : self.y_train})            for ex in x_col_vals:                each_x_dict = {}                x_ex_df = fydf[fydf[col] == ex]                for ey in y_vc_k:                    x_y_df = x_ex_df[x_ex_df['y'] == ey]                    each_x_dict[ey] = len(x_y_df) / y_vc[ey]                each_col_dict[ex] = each_x_dict            X_likelihood[col] = each_col_dict        y_likelihood = {i : j / sum(list(y_vc.values())) for (i, j) in y_vc.items()}        return X_likelihood, y_likelihood    def predictor(self, X_new):        cols = list(self.X_likelihood.keys())        col_new = {i : j for (i, j) in zip(cols, X_new)}        lprobs = {}        for l, v in self.y_likelihood.items():            cate_v = [self.X_likelihood[cn][cl][l] for (cn, cl) in col_new.items()]            lprobs[l] = round((np.prod(cate_v) * v), 4)        prob_ks = list(lprobs.keys())        prob_vs = list(lprobs.values())        return prob_ks[np.argmax(prob_vs)]    def predict(self):        if len(self.X_test_vals) == 1:            return self.predictor(X_new=self.X_test_vals[0])        preds = [self.predictor(X_new=i) for i in self.X_test_vals]        return preds    def accuracy_score(self, preds):        actual_vals = np.array(self.y_test_vals)        preds = np.array(preds)        corrects = np.count_nonzero(np.where((actual_vals == preds), 1, 0))        return corrects / len(actual_vals)

Naive Bayes - Testing

Before testing the model, we need to first split the original data into two parts. One part is used for training and another part is used for testing. In machine learning, we generally take 80% of the data for training and the remaining 20% data for testing.

Data Splitter

def splitter(dframe, percentage=0.8, random_state=True):    if random_state:        dframe = dframe.sample(frac=1)    thresh = round(len(dframe) * percentage)    train_df = dframe.iloc[:thresh]    test_df = dframe.iloc[thresh:]    return train_df, test_df

The above function is used to divide the data into two parts.

train_df, test_df = splitter(dframe=cdf)

We have successfully divided the data into two parts.

Object Creation

nb = CategoricalNB(train_df=train_df, test_df=test_df, label='class')

Prediction

preds = nb.predict()

Accuracy Score

acc = nb.accuracy_score(preds=preds)print(acc)

The accuracy happens to be >= 82% which is a decent percentage and hence the model is good.

Challenges

There can be situations where we have to apply Laplace Smoothing and avoid the problems of the probability becoming 0. I have not implemented it just for now.
If we have many features in the data, we might end up getting a very small probability. This can be avoided by taking the logarithmic probabilities.
The above implementation only works for categorical data. If we have real-valued data, it is better to proceed with the Gaussian Naive Bayes procedure.

References

Categorical Naive Bayes - https://bit.ly/3uwOrTh
Machine Learning, Shatterline blog - https://bit.ly/3i1gqrv

Well, that's it for now. If you have liked my article you can buy some coffee and support me here. That would motivate me to write and learn more about what I know.

#8 - apply() and map() for Data Manipulation

Mohammed Sameeruddin — Fri, 21 May 2021 07:02:57 GMT

What is Data Manipulation?

Data manipulation is one of the techniques in the life cycle of data analysis. It refers to the process of organizing the data by adjusting certain rows or columns with which, it becomes easier to read and understand. We have a separate programming language called Data Manipulation Language (SQL) to proceed with all kinds of adjustments (inserting, deleting, modifying, etc).

Credits of Cover Image - Photo by Tower Electric Bikes on Unsplash

Note - This article mainly focuses on certain methods like apply() and map() - helpful to manipulate the data. These methods are pandas methods.

General Purpose

The ultimate goal of any program written in any language is to complete the task as fast as possible by consuming less memory and time. Using these built-in methods can be beneficial as they are optimized keeping all the constraints. The same implementation can be achieved by writing our own custom function, but this cannot be as efficient as using the built-in method, and moreover, this can sometimes turn out to be very tedious.

apply() this method is used to manipulate the entire data frame or one single column (series) when passed a function.
map() this method is used to manipulate only one single column (series) at a time when passed a function.

Problem Statement - 1

Given a numerical data frame, compute the square of each number from each column and obtained a new data frame. Further explanation can be seen in the below figure.

The data which is on the left side is the original data and the right side is the manipulated data which is basically the square of the number.

Custom Function

The custom function (user-defined function) that we can think of solving this problem can be like -

def sqaure_data(dframe):    cframe = pd.DataFrame()    for col in dframe:        square_col = []        for each_val in dframe[col]:            square_col.append(each_val ** 2)        cframe[col] = square_col    return cframe

The function is using two for loops which can make the program execute slowly when data is very large. This is definitely not the right thing to practice in original data analysis tasks.

apply() Method

Let's say, the original data (on the left) is stored in a variable called df. By using apply(), this task can be computed easily with just one line.

adf = df.apply(lambda x: x**2)

That's it! The above returns the manipulated data - stored in the variable adf. The only caveat in using this method is that it is important to make sure the data contains no NaN values and all the values are numeric in nature (in the sense, there should not be any text data).

map() Method

As the method apply() is used, map() can also be used but map() only manipulates a single column at a time. It cannot be used to manipulate the entire data at once.

mdf = pd.DataFrame()for col in df:    mdf[col] = df[col].map(lambda x: x**2)

for loop is simply used once just to iterate through columns but not each value in a column. This is far better than the first method but not as better as the apply() method.

apply() can perform the same task that map() can. There are advantages and disadvantages of both methods. Quick data processing is a must before using these methods in practice.

Problem Statement - 2

Given a numerical data frame, replace the values that are less than 25 with 1000. Further explanation can be seen in the below figure.

The data which is on the left side is the original data and the right side is the manipulated data which is basically the replacement of those numbers with 1000 that are less than 25.

Custom Function

The custom function (user-defined function) that we can think of solving this problem can be like -

def replacer(dframe, cval=25, with_=1000):    rframe = pd.DataFrame()    for col in dframe:        rvals = []        for each_val in dframe[col]:            r = with_ if each_val < cval else each_val            rvals.append(r)        rframe[col] = rvals    return rframe

As usual, this uses 2 loops that can be hectic if the data is very large.

apply() & map() Method

This case cannot be handled by one line of code, instead, a loop is used that goes through each column.

rdf = pd.DataFrame()for col in df:    rdf[col] = df[col].apply(lambda x: x if x > 25 else 1000)

rdf = pd.DataFrame()for col in df:    rdf[col] = df[col].map(lambda x: x if x > 25 else 1000)

Here, apply() is similar to map(), apply() is not utilized as robustly as it had been utilized in problem statement 1. It is because an extra conditional statement is introduced and thus, cannot be achieved in one line of code.

To compute this in one line of code, the method applymap() can be used. It is the combination of both apply() and map().

applymap() Method

rdf = df.applymap(lambda x: x if x > 25 else 1000)

That's it! The above returns the manipulated data - stored in the variable rdf. The only caveat in using this method is that it is important to make sure the data contains no NaN values and all the values are numeric in nature (in the sense, there should not be any text data).

Conclusion

The article encompasses the core details on how to manipulate the data effectively by certain methods.
Methods included here are apply(), map(), applymap() which are pandas data manipulation methods.
The use case of each method is explained clearly with problem statements.

Buy Me Coffee

If you have liked my article you can buy some coffee and support me here. That would motivate me to write and learn more about what I know.

Well, that's all for now. This article is included in the series Exploratory Data Analysis, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

#7 - Handle Missing Values in Data

Mohammed Sameeruddin — Wed, 12 May 2021 11:31:32 GMT

What is a missing value?

Missing value simply refers to a situation where there is no data stored for the variable in an observation. This usually happens when there is no response detected for a particular question during the process of data collection. Data Analysts have to be very careful in dealing with missing values. The process to handle the missing data is called imputation. On the whole, imputation is helpful to some extent but not fully in replacing the actual value that can be assumed in the place of missing value.

Credits of Cover Image - Photo by Alex Jones on Unsplash

Techniques of Imputation

Filling by a Central Tendency
- Mean replacement
- Median replacement
- Mode replacement
Categorical Imputation
- For each category, compute central tendency and fill the missing value for that category itself.
Model Imputation
- Building a machine learning model to fill up the missing data by the predicted values.

Note - In this article, we will focus only on 1 and 2 techniques. We will not focus on 3 just for now.

For implementing the above methods we will need to first create a toy dataset in order to programmatically understand each technique.

Necessary `import`s

import warningswarnings.filterwarnings('ignore')import pandas as pdimport numpy as npimport randomfrom sklearn.datasets import make_blobsfrom scipy import statsfrom matplotlib import pyplot as plt

Dataset Creation

In our data, we shall make sure that we have at least one feature which is categorical. The method make_blobs() can be really handy in creating toy datasets in our own experiments.

With this method, we will create a dataset that will have -

100 samples
2 features
3 centers

X, y = make_blobs(n_samples=100, n_features=2, centers=3)df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))print(df.head())

The gist of the data looks like this -

Index	x	y	label
0	-1.960223	-2.240400	2
1	-3.489893	-3.149356	2
2	-8.173217	6.186763	1
3	-1.216968	-3.318524	2
4	-7.661975	9.092514	1

Similarly, the original plot of the data looks like this -

Missing values at random locations

In Python, a missing value is represented as NaN which means Not a Number. We will create missing values in the y column for the above toy dataset.

df['y'].loc[df['y'].sample(frac=0.1).index] = np.nan

The above code simply means that we are creating 10% of missing values completely at random locations of the existing data frame.

df.isnull().sum()# ------'''x         0y        10label     0dtype: int64'''

Computing Central Tendency

Since, we have to impute the missing values with central tendencies like mean, median, and mode. We will create a helper function that takes three params -

data The dataset for which the central tendencies should be computed.
col The column for which the central tendencies should be computed.
strategy Mean, Median, and Mode strategies.

def fill_values(data, col, strategy):    if (strategy == 'mean'):        fval = data[col].mean()    elif (strategy == 'median'):        fval = data[col].mean()    else:        fval = stats.mode(data[col])[0][0]    return fval

The above function returns a value based on the strategy that is used.

1) Filling by a Central Tendency

In order to fill the NaN values with new values in the existing column of the data, we need to have a function that does so. The function takes three params -

data The dataset for which the central tendencies should be computed.
col The column for which the central tendencies should be computed.
strategy Mean, Median, and Mode strategies.

def simple_imputer(data, col, strategy):    fval = fill_values(data=data, col=col, strategy=strategy)    non_nans = data[col].fillna(fval).to_list()    return non_nans

The above function returns a list with filled values based on the strategy that is used.

Note - The above technique is very simple and would not produce good results when compared with other techniques.

If we try to visualize the data after filling the NaN's, we would see some sort of dispersion in the data.

We can see a horizontal pattern formed by points. These points are actually NaN values that got filled by a respective central tendency. And moreover, neither of them belong to any of the clusters.

2) Categorical Imputation

In our original dataset, besides x and y columns we also have one column called label that 3 unique values such as [0, 1, 2]. A simple hack that we can incorporate here is, we can separate the data by clusters and then compute the central tendency to fill up the NaN values. By doing so, we can make sure that the new values belong to their respective clusters.

As usual, we will have a function to separate data cluster-wise. The function takes four params -

data The dataset for which the central tendencies should be computed.
col The column for which the central tendencies should be computed.
label The column with which the data is separated into clusters.
strategy Mean, Median, and Mode strategies.

def label_imputer(self, data, col, label, strategy):    col_label_df = data[[col, label]]    counts_df = col_label_df[label].value_counts().to_frame()    classes = counts_df.index.to_list()    fval_class = []    for each_class in classes:        # separating data into clusters        mini_frame = col_label_df[col_label_df[label] == each_class]        # computing central tendency        non_nans_frame = pd.Series(data=simple_imputer(mini_frame, col, strategy), index=mini_frame.index)        fval_class.append(non_nans_frame)    final_vals = pd.concat(fval_class).sort_index().to_list()    return final_vals

The above function is used to compute the central tendency value based upon the cluster that it belongs to and finally returns a list of values.

If we try to visualize the data after filling the NaN's, we would not see any sort of dispersion in the data but rather a uniformity and resemblance with the original plot (without NaN values).

Full Code

class Imputer():    def __init__(self, strategy):        self.available_strats = ['mean', 'median', 'mode']        self.strategy = 'median' if strategy not in self.available_strats else strategy    def fill_values(self, data, col):        if (self.strategy == 'mean'):            fval = data[col].mean()        elif (self.strategy == 'median'):            fval = data[col].mean()        else:            fval = stats.mode(data[col])[0][0]        return fval    def simple_imputer(self, data, col):        fval = self.fill_values(data=data, col=col)        non_nans = data[col].fillna(fval).to_list()        return non_nans    def label_imputer(self, data, col, label):        col_label_df = data[[col, label]]        counts_df = col_label_df[label].value_counts().to_frame()        classes = counts_df.index.to_list()        fval_class = []        for each_class in classes:            mini_frame = col_label_df[col_label_df[label] == each_class]            non_nans_frame = pd.Series(data=self.simple_imputer(data=mini_frame, col=col), index=mini_frame.index)            fval_class.append(non_nans_frame)        final_vals = pd.concat(fval_class).sort_index().to_list()        return final_vals

Conclusion

Throughout this article, we saw different techniques to handle the missing data. But, this was implemented on a toy dataset, not on a real dataset.
Practicing the above on real data may or may not be suitable. And moreover, it is quite impossible to have a categorical feature to separate the data in clusters and fill the values based on the clusters themselves.
It might be beneficial to practice the third technique which predicts the missing values by a mathematical model and this is an unbiased technique that entirely depends on the training data and a model itself.
It is all up to the practitioner to choose any method and come up with a solution for the given problem.

Buy Me Coffee

If you have liked my article you can buy some coffee and support me here. That would motivate me to write and learn more about what I know.

Crimes Against Women - Geo Data Analysis

Mohammed Sameeruddin — Sat, 01 May 2021 11:36:12 GMT

Introduction

The main agenda of this article to analyze crime data by following all the steps required for the complete data analysis. The steps include Data Preparation, Data Cleaning, Data Wrangling, Feature Selection, Data Visualization & Comparison.

The data is about the crimes committed against women in India. The data is being recorded from 2001 to 2014. It includes crimes like -

Rape
Kidnapping and Abduction
Dowry Deaths
Assault on Women with intent to outrage her modesty
Insult to the modesty of Women
Cruelty by Husband or his Relatives
Importation of Girls

Data Source Crimes Against Women - kaggle.com

Throughout this article, we will try to see if at all there is a reduction in crimes as the year count increases. We will visualize each column of data (mentioned above) state-wise & year-wise, and thus explore in a much better way. If you want to directly check out my work at Kaggle, feel free to visit this notebook.

Credits of Cover Image - Photo by Stefano Pollio on Unsplash

Libraries

As always, begin with importing all the necessary packages.

import warningswarnings.filterwarnings('ignore')import requestsimport pandas as pdimport numpy as npimport plotly.graph_objects as go

Data Reading

I have saved the CSV data in my local system with the file name caw_2001-2014.csv.

df = pd.read_csv('caw_2001-2014.csv', index_col=0)

Data Preparation

We will rename all the columns ensuring that the column name is short and precise for easy access to the column.

df.columns = ['state_unit', 'district', 'year', 'rape', 'kidnap_abduction', 'dowry_deaths',               'women_assault', 'women_insult', 'husband_relative_cruelty', 'girl_importation']df.index = list(range(df.shape[0]))

There is a lot of string data where all the values are uppercased. To maintain uniqueness and flexibility, we will convert all the string values to title-case.

for col in df.columns:    df[col] = df[col].apply(lambda x : x.title() if isinstance(x, str) else x)

If we observe the column - state_unit, there are few values that need replacement. Some values are repetitive, in the sense that it signifies the same meaning but has a different name. We shall replace all those values to have a single unique name. This process is really helpful in order to visualize the data geographically.

replacements = {    'A & N Islands' : 'Andaman and Nicobar',    'A&N Islands' : 'Andaman and Nicobar',    'Daman & Diu' : 'Daman and Diu',    'Delhi Ut' : 'Delhi',    'D & N Haveli' : 'Dadra and Nagar Haveli',    'D&N Haveli' : 'Dadra and Nagar Haveli',    'Odisha' : 'Orissa',    'Jammu & Kashmir' : 'Jammu and Kashmir'}for (o, r) in replacements.items():    df['state_unit'].replace(to_replace=o, value=r, inplace=True)

Data Exploration & Categorization

Since the data is being collected from 2001 till 2014. We will need to split the data year-wise and save it in a dictionary. This is simply an efficient way of organizing the data.

def split_data(dframe):    min_year = dframe['year'].min()    max_year = dframe['year'].max()    data_year_wise = {        year : dframe[dframe['year'] == year] for year in range(min_year, max_year + 1)    }    return data_year_wise# --------------data_splits = split_data(dframe=df)

To know the dimension of the data (year-wise), we can iterate through each key and print the shape of each value.

for (y, d) in data_splits.items():    print(y, '\t', d.shape)# --------------'''2001      (716, 10)2002      (719, 10)2003      (728, 10)2004      (729, 10)...2014      (837, 10)'''

In the column district, there is one unique row that has the total number of crime counts pertaining to each column. This again differs based on the name of the state_unit.

def categorize_crimes(data_source, state_unit=None):    crime_list = list(data_source[2001].columns[3:])    all_crimes_year_wise = {}    for (y, d) in data_source.items():        y_df = d[d['district'].str.contains('Total')]        if state_unit:            y_df = y_df[y_df['state_unit'] == state_unit.title()]        crime_dict = {col : y_df[col].sum() for col in crime_list}        # all_crimes_year_wise[y] = dict(sorted(crime_dict.items(), key=lambda x:x[1], reverse=True))        all_crimes_year_wise[y] = crime_dict    return all_crimes_year_wise

The above function categorize_crimes() takes two parameters such as -

data_source refers to the entire data which is split based on year.
state_unit refers to the name of the state or unit. If the value is None, it considers total crime counts year-wise on the basis of the whole country. Otherwise, it considers total crime counts year-wise on the basis of a particular state.

Now that we have organized data categorized, we can visualize it to identify the patterns.

Data Visualization & Analysis

Visualizing the count of crimes considering the whole country by including all the crimes.

For this, we will create a function that takes the total number of crimes either state-wise or country-wise. We just need to provide the data_source and state_unit is optional.
Once it obtains the total count of each column, again sums up all the columns to get the whole total of crimes that happened every year.
If state_unit is specified, it does the same process but only limited to that particular state.

def plot_overall_crimes_by_year(data_source, state_unit=None, kind='bar'):    crimes_data = categorize_crimes(data_source=data_source, state_unit=state_unit)    year_sum_crimes = {y : sum(list(cr.values())) for (y, cr) in crimes_data.items()}    y_keys = list(year_sum_crimes.keys())    y_vals = list(year_sum_crimes.values())    t = 'Total Crimes - {}'    title = t.format(state_unit.title()) if state_unit else t.format('India')    if kind == 'bar':        trace = go.Bar(x=y_keys, y=y_vals)    else:        trace = go.Pie(labels=y_keys, values=y_vals)    layout = go.Layout(        height=400,        width=600,        title=title,        margin=dict(l=0, r=0, b=0, t=40)    )    fig = go.Figure(data=[trace], layout=layout)    fig.show()    return None

1 Function Call Visualization on the basis of the whole country.

plot_overall_crimes_by_year(data_source=data_splits)

2 Function Call Visualization on the basis of a particular state.

plot_overall_crimes_by_year(data_source=data_splits, state_unit='Andhra Pradesh')

Visualizing the count of crimes considering the whole country by including all the crimes and a specific year in which they took place.

For this, we will create a function that takes the total number of crimes either state-wise or country-wise by year. We just need to provide the data_source, year, and state_unit is optional.
If state_unit is specified, it does the same process but only limited to that particular state.

def plot_crimes_by_year(data_source, year, state_unit=None, kind='bar'):    crimes_data = categorize_crimes(data_source=data_source, state_unit=state_unit)    year_all_crimes = crimes_data[year]    y_keys = list(year_all_crimes.keys())    y_vals = list(year_all_crimes.values())    t = '{} - Total Crimes - {}'    title = t.format(year, state_unit.title()) if state_unit else t.format(year, 'India')    if kind == 'bar':        trace = go.Bar(x=y_keys,y=y_vals)    else:        trace = go.Pie(labels=y_keys, values=y_vals)    layout = go.Layout(        height=400,        width=600,        title=title,        margin=dict(l=0, r=0, b=0, t=40)    )    fig = go.Figure(data=[trace], layout=layout)    fig.show()    return None

1 Function Call Visualization on the basis of the whole country in a specific period of time (year).

plot_crimes_by_year(data_source=data_splits, year=2001)

2 Function Call Visualization on the basis of a particular state in a specific period of time (year).

plot_crimes_by_year(data_source=data_splits, year=2001, state_unit='Andhra Pradesh')

Comparing the difference of occurrences visually by both country-wise and a particular state-wise considering two different periods of time (years).

For this, we will create a function that takes the total number of crimes either state-wise or country-wise by two different years. We just need to provide the data_source, ideal_year, cwith_year, and state_unit is optional.
If state_unit is specified, it does the same process but only limited to that particular state.

def plot_overall_difference(data_source, ideal_year, cwith_year, state_unit=None):    crime_data = categorize_crimes(data_source=data_source, state_unit=state_unit)    ideal_year_crimes = crime_data[ideal_year]    cwith_year_crimes = crime_data[cwith_year]    t = '{} vs {} - diff - {}'    title = t.format(ideal_year, cwith_year, state_unit.title()) if state_unit else t.format(ideal_year, cwith_year, 'India')    trace1 = go.Bar(        x=list(ideal_year_crimes.keys()),        y=list(ideal_year_crimes.values()),        name=ideal_year    )    trace2 = go.Bar(        x=list(cwith_year_crimes.keys()),        y=list(cwith_year_crimes.values()),        name=cwith_year    )    layout = go.Layout(        height=400,        width=600,        title=title,        margin=dict(l=0, r=0, b=0, t=40)    )    fig = go.Figure(data=[trace1, trace2], layout=layout)    fig.show()    return None

1 Function Call Visualization on the basis of the whole country in two different periods of time (years).

plot_overall_difference(data_source=data_splits, ideal_year=2001, cwith_year=2014)

2 Function Call Visualization on the basis of a particular state in two different periods of time (years).

plot_overall_difference(data_source=data_splits, ideal_year=2001, cwith_year=2012, state_unit='Andhra Pradesh')

Plotting a particular crime both country-wise and state-wise to identify increasing or decreasing patterns yearly-wise.

For this, we will create a function that takes any single crime either state-wise or country-wise considering all the years. We just need to provide the data_source, crime, and state_unit is optional.
If state_unit is specified, it does the same process but only limited to that particular state.

def plot_crime_overall_diff(data_source, crime, state_unit=None, kind='bar'):    crime_data = categorize_crimes(data_source=data_source, state_unit=state_unit)    years_x = list(crime_data.keys())    crime_y = [cr[crime] for (y, cr) in crime_data.items()]    t = '{} - {} | {}  {}'    min_y = years_x[0]; max_y = years_x[-1]    title = t.format(min_y, max_y, state_unit.title(), crime) if state_unit else t.format(min_y, max_y, 'India', crime)    if kind == 'bar':        trace = go.Bar(x=years_x, y=crime_y)    else:        trace = go.Pie(labels=years_x, values=crime_y)    layout = go.Layout(        height=400,        width=600,        title=title,        margin=dict(l=0, r=0, b=0, t=40)    )    fig = go.Figure(data=[trace], layout=layout)    fig.show()    return None

1 Function Call Visualization of any single crime on the basis of the whole country.

plot_crime_overall_diff(data_source=data_splits, crime='rape')

2 Function Call Visualization of any single crime on the basis of a particular state.

plot_crime_overall_diff(data_source=data_splits, crime='rape', state_unit='Andhra Pradesh')

In the above methods, we considered only the rape column. If we want to identify patterns for any other column, we can do so by changing the parameter value.

Plotting the data state-wise and yearly-wise with a specific crime (column) as a target.

For this, we will create a function that takes any single crime state-wise or country-wise and at a particular period of time. We just need to provide the data_source, crime, year, and state_unit is optional.
If state_unit is specified, it does the same process but only limited to that particular state.

def plot_column(data_source, year, crime, state_unit=None, kind='bar'):    states_x, crime_y = obtain_features(data_source=data_source, year=year, crime=crime, state_unit=state_unit)    t = '{} | {}  {}'    title = t.format(year, crime, state_unit.title()) if state_unit else t.format(year, crime, 'India')    if kind == 'bar':        trace = go.Bar(x=states_x, y=crime_y)    else:        trace = go.Pie(labels=states_x, values=crime_y)    layout = go.Layout(        height=400,        width=600,        title=title,        margin=dict(l=0, r=0, b=0, t=40)    )    fig = go.Figure(data=[trace], layout=layout)    fig.show()    return None

1 Function Call Visualization of any single crime on the basis of the whole country in a particular period of time.

plot_column(data_source=data_splits, year=2001, crime='kidnap_abduction')

2 Function Call Visualization of any single crime on the basis of a particular state in a particular period of time.

plot_column(data_source=data_splits, year=2001, crime='kidnap_abduction', state_unit='Andhra Pradesh')

Geographical Plot - State-wise Crime Activity

In order to plot the data geographically, we need to have a map layout, and accordingly, the data can be visualized.

Downloading GeoJSON Data - State-wise

def get_india_map(year, state_unit=None):    if year < 2014:        if not state_unit:            return 'https://raw.githubusercontent.com/geohacker/india/master/state/india_state.geojson'        return 'https://raw.githubusercontent.com/geohacker/india/master/district/india_district.geojson'    return 'https://raw.githubusercontent.com/geohacker/india/master/state/india_telengana.geojson'

Getting the Districts GeoJSON

def get_districts_json(year, state_unit=None):    geo_link = get_india_map(year=year, state_unit=state_unit)    if not state_unit:        return geo_link    req_data = requests.get(url=geo_link)    req_json = req_data.json()['features']    state_districts = []    for feature in req_json:        if feature['properties']['NAME_1'] == state_unit:            state_districts.append(feature)    return {        "type": "FeatureCollection",        "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },        "features" : state_districts    }

Visualizing the Choropleth Map

def plot_state_wise(data_source, year, crime, state_unit=None):    state_unit = None    state_x, crime_y = obtain_features(data_source=data_source, year=year, crime=crime, state_unit=state_unit)    df_cols = ['name', 'crime_count']    state_crime_df = pd.DataFrame(data=zip(state_x, crime_y), columns=df_cols)    trace = go.Choropleth(        geojson=get_districts_json(year=year, state_unit=state_unit),        featureidkey='properties.NAME_1',        locations=state_crime_df['name'],        z=state_crime_df['crime_count'],        colorscale='Reds',        marker_line_color='black',        colorbar=dict(            title={'text': "Crime Range"},        )    )    layout = go.Layout(        title="{}  Crime Activity - {}".format(year, crime),        geo=dict(            visible=False,            lonaxis={'range': [65, 100]},            lataxis={'range': [5, 40]}        ),        margin=dict(l=0, b=0, t=30, r=0),        height=600,        width=600    )    fig = go.Figure(data=[trace], layout=layout)    fig.show()    return None

1 Function Call - Visualizing dowry_deaths data state-wise based on a particular year.

plot_state_wise(data_source=data_splits, year=2001, crime='dowry_deaths')

plot_state_wise(data_source=data_splits, year=2014, crime='dowry_deaths')

In Uttar Pradesh, we can observe that the count of deaths due to dowry crime was high in both 2001 and 2014.

Observation

It is observed that as the count of the year increases there is a drastic increase in the occurrence of crimes. It is quite painstaking to observe this, we are all been told that the development of the country (from every angle) needs time and patience. As the years increase there should be a decrease in occurrences. But it is the opposite.
Crimes against women are increasing no matter what measures are taken.
Moreover, this dataset includes the activity that had taken place till 2014, we do not know how many more happened to this date.
The same pattern is observed when we consider any single state and run our analysis.

Convert a Regular Matrix into Sparse Matrix in Python

Mohammed Sameeruddin — Wed, 21 Apr 2021 06:56:16 GMT

Introduction

Matrix is a type of data structure similar to an array where values are stored in rows and columns. Here, the values are of a unique type. When dealing with matrices (linear algebra) in Machine Learning and NLP, we often hear about two types of matrices as -

Dense Matrix - The matrix where most of the elements are non-zero. In this matrix, there are very few zero elements.
Sparse Matrix - In contrast, the matrix where most of the elements zero and very few elements are non-zero.

Note - There are no criteria as such how many zero values in a matrix determine that there is a need to sparse the matrix.

Credits of Cover Image - Photo by Alexander Schimmeck on Unsplash

Algorithm Explanation

Imagine you have a large matrix with N rows and M columns in which most of the values are zeros. You are asked to consider only non-zero elements since zero elements do not add much value. Definitely, you have time and space constraints since you are dealing with a very large matrix.

The simple method here is to neglect all 0's and store the non-zero elements in the form of [ row_index, col_index, non-zero_value ].

Approach:

Iterate through each row of an input matrix.
Iterate through each value of each row.
If (value == 0) skip.
If (value != 0) obtain the row number and column number along with the value from the input matrix.
Save the result as [ row, column, value ] for every row and column.
Stop.

Visual Explanation

Let's say we are given a matrix that has most of the elements to be 0.

The following GIF explains how to obtain the sparse matrix (GIF by Author).

Time to Code

In this section, we will try to code this in two different ways. One, with the help of the scipy module and another, implementing our own sparse matrix.

Let's get started ...

Necessary Imports

import numpy as npfrom scipy.sparse import csr_matrix

Random Matrix Creation

Just for the demonstration, we will make sure that the matrix contains 0 elements the most.

>>> mat = np.random.randint(low=0, high=3, size=(5, 5))>>> print(mat)[[2 2 0 2 0] [2 1 0 0 2] [2 1 0 1 0] [0 1 2 0 2] [0 1 2 2 1]]

SCIPY Implementation

With the help of the method csr_matrix() we can easily obtain the sparse matrix.

>>> smat = csr_matrix(mat)>>> print(smat)(0, 0)    2(0, 1)    2(0, 3)    2(1, 0)    2(1, 1)    1(1, 4)    2(2, 0)    2(2, 1)    1(2, 3)    1(3, 1)    1(3, 2)    2(3, 4)    2(4, 1)    1(4, 2)    2(4, 3)    2(4, 4)    1

We can think of a dictionary in python which is a key and value paired. The above output is something like a dictionary where keys are the index location (row, column) and values are the actual non-zero elements.

Custom Implementation

In order to implement it from scratch, we can follow the Algorithmic approach that I have explained previously.

class SparseMatrix():    def __init__(self, arr):        self.arr = arr    def retain_sparsity(self, to_dict=False):        sparse_mat = [            [rindx, cindx, val]            for (rindx, row) in enumerate(self.arr)            for (cindx, val) in enumerate(row)            if (val != 0)        ]        if to_dict:            sparse_mat = {(r, c) : v for (r, c, v) in sparse_mat}        return sparse_mat

The above class SparseMatrix() has a method retain_sparsity wherein the default argument to_dict is False. The argument to_dict is used whether to get the output in the form of dictionary or not.

Object Creation

sparse = SparseMatrix(arr=mat)

Output - List

>>> smat_c = sparse.retain_sparsity()>>> print(smat_c)[[0, 0, 2], [0, 1, 2], [0, 3, 2], [1, 0, 2], [1, 1, 1], [1, 4, 2], [2, 0, 2], [2, 1, 1], [2, 3, 1], [3, 1, 1], [3, 2, 2], [3, 4, 2], [4, 1, 1], [4, 2, 2], [4, 3, 2], [4, 4, 1]]

Output - dictionary

>>> smat_d = sparse.retain_sparsity(to_dict=True)>>> print(smat_d){(0, 0): 2, (0, 1): 2, (0, 3): 2, (1, 0): 2, (1, 1): 1, (1, 4): 2, (2, 0): 2, (2, 1): 1, (2, 3): 1, (3, 1): 1, (3, 2): 2, (3, 4): 2, (4, 1): 1, (4, 2): 2, (4, 3): 2, (4, 4): 1}

Conclusion

When we have space constraints while working with large matrices, it is often preferred to convert the matrix into sparse representation and this really takes less space comparatively the original matrix.
In fact, we can check the space (in bytes) occupied by the original matrix mat and the sparse matrix.

>>> from sys import getsizeof>>> # checking the space of original matrix>>> getsizeof(mat) 156>>> # checking the space of scipy sparse matrix>>> getsizeof(smat)24>>> # checking the custom implementation sparse matrix>>> getsizeof(smat_c)92

What we can observe is, the scipy method consumes less space than our custom method. It is because scipy is an optimized well-developed library mainly used for various scientific computations.
It is always better to use library methods than our own code to achieve faster results with fewer space constraints.

Buy Me Coffee

If you have liked my article you can buy some coffee and support me here. That would motivate me to write and learn more about what I know.

Matrix Multiplication - Normal Function to an Optimised Code

Mohammed Sameeruddin — Thu, 15 Apr 2021 06:05:08 GMT

Introduction

If you had read my previous articles on matrix operations, by now you would have already know what a matrix is. Yes, a matrix is a 2D representation of an array with M rows and N columns. The shape of the matrix is generally referred to as dimension. Thus the shape of any typical matrix is represented or assumed to have (M x N) dimensions.

Credits of Cover Image - Photo by Carl Nenzen Loven on Unsplash

Row Matrix - Collection of identical elements or objects stored in 1 row and N columns.
Column Matrix - Collection of identical elements or objects stored in N rows and 1 column.

Note - Matrices of shapes (1 x N) and (N x 1) are generally called row vector and column vector respectively.

Algorithm Explanation

For instance, let's assume we have two matrices A and B. The general rule before multiplying is that the number of COLUMNS of A should be exactly equal to the number of ROWS of B. If this rule is satisfied then -

We shall compute the dot product for each row of A with respect to each column of B. This process continues until there are no elements left to compute.

Each row of A is considered to be a row vector.
Each column of B is considered to be a column vector.
Dot Product - It is an algebraic operation that is computed on two equal-sized vectors which result in a single number. It is also called a scalar product. Mathematically, we represent it in the form of -

$$r.c = \sum_{i=1}^{n} r_i c_i \rightarrow \text{for} \ i = \text{1 to n}$$

The resultant matrix's (after operation) size would be equal to the number of ROWS of A and the number of COLUMNS of B.

Note - The computation can be risky or slow when we are dealing with large matrices. This can be easily handled by NumPy.

GIF Explanation

GIF by Author

Easy to read function

If we try to break down the whole algorithm, the very first thing that we have to do is to transpose one of the matrices and compute the dot or scalar product for each row of the matrix to each column of the other matrix.

Matrix Transpose

def transpose(m):    trans_mat = [[row[i] for row in m] for i in range(len(m[0]))]    return trans_mat

Dot Product

def scalar_product(r, c):    ps = [i * j for (i, j) in zip(r, c)]    return sum(ps)

Matrix Multiplication

def mats_product(m1, m2):    m2_t = transpose(m=m2)    mats_p = [[scalar_product(r=r, c=c) for c in m2_t] for r in m1]    return mats_p

To wrap all the functions together, we can do the following -

Wrap Up

def easy_product(m1, m2):        def transpose(m):        trans_mat = [[row[i] for row in m] for i in range(len(m[0]))]        return trans_mat    def scalar_product(r, c):        ps = [i * j for (i, j) in zip(r, c)]        return sum(ps)    def mats_product(m1, m2):        m2_t = transpose(m=m2)        mats_p = [[scalar_product(r=r, c=c) for c in m2_t] for r in m1]        return mats_p    return mats_product(m1, m2)

More simplified code - (Optimized)

The above easy_product() can still be optimized by using the built-in methods of Python. Better improvement is needed on the method transpose().

Matrix Transpose

transpose = lambda m : list(map(list, zip(*m)))

Dot Product

The above scalar_product() can still be reduced and maintained like -

scalar_product = lambda r, c: sum([i * j for (i, j) in zip(r, c)])

Matrix Multiplication

def mats_product(m1, m2):    m2_t = transpose(m=m2)    mats_p = [[scalar_product(r=r, c=c) for c in m2_t] for r in m1]    return mats_p

To wrap all the functions together, we can do the following -

Wrap Up

def optimized_product(m1, m2):    transpose = lambda m : list(map(list, zip(*m)))    scalar_product = lambda r, c: sum([i * j for (i, j) in zip(r, c)])    def mats_product(m1, m2):        m2_t = transpose(m=m2)        mats_p = [[scalar_product(r=r, c=c) for c in m2_t] for r in m1]        return mats_p    return mats_product(m1, m2)

Awesome! Both the functions are ready to be tested. In order to test so, we need to have matrices defined. We will create random matrices (function) which can further be helpful to check the speed compatibility of both functions.

Random matrices creation

import randomdef create_matrix(rcount, ccount):    random.seed(10)    m = [[random.randint(10, 80) for i in range(ccount)] for j in range(rcount)]    return m

Testing Phase

>>> nr = 2>>> nc = 3>>> >>> m1 = create_matrix(nr, nc)>>> m2 = create_matrix(nc, nr)>>> m1[[14, 64, 71], [11, 36, 69]]>>> m2[[14, 64], [71, 11], [36, 69]]

Normal Function

>>> mm = easy_product(m1, m2)>>> print(mm)[[7296, 6499], [5194, 5861]]

Improved Code

>>> mm = optimized_product(m1, m2)>>> print(mm)[[7296, 6499], [5194, 5861]]

Speed and Computation - Comparison

Now, both the functions seem working well enough. But it is also important to check the algorithm performance in terms of speedy computation and time.

For this, we will run both the functions in a loop for a defined set of matrix shapes and store the required amount of time that each takes. We shall also plot the same to represent it visually.

Performance Check

import timefrom matplotlib import pyplot as pltdef check_speed():    shapes = [(3, 2), (5, 4), (10, 9), (20, 5), (30, 8), (40, 2), (50, 10), (2, 10), (5, 80), (100, 34), (200, 100), (300, 200)]    x = [str(i) for i in shapes]    y1 = []; y2 = [];    for sp in shapes:        m1 = create_matrix(sp[0], sp[1])        m2 = create_matrix(sp[1], sp[0])        start_e = time.time()        res_easy = easy_product(m1, m2)        end_e = time.time()        easy_elapse = end_e - start_e        start_o = time.time()        res_opt = optimized_product(m1, m2)        end_o = time.time()        opt_elapse = end_o - start_o        y1.append(easy_elapse)        y2.append(opt_elapse)    plt.figure(figsize=(15, 6))    plt.plot(x, y1, 'o-', label='Normal Function')    plt.plot(x, y2, 'o-', label='Optimized Code')    plt.legend()    plt.show()    return None

Performance Graph

Both the algorithms seem to work similarly except for few cases. But there is a catch, we wrote these functions from a personal point. If we introduce NumPy for doing the same, the performance would be much better. The below graph includes the performance of NumPy as well.

Yes! NumPy is much faster. NumPy consumes very little time to compute the same operation no matter what the sizes are.

Things I Learnt

Breaking down a given problem and approaching each one to ultimately solve the original problem.
Comparing algorithms and tracking the computation.

Buy Me Coffee

If you have liked my article you can buy some coffee and support me here. That would motivate me to write and learn more about what I know.

#6 - Detect Outliers Like a Detective

Mohammed Sameeruddin — Mon, 12 Apr 2021 07:11:15 GMT

Outliers - In a Nutshell

In order to understand what an outlier is, we can think of a great example that explains the concept in a nutshell. Imagine, there is summer season going on. You are with your friends chilling. Everyone in your group is wearing normal clothes (light clothes that people usually wear in the summer season) except for the one who is like a weirdo. That single person is wearing a sweater. Now clearly, he/she is totally a different person who is not like the rest and certainly, he/she doesn't fit in your group.

The explanation can be clearly understood by seeing the cover image of this article.

Credits of Cover Image - Photo by Will Myers on Unsplash

Question in your Mind

Wait a second. Let's say I have numerical data, how can I be a detective to catch hold of an outlier in my data? Obviously, the data will be not like the cover image. (dang it)
In that case what should I do? (double dang it)

Answer to Your Question

Yes, what you are saying is absolutely logical. In order to detect the outlier in numerical data (or any data for reason), we can use various statistical methods that help find out the outlier.

Definition of Outlier - Statistical term

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate the experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

There are 3 easy ways to detect the outliers within the data.

Visual Exploration
- Scatter Plot
- Box Plot
Z-score Method
IQR - (Inter Quartile Range)

Note - The readers of this article are assumed to know basic statistical measures and how to calculate them. If yes, you can proceed further. Otherwise, you may want to refer to this article.

Outliers can occur due to the wrong entry of the data or typo into the original dataset. Now that we know different ways of detecting, we shall understand each with an example.

Dummy Data

Let's have data values say -

$$x = [10, 14, 15, 12, 18, 20, 19, 17, 22, 25, 80, 21, 25, 24, 13]$$

and

$$y = [13, 24, 17, 22, 26, 21, 14, 11, 10, 19, 23, 27, 11, 15, 85]$$

Necessary Imports

1. Visual Exploration

By representing the data in the form of a graph, one can easily identify the variability or the difference in data patterns that may lead to having an outlier. The visual representation is not more effective than the other two methods.

Scatter Plot

Scatter Plot basically plots or visualizes the data along the coordinate axes based on the scale range mentioned.

Code

Output

Box Plot

Box plot basically plots or visualizes the statistical measures of the data. The outliers are automatically shown either side min and max values. This is more robust than the scatter plot in detecting the outliers. The explanation of the box plot is below.

Credits - The above image is taken from the Internet.

Code - x

Output - x

The point (50) outside the max line is an outlier.

Code - y

Output - y

The point (60) outside the max line is an outlier.

Dang you (back)

2. Z-score Method

The Z-score method is used to normalize (standard normalize) the data. The normalized data is also called Standardized Normal Distribution. One important thing about this is that the mean is always 0 and the standard deviation is always 1.

Formula

Let

$$X = [x_1, x_2, x_3, \dots, x_n]$$

and

$$\mu_x \rightarrow \text{Mean}$$

$$\sigma_x \rightarrow \text{Standard Deviation}$$

then

$$z_x = \frac{(x_i - \mu_x)}{\sigma_x}$$

When we compute the above, it generates a sequence of numbers that are mostly normalized (standardized) values.

If any value is greater than +3 is an outlier.
Similarly, if any value less than -3 is an outlier.

Code

The above code returns a sequence of normalized values.

The above code returns the outliers from the original data by getting the indices.

Testing

The above result is clearly stating that 50 and 60 are outliers of x and y.

Double dang you (back)

3. IQR - (Inter Quartile Range)

IQR is self-explained by just looking at the box plot chart. The formula for IQR is

$$IQR = Q_3 - Q_1$$

where

$$Q_1 \rightarrow \text{First Quartile}$$

$$Q_3 \rightarrow \text{Third Quartile}$$

and

$$Q_2 \rightarrow \text{Median}$$

Code

Once we compute IQR, we need to find the bounds, i.e., upper bound and lower bound. We use a constant (1.5) to get the bounds.

The data point which is less than the lower bound is an outlier.
The data point which is greater than the upper bound is an outlier.

Code

Testing

The above result is clearly stating that 50 and 60 are outliers of x and y.

Triple dang you (back)

Hence the Questions are answered. You have now successfully become the detective who can detect the outliers in the data.

#5 - Statistical Measures for Exploratory Data Analysis

Mohammed Sameeruddin — Wed, 07 Apr 2021 10:11:51 GMT

Introduction

Statistics is a branch of mathematics (applied mathematics) concerned mainly with the analysis of the data. It includes - collection, analysis, interpretation, and presentation of huge numerical data. There are a variety of numerical measures such as mean, median, mode, percentiles, variance, and standard deviation to summarize the data.

We shall understand the meaning of each measure briefly and programmatically implement the same using Python.

Credits of Cover Image - Photo by Crissy Jarvis on Unsplash

Mean Computation

Mean is generally referred to as the average value for the given set of values. It is the central value or central tendency of a finite set of numbers (data). In a lot of ways, mean is misinterpreted as median (which is the middle value) but it more convenient to take mean as the measure of central tendency.

Formula

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) and the mean of this data is given as

$$\mu = \frac{1}{n} \sum_{i=1}^n x_i$$

Code

Limitation

Means do get affected when introduced outliers in the data.

Median Computation

Median is generally referred to as the mid data point from the data provided. In other words, it is the value that separates the data into two - lower half and higher half. The only constraint to compute the median is that the data needs to be in sorted order.

Formula

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) which is sorted and the median of the data is given as (considering the index starting from 0)-

If n is odd, then

$$Med(X) = X\bigg[\frac{n}{2}\bigg]$$

If n is even, then

$$Med(X) = \frac{(X[\frac{n}{2}-1] + X[\frac{n}{2}])}{2}$$

Code

Pro

Medians do not get affected when introduced outliers in the data.

Mode Computation

Mode is generally referred to as a number that appears most frequently in the given dataset. There is a formula to get the modal value which is more likely applicable to the data represented in class intervals. Other than that, we can easily compute the mode by counting the occurrence of each data point.

Code

Standard Deviation Computation

Standard Deviation is generally referred to as the overall dispersion of each data point with respect to the data mean. Here, dispersion is simply the distance that is measured. Always, the distances are positive, we never find the distance that is measured in negative. But, the dispersion for some points can be negative. In order to avoid that, we apply a mathematical hack that can be observed in the formula.

Formula

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) and the standard deviation of the data is given as

$$\sigma = \sqrt{\frac{\sum_{i=1}^n (x_i - \mu)^2}{n}}$$

Code

Limitation

Standard Deviation does get affected when introduced outliers in the data.

Variance Computation

Variance is generally referred to as the square of standard deviation. In case if the data is normally distributed, then the standard deviation is equal to variance which is again equal to 1.

Formula

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) and variance of the data is given as

$$Var(X) = \sigma^2$$

Code

Limitation

Variance does get affected when introduced outliers in the data.

Percentile Computation

Percentile is generally referred to as the single score value which falls below the given percentage of score in its frequency distribution. The median value is also equal to the 50th percentile of the given data distribution.

Formula

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) and location percentile can be computed by

$$l_p = \bigg[(n - 1) \frac{p}{100}\bigg] + 1$$

separate the integer part and floating part from the location percentile value and get the previous data value (integer part - 1) and current data value (integer part) with the help of indexing and compute the percentile value by

$$p_v = X[prev] + [\text{floating part of }l_p * (X[curr] - X[prev])]$$

Code

Limitation

The data needs to in sorted order (ascending) to be able to compute percentile efficiently.

Median Absolute Deviation

Median Absolute Deviation is generally referred to as the median value of absolute dispersion from each data point to the median of the data itself. Standard deviation is computed with respect to the mean value whereas median absolute deviation (MAD) is computed with respect to the median value.

Formula

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) and MAD of the data is given as

$$MAD(X) = Med(|x_i - Med(X)|)$$

Code

Covariance Computation

Covariance is generally referred to as a measure of the relationship between two random variables X and Y. This measure evaluates how much to what extent the variables change together. In other words, it is essentially a measure of the variance between two variables.

Formula

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

and

$$Y = [y_1, y_2, y_3, y_4, \dots, y_n]$$

be two finite sets of numbers (data) and the Covariance of X and Y is given as

$$\text{Cov(X, Y)} = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_x)(y_i - \mu_y)$$

Code

Types

Positive Covariance Indicates that two variables tend to move in the same direction.
Negative Covariance Indicates that two variables tend to move in the inverse direction.

Properties

Cov(X, Y) = Cov(Y, X)
Cov(X, X) = Var(X)

Correlation Computation

Correlation is generally referred to as a measure of the strength of the relationship between two finite sets (data). Correlation is the scaled measure of covariance. The correlation value is always in the range of -1 to +1.

Formula

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

and

$$Y = [y_1, y_2, y_3, y_4, \dots, y_n]$$

be two finite sets of numbers (data) and the Correlation of X and Y is given as

$$\text{Corr(X, Y)} = \frac{\text{Cov(X, Y)}}{\sigma_X \sigma_Y}$$

Code

Types

Positive Correlation Indicates that two variables have a strong relationship and tend to move in a positive direction. The value is generally +1.
Zero Correlation Indicates that there is no relationship between two variables. The value is generally 0.
Negative Correlation Indicates that two variables have a strong relationship and tend to move in the inverse direction. The value is generally -1.

Full Code

import mathfrom collections import Counterclass StatsBasics():    def compute_mean(self, data):        mean_val = sum(data)/len(data)        return mean_val    def compute_median(self, data):        data = sorted(data)        n = len(data)        mid_idx = n // 2        if (n % 2 != 0):            return data[mid_idx]        return (data[mid_idx - 1] + data[mid_idx]) / 2    def compute_mode(self, data):        datac = Counter(data)        max_freq = max(list(datac.values()))        if (max_freq == 1):            return "Mode doesn't exist"        modals = [i for (i, j) in datac.items() if (j == max_freq)]        return min(modals)    def compute_stddev(self, data):        mean_val = self.compute_mean(data=data)        dispersions = [(i - mean_val)**2 for i in data]        dispersion_mean = self.compute_mean(data=dispersions)        return math.sqrt(dispersion_mean)    def compute_variance(self, data):        stddev = self.compute_stddev(data=data)        return stddev**2    def compute_percentile(self, p, data):        data = sorted(data)        if (p == 100):            return data[-1]        l_p = (len(data) - 1) * (p / 100) + 1        int_l_p = int(l_p)        fl_l_p = l_p - int_l_p        val1 = data[int_l_p - 1]        val2 = data[int_l_p]            pval = val1 + (fl_l_p * (val2 - val1))        return round(pval, 2)    def compute_mad(self, data, c=0.6745):        median_val = self.compute_median(data=data)        abs_std = [abs(i - median_val) for i in data]        mad = self.compute_median(data=abs_std) / c        return round(mad, 2)    def compute_covariance(self, X, Y):        if (len(X) != len(Y)):            return None        mean_x = self.compute_mean(data=X)        mean_y = self.compute_mean(data=Y)        covals = [(x - mean_x)*(y - mean_y) for (x, y) in zip(X, Y)]        covar_val = self.compute_mean(data=covals)        return covar_val    def compute_correlation(self, X, Y):        covar_val = self.compute_covariance(X=X, Y=Y)        std_X = self.compute_stddev(data=X)        std_Y = self.compute_stddev(data=Y)        corr_val = covar_val / (std_X * std_Y)        return corr_val

#4 - Identify Patterns in the Data

Mohammed Sameeruddin — Thu, 25 Mar 2021 08:06:24 GMT

Introduction

Identifying Patterns helps us understand more about the data, which is gaining insights by observing trends and patterns. This also helps in finding the relationship between the two sets.

From the business point, observing trends really helps in tracking the overall sales and returns. Providing the data stored yearly-wise, we can easily plot and identify a trend pattern - an upward trend or constant trend or a lower trend. And thus, decisions are made based on facts.

Google Trends is one of the best applications available on the web to see the trend graph on the search term.

Credits of Cover Image - Photo by Clark Van Der Beken on Unsplash

Different Ways of Identifying Patterns

Depending on the data we can gain insights either by just looking at the tabular data (assuming data is in tabular format) or plotting it.

From a personal point, it is always good to visualize the data in order to properly identify the patterns. This really makes sense to me.

Let's try to identify patterns in one dataset.

Note - Throughout this series, we will be dealing with one dataset originally taken from Kaggle.

Hands-on Practice

For the hands-on practice, we will use a dataset in the domain of health care on heart attack possibility.

Let's get started ...

We are using Python and specifically Pandas library - developed with a clear intention to approach the key aspects of data analytics and data science problems.

Installation

pip install pandas --userpip install numpy --userpip install seaborn --userpip install matplotlib --user

The above installs the current version of packages in the system. Let's import the same.

import pandas as pdimport numpy as npimport seaborn as snsfrom matplotlib import pyplot as plt

Load / Read

In order to load or read the data, we use the read_csv() method where we explicitly tell Pandas that we want to read a CSV file.

df = pd.read_csv('heart.csv)

Since df is a Pandas object, we can instantiate the other methods that are made available to understand the data.

From the last tutorial, we know that cp (chest pain), thalach (heartbeat rate), chol (cholesterol) (not so important), and slope are happened to be the most important features.

Divide the Data by `target`

Display the list of columns in the dataset.

>>> df.columnsIndex(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'], dtype='object')

Filter the dataset considering the target variable.

df_1 = df[df['target'] == 1] # data of people who are riskydf_0 = df[df['target'] == 0] # data of people who are safe

The cholesterol is measured in milligrams (mg) and the heartbeat rate is measured over a minute. Let's plot the [chol , thalach] features with respect to age for both df_1 and df_0.

Age group - Risky

First group the data (df_1) by age considering the columns chol and thalach.

ag_df_1 = df_1.groupby(by=['age'])[['chol', 'thalach']].sum()

Plot a bar chart for the grouped data (above).

ag_df_1.plot(kind='bar', figsize=(15, 6), title='Risky')plt.show()

The plot looks like this -

Age group - Safe

First group the data (df_0) by age considering the columns chol and thalach.

ag_df_0 = df_0.groupby(by=['age'])[['chol', 'thalach']].sum()

Plot a bar chart for the grouped data (above).

ag_df_0.plot(kind='bar', figsize=(15, 6), title='Safe')plt.show()

The plot looks like this -

Conclusion

People who are in the risky group have high heartbeat rates compared with those who are in the safe group.
Also, the cholesterol measurements are higher in the risky group
One can easily assume that optimal heartbeat rate is a goal for heart patients in order to be healthy - clearly understood from the graphs.

Let's plot the [chol , thalach] features with respect to cp (chest pain) for both df_1 and df_0.

CP group - Risky

First group the data (df_1) by cp considering the columns chol and thalach.

cp_df_1 = df_1.groupby(by=['cp'])[['chol', 'thalach']].sum()

Plot a pie chart for the grouped data (above).

cp_df_1.plot(kind='pie', figsize=(15, 6), subplots=True, title='Risky')plt.show()

The plot looks like this -

CP group - Safe

First group the data (df_0) by cp considering the columns chol and thalach.

cp_df_0 = df_0.groupby(by=['cp'])[['chol', 'thalach']].sum()

Plot a pie chart for the grouped data (above).

cp_df_0.plot(kind='pie', figsize=(15, 6), subplots=True, title='Safe')plt.show()

The plot looks like this -

Conclusion

From the above two graphs, we can observe that safe category people have lesser degrees of chest pain than the risky.
It also very significant that cp and thalach both are equally important with respect to the age of the patient.

Scatter Plot of `age` and `cp`

Scatter Plot helps us identify an upward trend or downward trend for two variables that are taken. It is just another form of correlation represented graphically. Based on two datasets df_1 and df_0, let's compare cp with respect to age.

plt.figure(figsize=(10, 6))plt.scatter(df_1['age'], df_1['cp'], label='Risky')plt.scatter(df_0['age'], df_0['cp'], label='Safe')plt.xlabel('age')plt.ylabel('chest pain')plt.legend()plt.show()

Conclusion

From the plot, we can see that most of the safe category people have a lesser degree of chest pain than those of risky people.
In the safe category, though the age is more, the chest pain degree is less and of course there are others who have a high degree.

Scatter Plot of `age` and `thalach`

Now, let's compare thalach with respect to age for both df_1 and df_0.

plt.figure(figsize=(10, 6))plt.scatter(df_1['age'], df_1['thalach'], label='Risky')plt.scatter(df_0['age'], df_0['thalach'], label='Safe')plt.xlabel('age')plt.ylabel('thalach')plt.legend()plt.show()

Conclusion

If we just observe the safe (orange) category, the heartbeat level is constant with respect to age.
The same is not with the risky (blue) category.
We can also assume that people in the group between 29 to 60, have higher rates that lead to an attack with adding higher degrees of chest pain.

#3 - Feature Selection for EDA

Mohammed Sameeruddin — Tue, 23 Mar 2021 11:58:01 GMT

Introduction

Identifying or selecting the important features is one of the crucial processes in data science and data analytics. After all, the features that are selected decide how qualitative the data is. It is exactly like the phrase Garbage in, Garbage out - meaning, whatever features that we consider, the result of the whole analytical process depends on those features. It is also said that data analysts spend at least 60 to 70 percent of their time preparing the data.

Credits of Cover Image - Photo by Brian Lundquist on Unsplash

What is Data Preprocessing?

Data Preprocessing is an important phase of the whole analytical process, where relevant fields are extracted or queried from the gathered data. It is in this step, data analysts try their best to retain or increase the quality of the data, with which further steps become handy. During the process of data gathering, it is quite sure to end up collecting messy data and therefore it has to be preprocessed right after it is collected. It includes steps like -

Data Cleaning - In this process, unnecessary data values which do not add importance are removed. Mostly, it includes missing values that should be cleaned or filled.
Data Editing - In this process, the original data is edited or changed in order to maintain uniqueness. For example, in the dataset, if there is an age column then it is important to have all the age values in numeric, and there can be chances to get the data in non-numeric. In that case, it is always encouraged to edit the data.
Data Wrangling - In this process, the raw data is transformed or manipulated in such a format that it is easy to use and readily available for data analytics. Oftentimes, it is also called data manipulation.

Note - The above steps are explained in a nutshell. Moreover, the overall complexity (to maintain data quality) depends on the dataset that is collected.

Importance of Correlation

Correlation is a statistical measurement that tells how one feature is affecting the target variable. It returns a percentage value describing the relationship. The correlation value (percentage) lies between -1 and +1.

-1 describes that the feature and the target variables are negatively correlated, meaning - one increases when the other decreases and vice-versa.
0 describes that there is no correlation.
+1 describes that the feature and the target variables are positively correlated, meaning - one increases when the other increases and vice-versa.

Correlation is really helpful in knowing those features that actually affect the target variable.

Let's try to identify which features are important in one dataset.

Note - Throughout this series, we will be dealing with one dataset originally taken from Kaggle.

Hands-on Practice

For the hands-on practice, we will use a dataset in the domain of health care on heart attack possibility.

Let's get started ...

We are using Python and specifically Pandas library - developed with a clear intention to approach the key aspects of data analytics and data science problems.

Installation

pip install pandas --userpip install numpy --userpip install seaborn --userpip install matplotlib --user

The above installs the current version of packages in the system. Let's import the same.

import pandas as pdimport numpy as npimport seaborn as snsfrom matplotlib import pyplot as plt

Load / Read

In order to load or read the data, we use the read_csv() method where we explicitly tell Pandas that we want to read a CSV file.

df = pd.read_csv('heart.csv)

Since df is a Pandas object, we can instantiate the other methods that are made available to understand the data.

Display the 1st Five Rows

df.head()

head() is used to display the first five rows. By default, we don't explicitly mention how many we want to display. However, if you want to display other than five rows you can mention a number like head().

  age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca  thal target0 63  1   3  145      233  1   0       150     0     2.3     0     0   1    11 37  1   2  130      250  0   1       187     0     3.5     0     0   2    12 41  0   1  130      204  0   0       172     0     1.4     2     0   2    13 56  1   1  120      236  0   1       178     0     0.8     2     0   2    14 57  0   0  120      354  0   1       163     1     0.6     2     0   2    1

We have a column called target that has two unique values.

0 indicates fewer chances of getting a heart attack
1 indicates more chances of getting a heart attack

Count the data by `target` column

To count the frequency of the data we will use the value_counts() method.

ha_df = df['target'].value_counts().to_frame()

The above methods convert the count (data) to a data frame.

    target----------1   1650   138

Visually, we can represent the above count (data) as a pie chart.

ha_df.plot(kind='pie', figsize=(10, 6), subplots=True)plt.show()

Count the data by `sex` ratio

From the original data frame df, we have a column called sex which is again numerical where -

1 indicates male
0 indicates female

Let's visualize the pie chart of the same.

sdf = df['sex'].value_counts().to_frame()sdf.plot(kind='pie', figsize=(10, 6), subplots=True)plt.show()

Identify Important Features with `Correlation` Plot

With the help of Pandas, we can easily find correlations considering the entire data frame.

cor_df = df.corr()

This will return the correlations to each column with all the other columns. We can directly visualize the correlation matrix using seaborn plots.

plt.figure(figsize=(10, 6))sns.heatmap(    data=cor_df,    vmin=-1,    vmax=1,    center=0,    cmap='seismic',    annot=True)plt.show()

If we carefully observe the correlation matrix, there are totally 3 features that are affecting the target variable. They are -

cp - chest pain
thalach - heartbeat rate
slope - may be

The variable chol (cholesterol) has no relationship. In fact, from the correlation plot, we see that the relationship ratio is -0.085. Whereas, cp and thalach have around 0.43 and 0.42 respectively.

#2 - Determine objectives and understand the data better

Mohammed Sameeruddin — Mon, 22 Mar 2021 07:55:10 GMT

What is an Objective?

In any business problem, data analysts are often asked to find out the patterns and trends from the data in order to identify the strong root cause of problems (why something is happening). Throughout the time, they try to maintain the quality of the data to be precise in finding the true objectives. These objectives (targets) are not formulated from personal opinions or prejudices.

Credits of Cover Image - Photo by Silvan Arnet on Unsplash

To first understand the data, one has to be a great observant. By just observing the data one can know what kind of facts are given and what are those factors - can be considered to be the most important features. This stage is called Data Inspection.

Data Inspection

It is in this stage where one acts as Sherlock Holmes to understand the data better.

As soon as the data is loaded, to make sure one is following the right track one has to be curious to ask the following questions -

What is the problem statement?
What is the structure of the data?
What are the variables present?
- What is the type of each variable?
What is the total size of the data?
Are there any missing or NaN values within the data?

Let's try to understand one dataset from the above questions along with the objectives.

Note - Throughout this series, we will be dealing with one dataset originally taken from Kaggle.

Hands-on Practice

For the hands-on practice, we will use one dataset in the domain of health care on heart attack possibility.

Let's get started ...

We are using Python and specifically Pandas library that was developed with a clear intention to approach the key aspects of data analytics and data science problems.

Installation

pip install pandas --userpip install numpy --user

The above installs the current version of packages in the system. Let's import the same.

import pandas as pdimport numpy as np

1) What is the problem statement?

The title of the dataset itself gives a brief idea of the problem statement that we need to work on, i.e., predicting the possibility of a heart attack. But, for this article, our goal is to understand the data by answering the above questions.

Load / Read

In order to load or read the data, we use the read_csv() method where we explicitly tell Pandas that we want to read a CSV file.

df = pd.read_csv('heart.csv)

2) What is the structure of the data?

The dataset that we have just loaded is a CSV file data where all the values are separated with a character ,. By default, the data is read as a DataFrame which is basically tabular data.

df.info()

Since we have made an object called df, we can instantiate different methods available in the scope. One such method is info() which gives the information related to the data frame.

RangeIndex: 303 entries, 0 to 302Data columns (total 14 columns): #   Column    Non-Null Count  Dtype  ---  ------    --------------  -----   0   age       303 non-null    int64   1   sex       303 non-null    int64   2   cp        303 non-null    int64   3   trestbps  303 non-null    int64   4   chol      303 non-null    int64   5   fbs       303 non-null    int64   6   restecg   303 non-null    int64   7   thalach   303 non-null    int64   8   exang     303 non-null    int64   9   oldpeak   303 non-null    float64 10  slope     303 non-null    int64   11  ca        303 non-null    int64   12  thal      303 non-null    int64   13  target    303 non-null    int64  dtypes: float64(1), int64(13)memory usage: 33.2 KB

No doubt that we have got the entire information, structure, columns by just one command. We also got the types of each column which indirectly answers the 3rd question.

3) What are the variables present?

Variables are also called features which are the columns of the data. Each column or variable has a specific type of data that can be int, float, object etc.

df.columns

The above gives a list of columns or variables and the type of column name. All the columns are read as object.

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],      dtype='object')

3.1) What is the type of each variable?

df.dtypes

The above lists out the type of data that each variable or column holds.

age           int64sex           int64cp            int64trestbps      int64chol          int64fbs           int64restecg       int64thalach       int64exang         int64oldpeak     float64slope         int64ca            int64thal          int64target        int64dtype: object

4) What is the total size of the data?

The size of the data can be determined by -

df.shape

this gives the total number of rows and columns in the form of a tuple.

(303, 14)

The dataset has in total 303 rows and 14 columns. Now comes the last and final question.

5) Are there any missing or `NaN` values within the data?

Missing values in the data indicate that the value for that column or variable at a particular index is Null or None. In order to find the same, we have a method called isnull() which can be understood in this way -

is the data frame null?

df.isnull()

This, in return, creates a new data frame of boolean values representing

True, wherever the value is Null
False, wherever the value is NOT-Null

    age    sex    cp     trestbps chol   fbs    restecg  thalach  exang  oldpeak  slope  ca     thal   target0   False  False  False  False    False  False  False    False    False  False    False  False  False  False1   False  False  False  False    False  False  False    False    False  False    False  False  False  False2   False  False  False  False    False  False  False    False    False  False    False  False  False  False    ...    ...    ...    ...      ...    ...    ...      ...      ...    ...      ...    ...    ...    ...301 False  False  False  False    False  False  False    False    False  False    False  False  False  False302 False  False  False  False    False  False  False    False    False  False    False  False  False  False

However, looking at the above result, it can be tedious to exactly pinpoint the True value signifying the Null in that location. We can achieve it with the help of NumPy's np.where() method.

def get_nan_location(dframe):    nan_locs = {}    for col in dframe.columns:        nl = list(np.where(dframe[col].isnull())[0])        nan_locs[col] = nl    return nan_locs

The above takes one parameter i.e., dframe (the actual data frame), and returns a dictionary where each key is the column and each value is the list of indices where there is NaN pertaining to that respective column.

get_nan_location(dframe=df)#--------------------------{'age': [], 'sex': [], 'cp': [], 'trestbps': [], 'chol': [], 'fbs': [], 'restecg': [], 'thalach': [], 'exang': [], 'oldpeak': [], 'slope': [], 'ca': [], 'thal': [], 'target': []}

We can see all the columns have an empty list which simply indicates that there are no NaN values.

Once this step or stage is over, we should be formulating other questions that mainly depend on the data we are working with.

#1 - Getting started with Data Analytics

Mohammed Sameeruddin — Sun, 21 Mar 2021 10:12:22 GMT

What is Data?

Data is the knowledge gained from a factual basis. It can be related to an object or a person. An explanation obtained from processing the data is called information. Thus, data and information are two different things.

Data Facts and Figures
Information Processed data which is understood better

Credits of Cover Image - Photo by Hunter Harritt on Unsplash

Data Analytics - Meaning

From the above two, we can get clarity of the term " Data Analysis ". Data Analysis is a field of Statistics, Mathematics, and Computer Science combined together in processing the raw data to produce insightful or valuable information. You might have this question - Statistics and Mathematics are fine but why Computer Science? The knowledge of programming helps in different ways in analyzing data. Some of which are -

Process Automation
Handling Large Datasets
Querying Databases
Creating Models
Data Visualization
Dashboard Development

Tools / Languages

Of course, we cannot just analyze the given data with a piece of paper and pencil. We need to find one such platform to do all three - Stats, Math, and Programming.

Tools involved in Data Analysis -

Python
R
Julia
Matlab

Note - There are so many languages or tools available. But here, I talk about Python. If you want to know the list then do refer to this article.

Packages

To get started in the field of data, learning Python would benefit in many ways. Python has a wide variety of packages that have been developed over the years. From data collection to data modeling, Python has everything set for you.

List of Packages or libraries:

NumPy
Pandas
Statsmodels
Matplotlib
OpenCV
Scikit-Learn
Pytorch
Tensorflow
Plotly
Py-Spark, etc

These packages have been extensively used for data-related problems. There is no requirement to learn all the packages as long as one is curious enough to understand the problem and implement the method. But the deeper one goes the deeper knowledge of using these are a must.

End

How to Deploy a Python Image Processing App

Mohammed Sameeruddin — Fri, 12 Mar 2021 11:03:02 GMT

In this article, I will share my experience (the errors and issues) while deploying the app which I developed.

For a while till now, I have been working on my basic image processing app developed in Python using the frameworks and libraries like -

Dash
Plotly
Plotly express
NumPy &
OpenCV
and some other dependencies

Let me explain how the journey of developing this app began.

Credits of Cover Image - Photo by Kelly Sikkema on Unsplash

Idea

First, I didn't have any idea or plan to develop an app (that too for image processing). It is when one of my colleagues asked in our common group -

How do I re-mirror the image in the zoom app? By default, it takes the virtual background image that is already mirrored.

At that point, I really wanted to understand how this mirroring feature works in any image application. I already know the basics of image processing, like - how to read the image, how to convert the image into grayscale - stuff like that. I thought, why don't I try this in Python, in fact mirroring the image is simply reversing the rows of the image matrix (images are considered as matrices with rows and columns).

Implementation

The moment I knew what to do, I searched online. I got the resources that we can easily implement the mirroring operation using the PIL (Pillow) library. I remember I was once told that the NumPy library is mainly used for scientific and complex mathematical calculations. I fixed in my mind that I would only go to use the NumPy library for both mathematical understanding and as well as implementation.

Firstly, I implemented using a regular for loop. I observed that it was taking too much time to compute. Although it was working, I chose the np.fliplr() (flip-left-right) method. I plotted the result after operating and the image is completely mirrored. The same technique I used for flipping the image up to down. The np.flipud() (flip-up-down) method is used.

Likewise, step by step I learned other operations and executed them. Image Flipping and Mirroring with NumPy and OpenCV was the first blog I wrote in the category of image processing.

Blogs that I wrote (order wise)

Image Operations
Image Transformation (Morphological)
- Image Erosion
- Image Dilation
- Other transformations which use the concept of erosion and dilation.

Note - I have re-ordered the blogs in my series, in a way that a beginner can also understand from the initial step.

App Development

After having developed some image operations, I wanted to compile everything together as an app with a proper UI. I have extensively used dash & plotly for designing the UI part along with the user-upload component.

Note - I won't be sharing the whole code but for sure the code snippets.

Package `import`s

import dashimport dash_core_components as dccimport dash_html_components as htmlimport dash_daq as daqimport plotly.express as pximport cv2import numpy as npimport jsonimport base64from matplotlib import pyplot as pltfrom dash.dependencies import (Input, Output, State)from image_ops_scratch import (ImageOperations, read_image_string)from image_morphs_scratch import MorphologicalTransformations

image_ops_scratch the file where all the image operations are defined.
image_morphs_scratch the file where all the morphological image transformations are defined.

Start the App

external_stylesheets = [    'https://codepen.io/chriddyp/pen/bWLwgP.css']app = dash.Dash(__name__, external_stylesheets=external_stylesheets)app.config['suppress_callback_exceptions'] = Trueapp.title = 'Image Processing App'server = app.server

Front End

For developing the UI, we shall use the sub-packages of dash namely -

dash_daq
dash_core_components
dash_html_components
dash.dependencies

Please refer to the official documentation to understand the basics of dash callbacks and other important UI components.

File Upload Component (UI)

This component is used to upload a local file from the system to the application. When the user clicks on this, the file directory pops up to select an appropriate file. Firstly, the image file that we select is read in base64 code which is to be converted into an image array for the operation to work effectively (the operations work only for nd arrays).

html.Div([    dcc.Upload(        id='upload-image',        children=html.Div([            'Drag and Drop or ',            html.A('Select Files')        ]),        style={            'width': '100%',            'height': '70px',            'lineHeight': '60px',            'borderWidth': '1px',            'borderStyle': 'dashed',            'borderRadius': '5px',            'textAlign': 'center',            'margin': '10px',            'backgroundColor': '#F0F1F1'        },        multiple=True    ),], style={'paddingTop' : 50})

Convert the String Image to Matrix

def read_image_string(contents):   encoded_data = contents[0].split(',')[1]   nparr = np.frombuffer(base64.b64decode(encoded_data), np.uint8)   img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)   img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)   return img

Toggle Switch (Grayscale mode - on & off)

When the toggle switch is turned on, the image shall be converted directly to grayscale. By default, the image is read in RGB mode.

html.Div([    daq.ToggleSwitch(        id='image-mode',        size=60,        label='Gray Scale',        labelPosition='top',        color='#717171',        value=False,    )], style={'paddingTop' : 30, 'paddingBottom' : 10})

Radio Buttons - Operation / Transformation

To represent each operation we will use radio buttons. When the user selects any particular operation, the output of the image changes and the same is displayed in the app.

image_ops = ['None', 'Equalize', 'Flip', 'Mirror', 'Binarize', 'Invert', 'Solarize']html.Div([    dcc.RadioItems(        id='in-operation',         options=[{'label' : op, 'value' : op.lower()} for op in image_ops],        value='none'    ),], className='select-operation')########################image_morphs = ['None', 'Erode', 'Dilate', 'Open', 'Close', 'Gradient', 'Boundary Extraction']html.Div([    html.P('Morph level - '),    dcc.Input(id='morph-level', type='number', placeholder='Enter Morph Level - ', value=3),    html.Div([        dcc.RadioItems(            id='in-transformation',             options=[{'label' : tr, 'value' : tr.lower()} for tr in image_morphs],            value='none'        ),    ], style={'paddingTop' : 20})], className='select-operation')

Tabs to Separate Operations & Transformations

The app should have two tabs so as to separate image operations and image transformations. Also, in the transformations tab, the image needs to be converted into binary as they only work on binary images. Similarly, in the operations tab, we shall place the toggle switch.

tab_style = {    'borderBottom': '1px solid #d6d6d6',    'padding': '10px',    'fontWeight': 'bold'}tab_selected_style = {    'borderTop': '5px solid #d6d6d6',    'borderBottom': '3px solid #d6d6d6',    'backgroundColor': '#7E8483',    'color': 'white',    'padding': '6px'}html.Div([    dcc.Tabs(        id='image-processors-tabs',        value='operators',        children=[            dcc.Tab(                label='Operations',                value='operators',                style=tab_style,                selected_style=tab_selected_style,                children=[                    # toggle switch                    # operations - radio buttons                ]            ),            dcc.Tab(                label='Transformations',                value='transformers',                style=tab_style,                selected_style=tab_selected_style,                children=[                    # transformations - radio buttons                ]            )        ]    )], className='tab-div')

Output Image - Component

The output div should be the same for both operations and transformations. It has to be already preserved with a unique id with which it can be updated based on the tab that is selected.

html.Div(    id='result-in-out-image',     className='flex-item-right')

Back End

In dash, we mainly have callbacks based on which the output is displayed or updated. Each output callback identified with a unique id can take the n number of Inputs or States. But each callback takes only one output with a unique id that can be repeated exactly once.

The callback chart for this app is as follows -

Convert the String Image to Matrix

def parse_contents(contents, filename, date):    image_mat = read_image_string(contents=contents)    return image_mat

Here -

contents are the data in the file that is selected
filename points to the name of the file that is selected
date points to the status to when the file is last updated

Setting the Output Layout Based on the Tab Selected

We have one output div that is reserved. Based on the tab selected, the app should change its nature. Hence, the following code.

@app.callback(    Output('result-in-out-image', 'children'),     [Input('image-processors-tabs', 'value')])def set_output_layout(which_tab):    if which_tab == 'operators':        in_out_image_div = html.Div([            html.Div(                 children= [                    html.H5('Image Used - Output'),                    html.Div(id='output-image-op'),                ],                style={'textAlign' : 'center', 'paddingTop' : 50}            )        ])    elif which_tab == 'transformers':        in_out_image_div = html.Div([            html.Div(                 children= [                    html.H5('Image Used - Output'),                    html.Div(id='output-image-morph'),                ],                style={'textAlign' : 'center', 'paddingTop' : 50}            )        ])    return in_out_image_div

Image Operations - Back End

In this, we have -

1 Output
- output-image-op pointing to the change in children (HTML collection)
3 Inputs
- upload-image pointing to the contents of the file
- image-mode pointing to the mode (RGB or grayscale by toggle switch)
- in-operation pointing to the input operation that is selected
2 States
- upload-image pointing to the state of the file name
- upload-image pointing to the state of the date (when the file is last modified)

@app.callback(    Output('output-image-op', 'children'),     [        Input('upload-image', 'contents'),         Input('image-mode', 'value'),         Input('in-operation', 'value'),         # -------        State('upload-image', 'filename'),         State('upload-image', 'last_modified'),     ])def get_operated_image(contents, image_mode, operation, filenames, dates):    if contents is not None:        imsrc = parse_contents(contents, filenames, dates)        imo = ImageOperations(image_file_src=imsrc)        if (operation == 'equalize'):            out_img = imo.equalize_this(gray_scale=True) if image_mode else imo.equalize_this()        elif (operation == 'flip'):            out_img = imo.flip_this(gray_scale=True) if image_mode else imo.flip_this()        elif (operation == 'mirror'):            out_img = imo.mirror_this(gray_scale=True) if image_mode else imo.mirror_this()        elif (operation == 'binarize'):            out_img = imo.binarize_this(gray_scale=True) if image_mode else imo.binarize_this()        elif (operation == 'invert'):            out_img = imo.invert_this(gray_scale=True) if image_mode else imo.invert_this()        elif (operation == 'solarize'):            out_img = imo.solarize_this(gray_scale=True) if image_mode else imo.solarize_this()        else:            out_img = imo.read_this(gray_scale=True) if image_mode else imo.read_this()        out_image_fig = px.imshow(out_img, color_continuous_scale='gray') if image_mode else px.imshow(out_img)        out_image_fig.update_layout(            coloraxis_showscale=False,             width=600, height=400,             margin=dict(l=0, r=0, b=0, t=0)        )        out_image_fig.update_xaxes(showticklabels=False)        out_image_fig.update_yaxes(showticklabels=False)        output_result = html.Div([            dcc.Graph(id='out-op-img', figure=out_image_fig)        ], style={'paddingTop' : 50})        return output_result

Image Transformation - Back End

In this, we have -

1 Output
- output-image-morph pointing to the change in children (HTML collection)
3 Inputs
- upload-image pointing to the contents of the file
- morph-level pointing to the input value with which the level of the structuring element is decided
- in-transformation pointing to the input transformation that is selected
2 States
- upload-image pointing to the state of the file name
- upload-image pointing to the state of the date (when the file is last modified)

@app.callback(    Output('output-image-morph', 'children'),    [        Input('upload-image', 'contents'),         Input('morph-level', 'value'),         Input('in-transformation', 'value'),        # -------        State('upload-image', 'filename'),         State('upload-image', 'last_modified'),     ])def get_transformed_image(contents, level, transformation, filenames, dates):    if contents is not None:        imsrc = parse_contents(contents, filenames, dates)        morph = MorphologicalTransformations(image_file_src=imsrc, level=level)        level = 3 if level == None else level        image_src = morph.read_this()        if (transformation == 'erode'):            out_img = morph.erode_image(image_src=image_src)        elif (transformation == 'dilate'):            out_img = morph.dilate_image(image_src=image_src)        elif (transformation == 'open'):            out_img = morph.open_image(image_src=image_src)        elif (transformation == 'close'):            out_img = morph.close_image(image_src=image_src)        elif (transformation == 'gradient'):            out_img = morph.morph_gradient(image_src=image_src)        elif (transformation == 'boundary extraction'):            out_img = morph.extract_boundary(image_src=image_src)        else:            out_img = image_src        out_image_fig = px.imshow(out_img, color_continuous_scale='gray')        out_image_fig.update_layout(            coloraxis_showscale=False,             width=600, height=400,             margin=dict(l=0, r=0, b=0, t=0)        )        out_image_fig.update_xaxes(showticklabels=False)        out_image_fig.update_yaxes(showticklabels=False)        output_result = html.Div([            dcc.Graph(id='out-morph-img', figure=out_image_fig)        ], style={'paddingTop' : 50})        return output_result

Note - The above code snippets belong to one file i.e., app.py. The actual image processing code is not disclosed. One can look it up in my GitHub repository.

Deployment

The fun part was during the deployment. Although I have experience in deploying Python web apps on Heroku, I faced slight issues (which helped me learn) while deploying this app. I organized all the code and checked if there is any change that needed to be done.

Folder Structure

- image-app/    - assets/    -     custom_style.css    - images/    -     lena_original.png    -     pinktree.jpg    -     scenary.jpg    - .gitignore    - app.py    - image_morphs_scratch.py    - image_ops_scratch.py    - Procfile    - requirements.txt

I specified all the requirements in a separate requirements.txt file. The errors occurred during the installation of OpenCV inside Heroku. Actually, I was supposed to specify opencv-contrib-python-headless instead of opencv-contrib-python to get the package installed properly.

requirements.txt

dashplotlydash-core-componentsdash-html-componentsdash-daqnumpymatplotlibpandasopencv-contrib-python-headlessgunicorn

Procfile

web: gunicorn app:server

Everything is set now, We can easily follow these steps and deploy the app that can be accessible to everyone.

That's It, Folks

By developing this app, I got to learn some image processing concepts along with the mathematics behind it. I feel happy to finally be able to deploy this app on Heroku. I know the app is sort of basic and will be adding other interesting features to it in the future. I would like to see your feedback or suggestions in the comments.

GitHub link https://github.com/msameeruddin/image-app
App link https://process-image-app.herokuapp.com/

If you have liked this project, do drop a star, and buy coffee for me from here. We will meet in the next blog.

End

Transposing a Matrix — Normal Function to an Optimised Code

Mohammed Sameeruddin — Thu, 04 Mar 2021 03:39:52 GMT

In this article, we will learn 3 ways of transposing a matrix from an easy-to-read function to an optimized code without needing any for loops.

Credits of Cover Image - Photo by Vlado Paunovic on Unsplash

Introduction

A matrix is a 2D representation of an array with M rows and N columns. An array is a collection of identical elements or objects stored in 1 row and N columns. There are so many mathematical operations and properties that can be implemented in a matrix. One such operation is the transpose operation. Transposing a matrix is easy, just converting rows into columns and vice-versa.

Lets try to code this in 3 different ways and compare the performance.

1) Easy and Readable Function

In every matrix, the length of rows should be equal and the same follows for the columns.

We require two for loops:

In the first loop, we consider the size of any row. This will decide the column size for our transposed matrix.
In the second loop, we consider each row in the original matrix and extract every item order-wise.

Code

def transpose_matrix_1(matrix):    # take the first row length    row_len = len(matrix[0])    trans_mat = []    for i in range(row_len):        trans_row = []        for row in matrix:            # extract every item from each row order-wise            trans_row.append(row[i])        trans_mat.append(trans_row)    return trans_mat

Output

>>> matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]>>> t_matrix = transpose_matrix_1(matrix=matrix)>>> print(t_matrix)[[ 1  4  7 10] [ 2  5  8 11] [ 3  6  9 12]]

2) Implementing List Comprehension

We can improvise the above function by implementing list comprehension. List comprehension is easy to implement and performance-wise, it is way faster than normal for loops.

Code

def transpose_matrix_2(matrix):    row_len = len(matrix[0])    trans_mat = [[row[i] for row in matrix] for i in range(row_len)]    return trans_mat

Output

>>> matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]>>> t_matrix = transpose_matrix_2(matrix=matrix)>>> print(t_matrix)[[ 1  4  7 10] [ 2  5  8 11] [ 3  6  9 12]]

3) Lambda Function with map() and zip()

Lambda function is a small anonymous function that takes n arguments and returns only one output.

map() is used to map a particular function with a sequence of objects that we can achieve using for loop. But here, we dont require any.
zip() is used to zip every item from every row sequentially. It basically used to club things together.

Code

# zip all the items sequentially and convert it as a listtranspose_matrix_3 = lambda matrix: list(map(list, zip(*matrix)))

Output

>>> matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]>>> t_matrix = transpose_matrix_3(matrix=matrix)>>> print(t_matrix)[[ 1  4  7 10] [ 2  5  8 11] [ 3  6  9 12]]

Which one is Faster?

We shall compare each functions performance based on how much time it takes to complete the task. For this, we will rely on the timeit module.

Note: We will mostly use the square matrix of different sizes. However, we can tweak the dimension values.

Code

import randomfrom timeit import timeitimport matplotlib.pyplot as pltfrom matplotlib import stylestyle.use('seaborn')def time_convertor(val, time_type):    if time_type == 's':        val = val / 1000000    elif time_type == 'ms':        val = val / 1000    else:        val = val    return valdef compare_performace():    r = [3, 5, 10, 50, 100, 500, 1000]    X = []; Y1 = []; Y2 = []; Y3 = []    for i in range(len(r)):        matrix = [[random.randint(1, 20) for j in range(r[i])] for k in range(r[i])]        x_val = "{} x {}".format(r[i], r[i])        X.append(x_val)        y_val_1 = %timeit -o transpose_matrix_1(matrix=matrix)        y_val_2 = %timeit -o transpose_matrix_2(matrix=matrix)        y_val_3 = %timeit -o transpose_matrix_3(matrix=matrix)        print('---------')        y_val_1 = str(y_val_1).split(' ', 2)        y_val_2 = str(y_val_2).split(' ', 2)        y_val_3 = str(y_val_3).split(' ', 2)        y_val_1 = time_convertor(val=float(y_val_1[0]), time_type=y_val_1[1])        y_val_2 = time_convertor(val=float(y_val_2[0]), time_type=y_val_2[1])        y_val_3 = time_convertor(val=float(y_val_3[0]), time_type=y_val_3[1])        Y1.append(y_val_1)        Y2.append(y_val_2)        Y3.append(y_val_3)    plt.figure(figsize=(15, 8))    plt.plot(X, Y1, '-o', label='Normal function')    plt.plot(X, Y2, '-o', label='List comprehension')    plt.plot(X, Y3, '-o', label='Lambda function')    plt.xlabel('Matrix dimension')    plt.ylabel('Function performance')    plt.legend()    plt.show()    return True

2.97 s  81.9 ns per loop (mean  std. dev. of 7 runs, 100000 loops each)2.6 s  85.1 ns per loop (mean  std. dev. of 7 runs, 100000 loops each)1.49 s  223 ns per loop (mean  std. dev. of 7 runs, 1000000 loops each)---------4.65 s  158 ns per loop (mean  std. dev. of 7 runs, 100000 loops each)4.35 s  84.8 ns per loop (mean  std. dev. of 7 runs, 100000 loops each)1.85 s  90.4 ns per loop (mean  std. dev. of 7 runs, 1000000 loops each)---------13.9 s  582 ns per loop (mean  std. dev. of 7 runs, 100000 loops each)9.82 s  110 ns per loop (mean  std. dev. of 7 runs, 100000 loops each)3.13 s  124 ns per loop (mean  std. dev. of 7 runs, 100000 loops each)---------267 s  10.9 s per loop (mean  std. dev. of 7 runs, 1000 loops each)144 s  2.41 s per loop (mean  std. dev. of 7 runs, 10000 loops each)29.7 s  863 ns per loop (mean  std. dev. of 7 runs, 10000 loops each)---------1.02 ms  27.2 s per loop (mean  std. dev. of 7 runs, 1000 loops each)527 s  9.94 s per loop (mean  std. dev. of 7 runs, 1000 loops each)113 s  1.89 s per loop (mean  std. dev. of 7 runs, 10000 loops each)---------33.1 ms  815 s per loop (mean  std. dev. of 7 runs, 10 loops each)20.9 ms  887 s per loop (mean  std. dev. of 7 runs, 10 loops each)3.58 ms  57.4 s per loop (mean  std. dev. of 7 runs, 100 loops each)---------120 ms  1.71 ms per loop (mean  std. dev. of 7 runs, 10 loops each)73.1 ms  3.08 ms per loop (mean  std. dev. of 7 runs, 10 loops each)20 ms  1.73 ms per loop (mean  std. dev. of 7 runs, 100 loops each)

Performance Graph

The time complexity is similar for our first and second functions O(M*N). But when we compare, the list comprehension is always faster than a regular for loop.

In the other case, where we used the lambda function, it is quite faster than the list comprehension. All I had to do is simply use the built-in functions which are always faster than user-defined functions.

How to find an Intersection between two matrices easily using NumPy

Mohammed Sameeruddin — Fri, 19 Feb 2021 05:14:26 GMT

Introduction

The word intersection in mathematics is termed as the similar (smaller) objects between two different objects. Intuitively, we can say the intersection of objects is that it belongs to all of them.

Credits of Cover Image - Photo by Benjamin Elliott on Unsplash

Geometrically speaking, if we have two distinct lines (assuming these lines are two objects), the intersection of these two lines would be the point where both the lines meet. Well, in the case of parallel lines, the intersection doesnt exist. Geographically, the common junction between two or more roads can be taken as the area or region of intersection.

In Set theory, the intersection of two objects such as A and B is defined as the set of elements contained in both A and B. Symbolically, we represent the intersection as -

We also the above symbol as AND. Programatically this becomes very easy to code or to find the intersection between two sets.

Using NumPy, we can find the intersection of two matrices in two different ways. In fact, there are in-built methods with which we can easily find.

Lets first define the two matrices of the same size.

>>> import numpy as np>>> import matplotlib.pyplot as plt>>> >>> a = np.random.randint(low=0, high=2, size=(5, 5))>>> b = np.random.randint(low=0, high=2, size=(5, 5))>>> >>> print(a)[[1 0 1 1 0] [1 0 0 1 0] [1 0 1 1 0] [1 0 1 0 1] [0 1 0 0 0]]>>> >>> print(b)[[0 0 1 1 0] [0 0 0 0 0] [1 0 0 1 1] [0 0 1 1 0] [0 1 0 0 0]]

Using the operation `&`

In mathematics, intersection (&) is often called and. When we operate the same on a and b, using a broadcasting technique automatically do the elementwise calculation.

>>> c = a & b>>> print(c)[[0 0 1 1 0] [0 0 0 0 0] [1 0 0 1 0] [0 0 1 0 0] [0 1 0 0 0]]

The result c is the intersection of the matrices a and b.

Using the method `np.where()`

To get the gist of np.where() is a special method that acts like a ternary operator on NumPy arrays. By default, it takes 3 parameters -

condition The condition which is either true or false at certain positions (boolean value) of an array.
x The value that gets replaced at the positions of the array where the condition is true.
y The value that gets replaced at the positions of the array where the condition is false.

Lets have a function to get the intersection of the matrices.

def intersect_matrices(mat1, mat2):    if not (mat1.shape == mat2.shape):        return False    mat_intersect = np.where((mat1 == mat2), mat1, 0)    return mat_intersect

If we call the above function with our matrices a and b, we get -

>>> c = intersect_matrices(mat1=a, mat2=b)>>> print(c)[[0 0 1 1 0] [0 0 0 0 0] [1 0 0 1 0] [0 0 1 0 0] [0 1 0 0 0]]

If we visualize the above results, we can get -

In the plot, white patches represent the true values and black patches represent the false values. Particularly, in subplot c, white patches represent the intersection value of sets a, and b.

Note: The above implementation is done for the matrices which are of binary format. When we have values other than 0 and 1. It is better to go with the second procedure where we replace the values with 0 at the positions where the condition is false.

For different values other than 0 and 1, we have the plot something like below.

End

Complete Understanding of Morphological Transformations in Image Processing

Mohammed Sameeruddin — Tue, 16 Feb 2021 14:31:11 GMT

Introduction

In this article, we will explore other transformations where image erosion and image dilation stand as a base. In the previous articles on morphological transformations, we learned the two important transformations namely erosion and dilation.

The transformations that are built on top of these two are -

Opening
Closing
Morphological Gradient
Top hat
Black hat
Boundary Extraction
Hit Miss Transformation

We have seen a step-by-step implementation of erosion and dilation explaining the convolution method with simple matrix operations. In all of these transformations, we rely on the binary input image and a structuring element or kernel. The structuring element needs to be a square matrix which is again a binary matrix.

Note: If you are not familiar with erosion and dilation. I would recommend you refer to my previous articles.

Image Erosion https://bit.ly/3gcojco
Image Dilation https://bit.ly/2TlOncd

For now, we will consider

A input image matrix
B kernel matrix

Credits of Cover Image - Photo by Suzanne D. Williams on Unsplash

Opening Transformation / Image Opening

We know erosion and dilation are quite opposite to each other. But Opening is just another name of erosion followed by dilation. Mathematically, we can represent it as -

$$(A \circ B) \rightarrow (1)$$

If we further break it down, we can represent it as -

$$(A \circ B) = (A \ominus B) \oplus B \rightarrow (2)$$

Eq (1) is represented in terms of erosion and dilation, the same can be seen in Eq (2).

This transformation is helpful in removing the noise from the image.

Closing Transformation / Image Closing

Closing transformation is quite opposite of Opening transformation. Closing is just another name of a dilation followed by erosion. Mathematically, we can represent it as -

$$(A \bullet B) \rightarrow (3)$$

If we further break it down, we can represent it as -

$$(A \bullet B) = (A \oplus B) \ominus B \rightarrow (4)$$

Eq (3) is represented in terms of dilation and erosion, the same can be seen in Eq (4).

This transformation is helpful in closing the holes in the foreground object of the image.

Morphological Gradient

The morphological gradient can be easily obtained once we have the eroded image and dilated image. It is the difference between dilated image and an eroded image. Mathematically, we can represent it as -

$$(A \oplus B) - (A \ominus B)$$

The resultant of this transformation appears to be an outline of the foreground object.

Top Hat Transformation

Top Hat transformation is the difference between the input image and the opening of the image. Mathematically, we can represent it as -

$$A - (A \circ B) \rightarrow (5)$$

If we further break it down, we can represent it as -

$$A - [\ (A \ominus B) \oplus B \ ] \rightarrow (6)$$

Eq (5) is represented in terms of erosion and dilation, the same can be seen in Eq (6).

Black Hat Transformation

Back Hat transformation is the difference between the closing of the input image and the input image. Mathematically, we can represent it as -

$$(A \bullet B) - A \rightarrow (7)$$

If we further break it down, we can represent it as -

$$[\ (A \oplus B) \ominus B \ ] - A \rightarrow (8)$$

Eq (7) is represented in terms of dilation and erosion, the same can be seen in Eq (8).

Boundary Extraction

Boundary extraction is one of the applications of morphological transformations. In simple words, it is the difference between the input image and the eroded image. Mathematically, we can represent it as -

$$A - (A \ominus B)$$

The boundary of the foreground object is represented in white color and the rest be in black color. If we do the reverse process, i.e., the difference between the input image and dilated image the boundary will get in black color, and the rest will be in white color.

Time to Code

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV

`import` the Packages

import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt

For the entire implementation, I will be using the finger-print image, you can use the same for the practice.

Code Morphological Transformations

class MorphologicalTransformations(object):    def __init__(self, image_file, level):        self.level = 3 if level < 3 else level        self.image_file = image_file        self.MAX_PIXEL = 255        self.MIN_PIXEL = 0        self.MID_PIXEL = self.MAX_PIXEL // 2        self.kernel = np.full(shape=(level, level), fill_value=255)    def read_this(self):        image_src = cv2.imread(self.image_file, 0)        return image_src    def convert_binary(self, image_src, thresh_val):        color_1 = self.MAX_PIXEL        color_2 = self.MIN_PIXEL        initial_conv = np.where((image_src <= thresh_val), image_src, color_1)        final_conv = np.where((initial_conv > thresh_val), initial_conv, color_2)        return final_conv    def binarize_this(self):        image_src = self.read_this()        image_b = self.convert_binary(image_src=image_src, thresh_val=self.MID_PIXEL)        return image_b    def get_flat_submatrices(self, image_src, h_reduce, w_reduce):        image_shape = image_src.shape        flat_submats = np.array([            image_src[i:(i + self.level), j:(j + self.level)]            for i in range(image_shape[0] - h_reduce) for j in range(image_shape[1] - w_reduce)        ])        return flat_submats    def erode_image(self, image_src, with_plot=False):        orig_shape = image_src.shape        pad_width = self.level - 2        image_pad = np.pad(array=image_src, pad_width=pad_width, mode='constant')        pimg_shape = image_pad.shape        h_reduce, w_reduce = (pimg_shape[0] - orig_shape[0]), (pimg_shape[1] - orig_shape[1])        flat_submats = self.get_flat_submatrices(            image_src=image_pad, h_reduce=h_reduce, w_reduce=w_reduce        )        image_eroded = np.array([255 if (i == self.kernel).all() else 0 for i in flat_submats])        image_eroded = image_eroded.reshape(orig_shape)        if with_plot:            self.plot_it(orig_matrix=image_src, trans_matrix=image_eroded, head_text='Eroded - {}'.format(self.level))            return None        return image_eroded    def dilate_image(self, image_src, with_plot=False):                orig_shape = image_src.shape        pad_width = self.level - 2        image_pad = np.pad(array=image_src, pad_width=pad_width, mode='constant')        pimg_shape = image_pad.shape        h_reduce, w_reduce = (pimg_shape[0] - orig_shape[0]), (pimg_shape[1] - orig_shape[1])        flat_submats = self.get_flat_submatrices(            image_src=image_pad, h_reduce=h_reduce, w_reduce=w_reduce        )        image_dilated = np.array([255 if (i == self.kernel).any() else 0 for i in flat_submats])        image_dilated = image_dilated.reshape(orig_shape)        if with_plot:            self.plot_it(orig_matrix=image_src, trans_matrix=image_dilated, head_text='Dilated - {}'.format(self.level))            return None        return image_dilated    def open_image(self, image_src, with_plot=False):        image_eroded = self.erode_image(image_src=image_src)        image_opening = self.dilate_image(image_src=image_eroded)        if with_plot:            self.plot_it(orig_matrix=image_src, trans_matrix=image_opening, head_text='Opening - {}'.format(self.level))            return None        return image_opening    def close_image(self, image_src, with_plot=False):        image_dilated = self.dilate_image(image_src=image_src)        image_closing = self.erode_image(image_src=image_dilated)        if with_plot:            self.plot_it(orig_matrix=image_src, trans_matrix=image_closing, head_text='Closing - {}'.format(self.level))            return None        return image_closing    def morph_gradient(self, image_src, with_plot=False):        image_dilated = self.dilate_image(image_src=image_src)        image_eroded = self.erode_image(image_src=image_src)        image_grad = image_dilated - image_eroded        if with_plot:            self.plot_it(orig_matrix=image_src, trans_matrix=image_grad, head_text='Gradient Morph - {}'.format(self.level))            return None        return image_grad    def extract_boundary(self, image_src, with_plot=False):        image_eroded = self.erode_image(image_src=image_src)        ext_bound = image_src - image_eroded        if with_plot:            self.plot_it(orig_matrix=image_src, trans_matrix=ext_bound, head_text='Boundary - {}'.format(self.level))            return None        return ext_bound    def get_tophat(self, image_src, with_plot=False):        image_opening = self.open_image(image_src=image_src)        image_tophat = image_src - image_opening        if with_plot:            self.plot_it(orig_matrix=image_src, trans_matrix=image_tophat, head_text='Top Hat - {}'.format(self.level))            return None        return image_tophat    def get_blackhat(self, image_src, with_plot=False):        image_closing = self.close_image(image_src=image_src)        image_blackhat = image_closing - image_src        if with_plot:            self.plot_it(orig_matrix=image_src, trans_matrix=image_blackhat, head_text='Black Hat - {}'.format(self.level))            return None        return image_blackhat    def plot_it(self, orig_matrix, trans_matrix, head_text):        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        cmap_val = 'gray'        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text(head_text)        ax1.imshow(orig_matrix, cmap=cmap_val)        ax2.imshow(trans_matrix, cmap=cmap_val)        plt.show()        return True

Object Creation

morph = MorphologicalTransformations(    image_file='Finger-arch.jpg',     level=3)image_src = morph.binarize_this()

In the above code, we have made an object called morph, and using this object we are converting the input image into a binary image.

Erosion

morph.erode_image(image_src=image_src, with_plot=True)

Dilation

morph.dilate_image(image_src=image_src, with_plot=True)

Opening

morph.open_image(image_src=image_src, with_plot=True)

Closing

morph.close_image(image_src=image_src, with_plot=True)

Morphological Gradient

morph.morph_gradient(image_src=image_src, with_plot=True)

Top Hat

morph.get_tophat(image_src=image_src, with_plot=True)

Black Hat

morph.get_blackhat(image_src=image_src, with_plot=True)

Boundary Extraction

morph.extract_boundary(image_src=image_src, with_plot=True)

These are the results of all the transformations. We obtained these with the help of erosion and dilation.

Note: All of the code is written from scratch using the NumPy library.

Conclusion

In this article, we have almost covered the types of morphological transformations. Although, there are some advanced transformations like Hit Miss Transformation, which I didnt cover in this but definitely next time.

When we have speed and compatibility constraints, I would recommend using the official library methods. This was just practiced to understand the inner working and mathematics behind each algorithm.

End

Image Dilation Explained in Depth using NumPy

Mohammed Sameeruddin — Wed, 10 Feb 2021 06:11:19 GMT

Introduction

In this article, we will explore the mathematics behind the image dilation operation. Like Image Erosion, Image Dilation is another important morphological operation used to increase or expand shapes contained in the input image. Think of this as diluting the image. Diluting anything requires water, here we need a structuring element or kernel.

Credits of Cover Image - Photo by Jason Leung on Unsplash

Note: We are not expanding or increasing the image size. We are increasing the pixel strength and the size remains the same.

Mathematically, we can represent this operation in the following way -

$$A \bigoplus B$$

where -

A Input image that is binarized
B structuring element or kernel

The resultant of the above formula is the dilated image. The technique that we apply here is the 2D convolution for the input image with respect to the kernel. The kernel is basically a square matrix.

Concept of Dilation

A typical binary image consists of only 1s (255) and 0s. The kernel can either be a subset of the input image or not which is again in the binary form. To think of this mathematically in terms of matrices we can have:

A the matrix of input image and B the matrix of structuring element. The following conditions are to be applied for convolution -

Know the size of B to pad A by 0's. The padding width is (kernel_size - 2).
Position the center element of B to every element of the original input image (matrix) iteratively.
Extract the submatrix that is exactly equal to the size of the B.
Check if any (at least one) element from the submatrix is equal to the element in B considering the index location.
- If yes, replace the element of A to 1 or 255.
- 0, otherwise.
Continue this process for all the elements of A.

Note: We shall start from the first element of A till the last element of A. The GIF can be seen below to understand the convolution technique more clearly.

Imagine the matrix A is a binary image -

similarly, the matrix B (kernel) -

Since the kernel size is (3 x 3), we need to pad A by width (3-2) which is 1.

To extract the submatrices we can position the center element of B with every element of A and thus breaking it down, we can obtain a giant matrix where every element is a (3 x 3) submatrix of A.

From these submatrices, we can easily map the kernel B and obtain a new value for every element of A.

The resultant image considering the input image A -

Notice the difference, we have expanded the pixels of the original image with the level of 5. The black pixels did get reduced.

Lets code this totally from scratch using NumPy.

Time to Code

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV only used for reading the image (in this article).

Import the Packages

import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt

Read the Image

Since we do the morphological transformations on binary images, we shall make sure whatever image we read is binarized. Therefore, we have the following functions.

def read_this(image_file):    image_src = cv2.imread(image_file, 0)    return image_srcdef convert_binary(image_src, thresh_val):    color_1 = 255    color_2 = 0    initial_conv = np.where((image_src <= thresh_val), image_src, color_1)    final_conv = np.where((initial_conv > thresh_val), initial_conv, color_2)    return final_convdef binarize_this(image_file, thresh_val=127):    image_src = read_this(image_file=image_file)    image_b = convert_binary(image_src=image_src, thresh_val=thresh_val)    return image_b

In the dilation function, the main parameters that are passed are:

image_file The input image for which the dilation operation has to be performed.
dilation_level In how many levels do we have to dilate the image. By default, the value of this will be 3.
with_plot Simply to visualize the result showing the comparison between the original image and the dilated image.

In the function

We obtain the kernel matrix based on the dilation_level.
We pad the input image matrix by (kernel_size - dilation_level).
We obtain the submatrices and replace the new values accordingly as shown in the GIF.
Finally, we reshape the newly obtained array into the original size of the input image and plot the same.

def dilate_this(image_file, dilation_level=3, with_plot=False):    # setting the dilation_level    dilation_level = 3 if dilation_level < 3 else dilation_level    # obtain the kernel by the shape of (dilation_level, dilation_level)    structuring_kernel = np.full(shape=(dilation_level, dilation_level), fill_value=255)    image_src = binarize_this(image_file=image_file)    orig_shape = image_src.shape    pad_width = dilation_level - 2    # pad the image with pad_width    image_pad = np.pad(array=image_src, pad_width=pad_width, mode='constant')    pimg_shape = image_pad.shape    h_reduce, w_reduce = (pimg_shape[0] - orig_shape[0]), (pimg_shape[1] - orig_shape[1])    # obtain the submatrices according to the size of the kernel    flat_submatrices = np.array([        image_pad[i:(i + dilation_level), j:(j + dilation_level)]        for i in range(pimg_shape[0] - h_reduce) for j in range(pimg_shape[1] - w_reduce)    ])    # replace the values either 255 or 0 by dilation condition    image_dilate = np.array([255 if (i == structuring_kernel).any() else 0 for i in flat_submatrices])    # obtain new matrix whose shape is equal to the original image size    image_dilate = image_dilate.reshape(orig_shape)    # plotting    if with_plot:        cmap_val = 'gray'        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Dilated - {}".format(dilation_level))        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_dilate, cmap=cmap_val)        plt.show()        return True    return image_dilate

Now that the dilation function is ready, all that is left is testing. We will use a different image for testing.

For dilation level 3 -

dilate_this(image_file='wish.jpg', dilation_level=3, with_plot=True)

For dilation level 5 -

dilate_this(image_file='wish.jpg', dilation_level=3, with_plot=True)

In all the above results, we can notice the increase in the pixels. And this the example code developed totally from scratch. We can also rely on the cv2.dilate() method which is very faster.

References

Morphological Transformations official documentation.

Image Erosion Explained in Depth using NumPy

Mohammed Sameeruddin — Sat, 06 Feb 2021 12:16:38 GMT

Introduction

Erosion operation is one of the important morphological operations (morphological transformations) that follows a technique of mathematical morphology for the analysis and processing of geometrical structures.

To get a general idea of what erosion has to do with images, we can think of this as an operation in which it tries to reduce the shape that is contained in the input image. It is just like the erosion of soil but just that this operation erodes the boundaries of the foreground object.

Credits of Cover Image - Photo by Laura Colquitt on Unsplash

To represent this operation mathematically -

$$A \circleddash B$$

where -

A Input Image
B Structuring element or kernel

The resultant of the above formula gives the eroded image. The structuring element is basically a kernel where the image matrix is operated as a 2D convolution.

Note - This blog post covers the erosion process done on binary images. Also, it is often preferred to use binary images for morphological transformation.

Concept of Erosion

As discussed, we only use the binary images that consist of pixels either 0 or 1 (0 or 255 to be more precise). The structuring element or kernel is either a subset of the image matrix or not which also, is a binary representation that is mostly a square matrix.

Let us consider A as the image matrix and B as the kernel. We have conditions as follows:

We have to position the center element of B to the element iteratively taken in the image A.
We consider the submatrix of A to be the size B and check if the submatrix is exactly equivalent to B.
If yes, replace the pixel value to be 1 or 255 otherwise 0.

We need to do this for all the elements of A with B.

Imagine the image matrix A as -

and structuring element or kernel B as -

The binary image of the matrix A would be something like below.

For easy calculation, we shall pad the image by a pad_width equal to (kernel size - 2) with which the submatrix can be selected easily. The GIF can be seen below to visually know the inner working of the convolution.

Now that we know what to do, let's code the same using the library as well as from scratch.

Time to Code

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV It is only used for reading the image (in this article).

`import` the Packages

import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt

Read the Image

Since we do the morphological transformations on binary images, we shall make sure whatever image we read is binarized. Therefore, we have the following function.

def read_this(image_file):    image_src = cv2.imread(image_file, 0)    return image_srcdef convert_binary(image_src, thresh_val):    color_1 = 255    color_2 = 0    initial_conv = np.where((image_src <= thresh_val), image_src, color_1)    final_conv = np.where((initial_conv > thresh_val), initial_conv, color_2)    return final_convdef binarize_this(image_file, thresh_val=127):    image_src = read_this(image_file=image_file)    image_b = convert_binary(image_src=image_src, thresh_val=thresh_val)    return image_b

Note - By default we are reading the image in grayscale mode.

Code Implementation with Library

For this, we will be using a different image, and for the implementation, we will use the method erode() available in the module cv2. The parameters are as follows:

image_file The image that we want to apply the transformation.
level Basically the erosion level with which the structuring element or kernel's size is decided.
with_plot To obtain the results of both the original image and the transformed image.

def erode_lib(image_file, level=3, with_plot=True):    level = 3 if level < 3 else level    image_src = binarize_this(image_file=image_file)    # library method    image_eroded = cv2.erode(src=image_src, kernel=np.ones((level, level)), iterations=1)    if with_plot:        cmap_val = 'gray'        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Eroded - {}".format(level))        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_eroded, cmap=cmap_val)        plt.show()        return True    return image_eroded

Let's test the above function -

erode_lib(image_file='wish.jpg', level=3, with_plot=True)

Clearly, we can see the some of the pixels got reduced showing the pixel erosion.

Code Implementation from Scratch

As explained earlier, we need to carefully choose the pad_width depending upon the erosion_level. We normally take (kernel size - 2) or (erosion_level - 2) and here, the kernel is always square matrix.

After this, we shall also take the submatrices to position the center element of the kernel with each element of the image matrix iteratively. We make sure that the submatrix size is equal to kernel size. Hence we first pad the matrix with pad_width.

Let's code the erosion function from scratch.

def erode_this(image_file, erosion_level=3, with_plot=False):    erosion_level = 3 if erosion_level < 3 else erosion_level    structuring_kernel = np.full(shape=(erosion_level, erosion_level), fill_value=255)    image_src = binarize_this(image_file=image_file)    orig_shape = image_src.shape    pad_width = erosion_level - 2    # pad the matrix with `pad_width`    image_pad = np.pad(array=image_src, pad_width=pad_width, mode='constant')    pimg_shape = image_pad.shape    h_reduce, w_reduce = (pimg_shape[0] - orig_shape[0]), (pimg_shape[1] - orig_shape[1])    # sub matrices of kernel size    flat_submatrices = np.array([        image_pad[i:(i + erosion_level), j:(j + erosion_level)]        for i in range(pimg_shape[0] - h_reduce) for j in range(pimg_shape[1] - w_reduce)    ])    # condition to replace the values - if the kernel equal to submatrix then 255 else 0    image_erode = np.array([255 if (i == structuring_kernel).all() else 0 for i in flat_submatrices])    image_erode = image_erode.reshape(orig_shape)    if with_plot:        cmap_val = 'gray'        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Eroded - {}".format(erosion_level))        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_erode, cmap=cmap_val)        plt.show()        return True    return image_erode

Let's test the above function -

erode_this(image_file='wish.jpg', erosion_level=3, with_plot=True)

Clearly, we can see the reduction in the pixel values showcasing the pixel erosion. This, we implemented by convolution 2D technique totally from scratch. When compared to the speed of the algorithm, it is a bit slow compared to that of the library method.

References

Morphological Transformations official documentation.

Image Shifting - Implementing from Scratch

Mohammed Sameeruddin — Fri, 22 Jan 2021 11:32:03 GMT

Introduction

Image shifting is simply shifting each pixel of the image to a new position. This is a method of pixel shift used in digital cameras to produce super-resolution images. We can think of a pixel as a point in the coordinate axis to be shifted in any direction. When we implement this on all the pixels of the image then we can say the image is shifted.

In this blog article, we will try to shift the image as we shift the point in the coordinate axis completely using NumPy operations. The image is always considered as a 2D plane, hence we shall also consider a 2D coordinate axis having X as the horizontal axis and Y as the vertical axis.

The coordinate axis is divided into 4 quadrants namely -

Q1 Quadrant where both X and Y are positive.
Q2 Quadrant where X is negative and Y is positive.
Q3 Quadrant where both X and Y are negative.
Q4 Quadrant where X is positive and Y is negative.

We assume that our original image to be at origin i.e., (0, 0).

Now, let's say we want to shift the image at coordinates (3, 4). Basically, the origin of the image has to be shifted from (0, 0) to (3, 4).

Likewise, based on the coordinate points, we need to shift the image. Let's try to understand and implement from scratch using the module NumPy starting from a 2D matrix because images are just large matrices.

Credits of Cover Image - Photo by James Lewis on Unsplash

Time to Code

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV It is only used for reading the image (in this article).

`import` the Packages

import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt

`2D` Matrix

We will be creating a 5 X 5 matrix having random numbers.

>>> import random>>> >>> mat = [[random.randint(5, 100) for i in range(5)] for j in range(5)]>>> mat = np.matrix(mat)>>> print(mat)[[ 46  13  68  54  12] [  7  68  32  46  26] [ 46  43  58  27 100] [ 64  59  76 100  41] [ 35  62  56  44   7]]>>>

For instance, let us assume that we are shifting the image in Q1. For sure the image has to move right side towards the X axis and topside towards the Y axis. The size of the image increases as we have to pad the image's left side as per the x coordinate depth and the bottom side as per the y coordinate depth.

In the case of shifting the image in other quadrants, the same process has to be replicated.

Let's create a padding function that can pad the image on all sides depending on the depth that is passed.

def pad_vector(vector, how, depth, constant_value=0):    vect_shape = vector.shape[:2]    if (how == 'upper') or (how == 'top'):        pp = np.full(shape=(depth, vect_shape[1]), fill_value=constant_value)        pv = np.vstack(tup=(pp, vector))    elif (how == 'lower') or (how == 'bottom'):        pp = np.full(shape=(depth, vect_shape[1]), fill_value=constant_value)        pv = np.vstack(tup=(vector, pp))    elif (how == 'left'):        pp = np.full(shape=(vect_shape[0], depth), fill_value=constant_value)        pv = np.hstack(tup=(pp, vector))    elif (how == 'right'):        pp = np.full(shape=(vect_shape[0], depth), fill_value=constant_value)        pv = np.hstack(tup=(vector, pp))    else:        return vector    return pv

The above function is used to pad the image. The arguments used are as follows:

vector a matrix in which the padding is done.
how this takes four values that decide the quadrants where the image needs to be shifted.
- lower or bottom
- upper or top
- right
- left
depth the depth of the padding.
constant_value signifies black color and 0 is the default value.

Note - For the padding-right and padding-left, we use the method hstack(). In the same way, for the padding-top and padding-bottom, we use the method vstack(). These two are the NumPy methods.

hstack() horizontal stack
vstack() vertical stack

Based on the depth, the padding matrix whose values are zeros is created. Similarly, based on the direction, this padding matrix is stacked accordingly with the original matrix.

Let's test the above function.

>>> pmat = pad_vector(vector=mat, how='left', depth=3)>>> print(pmat)[[  0   0   0  46  13  68  54  12] [  0   0   0  7   68  32  46  26] [  0   0   0  46  43  58  27 100] [  0   0   0  64  59  76 100  41] [  0   0   0  35  62  56  44   7]]>>>

We can clearly see that the function padded the matrix left side with the depth level 3. If we were to plot the same (convert the padded matrix into an image), we will have something like the below.

>>> plt.axis("off")>>> plt.imshow(pmat, cmap='gray')>>> plt.show()

Whereas the original image is -

With this, we can conclude the image is shifted to the right side towards the X axis with the x coordinate 3. The same technique is applied to the real image. Let's try that.

We shall have a function to read the image both in grayscale and RGB format.

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_src

Let's make another function called shifter() which actually shifts the image along the Y axis irrespective of the quadrant.

def shifter(vect, y, y_):    if (y > 0):        image_trans = pad_vector(vector=vect, how='lower', depth=y_)    elif (y < 0):        image_trans = pad_vector(vector=vect, how='upper', depth=y_)    else:        image_trans = vect    return image_trans

Now that we have the shifter() function, we will need to use this in another function that can shift anywhere in the coordinate axis. Here, we consider X and Y axes.

def shift_image(image_src, at):    x, y = at    x_, y_ = abs(x), abs(y)    if (x > 0):        left_pad = pad_vector(vector=image_src, how='left', depth=x_)        image_trans = shifter(vect=left_pad, y=y, y_=y_)    elif (x < 0):        right_pad = pad_vector(vector=image_src, how='right', depth=x_)        image_trans = shifter(vect=right_pad, y=y, y_=y_)    else:        image_trans = shifter(vect=image_src, y=y, y_=y_)    return image_trans

When x and y coordinates are greater than 0, pad the image left side and bottom side.

When x is greater than 0 and y is less than 0, pad the image left side and topside.

When x is less than 0 and y is greater than 0, pad the image right side and bottom side.

When x and y coordinates are less than 0, pad the image right side and topside.

When x and y coordinates exactly equal to 0, do not disturb the image.

There is one problem yet to translate or shift the image. An image can be of two types - grayscale and colored. For grayscale, there won't be any problem. But for the colored image, we need to separate RGB pixels, apply the shift function, and then finally combine the pixels. Hence the below function is helpful for shifting both RGB image and grayscale image.

def translate_this(image_file, at, with_plot=False, gray_scale=False):    if len(at) != 2: return False    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    if not gray_scale:        r_image, g_image, b_image = image_src[:, :, 0], image_src[:, :, 1], image_src[:, :, 2]        r_trans = shift_image(image_src=r_image, at=at)        g_trans = shift_image(image_src=g_image, at=at)        b_trans = shift_image(image_src=b_image, at=at)        image_trans = np.dstack(tup=(r_trans, g_trans, b_trans))    else:        image_trans = shift_image(image_src=image_src, at=at)    if with_plot:        cmap_val = None if not gray_scale else 'gray'        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Translated")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_trans, cmap=cmap_val)        return True    return image_trans

Demonstration

Testing for the regular RGB image.

translate_this(    image_file='lena_original.png',    at=(60, 60),    with_plot=True)

Clearly, the image is shifted to the origin (60, 60) i.e., in the first quadrant (Q1).

Now let's test for the regular grayscale image.

translate_this(    image_file='lena_original.png',    at=(-60, -60),    with_plot=True,    gray_scale=True)

Clearly, the image is shifted to the position (-60, -60) i.e., in the third quadrant (Q3).

Well, that's it for this article. From this, we tried to understand how the image shifting process is done.

Solarizing the Image with NumPy

Mohammed Sameeruddin — Sat, 09 Jan 2021 09:36:48 GMT

Introduction

Solarization is a process used in Photography to describe the effect of tone reversal observed in cases of extreme overexposure of the photographic film in the camera. It is a darkroom technique to convert any photograph to look graphic or surrealistic. It often occurs when a very intense light source appears in the picture and is grossly overexposed. It is a technique where the image is partially or completely reversed (inverted) by an extreme overexposure. Original Source - here. With this blog, let's understand the same in a more Pythonic way.

Credits of Cover Image - Photo by engin akyurt on Unsplash

When we are solarizing an image, we mainly consider a threshold pixel value in order to reverse (invert) the image partially or completely. To know more about inverting an image, you can refer to my article here.

We will rely on the PIL library to solarize the image. We will also code the same without using the PIL module and by just using NumPy for the matrix operations.

Time to Code

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV
PIL

`import` the Packages

import numpy as npimport cv2import jsonfrom PIL import Imagefrom PIL import ImageOps as ipofrom matplotlib import pyplot as plt

Code Implementation with Library

We use the method solarize() available in the module ImageOps - part of the main module PIL. The method takes the following arguments -

image image file that needs to solarized.
threshold threshold value to either partially or completely reverse (invert) the image.

Now that we have all the requirements, let's write the function that solarizes the image.

def solarize_lib(image_file, thresh_val, gray_scale=False):    image_src = Image.open(image_file)    image_src = image_src.convert('L') if gray_scale else image_src    image_sol = ipo.solarize(image_src, threshold=thresh_val)    cmap_val = None if not gray_scale else 'gray'    fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))    ax1.axis("off")    ax1.title.set_text('Original')    ax2.axis("off")    ax2.title.set_text("Solarized")    ax1.imshow(image_src, cmap=cmap_val)    ax2.imshow(image_sol, cmap=cmap_val)    return True

The above function solarizes the images that are read both in colored mode or grayscale mode. Let's test the same.

solarize_lib(image_file='scenary.jpg', thresh_val=130)

For RGB image

For grayscale image

Above are the results of the images that are solarized.

Code Implementation from Scratch

For coding from scratch, we are relying on the NumPy module since matrix operations can be achieved with so much ease. We will need to validate for both colored images as well as grayscale images.

Read the Image

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_src

The above function reads the image either in grayscale or RGB and returns the image matrix.

As above we will have a thresh_val argument based on which the partial or complete reversing (inverting) be done.

def solarize_this(image_file, thresh_val, with_plot=False, gray_scale=False):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    if not gray_scale:        r_image, g_image, b_image = image_src[:, :, 0], image_src[:, :, 1], image_src[:, :, 2]        ## inverting the colored image (partially)        r_sol = np.where((r_image < thresh_val), r_image, ~r_image)        g_sol = np.where((g_image < thresh_val), g_image, ~g_image)        b_sol = np.where((b_image < thresh_val), b_image, ~b_image)        image_sol = np.dstack(tup=(r_sol, g_sol, b_sol))    else:        ## inverting the grayscale image (partially)        image_sol = np.where((image_src < thresh_val), image_src, ~image_src)    if with_plot:        cmap_val = None if not gray_scale else 'gray'        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Solarized")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_sol, cmap=cmap_val)        return True    return image_sol

We are using the where() method of NumPy to find out those pixels which are less than thresh_val and inverting the same which are greater. Let's test the function.

For RGB image

solarize_this(    image_file='scenary.jpg',     thresh_val=130,     with_plot=True)

For grayscale image

solarize_this(    image_file='lena_original.png',     thresh_val=120,     with_plot=True,     gray_scale=True)

PIL is a great library and so is NumPy. PIL is easy to use and the results can be obtained with less code. It is great for researchers and beginners. Implementing the same from scratch is what I have gained.

Here I take leave. If you have liked it consider visiting this page to read more on Image Processing. And make sure to buy coffee for me from here.

Concatenate Images with NumPy

Mohammed Sameeruddin — Fri, 08 Jan 2021 14:34:27 GMT

Introduction

Concatenation is a process of combining two things from the end. In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. To precisely explain this, we can perform concatenation for two words snow and ball resulting in snowball. It need not be from the end.

Credits of Cover Image - Photo by Bia Andrade on Unsplash

Concatenation of images is simply concatenating multiple matrices. But in Python, we need to understand that concatenation of lists is different from the concatenation of NumPy arrays.

Concatenating lists

For concatenating two or multiple lists, we can use the + operation that will combine two lists or multiple lists into one big list. Below is an example that can be understood easily.

>>> l1 = [1, 2, 3, 4]>>> l2 = [4, 5, 6, 7]>>> >>> # concatenation>>> l3 = l1 + l2>>> l3[1, 2, 3, 4, 4, 5, 6, 7]>>> >>> # or>>> l3 = l1.extend(l2)>>> l3[1, 2, 3, 4, 4, 5, 6, 7]

Concatenating NumPy arrays

The scenario would be completely different for NumPy arrays if we were to perform the same + operation as we did for lists (above). NumPy automatically performs linear addition (broadcasting technique) considering the shape of each array is similar. Below is an example that can be understood easily.

For 1D arrays

>>> import numpy as np>>> >>> l1 = np.array([1, 2, 3, 4])>>> l2 = np.array([4, 5, 6, 7])>>> >>> # using `+` operation>>> l3 = l1 + l2>>> print(l3)[ 5  7  9 11]>>>

For 2D arrays

>>> import numpy as np>>> >>> l1 = np.array([[1, 2], [3, 4]])>>> l2 = np.array([[4, 5], [6, 7]])>>> >>> # using `+` operation>>> l3 = l1 + l2>>> print(l3)[[ 5  7] [ 9 11]]>>>

For both cases, the + operation does not work. Rather, we can use the concatenate() method of NumPy's module to combine the arrays. The only criteria here is that we should have all the arrays in the same dimension.

With the help of this method, we can either concatenate horizontally or vertically. Below is an example that can be understood easily.

>>> import numpy as np>>> >>> l1 = np.array([[1, 2], [3, 4]])>>> l2 = np.array([[4, 5], [6, 7]])>>> >>> # concatenation>>> # axis = 0  concatenates vertically>>> l3 = np.concatenate((l1, l2), axis=0)>>> print(l3)[[1 2] [3 4] [4 5] [6 7]]>>> >>> # axis = 1  concatenates horizontally>>> l4 = np.concatenate((l1, l2), axis=1)>>> print(l4)[[1 2 4 5] [3 4 6 7]]>>>

From the above example, we are clear that concatenation can be easily done and the same thing is performed on the images to combine two or multiple (different) images.

We are basically replicating the methods hconcat() (horizontal concatenation) and vconcat() (vertical concatenation) of the module cv2 in NumPy.

Time to Code

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV It is only used for reading the image (in this article).

`import` the Packages

import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt

Read the Image

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_src

The above function reads the image either in grayscale or RGB and returns the image matrix.

Image Re-sizing

While concatenating two different images, there can be a difference in the dimensions of the image matrices. We need to make sure that the dimensions are the same. There is a need to resize the image matrix (refer to this article for understanding image resizing). And, hence the below function.

def resize_image(image_matrix, nh, nw):    image_size = image_matrix.shape    oh = image_size[0]    ow = image_size[1]       re_image_matrix = np.array([        np.array([image_matrix[(oh*h // nh)][(ow*w // nw)] for w in range(nw)])         for h in range(nh)    ])    return re_image_matrix

For this, we will need to take two different images.

Image 1

Image 2

Note - The dimensions of the above images are different.

Image Matrices

# grayscale mode is false by defaultimage1 = read_this(image_file='lena_original.png')image2 = read_this(image_file='pinktree.jpg')

Code Implementation with Library

The function takes totally three arguments -

image_set a list of image matrices.
how concatenating on what basis, vertically or horizontally.
default arg with_plot to plot the results or not.

def concat_lib(image_set, how, with_plot=False):    # dimension of each matrix in image_set    shape_vals = [imat.shape for imat in image_set]    # length of dimension of each matrix in image_set    shape_lens = [len(ishp) for ishp in shape_vals]    # if all the images in image_set are read in same mode    channel_flag = True if len(set(shape_lens)) == 1 else False    if channel_flag:        ideal_shape = max(shape_vals)        images_resized = [            # function call to resize the image            resize_image(image_matrix=imat, nh=ideal_shape[0], nw=ideal_shape[1])             if imat.shape != ideal_shape else imat for imat in image_set        ]    else:        return False    # cv2 library code to concatenate the image matrices    # we use methods like    #     - vconcat()  vertical concat    #     - hconcat()  horizontal concat    if (how == 0) or (how == 'vertical'):        concats = cv2.vconcat(images_resized)    elif (how == 1) or (how == 'horizontal'):        concats = cv2.hconcat(images_resized)    else:        concats = cv2.hconcat(images_resized)    if with_plot:        cmap_val = None if len(concats.shape) == 3 else 'gray'        plt.figure(figsize=(10, 6))        plt.axis("off")        plt.imshow(concats, cmap=cmap_val)        return True    return concats

In the case of image matrices with different dimensions, they are handled carefully by taking the ideal dimension whose matrix dimension is maximum. We are using the methods hconcat() and vconcat() based on the value of how that is passed.

Let's test the above function -

concat_lib(    image_set=[image1, image2, image1, image1, image2],     how='horizontal',     with_plot=True)

image2 is automatically resized to the size of image1 and the same is converted into a giant matrix.

Code Implementation from Scratch

The function takes totally three arguments -

image_set a list of image matrices.
how concatenating on what basis, vertically or horizontally.
default arg with_plot to plot the results or not.

def concat_images(image_set, how, with_plot=False):    # dimension of each matrix in image_set    shape_vals = [imat.shape for imat in image_set]    # length of dimension of each matrix in image_set    shape_lens = [len(ishp) for ishp in shape_vals]    # if all the images in image_set are read in same mode    channel_flag = True if len(set(shape_lens)) == 1 else False    if channel_flag:        ideal_shape = max(shape_vals)        images_resized = [            # function call to resize the image            resize_image(image_matrix=imat, nh=ideal_shape[0], nw=ideal_shape[1])             if imat.shape != ideal_shape else imat for imat in image_set        ]    else:        return False    images_resized = tuple(images_resized)    if (how == 'vertical') or (how == 0):        axis_val = 0    elif (how == 'horizontal') or (how == 1):        axis_val = 1    else:        axis_val = 1    # numpy code to concatenate the image matrices    # concatenation is done based on axis value    concats = np.concatenate(images_resized, axis=axis_val)    if with_plot:        cmap_val = None if len(concats.shape) == 3 else 'gray'        plt.figure(figsize=(10, 6))        plt.axis("off")        plt.imshow(concats, cmap=cmap_val)        return True    return concats

In the case of image matrices with different dimensions, they are handled carefully by taking the ideal dimension whose matrix dimension is maximum. We are using the method concatenate() based on the value of how that is passed.

Let's test the above function -

concat_images(    image_set=[image1, image2, image1, image1, image2],     how='vertical',     with_plot=True)

image2 is automatically resized to the size of image1 and the same is converted into a giant matrix.

We can also play with the above functions to explore more.

gimg1 = concat_images(image_set=[image1, image2], how='vertical')gimg2 = concat_images(image_set=[image1, image2], how='horizontal')concat_images(    image_set=[gimg1, gimg2],    how=1,     with_plot=True)

Here I take leave. If you have liked it consider visiting this page to read more on Image Processing. And make sure to buy coffee for me from here.

Re-sizing the Image using Interpolation

Mohammed Sameeruddin — Tue, 05 Jan 2021 08:38:30 GMT

Introduction

When we have an image, it is always taken as a 2D matrix. The size of the image is nothing but the dimension of the matrix.

Credits of Cover Image - Photo by Kai Krog Halse on Unplash

In Python when we read the image, the size can be easily found by the .shape method. In order to find the shape, we should first read the image and obtain the matrix.

Let's first implement for a 2D matrix or array and then replicate the same for images.

Finding Size of a Matrix

First, we are creating a matrix using the module NumPy. The code for the same can be seen below.

>>> import random>>> import numpy as np>>> >>> mat = np.array([...     [random.randint(10, 100) for i in range(5)] for j in range(5)... ])>>> >>> print(mat)[[ 30  91  12  44  52] [ 37  72  19 100  77] [ 94  77  60  48  64] [ 65  26  59  52  40] [ 37  58  13  74  36]]>>>

We will use the method .shape to find out the dimension of the matrix. Clearly, the matrix has 5 rows and 5 columns.

>>> dimension = mat.shape>>> >>> print(dimension)(5, 5)>>>

If we plot the same, we will get the matrix image as below.

>>> import matplotlib.pyplot as plt>>> >>> plt.figure(figsize=(12, 6))1200x600 with 0 Axes>>>> mat_plot = plt.imshow(mat)>>> plt.colorbar(mat_plot)0x0AF04418>>>> plt.show()>>>

Size Manipulation of the Matrix

We can increase the size by using the interpolation technique. In mathematics, interpolation is a type of estimation to construct new data points within the range of a discrete set of known data points.

Let's create a function that will resize the given matrix and returns a new matrix with the new shape.

def resize_image(image_matrix, nh, nw):    if len(image_matrix.shape) == 3:        oh, ow, _ = image_matrix.shape    else:        oh, ow = image_matrix.shape        re_image_matrix = np.array([        np.array([image_matrix[(oh*h // nh)][(ow*w // nw)] for w in range(nw)])         for h in range(nh)    ])    return re_image_matrix

Credits of the above code - NumPy scaling the Image.

Function Breakdown

We are passing three arguments -

image_matrix Basically any matrix that we want to change the dimension.
nh New height (goes row-wise).
nw New width (goes column-wise).

In the function, we are taking old height (oh) and old width (ow) according to the length of the image matrix's shape. The shape of the image varies for the colored image and grayscale image.

Also, we have 2 for loops for 2 levels (row-wise and column-wise) and performing an integer division for each iterative in the range of new height and new width. This will decide the index for which the element is extracted from the matrix.

Let's increase the matrix size where the new matrix should have 8 rows and 8 columns.

>>> re_mat = resize_image(image_matrix=mat, nh=8, nw=8)>>> print(re_mat)[[ 30  30  91  91  12  44  44  52] [ 30  30  91  91  12  44  44  52] [ 37  37  72  72  19 100 100  77] [ 37  37  72  72  19 100 100  77] [ 94  94  77  77  60  48  48  64] [ 65  65  26  26  59  52  52  40] [ 65  65  26  26  59  52  52  40] [ 37  37  58  58  13  74  74  36]]>>>

We can clearly see, the new matrix re_mat is bigger than the original matrix mat. The dimension of the re_mat is 8x8. If we were to visualize the same, there won't any particular difference between the original matrix plot and the new matrix plot.

>>> fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))>>>         >>> ax1.title.set_text('Original')>>> ax2.title.set_text("Re-sized")>>> >>> ax1.imshow(mat)>>> ax2.imshow(re_mat)>>> plt.show()>>>

We can notice the difference in the axis scaling. The second image is more scaled than the first image. The resizing is handled with care to retain the pixel values. The same thing is applied to the image for increasing and decreasing the size respectively.

Note - We can use the OpenCV's resizing techniques. I wanted to learn how we can do the same from scratch.

Image Processing - Separating RGB pixels

Mohammed Sameeruddin — Fri, 01 Jan 2021 06:59:08 GMT

Introduction

When we talk about images, Images are more like matrices where data is stored row-wise and column-wise. Since the image is a replication of a matrix, we can do all sorts of matrix operations on the image.

Credits of Cover Image - Photo by Erwan Hesry on Unsplash

We will use the NumPy module extensively for this project and also for the following projects related to Image Processing.

Matrix Creation

>>> import numpy as np>>> >>> mat = np.matrix(...     [[1, 2, 3, 4, 5],...      [3, 4, 5, 6, 1]]... )>>> matmatrix([[1, 2, 3, 4, 5],        [3, 4, 5, 6, 1]])>>> >>> mat.shape(2, 5)>>>

Transpose Operation

>>> mat.Tmatrix([[1, 3],        [2, 4],        [3, 5],        [4, 6],        [5, 1]])>>>

Scalar Multiplication

>>> 3 * matmatrix([[ 3,  6,  9, 12, 15],        [ 9, 12, 15, 18,  3]])>>>

Matrix to Image

For converting the matrix into an image, we can use the Matplotlib module. The method imshow() is helpful to do the conversion.

>>> from matplotlib import pyplot as plt>>> >>> plt.figure(figsize=(10, 3))>>> image_mat = plt.imshow(mat, cmap='Reds')>>> plt.colorbar(image_mat)>>> plt.show()>>>

Large matrix to Image

In the above example, we did the conversion for a small matrix. Now we shall try to do the conversion for a large matrix. We will use the Random module to generate random numbers for the matrix.

There should be 30 rows and 50 columns.
Each row of the matrix should have 50 numbers in the range of 1 and 200.

We shall use the method randint(a, b) to create a random large matrix.

>>> import random>>> >>> big_mat = [...     [random.randint(1, 200) for i in range(50)]...     for j in range(30)... ]>>>>>> big_mat = np.matrix(big_mat)>>>>>> big_matmatrix([[ 16, 184, 130, ...,  35, 126, 104],        [198, 153, 172, ..., 109,  60, 174],        [ 90, 138, 108, ..., 192, 132, 103],        ...,        [172,  83, 158, ..., 154,   4, 109],        [ 23,   3, 198, ..., 147,  12,  88],        [ 62, 106,  96, ..., 191,  83, 193]])>>> >>> big_mat.shape(30, 50)>>>

Let's convert the above large matrix into an image.

>>> plt.figure(figsize=(10, 5))>>> image_mat = plt.imshow(big_mat, cmap=None)>>> plt.colorbar(image_mat)>>> plt.show()

Image to Matrix

To convert an image into a matrix, we need to use the package called opencv-python. With this package, we can read an image into a matrix and perform the matrix operations using the NumPy module.

Installation

pip install opencv-python --user

A typical colored image is comprised of pixels that are represented as RGB pixels. A pixel is simply a number in the range of 0 to255 for all R, G, and B.

R Red 0 to 255
G Green 0 to 255
B Blue 0 to 255

We will cv2 (OpenCV) module to read the image in the form of a matrix. The method imread() is used to read the image either in grayscale mode or RGB mode.

`import` the packages

import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt

The image that we will be reading is the original Lenna image. Function to read the image as a matrix.

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_src

By default, imread() reads the image in BGR (Blue, Green, and Red) format. We shall convert into RGB and hence the above function. Now that we have read the image, we can separate the pixels into 3 distinct matrices separating R, G, and B.

Let's plot the same as we have done for the above examples.

def separate_rgb(image_file, gray_scale=False):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    if not gray_scale:        r_, g_, b_ = image_src[:, :, 0], image_src[:, :, 1], image_src[:, :, 2]        return [r_, g_, b_]    return [image_src]

The above function separates the pixels into 3 distinct matrices.

def plot_rgb_seperated(image_file, with_plot=True, gray_scale=False):    with_plot = True    image_src = read_this(image_file=image_file, gray_scale=False)    separates = separate_rgb(image_file=image_file, gray_scale=gray_scale)    pixel_matrices = [image_src]    pixel_matrices.extend(separates)    cmap_vals = [None, 'Reds', 'Greens', 'Blues'] if not gray_scale else [None, 'gray']    titles = ['Original', 'Red', 'Green', 'Blues'] if not gray_scale else ['Original', 'Grayscale']    n_cols = 4 if not gray_scale else 2    fig_size = (15, 10) if not gray_scale else (10, 20)    if with_plot:        fig, axes = plt.subplots(nrows=1, ncols=n_cols, figsize=fig_size)        for i, ax in zip(range(n_cols), axes):            ax.axis("off")            ax.set_title(titles[i])            ax.imshow(pixel_matrices[i], cmap=cmap_vals[i])        return True    return False

The above function plots the matrices along with the original matrix. Let's test the function.

For color image:

plot_rgb_seperated(image_file='lena_original.png')

For grayscale image:

plot_rgb_seperated(image_file='lena_original.png', gray_scale=True)

END

That is it for this article. We have learned how to convert the matrix into an image and vice-versa.

Inner Working of cv2.rectangle() using NumPy

Mohammed Sameeruddin — Thu, 31 Dec 2020 06:44:12 GMT

Introduction

In this article, we will try to understand the inner working of cv2.rectangle() using the NumPy module in Python.

Credits of Cover Image - Photo by Tim Mossholder on Unsplash

The cv2.rectangle() is OpenCVs method used to draw a rectangle on the image. We have options to decide the thickness and the color of the rectangle. But we need to make sure that the color is passed as the RGB code (R, G, B). With this, blog article we will try to focus on understanding the inner working of this method and implement the same from scratch using the NumPy module.

A rectangle is simply a shape that we would like to draw on an image in the specified position. We can have the position to draw by passing two points pt1 and pt2. The image matrix is considered as a 2D plane for us to decide the position of the rectangle. Lets understand the library method first.

The arguments of the above-mentioned method are:

img Image on which the rectangle is drawn.
pt1 Point 1 to decide the position of the rectangle.
pt2 Point 2 also to decide the position of the rectangle.
color Color of the rectangle which is passed as RGB code.
thickness The line thickness of the rectangle.

Other arguments and their importance can be learned by checking help(cv2.rectangle). First, we will implement the library method and then we will implement the code from scratch.

Time to Code

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV It is only used for reading the image (in this article).

Import the Packages

import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt

Read the Image

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_src

The above function reads the image either in grayscale or RGB and returns the image matrix.

Code Implementation with Library

We will use cv2.rectangle() method. But for better visualization, we can the following function taking arguments:

start_pos To knowing the starting position of the rectangle to draw.
length Length of the rectangle (absolute value is considered).
width Width of the rectangle (absolute value is considered).

From the above three arguments, we can obtain pt1 and pt2 points.

thickness Thickness of the rectangle.
color_name Instead of passing the RGB code, we can extract the (R, G, B) values from the color_names_data.json file where I have stored all the possible color names with their (R, G, B) values respectively. The sample of the color data can be viewed below.

{  "air force blue": {    "r": 93,    "g": 138,    "b": 168,    "hex": "#5d8aa8"  },  "alizarin crimson": {    "r": 227,    "g": 38,    "b": 54,    "hex": "#e32636"  },  "almond": {    "r": 239,    "g": 222,    "b": 205,    "hex": "#efdecd"  },  ...}

The function works for both colored images and grayscale images. But for grayscale images, a default color Black is considered.

def rectangle_lib(image_file, start_pos, length, width, thickness=3, with_plot=False, gray_scale=False, color_name=0):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    image_main = read_this(image_file=image_file, gray_scale=gray_scale)    image_shape = image_src.shape    length = abs(length)    width = abs(width)    start_row = start_pos if start_pos >= 0 else 0    start_column = start_row    end_row = length + start_row    end_row = end_row if end_row <= image_shape[0] else image_shape[0]    end_column = width + start_column    end_column = end_column if end_column <= image_shape[1] else image_shape[1]    if gray_scale:        color_name = (0, 0, 0)    else:        with open(file='color_names_data.json', mode='r') as col_json:            color_db = json.load(fp=col_json)        color_name = str(color_name).strip().lower()        colors_list = list(color_db.keys())        if color_name not in colors_list:            color_name = (0, 0, 0)        else:            color_name = tuple([color_db[color_name][i] for i in 'rgb'])    # library method    image_rect = cv2.rectangle(image_src, (start_column, start_row), (end_column, end_row), color_name, thickness)    if with_plot:        cmap_val = None if not gray_scale else 'gray'        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Rectangle")        ax1.imshow(image_main, cmap=cmap_val)        ax2.imshow(image_rect, cmap=cmap_val)        return True    return image_rect

The above function is way better to implement rather than specifying the point on our own. We can simply mention the three important arguments and based on that the points are determined. Lets test the function.

For color image:

rectangle_lib(    image_file='lena_original.png',     start_pos=199,     length=200,     width=200,     thickness=3,     with_plot=True,     color_name="yellow")

For grayscale image:

rectangle_lib(    image_file='lena_original.png',     start_pos=199,     length=200,     width=200,     thickness=3,     with_plot=True,     gray_scale=True)

For thickness (-1), there is a special case in which the whole rectangle area is colored based on the color specified.

The above cases are implemented using the library method. It is obvious that the library method works for sure. It is time to do something from scratch on our own.

Code Implementation from Scratch

We will use the same arguments that we have used in the above function rectangle_lib(). But the technique which we follow is different. Here, the argument thickness plays an important role alongside start_pos, length, and width.

The logic behind the implementation:

We have to grab the sub-image based on start_pos, length, and width. Basically, we have to crop the image. This cropped image will be the inner portion of the rectangle (Please read this article to know more info).
We need to pad the image matrix with the color value (obtained from the color name). If the image is grayscaled then we will pad with the black color value i.e, 0. Otherwise, we will pad 3 times after separating the R pixels, G pixels, and B pixels and finally merging them to form a single image. Basically, we are bordering the cropped image (Please read this article to know more info).
Finally, we need to replace this bordered, cropped image in the original image and display the image.

We will use the same color data from the file color_names_data.json for obtaining the color values. Lets code the logic.

def draw_rectangle(image_file, start_pos, length, width, thickness=3, with_plot=False, gray_scale=False, color_name=0):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    image_main = read_this(image_file=image_file, gray_scale=gray_scale)    image_shape = image_src.shape    length = abs(length)    width = abs(width)    thickness = -1 if thickness < 0 else thickness    start_row = start_pos if start_pos >= 0 else 0    start_column = start_row    end_row = length + start_row    end_row = end_row if end_row <= image_shape[0] else image_shape[0]    end_column = width + start_column    end_column = end_column if end_column <= image_shape[1] else image_shape[1]    start_row_grab = start_row - thickness    end_row_grab = end_row + thickness    start_column_grab = start_row_grab    end_column_grab = end_column + thickness    gso_image = image_src[start_row_grab:end_row_grab, start_column_grab:end_column_grab]    gsi_image = image_src[start_row:end_row, start_column:end_column]    if gray_scale:        color_name = 0        if thickness != -1:            inner_image_rect = np.pad(array=gsi_image, pad_width=thickness, mode='constant', constant_values=color_name)        else:            max_height, max_width = gso_image.shape            inner_image_rect = np.zeros(shape=(max_height, max_width))    else:        with open(file='color_names_data.json', mode='r') as col_json:            color_db = json.load(fp=col_json)        color_name = str(color_name).strip().lower()        colors_list = list(color_db.keys())        if color_name not in colors_list:            r_cons, g_cons, b_cons = (0, 0, 0)        else:            r_cons, g_cons, b_cons = [color_db[color_name][i] for i in 'rgb']        r_inner_image, g_inner_image, b_inner_image = gsi_image[:, :, 0], gsi_image[:, :, 1], gsi_image[:, :, 2]        if thickness != -1:            r_inner_rect = np.pad(array=r_inner_image, pad_width=thickness, mode='constant', constant_values=r_cons)            g_inner_rect = np.pad(array=g_inner_image, pad_width=thickness, mode='constant', constant_values=g_cons)            b_inner_rect = np.pad(array=b_inner_image, pad_width=thickness, mode='constant', constant_values=b_cons)            inner_image_rect = np.dstack(tup=(r_inner_rect, g_inner_rect, b_inner_rect))        else:            max_height, max_width, _ = gso_image.shape            r_out_rect = np.full(shape=(max_height, max_width), fill_value=r_cons)            g_out_rect = np.full(shape=(max_height, max_width), fill_value=g_cons)            b_out_rect = np.full(shape=(max_height, max_width), fill_value=b_cons)            inner_image_rect = np.dstack(tup=(r_out_rect, g_out_rect, b_out_rect))    image_src[start_row_grab:end_row_grab, start_column_grab:end_column_grab] = inner_image_rect    image_rect = image_src        if with_plot:        cmap_val = None if not gray_scale else 'gray'        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Rectangle")        ax1.imshow(image_main, cmap=cmap_val)        ax2.imshow(image_rect, cmap=cmap_val)        plt.show()        return True    return image_rect

Here, we are not dealing with the points to decide the position of the rectangle. We are using the techniques of cropping, bordering, and replacing to achieve the main motive. Lets test the above function.

For color image:

draw_rectangle(    image_file='lena_original.png',     start_pos=199,     length=200,     width=200,     thickness=3,     with_plot=True,     color_name='red')

For grayscale image:

draw_rectangle(    image_file='lena_original.png',     start_pos=199,     length=200,     width=200,     thickness=3,     with_plot=True,     gray_scale=True)

For thickness (-1), there is a special case in which the whole rectangle area is colored based on the color specified.

This Is It!

We finally got what we wanted. We tried to implement the code from scratch with the help of NumPy matrix operations. We used NumPy methods like:

pad() pads the matrix with constant values.
zeros() We used this for the special case when the thickness is -1. This is only used when the image is read in grayscale mode.
full() We used to obtain a matrix of identical values. This is also used when the thickness is -1 in the case where the image is read in colored mode.

Personally, I learned so much from implementing the method. I hope you find this insightful. You should definitely check out my other articles on the same subject in my profile.

If you liked it, you can buy coffee for me from here.

Crop the Image Intuitively — NumPy

Mohammed Sameeruddin — Sun, 27 Dec 2020 04:28:33 GMT

Introduction

In this blog article, we will learn how to crop an image in Python using NumPy as an ideal library. When we talk about images, they are just matrices in 2D space. And of course, it depends on the image, if it is an RGB image then the size of the image would be (width, height, 3) otherwise grayscale would just be (width, height). But ultimately, images are just large matrices where each value is a pixel positioned row-wise and column-wise accordingly.

Credits of Cover Image - Photo by Ulrike Langner on Unsplash

Cropping the image is just obtaining the sub-matrix of the image matrix. The size of the sub-matrix (cropped image) can be of our choice and mainly it is the height and width. There needs to be one important thing for the image to be cropped, i.e., starting position. The starting position is helpful for obtaining the sub-matrix from that position and depending upon height and width we can easily crop cut the image.

The three important things are:

starting_position
length (height)
width

Based on these three things, we can construct our cropping function completely ready.

Time to Code

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV It is only used for reading the image.

Import the Packages

import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt

Read the Image

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_src

The above function reads the image either in grayscale or RGB and returns the image matrix.

Cropping the Image

We need to pass the above mentioned 3 things as arguments in our function. But before doing lets try to crop (slice) the matrix with NumPy.

>>> import numpy as np>>> m = np.array([...     [1, 2, 3, 4, 5, 6, 7],...     [5, 3, 4, 2, 1, 7, 6],...     [6, 4, 3, 5, 1, 2, 7],...     [5, 6, 3, 1, 4, 2, 7],...     [1, 2, 3, 4, 5, 6, 7]... ])>>>>>> print(m)[[1 2 3 4 5 6 7] [5 3 4 2 1 7 6] [6 4 3 5 1 2 7] [5 6 3 1 4 2 7] [1 2 3 4 5 6 7]]>>>>>> crop_m = m[1:4, 2:7]>>> print(crop_m)[[4 2 1 7 6] [3 5 1 2 7] [3 1 4 2 7]]>>>

The above code is an example of how we can crop an image matrix. Notice crop_m is the cropped matrix (sub-matrix) that is sliced from the original matrix m. The sub-matrix crop_m is taking values from [1:4, 2:7], i.e., values from 1st row till 4th row and from 2nd column till 7th column. We should something similar for the image to obtain the cropped image. Lets write the cropping image function.

def crop_this(image_file, start_pos, length, width, with_plot=False, gray_scale=False):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    image_shape = image_src.shape    length = abs(length)    width = abs(width)    start_row = start_pos if start_pos >= 0 else 0    start_column = start_row    end_row = length + start_row    end_row = end_row if end_row <= image_shape[0] else image_shape[0]    end_column = width + start_column    end_column = end_column if end_column <= image_shape[1] else image_shape[1]    print("start row \t- ", start_row)    print("end row \t- ", end_row)    print("start column \t- ", start_column)    print("end column \t- ", end_column)    image_cropped = image_src[start_row:end_row, start_column:end_column]    cmap_val = None if not gray_scale else 'gray'    if with_plot:        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Cropped")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_cropped, cmap=cmap_val)        return True    return image_cropped

Lets understand what this function will actually result in.

At the first step, we read the image either in grayscale or RGB and obtain the image matrix.
We obtain the height and width of the image which is further used in the validation of the code.
We make sure that the length and width are positive integers. Hence absolute values are considered.
We calculate the four important values which are useful for slicing the matrix start_row, end_row, start_column, end_column. We obtain that using the three arguments that are passed start_pos, length, width.
We obtain the cropped image by slicing the matrix.
We plot both the original and cropped images for the visualization.

Lets test the above function

For RGB Image

crop_this(    image_file='lena_original.png',    start_pos=199,    length=100,    width=200,    with_plot=True)

start row -  199end row -  299start column -  199end column -  399

For Grayscale Image

crop_this(    image_file='lena_original.png',    start_pos=199,    length=100,    width=200,    with_plot=True,    gray_scale=True)

start row -  199end row -  299start column -  199end column -  399

This is it!!! We finally are able to crop the image by just knowing the starting position and length & width of the cropped image. Isnt it great? We can also add a lot of customization options like adding a border around the image and other things. To know how to add a border to the image, you can refer to my article.

Other similar articles can be found in my profile. Have a great time reading and implementing the same.

If you liked it, you can buy coffee for me from here.

Inverting an Image using NumPy’s Broadcasting method

Mohammed Sameeruddin — Thu, 17 Dec 2020 06:32:36 GMT

Introduction

In this article, we will learn how to invert an image using NumPy. To get some gist of this, lets have two values 0 and 1. Here 0 represents Black and 1 represents White. When we apply inversion to these values, we get:

0 inverted 1
1 inverted 0

The above only works when we have two values. 0 for low and 1 for high. If we were to relate the same with the Binary Image whose pixel values are just 1s and 0's. The inversion would be reversed. To put it in words we can say from White and Black to Black and White.

Credits of Cover Image - Photo by Jr Korpa on Unsplash

Broadcasting

Unlike lists, if we want to add a number to the values of the list. We iterate through each element and add the number. Whereas, in NumPy, we need not iterate through each element and add. Instead, we can treat the array list as a single element and add the number. NumPy automatically adds that number to all the elements of the array list. This technique is called broadcasting.

The broadcasting technique is applicable to both matrices and arrays. It is very fast when compared to normal loops.

>>> import numpy as np>>> M = np.array([1, 2, 3, 4, 5, 6])>>> M = 3 + M>>> Marray([4, 5, 6, 7, 8, 9])>>>

Lets see the demonstration for a random matrix.

White Black

A simple demonstration of the White Black matrix and the image can be seen below.

1 is visualized as White
0 is visualized as Black

>>> import numpy as np>>> image_b = np.array([...     [1,0,1],...     [1,1,0],...     [0,1,1]])>>> image_barray([[1, 0, 1],       [1, 1, 0],       [0, 1, 1]])>>>

If we visualize the above matrix, we can see something like the below.

Black White

A simple demonstration of the Black White matrix and the image can be seen below.

1 is changed to 0 Black
0 is changed to 1 White

>>> # Broadcasting>>> image_i = 1 - image_b>>> # image_i = ~ image_b>>> image_iarray([[0, 1, 0],       [0, 0, 1],       [1, 0, 0]])>>>

If we visualize the above matrix, we can see something like the below.

To convert a matrix into an inverted matrix we can also use the operation ~'. It works in the same way.

If you want to know specifically how to convert an image into binary, you can refer to my article where I explain the procedure for both colored and grayscale images.

The above implementation worked since the images are already binarized. What if we wanted to apply this for a colored and non-binarized image? The scenarios will be different. Lets find out what we can do for those.

Time to Code

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV

Import the Packages

import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt

Read the Image

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_src

The above function reads the image either in grayscale or RGB and returns the image matrix.

Code Implementation with Library

For inverting an image using the cv2 library, we can use the method bitwise_not() which is available in the library. We can just pass the image matrix as an argument in the method.

def invert_lib(image_file, with_plot=False, gray_scale=False):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    cmap_val = None if not gray_scale else 'gray'    image_i = cv2.bitwise_not(image_src)    if with_plot:        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Inverted")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_i, cmap=cmap_val)        return True    return image_i

The above function returns the inverted image of the original image that is passed. The same can be seen when plotted.

Lets test the above function

invert_lib(image_file='lena_original.png', with_plot=True)

invert_lib(image_file='lena_original.png', with_plot=True, gray_scale=True)

We have obtained the inverted images for both colored and grayscale images.

Code Implementation from Scratch

In order to invert the image, we have to apply the broadcasting technique using NumPy. We have to subtract each pixel value from 255. It is because the highest pixel value or color value is 255 in an image. Either way, we can apply the ~ (negation) operation to the image.

def invert_this(image_file, with_plot=False, gray_scale=False):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    # image_i = ~ image_src    image_i = 255 - image_src    if with_plot:        cmap_val = None if not gray_scale else 'gray'        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Inverted")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_i, cmap=cmap_val)        return True    return image_i

Lets test the above function

invert_this(image_file='lena_original.png', with_plot=True)

invert_this(image_file='lena_original.png', with_plot=True, gray_scale=True)

This is it! We have successfully written the code for converting the image into an inverted image. The results are similar to that of the library code.

Lets see what will it result in when applied to the binarized image. I have put the result without the code.

The inverted image is exactly opposite to the binarized image. Hence this concludes the agenda of this article.

You should definitely check out my other articles on the same subject in my profile.

If you liked it, you can buy coffee for me from here.

Adding a Border to the Image using NumPy

Mohammed Sameeruddin — Sat, 12 Dec 2020 07:47:35 GMT

Introduction

In this article, we will learn how to add/draw a border to the image. We use OpenCV to read the image. The rest of the part is handled by NumPy to code from scratch. We rely on it for the matrix operations that can be achieved with so much ease.

There are two aspects in which we can start to think:

If the image is read in grayscale, we can simply keep the default color as black. This is because the length of the shape of the image matrix would be 2. Therefore we cannot add a color border whose color value would be of size 3 and thus it cannot be mapped easily.
If the image is read in RGB, we can have a choice to pick the color for the border. This is because the length of the shape of the image matrix would be 3. Hence we can add a color border whose color value would be of size 3 which can be mapped easily.

Credits of Cover Image - Photo by Kanan Khasmammadov on Unsplash

Before proceeding further, we need to make sure we have enough colors (based on the users choice). I have extracted the possible color values from this website.

The code of the same can be seen below. The result is stored in a JSON file named color_names_data.json.

import requestsimport jsonfrom bs4 import BeautifulSoupdef extract_table(url):    res = requests.get(url=url)    con = res.text    soup = BeautifulSoup(con, features='lxml')    con_table = soup.find('table', attrs={'class' : 'color-list'})    headings = [th.get_text().lower() for th in con_table.find("tr").find_all("th")]    table_rows = [headings]    for row in con_table.find_all("tr")[1:]:        each_row = [td.get_text().lower() for td in row.find_all("td")]        table_rows.append(each_row)    return table_rowscol_url = "https://www.colorhexa.com/color-names"color_rows_ = extract_table(url=col_url)color_dict = {}for co in color_rows_[1:]:    color_dict[co[0]] = {        'r' : int(co[2]),        'g' : int(co[3]),        'b' : int(co[4]),        'hex' : co[1]    }with open(file='color_names_data.json', mode='w') as col_json:    json.dump(obj=color_dict, fp=col_json, indent=2)

It is necessary to grab R, G, and B values in order for the mapping after the separation of pixels. We follow split and merge methods using NumPy.

The structure of the color data can be seen below.

{  "air force blue": {    "r": 93,    "g": 138,    "b": 168,    "hex": "#5d8aa8"  },  "alizarin crimson": {    "r": 227,    "g": 38,    "b": 54,    "hex": "#e32636"  },  "almond": {    "r": 239,    "g": 222,    "b": 205,    "hex": "#efdecd"  },  ...}

Time to Code

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV

Import the Packages

import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt

Read the Image

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_src

The above function reads the image either in grayscale or RGB and returns the image matrix.

Code Implementation with Library

For adding/drawing borders around the image, the important arguments can be named as below:

image_file Image file location or image name if the image is stored in the same directory.
bt Border thickness
color_name By default it takes 0 (black color). Otherwise, any color name can be taken.

We use the method copyMakeBorder() available in the library OpenCV that is used to create a new bordered image with a specified thickness. In the code, we make sure to convert the color name into values from the color data we collected earlier.

The below function works for both RGB image and grayscale image as expected.

def add_border(image_file, bt=5, with_plot=False, gray_scale=False, color_name=0):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    if gray_scale:        color_name = 0        value = [color_name for i in range(3)]    else:        color_name = str(color_name).strip().lower()        with open(file='color_names_data.json', mode='r') as col_json:            color_db = json.load(fp=col_json)        colors_list = list(color_db.keys())        if color_name not in colors_list:            value = [0, 0, 0]        else:            value = [color_db[color_name][i] for i in 'rgb']    image_b = cv2.copyMakeBorder(image_src, bt, bt, bt, bt, cv2.BORDER_CONSTANT, value=value)    if with_plot:        cmap_val = None if not gray_scale else 'gray'        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Bordered")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_b, cmap=cmap_val)        return True    return image_b

Lets test the above function

add_border(image_file='lena_original.png', with_plot=True, color_name='green')

add_border(image_file='lena_original.png', with_plot=True, gray_scale=True, color_name='pink')

We can clearly see the borders are been added/drawn. For the grayscale image, though we have mentioned pink, a black border is drawn.

Code Implementation from Scratch

When we talk about the border, it is basically a constant pixel value of one color around the entire image. It is important to take note of the thickness of the border to be able to see. Considering the thickness we should append a constant value around the image that matches the thickness level.

In order to do so, we can use the pad() method available in the library NumPy. This method appends a constant value that matches the level of the pad_width argument which is mentioned.

Example

>>> import numpy as np>>> m = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]>>> m = np.array(m)>>> marray([[1, 2, 3],       [4, 5, 6],       [7, 8, 9]])>>>>>> pm = np.pad(array=m, pad_width=1, mode='constant', constant_values=12)>>> pmarray([[12, 12, 12, 12, 12],       [12,  1,  2,  3, 12],       [12,  4,  5,  6, 12],       [12,  7,  8,  9, 12],       [12, 12, 12, 12, 12]])>>>>>> pmm = np.pad(array=m, pad_width=2, mode='constant', constant_values=24)>>> pmmarray([[24, 24, 24, 24, 24, 24, 24],       [24, 24, 24, 24, 24, 24, 24],       [24, 24,  1,  2,  3, 24, 24],       [24, 24,  4,  5,  6, 24, 24],       [24, 24,  7,  8,  9, 24, 24],       [24, 24, 24, 24, 24, 24, 24],       [24, 24, 24, 24, 24, 24, 24]])>>>

More examples can be found in the documentation.

We just need to change the constant_values argument by taking the actual color value.

For a grayscale image, we simply pad the image matrix with black color i.e., 0.
For the RGB image, we are to grab the color value from the data collected, split the image into 3 matrices, and pad each matrix with each color value.The below function would explain the flow more clearly.

The below function would explain the flow more clearly.

def draw_border(image_file, bt=5, with_plot=False, gray_scale=False, color_name=0):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)        if gray_scale:        color_name = 0        image_b = np.pad(array=image_src, pad_width=bt, mode='constant', constant_values=color_name)        cmap_val = 'gray'    else:        color_name = str(color_name).strip().lower()        with open(file='color_names_data.json', mode='r') as col_json:            color_db = json.load(fp=col_json)        colors_list = list(color_db.keys())        if color_name not in colors_list:            r_cons, g_cons, b_cons = [0, 0, 0]        else:            r_cons, g_cons, b_cons = [color_db[color_name][i] for i in 'rgb']        r_, g_, b_ = image_src[:, :, 0], image_src[:, :, 1], image_src[:, :, 2]        rb = np.pad(array=r_, pad_width=bt, mode='constant', constant_values=r_cons)        gb = np.pad(array=g_, pad_width=bt, mode='constant', constant_values=g_cons)        bb = np.pad(array=b_, pad_width=bt, mode='constant', constant_values=b_cons)        image_b = np.dstack(tup=(rb, gb, bb))        cmap_val = None    if with_plot:        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Bordered")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_b, cmap=cmap_val)        return True    return image_b

Lets test the above function

draw_border(image_file='lena_original.png', with_plot=True, color_name='cyan')

draw_border(image_file='lena_original.png', bt=10, with_plot=True, gray_scale=True)

Yay! We did it. We coded the entire thing including color choice completely from scratch except for the part where we read the image file. We relied mostly on NumPy as it is very fast in computing matrix operations (We could have done it with for loops if we wanted our code to execute very slow).

Personally, this was a great learning for me. I am starting to think about how difficult and fun that would be for the people who actually work on open source libraries.

You should definitely check out my other articles on the same subject in my profile.

If you liked it, you can buy coffee for me from here.

Binarization of Image using NumPy

Mohammed Sameeruddin — Fri, 04 Dec 2020 03:43:53 GMT

Introduction

In this article, we will learn how to binarize an image using NumPy, and of course, we will use OpenCV for reading the image both in grayscale and RGB.

To understand what binary is binary is something that is made of two things. In computer terminology, binary is just 0 and 1. If we were to relate the same in the images, it is said as black and white images where

0 signifies Black
1 signifies White.

At the initial stages of learning image processing, we often think of a grayscale image as a binary image. Although it is not. But slowly when we pick up the subject, we realize how wrong we were. So, moving ahead, we will learn how to binarize the image with both using the library and without using the library (NumPy is used for matrix operations just to avoid the slowness of the program when used the regular for loops). Besides this, we will also use Matplotlib to plot the results.

Credits of Cover Image - Photo by Angel Santos on Unsplash

RGB and Grayscale Overview

The binary operation works really well for the grayscale images. The problem with the color (RGB) images is that each pixel is a vector representing 3 unique values one for Red, one for Green, and one for Blue.

A typical grayscale images matrix would look like -

array([[162, 162, 162, ..., 170, 155, 128],       [162, 162, 162, ..., 170, 155, 128],       [162, 162, 162, ..., 170, 155, 128],       ...,       [ 43,  43,  50, ..., 104, 100,  98],       [ 44,  44,  55, ..., 104, 105, 108],       [ 44,  44,  55, ..., 104, 105, 108]], dtype=uint8)

A typical RGB images matrix would seem like -

array([[[226, 137, 125], ..., [200,  99,  90]],       [[226, 137, 125], ..., [200,  99,  90]],       [[226, 137, 125], ..., [200,  99,  90]],       ...,       [[ 84,  18,  60], ..., [177,  62,  79]],       [[ 82,  22,  57], ..., [185,  74,  81]],       [[ 82,  22,  57], ..., [185,  74,  81]]], dtype=uint8)

If we were to separate R, G, and B pixels from the above matrix. We get

R matrix

array([[226, 226, 223, ..., 230, 221, 200],       [226, 226, 223, ..., 230, 221, 200],       [226, 226, 223, ..., 230, 221, 200],       ...,       [ 84,  84,  92, ..., 173, 172, 177],       [ 82,  82,  96, ..., 179, 181, 185],       [ 82,  82,  96, ..., 179, 181, 185]], dtype=uint8)

G matrix

array([[137, 137, 137, ..., 148, 130,  99],       [137, 137, 137, ..., 148, 130,  99],       [137, 137, 137, ..., 148, 130,  99],       ...,       [ 18,  18,  27, ...,  73,  68,  62],       [ 22,  22,  32, ...,  70,  71,  74],       [ 22,  22,  32, ...,  70,  71,  74]], dtype=uint8)

B matrix

array([[125, 125, 133, ..., 122, 110,  90],       [125, 125, 133, ..., 122, 110,  90],       [125, 125, 133, ..., 122, 110,  90],       ...,       [ 60,  60,  58, ...,  84,  76,  79],       [ 57,  57,  62, ...,  79,  81,  81],       [ 57,  57,  62, ...,  79,  81,  81]], dtype=uint8)

Whatever operation we compute on the grayscale image, we will need to compute the same on the RGB image but for 3 times separating R, G, and B pixels and finally merging them as a proper RGB image.

Time to Code

The packages that we mainly use are -

NumPy
Matplotlib
OpenCV

Import the Packages

import cv2import numpy as npfrom matplotlib import pyplot as plt

Read the Image

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_src

The above function reads the image either in grayscale or RGB and returns the image matrix.

Code Implementation with Library

For converting the image into a binary image, we can simply make use of the threshold() method available in the cv2 library. This method, irrespective of what the image is (grayscale or RGB) converts into binary. It takes 4 arguments in use.

src It is basically the image matrix.
thresh It is the threshold value based on which pixels are given a new value. If the pixels are less than this value, we will revalue those pixels to 255. Otherwise, the pixels will be revalued to 0.
maxval It is the maximum pixel value that a typical image could contain (255).
type It is basically a thresholding type that is given and based on that type the operation is computed. There are several types with which the operation is taken care of.

After this, we will plot the results to see the variation and hence the below function.

def binarize_lib(image_file, thresh_val=127, with_plot=False, gray_scale=False):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    th, image_b = cv2.threshold(src=image_src, thresh=thresh_val, maxval=255, type=cv2.THRESH_BINARY)    if with_plot:        cmap_val = None if not gray_scale else 'gray'        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Binarized")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_b, cmap=cmap_val)        return True    return image_b

Lets test the above function -

binarize_lib(image_file='lena_original.png', with_plot=True)

binarize_lib(image_file='lena_original.png', with_plot=True, gray_scale=True)

Now that we have seen the results of both original and binary images, it is obvious that the library code works for both. Its time to make our hands dirty to code the same from the scratch.

Code Implementation from Scratch

First, we will write a function that will revalue the pixel values which are less than the specified threshold to 255.

By doing it, we will see something like below -

def convert_binary(image_matrix, thresh_val):    white = 255    black = 0    initial_conv = np.where((image_matrix <= thresh_val), image_matrix, white)    final_conv = np.where((initial_conv > thresh_val), initial_conv, black)    return final_conv

We will call the above function three times by separating R, G, and B values and finally merge the same to obtain the binarized image. Once doing it, we can plot the results just like how we did it before.

def binarize_this(image_file, thresh_val=127, with_plot=False, gray_scale=False):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    if not gray_scale:        cmap_val = None        r_img, g_img, b_img = image_src[:, :, 0], image_src[:, :, 1], image_src[:, :, 2]        r_b = convert_binary(image_matrix=r_img, thresh_val=thresh_val)        g_b = convert_binary(image_matrix=g_img, thresh_val=thresh_val)        b_b = convert_binary(image_matrix=b_img, thresh_val=thresh_val)        image_b = np.dstack(tup=(r_b, g_b, b_b))    else:        cmap_val = 'gray'        image_b = convert_binary(image_matrix=image_src, thresh_val=thresh_val)    if with_plot:        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))        ax1.axis("off")        ax1.title.set_text('Original')        ax2.axis("off")        ax2.title.set_text("Binarized")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_b, cmap=cmap_val)        return True    return image_b

We have made our binarizing code by just using NumPy. Lets test the same -

binarize_this(image_file='lena_original.png', with_plot=True)

binarize_this(image_file='lena_original.png', with_plot=True, gray_scale=True)

This is it. Whatever we wanted to accomplish, we have accomplished it. The results are quite similar to the one we got by using the library code.

Hence this concludes the aim of this article.

Image Equalization (Contrast Enhancing) in Python

Mohammed Sameeruddin — Thu, 26 Nov 2020 14:44:11 GMT

Introduction

I have been practicing image processing for quite a little - manipulating the images (image matrices to be precise). In doing so, I got to explore the equalizing methods for images so as to enhance the contrast to a certain extent that the manipulated image looks better than the original image. This technique is termed as Histogram Equalization.

Credits of Cover Image - Photo by Andreas Gcklhorn on Unsplash

Often times what happens is when the image is captured, it will not be the same as the natural view. In order to meet the level of natural view, post-processing should be done. Hence Histogram Equalization (Normalization) is one of those techniques to enhance the contrast by tweaking the pixel values of the image.

An example can be seen below - original image and equalized image.

If we were to plot the image histograms, it would look something like below -

Credits - The above images have been taken from the Internet for showing the examples.

Importance of Histogram Equalization

This method works better for both bright and dark images, especially in the field of medical science there is higher importance in analyzing the X-ray images.
It is also very useful in viewing scientific images like thermal images and satellite images.

Implementation

In this article, I will implement this method both by using the openCV library and from scratch with just NumPy and Matplotlib. Although I would like to do without using NumPy, it would take much time to compute.

Note - For coding from scratch, I will use openCV to read the image and nothing else.

I have taken Lena Image for testing the functions. I have saved the same in my working directory.

Import the Requirements

The packages that we mainly use are:

NumPy
Matplotlib
OpenCV It is only used for reading the image (in this article).

import cv2import numpy as npfrom matplotlib import pyplot as plt

Read the Image

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_src

The above function reads the image either in gray_scale or RGB and returns the image matrix.

Code Implementation with Library

For equalizing, we can simply use the equalizeHist() method available in the library cv2. We have two aspects here -

When the image is read in RGB.
- Separate the pixels based on the color combination. We can use the split() method available in the library cv2.
- Apply the equalization method for each matrix.
- Merge the equalized image matrices altogether with the method merge() available in the library cv2.
When the image is read in gray_scale.
- Just apply the equalization method for the image matrix.
Plot the original image and equalized image.

def equalize_this(image_file, with_plot=False, gray_scale=False):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    if not gray_scale:        r_image, g_image, b_image = cv2.split(image_src)        r_image_eq = cv2.equalizeHist(r_image)        g_image_eq = cv2.equalizeHist(g_image)        b_image_eq = cv2.equalizeHist(b_image)        image_eq = cv2.merge((r_image_eq, g_image_eq, b_image_eq))        cmap_val = None    else:        image_eq = cv2.equalizeHist(image_src)        cmap_val = 'gray'    if with_plot:        fig = plt.figure(figsize=(10, 20))        ax1 = fig.add_subplot(2, 2, 1)        ax1.axis("off")        ax1.title.set_text('Original')        ax2 = fig.add_subplot(2, 2, 2)        ax2.axis("off")        ax2.title.set_text("Equalized")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_eq, cmap=cmap_val)        return True    return image_eq

Let's test the above function -

equalize_this(image_file='lena_original.png', with_plot=True)

equalize_this(image_file='lena_original.png', with_plot=True, gray_scale=True)

The above plots are clear and we can say that the equalized images look better than the original images. This was implemented using the cv2 library.

Code Implementation from Scratch

For this, I am using NumPy for all the matrix operations. Again we can do it with for loops, but it will take more time to compute. Even here we have two aspects as before -

When the image is read in RGB.
- Separate the pixels based on the color combination. We can slice it down using NumPy operations.
- Apply the equalization method for each matrix.
- Merge the equalized image matrices altogether with the method dstack(tup=()) available in the library NumPy.
When the image is read in gray_scale.
- Just apply the equalization method for the image matrix.
Plot the original image and equalized image.

Let's write our own function to compute the image equalization. Image pixel values are normally in the range of 0 to 255. So in total, we will have 256 pixels.

def enhance_contrast(image_matrix, bins=256):    image_flattened = image_matrix.flatten()    image_hist = np.zeros(bins)    # frequency count of each pixel    for pix in image_matrix:        image_hist[pix] += 1    # cummulative sum    cum_sum = np.cumsum(image_hist)    norm = (cum_sum - cum_sum.min()) * 255    # normalization of the pixel values    n_ = cum_sum.max() - cum_sum.min()    uniform_norm = norm / n_    uniform_norm = uniform_norm.astype('int')    # flat histogram    image_eq = uniform_norm[image_flattened]    # reshaping the flattened matrix to its original shape    image_eq = np.reshape(a=image_eq, newshape=image_matrix.shape)    return image_eq

Credits - The above code is an inspiration from the article written by Tory Walker.

The above function returns an equalized image matrix when passed the original image matrix as an argument.

Let's write another function that computes the equalization for both the RGB image and the gray_scale image taking the above function in use.

def equalize_this(image_file, with_plot=False, gray_scale=False, bins=256):    image_src = read_this(image_file=image_file, gray_scale=gray_scale)    if not gray_scale:        r_image = image_src[:, :, 0]        g_image = image_src[:, :, 1]        b_image = image_src[:, :, 2]        r_image_eq = enhance_contrast(image_matrix=r_image)        g_image_eq = enhance_contrast(image_matrix=g_image)        b_image_eq = enhance_contrast(image_matrix=b_image)        image_eq = np.dstack(tup=(r_image_eq, g_image_eq, b_image_eq))        cmap_val = None    else:        image_eq = enhance_contrast(image_matrix=image_src)        cmap_val = 'gray'    if with_plot:        fig = plt.figure(figsize=(10, 20))        ax1 = fig.add_subplot(2, 2, 1)        ax1.axis("off")        ax1.title.set_text('Original')        ax2 = fig.add_subplot(2, 2, 2)        ax2.axis("off")        ax2.title.set_text("Equalized")        ax1.imshow(image_src, cmap=cmap_val)        ax2.imshow(image_eq, cmap=cmap_val)        return True    return image_eq

Let's test the above function -

equalize_this(image_file='lena_original.png', with_plot=True)

equalize_this(image_file='lena_original.png', with_plot=True, gray_scale=True)

The above plots are clear and we can say that the equalized images look better than the original images. This was implemented from scratch using the NumPy library.

Comparison

Let's compare the equalized image obtained from the cv2 library and the equalized image obtained from the code written from scratch.

We can notice there is a slight difference between the library image and scratch image. But both seem to be clear when compared with the original image. Here I complete my article with my own takeaway.

Takeaway

Personally, I learned a lot by exploring and implementing different methods applied to increase the image intensity. Especially, trying to implement the code from scratch by both referring and learning.
It is always good to use the library methods as they seem to be more optimized and works 100 percent.
Image processing is a very crucial subject to learn and one really deserves to try out practicing with so much curiosity and ones' own exploration.

Do give a read to my other articles and let me know your thoughts -

Image Flipping and Mirroring with NumPy and OpenCV

Mohammed Sameeruddin — Wed, 18 Nov 2020 12:21:54 GMT

Introduction

In this blog article, I will explain how we can mirror or flip an image in Python. All it is required is understanding the basic mathematics behind various matrix operations and matrix manipulations.

Credits of Cover Image - Photo by Michael Busch on Unsplash

Package requirements

NumPy - For matrix operations and manipulating the same.
OpenCV - For reading the image and converting it into a 2D array (matrix).
Matplotlib - For plotting the matrix as an image.

For this mini-project, I am using a famous Lena image which is mainly used for testing the computer vision models. Make sure to download this image and save it in the current working directory.

import cv2import numpy as npfrom matplotlib import pyplot as plt

Let's get started

First, we read the image file using the imread() method from the module cv2. To do that, we simply need to import the package and use so. Hence by doing this, we get the image in the form of a matrix.

By default, imread() method reads the image in BGR (Blue, Green, Red) format. To convert the read image into regular format i.e., RGB (Red, Green, Blue) we use cvtColor() method from the same module cv2.

def read_this(image_file, gray_scale=False):    image_src = cv2.imread(image_file)    if gray_scale:        image_rgb = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)    else:        image_rgb = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)    return image_rgb

The above function returns an image matrix from the image file that is passed. It consists of regular if and else condition if we want to get the image matrix either in gray_scale or RGB format.

Mirroring the Image

To mirror the image basically, we need to reverse the matrix row-wise from left to right. For example, let's consider a matrix A.

>>> A = [       [4, 1, 1],       [2, 8, 0],       [3, 8, 1]]

If we want to mirror this matrix (row-wise), then it would be -

>>> import numpy as np>>> mirror_ = np.fliplr(A)>>> mirror_[[1, 1, 4], [0, 8, 2], [1, 8, 3]]

We can also do this without using the NumPy module. If so, we get to use loops and reverse each row. This would take a while if performed the same on image matrices as they are very large matrices and we don't want our code to perform very slow.

def mirror_this(image_file, gray_scale=False, with_plot=False):    image_rgb = read_this(image_file=image_file, gray_scale=gray_scale)    image_mirror = np.fliplr(image_rgb)    if with_plot:        fig = plt.figure(figsize=(10, 20))        ax1 = fig.add_subplot(2, 2, 1)        ax1.axis("off")        ax1.title.set_text('Original')        ax2 = fig.add_subplot(2, 2, 2)        ax2.axis("off")        ax2.title.set_text("Mirrored")        if not gray_scale:            ax1.imshow(image_rgb)            ax2.imshow(image_mirror)        else:            ax1.imshow(image_rgb, cmap='gray')            ax2.imshow(image_mirror, cmap='gray')        return True    return image_mirror

The above function returns an image matrix that is reversed or flipped row-wise from left to right.

Let's plot the same -

mirror_this(image_file="lena_original.png", with_plot=True)

mirror_this(image_file="lena_original.png", gray_scale=True, with_plot=True)

Flipping the Image

To flip the image basically, we need to reverse the matrix column-wise from up to down. For example, let's consider a matrix B.

>>> B = [       [4, 1, 1],       [2, 8, 0],       [3, 8, 1]]

If we want to flip this matrix (column-wise), then it would be -

>>> import numpy as np>>> flip_= np.flipud(B)>>> flip_[[3, 8, 1], [2, 8, 0], [4, 1, 1]]

We use NumPy for flipping the matrix in order to maintain the fastness of the code.

def flip_this(image_file, gray_scale=False, with_plot=False):    image_rgb = read_this(image_file=image_file, gray_scale=gray_scale)    image_flip = np.flipud(image_rgb)    if with_plot:        fig = plt.figure(figsize=(10, 20))        ax1 = fig.add_subplot(2, 2, 1)        ax1.axis("off")        ax1.title.set_text('Original')        ax2 = fig.add_subplot(2, 2, 2)        ax2.axis("off")        ax2.title.set_text("Flipped")        if not gray_scale:            ax1.imshow(image_rgb)            ax2.imshow(image_flip)        else:            ax1.imshow(image_rgb, cmap='gray')            ax2.imshow(image_flip, cmap='gray')        return True    return image_flip

The above function returns an image matrix that is reversed or flipped column-wise from up to down.

Let's plot the same -

flip_this(image_file='lena_original.png', with_plot=True)

flip_this(image_file='lena_original.png', gray_scale=True, with_plot=True)

The Complete Code

class ImageOpsFromScratch(object):    def __init__(self, image_file):        self.image_file = image_file    def read_this(self, gray_scale=False):        image_src = cv2.imread(self.image_file)        if gray_scale:            image_rgb = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY)        else:            image_rgb = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB)        return image_rgb    def mirror_this(self, with_plot=True, gray_scale=False):        image_rgb = self.read_this(gray_scale=gray_scale)        image_mirror = np.fliplr(image_rgb)        if with_plot:            self.plot_it(orig_matrix=image_rgb, trans_matrix=image_mirror, head_text='Mirrored', gray_scale=gray_scale)            return None        return image_mirror    def flip_this(self, with_plot=True, gray_scale=False):        image_rgb = self.read_this(gray_scale=gray_scale)        image_flip = np.flipud(image_rgb)        if with_plot:            self.plot_it(orig_matrix=image_rgb, trans_matrix=image_flip, head_text='Flipped', gray_scale=gray_scale)            return None        return image_flip    def plot_it(self, orig_matrix, trans_matrix, head_text, gray_scale=False):        fig = plt.figure(figsize=(10, 20))        ax1 = fig.add_subplot(2, 2, 1)        ax1.axis("off")        ax1.title.set_text('Original')        ax2 = fig.add_subplot(2, 2, 2)        ax2.axis("off")        ax2.title.set_text(head_text)        if not gray_scale:            ax1.imshow(orig_matrix)            ax2.imshow(trans_matrix)        else:            ax1.imshow(orig_matrix, cmap='gray')            ax2.imshow(trans_matrix, cmap='gray')        return True

Basic Image Operations Package

imo = ImageOpsFromScratch(image_file='lena_original.png')### Mirroring ###imo.mirror_this()imo.mirror_this(gray_scale=True)### Flipping ###imo.flip_this()imo.flip_this(gray_scale=True)

The above image results will be displayed. Now that everything is sorted we can create other image operations like equalize(), solarize() etc. I have not covered that in this article but definitely in the next one.

What I have gained from this?

By implementing this, I have learned and understood the logic behind various image operations and how to put that in the form of code.

We can also do the same using a well-known image library Pillow. But, as a developer, we need to know things work from start to end. This is something where everyone should be curious about.

Please do read my another article related to image processing and computer vision where I explain image convolution from scratch.

dora_explorer - Python Package to Explore the Place Details

Mohammed Sameeruddin — Wed, 21 Oct 2020 11:07:40 GMT

Introduction

Recently I took up a small project of developing a Python package and deploying it on PyPI. The package is based on the famous problem in Graph Theory - Travelling Salesman Problem.

The simple procedure to solve this is by taking random nodes with random distance measures between the nodes and apply the concept of Dynamic Programming that breaks the bigger problem into smaller sub-problems and thus try to obtain the optimal solution for the bigger problem by first finding the optimal solution for smaller sub-problems. Recursive methods are applied subsequently to solve each sub-problem and thus solve the original problem.

Note

To understand how to implement this programmatically, it is recommended that you refer to my medium blog where I explain the procedure in detail.
I have implemented the same but taken it to the advanced level by considering the authentic place coordinates and finding the shortest path by taking appropriate distance measures.

Credits of Cover Image - Photo by Nicolas Van Leekwijck on Unsplash

Inception

The name of the package is dora-explorer. If you had watched any cartoons when you were a child, then you probably already know who Dora was. Dora The Explorer is the name of one such cartoon program in which Dora explores different places. I have just combined the concept of both cartoon and the traveling salesman problem where the programmer or the user can be beneficial.

Instead of just solving the problem by dynamic programming, I wanted to make it work for the real place names that are passed. In the background, it has to fetch the coordinate values and compute the distances and thus find the optimal shortest path. The package has certain methods to visualize the shortest path including the place coordinates on the map. Not just joining the lines from point A to point B, it can get the exact route that is required to reach point B from point A. Hence, all these features make the package very unique and new.

Installation

The current version of the package is 0.2.2 and the link can be found below.

https://pypi.org/project/dora-explorer/

It can be easily installed via pip.

pip install dora-explorer --user

Requirements

plotlympuunidecodegeonamescacherequests

The pip command automatically installs the requirements.

`import` the package

from dora_explorer.distance_locator import DistanceLocatorfrom dora_explorer.travel_places import GeoTravellerfrom dora_explorer.tiny_explore import DoraTheExplorer

Creating objects.

dis_loc = DistanceLocator()geo_travel = GeoTraveller()

Defining from_place and to_place.

from_place = 'singapore'to_place = 'new zealand'

Distance checking

The method get_distance() is used to get the distance between two places. The default distance is taken in km. Allowed options are -

Miles
Meters
Feet and
Yards

## kilometersfrom_to_dis_km = dis_loc.get_distance(from_=from_place, to_=to_place)print("The distance in kms from {} to {} - {} kms".format(from_place, to_place, from_to_dis_km))## milesfrom_to_dis_miles = dis_loc.get_distance(from_=from_place, to_=to_place, in_miles=True)print("The distance in miles from {} to {} - {} miles".format(from_place, to_place, from_to_dis_miles))## metersfrom_to_dis_meters = dis_loc.get_distance(from_=from_place, to_=to_place, in_meters=True)print("The distance in meters from {} to {} - {} meters".format(from_place, to_place, from_to_dis_meters))## feetfrom_to_dis_feet = dis_loc.get_distance(from_=from_place, to_=to_place, in_feet=True)print("The distance in feet from {} to {} - {} feet".format(from_place, to_place, from_to_dis_feet))## yardsfrom_to_dis_yards = dis_loc.get_distance(from_=from_place, to_=to_place, in_yards=True)print("The distance in yards from {} to {} - {} yards".format(from_place, to_place, from_to_dis_yards))

Output

'The distance in kms from singapore to new zealand - 8358.12 kms''The distance in miles from singapore to new zealand - 5194.6 miles''The distance in meters from singapore to new zealand - 8358116.45 meters''The distance in feet from singapore to new zealand - 27422980.08 feet''The distance in yards from singapore to new zealand - 9143779.4 yards'

We can also visualize the distance plot from from_place to to_place. To do this, we require a map. Hence, it is preferred to have MapBox API that can be obtained from the official website.

I have saved my API in a mapbox_api.txt file.

`read()` the MapBox API

with open(file='mapbox_api.txt', mode='r') as api_file:    map_api = api_file.read()

To get the distance plot, we have to use the method get_distance_plot() and the argument geo_token should take a string value (API).

There are two main conditions to get the appropriate distance plot.

Case - 1: If geo_token=map_api then

If from_place and to_place belong to the same country, then with_map=True and with_directions=True.

## belong to the same country - Indiafrom_place = 'delhi'to_place = 'bengaluru'dis_loc.get_distance_plot(    from_=from_place,     to_=to_place,     with_map=True,     with_directions=True,     geo_token=map_api)

If from_place and to_place belong to differernt countries, then with_map=True and with_directions=False.

from_place = 'singapore'to_place = 'new zealand'dis_loc.get_distance_plot(    from_=from_place,     to_=to_place,     with_map=True,     with_directions=False,     geo_token=map_api)

Case - 2: If geo_token=None or geo_token="" then with_map=False and with_directions=False.

The distance plot will be simply a line plot from two coordinates (latitudes and longitudes).

from_place = 'delhi'to_place = 'bengaluru'dis_loc.get_distance_plot(from_=from_place, to_=to_place)

Shortest Path

In order to get the shortest path, a list of places should be given as input. There are two possible conditions to get the result.

Case - 1: If the list of places belong to the same country considering geo_token=map_api.

Let's take the places belonging to Spain country.

place_list = ["madrid", "barcelona", "alicante", "palma"]explore = DoraTheExplorer(place_list=place_list)explore.get_path(    source_city='palma',    with_map=True,    with_directions=True,    geo_token=map_api)

Output

'plot is saved successfully ...''palma >> barcelona >> madrid >> alicante >> palma'

Note - If geo_token=None and all the places belong to the same country, then a line plot is plotted after joining the coordinates (latitudes and longitudes).

Minimum distance to cover all the places

## kilometersmin_distance_km = explore.get_min_dis(source_city='palma')print("The minimun distance to cover {} in kms - {}".format(place_list, min_distance_km))## milesmin_distance_miles = explore.get_min_dis(source_city='palma', in_miles=True)print("The minimum distance to cover {} in miles - {}".format(place_list, min_distance_miles))## metersmin_distance_meters = explore.get_min_dis(source_city='palma', in_meters=True)print("The minimum distance to cover {} in meters - {}".format(place_list, min_distance_meters))## feetmin_distance_feet = explore.get_min_dis(source_city='palma', in_feet=True)print("The minimum distance to cover {} in feet - {}".format(place_list, min_distance_feet))## yardsmin_distance_yards = explore.get_min_dis(source_city='palma', in_yards=True)print("The minimum distance to cover {} in yards - {}".format(place_list, min_distance_yards))

Output

"The minimun distance to cover ['madrid', 'barcelona', 'alicante', 'palma'] in kms - 1372.78""The minimum distance to cover ['madrid', 'barcelona', 'alicante', 'palma'] in miles - 853.19""The minimum distance to cover ['madrid', 'barcelona', 'alicante', 'palma'] in meters - 1372780.0""The minimum distance to cover ['madrid', 'barcelona', 'alicante', 'palma'] in feet - 4504091.18""The minimum distance to cover ['madrid', 'barcelona', 'alicante', 'palma'] in yards - 1501821.32"

Case - 2: If the list of places belong to different countries considering geo_token=map_api or geo_token=None.

Let's take the places belonging to different countries.

place_list = ['london', 'madrid', 'bengaluru', 'delhi']explore = DoraTheExplorer(place_list=place_list)explore.get_path(    source_city='london',    with_map=True,    with_directions=True,    geo_token=map_api)

Note - Finding the shortest path or visualizing it can be difficult. The route directions API does not work if places belong to different countries.

Output

"Cannot find the shortest path, as the cities ['london', 'madrid', 'bengaluru', 'delhi'] do not belong to the same country."

Minimum distance to cover all the places

min_distance = explore.get_min_dis(source_city='london')print(min_distance)

Output

"Cannot find the minimum distance to cover the cities ['london', 'madrid', 'bengaluru', 'delhi'] as they do not belong to the same country."

That's it. A simple package to explore the details of the places finally deployed and documented.

Additional Details

The package is still under development. Quite a few features and bugs need to be added and resolved respectively.
If you have liked the idea and the package, you can star it on my GitHub page.
If you would like to contribute, you are welcome to open a PR.
If you would like to report an issue, I would be happy to check and resolve the same.

End

Sameer

Indian Railways - Data Visualization App

Introduction

Data Source

Implementation

import Packages

Data Reading

Data Preparation

Data Visualization

Stations

Trains

Trains (from station)

App Development

Stations

Input Layout

Output Layout

Trains

Input Layout

Output Layout

from station to station

DeepGlobe Road Extraction - Challenge

Introduction

DL Problem Formulation

Data Overview

Problem Type

Performance Metric

Data Download

Exploratory Data Analysis

Binarization

Image Resizing

Training Images

Validation Images

Test Images

Transfer Learning

No Augmentation

ResNet34

VGG16

MobileNet

Inception

ResNet50

Augmentation

Model Performance

U-NET Model

No Augmentation

Augmentation

Model Performance

Error Analysis

Single Image Segmentation

No Augmentation

100034_sat

100393_sat

Augmentation

100034_sat

100393_sat

UI Part

Video Demonstration

References

San Francisco Crime Classification

San Francisco - City

ML Problem Formulation

Data Overview

Problem Type

Performance Metric

Objectives & Constraints

Exploratory Data Analysis

Univariate Analysis

Target Distribution

Address-type Distribution

Police-department Distribution

Year Distribution

Month Distribution

Weekday (Day) Distribution

Hour Distribution

Time-period Distribution

Hour-type Distribution

Occurrences Yearly

Map Visualization Yearly

Choropleth Visualization Monthly

Geo-Density Map

Multivariate Analysis

`import` Packages

`from` station `to` station

`import` Packages

`import` Packages

`import` Packages

`import` Packages

OpenRouteServices `class`

`import` Packages