Indian Railways is a statutory agency that manages the entire India's national railway system and is maintained by the Ministry of Railways, Government of India. As of March 31, 2022, it administers the world's fourth-largest national railway system, with a total route length of 67,956 kilometers (42,226 mi). On a daily basis, it runs 13,169 passenger trains both long-distance and suburban routes almost covering 7,325 stations across India.

Indian railways have been one of the go-to options to travel from one to another. This blog is mainly intended to discuss how we can visualize the railways' data as per the station directly on the map.

Indian Railways data was collected by Sanjay and Sajjad.

To know more about the difficulties they faced while collecting the data - mainly to maintain the authenticity, you can read it from here.

**Source**: https://github.com/datameet/railways

There are three files -

**stations**- contains the data of all the stations of India along with the coordinate values and other details. The format of the file is`GeoJSON`

type.

`{ "geometry": { "type": "Point", "coordinates": [75.4516454, 27.2520587] }, "type": "Feature", "properties": { "state": "Rajasthan", "code": "BDHL", "name": "Badhal", "zone": "NWR", "address": "Kishangarh Renwal, Rajasthan" }}`

**trains**- contains the data of all the train paths (railway routes) along with the coordinate values and other details. The format of the file is`GeoJSON`

type.

`{ "geometry": { "type": "LineString", "coordinates": [ [72.89173899999999, 19.070320000000002], [78.2266994458, 26.0352337224], [78.18700399999999, 26.145594], [78.18229199999999, 26.216483] ] }, "type": "Feature", "properties": { "third_ac": true, "arrival": "15:35:00", "from_station_code": "LTT", "name": "Mumbai LTT - Gwalior (Weekly) Special", "zone": "CR", "chair_car": true, "first_class": true, "duration_m": 45, "sleeper": true, "from_station_name": "LOKMANYA TILAK TERM", "number": "01101", "departure": "15:50:00", "return_train": "01102", "to_station_code": "GWL", "second_ac": true, "classes": "", "to_station_name": "GWALIOR JN", "duration_h": 23, "type": "Exp", "first_ac": true, "distance": 1216 }}`

**schedules**- contains the data of all the train schedules along with other details. The format of the file is`JSON`

type.

`{ "arrival": "None", "day": 1, "train_name": "Falaknuma Lingampalli MMTS", "station_name": "KACHEGUDA FALAKNUMA", "station_code": "FM", "id": 302214, "train_number": "47154", "departure": "07:55:00"}`

In Addition to this, I downloaded the shape file of India that contains the boundaries of the states.

`import`

PackagesAs always we begin with importing the packages that are essential for this project.

`JSON`

file reader - all the`JSON`

file data including the`GeoJSON`

format can be read using the below function.`SHP`

file reader - helps to read the shape file data.

The files `stations.json`

and `trains.json`

need to be converted into shape files and thus the data can be easily manipulated further using the `geopandas`

package.

**stations** shape file

**trains** shape file

**Indian States Boundaries**

In the `stations`

shapefile, there are so many missing values (especially in the `state`

columns. To get the state name, we need to perform spatial join, keeping `stations`

on the left and Indian states' boundaries on the right.

Similarly, in the `trains`

shapefile, there are a few stations that contain multiple states. The following is the list for the same.

**MAULA ALI**Telangana, Punjab**NAGARGALI**Maharashtra, Karnataka**INDARGARH**Rajasthan, Madhya Pradesh**ALAI**Rajasthan, Uttrakhand**Jagdishpur**Bihar, Jharkhand**KOMATIPALLI**Telangana, Andhra Pradesh**NAGARI**Andhra Pradesh, Maharashtra**Kashinagar Halt**West Bengal, Odisha**SULGARE**Maharashtra, Chattisgarh

Now that the data is prepared, we can use Plotly to get an interactive map plot.

With the above function, we can plot the stations by providing the `state_name`

param.

**Andhra Pradesh** stations

With the above function, we can plot the train paths by providing the `state_name`

param.

**Andhra Pradesh** trains

With the above function, we can plot the train paths provided `from_`

station.

**Chennai Beach** station

Using all the procedures above and other packages (like `dash`

) that support Plotly figures, we can develop a full-fledged web application.

Since we have two aspects `stations`

and `trains`

, we can create `HTML`

tabs and accordingly display the map.

Apart from just having `state_name`

it also has additional inputs such as displaying the boundary of the state.

The below plot shows each stop from the `from_`

station to all the remaining stations.

The below plot does not show any stops but all the stations from the `from_`

station in the form of a network.

`from`

station `to`

stationIf we select a specific `to_`

station, we get to see additional details of the train like -

**Arrival**- the time when the train arrives.**Departure**- the time when the train departures.**1st Class**-`1`

if the train is 1st class, else`0`

.**Time**- duration to reach the destination station.**Sleeper**-`1`

if the train is sleeper type class, else`0`

.**Distance**- the total distance in kilometers.

The map for the same is below.

Well, that's it for this article. You can find my work in the below links.

You can also subscribe to my newsletter for such exclusive content. Thanks all.

**End**

The Geoscience and Remote Sensing Society - one of the well-known communities to learn and contribute to Geospatial Science has sponsored the DeepGlobe machine vision challenge in 2018, which includes the deep analysis of satellite images of Earth.

The 3 challenges are -

- Road Extraction
- Building Detection
- Land Cover Classification

As part of this, I picked up the problem of **Road Extraction** as roads have always been a crucial part in various aspects be it transportation, traffic management, city planning, road monitoring, GPS navigation, etc.

The challenges of DeepGlobe are purely research-based and focus on the real problems.

Source https://competitions.codalab.org/competitions/18467

In total, we have

`14796`

images. Out of which`6226`

training satellite images`6226`

training mask images`1243`

validation satellite images`1101`

test satellite images

Validation and test data sets do not contain respective mask images. This is something we need to predict.

If we observe the masks, there are

`2`

classes. Therefore, it is a binary image segmentation task.This task is a computer vision task wherein we are dealing with images (satellite images).

Since it is an image segmentation task, the metrics that we stick to are -

**IoU Score**- the area of the overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth.`0`

**IoU**means there is no overlap between the predicted segmentation and the ground truth.`1`

**IoU**means the predicted segmentation and ground truth are exactly identical.

**Accuracy**- it is the ratio of correct predictions to the total number of input samples. The one caveat here is that we need to have an equal number of classes to consider this metric.

The dataset memory consumption is around

`4GB`

. We can download this dataset easily by using**Kaggle APIs**or**CURL WGET**chrome extension that takes less time compared with the regular download.For convenience, I have already downloaded it to my system. The structure of the data after extracting the files is as follows.

`metadata.csv`

file contains image paths for all training, validation, and test sets.

Originally, the size of each image is

`(1024, 1024)`

. If we train the model considering this size, it would end up having more trainable parameters. We need to make sure the image sizes are small and have uniformity in them.It is also mentioned that some of the training masks are not binary images. A binary is an image whose pixels are either

`0`

or`255`

. To maintain uniformity, we need to binarize the training masks with the help of the`OpenCV`

package.

To know more about image binarization, you can feel free to check my blog.

The outputs are as follows.

The ideal shape of an image that we want is `(256, 256)`

. We need to resize all the images in training, validation, and test sets.

Some practitioners also consider the shape of the images to be

`(128, 128)`

.

Let's visualize a sample of training data that includes both satellite image and its respective mask image.

By just observing this sample data, we can tell that this task is limited to extracting the larger paths which are clearly visible.

It would have been perfect if there were validation mask images so that it could become easy to track the models' performance. Unfortunately, we are not provided the validation data with the mask images.

To subsidize this problem, we can take

`1%`

of training data as validation and train the models'.

This sample set contains a mix of larger paths and thinner paths.

The test dataset does not contain the targets or mask images. This is something we need to predict. Let's visualize the same.

This sample set contains a mix of larger paths and thinner paths.

This ends our EDA. However, if you would like to know the complete details with all the code, you can check my notebook hosted on the Jovian platform.

Transfer learning is a part of ML and DL problems wherein the model's knowledge that has been trained in solving one problem is stored and further used for solving another problem that is quite similar.

The main advantage of transfer learning is that we do not have to train the models from scratch and rather use the pre-trained ones. Tensorflow by Google is a handy library for implementing deep learning models. It also includes various pre-trained models that can solve any problem statements with slight tweaks in the model's architecture.

We are dividing this modeling (transfer learning) part into `2`

aspects.

The below output is the validation data prediction. The rightmost image is the prediction mask and we can observe that for those images that contain larger paths, the model is able to extract the path but not as efficiently as it needs to be.

The below output is the test data prediction. The right image is the prediction mask. As usual, the model only identifies the larger paths but not the thinner paths. We can also see some incomplete paths that have been predicted.

The below output is the validation data prediction. The rightmost image is the prediction mask and we can observe that the performance is quite similar to the previous (`ResNet34`

) model. Complex patterns of the paths where there are bushes covered are not identified properly.

The below output is the test data prediction. The right image is the prediction mask. As usual, the model only identifies the larger paths but not the thinner paths. We can also see some incomplete paths that have been predicted.

The below output is the validation data prediction. The rightmost image is the prediction mask and we can observe that for those images that contain larger paths, the model is able to extract the path but not as efficiently as it needs to be.

The below output is the test data prediction. The right image is the prediction mask. The model's performance is not up to the mark when compared to the above previous models.

The below output is the validation data prediction. The rightmost image is the prediction mask and we can observe that for those images that contain larger paths, the model is able to extract the path but not as efficiently as it needs to be.

For the complex patterns, the model doesn't perform well. But it does a pretty good job in identifying the thinner patterns which are clear.

The below output is the test data prediction. The right image is the prediction mask. For some patterns, it does a good job which cannot be identified by `ResNet34`

or `VGG16`

.

The below output is the validation data prediction. The rightmost image is the prediction mask. It was only able to extract the larger paths that looked very clear.

Test data prediction.

I have trained the same list of models by slightly augmenting the data. The results are not up to the mark.

Comparing the results in order to select the best model.

The results of the augmented data are not so good compared with the original data (no augmentation). Clearly, it can be noted by checking the performances of the various models.

In fact the masks obtained for the augmented data are not as clear as the masks obtained for the original (no augmentation) data.

Hence, it is not a valid decision to consider the models that have been trained on the augmented data.

The only option left now is to select the best model out of other models that have been trained on the original (no augmentation) data.

The best model(s) from all the above models are

`ResNet34`

,`VGG16`

, and`Inception`

.

Feel free to check my notebook for transfer learning hosted on the Jovian platform.

Apart from implementing transfer learning models, I also wanted to cross-check the performance by implementing a famous deep learning algorithm called U-NET that is mostly used for image segmentation tasks.

The performance of the U-NET model is significantly better than the transfer learning. As usual, we shall divide this into `2`

aspects.

The below output is the validation data prediction. The rightmost image is the prediction mask and we can observe that the model is able to extract most of the paths (including thinner paths).

Even the paths that have not been highlighted in the original mask, are still being identified with the help of this model.

Test data prediction.

In the case of data augmentation, the model still performs its best. Augmentation has been helpful to identify complex patterns. Results of the validation prediction can be seen below.

Test data prediction.

The above are the results obtained by the U-NET model developed from scratch.

The left image corresponds to the results of the model that is trained on the original (no augmentation) data.

The right image corresponds to the results of the model that is trained on the original (augmented) data.

Clearly by observing the graphs, we can say that the left one (model) shows slight overfitting when compared with the right one (model). Although, it seems the difference is negligible.

Personally, I feel that the model trained on the augmented data does show some closer predictions. This is something we should observe in the process of error analysis.

Feel free to check my notebook for U-NET modeling hosted on the Jovian platform.

Error analysis is very much helpful (in case, when we have multiple models) in order to decide which model is the best model and which is not. This is a stage where we can know how we should be improving the model or the dataset.

In total, we have `6`

models that have been trained on both augmented data and non-augmented data. The **IoU** score is the performance metric that we have considered. Based on my observation, U-NET that is trained on non-augmented data happens to be the best model since it has the highest **IoU** score.

The error analysis has to be performed on sample data. But if you do have time and efficient system requirements you can also consider the whole dataset. For this analysis, I have considered only `100`

random images.

Feel free to check my notebook for deep error analysis hosted on the Jovian platform.

As part of this, we need to extract the road path when passed a single satellite image as input to the model. This is important in real-life scenarios as we do not generally pass the entire test set to the model. Instead of getting the prediction mask, it is always better to highlight the road path on the satellite image itself. We can highlight the path by any color.

Once we have the model loaded, we can always pass a single image for getting the route. Upon getting the prediction mask, with the help of simple image processing techniques, we can still retain the background and only highlight the path.

Hence, the below function.

Jovian Project Link https://jovian.ai/msameeruddin/collections/deep-globe-road-extraction

LinkedIn https://www.linkedin.com/in/mohammed-sameeruddin-596a70133/

- https://www.appliedaicourse.com/
- https://youtu.be/GAYJ81M58y8
- https://www.cs.toronto.edu/~urtasun/publications/mattyus_etal_iccv17.pdf
- https://arxiv.org/pdf/1908.08223.pdf
- https://www.kaggle.com/vanvalkenberg/road-maps-from-aerial-images
- https://www.kaggle.com/balraj98/road-extraction-from-satellite-images-deeplabv3

Well, that's it from my end. Make sure to subscribe to my newsletter to never miss an update from the exclusive content that I post.

]]>San Francisco is the cultural, commercial, and financial capital of the state of California in the United States. It is the 17th most populated city in the United States and a renowned tourist destination noted for its cool summers, the Golden Gate Bridge, and some of the world's best restaurants. Despite being a city known for its growth, San Francisco remains one of the most dangerous places to live due to an increase in criminal activities.

**SF OpenData** - an open data catalog of San Francisco has provided 12 years of crime data dated from **1/1/2003** to **5/13/2015**. These criminal records are mostly the incidents derived from **SFPD Crime Incident Reporting System**. They have published this dataset on Kaggle.

The main agenda of this whole case study is to predict the category of the crime for the incident that can occur in the future. Thus, helping them reduce the time and effort of organizing the data.

- Source https://www.kaggle.com/c/sf-crime/data
In total, we have 3 files

*train.csv.zip**test.csv.zip**sampleSubmission.csv.zip*

The training data has 9 columns that include the target column as well.

- The test data has 6 columns that exclude the target column (this is something we need to predict).

- There are
`39`

different classes of crimes that need to be classified. - Hence this leads to a multi-class classification problem.

Since it is a classification problem, the metrics that we stick to are -

**Multi Log-Loss**- the ultimate aim of any machine learning model is to reduce the loss. For this problem, a perfect classifier would be the one that reduces the multiclass logarithmic loss to`0`

.

**Confusion Matrix**- is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class.

- Predict the probability of each data point belonging to each of the
`39`

classes. - Constraints:
- Interpretability
- Class probabilities are needed
- Penalize the errors in class probabilities => Metric is Log-loss
- No latency constraints

The original training data that are given -

The original test data that are given -

Clearly, from the `Dates`

column we can extract time-based features such as -

*date**year**month**day**hour**minute**season**hour_type**time_period*

Also, from the `Address`

column we can extract the address type such as -

Training data after extracting time-based and address-type features.

Test data after extracting time-based and address-type features.

So, we got some important features. We can perform the univariate analysis considering each column and get to know how the data is distributed.

- The above is the class distribution of the column
`category`

. - Clearly, we can see that
`Larceny/Theft`

is the most occurred type of crime in the all the years. - The last
`5`

to`6`

crimes are negligible, meaning they occurred very rarely. - The data is not balanced. Therefore, while building the model it is better to do the stratification-based splitting.

- From the above plot, we can tell that most of the crimes occurred on
`Streets`

and`Crosses`

.`St`

stands for the street.`Cross`

signifies the junction point.

`Southern`

is the bay area where most of the crimes got reported.

- We know that the data is recorded from
**1/1/2003**to**5/13/2015**.- The data of the year
**2015**is not recorded fully.

- The data of the year
- The year
**2013**has more number crimes. Others also have a similar occurrence range.

- In all of the months, we can observe that the occurrence of the crimes roughly ranges from
`60k`

to`80k`

.

`Friday`

is the day where most of the crimes occurred.`Sunday`

is the day when fewer crimes (compared with other days) occurred.- This is more likely because it is a holiday.

- This distribution is almost similar.

- It is observed from the above that most of the crimes happen either in the evening or at midnight.

- We can observe that most of the crimes usually happen either during the evening or at night.

`Morning`

,`Mid-Night`

, and`Night`

are considered to be the time period suitable for crimes to be happening.- In fact, they have a similar distribution.

`Evening`

and`Noon`

are the time periods where the business is usually continuous.

- The above is an animation plot showing the occurrences that had happened yearly.
- The above is the count of each occurrence type that occurred in the police department.
- The most common crime that occurred is
`Larceny/Theft`

. - This is something we observed in the plot of overall occurrences in all the years.

- The above shows the exact location of the types of crimes that occurred per police department.
- This is a year-wise plot and the police department that is chosen is
`Richmond`

.

- The above is the choropleth map based on the count of the occurrences per police department.
- Southern is the police department where most of the crimes got reported.
- This is a month-wise plot for the year
`2015`

.

- The GIF showcases the density map of the top
`12`

crimes that occurred per year. - The most important thing to observe in this is that
`Larency/Theft`

is always present at first.- The second is
`Other Offenses`

.

- The second is
- The density is taken in descending order and hence retains the top crimes.

In order to do multivariate analysis, we must require other features based on the columns that we have. We can follow the below steps to obtain them.

- Categorical columns can be encoded to
`One-Hot-Encoding`

.- Procedure to follow up - here.

- We can extract the spatial distance features by considering all the police stations in San Francisco and computing the
**haversine distance**between each station to all the coordinate values that are present in the data.- Procedure to follow up - here.

- We can perform some arithmetic operations between latitude and longitude. These are known as location-based features.
- Procedure to follow up - here.

`Address`

column can be converted to`Bog of Words`

and`TF-IDF`

representations.- Procedure to follow up - here.

- In doing so, we end up with
`130`

features. This is pretty good to learn from the data. Earlier we had around`7`

features.

We shall visualize a ** TSNE** plot to see whether we can separate the classes fully or not. Since the data is large, I have considered only the

`10%`

of the whole data by performing a stratification split. The reason for doing this is because `10%`

of the whole data is almost `87804`

data points.- Currently, the dataset consists of
`39`

classes. - From the above plot, we can clearly see that the data is highly imbalanced.
- The clusters are not separated properly.

- The above is better compared to when we considered all the categories.
- The clusters also formed in a better way.
- The separation of the clusters is also better.

- The above plot is far better compared to all the previous plots.
- We can see the separation clearly.

Now that we got the features it is time for us to apply various classification models and check the performance of each. For this problem, we will go with the following models.

*Logistic Regression**Decision Tree**Random Forest**XGBoost Classifier*

We can also try

*KNN*

I did not implement it as it consumes more time to train and due to the system's inefficiency, I opted to skip it.

Let's train some models...

Also known as a random model. This model assigns a random class label for each data point. This is helpful in order to check if other important models do better than this or not. The loss that we obtain from this - can be set as a threshold and thus aiming to achieve the minimum loss than the threshold.

After hyperparameter tuning, it happens the best value of `C`

to be `30`

and `l2`

as a regulariser. We proceed with `log loss`

since `accuracy`

won't help as the data is imbalanced. Fortunately, the loss is lesser than the loss of the dummy classifier.

When the model is tested on the test data, the overall Kaggle score is -

After hyperparameter tuning, it happens the best value of `max_depth`

to be `50`

and `min_samples_split`

to be `500`

. We proceed with `log loss`

since `accuracy`

won't help as the data is imbalanced. Fortunately, the loss is lesser than the loss of the dummy classifier.

When the model is tested on the test data, the overall Kaggle score is -

After hyperparameter tuning, it happens the best value of `max_depth`

to be `8`

and `n_estimators`

to be `100`

. We proceed with `log loss`

since `accuracy`

won't help as the data is imbalanced. Fortunately, the loss is lesser than the loss of the dummy classifier.

When the model is tested on the test data, the overall Kaggle score is -

After hyperparameter tuning, it happens the best value of `max_depth`

to be `8`

and `n_estimators`

to be `150`

. We proceed with `log loss`

since `accuracy`

won't help as the data is imbalanced. Fortunately, the loss is lesser than the loss of the dummy classifier.

When the model is tested on the test data, the overall Kaggle score is -

We have to calibrate the probabilities (predictions), as probability is playing a vital role in the problem of classifying.

The results of`StackingClassifier`

are not up to the mark.

- From the above, we can observe that all models (except
`DummyClassifier`

) performed well.- The loss of each model is less than the
`DummyClassifier`

. - The best model compared to all other models is
`XGBoost`

. The loss of this very little.

- The loss of each model is less than the

The full code for modeling can be found here.

As part of this, we need to predict the category of the crime when the original test data (single point) is passed as input to the model. This is important in real-life scenarios as we do not generally pass the features and preprocessed test data to the model.

The code for the single-point prediction can be found here.

The video demonstration.

Jovian Project Link -

LinkedIn Profile -

https://www.appliedaicourse.com/

https://towardsdatascience.com/why-training-set-should-always-be-smaller-than-test-set-61f087ed203c

https://towardsdatascience.com/leveraging-geolocation-data-for-machine-learning-essential-techniques-192ce3a969bc

https://www.slideshare.net/RohitDandona/san-francisco-crime-prediction-report

https://www.researchgate.net/publication/305288147_San_Francisco_Crime_Classification

https://cseweb.ucsd.edu/classes/wi15/cse255-a/reports/fa15/012.pdf

We can attempt to apply deep learning-based models and check the performance with respect to the already trained models that we did. To my knowledge, most of the kagglers who attempted to solve this problem implemented their solutions using deep learning models. We can definitely give it a try.

]]>An earthquake is the result of a sudden release of stored energy in the Earth's crust that creates seismic waves. According to USGS, an earthquake is what happens when two blocks of the earth suddenly slip past one another. The surface where the two blocks slip is known as the **fault** or **fault plane**. USGS tracks the data of the hazards and updates it every minute. The data is stored in `CSV`

format.

In this blog, we will learn how to visualize the seismic data on the basis of the world or specific area. We will also learn to animate the visualization process based on the `time`

factor.

**Credits of Cover Image** - Photo by Cdric Dhaenens on Unsplash

USGS is the main source of data. It is well known for various disaster management systems including **Earthquakes**, **Floods**, **Volcanoes**, etc. The seismic data is recorded in four categories.

Hourly updated every minute

Daily updated every minute

Weekly updated every minute

Monthly updated every minute

For the project, we will only consider

Daily,Weekly, andMonthlydata.

**Source** - USGS Spreadsheet Format. In this link, you can find additional details like column names, and column descriptions.

Using this live-streaming data we can develop a live-updating dashboard (maybe I will cover that next time).

If you would like to get the video explanation, you can watch the below video with some additional details.

`import`

PackagesAs always we begin with importing the packages that are essential for this project.

In the data, there are almost `22`

columns in which most of them are `NaN`

. There are few columns only a seismic researcher knows the importance of that. For the visualization, we will only consider `5`

columns.

**time**useful for animating the visualization process.If monthly data is fetched, we shall extract the*Monthly***dates**from the`time`

column.If weekly data is fetched, we shall extract the*Weekly***weekdays**from the`time`

column.If daily data is fetched, we shall extract the*Daily***hours**(numbers) from the`time`

column.

**latitude**useful for plotting the earthquake location on the map.**longitude**useful for plotting the earthquake location on the map.**mag**useful to represent marker size as based on the magnitude of the earthquake.**place**useful to visualize the data either worldwide or specific area.The

`place`

column consists of both`sub area`

and the`main area`

. This has to be split in order to visualize the data pertaining to a specific area (main area).

The below is the function to obtain the preprocessed data after performing the above operations. By default, it fetches the data worldwide.

The function takes `3`

parameters such as -

**manner**the value of the parameter should be either of these`daily`

,`weekly`

, and`monthly`

. By default, the value is`daily`

.**bbox**is actually a bounding box parameter with which the data is filtered pertaining to the value that is passed. By default, the value is`Worldwide`

.**mag_thresh**the value is a magnitude threshold that takes integer or float values. Based on this, the data is filtered. By default, it is`2`

.

Once the data is fetched, we shall visualize to observe data patterns. The data shall be plotted on the map (**stamen-terrain**). We will be using Plotly to ease the visualization process.

Hence the below function.

Let's look at the animations for the `Worldwide`

data.

The animation is created on the basis of **hours** (numbers). Feel free to zoom in and zoom out the map.

The animation is created on the basis of the **weekdays**. Feel free to zoom in and zoom out the map.

The animation is created on the basis of the **dates**. Feel free to zoom in and zoom out the map.

Let's look at the animations for the `Area`

based data.

The animation is created on the basis of **hours** (only pertaining to the `bbox`

value that is passed). Feel free to zoom in and zoom out the map.

The animation is created on the basis of the **weekdays** (only pertaining to the `bbox`

value that is passed). Feel free to zoom in and zoom out the map.

The animation is created on the basis of the **dates** (only pertaining to the `bbox`

value that is passed). Feel free to zoom in and zoom out the map.

Well, that's it for this article. You can find my work in the below links.

You can also subscribe to my newsletter for such exclusive content. Thanks all.

**End**

In this article, we will try to explore an API provided by Foursquare (a company that is well known for its location-based services). With this API, we can power location-based experiences in our app or website.

Given a certain location, we can find different types of venues starting from coffee shops to travel & transport. To leverage the API we need to first create a developer account. While creating so, Foursquare provides two types of ID:

- Client ID
- Client Secret

These two are very important to fetch data of any category of venues pertaining to the location that is provided. Besides this, we also require openrouteservice.org API and mapbox.com API. You can refer to my previous blog to get it so.

With this blog, I would like to share my working POC which I implemented as part of the weekend project.

**Credits of Cover Image** - Photo by Alessandra Onisor on Unsplash

If you would like to get the video explanation, you can watch the below video with some additional details.

`import`

PackagesAs always we begin with importing the packages that are essential for this POC.

**OpenRouteService Token**

**Mapbox Token**

**Foursquare Venues**

Foursquare provides several venues categories. For every venue category, we have a unique ID (specifically termed as `categoryId`

). To know the category of the different venues that the company provides, you can check this for the information. The below image explains it.

From the image, we can note that we have a `categoryName`

and its `categoryId`

.

- We can refer to the site every time and fetch the details for a venue of our interest.
- Or, we can save all the names and ids and create a dropdown that gives freedom for a user to select anything from the list (dropdown).

(As part of this project, I have scraped all the names and ids as per the category and saved them in a JSON file. However, if you are interested to learn how I scraped, you can check out my notebook. You can also download the data from here.)

In total, we have `10`

main categories. In each category, we have sub-categories.

**Foursquare Credentials**

I have created my developer account and saved the secrets in my local file.

The `keys`

of the credentials can be seen below.

Since the data is loaded in a variable called `foursquare_venues`

, we just need to parse the data accordingly to extract the main categories and the respective sub-categories.

Hence, the below helper functions.

The main aim of the Foursquare API is to fetch venues data pertaining to a location. Without the location (coordinates), we can't fetch the data. To get the location coordinates, we must follow two aspects :

- Device location (which comes by default)
- User's interest (where the app user can change the location area)

Hence, the below helper functions.

`JupyterDash`

is part of the `Dash`

package which is exclusively used for creating user applications in the Jupyter-Notebook environment. Let's just initialize the app.

The default `external_stylesheets`

is used for app design. It is a classic design that can be used for any web application.

The layout design for the `JupyterDash`

application is similar to the `Dash`

application. The variable (object) `app`

has an attribute called `layout`

which is used to save the whole design.

**Layout Design**

**Layout Output**

The user has to select the dropdown values to visualize the locations of the venues pertaining to the place name and categories. The backend is not yet there.

The callbacks (backend) for the `JupyterDash`

application are similar to the `Dash`

application. The callbacks act as a decorator to the updating function and thus function need not be called in order to represent the output/result.

In the above code, we have two callbacks.

- The first one indicates the modification of the sub-category dropdown. Notice how the sub-category dropdown changes when the main category is updated.

- The second one is for displaying the map.

(The above are just the images but not the working app.)

To view the application, the server of the app needs to be activated. Feel free to select different values from the dropdown to observe the change in the map result.

(Whenever you want to change the location, just mention a different place name and press **enter**.)

Well, that's it for this article. You can find my notebook with the full code here. You can also subscribe to my newsletter for such exclusive content. Thanks all.

**End**

SAPS (South African Police Service) has provided crime data via Kaggle which contains the history of crime statistics from 2004 to 2015 per province and station in South Africa. The dataset contains a set of shapefiles and CSV files.

**Data Source** https://www.kaggle.com/slwessels/crime-statistics-for-south-africa

Let's see if we can generate an insightful analysis and report using this data by leveraging Python and Geemap for this project.

**Credits of Cover Image** - Photo by Simone Busatto on Unsplash

We will analyze the data fully using Python. To do so, we begin with importing necessary packages.

If you are interested in watching the video explanation, then the below is for you.

`import`

PackagesWe are going to use certain packages like `geopandas`

, `geemap`

, etc to read data from shapefiles easily. We will need to import this along with some additional packages.

The next step is to initialize the earth engine module. We do this in order to visualize the spatial data integrating with `geemap`

.

(You will need to authenticate the earth engine package after signing up on the Google Earth Engine website).

In order to load the map centering the **South Africa** or the province of South Africa, we need a location. For that purpose, we require a geocoder to avoid hard-coded coordinate values.

(To know more about **OpenRouteSerivices**, you can refer to this blog).

**ORS Token**

**Location Fetcher**

**South Africa location**

The dataset contains two `CSV`

files and one `shapefile`

. We can load `CSV`

files using `pandas`

and `shapefile`

using `geopandas`

or `geemap`

.

The sizes of the data can be known using the `shape`

attribute.

String data manipulation can be done with the `apply()`

method.

In order to analyze the data per each province and district, we also need some additional data (shapefiles) which we can easily obtain from diva-gis.org. Because we need South African states and districts, we can simply search for that to get the shapefile data.

Since South Africa contains 9 provinces, we can graph a bar chart to visualize the population data of each province.

We can also directly visualize the population data on the map in the form of a choropleth map. We can leverage `geemap`

to get an interactive map plot. For that, we first need to merge `sa_states`

with `popn_province_df`

.

**Choropleth Map of Population**

In the above map, the darker region indicates the higher population and vice-versa.

The list of the total number of crime categories.

If we want to know the number of cases reported in each station for a selected crime category and year belonged to a particular province, then we shall have a function wherein we can pass the inputs and get the result accordingly.

**Murders in the year 2006 at Western Cape**

**Murders in the year 2010 at Eastern Cape**

**Mapping Stations per district at a selected province**

We can also plot the stations where the crimes are reported. We can show that per each district of a selected province. The following function helps us achieve that.

**Murders reported in the year 2006 at the stations of Western Cape**

**Murders reported in the year 2010 at the stations of Eastern Cape**

**Choropleth Map of Districts**

As a matter of fact, we can also plot a choropleth map with respect to total crimes reported in a selected province in each district irrespective of the station that the crime has been reported. This is done by filtering the data based on the year that is passed.

We will incorporate some amazing functionalities of `geopandas`

such as spatial join.

(More details about spatial joins will be explained in the coming posts).

**Murders in the year 2010 at Eastern Cape**

**Murders in the year 2008 at Gauteng**

If we explore the first five rows of the crime data, we can see that it contains years as columns.

We shall sum all the years' data column-wise to end up having total crimes being reported per station. Using this, we can plot a choropleth map (of a different kind) based on province and districts. Again, we will incorporate spatial join techniques for district-wise visualization.

**Total Murders at Western Cape**

We can change the parameters in every function to get the result according to the province, year, and crime category that are passed.

You can find my notebook with the full code here.

Well, that's it for this article, you can subscribe to my newsletter for such exclusive content. Thanks all.

]]>I have been working with geospatial data for quite a while. I always wanted to create something similar to google maps (at least a few services that it provides like geocoding, route-directions, etc). Google maps' do provide an API for various services. But the API is not free. If you want to leverage the API for your applications, then you may want to provide credit card details and other things. Earlier, the Google API was free to use, but now the plan is changed. However, you can switch to Mapbox services in order to leverage the same functionalities completely for free.

In this blog, I would like to share some interesting services that an open route provides. So sit tight and read this post carefully.

**Credits of Cover Image** - Photo by Glenn Carstens-Peters on Unsplash

You need to first create an account in openrouteservice.org. You can sign-up using your GitHub.

Navigate to the Dashboard section and generate a free token.

**(Optional)**- To test the service, you can go to the examples section and test for different services. You can also get the code snippet in your favorite programming language.Copy the API token and save it in a

`txt`

file (do not show your API to anyone).

In order to visualize the results, we need to also create a Mapbox token which we will integrate with Plotly and generate an interactive plot.

Create an account by visiting mapbox.com.

Navigate to

**account**(account.mapbox.com) and create your free API.You can also use the default API which Mapbox creates for every user.

Copy the API token and save it in a

`txt`

file (do not show your API to anyone).

`import`

PackagesAfter this, we need to visualize the results from the APIs on the map after all we are fetching geospatial data. We will just make a `class`

visualize so.

Create a class called

`MapPlotter`

to visualize the geospatial data.It contains 3 private methods (common methods) to sort of using them in different scenarios.

It contains 4 public methods (plotting methods) to sort of visualizing the data.

We will also need helper functions to extract latitudes and longitudes separately from coordinates (list).

`class`

The services already provide all the functionalities in the form of code snippets. I just consolidated everything into one `class`

to make myself easy enough to use it accordingly.

Create a class called

`OpenRouteService`

to fetch or extract data using API.By default, it inherits the

`MapPlotter`

class for visualization.It contains only 4 public methods for fetching the data in different situations.

So, everything is set now. We need to create an object and access different methods to explore the Open Route Services.

We can get the coordinates for any place name. A sample output can be seen below. Feel free to zoom in and zoom out.

We can get the directions for any two places (`from`

to `destination`

). A sample output can be seen below. Feel free to zoom in and zoom out.

We can get the isochrones for any two places (`from`

to `destination`

). A sample output can be seen below. Feel free to zoom in and zoom out.

We can get the POIs (point of interest) for any two places (`from`

to `destination`

). A sample output can be seen below. Feel free to zoom in and zoom out.

Well, this was just a simple POC (Proof of Concept) I tried to explore the ORS API. You can also make use of this API in your applications.

**Full Notebook**https://jovian.ai/msameeruddin/gdip-open-route

That's it for this article, you can subscribe to my newsletter for such exclusive content. Thanks all.

**End**

For every tabular data, we know that the data is stored in the form of a matrix (rows and columns). It is obvious that every column may not be of the same data type. We tend to have data of different types. Some column values may be integers (numerical), real-valued (float), and categorical (string). The main agenda of this article is to explore some unknown methods which can effectively work for separating the columns based on their type.

Unfortunately, if in the dataset there are both categorical and textual data, `Pandas`

will consider both as categorical data. But, with a simple hack, we can differentiate that easily.

If we have categorical data like

`['short', 'tall', 'average']`

, we can still extract the information even though each value is of string type. We can plot a pie chart or bar chart to visualize it.But, if we have textual data like

`[ 'This is the first document.', 'This document is the second document.']`

, we cannot do much for this although the type of each value is a string, we know that it is actually a sentence.

With `Pandas`

, we will see how to solve this separation task programmatically.

**Credits of Cover Image** - Photo by Will Francis on Unsplash

`import`

Packages`import numpy as npimport pandas as pdimport random`

For this example, we will create our own (dummy) data set.

`real_values = np.random.rand(10, 3)num_cate = np.array([random.choice([1, 2, 3]) for i in range(10)])str_cate = np.array([random.choice(['short', 'tall', 'average']) for i in range(10)])# -----df = pd.DataFrame(data=real_values, columns=['col_1', 'col_2', 'col_3'])df['col_4'] = num_catedf['col_5'] = str_cate`

The first five rows of `df`

look like -

col_1 | col_2 | col_3 | col_4 | col_5 |

0.902552 | 0.212317 | 0.208628 | 2 | short |

0.308503 | 0.959903 | 0.631784 | 3 | tall |

0.459789 | 0.199742 | 0.176000 | 2 | tall |

0.760787 | 0.167179 | 0.137287 | 2 | average |

0.199737 | 0.120213 | 0.145068 | 2 | average |

We can check the type of each column by -

`df.dtypes'''col_1 float64col_2 float64col_3 float64col_4 int32col_5 object'''`

We can see that `col_5`

is of type `object`

which simply represents that it is different from `int`

and `float`

. To take it to the next level, we can have a function that can give the columns that are numerical and string separately.

`def split_categories_numericals(dframe): cols = list(dframe.columns) num_cols = list(dframe._get_numeric_data().columns) cate_cols = list(set(cols) - set(num_cols)) return cate_cols, num_cols`

`cate_cols, num_cols = split_categories_numericals(dframe=df)print(cate_cols) # ['col_5']print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']`

The function `split_categories_numericals()`

works fine for the above data set. Let's add one more column that has textual sentences as values.

`sents = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', 'Hello, how are you doing?', 'Did you go to the party yesterday?']text_data = [random.choice(sents) for i in range(10)]df['col_6'] = text_data`

Now, the first five rows of the `df`

look like -

col_1 | col_2 | col_3 | col_4 | col_5 | col_6 |

0.902552 | 0.212317 | 0.208628 | 2 | short | This document is the second document. |

0.308503 | 0.959903 | 0.631784 | 3 | tall | Is this the first document? |

0.459789 | 0.199742 | 0.176000 | 2 | tall | Hello, how are you doing? |

0.760787 | 0.167179 | 0.137287 | 2 | average | Did you go to the party yesterday? |

0.199737 | 0.120213 | 0.145068 | 2 | average | Hello, how are you doing? |

If we check the `dtypes`

of `df`

, we get -

`df.dtypes'''col_1 float64col_2 float64col_3 float64col_4 int32col_5 objectcol_6 object'''`

We can see `col_5`

and `col_6`

have the same type, i.e., `object`

. It is because they have string values. But, `col_5`

- we can get a pie chart, compute the likelihood, etc. Whereas `col_6`

, since it has textual data, we need to convert it into vectors (numerical vectors) and then analyze it.

Now, the ultimate question is how can we know if a data set has textual data (sentences) as values. If we try to apply `split_categories_numericals()`

, we get -

`cate_cols, num_cols = split_categories_numericals(dframe=df)print(cate_cols) # ['col_5', 'col_6']print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']`

It considers both `col_5`

and `col_6`

as categorical.

Is there any programming hack to automatically differentiate categorical data and textual data?

The answer is **Yes**. We can very well achieve that with the following function.

`def split_text_categories_numericals(dframe): cols = list(dframe.columns) num_cols = list(dframe._get_numeric_data().columns) cate_cols = list(set(cols) - set(num_cols)) text_cols = []; category_cols = [] for ccol in cate_cols: each_col_list = dframe[ccol].str.split(' ').to_list() col_val_len_arr = np.array(list(map(len, each_col_list))) if np.any(col_val_len_arr > 1): text_cols.append(ccol) else: category_cols.append(ccol) return num_cols, category_cols, text_cols`

`num_cols, cate_cols, text_cols = split_text_categories_numericals(dframe=df)print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']print(cate_cols) # ['col_5']print(text_cols) # ['col_6']`

We have separated the columns with a simple hack now.

If in the data set, there are columns of different types, and if you want to do text processing, using the above function you can exactly get the columns that have textual data (sentences).

By extracting textual columns, you can easily process it and get the most out of it.

The function

`split_text_categories_numericals()`

only works for the preprocessed and cleaned data.It might not work efficiently if the function is applied directly to the original data.

**End**

Logistic regression is a statistical model which is extensively used in binary classification tasks. The name **logistic** because it uses a logistic function to do the classification. The logistic function is also known as **cross-entropy**. Besides this, we use a special function known as **sigmoid function** to prevent the impact of outliers in the whole modeling.

**Credits of Cover Image** - Photo by Luis Soto on Unsplash

Given the data points, we need to classify them into two classes (binary classification). We shall find a separating hyperplane to divide the classes.

$$f(w, b) = w^Tx+ b \rightarrow (1)$$

The notation of sigmoid function can be stated as -

$$S(x) = \frac{1}{1 + e^{-x}}$$

The plot of the sigmoid function is -

**Credits** - The above plot is taken from Google.

If we observe the bounds of the curve, the upper bound is `1`

and the lower bound is `0`

. The sigmoid function only returns the value in the range of `0`

and `1`

which is sufficient for doing binary classification.

If we use a sigmoid function by substituting the above approximation (1), we get -

$$\implies \frac{1}{1 + e^{-(w^Tx + b)}} \rightarrow (2)$$

Equation (2) is the actual model which is used in predicting the class label.

Given `x`

, `w`

, and `b`

. We need to pass this in the model that in turn returns a probability value.

If the probability value is less than

`0.5`

, the value is predicted as`0`

.If the probability value is greater than equal to

`0.5`

, the value is predicted as`1`

.

Everything is properly written except two things which are still unknown. The two things are the parameters `w`

and `b`

. The error in the model depends on these two parameter values. Here, `w`

is the coefficient and `b`

is the intercept. We cannot simply assign random values for `w`

and `b`

. Instead, they are to be chosen wisely with the help of the stochastic gradient descent process.

The ultimate goal is now to find the best values for `w`

and `b`

. Here `w`

is a vector and `b`

is a scalar value. We shall use the logistic function or cross-entropy function to find these values.

It is stated as -

$$L = \frac{1}{n} \sum_{i=1}^n \bigg[-y_i \log\big[\sigma(w^Tx_i + b)\big] - (1 - y_i) \log\big[1 - \sigma(w^Tx_i + b)\big]\bigg] \rightarrow (3)$$

**Note** - Sigmoid function is often denoted as sigma.

SGD is an iterative process where we initially assign random values (possibly 0) for both `w`

and `b`

. At every iteration,

For

`w`

, we differentiate (3) with respect to`w`

and obtain`dw`

.For

`b`

, we differentiate (3) with respect to`b`

and obtain`db`

.

Mathematically, it can be represented as -

$$dw = \frac{1}{n} \sum_{i=1}^n x_i \big[\sigma(w^Tx_i + b) - y_i\big]$$

and

$$db = \frac{1}{n} \sum_{i=1}^n \big[\sigma(w^Tx_i + b) - y_i\big]$$

**Note** - I have actually differentiated the above on a paper and verified.

We update/replace the actual `w`

and `b`

with `dw`

and `db`

respectively at every iteration until the values are not fully minimized. The updating process can be understood in the following way.

$$w = w - \alpha * dw$$

and

$$b = b - \alpha * db$$

Now that we have understood the complete process, let's implement the same from scratch.

We will start by importing the necessary libraries as always at the start.

`import warningswarnings.filterwarnings('ignore')import numpy as npimport pandas as pdfrom matplotlib import pyplot as pltfrom sklearn.datasets import make_classification`

For now, we will depend on a toy dataset that we can easily create by the module `sklearn`

.

`X, y = make_classification( n_samples=500, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, n_classes=2, random_state=60)d = {'col{}'.format(i + 1) : X[:,i] for i in range(len(X[0]))}df = pd.DataFrame(d)df['label'] = y`

We need to split the data into two - the `training`

set and the `testing set`

. We do this by a random splitter function.

`def splitter(dframe, percentage=0.8, random_state=True): """ :param DataFrame dframe: Pandas DataFrame :param float percentage: Percentage value to split the data :param boolean random_state: True/False :return: train_df, test_df """ if random_state: dframe = dframe.sample(frac=1) thresh = round(len(dframe) * percentage) train_df = dframe.iloc[:thresh] test_df = dframe.iloc[thresh:] return train_df, test_df`

**Train Test Split**

`train_df, test_df = splitter(dframe=df)`

`class LogisticRegression():`

The name of the classifier is `LogisticRegression`

and it is a class where we define other methods.

`__init__()`

Method` def __init__(self, train_df, test_df, label, lambda_=0.0001, n_iters=1000): self.lambda_ = lambda_ self.n_iters = n_iters self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label) self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label) self.X_train = self.X_train.values self.y_train = self.y_train.values self.X_test = self.X_test.values self.y_test = self.y_test.values self.n_ = len(self.X_train) self.w, self.b = self.find_best_params()`

The above method is a constructor that takes five parameters -

`train_df`

refers to the subset of the data that is used to train the regressor.`test_df`

refers to the subset of the data that is used to test the regressor.`label`

refers to the series of data which is actually the column name of the class label.`lambda_`

refers to a constant that is used to update the parameters during the process of SGD.`n_iters`

refers to a constant that is used to decide the total iterations of the process of SGD.

`split_features_targets()`

Method` def split_features_targets(self, df, label): X = df.drop(columns=[label], axis=1) y = df[label] return X, y`

The above method is used to separate features and targets from the data. It takes two parameters -

`df`

refers to the entire dataset that is passed for classifying.`label`

refers to the series of df which is actually the column name of the class label.

`sigmoid()`

Method` def sigmoid(self, z): return 1 / (1 + np.exp(-z))`

The above method is a sigmoid function that takes one parameter. It is used to return the value in the range of `0`

and `1`

.

`z`

refers to a data value.

`diff_params_wb()`

Method` def diff_params_wb(self, w, b): lm = np.dot(self.X_train, w) + b z = self.sigmoid(z=lm) w_ = (1 / self.n_) * np.dot((z - self.y_train), self.X_train) b_ = (1 / self.n_) * np.sum((z - self.y_train)) return w_, b_`

The above method is used to differentiate the parameters. It takes two functional parameters -

`w`

refers to the initial weight vector that is used in the process of SGD.`b`

refers to the initial intercept value that is used in the process of SGD.

`find_best_params()`

Method` def find_best_params(self): ow = np.zeros_like(a=self.X_train[0]) ob = 0 for i in range(self.n_iters): w_, b_ = self.diff_params_wb(w=ow, b=ob) ow = ow - (self.lambda_ * w_) ob = ob - (self.lambda_ * b_) return ow, ob`

The above method is used to get the best (minimal) values of the parameters `w`

and `b`

. It takes no functional parameters. This method follows the process of SGD to update the original values of `w`

and `b`

iteratively.

`draw_line()`

Method` def draw_line(self, ax): if (len(self.w) == 2): a, b = self.w c = self.b y_min = np.min(self.X_train) y_max = np.max(self.X_train) p1 = [((-b*y_min - c)/a), y_min] p2 = [((-b*y_max - c)/a), y_max] points = np.array([p1, p2]) ax.plot(points[:, 0], points[:, 1], color='#BA4A00') return None`

The above method is used to draw a hyperplane. It takes one parameter -

`ax`

refers to the axis for which the hyperplane is plotted.

`predict()`

Method` def predict(self, with_plot=False): y_test_preds = self.sigmoid(z=(np.dot(self.X_test, self.w) + self.b)) y_test_preds_c = np.where((y_test_preds >= 0.5), 1, 0) y_train_preds = self.sigmoid(z=(np.dot(self.X_train, self.w) + self.b)) y_train_preds_c = np.where((y_train_preds >= 0.5), 1, 0) if with_plot: fig = plt.figure(figsize=(10, 4)) ax1 = fig.add_subplot(1, 2, 1) ax1.title.set_text('Training') ax1.scatter(self.X_train[:, 0], self.X_train[:, 1], c=y_train_preds_c, label='points') self.draw_line(ax=ax1) ax1.legend() ax2 = fig.add_subplot(1, 2, 2) ax2.title.set_text("Testing") ax2.scatter(self.X_test[:, 0], self.X_test[:, 1], c=y_test_preds_c, label='points') self.draw_line(ax=ax2) ax2.legend() plt.show() return y_test_preds_c`

The above method is used to predict the class label for the new unseen data. It takes one parameter (which is optional) -

`with_plot`

refers to a boolean value to decide if to plot the best fit line along with the data points.

By default, this function parameter takes the False value and therefore optional.

`score()`

Method` def score(self, preds): preds = np.array(preds) if (len(self.y_test) == len(preds)): non_z = np.count_nonzero(a=np.where((preds == self.y_test), 1, 0)) return non_z / len(preds) return "Lengths do not match"`

The above method is used to compute the level of accuracy score that determines whether a model is performing well or not. It is a fraction of the total number of correctly classified data with the total number of all the data points. Generally, the model whose accuracy level is greater than 0.80 or 80 is considered to be a good model. It takes two parameters -

`preds`

refers to an array of predictions (class labels).

**Full Code**

`class LogisticRegression(): def __init__(self, train_df, test_df, label, lambda_=0.0001, n_iters=1000): self.lambda_ = lambda_ self.n_iters = n_iters self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label) self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label) self.X_train = self.X_train.values self.y_train = self.y_train.values self.X_test = self.X_test.values self.y_test = self.y_test.values self.n_ = len(self.X_train) self.w, self.b = self.find_best_params() def split_features_targets(self, df, label): X = df.drop(columns=[label], axis=1) y = df[label] return X, y def sigmoid(self, z): return 1 / (1 + np.exp(-z)) def diff_params_wb(self, w, b): lm = np.dot(self.X_train, w) + b z = self.sigmoid(z=lm) w_ = (1 / self.n_) * np.dot((z - self.y_train), self.X_train) b_ = (1 / self.n_) * np.sum((z - self.y_train)) return w_, b_ def find_best_params(self): ow = np.zeros_like(a=self.X_train[0]) ob = 0 for i in range(self.n_iters): w_, b_ = self.diff_params_wb(w=ow, b=ob) ow = ow - (self.lambda_ * w_) ob = ob - (self.lambda_ * b_) return ow, ob def draw_line(self, ax): if (len(self.w) == 2): a, b = self.w c = self.b y_min = np.min(self.X_train) y_max = np.max(self.X_train) p1 = [((-b*y_min - c)/a), y_min] p2 = [((-b*y_max - c)/a), y_max] points = np.array([p1, p2]) ax.plot(points[:, 0], points[:, 1], color='#BA4A00') return None def predict(self, with_plot=False): y_test_preds = self.sigmoid(z=(np.dot(self.X_test, self.w) + self.b)) y_test_preds_c = np.where((y_test_preds >= 0.5), 1, 0) y_train_preds = self.sigmoid(z=(np.dot(self.X_train, self.w) + self.b)) y_train_preds_c = np.where((y_train_preds >= 0.5), 1, 0) if with_plot: fig = plt.figure(figsize=(10, 4)) ax1 = fig.add_subplot(1, 2, 1) ax1.title.set_text('Training') ax1.scatter(self.X_train[:, 0], self.X_train[:, 1], c=y_train_preds_c, label='points') self.draw_line(ax=ax1) ax1.legend() ax2 = fig.add_subplot(1, 2, 2) ax2.title.set_text("Testing") ax2.scatter(self.X_test[:, 0], self.X_test[:, 1], c=y_test_preds_c, label='points') self.draw_line(ax=ax2) ax2.legend() plt.show() return y_test_preds_c def score(self, preds): preds = np.array(preds) if (len(self.y_test) == len(preds)): non_z = np.count_nonzero(a=np.where((preds == self.y_test), 1, 0)) return non_z / len(preds) return "Lengths do not match"`

We have already created a toy data set. We just need to test the model on that data.

**Note** - The data that we created is random. Results may differ for each execution.

`lr = LogisticRegression( train_df=train_df, test_df=test_df, label='label', lambda_=0.001, n_iters=100)`

`print(lr.w)# [-3.77376685e-02 -5.56964363e-06]`

`print(lr.b)# 0.0017126093242175264`

`preds = lr.predict(with_plot=True)`

**Training & Testing Plot**

`print(lr.score(preds=preds))# 0.84`

The accuracy happens to be `>= 84%`

which is a decent percentage and hence the model is good.

Well, this entire code is been developed from scratch and it is sure that it, by performance - may not be as efficient as the library methods. But, it was good to understand the mathematics behind the work.

My code is slow.

In this article, we considered linearly separable data. In the case of non-linearly separable data, we need to do feature transformation.

No regularization is implemented.

Although outliers do not impact much because of the sigmoid function that is been used. But still, we can remove them to avoid problems.

In real life, the data will not be like toy data set. It will be completely different and challenging also.

YouTube video https://bit.ly/2UDmtZM

Wikipedia article https://en.wikipedia.org/wiki/Logistic_regression

**End**

Regression analysis is a process of predicting a response variable given an explanatory variable. The response variable is also called a dependent variable and an explanatory variable is known as an independent variable. Given a problem statement, when there are multiple explanatory variables and one response variable, then the process is known as **Multiple Linear Regression**. On contrary, if a problem statement contains only one explanatory variable and one response variable, it is known as **Simple Linear Regression**.

**Note** - For implementing regression analysis, there needs to be a strong relationship between the explanatory variable and the response variable.

Unlike predicting the class label (in classification task), here we predict the real value (continuous value) for the new unseen data.

**Credits of Cover Image** - Photo by Stary Smok on Unsplash

Let's denote the response variable as `y`

and the explanatory variable as `X`

. Here `X`

, can take either a single feature or multiple features.

In the case of `X`

having one feature, the model would be -

$$\hat{y} = w^Tx + b \rightarrow (1)$$

In the case of `X`

having multiple features, the model would be -

$$X = [x_1, x_2, x_3, \dots, x_n]$$

$$\hat{y} = w_1x_1 + w_2x_2 + w_3x_3 + \dots + w_nx_n + b$$

$$\hat{y} = \sum_{i=1}^n w_ix_i + b$$

$$\hat{y} = W^TX + b \rightarrow (2)$$

The mathematical model (1) is used for simple linear regression tasks, whereas (2) for multiple linear regression tasks. In both cases, the response variable `y^`

is the target feature.

The actual response variable is denoted as `y`

and we all know that the machine learning model cannot predict very accurately. If it cannot predict accurately, it surely contains an error. Thus, the main focus now shifted to have errors as small as possible.

$$y = \hat{y} + E \rightarrow (3)$$

Substituting (1) in (3) for simple linear regression, we get -

$$y = w^Tx + b + E \rightarrow (4)$$

Substituting (2) in (3) for multiple linear regression, we get -

$$y = W^TX + b + E \rightarrow (5)$$

The error `E`

is the difference between the actual target feature and the predicted target feature.

$$E = (y - \hat{y})$$

Everything is properly written except two things which are still unknown. The two things are the parameters `w`

and `b`

. The error in the model depends on these two parameter values. Here, `w`

is the coefficient and `b`

is the intercept. We cannot simply assign random values for `w`

and `b`

. Instead, they are to be chosen wisely with the help of the stochastic gradient descent process.

Let's consider the model equation (1) for understanding optimization.

SGD is an iterative process where we initially assign random values (possibly 0) for both `w`

and `b`

. At every iteration,

- For
`w`

, we differentiate (1) with respect to`w`

and obtain`dw`

. - For
`b`

, we differentiate (1) with respect to`b`

and obtain`db`

.

Mathematically, it can be represented as -

$$dw \implies \frac{\partial}{\partial w} = \frac{2}{N} \sum_{i=1}^n x_i \big[(w^Tx_i + b) - y_i \big]$$

and

$$db \implies \frac{\partial}{\partial b} = \frac{2}{N} \sum_{i=1}^n \big[(w^Tx_i + b) - y_i \big]$$

We update/replace the actual `w`

and `b`

with `dw`

and `db`

respectively at every iteration until the values are not fully minimized. The updating process can be understood in the following way.

$$w = w - \alpha * dw$$

and

$$b = b - \alpha * db$$

The entire process is the same for obtaining the minimal values of `W`

and `b`

for the model equation (2).

Now that we have understood the complete process, let's implement the same from scratch.

We will start by importing the necessary libraries as always at the start.

`import warningswarnings.filterwarnings('ignore')import osimport pandas as pdimport numpy as npimport imageiofrom matplotlib import pyplot as pltfrom sklearn.datasets import make_regression`

For now, we will depend on a toy dataset that we can easily create by the module `sklearn`

.

`X, y = make_regression(n_samples=200, n_features=1, noise=5, random_state=15)df = pd.DataFrame(dict(col1=X[:, 0], col2=y))`

We need to split the data into two - the `training`

set and the `testing`

set. We do this by a random splitter function.

`def splitter(dframe, percentage=0.8, random_state=True): """ :param DataFrame dframe: Pandas DataFrame :param float percentage: Percentage value to split the data :param boolean random_state: True/False :return: train_df, test_df """ if random_state: dframe = dframe.sample(frac=1) thresh = round(len(dframe) * percentage) train_df = dframe.iloc[:thresh] test_df = dframe.iloc[thresh:] return train_df, test_df`

**Train Test Split**

`train_df, test_df = splitter(dframe=df)`

`class LinearRegression():`

The name of the regression is `LinearRegression`

and it is a class where we define other methods.

`__init__()`

Method` def __init__(self, train_df, test_df, label, lambda_=0.001, n_iters=1000): self.n_iters = n_iters self.lambda_ = lambda_ self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label) self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label) self.X_train = self.X_train.values self.y_train = self.y_train.values self.X_test = self.X_test.values self.y_test = self.y_test.values self.n_train = len(self.X_train) self.n_test = len(self.X_test) self.wb_params = {'w' : [], 'b' : []} self.w, self.b = self.find_best_params()`

The above method is a constructor that takes five parameters -

`train_df`

refers to the subset of the data that is used to train the regressor.`test_df`

refers to the subset of the data that is used to test the regressor.`label`

refers to the series of data which is actually the column name of the class label.`lambda_`

refers to a constant that is used to update the parameters during the process of SGD.`n_iters`

refers to a constant that is used to decide the total iterations of the process of SGD.

`split_features_targets()`

Method` def split_features_targets(self, df, label): X = df.drop(columns=[label], axis=1) y = df[label] return X, y`

The above method is used to separate features and targets from the data. It takes two parameters -

`df`

refers to the entire dataset that is passed for classifying.`label`

refers to the series of`df`

which is actually the column name of the class label.

`diff_params_wb()`

Method` def diff_params_wb(self, w, b): lm = np.dot(self.X_train, w) + b w_ = (2 / self.n_train) * np.dot(self.X_train.T, (lm - self.y_train)) b_ = (2 / self.n_train) * np.sum((lm - self.y_train)) return w_, b_`

The above method is used to differentiate the parameters. It takes two parameters -

`w`

refers to the initial weight vector that is used in the process of SGD.`b`

refers to the initial intercept value that is used in the process of SGD.

`find_best_params()`

Method` def find_best_params(self): ow = np.zeros_like(a=self.X_train[0]) ob = 0 for i in range(self.n_iters): w_, b_ = self.diff_params_wb(w=ow, b=ob) ow = ow - (self.lambda_ * w_) ob = ob - (self.lambda_ * b_) self.wb_params['w'].append(ow) self.wb_params['b'].append(ob) return ow, ob`

The above method is used to get the best (minimal) values of the parameters `w`

and `b`

. It takes no functional parameters. This method follows the process of SGD to update the original values of `w`

and `b`

iteratively.

`predict()`

Method` def predict(self, with_plot=False, save_process=False): y_test_preds = np.dot(self.X_test, self.w) + self.b y_train_preds = np.dot(self.X_train, self.w) + self.b if with_plot: fig = plt.figure(figsize=(10, 4)) ax1 = fig.add_subplot(1, 2, 1) ax1.title.set_text('Training') ax1.scatter(self.X_train, self.y_train, label='points') ax1.plot(self.X_train, y_train_preds, color='red', label='best fit') ax1.legend() ax2 = fig.add_subplot(1, 2, 2) ax2.title.set_text("Testing") ax2.scatter(self.X_test, self.y_test, label='points') ax2.plot(self.X_test, y_test_preds, color='red', label='best fit') ax2.legend() plt.show() if save_process: self.save_process_togif(test_x=self.X_test, test_y=self.y_test) return y_test_preds`

The above method is used to predict the real (continuous) value for the new unseen data. It takes two parameters (which are optional) -

`with_plot`

refers to a boolean value to decide if to plot the best fit line along with the data points.`save_process`

refers to a boolean value to decide if to save the process of SGD in`GIF`

format.

By default, these functional parameters take the `False`

value and therefore optional.

`save_process_togif()`

Method` def save_process_togif(self, test_x, test_y): wp = self.wb_params['w'] bp = self.wb_params['b'] c = 0 for i in range(0, len(wp), 50): c += 1 d = '0' + str(c) if (len(str(c)) == 1) else str(c) test_p = np.dot(test_x, wp[i]) + bp[i] fig = plt.figure(figsize=(10, 10)) plt.title("Testing") plt.scatter(test_x, test_y, label='points') plt.plot(test_x, test_p, color='red', label='best fit') plt.legend() plt.savefig('{}-lr-plot.png'.format(d)) plt.close(fig) path = os.getcwd() files_list = os.listdir(path=path) png_list = [i for i in files_list if (i[0] != '.') and (i.split('.')[1] == 'png')] png_list.sort() png_gif = [imageio.imread(i) for i in png_list] kargs = {'duration': 1} gif_name = 'process-lin-reg.gif' imageio.mimsave(gif_name, png_gif, **kargs) print('Process saved in ', path + '\\' + gif_name) return None`

The above method is used to save the process of SGD in the form of `GIF`

. It takes two parameters -

`test_x`

refers to the features whose target needs to be predicted.`test_y`

refers to the actual target values.

`score()`

Method` def score(self, preds): preds = np.array(preds) if (len(self.y_test) == len(preds)): y_act_mean = np.mean(self.y_test) sst = np.sum((self.y_test - y_act_mean) ** 2) ssr = np.sum((self.y_test - preds) ** 2) return (1 - (ssr / sst)) return "Lengths do not match"`

The above method is used to calculate the **R^2**. It takes one parameter -

`preds`

refers to the predicted values that are used to calculate R^2.

**Full Code**

`class LinearRegression(): def __init__(self, train_df, test_df, label, lambda_=0.001, n_iters=1000): self.n_iters = n_iters self.lambda_ = lambda_ self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label) self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label) self.X_train = self.X_train.values self.y_train = self.y_train.values self.X_test = self.X_test.values self.y_test = self.y_test.values self.n_train = len(self.X_train) self.n_test = len(self.X_test) self.wb_params = {'w' : [], 'b' : []} self.w, self.b = self.find_best_params() def split_features_targets(self, df, label): X = df.drop(columns=[label], axis=1) y = df[label] return X, y def diff_params_wb(self, w, b): lm = np.dot(self.X_train, w) + b w_ = (2 / self.n_train) * np.dot(self.X_train.T, (lm - self.y_train)) b_ = (2 / self.n_train) * np.sum((lm - self.y_train)) return w_, b_ def find_best_params(self): ow = np.zeros_like(a=self.X_train[0]) ob = 0 for i in range(self.n_iters): w_, b_ = self.diff_params_wb(w=ow, b=ob) ow = ow - (self.lambda_ * w_) ob = ob - (self.lambda_ * b_) self.wb_params['w'].append(ow) self.wb_params['b'].append(ob) return ow, ob def predict(self, with_plot=False, save_process=False): y_test_preds = np.dot(self.X_test, self.w) + self.b y_train_preds = np.dot(self.X_train, self.w) + self.b if with_plot: fig = plt.figure(figsize=(10, 4)) ax1 = fig.add_subplot(1, 2, 1) ax1.title.set_text('Training') ax1.scatter(self.X_train, self.y_train, label='points') ax1.plot(self.X_train, y_train_preds, color='red', label='best fit') ax1.legend() ax2 = fig.add_subplot(1, 2, 2) ax2.title.set_text("Testing") ax2.scatter(self.X_test, self.y_test, label='points') ax2.plot(self.X_test, y_test_preds, color='red', label='best fit') ax2.legend() plt.show() if save_process: self.save_process_togif(test_x=self.X_test, test_y=self.y_test) return y_test_preds def save_process_togif(self, test_x, test_y): wp = self.wb_params['w'] bp = self.wb_params['b'] c = 0 for i in range(0, len(wp), 50): c += 1 d = '0' + str(c) if (len(str(c)) == 1) else str(c) test_p = np.dot(test_x, wp[i]) + bp[i] fig = plt.figure(figsize=(10, 10)) plt.title("Testing") plt.scatter(test_x, test_y, label='points') plt.plot(test_x, test_p, color='red', label='best fit') plt.legend() plt.savefig('{}-lr-plot.png'.format(d)) plt.close(fig) path = os.getcwd() files_list = os.listdir(path=path) png_list = [i for i in files_list if (i[0] != '.') and (i.split('.')[1] == 'png')] png_list.sort() png_gif = [imageio.imread(i) for i in png_list] kargs = {'duration': 1} gif_name = 'process-lin-reg.gif' imageio.mimsave(gif_name, png_gif, **kargs) print('Process saved in ', path + '\\' + gif_name) return None def score(self, preds): preds = np.array(preds) if (len(self.y_test) == len(preds)): y_act_mean = np.mean(self.y_test) sst = np.sum((self.y_test - y_act_mean) ** 2) ssr = np.sum((self.y_test - preds) ** 2) return (1 - (ssr / sst)) return "Lengths do not match"`

We have already created a toy data set. We just need to test the model on that data.

**Note**: The data that we created is random. Results may differ for each execution.

`lr = LinearRegression( train_df=train_df, test_df=test_df, label='col2', lambda_=0.01, n_iters=1000)`

`print(lr.w)# [5.90699167]`

`print(lr.b)# -0.5282539569012746`

`preds = lr.predict(with_plot=True, save_process=True)`

**Training & Testing Plot**

The regression line sufficiently passes through the data points.

**SGD Process Demonstration**

The test values are used to execute the process of SGD. As the values of `w`

and `b`

change for every iteration the best fit line changes accordingly, and at some point, the line stops changing, which basically means that we have got the minimum values for `w`

and `b`

.

`print(lr.score(preds=preds))# 0.6830871781610702`

The accuracy is almost `69%`

for the data whose shape is `(200, 2)`

. If we had taken/created large data, then there would have been some change.

Well, this entire code is been developed from scratch and it is sure that it, by performance - may not be as efficient as the library methods. But, it was good to understand the mathematics behind the work.

My code is slow.

No regularization is implemented although there types of it.

- Linear regression with
**L1**regularization is known as**Ridge regression**. - Linear regression with
**L2**regularization is known as**Lasso regression**. - Linear regression with both
**L1**and**L2**regularizations is known as**ElasticNet regression**.

- Linear regression with
If outliers are present in the data, it may impact the model significantly.

- YouTube video https://bit.ly/2TRmnxd
- Wikipedia article https://en.wikipedia.org/wiki/Linear_regression

**End**

KNN algorithm is a **supervised** classification algorithm that is mainly used to predict which category a query point belongs to, given a bunch of data values with respect to its corresponding categories (class labels). Talking from the perspective of the classification tasks, KNN is very much similar to Naive Bayes but slightly different in terms of technicality and implementation. In Naive Bayes, we compute the likelihood by probabilistic methods, whereas in KNN, we compute distance measurements along with other extensions added.

One great advantage of KNN is that, though it is used for classification tasks, we can, however, extend it for regression tasks as well. In this article, we will try to understand and implement KNN from scratch.

**Note**: If you want to know how Naive Bayes works, then I would strongly recommend checking out my blog post - **Naive Bayes in Practice**.

**Credits of Cover Image** - Photo by Tom Rumble on Unsplash

Let's imagine we have 2 sets of points separated as per the category. Now, given a new point, how can you predict or guess which category it belongs to?

The task of putting the query point in the right class label can be tricky. We cannot just randomly assign the class label.

Mathematically, we have to compute the distance between the query point and each data point. We need to set a threshold and select the top nearest distance values. Let's say the threshold value is 5.

What we can see is that the query point is nearer to the **blue** category with 3 points and the **orange** category with 2 points. Now, when we apply the majority voting technique, the class label of the query point will be **blue** since 3 is greater than 2.

The same concept is used in developing the algorithm. Now that we have understood the theory, let's implement this totally from scratch.

Well, to compute the distance between two points, we have various methods. Here, we will just select a few distance measurement metrics.

**Types of distance metrics**

- Euclidean Distance
- Manhattan Distance
- Minkowski Distance
- Hamming Distance
- Cosine Distance

The length of a line segment between any two points in the coordinate system is Euclidean distance. This metric computes the shortest distance between two points. This is also referred to as the `L2`

norm.

**Mathematical formula**

Let

$$p = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the distance between `p`

and `q`

can be represented as -

$$pq = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}$$

The same pattern is followed in the case of higher-dimensional vectors.

Manhattan distance follows the concept of **Taxicab Geometry**. The usual Euclidean distance metric is replaced in such a way that the distance between points is actually the sum of absolute differences between the points. This is also referred to as the `L1`

norm.

**Mathematical formula**

Let

$$p = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the distance between `p`

and `q`

can be represented as -

$$pq = |x_1 - x_2| + |y_1 - y_2|$$

The same pattern is followed in the case of higher-dimensional vectors.

Minkowski distance is the generalized form of both Euclidean and Manhattan distance metrics. This is also called the `Lp`

norm. If the value of `p`

is 1 then it is known as Manhattan, if `p`

is 2 then Euclidean, and so on.

**Mathematical formula**

Let

$$r = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the distance between `r`

and `q`

can be represented as -

$$rq = \sqrt[p]{(x_1 - x_2)^p + (y_1 - y_2)^p}$$

The same pattern is followed in the case of higher-dimensional vectors.

The hamming distance metric is highly recommended when we are dealing with binary vectors. In case, if it is used in non-binary vectors, it produces terrible results. It is the number of bit positions in which two bits are different provided the length of two vectors are the same.

**Mathematical formula**

Let

$$p = [0, 1, 1, 0, 1, 0, 0]$$

and

$$q = [1, 0, 1, 0, 1, 0, 1]$$

the distance between `p`

and `q`

can be represented as -

the sum of non-identical bits between the two points or vectors.

In order to find the angular distance, first, we have to find the cosine similarity. The similarity between two non-zero points by their inner product is called cosine similarity.

**Mathematical formula**

Let

$$p = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the cosine similarity between `p`

and `q`

can be represented as -

$$\text{cosine-sim}(p, q) = \frac{p.q}{||p||_2 \ ||q||_2} = \frac{p.q}{\sqrt{p.p} \ \sqrt{q.q}}$$

and cosine distance between `p`

and `q`

can be represented as -

$$\text{cosine-dist}(p, q) = 1 - \text{cosine-sim}(p, q)$$

The same pattern is followed in the case of higher-dimensional vectors.

Coding the above distance measures or metrics becomes very easy with the help of NumPy being the amazing package for numerical computation and linear algebra. Once we install it in the system, we can easily use it by importing the package.

**Full code**

`import numpy as npclass DistanceMeasures(): def __init__(self, point1, point2): """ :param array point1: Point :param array point2: Point """ self.point1 = np.array(point1) self.point2 = np.array(point2) def euclidean_measure(self): flag = True if (len(self.point1) == len(self.point2)) else False if flag: dist = np.linalg.norm(self.point1 - self.point2) return round(dist, 4) return None def manhattan_measure(self): flag = True if (len(self.point1) == len(self.point2)) else False if flag: dist = np.abs(self.point1 - self.point2).sum() return round(dist, 4) return None def minkowski_measure(self, p): flag = True if (len(self.point1) == len(self.point2)) else False if not flag and (p <= 0): return None if (p == 1): return self.manhattan_measure() elif (p == 2): return self.euclidean_measure() dist = np.sum(np.abs(self.point1 - self.point2) ** p) ** (1 / p) return round(dist, 4) def hamming_measure(self): flag = True if (len(self.point1) == len(self.point2)) else False if flag: return np.sum(self.point1 != self.point2) return None def cosine_measure(self): flag = True if (len(self.point1) == len(self.point2)) else False if flag: nume = np.dot(a=self.point1, b=self.point2) denome = np.sqrt(np.dot(a=self.point1, b=self.point1)) * np.sqrt(np.dot(a=self.point2, b=self.point2)) return 1 - (nume / denome) return None`

We have already understood the intuition of how this algorithm works. Now, we just need to apply that concept in the code. Let's do that.

**Note**: This article doesn't cover the performance metrics or hyperparameter tuning of the model.

`import numpy as npimport pandas as pdimport plotly.graph_objects as gofrom plotly.subplots import make_subplotsfrom sklearn.datasets import make_classification`

`class KNN():`

The name of the classifier is `KNN`

and it is a class where we define other methods.

`__init__()`

Method` def __init__(self, n_neighbors, train_df, test_df, label, metric='euclidean'): """ :param int n_neighbors: Number of nearest neighbors :param str metric: Distance measurement metric. Accepts `euclidean`, `manhattan`, `minkowski`, `hamming`, and `cosine` methods """ self.n_neighbors = n_neighbors self.available_metrics = ['euclidean', 'manhattan', 'minkowski', 'hamming', 'cosine'] self.metric = metric if metric in self.available_metrics else 'euclidean' self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label) self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label) self.X_train = self.X_train.values self.y_train = self.y_train.values self.X_test = self.X_test.values self.y_test = self.y_test.values`

The above method is a constructor that takes five parameters -

`n_neighbors`

refers to the number of top nearest neighbors that are used to predict the class label.`train_df`

refers to the subset of the data that is used to train the classifier.`test_df`

refers to the subset of the data that is used to test the classifier.`label`

refers to the series of data which is actually the column name of the class label.`metric`

refers to the type of distance metric that is used to compute the distance between points or vectors.

`split_features_targets()`

Method` def split_features_targets(self, df, label): """ :param DataFrame df: The main dataset :param str label: The column name which signifies class labels :return: X, y """ X = df.drop(columns=[label], axis=1) y = df[label] return X, y`

The above method is used to separate features and targets from the data. It takes two parameters -

`df`

refers to the entire dataset that is passed for classifying.`label`

refers to the series of`df`

which is actually the column name of the class label.

`most_freq()`

Method` def most_freq(self, l): """ :param list l: List of values (can have duplicates) :return: Most occurred element from l """ return max(set(l), key=l.count)`

The above method is used to get the most occurred elements in the given list. Basically, this method replicates the majority voting technique that predicts the class label. It takes one parameter -

`l`

refers to the list of elements (class labels).

`find_nearest_neighbors()`

Method` def find_nearest_neighbors(self, X_new): """ :param array X_new: Array/List of each feature :return: distance_labels - dictionary """ distances = [] for X_row in self.X_train: dm = DistanceMeasures(point1=X_row, point2=X_new) if (self.metric == 'euclidean'): distances.append(dm.euclidean_measure()) elif (self.metric == 'manhattan'): distances.append(dm.manhattan_measure()) elif (self.metric == 'minkowski'): distances.append(dm.minkowski_measure(p=3)) elif (self.metric == 'hamming'): distances.append(dm.hamming_measure()) else: distances.append(dm.cosine_measure()) distance_labels = {d : l for (d, l) in zip(distances, self.y_train)} distance_labels = dict( sorted(distance_labels.items(), key=lambda x:x[0])[:self.n_neighbors] ) return distance_labels`

The above method is used to find the top nearest neighbors for the query point with respect to all the data points present in the training data. It takes one parameter -

`X_new`

refers to the query point for which the class label is predicted.

`predict()`

Method` def predict(self): if (len(self.X_test) == 1): odl = self.find_nearest_neighbors(X_new=self.X_test) return self.most_freq(l=list(odl.values())) preds = [] for test in self.X_test: odl = self.find_nearest_neighbors(X_new=test) preds.append(self.most_freq(l=list(odl.values()))) return np.array(preds)`

The above method is used to predict the class labels for the `X_test`

data points. It takes no parameters but uses the methods like `most_freq()`

and `find_nearest_neighbors()`

in order to complete the task.

`score()`

Method` def score(self, preds, with_plot=False): """ :param array preds: Predictions :param boolean with_plot: True/False :return: accuracy level """ if (len(self.y_test) == len(preds)): if with_plot: self.plot_it(preds=preds) return sum([1 if (i == j) else 0 for (i, j) in zip(self.y_test, preds)]) / len(preds) return "Lengths do not match"`

The above method is used to compute the level of accuracy score that determines whether a model is performing well or not. It is a fraction of the total number of **correctly classified** data with the total number of all the data points. Generally, the model whose accuracy level is greater than `0.80`

or `80`

is considered to be a good model. It takes two parameters -

`preds`

refers to an array of predictions (class labels).`with_plot`

refers to a boolean value from which data visualization is decided.

`plot_it()`

Method` def plot_it(self, preds): """ :param array preds: Predictions :return: None """ fig = make_subplots(rows=1, cols=2) x_ = list(self.X_train[:, 0]) y_ = list(self.X_train[:, 1]) c_ = list(self.y_train) fig.add_trace( go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Training'), row=1, col=1 ) x_ = list(self.X_test[:, 0]) y_ = list(self.X_test[:, 1]) c_ = list(preds) fig.add_trace( go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Testing'), row=1, col=2 ) title = 'Training {} Testing'.format(' '*68) fig.update_layout( title=title, height=300, width=900, margin=dict(l=0, b=0, t=40, r=0), showlegend=False ) fig.show() return None`

The above method is used to visualize the data. Both training and testing data are visualized separately. It takes one parameter -

`preds`

refers to an array of predictions (class labels).

**Full code**

`class KNN(): def __init__(self, n_neighbors, train_df, test_df, label, metric='euclidean'): """ :param int n_neighbors: Number of nearest neighbors :param str metric: Distance measurement metric. Accepts `euclidean`, `manhattan`, `minkowski`, `hamming`, and `cosine` methods """ self.n_neighbors = n_neighbors self.available_metrics = ['euclidean', 'manhattan', 'minkowski', 'hamming', 'cosine'] self.metric = metric if metric in self.available_metrics else 'euclidean' self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label) self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label) self.X_train = self.X_train.values self.y_train = self.y_train.values self.X_test = self.X_test.values self.y_test = self.y_test.values def split_features_targets(self, df, label): """ :param DataFrame df: The main dataset :param str label: The column name which signifies class labels :return: X, y """ X = df.drop(columns=[label], axis=1) y = df[label] return X, y def most_freq(self, l): """ :param list l: List of values (can have duplicates) :return: Most occurred element from l """ return max(set(l), key=l.count) def find_nearest_neighbors(self, X_new): """ :param array X_new: Array/List of each feature :return: distance_labels - dictionary """ distances = [] for X_row in self.X_train: dm = DistanceMeasures(point1=X_row, point2=X_new) if (self.metric == 'euclidean'): distances.append(dm.euclidean_measure()) elif (self.metric == 'manhattan'): distances.append(dm.manhattan_measure()) elif (self.metric == 'minkowski'): distances.append(dm.minkowski_measure(p=3)) elif (self.metric == 'hamming'): distances.append(dm.hamming_measure()) else: distances.append(dm.cosine_measure()) distance_labels = {d : l for (d, l) in zip(distances, self.y_train)} distance_labels = dict( sorted(distance_labels.items(), key=lambda x:x[0])[:self.n_neighbors] ) return distance_labels def predict(self): if (len(self.X_test) == 1): odl = self.find_nearest_neighbors(X_new=self.X_test) return self.most_freq(l=list(odl.values())) preds = [] for test in self.X_test: odl = self.find_nearest_neighbors(X_new=test) preds.append(self.most_freq(l=list(odl.values()))) return np.array(preds) def score(self, preds, with_plot=False): """ :param array preds: Predictions :param boolean with_plot: True/False :return: accuracy level """ if (len(self.y_test) == len(preds)): if with_plot: self.plot_it(preds=preds) return sum([1 if (i == j) else 0 for (i, j) in zip(self.y_test, preds)]) / len(preds) return "Lengths do not match" def plot_it(self, preds): """ :param array preds: Predictions :return: None """ fig = make_subplots(rows=1, cols=2) x_ = list(self.X_train[:, 0]) y_ = list(self.X_train[:, 1]) c_ = list(self.y_train) fig.add_trace( go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Training'), row=1, col=1 ) x_ = list(self.X_test[:, 0]) y_ = list(self.X_test[:, 1]) c_ = list(preds) fig.add_trace( go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Testing'), row=1, col=2 ) title = 'Training {} Testing'.format(' '*68) fig.update_layout( title=title, height=300, width=900, margin=dict(l=0, b=0, t=40, r=0), showlegend=False ) fig.show() return None`

Before testing the model, we need to first create a toy dataset and then split it into `training`

and `testing`

subsets.

`X, y = make_classification( n_samples=300, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, n_classes=3, random_state=60)df = pd.DataFrame(dict(col1=X[:,0], col2=X[:,1], label=y))`

The first five rows of `df`

look like -

col1 | col2 | label |

0.774027 | -1.654300 | 0 |

0.487983 | -0.202421 | 0 |

0.762589 | -0.440150 | 0 |

-1.920213 | 0.264316 | 1 |

0.773207 | -0.014049 | 2 |

`def splitter(dframe, percentage=0.8, random_state=True): """ :param DataFrame dframe: Pandas DataFrame :param float percentage: Percentage value to split the data :param boolean random_state: True/False :return: train_df, test_df """ if random_state: dframe = dframe.sample(frac=1) thresh = round(len(dframe) * percentage) train_df = dframe.iloc[:thresh] test_df = dframe.iloc[thresh:] return train_df, test_df`

The above function is used to divide the data into two parts.

`train_df, test_df = splitter(dframe=df)`

`neigh = KNN( train_df=train_df, test_df=test_df, label='label', n_neighbors=5, metric='cosine')`

`preds = neigh.predict()`

`acc = neigh.score(preds=preds, with_plot=True)print(acc)`

The accuracy happens to be `>= 95%`

which is a decent percentage and hence the model is good.

It is better to use cross-validation techniques to find the optimal value of

`k`

to obtain better results. We did not discuss that in this article.If the data is huge, then it can very tough job for the computer as there will be time and space constraints. Because KNN happens to be slow compared to other models. Of course, we can make use of methods like

`KD-Tree`

and`LSH`

(Locality Sensitive Hashing) to fasten the model performance.Oftentimes, the majority voting technique may not work effectively. To avoid this problem, we can make

`weighted KNN`

which works really well.It is always good to use the package methods rather than writing our own algorithms. The algorithms that we write may not be optimized enough to proceed.

- KNN - Wikipedia article https://bit.ly/3w4cUAW
- TDS blog article https://bit.ly/3x5f3ML
- Distance metrics https://bit.ly/3ctuBlL

**End**

Classification is a task of grouping things together on the basis of the similarity they share with each other. It helps organize things and thus makes the study more easy and systematic. In statistics, classification refers to the problem of identifying to which set of **categories** an observation or data value belongs to.

**Examples -**

- Identifying whether an email is a
**spam**or**not spam**. - On certain parameters of health, identifying or classifying whether a patient
**gets a heart attack**or**not**.

For humans, it can be very easy to do the classification task assuming that he/she has proper domain-specific knowledge and given certain features he/she can achieve it by no means. But, it can be tricky for a machine to classify - unless it is provided with proper training from the data and algorithm (classifier) that is used for learning. Thanks to Machine Learning and AI with which, now we are able to train machines with lots of data and predicting the results on unseen data.

**Credits of Cover Image** - Photo by Annie Spratt on Unsplash

Machine Learning is a method to make machines learn from the previous experience (data) with which predictions are made on the future unseen data. These predictions may not be perfect and accurate but simply standard guesses that can be considered to make future decisions.

Since we are talking about a machine to do the classification task, luckily, over the years researchers have built various classification algorithms that can be used for our own datasets (assuming the dataset is processed enough). Some of the types of classification algorithms are -

- Logistic Regression
- Naive Bayes Classifier
- K-Nearest Neighbor (KNN)
- Decision Trees
- Neural Networks

**Note** - In this article, we will focus only on the **Naive Bayes Classifier** - algorithm.

A simple and robust classifier that belongs to the family of **probabilistic classifiers**. It follows the idea of the **Bayes Theorem** assuming that every feature is independent of every other feature. Given the categorical features (not real-valued data) along with categorical class labels, Naive Bayes computes **likelihood** for each category from every feature with respect to each category of class labels. Thus, it will choose a specific category of a class label whose likelihood is maximum.

Let's assume that we are given class labels such as -

$$C_k = [c_1, c_2, c_3, \dots, c_K]$$

and features such as -

$$X = [x_1, x_2, x_3, \dots, x_n]$$

with the help of the Bayes Theorem, we can have -

$$P(C_k | X) = \frac{P(C_k) P(X | C_k)}{P(X)}$$

Now, by applying chain rule and applications of conditional probability, we will get -

$$P(C_k | x_1, x_2, x_3, \dots, x_n) \propto P(C_k) \prod P(x_i | C_k) \ (i \ \text{from} \ 1 \ \text{to} \ n)$$

Clearly, the above is a probability model which can be used to construct a classifier such as -

$$\hat{y} = \text{argmax} \bigg[P(C_k) \prod P(x_i | C_k) \ (i \ \text{from} \ 1 \ \text{to} \ n) \bigg]$$

That's it. Based on this idea, the classifier predicts the category. Let's try to code this from scratch.

For the practice of Naive Bayes, we shall have a processed categorical data with features and class labels to compute likelihoods. The whole process is explained below.

Data Source - https://bit.ly/3oZ9s83

**Libraries**

`import pandas as pdimport numpy as np`

**Data Reading**

`data_source = 'http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'cdf = pd.read_csv( filepath_or_buffer=data_source, names=['buying','maint','doors','persons','lug_boot','safety','class'], sep=',')`

The data is stored in the variable called `cdf`

.

**Construction**

`class CategoricalNB():`

The name of the classifier is `CategoricalNB()`

and it is a `class`

where we define other methods.

` def __init__(self, train_df, test_df, label): self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label) self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label) self.X_test_vals = self.X_test.values self.y_test_vals = self.y_test.values self.X_likelihood, self.y_likelihood = self.compute_likelihood()`

The above method is a constructor that takes three parameters such as -

`train_df`

refers to the subset of the data that is used to train the classifier.`test_df`

refers to the subset of the data that is used to test the classifier.`label`

refers to the series of data which is actually the column name of the class label.

` def split_features_targets(self, df, label): X = df.drop(columns=[label], axis=1) y = df[label] return X, y`

The above method is used to separate features and targets from the data. It takes two parameters such as -

`df`

refers to the entire dataset that is passed for classifying.`label`

refers to the series of`df`

which is actually the column name of the class label.

` def compute_likelihood(self): X_likelihood = {} yc_df = self.y_train.value_counts().to_frame() yc_df.reset_index(inplace=True) yc_df.columns = ['class', 'count'] y_vc = {i : j for (i, j) in zip(yc_df['class'], yc_df['count'])} y_vc_k = list(y_vc.keys()) for col in self.X_train: each_col_dict = {} x_col_vals = self.X_train[col].value_counts().to_frame().index.to_list() fydf = pd.DataFrame(data={col : self.X_train[col], 'y' : self.y_train}) for ex in x_col_vals: each_x_dict = {} x_ex_df = fydf[fydf[col] == ex] for ey in y_vc_k: x_y_df = x_ex_df[x_ex_df['y'] == ey] each_x_dict[ey] = len(x_y_df) / y_vc[ey] each_col_dict[ex] = each_x_dict X_likelihood[col] = each_col_dict y_likelihood = {i : j / sum(list(y_vc.values())) for (i, j) in y_vc.items()} return X_likelihood, y_likelihood`

The above method is used to compute the likelihood of each category from every feature with respect to each category of the class label. This is very useful as it is helpful to predict the category of the new data point. Here, likelihood is simply the probability - computed as per the category.

` def predictor(self, X_new): cols = list(self.X_likelihood.keys()) col_new = {i : j for (i, j) in zip(cols, X_new)} lprobs = {} for l, v in self.y_likelihood.items(): cate_v = [self.X_likelihood[cn][cl][l] for (cn, cl) in col_new.items()] lprobs[l] = round((np.prod(cate_v) * v), 4) prob_ks = list(lprobs.keys()) prob_vs = list(lprobs.values()) return prob_ks[np.argmax(prob_vs)]`

The above method is used to get the specific category of a class label whose probability for the parameter `X_new`

is maximum. Here, the parameter `X_new`

is basically the query point for which we want the prediction.

` def predict(self): if len(self.X_test_vals) == 1: return self.predictor(X_new=self.X_test_vals[0]) preds = [self.predictor(X_new=i) for i in self.X_test_vals] return preds`

The above method is used to predict the categories of a class label for the completely unseen dataset. This method works for both single query point or multiple query points.

` def accuracy_score(self, preds): actual_vals = np.array(self.y_test_vals) preds = np.array(preds) corrects = np.count_nonzero(np.where((actual_vals == preds), 1, 0)) return corrects / len(actual_vals)`

The above method is used to compute the level of accuracy (a metric) to determine how the model (algorithm) is performing. It is a fraction of the total number of **correctly classified** data with the total number of all the data points. Generally, the model whose accuracy level is greater than `0.80`

or `80`

is considered to be a good model.

**Full Code**

`class CategoricalNB(): def __init__(self, train_df, test_df, label): self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label) self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label) self.X_test_vals = self.X_test.values self.y_test_vals = self.y_test.values self.X_likelihood, self.y_likelihood = self.compute_likelihood() def split_features_targets(self, df, label): X = df.drop(columns=[label], axis=1) y = df[label] return X, y def compute_likelihood(self): X_likelihood = {} yc_df = self.y_train.value_counts().to_frame() yc_df.reset_index(inplace=True) yc_df.columns = ['class', 'count'] y_vc = {i : j for (i, j) in zip(yc_df['class'], yc_df['count'])} y_vc_k = list(y_vc.keys()) for col in self.X_train: each_col_dict = {} x_col_vals = self.X_train[col].value_counts().to_frame().index.to_list() fydf = pd.DataFrame(data={col : self.X_train[col], 'y' : self.y_train}) for ex in x_col_vals: each_x_dict = {} x_ex_df = fydf[fydf[col] == ex] for ey in y_vc_k: x_y_df = x_ex_df[x_ex_df['y'] == ey] each_x_dict[ey] = len(x_y_df) / y_vc[ey] each_col_dict[ex] = each_x_dict X_likelihood[col] = each_col_dict y_likelihood = {i : j / sum(list(y_vc.values())) for (i, j) in y_vc.items()} return X_likelihood, y_likelihood def predictor(self, X_new): cols = list(self.X_likelihood.keys()) col_new = {i : j for (i, j) in zip(cols, X_new)} lprobs = {} for l, v in self.y_likelihood.items(): cate_v = [self.X_likelihood[cn][cl][l] for (cn, cl) in col_new.items()] lprobs[l] = round((np.prod(cate_v) * v), 4) prob_ks = list(lprobs.keys()) prob_vs = list(lprobs.values()) return prob_ks[np.argmax(prob_vs)] def predict(self): if len(self.X_test_vals) == 1: return self.predictor(X_new=self.X_test_vals[0]) preds = [self.predictor(X_new=i) for i in self.X_test_vals] return preds def accuracy_score(self, preds): actual_vals = np.array(self.y_test_vals) preds = np.array(preds) corrects = np.count_nonzero(np.where((actual_vals == preds), 1, 0)) return corrects / len(actual_vals)`

Before testing the model, we need to first split the original data into two parts. One part is used for training and another part is used for testing. In machine learning, we generally take `80%`

of the data for training and the remaining `20%`

data for testing.

**Data Splitter**

`def splitter(dframe, percentage=0.8, random_state=True): if random_state: dframe = dframe.sample(frac=1) thresh = round(len(dframe) * percentage) train_df = dframe.iloc[:thresh] test_df = dframe.iloc[thresh:] return train_df, test_df`

The above function is used to divide the data into two parts.

`train_df, test_df = splitter(dframe=cdf)`

We have successfully divided the data into two parts.

**Object Creation**

`nb = CategoricalNB(train_df=train_df, test_df=test_df, label='class')`

**Prediction**

`preds = nb.predict()`

**Accuracy Score**

`acc = nb.accuracy_score(preds=preds)print(acc)`

The accuracy happens to be `>= 82%`

which is a decent percentage and hence the model is good.

- There can be situations where we have to apply
**Laplace Smoothing**and avoid the problems of the probability becoming`0`

. I have not implemented it just for now. - If we have many features in the data, we might end up getting a very small probability. This can be avoided by taking the logarithmic probabilities.
- The above implementation only works for categorical data. If we have real-valued data, it is better to proceed with the
**Gaussian Naive Bayes**procedure.

- Categorical Naive Bayes - https://bit.ly/3uwOrTh
- Machine Learning, Shatterline blog - https://bit.ly/3i1gqrv

Well, that's it for now. If you have liked my article you can **buy some coffee** and support me here. That would motivate me to write and learn more about what I know.

Data manipulation is one of the techniques in the life cycle of data analysis. It refers to the process of organizing the data by adjusting certain rows or columns with which, it becomes easier to read and understand. We have a separate programming language called **Data Manipulation Language (SQL)** to proceed with all kinds of adjustments (inserting, deleting, modifying, etc).

**Credits of Cover Image** - Photo by Tower Electric Bikes on Unsplash

**Note** - This article mainly focuses on certain methods like `apply()`

and `map()`

- helpful to manipulate the data. These methods are `pandas`

methods.

The ultimate goal of any program written in any language is to complete the task as fast as possible by consuming less memory and time. Using these built-in methods can be beneficial as they are optimized keeping all the constraints. The same implementation can be achieved by writing our own custom function, but this cannot be as efficient as using the built-in method, and moreover, this can sometimes turn out to be very tedious.

`apply()`

this method is used to manipulate the entire data frame or one single column (series) when passed a function.`map()`

this method is used to manipulate only one single column (series) at a time when passed a function.

Given a numerical data frame, compute the square of each number from each column and obtained a new data frame. Further explanation can be seen in the below figure.

The data which is on the left side is the original data and the right side is the manipulated data which is basically the square of the number.

**Custom Function**

The custom function (user-defined function) that we can think of solving this problem can be like -

`def sqaure_data(dframe): cframe = pd.DataFrame() for col in dframe: square_col = [] for each_val in dframe[col]: square_col.append(each_val ** 2) cframe[col] = square_col return cframe`

The function is using two `for`

loops which can make the program execute slowly when data is very large. This is definitely not the right thing to practice in original data analysis tasks.

**apply() Method**

Let's say, the original data (on the left) is stored in a variable called `df`

. By using `apply()`

, this task can be computed easily with just one line.

`adf = df.apply(lambda x: x**2)`

That's it! The above returns the manipulated data - stored in the variable `adf`

. The only caveat in using this method is that it is important to make sure the data contains no `NaN`

values and all the values are numeric in nature (in the sense, there should not be any text data).

**map() Method**

As the method `apply()`

is used, `map()`

can also be used but `map()`

only manipulates a single column at a time. It cannot be used to manipulate the entire data at once.

`mdf = pd.DataFrame()for col in df: mdf[col] = df[col].map(lambda x: x**2)`

`for`

loop is simply used once just to iterate through columns but not each value in a column. This is far better than the first method but not as better as the `apply()`

method.

`apply()`

can perform the same task that `map()`

can. There are advantages and disadvantages of both methods. Quick data processing is a must before using these methods in practice.

Given a numerical data frame, replace the values that are less than `25`

with `1000`

. Further explanation can be seen in the below figure.

The data which is on the left side is the original data and the right side is the manipulated data which is basically the replacement of those numbers with `1000`

that are less than `25`

.

**Custom Function**

The custom function (user-defined function) that we can think of solving this problem can be like -

`def replacer(dframe, cval=25, with_=1000): rframe = pd.DataFrame() for col in dframe: rvals = [] for each_val in dframe[col]: r = with_ if each_val < cval else each_val rvals.append(r) rframe[col] = rvals return rframe`

As usual, this uses `2`

loops that can be hectic if the data is very large.

**apply() & map() Method**

This case cannot be handled by one line of code, instead, a loop is used that goes through each column.

`rdf = pd.DataFrame()for col in df: rdf[col] = df[col].apply(lambda x: x if x > 25 else 1000)`

or

`rdf = pd.DataFrame()for col in df: rdf[col] = df[col].map(lambda x: x if x > 25 else 1000)`

Here, `apply()`

is similar to `map()`

, `apply()`

is not utilized as robustly as it had been utilized in problem statement 1. It is because an extra conditional statement is introduced and thus, cannot be achieved in one line of code.

To compute this in one line of code, the method `applymap()`

can be used. It is the combination of both `apply()`

and `map()`

.

**applymap() Method**

`rdf = df.applymap(lambda x: x if x > 25 else 1000)`

That's it! The above returns the manipulated data - stored in the variable `rdf`

. The only caveat in using this method is that it is important to make sure the data contains no `NaN`

values and all the values are numeric in nature (in the sense, there should not be any text data).

The article encompasses the core details on how to manipulate the data effectively by certain methods.

Methods included here are

`apply()`

,`map()`

,`applymap()`

which are`pandas`

data manipulation methods.The use case of each method is explained clearly with problem statements.

**Buy Me Coffee**

If you have liked my article you can buy some coffee and support me here. That would motivate me to write and learn more about what I know.

Well, that's all for now. This article is included in the series **Exploratory Data Analysis**, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

Missing value simply refers to a situation where there is no data stored for the variable in an observation. This usually happens when there is no response detected for a particular question during the process of data collection. Data Analysts have to be very careful in dealing with missing values. The process to handle the missing data is called **imputation**. On the whole, imputation is helpful to some extent but not fully in replacing the actual value that can be assumed in the place of missing value.

**Credits of Cover Image** - Photo by Alex Jones on Unsplash

- Filling by a
**Central Tendency**- Mean replacement
- Median replacement
- Mode replacement

**Categorical Imputation**- For each category, compute central tendency and fill the missing value for that category itself.

**Model Imputation**- Building a machine learning model to fill up the missing data by the predicted values.

**Note** - In this article, we will focus only on `1`

and `2`

techniques. We will not focus on `3`

just for now.

For implementing the above methods we will need to first create a **toy dataset** in order to programmatically understand each technique.

`import`

s`import warningswarnings.filterwarnings('ignore')import pandas as pdimport numpy as npimport randomfrom sklearn.datasets import make_blobsfrom scipy import statsfrom matplotlib import pyplot as plt`

In our data, we shall make sure that we have at least one feature which is categorical. The method `make_blobs()`

can be really handy in creating toy datasets in our own experiments.

With this method, we will create a dataset that will have -

- 100 samples
- 2 features
- 3 centers

`X, y = make_blobs(n_samples=100, n_features=2, centers=3)df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))print(df.head())`

The gist of the data looks like this -

Index | x | y | label |

0 | -1.960223 | -2.240400 | 2 |

1 | -3.489893 | -3.149356 | 2 |

2 | -8.173217 | 6.186763 | 1 |

3 | -1.216968 | -3.318524 | 2 |

4 | -7.661975 | 9.092514 | 1 |

Similarly, the **original plot** of the data looks like this -

In Python, a missing value is represented as `NaN`

which means **Not a Number**. We will create missing values in the `y`

column for the above toy dataset.

`df['y'].loc[df['y'].sample(frac=0.1).index] = np.nan`

The above code simply means that we are creating `10%`

of missing values completely at random locations of the existing data frame.

`df.isnull().sum()# ------'''x 0y 10label 0dtype: int64'''`

Since, we have to impute the missing values with central tendencies like `mean`

, `median`

, and `mode`

. We will create a **helper function** that takes three params -

`data`

The dataset for which the central tendencies should be computed.`col`

The column for which the central tendencies should be computed.`strategy`

Mean, Median, and Mode strategies.

`def fill_values(data, col, strategy): if (strategy == 'mean'): fval = data[col].mean() elif (strategy == 'median'): fval = data[col].mean() else: fval = stats.mode(data[col])[0][0] return fval`

The above function returns a value based on the `strategy`

that is used.

In order to fill the `NaN`

values with new values in the existing column of the data, we need to have a function that does so. The function takes three params -

`data`

The dataset for which the central tendencies should be computed.`col`

The column for which the central tendencies should be computed.`strategy`

Mean, Median, and Mode strategies.

`def simple_imputer(data, col, strategy): fval = fill_values(data=data, col=col, strategy=strategy) non_nans = data[col].fillna(fval).to_list() return non_nans`

The above function returns a list with filled values based on the `strategy`

that is used.

**Note** - The above technique is very simple and would not produce good results when compared with other techniques.

If we try to visualize the data after filling the `NaN`

's, we would see some sort of dispersion in the data.

We can see a horizontal pattern formed by points. These points are actually `NaN`

values that got filled by a respective central tendency. And moreover, neither of them belong to any of the clusters.

In our original dataset, besides `x`

and `y`

columns we also have one column called `label`

that `3`

unique values such as **[0, 1, 2]**. A simple hack that we can incorporate here is, we can separate the data by clusters and then compute the central tendency to fill up the `NaN`

values. By doing so, we can make sure that the new values belong to their respective clusters.

As usual, we will have a function to separate data cluster-wise. The function takes four params -

`data`

The dataset for which the central tendencies should be computed.`col`

The column for which the central tendencies should be computed.`label`

The column with which the data is separated into clusters.`strategy`

Mean, Median, and Mode strategies.

`def label_imputer(self, data, col, label, strategy): col_label_df = data[[col, label]] counts_df = col_label_df[label].value_counts().to_frame() classes = counts_df.index.to_list() fval_class = [] for each_class in classes: # separating data into clusters mini_frame = col_label_df[col_label_df[label] == each_class] # computing central tendency non_nans_frame = pd.Series(data=simple_imputer(mini_frame, col, strategy), index=mini_frame.index) fval_class.append(non_nans_frame) final_vals = pd.concat(fval_class).sort_index().to_list() return final_vals`

The above function is used to compute the central tendency value based upon the cluster that it belongs to and finally returns a list of values.

If we try to visualize the data after filling the `NaN`

's, we would not see any sort of dispersion in the data but rather a uniformity and resemblance with the original plot (without `NaN`

values).

`class Imputer(): def __init__(self, strategy): self.available_strats = ['mean', 'median', 'mode'] self.strategy = 'median' if strategy not in self.available_strats else strategy def fill_values(self, data, col): if (self.strategy == 'mean'): fval = data[col].mean() elif (self.strategy == 'median'): fval = data[col].mean() else: fval = stats.mode(data[col])[0][0] return fval def simple_imputer(self, data, col): fval = self.fill_values(data=data, col=col) non_nans = data[col].fillna(fval).to_list() return non_nans def label_imputer(self, data, col, label): col_label_df = data[[col, label]] counts_df = col_label_df[label].value_counts().to_frame() classes = counts_df.index.to_list() fval_class = [] for each_class in classes: mini_frame = col_label_df[col_label_df[label] == each_class] non_nans_frame = pd.Series(data=self.simple_imputer(data=mini_frame, col=col), index=mini_frame.index) fval_class.append(non_nans_frame) final_vals = pd.concat(fval_class).sort_index().to_list() return final_vals`

Throughout this article, we saw different techniques to handle the missing data. But, this was implemented on a toy dataset, not on a real dataset.

Practicing the above on real data may or may not be suitable. And moreover, it is quite impossible to have a

**categorical**feature to separate the data in clusters and fill the values based on the clusters themselves.It might be beneficial to practice the third technique which predicts the missing values by a mathematical model and this is an unbiased technique that entirely depends on the

**training data**and a**model**itself.It is all up to the practitioner to choose any method and come up with a solution for the given problem.

**Buy Me Coffee**

If you have liked my article you can buy some coffee and support me here. That would motivate me to write and learn more about what I know.

Well, that's all for now. This article is included in the series **Exploratory Data Analysis**, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

The main agenda of this article to analyze crime data by following all the steps required for the complete data analysis. The steps include **Data Preparation**, **Data Cleaning**, **Data Wrangling**, **Feature Selection**, **Data Visualization & Comparison**.

The data is about the crimes committed against women in India. The data is being recorded from 2001 to 2014. It includes crimes like -

- Rape
- Kidnapping and Abduction
- Dowry Deaths
- Assault on Women with intent to outrage her modesty
- Insult to the modesty of Women
- Cruelty by Husband or his Relatives
- Importation of Girls

**Data Source** Crimes Against Women - kaggle.com

Throughout this article, we will try to see if at all there is a reduction in crimes as the year count increases. We will visualize each column of data (mentioned above) state-wise & year-wise, and thus explore in a much better way. If you want to directly check out my work at Kaggle, feel free to visit this notebook.

**Credits of Cover Image** - Photo by Stefano Pollio on Unsplash

As always, begin with importing all the necessary packages.

`import warningswarnings.filterwarnings('ignore')import requestsimport pandas as pdimport numpy as npimport plotly.graph_objects as go`

I have saved the

`CSV`

data in my local system with the file name`caw_2001-2014.csv`

.

`df = pd.read_csv('caw_2001-2014.csv', index_col=0)`

- We will rename all the columns ensuring that the column name is short and precise for easy access to the column.

`df.columns = ['state_unit', 'district', 'year', 'rape', 'kidnap_abduction', 'dowry_deaths', 'women_assault', 'women_insult', 'husband_relative_cruelty', 'girl_importation']df.index = list(range(df.shape[0]))`

- There is a lot of
`string`

data where all the values are uppercased. To maintain uniqueness and flexibility, we will convert all the`string`

values to title-case.

`for col in df.columns: df[col] = df[col].apply(lambda x : x.title() if isinstance(x, str) else x)`

- If we observe the column -
`state_unit`

, there are few values that need replacement. Some values are repetitive, in the sense that it signifies the same meaning but has a different name. We shall replace all those values to have a single unique name. This process is really helpful in order to visualize the data geographically.

`replacements = { 'A & N Islands' : 'Andaman and Nicobar', 'A&N Islands' : 'Andaman and Nicobar', 'Daman & Diu' : 'Daman and Diu', 'Delhi Ut' : 'Delhi', 'D & N Haveli' : 'Dadra and Nagar Haveli', 'D&N Haveli' : 'Dadra and Nagar Haveli', 'Odisha' : 'Orissa', 'Jammu & Kashmir' : 'Jammu and Kashmir'}for (o, r) in replacements.items(): df['state_unit'].replace(to_replace=o, value=r, inplace=True)`

- Since the data is being collected from
`2001`

till`2014`

. We will need to split the data year-wise and save it in a dictionary. This is simply an efficient way of organizing the data.

`def split_data(dframe): min_year = dframe['year'].min() max_year = dframe['year'].max() data_year_wise = { year : dframe[dframe['year'] == year] for year in range(min_year, max_year + 1) } return data_year_wise# --------------data_splits = split_data(dframe=df)`

- To know the dimension of the data (year-wise), we can iterate through each
`key`

and print the shape of each`value`

.

`for (y, d) in data_splits.items(): print(y, '\t', d.shape)# --------------'''2001 (716, 10)2002 (719, 10)2003 (728, 10)2004 (729, 10)...2014 (837, 10)'''`

- In the column
`district`

, there is one unique row that has the total number of crime counts pertaining to each column. This again differs based on the name of the`state_unit`

.

`def categorize_crimes(data_source, state_unit=None): crime_list = list(data_source[2001].columns[3:]) all_crimes_year_wise = {} for (y, d) in data_source.items(): y_df = d[d['district'].str.contains('Total')] if state_unit: y_df = y_df[y_df['state_unit'] == state_unit.title()] crime_dict = {col : y_df[col].sum() for col in crime_list} # all_crimes_year_wise[y] = dict(sorted(crime_dict.items(), key=lambda x:x[1], reverse=True)) all_crimes_year_wise[y] = crime_dict return all_crimes_year_wise`

The above function `categorize_crimes()`

takes two parameters such as -

`data_source`

refers to the entire data which is split based on year.`state_unit`

refers to the name of the state or unit. If the value is`None`

, it considers total crime counts year-wise on the basis of the whole country. Otherwise, it considers total crime counts year-wise on the basis of a particular state.

Now that we have organized data categorized, we can visualize it to identify the patterns.

Visualizing the count of crimes considering the whole country by including all the crimes.

For this, we will create a function that takes the total number of crimes either state-wise or country-wise. We just need to provide the

`data_source`

and`state_unit`

is optional.Once it obtains the total count of each column, again sums up all the columns to get the whole total of crimes that happened every year.

If

`state_unit`

is specified, it does the same process but only limited to that particular state.

`def plot_overall_crimes_by_year(data_source, state_unit=None, kind='bar'): crimes_data = categorize_crimes(data_source=data_source, state_unit=state_unit) year_sum_crimes = {y : sum(list(cr.values())) for (y, cr) in crimes_data.items()} y_keys = list(year_sum_crimes.keys()) y_vals = list(year_sum_crimes.values()) t = 'Total Crimes - {}' title = t.format(state_unit.title()) if state_unit else t.format('India') if kind == 'bar': trace = go.Bar(x=y_keys, y=y_vals) else: trace = go.Pie(labels=y_keys, values=y_vals) layout = go.Layout( height=400, width=600, title=title, margin=dict(l=0, r=0, b=0, t=40) ) fig = go.Figure(data=[trace], layout=layout) fig.show() return None`

**1 Function Call**Visualization on the basis of the whole country.

`plot_overall_crimes_by_year(data_source=data_splits)`

**2 Function Call**Visualization on the basis of a particular state.

`plot_overall_crimes_by_year(data_source=data_splits, state_unit='Andhra Pradesh')`

Visualizing the count of crimes considering the whole country by including all the crimes and a specific year in which they took place.

For this, we will create a function that takes the total number of crimes either state-wise or country-wise by year. We just need to provide the

`data_source`

,`year`

, and`state_unit`

is optional.If

`state_unit`

is specified, it does the same process but only limited to that particular state.

`def plot_crimes_by_year(data_source, year, state_unit=None, kind='bar'): crimes_data = categorize_crimes(data_source=data_source, state_unit=state_unit) year_all_crimes = crimes_data[year] y_keys = list(year_all_crimes.keys()) y_vals = list(year_all_crimes.values()) t = '{} - Total Crimes - {}' title = t.format(year, state_unit.title()) if state_unit else t.format(year, 'India') if kind == 'bar': trace = go.Bar(x=y_keys,y=y_vals) else: trace = go.Pie(labels=y_keys, values=y_vals) layout = go.Layout( height=400, width=600, title=title, margin=dict(l=0, r=0, b=0, t=40) ) fig = go.Figure(data=[trace], layout=layout) fig.show() return None`

**1 Function Call**Visualization on the basis of the whole country in a specific period of time (year).

`plot_crimes_by_year(data_source=data_splits, year=2001)`

**2 Function Call**Visualization on the basis of a particular state in a specific period of time (year).

`plot_crimes_by_year(data_source=data_splits, year=2001, state_unit='Andhra Pradesh')`

Comparing the difference of occurrences visually by both country-wise and a particular state-wise considering two different periods of time (years).

For this, we will create a function that takes the total number of crimes either state-wise or country-wise by two different years. We just need to provide the

`data_source`

,`ideal_year`

,`cwith_year`

, and`state_unit`

is optional.If

`state_unit`

is specified, it does the same process but only limited to that particular state.

`def plot_overall_difference(data_source, ideal_year, cwith_year, state_unit=None): crime_data = categorize_crimes(data_source=data_source, state_unit=state_unit) ideal_year_crimes = crime_data[ideal_year] cwith_year_crimes = crime_data[cwith_year] t = '{} vs {} - diff - {}' title = t.format(ideal_year, cwith_year, state_unit.title()) if state_unit else t.format(ideal_year, cwith_year, 'India') trace1 = go.Bar( x=list(ideal_year_crimes.keys()), y=list(ideal_year_crimes.values()), name=ideal_year ) trace2 = go.Bar( x=list(cwith_year_crimes.keys()), y=list(cwith_year_crimes.values()), name=cwith_year ) layout = go.Layout( height=400, width=600, title=title, margin=dict(l=0, r=0, b=0, t=40) ) fig = go.Figure(data=[trace1, trace2], layout=layout) fig.show() return None`

**1 Function Call**Visualization on the basis of the whole country in two different periods of time (years).

`plot_overall_difference(data_source=data_splits, ideal_year=2001, cwith_year=2014)`

**2 Function Call**Visualization on the basis of a particular state in two different periods of time (years).

`plot_overall_difference(data_source=data_splits, ideal_year=2001, cwith_year=2012, state_unit='Andhra Pradesh')`

Plotting a particular crime both country-wise and state-wise to identify increasing or decreasing patterns yearly-wise.

For this, we will create a function that takes any single crime either state-wise or country-wise considering all the years. We just need to provide the

`data_source`

,`crime`

, and`state_unit`

is optional.If

`state_unit`

is specified, it does the same process but only limited to that particular state.

`def plot_crime_overall_diff(data_source, crime, state_unit=None, kind='bar'): crime_data = categorize_crimes(data_source=data_source, state_unit=state_unit) years_x = list(crime_data.keys()) crime_y = [cr[crime] for (y, cr) in crime_data.items()] t = '{} - {} | {} {}' min_y = years_x[0]; max_y = years_x[-1] title = t.format(min_y, max_y, state_unit.title(), crime) if state_unit else t.format(min_y, max_y, 'India', crime) if kind == 'bar': trace = go.Bar(x=years_x, y=crime_y) else: trace = go.Pie(labels=years_x, values=crime_y) layout = go.Layout( height=400, width=600, title=title, margin=dict(l=0, r=0, b=0, t=40) ) fig = go.Figure(data=[trace], layout=layout) fig.show() return None`

**1 Function Call** Visualization of any single crime on the basis of the whole country.

`plot_crime_overall_diff(data_source=data_splits, crime='rape')`

**2 Function Call** Visualization of any single crime on the basis of a particular state.

`plot_crime_overall_diff(data_source=data_splits, crime='rape', state_unit='Andhra Pradesh')`

In the above methods, we considered only the `rape`

column. If we want to identify patterns for any other column, we can do so by changing the parameter value.

Plotting the data state-wise and yearly-wise with a specific crime (column) as a target.

For this, we will create a function that takes any single crime state-wise or country-wise and at a particular period of time. We just need to provide the

`data_source`

,`crime`

,`year`

, and`state_unit`

is optional.If

`state_unit`

is specified, it does the same process but only limited to that particular state.

`def plot_column(data_source, year, crime, state_unit=None, kind='bar'): states_x, crime_y = obtain_features(data_source=data_source, year=year, crime=crime, state_unit=state_unit) t = '{} | {} {}' title = t.format(year, crime, state_unit.title()) if state_unit else t.format(year, crime, 'India') if kind == 'bar': trace = go.Bar(x=states_x, y=crime_y) else: trace = go.Pie(labels=states_x, values=crime_y) layout = go.Layout( height=400, width=600, title=title, margin=dict(l=0, r=0, b=0, t=40) ) fig = go.Figure(data=[trace], layout=layout) fig.show() return None`

**1 Function Call** Visualization of any single crime on the basis of the whole country in a particular period of time.

`plot_column(data_source=data_splits, year=2001, crime='kidnap_abduction')`

**2 Function Call** Visualization of any single crime on the basis of a particular state in a particular period of time.

`plot_column(data_source=data_splits, year=2001, crime='kidnap_abduction', state_unit='Andhra Pradesh')`

In order to plot the data geographically, we need to have a map layout, and accordingly, the data can be visualized.

**Downloading GeoJSON Data** - State-wise

`def get_india_map(year, state_unit=None): if year < 2014: if not state_unit: return 'https://raw.githubusercontent.com/geohacker/india/master/state/india_state.geojson' return 'https://raw.githubusercontent.com/geohacker/india/master/district/india_district.geojson' return 'https://raw.githubusercontent.com/geohacker/india/master/state/india_telengana.geojson'`

**Getting the Districts GeoJSON**

`def get_districts_json(year, state_unit=None): geo_link = get_india_map(year=year, state_unit=state_unit) if not state_unit: return geo_link req_data = requests.get(url=geo_link) req_json = req_data.json()['features'] state_districts = [] for feature in req_json: if feature['properties']['NAME_1'] == state_unit: state_districts.append(feature) return { "type": "FeatureCollection", "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } }, "features" : state_districts }`

**Visualizing the Choropleth Map**

`def plot_state_wise(data_source, year, crime, state_unit=None): state_unit = None state_x, crime_y = obtain_features(data_source=data_source, year=year, crime=crime, state_unit=state_unit) df_cols = ['name', 'crime_count'] state_crime_df = pd.DataFrame(data=zip(state_x, crime_y), columns=df_cols) trace = go.Choropleth( geojson=get_districts_json(year=year, state_unit=state_unit), featureidkey='properties.NAME_1', locations=state_crime_df['name'], z=state_crime_df['crime_count'], colorscale='Reds', marker_line_color='black', colorbar=dict( title={'text': "Crime Range"}, ) ) layout = go.Layout( title="{} Crime Activity - {}".format(year, crime), geo=dict( visible=False, lonaxis={'range': [65, 100]}, lataxis={'range': [5, 40]} ), margin=dict(l=0, b=0, t=30, r=0), height=600, width=600 ) fig = go.Figure(data=[trace], layout=layout) fig.show() return None`

**1 Function Call** - Visualizing `dowry_deaths`

data state-wise based on a particular year.

`plot_state_wise(data_source=data_splits, year=2001, crime='dowry_deaths')`

`plot_state_wise(data_source=data_splits, year=2014, crime='dowry_deaths')`

In Uttar Pradesh, we can observe that the count of deaths due to dowry crime was high in both 2001 and 2014.

It is observed that as the count of the year increases there is a drastic increase in the occurrence of crimes. It is quite painstaking to observe this, we are all been told that the development of the country (from every angle) needs time and patience. As the years increase there should be a decrease in occurrences. But it is the opposite.

Crimes against women are increasing no matter what measures are taken.

Moreover, this dataset includes the activity that had taken place till 2014, we do not know how many more happened to this date.

The same pattern is observed when we consider any single state and run our analysis.

Well, that's all for now. This article is included in the series **Exploratory Data Analysis**, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

Matrix is a type of data structure similar to an array where values are stored in rows and columns. Here, the values are of a unique type. When dealing with matrices (linear algebra) in Machine Learning and NLP, we often hear about two types of matrices as -

**Dense Matrix**- The matrix where most of the elements are non-zero. In this matrix, there are very few zero elements.**Sparse Matrix**- In contrast, the matrix where most of the elements zero and very few elements are non-zero.

**Note** - There are no criteria as such how many zero values in a matrix determine that there is a need to sparse the matrix.

**Credits of Cover Image** - Photo by Alexander Schimmeck on Unsplash

Imagine you have a large matrix with `N`

rows and `M`

columns in which most of the values are zeros. You are asked to consider only non-zero elements since zero elements do not add much value. Definitely, you have time and space constraints since you are dealing with a very large matrix.

The simple method here is to neglect all `0`

's and store the non-zero elements in the form of **[** `row_index`

, `col_index`

, `non-zero_value`

**]**.

**Approach: **

- Iterate through each
`row`

of an input matrix. - Iterate through each
`value`

of each`row`

. - If
**(value == 0)**skip. - If
**(value != 0)**obtain the`row`

number and`column`

number along with the`value`

from the input matrix. - Save the result as
**[**`row`

,`column`

,`value`

**]**for every`row`

and`column`

. - Stop.

Let's say we are given a matrix that has most of the elements to be `0`

.

The following `GIF`

explains how to obtain the sparse matrix (**GIF by Author**).

In this section, we will try to code this in two different ways. One, with the help of the `scipy`

module and another, implementing our own sparse matrix.

*Let's get started ...*

**Necessary Imports**

`import numpy as npfrom scipy.sparse import csr_matrix`

**Random Matrix Creation**

Just for the demonstration, we will make sure that the matrix contains `0`

elements the most.

`>>> mat = np.random.randint(low=0, high=3, size=(5, 5))>>> print(mat)[[2 2 0 2 0] [2 1 0 0 2] [2 1 0 1 0] [0 1 2 0 2] [0 1 2 2 1]]`

With the help of the method `csr_matrix()`

we can easily obtain the sparse matrix.

`>>> smat = csr_matrix(mat)>>> print(smat)(0, 0) 2(0, 1) 2(0, 3) 2(1, 0) 2(1, 1) 1(1, 4) 2(2, 0) 2(2, 1) 1(2, 3) 1(3, 1) 1(3, 2) 2(3, 4) 2(4, 1) 1(4, 2) 2(4, 3) 2(4, 4) 1`

We can think of a `dictionary`

in python which is a `key`

and `value`

paired. The above output is something like a `dictionary`

where keys are the index location (row, column) and values are the actual non-zero elements.

In order to implement it from scratch, we can follow the Algorithmic approach that I have explained previously.

`class SparseMatrix(): def __init__(self, arr): self.arr = arr def retain_sparsity(self, to_dict=False): sparse_mat = [ [rindx, cindx, val] for (rindx, row) in enumerate(self.arr) for (cindx, val) in enumerate(row) if (val != 0) ] if to_dict: sparse_mat = {(r, c) : v for (r, c, v) in sparse_mat} return sparse_mat`

The above class `SparseMatrix()`

has a method `retain_sparsity`

wherein the default argument `to_dict`

is `False`

. The argument `to_dict`

is used whether to get the output in the form of `dictionary`

or not.

**Object Creation**

`sparse = SparseMatrix(arr=mat)`

**Output - List**

`>>> smat_c = sparse.retain_sparsity()>>> print(smat_c)[[0, 0, 2], [0, 1, 2], [0, 3, 2], [1, 0, 2], [1, 1, 1], [1, 4, 2], [2, 0, 2], [2, 1, 1], [2, 3, 1], [3, 1, 1], [3, 2, 2], [3, 4, 2], [4, 1, 1], [4, 2, 2], [4, 3, 2], [4, 4, 1]]`

**Output - dictionary**

`>>> smat_d = sparse.retain_sparsity(to_dict=True)>>> print(smat_d){(0, 0): 2, (0, 1): 2, (0, 3): 2, (1, 0): 2, (1, 1): 1, (1, 4): 2, (2, 0): 2, (2, 1): 1, (2, 3): 1, (3, 1): 1, (3, 2): 2, (3, 4): 2, (4, 1): 1, (4, 2): 2, (4, 3): 2, (4, 4): 1}`

When we have space constraints while working with large matrices, it is often preferred to convert the matrix into sparse representation and this really takes less space comparatively the original matrix.

In fact, we can check the space (in bytes) occupied by the original matrix

`mat`

and the`sparse`

matrix.

`>>> from sys import getsizeof>>> # checking the space of original matrix>>> getsizeof(mat) 156>>> # checking the space of scipy sparse matrix>>> getsizeof(smat)24>>> # checking the custom implementation sparse matrix>>> getsizeof(smat_c)92`

What we can observe is, the

`scipy`

method consumes less space than our custom method. It is because`scipy`

is an optimized well-developed library mainly used for various scientific computations.It is always better to use library methods than our own code to achieve faster results with fewer space constraints.

**Buy Me Coffee**

If you have liked my article you can buy some coffee and support me here. That would motivate me to write and learn more about what I know.

]]>If you had read my previous articles on matrix operations, by now you would have already know what a matrix is. Yes, a matrix is a `2D`

representation of an array with `M`

rows and `N`

columns. The shape of the matrix is generally referred to as dimension. Thus the shape of any typical matrix is represented or assumed to have (`M`

x `N`

) dimensions.

**Credits of Cover Image** - Photo by Carl Nenzen Loven on Unsplash

**Row Matrix**- Collection of identical elements or objects stored in`1`

row and`N`

columns.**Column Matrix**- Collection of identical elements or objects stored in`N`

rows and`1`

column.

**Note** - Matrices of shapes (`1`

x `N`

) and (`N`

x `1`

) are generally called row vector and column vector respectively.

For instance, let's assume we have two matrices `A`

and `B`

. The general rule before multiplying is that the number of **COLUMNS** of `A`

should be exactly equal to the number of **ROWS** of `B`

. If this rule is satisfied then -

We shall compute the **dot product** for each row of `A`

with respect to each column of `B`

. This process continues until there are no elements left to compute.

- Each row of
`A`

is considered to be a row vector. - Each column of
`B`

is considered to be a column vector. **Dot Product**- It is an algebraic operation that is computed on two equal-sized vectors which result in a single number. It is also called a scalar product. Mathematically, we represent it in the form of -

$$r.c = \sum_{i=1}^{n} r_i c_i \rightarrow \text{for} \ i = \text{1 to n}$$

The resultant matrix's (after operation) size would be equal to the number of **ROWS** of `A`

and the number of **COLUMNS** of `B`

.

**Note** - The computation can be risky or slow when we are dealing with large matrices. This can be easily handled by `NumPy`

.

**GIF by Author**

If we try to break down the whole algorithm, the very first thing that we have to do is to transpose one of the matrices and compute the dot or scalar product for each row of the matrix to each column of the other matrix.

**Matrix Transpose**

`def transpose(m): trans_mat = [[row[i] for row in m] for i in range(len(m[0]))] return trans_mat`

**Dot Product**

`def scalar_product(r, c): ps = [i * j for (i, j) in zip(r, c)] return sum(ps)`

**Matrix Multiplication**

`def mats_product(m1, m2): m2_t = transpose(m=m2) mats_p = [[scalar_product(r=r, c=c) for c in m2_t] for r in m1] return mats_p`

To wrap all the functions together, we can do the following -

**Wrap Up**

`def easy_product(m1, m2): def transpose(m): trans_mat = [[row[i] for row in m] for i in range(len(m[0]))] return trans_mat def scalar_product(r, c): ps = [i * j for (i, j) in zip(r, c)] return sum(ps) def mats_product(m1, m2): m2_t = transpose(m=m2) mats_p = [[scalar_product(r=r, c=c) for c in m2_t] for r in m1] return mats_p return mats_product(m1, m2)`

The above `easy_product()`

can still be optimized by using the built-in methods of Python. Better improvement is needed on the method `transpose()`

.

**Matrix Transpose**

`transpose = lambda m : list(map(list, zip(*m)))`

**Dot Product**

The above `scalar_product()`

can still be reduced and maintained like -

`scalar_product = lambda r, c: sum([i * j for (i, j) in zip(r, c)])`

**Matrix Multiplication**

`def mats_product(m1, m2): m2_t = transpose(m=m2) mats_p = [[scalar_product(r=r, c=c) for c in m2_t] for r in m1] return mats_p`

To wrap all the functions together, we can do the following -

**Wrap Up**

`def optimized_product(m1, m2): transpose = lambda m : list(map(list, zip(*m))) scalar_product = lambda r, c: sum([i * j for (i, j) in zip(r, c)]) def mats_product(m1, m2): m2_t = transpose(m=m2) mats_p = [[scalar_product(r=r, c=c) for c in m2_t] for r in m1] return mats_p return mats_product(m1, m2)`

Awesome! Both the functions are ready to be tested. In order to test so, we need to have matrices defined. We will create random matrices (function) which can further be helpful to check the speed compatibility of both functions.

**Random matrices creation**

`import randomdef create_matrix(rcount, ccount): random.seed(10) m = [[random.randint(10, 80) for i in range(ccount)] for j in range(rcount)] return m`

`>>> nr = 2>>> nc = 3>>> >>> m1 = create_matrix(nr, nc)>>> m2 = create_matrix(nc, nr)>>> m1[[14, 64, 71], [11, 36, 69]]>>> m2[[14, 64], [71, 11], [36, 69]]`

**Normal Function**

`>>> mm = easy_product(m1, m2)>>> print(mm)[[7296, 6499], [5194, 5861]]`

**Improved Code**

`>>> mm = optimized_product(m1, m2)>>> print(mm)[[7296, 6499], [5194, 5861]]`

Now, both the functions seem working well enough. But it is also important to check the algorithm performance in terms of speedy computation and time.

For this, we will run both the functions in a loop for a defined set of matrix shapes and store the required amount of time that each takes. We shall also plot the same to represent it visually.

**Performance Check**

`import timefrom matplotlib import pyplot as pltdef check_speed(): shapes = [(3, 2), (5, 4), (10, 9), (20, 5), (30, 8), (40, 2), (50, 10), (2, 10), (5, 80), (100, 34), (200, 100), (300, 200)] x = [str(i) for i in shapes] y1 = []; y2 = []; for sp in shapes: m1 = create_matrix(sp[0], sp[1]) m2 = create_matrix(sp[1], sp[0]) start_e = time.time() res_easy = easy_product(m1, m2) end_e = time.time() easy_elapse = end_e - start_e start_o = time.time() res_opt = optimized_product(m1, m2) end_o = time.time() opt_elapse = end_o - start_o y1.append(easy_elapse) y2.append(opt_elapse) plt.figure(figsize=(15, 6)) plt.plot(x, y1, 'o-', label='Normal Function') plt.plot(x, y2, 'o-', label='Optimized Code') plt.legend() plt.show() return None`

**Performance Graph**

Both the algorithms seem to work similarly except for few cases. But there is a catch, we wrote these functions from a personal point. If we introduce `NumPy`

for doing the same, the performance would be much better. The below graph includes the performance of `NumPy`

as well.

Yes! NumPy is much faster. NumPy consumes very little time to compute the same operation no matter what the sizes are.

Breaking down a given problem and approaching each one to ultimately solve the original problem.

Comparing algorithms and tracking the computation.

**Buy Me Coffee**

In order to understand what an outlier is, we can think of a great example that explains the concept in a nutshell. Imagine, there is summer season going on. You are with your friends chilling. Everyone in your group is wearing normal clothes (light clothes that people usually wear in the summer season) except for the one who is like a weirdo. That single person is wearing a sweater. Now clearly, he/she is totally a different person who is not like the rest and certainly, he/she doesn't fit in your group.

The explanation can be clearly understood by seeing the cover image of this article.

**Credits of Cover Image** - Photo by Will Myers on Unsplash

Wait a second. Let's say I have numerical data, how can I be a detective to catch hold of an outlier in my data? Obviously, the data will be not like the cover image. (dang it)

In that case what should I do? (double dang it)

Yes, what you are saying is absolutely logical. In order to detect the outlier in numerical data (or any data for reason), we can use various statistical methods that help find out the outlier.

**Definition of Outlier** - Statistical term

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate the experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

There are 3 easy ways to detect the outliers within the data.

- Visual Exploration
- Scatter Plot
- Box Plot

- Z-score Method
- IQR - (Inter Quartile Range)

**Note** - The readers of this article are assumed to know basic statistical measures and how to calculate them. If yes, you can proceed further. Otherwise, you may want to refer to this article.

Outliers can occur due to the wrong entry of the data or typo into the original dataset. Now that we know different ways of detecting, we shall understand each with an example.

Let's have data values say -

$$x = [10, 14, 15, 12, 18, 20, 19, 17, 22, 25, 80, 21, 25, 24, 13]$$

and

$$y = [13, 24, 17, 22, 26, 21, 14, 11, 10, 19, 23, 27, 11, 15, 85]$$

**Necessary Imports**

By representing the data in the form of a graph, one can easily identify the variability or the difference in data patterns that may lead to having an outlier. The visual representation is not more effective than the other two methods.

**Scatter Plot**

Scatter Plot basically plots or visualizes the data along the coordinate axes based on the scale range mentioned.

Code

Output

**Box Plot**

Box plot basically plots or visualizes the statistical measures of the data. The outliers are automatically shown either side `min`

and `max`

values. This is more robust than the scatter plot in detecting the outliers. The explanation of the box plot is below.

**Credits** - The above image is taken from the Internet.

Code - `x`

Output - `x`

The point (50) outside the `max`

line is an outlier.

Code - `y`

Output - `y`

The point (60) outside the `max`

line is an outlier.

Dang you (back)

The Z-score method is used to normalize (standard normalize) the data. The normalized data is also called Standardized Normal Distribution. One important thing about this is that the mean is always `0`

and the standard deviation is always `1`

.

**Formula**

Let

$$X = [x_1, x_2, x_3, \dots, x_n]$$

and

$$\mu_x \rightarrow \text{Mean}$$

$$\sigma_x \rightarrow \text{Standard Deviation}$$

then

$$z_x = \frac{(x_i - \mu_x)}{\sigma_x}$$

When we compute the above, it generates a sequence of numbers that are mostly normalized (standardized) values.

- If any value is greater than
`+3`

is an outlier. - Similarly, if any value less than
`-3`

is an outlier.

**Code**

The above code returns a sequence of normalized values.

The above code returns the outliers from the original data by getting the indices.

**Testing**

The above result is clearly stating that `50`

and `60`

are outliers of `x`

and `y`

.

Double dang you (back)

IQR is self-explained by just looking at the box plot chart. The formula for IQR is

$$IQR = Q_3 - Q_1$$

where

$$Q_1 \rightarrow \text{First Quartile}$$

$$Q_3 \rightarrow \text{Third Quartile}$$

and

$$Q_2 \rightarrow \text{Median}$$

**Code**

Once we compute IQR, we need to find the bounds, i.e., upper bound and lower bound. We use a constant (`1.5`

) to get the bounds.

- The data point which is less than the lower bound is an outlier.
- The data point which is greater than the upper bound is an outlier.

**Code**

**Testing**

The above result is clearly stating that `50`

and `60`

are outliers of `x`

and `y`

.

Triple dang you (back)

Hence the Questions are answered. You have now successfully become the detective who can detect the outliers in the data.

**Exploratory Data Analysis**, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

Statistics is a branch of mathematics (applied mathematics) concerned mainly with the analysis of the data. It includes - collection, analysis, interpretation, and presentation of huge numerical data. There are a variety of numerical measures such as `mean`

, `median`

, `mode`

, `percentiles`

, `variance`

, and `standard deviation`

to summarize the data.

We shall understand the meaning of each measure briefly and programmatically implement the same using Python.

**Credits of Cover Image** - Photo by Crissy Jarvis on Unsplash

Mean is generally referred to as the average value for the given set of values. It is the central value or central tendency of a finite set of numbers (data). In a lot of ways, mean is misinterpreted as median (which is the middle value) but it more convenient to take mean as the measure of central tendency.

**Formula**

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) and the mean of this data is given as

$$\mu = \frac{1}{n} \sum_{i=1}^n x_i$$

**Code**

**Limitation**

- Means do get affected when introduced outliers in the data.

Median is generally referred to as the mid data point from the data provided. In other words, it is the value that separates the data into two - lower half and higher half. The only constraint to compute the median is that the data needs to be in sorted order.

**Formula**

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) which is sorted and the median of the data is given as (considering the index starting from `0`

)-

- If
`n`

is odd, then

$$Med(X) = X\bigg[\frac{n}{2}\bigg]$$

- If
`n`

is even, then

$$Med(X) = \frac{(X[\frac{n}{2}-1] + X[\frac{n}{2}])}{2}$$

**Code**

**Pro**

- Medians do not get affected when introduced outliers in the data.

Mode is generally referred to as a number that appears most frequently in the given dataset. There is a formula to get the modal value which is more likely applicable to the data represented in class intervals. Other than that, we can easily compute the mode by counting the occurrence of each data point.

**Code**

Standard Deviation is generally referred to as the overall dispersion of each data point with respect to the data mean. Here, dispersion is simply the distance that is measured. Always, the distances are positive, we never find the distance that is measured in negative. But, the dispersion for some points can be negative. In order to avoid that, we apply a mathematical hack that can be observed in the formula.

**Formula**

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) and the standard deviation of the data is given as

$$\sigma = \sqrt{\frac{\sum_{i=1}^n (x_i - \mu)^2}{n}}$$

**Code**

**Limitation**

- Standard Deviation does get affected when introduced outliers in the data.

Variance is generally referred to as the square of standard deviation. In case if the data is normally distributed, then the standard deviation is equal to variance which is again equal to `1`

.

**Formula**

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) and variance of the data is given as

$$Var(X) = \sigma^2$$

**Code**

**Limitation**

- Variance does get affected when introduced outliers in the data.

Percentile is generally referred to as the single score value which falls below the given percentage of score in its frequency distribution. The median value is also equal to the `50`

th percentile of the given data distribution.

**Formula**

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) and location percentile can be computed by

$$l_p = \bigg[(n - 1) \frac{p}{100}\bigg] + 1$$

separate the integer part and floating part from the location percentile value and get the previous data value (integer part - 1) and current data value (integer part) with the help of indexing and compute the percentile value by

$$p_v = X[prev] + [\text{floating part of }l_p * (X[curr] - X[prev])]$$

**Code**

**Limitation**

- The data needs to in sorted order (ascending) to be able to compute percentile efficiently.

Median Absolute Deviation is generally referred to as the median value of absolute dispersion from each data point to the median of the data itself. Standard deviation is computed with respect to the mean value whereas median absolute deviation (MAD) is computed with respect to the median value.

**Formula**

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

be the finite set of numbers (data) and MAD of the data is given as

$$MAD(X) = Med(|x_i - Med(X)|)$$

**Code**

Covariance is generally referred to as a measure of the relationship between two random variables `X`

and `Y`

. This measure evaluates how much to what extent the variables change together. In other words, it is essentially a measure of the variance between two variables.

**Formula**

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

and

$$Y = [y_1, y_2, y_3, y_4, \dots, y_n]$$

be two finite sets of numbers (data) and the Covariance of `X`

and `Y`

is given as

$$\text{Cov(X, Y)} = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_x)(y_i - \mu_y)$$

**Code**

**Types**

`Positive Covariance`

Indicates that two variables tend to move in the same direction.`Negative Covariance`

Indicates that two variables tend to move in the inverse direction.

**Properties**

- Cov(X, Y) = Cov(Y, X)
- Cov(X, X) = Var(X)

Correlation is generally referred to as a measure of the strength of the relationship between two finite sets (data). Correlation is the scaled measure of covariance. The correlation value is always in the range of `-1`

to `+1`

.

**Formula**

Let

$$X = [x_1, x_2, x_3, x_4, \dots, x_n]$$

and

$$Y = [y_1, y_2, y_3, y_4, \dots, y_n]$$

be two finite sets of numbers (data) and the Correlation of `X`

and `Y`

is given as

$$\text{Corr(X, Y)} = \frac{\text{Cov(X, Y)}}{\sigma_X \sigma_Y}$$

**Code**

**Types**

`Positive Correlation`

Indicates that two variables have a strong relationship and tend to move in a positive direction. The value is generally`+1`

.`Zero Correlation`

Indicates that there is no relationship between two variables. The value is generally`0`

.`Negative Correlation`

Indicates that two variables have a strong relationship and tend to move in the inverse direction. The value is generally`-1`

.

`import mathfrom collections import Counterclass StatsBasics(): def compute_mean(self, data): mean_val = sum(data)/len(data) return mean_val def compute_median(self, data): data = sorted(data) n = len(data) mid_idx = n // 2 if (n % 2 != 0): return data[mid_idx] return (data[mid_idx - 1] + data[mid_idx]) / 2 def compute_mode(self, data): datac = Counter(data) max_freq = max(list(datac.values())) if (max_freq == 1): return "Mode doesn't exist" modals = [i for (i, j) in datac.items() if (j == max_freq)] return min(modals) def compute_stddev(self, data): mean_val = self.compute_mean(data=data) dispersions = [(i - mean_val)**2 for i in data] dispersion_mean = self.compute_mean(data=dispersions) return math.sqrt(dispersion_mean) def compute_variance(self, data): stddev = self.compute_stddev(data=data) return stddev**2 def compute_percentile(self, p, data): data = sorted(data) if (p == 100): return data[-1] l_p = (len(data) - 1) * (p / 100) + 1 int_l_p = int(l_p) fl_l_p = l_p - int_l_p val1 = data[int_l_p - 1] val2 = data[int_l_p] pval = val1 + (fl_l_p * (val2 - val1)) return round(pval, 2) def compute_mad(self, data, c=0.6745): median_val = self.compute_median(data=data) abs_std = [abs(i - median_val) for i in data] mad = self.compute_median(data=abs_std) / c return round(mad, 2) def compute_covariance(self, X, Y): if (len(X) != len(Y)): return None mean_x = self.compute_mean(data=X) mean_y = self.compute_mean(data=Y) covals = [(x - mean_x)*(y - mean_y) for (x, y) in zip(X, Y)] covar_val = self.compute_mean(data=covals) return covar_val def compute_correlation(self, X, Y): covar_val = self.compute_covariance(X=X, Y=Y) std_X = self.compute_stddev(data=X) std_Y = self.compute_stddev(data=Y) corr_val = covar_val / (std_X * std_Y) return corr_val`

**Exploratory Data Analysis**, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

Identifying Patterns helps us understand more about the data, which is gaining insights by observing trends and patterns. This also helps in finding the relationship between the two sets.

From the business point, observing trends really helps in tracking the overall sales and returns. Providing the data stored yearly-wise, we can easily plot and identify a trend pattern - an upward trend or constant trend or a lower trend. And thus, decisions are made based on facts.

**Google Trends** is one of the best applications available on the web to see the trend graph on the `search term`

.

**Credits of Cover Image** - Photo by Clark Van Der Beken on Unsplash

Depending on the data we can gain insights either by just looking at the tabular data (assuming data is in tabular format) or plotting it.

From a personal point, it is always good to visualize the data in order to properly identify the patterns. This really makes sense to me.

Let's try to identify patterns in one dataset.

**Note** - Throughout this series, we will be dealing with one dataset originally taken from Kaggle.

For the hands-on practice, we will use a dataset in the domain of health care on heart attack possibility.

*Let's get started ...*

We are using Python and specifically Pandas library - developed with a clear intention to approach the key aspects of data analytics and data science problems.

`pip install pandas --userpip install numpy --userpip install seaborn --userpip install matplotlib --user`

The above installs the current version of packages in the system. Let's import the same.

`import pandas as pdimport numpy as npimport seaborn as snsfrom matplotlib import pyplot as plt`

In order to load or read the data, we use the `read_csv()`

method where we explicitly tell Pandas that we want to read a `CSV`

file.

`df = pd.read_csv('heart.csv)`

Since `df`

is a Pandas object, we can instantiate the other methods that are made available to understand the data.

From the last tutorial, we know that `cp`

(chest pain), `thalach`

(heartbeat rate), `chol`

(cholesterol) (not so important), and `slope`

are happened to be the most important features.

`target`

Display the list of columns in the dataset.

`>>> df.columnsIndex(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'], dtype='object')`

Filter the dataset considering the `target`

variable.

`df_1 = df[df['target'] == 1] # data of people who are riskydf_0 = df[df['target'] == 0] # data of people who are safe`

The cholesterol is measured in milligrams (mg) and the heartbeat rate is measured over a minute. Let's plot the [`chol`

, `thalach`

] features with respect to `age`

for both `df_1`

and `df_0`

.

**Age group - Risky**

- First group the data (
`df_1`

) by`age`

considering the columns`chol`

and`thalach`

.

`ag_df_1 = df_1.groupby(by=['age'])[['chol', 'thalach']].sum()`

- Plot a
`bar`

chart for the grouped data (above).

`ag_df_1.plot(kind='bar', figsize=(15, 6), title='Risky')plt.show()`

The plot looks like this -

**Age group - Safe**

- First group the data (
`df_0`

) by`age`

considering the columns`chol`

and`thalach`

.

`ag_df_0 = df_0.groupby(by=['age'])[['chol', 'thalach']].sum()`

- Plot a
`bar`

chart for the grouped data (above).

`ag_df_0.plot(kind='bar', figsize=(15, 6), title='Safe')plt.show()`

The plot looks like this -

**Conclusion**

People who are in the risky group have high heartbeat rates compared with those who are in the safe group.

Also, the cholesterol measurements are higher in the risky group

One can easily assume that optimal heartbeat rate is a goal for heart patients in order to be healthy - clearly understood from the graphs.

Let's plot the [`chol`

, `thalach`

] features with respect to `cp`

(chest pain) for both `df_1`

and `df_0`

.

**CP group - Risky**

- First group the data (
`df_1`

) by`cp`

considering the columns`chol`

and`thalach`

.

`cp_df_1 = df_1.groupby(by=['cp'])[['chol', 'thalach']].sum()`

- Plot a
`pie`

chart for the grouped data (above).

`cp_df_1.plot(kind='pie', figsize=(15, 6), subplots=True, title='Risky')plt.show()`

The plot looks like this -

**CP group - Safe**

- First group the data (
`df_0`

) by`cp`

considering the columns`chol`

and`thalach`

.

`cp_df_0 = df_0.groupby(by=['cp'])[['chol', 'thalach']].sum()`

- Plot a
`pie`

chart for the grouped data (above).

`cp_df_0.plot(kind='pie', figsize=(15, 6), subplots=True, title='Safe')plt.show()`

The plot looks like this -

**Conclusion**

From the above two graphs, we can observe that safe category people have lesser degrees of chest pain than the risky.

It also very significant that

`cp`

and`thalach`

both are equally important with respect to the`age`

of the patient.

`age`

and `cp`

Scatter Plot helps us identify an upward trend or downward trend for two variables that are taken. It is just another form of correlation represented graphically. Based on two datasets `df_1`

and `df_0`

, let's compare `cp`

with respect to `age`

.

`plt.figure(figsize=(10, 6))plt.scatter(df_1['age'], df_1['cp'], label='Risky')plt.scatter(df_0['age'], df_0['cp'], label='Safe')plt.xlabel('age')plt.ylabel('chest pain')plt.legend()plt.show()`

**Conclusion**

From the plot, we can see that most of the safe category people have a lesser degree of chest pain than those of risky people.

In the safe category, though the age is more, the chest pain degree is less and of course there are others who have a high degree.

`age`

and `thalach`

Now, let's compare `thalach`

with respect to `age`

for both `df_1`

and `df_0`

.

`plt.figure(figsize=(10, 6))plt.scatter(df_1['age'], df_1['thalach'], label='Risky')plt.scatter(df_0['age'], df_0['thalach'], label='Safe')plt.xlabel('age')plt.ylabel('thalach')plt.legend()plt.show()`

**Conclusion**

If we just observe the safe (orange) category, the heartbeat level is constant with respect to age.

The same is not with the risky (blue) category.

We can also assume that people in the group between 29 to 60, have higher rates that lead to an attack with adding higher degrees of chest pain.

**Exploratory Data Analysis**, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

Identifying or selecting the important features is one of the crucial processes in data science and data analytics. After all, the features that are selected decide how qualitative the data is. It is exactly like the phrase ** Garbage in, Garbage out** - meaning, whatever features that we consider, the result of the whole analytical process depends on those features. It is also said that data analysts spend at least

`60`

to `70`

percent of their time preparing the data.**Credits of Cover Image** - Photo by Brian Lundquist on Unsplash

Data Preprocessing is an important phase of the whole analytical process, where relevant fields are extracted or queried from the gathered data. It is in this step, data analysts try their best to retain or increase the quality of the data, with which further steps become handy. During the process of data gathering, it is quite sure to end up collecting ** messy data** and therefore it has to be preprocessed right after it is collected. It includes steps like -

**Data Cleaning**- In this process, unnecessary data values which do not add importance are removed. Mostly, it includes missing values that should be cleaned or filled.**Data Editing**- In this process, the original data is edited or changed in order to maintain uniqueness. For example, in the dataset, if there is an`age`

column then it is important to have all the age values in numeric, and there can be chances to get the data in non-numeric. In that case, it is always encouraged to edit the data.**Data Wrangling**- In this process, the raw data is transformed or manipulated in such a format that it is easy to use and readily available for data analytics. Oftentimes, it is also called data manipulation.

**Note** - The above steps are explained in a nutshell. Moreover, the overall complexity (to maintain data quality) depends on the dataset that is collected.

Correlation is a statistical measurement that tells how one feature is affecting the target variable. It returns a percentage value describing the relationship. The correlation value (percentage) lies between `-1`

and `+1`

.

`-1`

describes that the feature and the target variables are negatively correlated, meaning - one increases when the other decreases and vice-versa.`0`

describes that there is no correlation.`+1`

describes that the feature and the target variables are positively correlated, meaning - one increases when the other increases and vice-versa.

Correlation is really helpful in knowing those features that actually affect the target variable.

Let's try to identify which features are important in one dataset.

**Note** - Throughout this series, we will be dealing with one dataset originally taken from Kaggle.

For the hands-on practice, we will use a dataset in the domain of health care on heart attack possibility.

*Let's get started ...*

We are using Python and specifically Pandas library - developed with a clear intention to approach the key aspects of data analytics and data science problems.

`pip install pandas --userpip install numpy --userpip install seaborn --userpip install matplotlib --user`

The above installs the current version of packages in the system. Let's import the same.

`import pandas as pdimport numpy as npimport seaborn as snsfrom matplotlib import pyplot as plt`

In order to load or read the data, we use the `read_csv()`

method where we explicitly tell Pandas that we want to read a `CSV`

file.

`df = pd.read_csv('heart.csv)`

Since `df`

is a Pandas object, we can instantiate the other methods that are made available to understand the data.

`df.head()`

`head()`

is used to display the first five rows. By default, we don't explicitly mention how many we want to display. However, if you want to display other than five rows you can mention a number like `head(<any_number>)`

.

` age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 11 37 1 2 130 250 0 1 187 0 3.5 0 0 2 12 41 0 1 130 204 0 0 172 0 1.4 2 0 2 13 56 1 1 120 236 0 1 178 0 0.8 2 0 2 14 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1`

We have a `column`

called `target`

that has two unique values.

`0`

indicates fewer chances of getting a heart attack`1`

indicates more chances of getting a heart attack

`target`

columnTo count the frequency of the data we will use the `value_counts()`

method.

`ha_df = df['target'].value_counts().to_frame()`

The above methods convert the count (data) to a data frame.

` target----------1 1650 138`

Visually, we can represent the above count (data) as a `pie`

chart.

`ha_df.plot(kind='pie', figsize=(10, 6), subplots=True)plt.show()`

`sex`

ratioFrom the original data frame `df`

, we have a column called `sex`

which is again numerical where -

`1`

indicates male`0`

indicates female

Let's visualize the `pie`

chart of the same.

`sdf = df['sex'].value_counts().to_frame()sdf.plot(kind='pie', figsize=(10, 6), subplots=True)plt.show()`

`Correlation`

PlotWith the help of Pandas, we can easily find correlations considering the entire data frame.

`cor_df = df.corr()`

This will return the correlations to each column with all the other columns. We can directly visualize the correlation matrix using `seaborn`

plots.

`plt.figure(figsize=(10, 6))sns.heatmap( data=cor_df, vmin=-1, vmax=1, center=0, cmap='seismic', annot=True)plt.show()`

If we carefully observe the correlation matrix, there are totally 3 features that are affecting the `target`

variable. They are -

`cp`

- chest pain`thalach`

- heartbeat rate`slope`

- may be

The variable `chol`

(cholesterol) has no relationship. In fact, from the correlation plot, we see that the relationship ratio is `-0.085`

. Whereas, `cp`

and `thalach`

have around `0.43`

and `0.42`

respectively.

**Exploratory Data Analysis**, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

In any business problem, data analysts are often asked to find out the patterns and trends from the data in order to identify the strong root cause of problems (why something is happening). Throughout the time, they try to maintain the quality of the data to be precise in finding the true objectives. These objectives (targets) are not formulated from personal opinions or prejudices.

**Credits of Cover Image** - Photo by Silvan Arnet on Unsplash

To first understand the data, one has to be a great observant. By just observing the data one can know what kind of facts are given and what are those factors - can be considered to be the most important features. This stage is called **Data Inspection**.

It is in this stage where one acts as **Sherlock Holmes** to understand the data better.

As soon as the data is loaded, to make sure one is following the right track one has to be curious to ask the following questions -

- What is the problem statement?
- What is the structure of the data?
- What are the variables present?
- What is the type of each variable?

- What is the total size of the data?
- Are there any missing or
`NaN`

values within the data?

Let's try to understand one dataset from the above questions along with the objectives.

**Note** - Throughout this series, we will be dealing with one dataset originally taken from Kaggle.

For the hands-on practice, we will use one dataset in the domain of health care on heart attack possibility.

*Let's get started ...*

We are using Python and specifically Pandas library that was developed with a clear intention to approach the key aspects of data analytics and data science problems.

`pip install pandas --userpip install numpy --user`

The above installs the current version of packages in the system. Let's import the same.

`import pandas as pdimport numpy as np`

The title of the dataset itself gives a brief idea of the problem statement that we need to work on, i.e., predicting the possibility of a heart attack. But, for this article, our goal is to understand the data by answering the above questions.

In order to load or read the data, we use the `read_csv()`

method where we explicitly tell Pandas that we want to read a `CSV`

file.

`df = pd.read_csv('heart.csv)`

The dataset that we have just loaded is a `CSV`

file data where all the values are separated with a character `,`

. By default, the data is read as a `DataFrame`

which is basically tabular data.

`df.info()`

Since we have made an object called `df`

, we can instantiate different methods available in the scope. One such method is `info()`

which gives the information related to the data frame.

`<class 'pandas.core.frame.DataFrame'>RangeIndex: 303 entries, 0 to 302Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null int64 1 sex 303 non-null int64 2 cp 303 non-null int64 3 trestbps 303 non-null int64 4 chol 303 non-null int64 5 fbs 303 non-null int64 6 restecg 303 non-null int64 7 thalach 303 non-null int64 8 exang 303 non-null int64 9 oldpeak 303 non-null float64 10 slope 303 non-null int64 11 ca 303 non-null int64 12 thal 303 non-null int64 13 target 303 non-null int64 dtypes: float64(1), int64(13)memory usage: 33.2 KB`

No doubt that we have got the entire information, structure, columns by just one command. We also got the types of each column which indirectly answers the 3rd question.

Variables are also called features which are the columns of the data. Each column or variable has a specific type of data that can be `int`

, `float`

, `object`

etc.

`df.columns`

The above gives a list of columns or variables and the type of column name. All the columns are read as `object`

.

`Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'], dtype='object')`

`df.dtypes`

The above lists out the type of data that each variable or column holds.

`age int64sex int64cp int64trestbps int64chol int64fbs int64restecg int64thalach int64exang int64oldpeak float64slope int64ca int64thal int64target int64dtype: object`

The size of the data can be determined by -

`df.shape`

this gives the total number of rows and columns in the form of a `tuple`

.

`(303, 14)`

The dataset has in total `303`

rows and `14`

columns. Now comes the last and final question.

`NaN`

values within the data?Missing values in the data indicate that the value for that column or variable at a particular index is `Null`

or `None`

. In order to find the same, we have a method called `isnull()`

which can be understood in this way -

is the data frame null?

`df.isnull()`

This, in return, creates a new data frame of `boolean`

values representing

`True`

, wherever the value is`Null`

`False`

, wherever the value is`NOT-Null`

` age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target0 False False False False False False False False False False False False False False1 False False False False False False False False False False False False False False2 False False False False False False False False False False False False False False ... ... ... ... ... ... ... ... ... ... ... ... ... ...301 False False False False False False False False False False False False False False302 False False False False False False False False False False False False False False`

However, looking at the above result, it can be tedious to exactly pinpoint the `True`

value signifying the `Null`

in that location. We can achieve it with the help of NumPy's `np.where()`

method.

`def get_nan_location(dframe): nan_locs = {} for col in dframe.columns: nl = list(np.where(dframe[col].isnull())[0]) nan_locs[col] = nl return nan_locs`

The above takes one parameter i.e., `dframe`

(the actual data frame), and returns a dictionary where each key is the column and each value is the list of indices where there is `NaN`

pertaining to that respective column.

`get_nan_location(dframe=df)#--------------------------{'age': [], 'sex': [], 'cp': [], 'trestbps': [], 'chol': [], 'fbs': [], 'restecg': [], 'thalach': [], 'exang': [], 'oldpeak': [], 'slope': [], 'ca': [], 'thal': [], 'target': []}`

We can see all the columns have an empty list which simply indicates that there are no `NaN`

values.

Once this step or stage is over, we should be formulating other questions that mainly depend on the data we are working with.

**Exploratory Data Analysis**, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

Data is the knowledge gained from a factual basis. It can be related to an object or a person. An explanation obtained from processing the data is called information. Thus, data and information are two different things.

**Data**Facts and Figures**Information**Processed data which is understood better

**Credits of Cover Image** - Photo by Hunter Harritt on Unsplash

From the above two, we can get clarity of the term **" Data Analysis "**. Data Analysis is a field of Statistics, Mathematics, and Computer Science combined together in processing the raw data to produce insightful or valuable information. You might have this question - Statistics and Mathematics are fine but why Computer Science? The knowledge of programming helps in different ways in analyzing data. Some of which are -

- Process Automation
- Handling Large Datasets
- Querying Databases
- Creating Models
- Data Visualization
- Dashboard Development

Of course, we cannot just analyze the given data with a piece of paper and pencil. We need to find one such platform to do all three - Stats, Math, and Programming.

Tools involved in Data Analysis -

- Python
- R
- Julia
- Matlab

**Note** - There are so many languages or tools available. But here, I talk about Python. If you want to know the list then do refer to this article.

To get started in the field of data, learning Python would benefit in many ways. Python has a wide variety of packages that have been developed over the years. From data collection to data modeling, Python has everything set for you.

List of Packages or libraries:

- NumPy
- Pandas
- Statsmodels
- Matplotlib
- OpenCV
- Scikit-Learn
- Pytorch
- Tensorflow
- Plotly
- Py-Spark, etc

These packages have been extensively used for data-related problems. There is no requirement to learn all the packages as long as one is curious enough to understand the problem and implement the method. But the deeper one goes the deeper knowledge of using these are a must.

**Exploratory Data Analysis**, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

For a while till now, I have been working on my basic image processing app developed in Python using the frameworks and libraries like -

- Dash
- Plotly
- Plotly express
- NumPy &
- OpenCV
- and some other dependencies

Let me explain how the journey of developing this app began.

**Credits of Cover Image** - Photo by Kelly Sikkema on Unsplash

First, I didn't have any idea or plan to develop an app (that too for image processing). It is when one of my colleagues asked in our common group -

How do I re-mirror the image in the zoom app? By default, it takes the virtual background image that is already mirrored.

At that point, I really wanted to understand how this mirroring feature works in any image application. I already know the basics of image processing, like - how to read the image, how to convert the image into grayscale - stuff like that. I thought, why don't I try this in Python, in fact mirroring the image is simply reversing the rows of the image matrix (images are considered as matrices with rows and columns).

The moment I knew what to do, I searched online. I got the resources that we can easily implement the `mirroring`

operation using the `PIL`

(Pillow) library. I remember I was once told that the NumPy library is mainly used for scientific and complex mathematical calculations. I fixed in my mind that I would only go to use the NumPy library for both mathematical understanding and as well as implementation.

Firstly, I implemented using a regular `for`

loop. I observed that it was taking too much time to compute. Although it was working, I chose the `np.fliplr()`

(`flip-left-right`

) method. I plotted the result after operating and the image is completely mirrored. The same technique I used for flipping the image up to down. The `np.flipud()`

(`flip-up-down`

) method is used.

Likewise, step by step I learned other operations and executed them. Image Flipping and Mirroring with NumPy and OpenCV was the first blog I wrote in the category of image processing.

Image Operations

Image Transformation (Morphological)

- Image Erosion
- Image Dilation
- Other transformations which use the concept of erosion and dilation.

**Note** - I have re-ordered the blogs in my series, in a way that a beginner can also understand from the initial step.

After having developed some image operations, I wanted to compile everything together as an app with a proper UI. I have extensively used `dash`

& `plotly`

for designing the UI part along with the user-upload component.

**Note** - I won't be sharing the whole code but for sure the code snippets.

`import`

s`import dashimport dash_core_components as dccimport dash_html_components as htmlimport dash_daq as daqimport plotly.express as pximport cv2import numpy as npimport jsonimport base64from matplotlib import pyplot as pltfrom dash.dependencies import (Input, Output, State)from image_ops_scratch import (ImageOperations, read_image_string)from image_morphs_scratch import MorphologicalTransformations`

`image_ops_scratch`

the file where all the image operations are defined.`image_morphs_scratch`

the file where all the morphological image transformations are defined.

`external_stylesheets = [ 'https://codepen.io/chriddyp/pen/bWLwgP.css']app = dash.Dash(__name__, external_stylesheets=external_stylesheets)app.config['suppress_callback_exceptions'] = Trueapp.title = 'Image Processing App'server = app.server`

For developing the UI, we shall use the sub-packages of `dash`

namely -

- dash_daq
- dash_core_components
- dash_html_components
- dash.dependencies

Please refer to the official documentation to understand the basics of dash callbacks and other important UI components.

This component is used to upload a local file from the system to the application. When the user clicks on this, the file directory pops up to select an appropriate file. Firstly, the image file that we select is read in `base64 code`

which is to be converted into an image array for the operation to work effectively (the operations work only for `nd`

arrays).

`html.Div([ dcc.Upload( id='upload-image', children=html.Div([ 'Drag and Drop or ', html.A('Select Files') ]), style={ 'width': '100%', 'height': '70px', 'lineHeight': '60px', 'borderWidth': '1px', 'borderStyle': 'dashed', 'borderRadius': '5px', 'textAlign': 'center', 'margin': '10px', 'backgroundColor': '#F0F1F1' }, multiple=True ),], style={'paddingTop' : 50})`

`def read_image_string(contents): encoded_data = contents[0].split(',')[1] nparr = np.frombuffer(base64.b64decode(encoded_data), np.uint8) img = cv2.imdecode(nparr, cv2.IMREAD_COLOR) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) return img`

When the toggle switch is turned on, the image shall be converted directly to grayscale. By default, the image is read in RGB mode.

`html.Div([ daq.ToggleSwitch( id='image-mode', size=60, label='Gray Scale', labelPosition='top', color='#717171', value=False, )], style={'paddingTop' : 30, 'paddingBottom' : 10})`

To represent each operation we will use radio buttons. When the user selects any particular operation, the output of the image changes and the same is displayed in the app.

`image_ops = ['None', 'Equalize', 'Flip', 'Mirror', 'Binarize', 'Invert', 'Solarize']html.Div([ dcc.RadioItems( id='in-operation', options=[{'label' : op, 'value' : op.lower()} for op in image_ops], value='none' ),], className='select-operation')########################image_morphs = ['None', 'Erode', 'Dilate', 'Open', 'Close', 'Gradient', 'Boundary Extraction']html.Div([ html.P('Morph level - '), dcc.Input(id='morph-level', type='number', placeholder='Enter Morph Level - ', value=3), html.Div([ dcc.RadioItems( id='in-transformation', options=[{'label' : tr, 'value' : tr.lower()} for tr in image_morphs], value='none' ), ], style={'paddingTop' : 20})], className='select-operation')`

The app should have two tabs so as to separate image operations and image transformations. Also, in the transformations tab, the image needs to be converted into binary as they only work on binary images. Similarly, in the operations tab, we shall place the toggle switch.

`tab_style = { 'borderBottom': '1px solid #d6d6d6', 'padding': '10px', 'fontWeight': 'bold'}tab_selected_style = { 'borderTop': '5px solid #d6d6d6', 'borderBottom': '3px solid #d6d6d6', 'backgroundColor': '#7E8483', 'color': 'white', 'padding': '6px'}html.Div([ dcc.Tabs( id='image-processors-tabs', value='operators', children=[ dcc.Tab( label='Operations', value='operators', style=tab_style, selected_style=tab_selected_style, children=[ # toggle switch # operations - radio buttons ] ), dcc.Tab( label='Transformations', value='transformers', style=tab_style, selected_style=tab_selected_style, children=[ # transformations - radio buttons ] ) ] )], className='tab-div')`

The output `div`

should be the same for both operations and transformations. It has to be already preserved with a unique `id`

with which it can be updated based on the `tab`

that is selected.

`html.Div( id='result-in-out-image', className='flex-item-right')`

In dash, we mainly have `callbacks`

based on which the output is displayed or updated. Each output callback identified with a unique `id`

can take the `n`

number of `Inputs`

or `States`

. But each `callback`

takes only one output with a unique `id`

that can be repeated exactly once.

The callback chart for this app is as follows -

`def parse_contents(contents, filename, date): image_mat = read_image_string(contents=contents) return image_mat`

Here -

- contents are the data in the file that is selected
- filename points to the name of the file that is selected
- date points to the status to when the file is last updated

We have one output `div`

that is reserved. Based on the `tab`

selected, the app should change its nature. Hence, the following code.

`@app.callback( Output('result-in-out-image', 'children'), [Input('image-processors-tabs', 'value')])def set_output_layout(which_tab): if which_tab == 'operators': in_out_image_div = html.Div([ html.Div( children= [ html.H5('Image Used - Output'), html.Div(id='output-image-op'), ], style={'textAlign' : 'center', 'paddingTop' : 50} ) ]) elif which_tab == 'transformers': in_out_image_div = html.Div([ html.Div( children= [ html.H5('Image Used - Output'), html.Div(id='output-image-morph'), ], style={'textAlign' : 'center', 'paddingTop' : 50} ) ]) return in_out_image_div`

In this, we have -

- 1 Output
`output-image-op`

pointing to the change in children (HTML collection)

- 3 Inputs
`upload-image`

pointing to the contents of the file`image-mode`

pointing to the mode (RGB or grayscale by toggle switch)`in-operation`

pointing to the input operation that is selected

- 2 States
`upload-image`

pointing to the state of the file name`upload-image`

pointing to the state of the date (when the file is last modified)

`@app.callback( Output('output-image-op', 'children'), [ Input('upload-image', 'contents'), Input('image-mode', 'value'), Input('in-operation', 'value'), # ------- State('upload-image', 'filename'), State('upload-image', 'last_modified'), ])def get_operated_image(contents, image_mode, operation, filenames, dates): if contents is not None: imsrc = parse_contents(contents, filenames, dates) imo = ImageOperations(image_file_src=imsrc) if (operation == 'equalize'): out_img = imo.equalize_this(gray_scale=True) if image_mode else imo.equalize_this() elif (operation == 'flip'): out_img = imo.flip_this(gray_scale=True) if image_mode else imo.flip_this() elif (operation == 'mirror'): out_img = imo.mirror_this(gray_scale=True) if image_mode else imo.mirror_this() elif (operation == 'binarize'): out_img = imo.binarize_this(gray_scale=True) if image_mode else imo.binarize_this() elif (operation == 'invert'): out_img = imo.invert_this(gray_scale=True) if image_mode else imo.invert_this() elif (operation == 'solarize'): out_img = imo.solarize_this(gray_scale=True) if image_mode else imo.solarize_this() else: out_img = imo.read_this(gray_scale=True) if image_mode else imo.read_this() out_image_fig = px.imshow(out_img, color_continuous_scale='gray') if image_mode else px.imshow(out_img) out_image_fig.update_layout( coloraxis_showscale=False, width=600, height=400, margin=dict(l=0, r=0, b=0, t=0) ) out_image_fig.update_xaxes(showticklabels=False) out_image_fig.update_yaxes(showticklabels=False) output_result = html.Div([ dcc.Graph(id='out-op-img', figure=out_image_fig) ], style={'paddingTop' : 50}) return output_result`

In this, we have -

- 1 Output
`output-image-morph`

pointing to the change in children (HTML collection)

- 3 Inputs
`upload-image`

pointing to the contents of the file`morph-level`

pointing to the input value with which the level of the structuring element is decided`in-transformation`

pointing to the input transformation that is selected

- 2 States
`upload-image`

pointing to the state of the file name`upload-image`

pointing to the state of the date (when the file is last modified)

`@app.callback( Output('output-image-morph', 'children'), [ Input('upload-image', 'contents'), Input('morph-level', 'value'), Input('in-transformation', 'value'), # ------- State('upload-image', 'filename'), State('upload-image', 'last_modified'), ])def get_transformed_image(contents, level, transformation, filenames, dates): if contents is not None: imsrc = parse_contents(contents, filenames, dates) morph = MorphologicalTransformations(image_file_src=imsrc, level=level) level = 3 if level == None else level image_src = morph.read_this() if (transformation == 'erode'): out_img = morph.erode_image(image_src=image_src) elif (transformation == 'dilate'): out_img = morph.dilate_image(image_src=image_src) elif (transformation == 'open'): out_img = morph.open_image(image_src=image_src) elif (transformation == 'close'): out_img = morph.close_image(image_src=image_src) elif (transformation == 'gradient'): out_img = morph.morph_gradient(image_src=image_src) elif (transformation == 'boundary extraction'): out_img = morph.extract_boundary(image_src=image_src) else: out_img = image_src out_image_fig = px.imshow(out_img, color_continuous_scale='gray') out_image_fig.update_layout( coloraxis_showscale=False, width=600, height=400, margin=dict(l=0, r=0, b=0, t=0) ) out_image_fig.update_xaxes(showticklabels=False) out_image_fig.update_yaxes(showticklabels=False) output_result = html.Div([ dcc.Graph(id='out-morph-img', figure=out_image_fig) ], style={'paddingTop' : 50}) return output_result`

**Note** - The above code snippets belong to one file i.e., `app.py`

. The actual image processing code is not disclosed. One can look it up in my GitHub repository.

The fun part was during the deployment. Although I have experience in deploying Python web apps on Heroku, I faced slight issues (which helped me learn) while deploying this app. I organized all the code and checked if there is any change that needed to be done.

`- image-app/ - assets/ - custom_style.css - images/ - lena_original.png - pinktree.jpg - scenary.jpg - .gitignore - app.py - image_morphs_scratch.py - image_ops_scratch.py - Procfile - requirements.txt`

I specified all the requirements in a separate `requirements.txt`

file. The errors occurred during the installation of `OpenCV`

inside Heroku. Actually, I was supposed to specify `opencv-contrib-python-headless`

instead of `opencv-contrib-python`

to get the package installed properly.

`dashplotlydash-core-componentsdash-html-componentsdash-daqnumpymatplotlibpandasopencv-contrib-python-headlessgunicorn`

`web: gunicorn app:server`

Everything is set now, We can easily follow these steps and deploy the app that can be accessible to everyone.

By developing this app, I got to learn some image processing concepts along with the mathematics behind it. I feel happy to finally be able to deploy this app on Heroku. I know the app is sort of basic and will be adding other interesting features to it in the future. I would like to see your feedback or suggestions in the comments.

- GitHub link https://github.com/msameeruddin/image-app
- App link https://process-image-app.herokuapp.com/

If you have liked this project, do drop a star, and buy coffee for me from here. We will meet in the next blog.

**Credits of Cover Image** - Photo by Vlado Paunovic on Unsplash

A matrix is a 2D representation of an array with M rows and N columns. An array is a collection of identical elements or objects stored in 1 row and N columns. There are so many mathematical operations and properties that can be implemented in a matrix. One such operation is the transpose operation. Transposing a matrix is easy, just converting rows into columns and vice-versa.

Lets try to code this in 3 different ways and compare the performance.

In every matrix, the length of rows should be equal and the same follows for the columns.

We require two `for`

loops:

In the first loop, we consider the size of any row. This will decide the column size for our transposed matrix.

In the second loop, we consider each row in the original matrix and extract every item order-wise.

`def transpose_matrix_1(matrix): # take the first row length row_len = len(matrix[0]) trans_mat = [] for i in range(row_len): trans_row = [] for row in matrix: # extract every item from each row order-wise trans_row.append(row[i]) trans_mat.append(trans_row) return trans_mat`

`>>> matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]>>> t_matrix = transpose_matrix_1(matrix=matrix)>>> print(t_matrix)[[ 1 4 7 10] [ 2 5 8 11] [ 3 6 9 12]]`

We can improvise the above function by implementing list comprehension. List comprehension is easy to implement and performance-wise, it is way faster than normal `for`

loops.

`def transpose_matrix_2(matrix): row_len = len(matrix[0]) trans_mat = [[row[i] for row in matrix] for i in range(row_len)] return trans_mat`

`>>> matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]>>> t_matrix = transpose_matrix_2(matrix=matrix)>>> print(t_matrix)[[ 1 4 7 10] [ 2 5 8 11] [ 3 6 9 12]]`

Lambda function is a small anonymous function that takes n arguments and returns only one output.

`map()`

is used to map a particular function with a sequence of objects that we can achieve using for loop. But here, we dont require any.`zip()`

is used to zip every item from every row sequentially. It basically used to club things together.

`# zip all the items sequentially and convert it as a listtranspose_matrix_3 = lambda matrix: list(map(list, zip(*matrix)))`

`>>> matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]>>> t_matrix = transpose_matrix_3(matrix=matrix)>>> print(t_matrix)[[ 1 4 7 10] [ 2 5 8 11] [ 3 6 9 12]]`

We shall compare each functions performance based on how much time it takes to complete the task. For this, we will rely on the `timeit`

module.

**Note**: We will mostly use the square matrix of different sizes. However, we can tweak the dimension values.

`import randomfrom timeit import timeitimport matplotlib.pyplot as pltfrom matplotlib import stylestyle.use('seaborn')def time_convertor(val, time_type): if time_type == 's': val = val / 1000000 elif time_type == 'ms': val = val / 1000 else: val = val return valdef compare_performace(): r = [3, 5, 10, 50, 100, 500, 1000] X = []; Y1 = []; Y2 = []; Y3 = [] for i in range(len(r)): matrix = [[random.randint(1, 20) for j in range(r[i])] for k in range(r[i])] x_val = "{} x {}".format(r[i], r[i]) X.append(x_val) y_val_1 = %timeit -o transpose_matrix_1(matrix=matrix) y_val_2 = %timeit -o transpose_matrix_2(matrix=matrix) y_val_3 = %timeit -o transpose_matrix_3(matrix=matrix) print('---------') y_val_1 = str(y_val_1).split(' ', 2) y_val_2 = str(y_val_2).split(' ', 2) y_val_3 = str(y_val_3).split(' ', 2) y_val_1 = time_convertor(val=float(y_val_1[0]), time_type=y_val_1[1]) y_val_2 = time_convertor(val=float(y_val_2[0]), time_type=y_val_2[1]) y_val_3 = time_convertor(val=float(y_val_3[0]), time_type=y_val_3[1]) Y1.append(y_val_1) Y2.append(y_val_2) Y3.append(y_val_3) plt.figure(figsize=(15, 8)) plt.plot(X, Y1, '-o', label='Normal function') plt.plot(X, Y2, '-o', label='List comprehension') plt.plot(X, Y3, '-o', label='Lambda function') plt.xlabel('Matrix dimension') plt.ylabel('Function performance') plt.legend() plt.show() return True`

`2.97 s 81.9 ns per loop (mean std. dev. of 7 runs, 100000 loops each)2.6 s 85.1 ns per loop (mean std. dev. of 7 runs, 100000 loops each)1.49 s 223 ns per loop (mean std. dev. of 7 runs, 1000000 loops each)---------4.65 s 158 ns per loop (mean std. dev. of 7 runs, 100000 loops each)4.35 s 84.8 ns per loop (mean std. dev. of 7 runs, 100000 loops each)1.85 s 90.4 ns per loop (mean std. dev. of 7 runs, 1000000 loops each)---------13.9 s 582 ns per loop (mean std. dev. of 7 runs, 100000 loops each)9.82 s 110 ns per loop (mean std. dev. of 7 runs, 100000 loops each)3.13 s 124 ns per loop (mean std. dev. of 7 runs, 100000 loops each)---------267 s 10.9 s per loop (mean std. dev. of 7 runs, 1000 loops each)144 s 2.41 s per loop (mean std. dev. of 7 runs, 10000 loops each)29.7 s 863 ns per loop (mean std. dev. of 7 runs, 10000 loops each)---------1.02 ms 27.2 s per loop (mean std. dev. of 7 runs, 1000 loops each)527 s 9.94 s per loop (mean std. dev. of 7 runs, 1000 loops each)113 s 1.89 s per loop (mean std. dev. of 7 runs, 10000 loops each)---------33.1 ms 815 s per loop (mean std. dev. of 7 runs, 10 loops each)20.9 ms 887 s per loop (mean std. dev. of 7 runs, 10 loops each)3.58 ms 57.4 s per loop (mean std. dev. of 7 runs, 100 loops each)---------120 ms 1.71 ms per loop (mean std. dev. of 7 runs, 10 loops each)73.1 ms 3.08 ms per loop (mean std. dev. of 7 runs, 10 loops each)20 ms 1.73 ms per loop (mean std. dev. of 7 runs, 100 loops each)`

The time complexity is similar for our first and second functions `O(M*N)`

. But when we compare, the list comprehension is always faster than a regular `for`

loop.

In the other case, where we used the lambda function, it is quite faster than the list comprehension. All I had to do is simply use the built-in functions which are always faster than user-defined functions.

]]>The word intersection in mathematics is termed as the similar (smaller) objects between two different objects. Intuitively, we can say the intersection of objects is that it belongs to all of them.

**Credits of Cover Image** - Photo by Benjamin Elliott on Unsplash

Geometrically speaking, if we have two distinct lines (assuming these lines are two objects), the intersection of these two lines would be the point where both the lines meet. Well, in the case of parallel lines, the intersection doesnt exist. Geographically, the common junction between two or more roads can be taken as the area or region of intersection.

In Set theory, the intersection of two objects such as `A`

and `B`

is defined as the set of elements contained in both `A`

and `B`

. Symbolically, we represent the intersection as -

We also the above symbol as `AND`

. Programatically this becomes very easy to code or to find the intersection between two sets.

Using NumPy, we can find the intersection of two matrices in two different ways. In fact, there are in-built methods with which we can easily find.

Lets first define the two matrices of the same size.

`>>> import numpy as np>>> import matplotlib.pyplot as plt>>> >>> a = np.random.randint(low=0, high=2, size=(5, 5))>>> b = np.random.randint(low=0, high=2, size=(5, 5))>>> >>> print(a)[[1 0 1 1 0] [1 0 0 1 0] [1 0 1 1 0] [1 0 1 0 1] [0 1 0 0 0]]>>> >>> print(b)[[0 0 1 1 0] [0 0 0 0 0] [1 0 0 1 1] [0 0 1 1 0] [0 1 0 0 0]]`

`&`

In mathematics, intersection (&) is often called `and`

. When we operate the same on `a`

and `b`

, using a broadcasting technique automatically do the elementwise calculation.

`>>> c = a & b>>> print(c)[[0 0 1 1 0] [0 0 0 0 0] [1 0 0 1 0] [0 0 1 0 0] [0 1 0 0 0]]`

The result `c`

is the intersection of the matrices `a`

and `b`

.

`np.where()`

To get the gist of `np.where()`

is a special method that acts like a ternary operator on NumPy arrays. By default, it takes 3 parameters -

`condition`

The condition which is either true or false at certain positions (boolean value) of an array.`x`

The value that gets replaced at the positions of the array where the condition is true.`y`

The value that gets replaced at the positions of the array where the condition is false.

Lets have a function to get the intersection of the matrices.

`def intersect_matrices(mat1, mat2): if not (mat1.shape == mat2.shape): return False mat_intersect = np.where((mat1 == mat2), mat1, 0) return mat_intersect`

If we call the above function with our matrices `a`

and `b`

, we get -

`>>> c = intersect_matrices(mat1=a, mat2=b)>>> print(c)[[0 0 1 1 0] [0 0 0 0 0] [1 0 0 1 0] [0 0 1 0 0] [0 1 0 0 0]]`

If we visualize the above results, we can get -

In the plot, white patches represent the true values and black patches represent the false values. Particularly, in subplot `c`

, white patches represent the intersection value of sets `a`

, and `b`

.

**Note**: The above implementation is done for the matrices which are of binary format. When we have values other than `0`

and `1`

. It is better to go with the second procedure where we replace the values with `0`

at the positions where the condition is false.

For different values other than `0`

and `1`

, we have the plot something like below.

**End**

In this article, we will explore other transformations where image erosion and image dilation stand as a base. In the previous articles on morphological transformations, we learned the two important transformations namely **erosion** and **dilation**.

The transformations that are built on top of these two are -

- Opening
- Closing
- Morphological Gradient
- Top hat
- Black hat
- Boundary Extraction
- Hit Miss Transformation

We have seen a step-by-step implementation of erosion and dilation explaining the convolution method with simple matrix operations. In all of these transformations, we rely on the binary input image and a structuring element or kernel. The structuring element needs to be a square matrix which is again a binary matrix.

**Note**: If you are not familiar with **erosion** and **dilation**. I would recommend you refer to my previous articles.

- Image Erosion https://bit.ly/3gcojco
- Image Dilation https://bit.ly/2TlOncd

For now, we will consider

**A**input image matrix**B**kernel matrix

**Credits of Cover Image** - Photo by Suzanne D. Williams on Unsplash

We know erosion and dilation are quite opposite to each other. But Opening is just another name of erosion followed by dilation. Mathematically, we can represent it as -

$$(A \circ B) \rightarrow (1)$$

If we further break it down, we can represent it as -

$$(A \circ B) = (A \ominus B) \oplus B \rightarrow (2)$$

**Eq (1)** is represented in terms of erosion and dilation, the same can be seen in **Eq (2)**.

This transformation is helpful in removing the noise from the image.

Closing transformation is quite opposite of Opening transformation. Closing is just another name of a dilation followed by erosion. Mathematically, we can represent it as -

$$(A \bullet B) \rightarrow (3)$$

If we further break it down, we can represent it as -

$$(A \bullet B) = (A \oplus B) \ominus B \rightarrow (4)$$

**Eq (3)** is represented in terms of dilation and erosion, the same can be seen in **Eq (4)**.

This transformation is helpful in closing the holes in the foreground object of the image.

The morphological gradient can be easily obtained once we have the eroded image and dilated image. It is the difference between dilated image and an eroded image. Mathematically, we can represent it as -

$$(A \oplus B) - (A \ominus B)$$

The resultant of this transformation appears to be an outline of the foreground object.

Top Hat transformation is the difference between the input image and the opening of the image. Mathematically, we can represent it as -

$$A - (A \circ B) \rightarrow (5)$$

If we further break it down, we can represent it as -

$$A - [\ (A \ominus B) \oplus B \ ] \rightarrow (6)$$

**Eq (5)** is represented in terms of erosion and dilation, the same can be seen in **Eq (6)**.

Back Hat transformation is the difference between the closing of the input image and the input image. Mathematically, we can represent it as -

$$(A \bullet B) - A \rightarrow (7)$$

If we further break it down, we can represent it as -

$$[\ (A \oplus B) \ominus B \ ] - A \rightarrow (8)$$

**Eq (7)** is represented in terms of dilation and erosion, the same can be seen in **Eq (8)**.

Boundary extraction is one of the applications of morphological transformations. In simple words, it is the difference between the input image and the eroded image. Mathematically, we can represent it as -

$$A - (A \ominus B)$$

The boundary of the foreground object is represented in white color and the rest be in black color. If we do the reverse process, i.e., the difference between the input image and dilated image the boundary will get in black color, and the rest will be in white color.

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV

`import`

the Packages`import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt`

For the entire implementation, I will be using the finger-print image, you can use the same for the practice.

`class MorphologicalTransformations(object): def __init__(self, image_file, level): self.level = 3 if level < 3 else level self.image_file = image_file self.MAX_PIXEL = 255 self.MIN_PIXEL = 0 self.MID_PIXEL = self.MAX_PIXEL // 2 self.kernel = np.full(shape=(level, level), fill_value=255) def read_this(self): image_src = cv2.imread(self.image_file, 0) return image_src def convert_binary(self, image_src, thresh_val): color_1 = self.MAX_PIXEL color_2 = self.MIN_PIXEL initial_conv = np.where((image_src <= thresh_val), image_src, color_1) final_conv = np.where((initial_conv > thresh_val), initial_conv, color_2) return final_conv def binarize_this(self): image_src = self.read_this() image_b = self.convert_binary(image_src=image_src, thresh_val=self.MID_PIXEL) return image_b def get_flat_submatrices(self, image_src, h_reduce, w_reduce): image_shape = image_src.shape flat_submats = np.array([ image_src[i:(i + self.level), j:(j + self.level)] for i in range(image_shape[0] - h_reduce) for j in range(image_shape[1] - w_reduce) ]) return flat_submats def erode_image(self, image_src, with_plot=False): orig_shape = image_src.shape pad_width = self.level - 2 image_pad = np.pad(array=image_src, pad_width=pad_width, mode='constant') pimg_shape = image_pad.shape h_reduce, w_reduce = (pimg_shape[0] - orig_shape[0]), (pimg_shape[1] - orig_shape[1]) flat_submats = self.get_flat_submatrices( image_src=image_pad, h_reduce=h_reduce, w_reduce=w_reduce ) image_eroded = np.array([255 if (i == self.kernel).all() else 0 for i in flat_submats]) image_eroded = image_eroded.reshape(orig_shape) if with_plot: self.plot_it(orig_matrix=image_src, trans_matrix=image_eroded, head_text='Eroded - {}'.format(self.level)) return None return image_eroded def dilate_image(self, image_src, with_plot=False): orig_shape = image_src.shape pad_width = self.level - 2 image_pad = np.pad(array=image_src, pad_width=pad_width, mode='constant') pimg_shape = image_pad.shape h_reduce, w_reduce = (pimg_shape[0] - orig_shape[0]), (pimg_shape[1] - orig_shape[1]) flat_submats = self.get_flat_submatrices( image_src=image_pad, h_reduce=h_reduce, w_reduce=w_reduce ) image_dilated = np.array([255 if (i == self.kernel).any() else 0 for i in flat_submats]) image_dilated = image_dilated.reshape(orig_shape) if with_plot: self.plot_it(orig_matrix=image_src, trans_matrix=image_dilated, head_text='Dilated - {}'.format(self.level)) return None return image_dilated def open_image(self, image_src, with_plot=False): image_eroded = self.erode_image(image_src=image_src) image_opening = self.dilate_image(image_src=image_eroded) if with_plot: self.plot_it(orig_matrix=image_src, trans_matrix=image_opening, head_text='Opening - {}'.format(self.level)) return None return image_opening def close_image(self, image_src, with_plot=False): image_dilated = self.dilate_image(image_src=image_src) image_closing = self.erode_image(image_src=image_dilated) if with_plot: self.plot_it(orig_matrix=image_src, trans_matrix=image_closing, head_text='Closing - {}'.format(self.level)) return None return image_closing def morph_gradient(self, image_src, with_plot=False): image_dilated = self.dilate_image(image_src=image_src) image_eroded = self.erode_image(image_src=image_src) image_grad = image_dilated - image_eroded if with_plot: self.plot_it(orig_matrix=image_src, trans_matrix=image_grad, head_text='Gradient Morph - {}'.format(self.level)) return None return image_grad def extract_boundary(self, image_src, with_plot=False): image_eroded = self.erode_image(image_src=image_src) ext_bound = image_src - image_eroded if with_plot: self.plot_it(orig_matrix=image_src, trans_matrix=ext_bound, head_text='Boundary - {}'.format(self.level)) return None return ext_bound def get_tophat(self, image_src, with_plot=False): image_opening = self.open_image(image_src=image_src) image_tophat = image_src - image_opening if with_plot: self.plot_it(orig_matrix=image_src, trans_matrix=image_tophat, head_text='Top Hat - {}'.format(self.level)) return None return image_tophat def get_blackhat(self, image_src, with_plot=False): image_closing = self.close_image(image_src=image_src) image_blackhat = image_closing - image_src if with_plot: self.plot_it(orig_matrix=image_src, trans_matrix=image_blackhat, head_text='Black Hat - {}'.format(self.level)) return None return image_blackhat def plot_it(self, orig_matrix, trans_matrix, head_text): fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) cmap_val = 'gray' ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text(head_text) ax1.imshow(orig_matrix, cmap=cmap_val) ax2.imshow(trans_matrix, cmap=cmap_val) plt.show() return True`

`morph = MorphologicalTransformations( image_file='Finger-arch.jpg', level=3)image_src = morph.binarize_this()`

In the above code, we have made an object called morph, and using this object we are converting the input image into a binary image.

`morph.erode_image(image_src=image_src, with_plot=True)`

`morph.dilate_image(image_src=image_src, with_plot=True)`

`morph.open_image(image_src=image_src, with_plot=True)`

`morph.close_image(image_src=image_src, with_plot=True)`

`morph.morph_gradient(image_src=image_src, with_plot=True)`

`morph.get_tophat(image_src=image_src, with_plot=True)`

`morph.get_blackhat(image_src=image_src, with_plot=True)`

`morph.extract_boundary(image_src=image_src, with_plot=True)`

These are the results of all the transformations. We obtained these with the help of erosion and dilation.

**Note**: All of the code is written from scratch using the NumPy library.

In this article, we have almost covered the types of morphological transformations. Although, there are some advanced transformations like **Hit Miss Transformation**, which I didnt cover in this but definitely next time.

When we have speed and compatibility constraints, I would recommend using the official library methods. This was just practiced to understand the inner working and mathematics behind each algorithm.

**End**

In this article, we will explore the mathematics behind the image dilation operation. Like Image Erosion, Image Dilation is another important morphological operation used to increase or expand shapes contained in the input image. Think of this as diluting the image. Diluting anything requires water, here we need a structuring element or kernel.

**Credits of Cover Image** - Photo by Jason Leung on Unsplash

**Note**: We are not expanding or increasing the image size. We are increasing the pixel strength and the size remains the same.

Mathematically, we can represent this operation in the following way -

$$A \bigoplus B$$

where -

`A`

Input image that is binarized`B`

structuring element or kernel

The resultant of the above formula is the dilated image. The technique that we apply here is the 2D convolution for the input image with respect to the kernel. The kernel is basically a **square matrix**.

A typical binary image consists of only **1**s (255) and **0**s. The kernel can either be a subset of the input image or not which is again in the binary form. To think of this mathematically in terms of matrices we can have:

**A** the matrix of input image and **B** the matrix of structuring element. The following conditions are to be applied for convolution -

- Know the size of
**B**to pad**A**by 0's. The padding width is (`kernel_size`

- 2). - Position the center element of
**B**to every element of the original input image (matrix) iteratively. - Extract the submatrix that is exactly equal to the size of the
**B**. - Check if any (at least one) element from the submatrix is equal to the element in
**B**considering the index location.- If yes, replace the element of
**A**to 1 or 255. - 0, otherwise.

- If yes, replace the element of
- Continue this process for all the elements of
**A**.

**Note**: We shall start from the first element of **A** till the last element of **A**. The GIF can be seen below to understand the convolution technique more clearly.

Imagine the matrix **A** is a binary image -

similarly, the matrix **B** (kernel) -

Since the kernel size is `(3 x 3)`

, we need to pad **A** by width (3-2) which is 1.

To extract the submatrices we can position the center element of **B** with every element of **A** and thus breaking it down, we can obtain a giant matrix where every element is a `(3 x 3)`

submatrix of **A**.

From these submatrices, we can easily map the kernel **B** and obtain a new value for every element of **A**.

The resultant image considering the input image **A** -

Notice the difference, we have expanded the pixels of the original image with the level of 5. The black pixels did get reduced.

Lets code this totally from scratch using NumPy.

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV only used for reading the image (in this article).

`import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt`

Since we do the morphological transformations on binary images, we shall make sure whatever image we read is binarized. Therefore, we have the following functions.

`def read_this(image_file): image_src = cv2.imread(image_file, 0) return image_srcdef convert_binary(image_src, thresh_val): color_1 = 255 color_2 = 0 initial_conv = np.where((image_src <= thresh_val), image_src, color_1) final_conv = np.where((initial_conv > thresh_val), initial_conv, color_2) return final_convdef binarize_this(image_file, thresh_val=127): image_src = read_this(image_file=image_file) image_b = convert_binary(image_src=image_src, thresh_val=thresh_val) return image_b`

In the dilation function, the main parameters that are passed are:

**image_file**The input image for which the dilation operation has to be performed.**dilation_level**In how many levels do we have to dilate the image. By default, the value of this will be 3.**with_plot**Simply to visualize the result showing the comparison between the original image and the dilated image.

In the function

- We obtain the kernel matrix based on the
`dilation_level`

. - We pad the input image matrix by
`(kernel_size - dilation_level)`

. - We obtain the submatrices and replace the new values accordingly as shown in the GIF.
- Finally, we reshape the newly obtained array into the original size of the input image and plot the same.

`def dilate_this(image_file, dilation_level=3, with_plot=False): # setting the dilation_level dilation_level = 3 if dilation_level < 3 else dilation_level # obtain the kernel by the shape of (dilation_level, dilation_level) structuring_kernel = np.full(shape=(dilation_level, dilation_level), fill_value=255) image_src = binarize_this(image_file=image_file) orig_shape = image_src.shape pad_width = dilation_level - 2 # pad the image with pad_width image_pad = np.pad(array=image_src, pad_width=pad_width, mode='constant') pimg_shape = image_pad.shape h_reduce, w_reduce = (pimg_shape[0] - orig_shape[0]), (pimg_shape[1] - orig_shape[1]) # obtain the submatrices according to the size of the kernel flat_submatrices = np.array([ image_pad[i:(i + dilation_level), j:(j + dilation_level)] for i in range(pimg_shape[0] - h_reduce) for j in range(pimg_shape[1] - w_reduce) ]) # replace the values either 255 or 0 by dilation condition image_dilate = np.array([255 if (i == structuring_kernel).any() else 0 for i in flat_submatrices]) # obtain new matrix whose shape is equal to the original image size image_dilate = image_dilate.reshape(orig_shape) # plotting if with_plot: cmap_val = 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Dilated - {}".format(dilation_level)) ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_dilate, cmap=cmap_val) plt.show() return True return image_dilate`

Now that the dilation function is ready, all that is left is testing. We will use a different image for testing.

For dilation level 3 -

`dilate_this(image_file='wish.jpg', dilation_level=3, with_plot=True)`

For dilation level 5 -

`dilate_this(image_file='wish.jpg', dilation_level=3, with_plot=True)`

In all the above results, we can notice the increase in the pixels. And this the example code developed totally from scratch. We can also rely on the `cv2.dilate()`

method which is very faster.

- Morphological Transformations official documentation.

Erosion operation is one of the important morphological operations (morphological transformations) that follows a technique of `mathematical morphology`

for the analysis and processing of geometrical structures.

To get a general idea of what erosion has to do with images, we can think of this as an operation in which it tries to reduce the shape that is contained in the input image. It is just like the erosion of soil but just that this operation erodes the boundaries of the foreground object.

**Credits of Cover Image** - Photo by Laura Colquitt on Unsplash

To represent this operation mathematically -

$$A \circleddash B$$

where -

`A`

Input Image`B`

Structuring element or kernel

The resultant of the above formula gives the eroded image. The structuring element is basically a kernel where the image matrix is operated as a `2D`

convolution.

**Note** - This blog post covers the erosion process done on binary images. Also, it is often preferred to use binary images for morphological transformation.

As discussed, we only use the binary images that consist of pixels either `0`

or `1`

(`0`

or `255`

to be more precise). The structuring element or kernel is either a subset of the image matrix or not which also, is a binary representation that is mostly a `square matrix`

.

Let us consider `A`

as the image matrix and `B`

as the kernel. We have conditions as follows:

We have to position the center element of

`B`

to the element iteratively taken in the image`A`

.We consider the submatrix of

`A`

to be the size`B`

and check if the submatrix is exactly equivalent to`B`

.If yes, replace the pixel value to be

`1`

or`255`

otherwise`0`

.

We need to do this for all the elements of `A`

with `B`

.

Imagine the image matrix `A`

as -

and structuring element or kernel `B`

as -

The binary image of the matrix `A`

would be something like below.

For easy calculation, we shall pad the image by a `pad_width`

equal to `(kernel size - 2)`

with which the submatrix can be selected easily. The `GIF`

can be seen below to visually know the inner working of the convolution.

Now that we know what to do, let's code the same using the library as well as from scratch.

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV It is only used for reading the image (in this article).

`import`

the Packages`import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt`

Since we do the morphological transformations on binary images, we shall make sure whatever image we read is binarized. Therefore, we have the following function.

`def read_this(image_file): image_src = cv2.imread(image_file, 0) return image_srcdef convert_binary(image_src, thresh_val): color_1 = 255 color_2 = 0 initial_conv = np.where((image_src <= thresh_val), image_src, color_1) final_conv = np.where((initial_conv > thresh_val), initial_conv, color_2) return final_convdef binarize_this(image_file, thresh_val=127): image_src = read_this(image_file=image_file) image_b = convert_binary(image_src=image_src, thresh_val=thresh_val) return image_b`

**Note** - By default we are reading the image in grayscale mode.

For this, we will be using a different image, and for the implementation, we will use the method `erode()`

available in the module `cv2`

. The parameters are as follows:

**image_file**The image that we want to apply the transformation.**level**Basically the`erosion`

level with which the`structuring element`

or`kernel`

's size is decided.**with_plot**To obtain the results of both the original image and the transformed image.

`def erode_lib(image_file, level=3, with_plot=True): level = 3 if level < 3 else level image_src = binarize_this(image_file=image_file) # library method image_eroded = cv2.erode(src=image_src, kernel=np.ones((level, level)), iterations=1) if with_plot: cmap_val = 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Eroded - {}".format(level)) ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_eroded, cmap=cmap_val) plt.show() return True return image_eroded`

Let's test the above function -

`erode_lib(image_file='wish.jpg', level=3, with_plot=True)`

Clearly, we can see the some of the pixels got reduced showing the pixel erosion.

As explained earlier, we need to carefully choose the `pad_width`

depending upon the `erosion_level`

. We normally take `(kernel size - 2)`

or `(erosion_level - 2)`

and here, the `kernel`

is always `square matrix`

.

After this, we shall also take the submatrices to position the center element of the `kernel`

with each element of the image matrix iteratively. We make sure that the submatrix size is equal to kernel size. Hence we first pad the matrix with `pad_width`

.

Let's code the erosion function from scratch.

`def erode_this(image_file, erosion_level=3, with_plot=False): erosion_level = 3 if erosion_level < 3 else erosion_level structuring_kernel = np.full(shape=(erosion_level, erosion_level), fill_value=255) image_src = binarize_this(image_file=image_file) orig_shape = image_src.shape pad_width = erosion_level - 2 # pad the matrix with `pad_width` image_pad = np.pad(array=image_src, pad_width=pad_width, mode='constant') pimg_shape = image_pad.shape h_reduce, w_reduce = (pimg_shape[0] - orig_shape[0]), (pimg_shape[1] - orig_shape[1]) # sub matrices of kernel size flat_submatrices = np.array([ image_pad[i:(i + erosion_level), j:(j + erosion_level)] for i in range(pimg_shape[0] - h_reduce) for j in range(pimg_shape[1] - w_reduce) ]) # condition to replace the values - if the kernel equal to submatrix then 255 else 0 image_erode = np.array([255 if (i == structuring_kernel).all() else 0 for i in flat_submatrices]) image_erode = image_erode.reshape(orig_shape) if with_plot: cmap_val = 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Eroded - {}".format(erosion_level)) ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_erode, cmap=cmap_val) plt.show() return True return image_erode`

Let's test the above function -

`erode_this(image_file='wish.jpg', erosion_level=3, with_plot=True)`

Clearly, we can see the reduction in the pixel values showcasing the pixel erosion. This, we implemented by convolution `2D`

technique totally from scratch. When compared to the speed of the algorithm, it is a bit slow compared to that of the library method.

- Morphological Transformations official documentation.

Image shifting is simply shifting each pixel of the image to a new position. This is a method of pixel shift used in digital cameras to produce super-resolution images. We can think of a pixel as a point in the coordinate axis to be shifted in any direction. When we implement this on all the pixels of the image then we can say the image is shifted.

In this blog article, we will try to shift the image as we shift the point in the coordinate axis completely using NumPy operations. The image is always considered as a `2D`

plane, hence we shall also consider a `2D`

coordinate axis having `X`

as the horizontal axis and `Y`

as the vertical axis.

The coordinate axis is divided into `4`

quadrants namely -

**Q1**Quadrant where both`X`

and`Y`

are positive.**Q2**Quadrant where`X`

is negative and`Y`

is positive.**Q3**Quadrant where both`X`

and`Y`

are negative.**Q4**Quadrant where`X`

is positive and`Y`

is negative.

We assume that our original image to be at origin i.e., (0, 0).

Now, let's say we want to shift the image at coordinates (3, 4). Basically, the origin of the image has to be shifted from (0, 0) to (3, 4).

Likewise, based on the coordinate points, we need to shift the image. Let's try to understand and implement from scratch using the module NumPy starting from a `2D`

matrix because images are just large matrices.

**Credits of Cover Image** - Photo by James Lewis on Unsplash

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV It is only used for reading the image (in this article).

`import`

the Packages`import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt`

`2D`

MatrixWe will be creating a `5 X 5`

matrix having random numbers.

`>>> import random>>> >>> mat = [[random.randint(5, 100) for i in range(5)] for j in range(5)]>>> mat = np.matrix(mat)>>> print(mat)[[ 46 13 68 54 12] [ 7 68 32 46 26] [ 46 43 58 27 100] [ 64 59 76 100 41] [ 35 62 56 44 7]]>>>`

For instance, let us assume that we are shifting the image in `Q1`

. For sure the image has to move right side towards the `X`

axis and topside towards the `Y`

axis. The size of the image increases as we have to pad the image's left side as per the `x`

coordinate depth and the bottom side as per the `y`

coordinate depth.

In the case of shifting the image in other quadrants, the same process has to be replicated.

Let's create a padding function that can pad the image on all sides depending on the depth that is passed.

`def pad_vector(vector, how, depth, constant_value=0): vect_shape = vector.shape[:2] if (how == 'upper') or (how == 'top'): pp = np.full(shape=(depth, vect_shape[1]), fill_value=constant_value) pv = np.vstack(tup=(pp, vector)) elif (how == 'lower') or (how == 'bottom'): pp = np.full(shape=(depth, vect_shape[1]), fill_value=constant_value) pv = np.vstack(tup=(vector, pp)) elif (how == 'left'): pp = np.full(shape=(vect_shape[0], depth), fill_value=constant_value) pv = np.hstack(tup=(pp, vector)) elif (how == 'right'): pp = np.full(shape=(vect_shape[0], depth), fill_value=constant_value) pv = np.hstack(tup=(vector, pp)) else: return vector return pv`

The above function is used to pad the image. The arguments used are as follows:

**vector**a matrix in which the padding is done.**how**this takes four values that decide the quadrants where the image needs to be shifted.- lower or bottom
- upper or top
- right
- left

**depth**the depth of the padding.**constant_value**signifies`black`

color and`0`

is the default value.

**Note** - For the padding-right and padding-left, we use the method `hstack()`

. In the same way, for the padding-top and padding-bottom, we use the method `vstack()`

. These two are the NumPy methods.

**hstack()**horizontal stack**vstack()**vertical stack

Based on the depth, the padding matrix whose values are zeros is created. Similarly, based on the direction, this padding matrix is stacked accordingly with the original matrix.

Let's test the above function.

`>>> pmat = pad_vector(vector=mat, how='left', depth=3)>>> print(pmat)[[ 0 0 0 46 13 68 54 12] [ 0 0 0 7 68 32 46 26] [ 0 0 0 46 43 58 27 100] [ 0 0 0 64 59 76 100 41] [ 0 0 0 35 62 56 44 7]]>>>`

We can clearly see that the function padded the matrix left side with the depth level `3`

. If we were to plot the same (convert the padded matrix into an image), we will have something like the below.

`>>> plt.axis("off")>>> plt.imshow(pmat, cmap='gray')>>> plt.show()`

Whereas the original image is -

With this, we can conclude the image is shifted to the right side towards the `X`

axis with the `x`

coordinate 3. The same technique is applied to the real image. Let's try that.

We shall have a function to read the image both in grayscale and RGB format.

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_src`

Let's make another function called `shifter()`

which actually shifts the image along the `Y`

axis irrespective of the quadrant.

`def shifter(vect, y, y_): if (y > 0): image_trans = pad_vector(vector=vect, how='lower', depth=y_) elif (y < 0): image_trans = pad_vector(vector=vect, how='upper', depth=y_) else: image_trans = vect return image_trans`

Now that we have the `shifter()`

function, we will need to use this in another function that can shift anywhere in the coordinate axis. Here, we consider `X`

and `Y`

axes.

`def shift_image(image_src, at): x, y = at x_, y_ = abs(x), abs(y) if (x > 0): left_pad = pad_vector(vector=image_src, how='left', depth=x_) image_trans = shifter(vect=left_pad, y=y, y_=y_) elif (x < 0): right_pad = pad_vector(vector=image_src, how='right', depth=x_) image_trans = shifter(vect=right_pad, y=y, y_=y_) else: image_trans = shifter(vect=image_src, y=y, y_=y_) return image_trans`

- When
`x`

and`y`

coordinates are greater than 0, pad the image left side and bottom side.

- When
`x`

is greater than 0 and`y`

is less than 0, pad the image left side and topside.

- When
`x`

is less than 0 and`y`

is greater than 0, pad the image right side and bottom side.

- When
`x`

and`y`

coordinates are less than 0, pad the image right side and topside.

- When
`x`

and`y`

coordinates exactly equal to 0, do not disturb the image.

There is one problem yet to translate or shift the image. An image can be of two types - grayscale and colored. For grayscale, there won't be any problem. But for the colored image, we need to separate RGB pixels, apply the shift function, and then finally combine the pixels. Hence the below function is helpful for shifting both RGB image and grayscale image.

`def translate_this(image_file, at, with_plot=False, gray_scale=False): if len(at) != 2: return False image_src = read_this(image_file=image_file, gray_scale=gray_scale) if not gray_scale: r_image, g_image, b_image = image_src[:, :, 0], image_src[:, :, 1], image_src[:, :, 2] r_trans = shift_image(image_src=r_image, at=at) g_trans = shift_image(image_src=g_image, at=at) b_trans = shift_image(image_src=b_image, at=at) image_trans = np.dstack(tup=(r_trans, g_trans, b_trans)) else: image_trans = shift_image(image_src=image_src, at=at) if with_plot: cmap_val = None if not gray_scale else 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Translated") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_trans, cmap=cmap_val) return True return image_trans`

Testing for the regular RGB image.

`translate_this( image_file='lena_original.png', at=(60, 60), with_plot=True)`

Clearly, the image is shifted to the origin (60, 60) i.e., in the first quadrant (Q1).

Now let's test for the regular grayscale image.

`translate_this( image_file='lena_original.png', at=(-60, -60), with_plot=True, gray_scale=True)`

Clearly, the image is shifted to the position (-60, -60) i.e., in the third quadrant (Q3).

Well, that's it for this article. From this, we tried to understand how the image shifting process is done.

]]>Solarization is a process used in Photography to describe the effect of tone reversal observed in cases of extreme overexposure of the photographic film in the camera. It is a darkroom technique to convert any photograph to look graphic or surrealistic. It often occurs when a very intense light source appears in the picture and is grossly overexposed. It is a technique where the image is partially or completely reversed (inverted) by an extreme overexposure. **Original Source** - here. With this blog, let's understand the same in a more Pythonic way.

**Credits of Cover Image** - Photo by engin akyurt on Unsplash

When we are solarizing an image, we mainly consider a threshold pixel value in order to reverse (invert) the image partially or completely. To know more about inverting an image, you can refer to my article here.

We will rely on the `PIL`

library to solarize the image. We will also code the same without using the `PIL`

module and by just using NumPy for the matrix operations.

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV
- PIL

`import`

the Packages`import numpy as npimport cv2import jsonfrom PIL import Imagefrom PIL import ImageOps as ipofrom matplotlib import pyplot as plt`

We use the method `solarize()`

available in the module `ImageOps`

- part of the main module `PIL`

. The method takes the following arguments -

`image`

image file that needs to solarized.`threshold`

threshold value to either partially or completely reverse (invert) the image.

Now that we have all the requirements, let's write the function that solarizes the image.

`def solarize_lib(image_file, thresh_val, gray_scale=False): image_src = Image.open(image_file) image_src = image_src.convert('L') if gray_scale else image_src image_sol = ipo.solarize(image_src, threshold=thresh_val) cmap_val = None if not gray_scale else 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Solarized") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_sol, cmap=cmap_val) return True`

The above function solarizes the images that are read both in colored mode or grayscale mode. Let's test the same.

`solarize_lib(image_file='scenary.jpg', thresh_val=130)`

**For RGB image**

**For grayscale image**

Above are the results of the images that are solarized.

For coding from scratch, we are relying on the NumPy module since matrix operations can be achieved with so much ease. We will need to validate for both colored images as well as grayscale images.

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_src`

The above function reads the image either in grayscale or RGB and returns the image matrix.

As above we will have a `thresh_val`

argument based on which the partial or complete reversing (inverting) be done.

`def solarize_this(image_file, thresh_val, with_plot=False, gray_scale=False): image_src = read_this(image_file=image_file, gray_scale=gray_scale) if not gray_scale: r_image, g_image, b_image = image_src[:, :, 0], image_src[:, :, 1], image_src[:, :, 2] ## inverting the colored image (partially) r_sol = np.where((r_image < thresh_val), r_image, ~r_image) g_sol = np.where((g_image < thresh_val), g_image, ~g_image) b_sol = np.where((b_image < thresh_val), b_image, ~b_image) image_sol = np.dstack(tup=(r_sol, g_sol, b_sol)) else: ## inverting the grayscale image (partially) image_sol = np.where((image_src < thresh_val), image_src, ~image_src) if with_plot: cmap_val = None if not gray_scale else 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Solarized") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_sol, cmap=cmap_val) return True return image_sol`

We are using the `where()`

method of NumPy to find out those pixels which are less than `thresh_val`

and inverting the same which are greater. Let's test the function.

**For RGB image**

`solarize_this( image_file='scenary.jpg', thresh_val=130, with_plot=True)`

**For grayscale image**

`solarize_this( image_file='lena_original.png', thresh_val=120, with_plot=True, gray_scale=True)`

`PIL`

is a great library and so is NumPy. `PIL`

is easy to use and the results can be obtained with less code. It is great for researchers and beginners. Implementing the same from scratch is what I have gained.

Here I take leave. If you have liked it consider visiting this page to read more on Image Processing. And make sure to buy coffee for me from here.

]]>Concatenation is a process of combining two things from the end. In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. To precisely explain this, we can perform concatenation for two words `snow`

and `ball`

resulting in `snowball`

. It need not be from the end.

**Credits of Cover Image** - Photo by Bia Andrade on Unsplash

Concatenation of images is simply concatenating multiple matrices. But in Python, we need to understand that concatenation of lists is different from the concatenation of NumPy arrays.

For concatenating two or multiple lists, we can use the `+`

operation that will combine two lists or multiple lists into one big list. Below is an example that can be understood easily.

`>>> l1 = [1, 2, 3, 4]>>> l2 = [4, 5, 6, 7]>>> >>> # concatenation>>> l3 = l1 + l2>>> l3[1, 2, 3, 4, 4, 5, 6, 7]>>> >>> # or>>> l3 = l1.extend(l2)>>> l3[1, 2, 3, 4, 4, 5, 6, 7]`

The scenario would be completely different for NumPy arrays if we were to perform the same `+`

operation as we did for lists (above). NumPy automatically performs linear addition (broadcasting technique) considering the shape of each array is similar. Below is an example that can be understood easily.

**For 1D arrays**

`>>> import numpy as np>>> >>> l1 = np.array([1, 2, 3, 4])>>> l2 = np.array([4, 5, 6, 7])>>> >>> # using `+` operation>>> l3 = l1 + l2>>> print(l3)[ 5 7 9 11]>>>`

**For 2D arrays**

`>>> import numpy as np>>> >>> l1 = np.array([[1, 2], [3, 4]])>>> l2 = np.array([[4, 5], [6, 7]])>>> >>> # using `+` operation>>> l3 = l1 + l2>>> print(l3)[[ 5 7] [ 9 11]]>>>`

For both cases, the `+`

operation does not work. Rather, we can use the `concatenate()`

method of NumPy's module to combine the arrays. The only criteria here is that we should have all the arrays in the same dimension.

With the help of this method, we can either concatenate `horizontally`

or `vertically`

. Below is an example that can be understood easily.

`>>> import numpy as np>>> >>> l1 = np.array([[1, 2], [3, 4]])>>> l2 = np.array([[4, 5], [6, 7]])>>> >>> # concatenation>>> # axis = 0 concatenates vertically>>> l3 = np.concatenate((l1, l2), axis=0)>>> print(l3)[[1 2] [3 4] [4 5] [6 7]]>>> >>> # axis = 1 concatenates horizontally>>> l4 = np.concatenate((l1, l2), axis=1)>>> print(l4)[[1 2 4 5] [3 4 6 7]]>>>`

From the above example, we are clear that concatenation can be easily done and the same thing is performed on the images to combine two or multiple (different) images.

We are basically replicating the methods `hconcat()`

(horizontal concatenation) and `vconcat()`

(vertical concatenation) of the module `cv2`

in NumPy.

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV It is only used for reading the image (in this article).

`import`

the Packages`import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt`

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_src`

The above function reads the image either in grayscale or RGB and returns the image matrix.

While concatenating two different images, there can be a difference in the dimensions of the image matrices. We need to make sure that the dimensions are the same. There is a need to resize the image matrix (refer to this article for understanding image resizing). And, hence the below function.

`def resize_image(image_matrix, nh, nw): image_size = image_matrix.shape oh = image_size[0] ow = image_size[1] re_image_matrix = np.array([ np.array([image_matrix[(oh*h // nh)][(ow*w // nw)] for w in range(nw)]) for h in range(nh) ]) return re_image_matrix`

For this, we will need to take two different images.

**Image 1**

**Image 2**

**Note** - The dimensions of the above images are different.

`# grayscale mode is false by defaultimage1 = read_this(image_file='lena_original.png')image2 = read_this(image_file='pinktree.jpg')`

The function takes totally three arguments -

`image_set`

a list of image matrices.`how`

concatenating on what basis,`vertically`

or`horizontally`

.- default arg
`with_plot`

to plot the results or not.

`def concat_lib(image_set, how, with_plot=False): # dimension of each matrix in image_set shape_vals = [imat.shape for imat in image_set] # length of dimension of each matrix in image_set shape_lens = [len(ishp) for ishp in shape_vals] # if all the images in image_set are read in same mode channel_flag = True if len(set(shape_lens)) == 1 else False if channel_flag: ideal_shape = max(shape_vals) images_resized = [ # function call to resize the image resize_image(image_matrix=imat, nh=ideal_shape[0], nw=ideal_shape[1]) if imat.shape != ideal_shape else imat for imat in image_set ] else: return False # cv2 library code to concatenate the image matrices # we use methods like # - vconcat() vertical concat # - hconcat() horizontal concat if (how == 0) or (how == 'vertical'): concats = cv2.vconcat(images_resized) elif (how == 1) or (how == 'horizontal'): concats = cv2.hconcat(images_resized) else: concats = cv2.hconcat(images_resized) if with_plot: cmap_val = None if len(concats.shape) == 3 else 'gray' plt.figure(figsize=(10, 6)) plt.axis("off") plt.imshow(concats, cmap=cmap_val) return True return concats`

In the case of image matrices with different dimensions, they are handled carefully by taking the ideal dimension whose matrix dimension is maximum. We are using the methods `hconcat()`

and `vconcat()`

based on the value of `how`

that is passed.

Let's test the above function -

`concat_lib( image_set=[image1, image2, image1, image1, image2], how='horizontal', with_plot=True)`

`image2`

is automatically resized to the size of `image1`

and the same is converted into a giant matrix.

The function takes totally three arguments -

`image_set`

a list of image matrices.`how`

concatenating on what basis,`vertically`

or`horizontally`

.- default arg
`with_plot`

to plot the results or not.

`def concat_images(image_set, how, with_plot=False): # dimension of each matrix in image_set shape_vals = [imat.shape for imat in image_set] # length of dimension of each matrix in image_set shape_lens = [len(ishp) for ishp in shape_vals] # if all the images in image_set are read in same mode channel_flag = True if len(set(shape_lens)) == 1 else False if channel_flag: ideal_shape = max(shape_vals) images_resized = [ # function call to resize the image resize_image(image_matrix=imat, nh=ideal_shape[0], nw=ideal_shape[1]) if imat.shape != ideal_shape else imat for imat in image_set ] else: return False images_resized = tuple(images_resized) if (how == 'vertical') or (how == 0): axis_val = 0 elif (how == 'horizontal') or (how == 1): axis_val = 1 else: axis_val = 1 # numpy code to concatenate the image matrices # concatenation is done based on axis value concats = np.concatenate(images_resized, axis=axis_val) if with_plot: cmap_val = None if len(concats.shape) == 3 else 'gray' plt.figure(figsize=(10, 6)) plt.axis("off") plt.imshow(concats, cmap=cmap_val) return True return concats`

In the case of image matrices with different dimensions, they are handled carefully by taking the ideal dimension whose matrix dimension is maximum. We are using the method `concatenate()`

based on the value of `how`

that is passed.

Let's test the above function -

`concat_images( image_set=[image1, image2, image1, image1, image2], how='vertical', with_plot=True)`

`image2`

is automatically resized to the size of `image1`

and the same is converted into a giant matrix.

We can also play with the above functions to explore more.

`gimg1 = concat_images(image_set=[image1, image2], how='vertical')gimg2 = concat_images(image_set=[image1, image2], how='horizontal')concat_images( image_set=[gimg1, gimg2], how=1, with_plot=True)`

Here I take leave. If you have liked it consider visiting this page to read more on Image Processing. And make sure to buy coffee for me from here.

]]>When we have an image, it is always taken as a `2D`

matrix. The size of the image is nothing but the dimension of the matrix.

**Credits of Cover Image** - Photo by Kai Krog Halse on Unplash

In Python when we read the image, the size can be easily found by the `.shape`

method. In order to find the shape, we should first read the image and obtain the matrix.

Let's first implement for a `2D`

matrix or array and then replicate the same for images.

First, we are creating a matrix using the module `NumPy`

. The code for the same can be seen below.

`>>> import random>>> import numpy as np>>> >>> mat = np.array([... [random.randint(10, 100) for i in range(5)] for j in range(5)... ])>>> >>> print(mat)[[ 30 91 12 44 52] [ 37 72 19 100 77] [ 94 77 60 48 64] [ 65 26 59 52 40] [ 37 58 13 74 36]]>>>`

We will use the method `.shape`

to find out the dimension of the matrix. Clearly, the matrix has `5`

rows and `5`

columns.

`>>> dimension = mat.shape>>> >>> print(dimension)(5, 5)>>>`

If we plot the same, we will get the matrix image as below.

`>>> import matplotlib.pyplot as plt>>> >>> plt.figure(figsize=(12, 6))<Figure size 1200x600 with 0 Axes>>>> mat_plot = plt.imshow(mat)>>> plt.colorbar(mat_plot)<matplotlib.colorbar.Colorbar object at 0x0AF04418>>>> plt.show()>>>`

We can increase the size by using the interpolation technique. In mathematics, interpolation is a type of estimation to construct new data points within the range of a discrete set of known data points.

Let's create a function that will resize the given matrix and returns a new matrix with the new shape.

`def resize_image(image_matrix, nh, nw): if len(image_matrix.shape) == 3: oh, ow, _ = image_matrix.shape else: oh, ow = image_matrix.shape re_image_matrix = np.array([ np.array([image_matrix[(oh*h // nh)][(ow*w // nw)] for w in range(nw)]) for h in range(nh) ]) return re_image_matrix`

**Credits of the above code** - NumPy scaling the Image.

We are passing three arguments -

`image_matrix`

Basically any matrix that we want to change the dimension.`nh`

New height (goes row-wise).`nw`

New width (goes column-wise).

In the function, we are taking old height (`oh`

) and old width (`ow`

) according to the length of the image matrix's shape. The shape of the image varies for the colored image and grayscale image.

Also, we have 2 `for`

loops for 2 levels (row-wise and column-wise) and performing an integer division for each iterative in the range of new height and new width. This will decide the index for which the element is extracted from the matrix.

Let's increase the matrix size where the new matrix should have `8`

rows and `8`

columns.

`>>> re_mat = resize_image(image_matrix=mat, nh=8, nw=8)>>> print(re_mat)[[ 30 30 91 91 12 44 44 52] [ 30 30 91 91 12 44 44 52] [ 37 37 72 72 19 100 100 77] [ 37 37 72 72 19 100 100 77] [ 94 94 77 77 60 48 48 64] [ 65 65 26 26 59 52 52 40] [ 65 65 26 26 59 52 52 40] [ 37 37 58 58 13 74 74 36]]>>>`

We can clearly see, the new matrix `re_mat`

is bigger than the original matrix `mat`

. The dimension of the `re_mat`

is `8`

x`8`

. If we were to visualize the same, there won't any particular difference between the original matrix plot and the new matrix plot.

`>>> fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))>>> >>> ax1.title.set_text('Original')>>> ax2.title.set_text("Re-sized")>>> >>> ax1.imshow(mat)>>> ax2.imshow(re_mat)>>> plt.show()>>>`

We can notice the difference in the axis scaling. The second image is more scaled than the first image. The resizing is handled with care to retain the pixel values. The same thing is applied to the image for increasing and decreasing the size respectively.

**Note** - We can use the `OpenCV`

's resizing techniques. I wanted to learn how we can do the same from scratch.

When we talk about images, Images are more like matrices where data is stored row-wise and column-wise. Since the image is a replication of a matrix, we can do all sorts of matrix operations on the image.

**Credits of Cover Image** - Photo by Erwan Hesry on Unsplash

We will use the NumPy module extensively for this project and also for the following projects related to Image Processing.

`>>> import numpy as np>>> >>> mat = np.matrix(... [[1, 2, 3, 4, 5],... [3, 4, 5, 6, 1]]... )>>> matmatrix([[1, 2, 3, 4, 5], [3, 4, 5, 6, 1]])>>> >>> mat.shape(2, 5)>>>`

`>>> mat.Tmatrix([[1, 3], [2, 4], [3, 5], [4, 6], [5, 1]])>>>`

`>>> 3 * matmatrix([[ 3, 6, 9, 12, 15], [ 9, 12, 15, 18, 3]])>>>`

For converting the matrix into an image, we can use the Matplotlib module. The method `imshow()`

is helpful to do the conversion.

`>>> from matplotlib import pyplot as plt>>> >>> plt.figure(figsize=(10, 3))>>> image_mat = plt.imshow(mat, cmap='Reds')>>> plt.colorbar(image_mat)>>> plt.show()>>>`

In the above example, we did the conversion for a small matrix. Now we shall try to do the conversion for a large matrix. We will use the Random module to generate random numbers for the matrix.

- There should be 30 rows and 50 columns.
- Each row of the matrix should have 50 numbers in the range of 1 and 200.

We shall use the method `randint(a, b)`

to create a random large matrix.

`>>> import random>>> >>> big_mat = [... [random.randint(1, 200) for i in range(50)]... for j in range(30)... ]>>>>>> big_mat = np.matrix(big_mat)>>>>>> big_matmatrix([[ 16, 184, 130, ..., 35, 126, 104], [198, 153, 172, ..., 109, 60, 174], [ 90, 138, 108, ..., 192, 132, 103], ..., [172, 83, 158, ..., 154, 4, 109], [ 23, 3, 198, ..., 147, 12, 88], [ 62, 106, 96, ..., 191, 83, 193]])>>> >>> big_mat.shape(30, 50)>>>`

Let's convert the above large matrix into an image.

`>>> plt.figure(figsize=(10, 5))>>> image_mat = plt.imshow(big_mat, cmap=None)>>> plt.colorbar(image_mat)>>> plt.show()`

To convert an image into a matrix, we need to use the package called `opencv-python`

. With this package, we can read an image into a matrix and perform the matrix operations using the NumPy module.

**Installation**

`pip install opencv-python --user`

A typical colored image is comprised of pixels that are represented as `RGB`

pixels. A pixel is simply a number in the range of `0`

to`255`

for all `R`

, `G`

, and `B`

.

- R Red 0 to 255
- G Green 0 to 255
- B Blue 0 to 255

We will `cv2`

(OpenCV) module to read the image in the form of a matrix. The method `imread()`

is used to read the image either in `grayscale`

mode or `RGB`

mode.

`import`

the packages`import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt`

The image that we will be reading is the original Lenna image. Function to read the image as a matrix.

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_src`

By default, `imread()`

reads the image in `BGR`

(Blue, Green, and Red) format. We shall convert into `RGB`

and hence the above function. Now that we have read the image, we can separate the pixels into 3 distinct matrices separating `R`

, `G`

, and `B`

.

Let's plot the same as we have done for the above examples.

`def separate_rgb(image_file, gray_scale=False): image_src = read_this(image_file=image_file, gray_scale=gray_scale) if not gray_scale: r_, g_, b_ = image_src[:, :, 0], image_src[:, :, 1], image_src[:, :, 2] return [r_, g_, b_] return [image_src]`

The above function separates the pixels into 3 distinct matrices.

`def plot_rgb_seperated(image_file, with_plot=True, gray_scale=False): with_plot = True image_src = read_this(image_file=image_file, gray_scale=False) separates = separate_rgb(image_file=image_file, gray_scale=gray_scale) pixel_matrices = [image_src] pixel_matrices.extend(separates) cmap_vals = [None, 'Reds', 'Greens', 'Blues'] if not gray_scale else [None, 'gray'] titles = ['Original', 'Red', 'Green', 'Blues'] if not gray_scale else ['Original', 'Grayscale'] n_cols = 4 if not gray_scale else 2 fig_size = (15, 10) if not gray_scale else (10, 20) if with_plot: fig, axes = plt.subplots(nrows=1, ncols=n_cols, figsize=fig_size) for i, ax in zip(range(n_cols), axes): ax.axis("off") ax.set_title(titles[i]) ax.imshow(pixel_matrices[i], cmap=cmap_vals[i]) return True return False`

The above function plots the matrices along with the original matrix. Let's test the function.

**For color image:**

`plot_rgb_seperated(image_file='lena_original.png')`

**For grayscale image:**

`plot_rgb_seperated(image_file='lena_original.png', gray_scale=True)`

That is it for this article. We have learned how to convert the matrix into an image and vice-versa.

]]>In this article, we will try to understand the inner working of cv2.rectangle() using the NumPy module in Python.

**Credits of Cover Image** - Photo by Tim Mossholder on Unsplash

The `cv2.rectangle()`

is OpenCVs method used to draw a rectangle on the image. We have options to decide the thickness and the color of the rectangle. But we need to make sure that the color is passed as the `RGB`

code (R, G, B). With this, blog article we will try to focus on understanding the inner working of this method and implement the same from scratch using the NumPy module.

A rectangle is simply a shape that we would like to draw on an image in the specified position. We can have the position to draw by passing two points `pt1`

and `pt2`

. The image matrix is considered as a 2D plane for us to decide the position of the rectangle. Lets understand the library method first.

The arguments of the above-mentioned method are:

`img`

Image on which the rectangle is drawn.`pt1`

Point 1 to decide the position of the rectangle.`pt2`

Point 2 also to decide the position of the rectangle.`color`

Color of the rectangle which is passed as RGB code.`thickness`

The line thickness of the rectangle.

Other arguments and their importance can be learned by checking `help(cv2.rectangle)`

. First, we will implement the library method and then we will implement the code from scratch.

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV It is only used for reading the image (in this article).

`import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt`

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_src`

The above function reads the image either in grayscale or RGB and returns the image matrix.

We will use `cv2.rectangle()`

method. But for better visualization, we can the following function taking arguments:

`start_pos`

To knowing the starting position of the rectangle to draw.`length`

Length of the rectangle (absolute value is considered).`width`

Width of the rectangle (absolute value is considered).

From the above three arguments, we can obtain pt1 and pt2 points.

`thickness`

Thickness of the rectangle.`color_name`

Instead of passing the`RGB`

code, we can extract the (R, G, B) values from the`color_names_data.json`

file where I have stored all the possible color names with their (R, G, B) values respectively. The sample of the color data can be viewed below.

`{ "air force blue": { "r": 93, "g": 138, "b": 168, "hex": "#5d8aa8" }, "alizarin crimson": { "r": 227, "g": 38, "b": 54, "hex": "#e32636" }, "almond": { "r": 239, "g": 222, "b": 205, "hex": "#efdecd" }, ...}`

The function works for both colored images and grayscale images. But for grayscale images, a default color `Black`

is considered.

`def rectangle_lib(image_file, start_pos, length, width, thickness=3, with_plot=False, gray_scale=False, color_name=0): image_src = read_this(image_file=image_file, gray_scale=gray_scale) image_main = read_this(image_file=image_file, gray_scale=gray_scale) image_shape = image_src.shape length = abs(length) width = abs(width) start_row = start_pos if start_pos >= 0 else 0 start_column = start_row end_row = length + start_row end_row = end_row if end_row <= image_shape[0] else image_shape[0] end_column = width + start_column end_column = end_column if end_column <= image_shape[1] else image_shape[1] if gray_scale: color_name = (0, 0, 0) else: with open(file='color_names_data.json', mode='r') as col_json: color_db = json.load(fp=col_json) color_name = str(color_name).strip().lower() colors_list = list(color_db.keys()) if color_name not in colors_list: color_name = (0, 0, 0) else: color_name = tuple([color_db[color_name][i] for i in 'rgb']) # library method image_rect = cv2.rectangle(image_src, (start_column, start_row), (end_column, end_row), color_name, thickness) if with_plot: cmap_val = None if not gray_scale else 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Rectangle") ax1.imshow(image_main, cmap=cmap_val) ax2.imshow(image_rect, cmap=cmap_val) return True return image_rect`

The above function is way better to implement rather than specifying the point on our own. We can simply mention the three important arguments and based on that the points are determined. Lets test the function.

**For color image:**

`rectangle_lib( image_file='lena_original.png', start_pos=199, length=200, width=200, thickness=3, with_plot=True, color_name="yellow")`

**For grayscale image:**

`rectangle_lib( image_file='lena_original.png', start_pos=199, length=200, width=200, thickness=3, with_plot=True, gray_scale=True)`

For thickness (-1), there is a special case in which the whole rectangle area is colored based on the color specified.

The above cases are implemented using the library method. It is obvious that the library method works for sure. It is time to do something from scratch on our own.

We will use the same arguments that we have used in the above function `rectangle_lib()`

. But the technique which we follow is different. Here, the argument `thickness`

plays an important role alongside `start_pos`

, `length`

, and `width`

.

The logic behind the implementation:

We have to grab the sub-image based on

`start_pos`

,`length`

, and`width`

. Basically, we have to crop the image. This cropped image will be the inner portion of the rectangle (Please read this article to know more info).We need to

`pad`

the image matrix with the color value (obtained from the color name). If the image is grayscaled then we will pad with the black color value i.e,`0`

. Otherwise, we will`pad`

3 times after separating the`R`

pixels,`G`

pixels, and`B`

pixels and finally merging them to form a single image. Basically, we are bordering the cropped image (Please read this article to know more info).Finally, we need to replace this bordered, cropped image in the original image and display the image.

We will use the same color data from the file `color_names_data.json`

for obtaining the color values. Lets code the logic.

`def draw_rectangle(image_file, start_pos, length, width, thickness=3, with_plot=False, gray_scale=False, color_name=0): image_src = read_this(image_file=image_file, gray_scale=gray_scale) image_main = read_this(image_file=image_file, gray_scale=gray_scale) image_shape = image_src.shape length = abs(length) width = abs(width) thickness = -1 if thickness < 0 else thickness start_row = start_pos if start_pos >= 0 else 0 start_column = start_row end_row = length + start_row end_row = end_row if end_row <= image_shape[0] else image_shape[0] end_column = width + start_column end_column = end_column if end_column <= image_shape[1] else image_shape[1] start_row_grab = start_row - thickness end_row_grab = end_row + thickness start_column_grab = start_row_grab end_column_grab = end_column + thickness gso_image = image_src[start_row_grab:end_row_grab, start_column_grab:end_column_grab] gsi_image = image_src[start_row:end_row, start_column:end_column] if gray_scale: color_name = 0 if thickness != -1: inner_image_rect = np.pad(array=gsi_image, pad_width=thickness, mode='constant', constant_values=color_name) else: max_height, max_width = gso_image.shape inner_image_rect = np.zeros(shape=(max_height, max_width)) else: with open(file='color_names_data.json', mode='r') as col_json: color_db = json.load(fp=col_json) color_name = str(color_name).strip().lower() colors_list = list(color_db.keys()) if color_name not in colors_list: r_cons, g_cons, b_cons = (0, 0, 0) else: r_cons, g_cons, b_cons = [color_db[color_name][i] for i in 'rgb'] r_inner_image, g_inner_image, b_inner_image = gsi_image[:, :, 0], gsi_image[:, :, 1], gsi_image[:, :, 2] if thickness != -1: r_inner_rect = np.pad(array=r_inner_image, pad_width=thickness, mode='constant', constant_values=r_cons) g_inner_rect = np.pad(array=g_inner_image, pad_width=thickness, mode='constant', constant_values=g_cons) b_inner_rect = np.pad(array=b_inner_image, pad_width=thickness, mode='constant', constant_values=b_cons) inner_image_rect = np.dstack(tup=(r_inner_rect, g_inner_rect, b_inner_rect)) else: max_height, max_width, _ = gso_image.shape r_out_rect = np.full(shape=(max_height, max_width), fill_value=r_cons) g_out_rect = np.full(shape=(max_height, max_width), fill_value=g_cons) b_out_rect = np.full(shape=(max_height, max_width), fill_value=b_cons) inner_image_rect = np.dstack(tup=(r_out_rect, g_out_rect, b_out_rect)) image_src[start_row_grab:end_row_grab, start_column_grab:end_column_grab] = inner_image_rect image_rect = image_src if with_plot: cmap_val = None if not gray_scale else 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Rectangle") ax1.imshow(image_main, cmap=cmap_val) ax2.imshow(image_rect, cmap=cmap_val) plt.show() return True return image_rect`

Here, we are not dealing with the points to decide the position of the rectangle. We are using the techniques of `cropping`

, `bordering`

, and `replacing`

to achieve the main motive. Lets test the above function.

**For color image:**

`draw_rectangle( image_file='lena_original.png', start_pos=199, length=200, width=200, thickness=3, with_plot=True, color_name='red')`

**For grayscale image:**

`draw_rectangle( image_file='lena_original.png', start_pos=199, length=200, width=200, thickness=3, with_plot=True, gray_scale=True)`

For thickness (-1), there is a special case in which the whole rectangle area is colored based on the color specified.

We finally got what we wanted. We tried to implement the code from scratch with the help of NumPy matrix operations. We used NumPy methods like:

`pad()`

pads the matrix with constant values.`zeros()`

We used this for the special case when the`thickness`

is -1. This is only used when the image is read in grayscale mode.`full()`

We used to obtain a matrix of identical values. This is also used when the`thickness`

is -1 in the case where the image is read in colored mode.

Personally, I learned so much from implementing the method. I hope you find this insightful. You should definitely check out my other articles on the same subject in my profile.

If you liked it, you can buy coffee for me from here.

]]>In this blog article, we will learn how to crop an image in Python using NumPy as an ideal library. When we talk about images, they are just matrices in 2D space. And of course, it depends on the image, if it is an `RGB`

image then the size of the image would be `(width, height, 3)`

otherwise grayscale would just be `(width, height)`

. But ultimately, images are just large matrices where each value is a pixel positioned `row-wise`

and `column-wise`

accordingly.

**Credits of Cover Image** - Photo by Ulrike Langner on Unsplash

Cropping the image is just obtaining the sub-matrix of the image matrix. The size of the sub-matrix (cropped image) can be of our choice and mainly it is the height and width. There needs to be one important thing for the image to be cropped, i.e., `starting position`

. The `starting position`

is helpful for obtaining the sub-matrix from that position and depending upon height and width we can easily crop cut the image.

The three important things are:

- starting_position
- length (height)
- width

Based on these three things, we can construct our cropping function completely ready.

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV It is only used for reading the image.

`import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt`

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_src`

The above function reads the image either in grayscale or RGB and returns the image matrix.

We need to pass the above mentioned 3 things as arguments in our function. But before doing lets try to crop (slice) the matrix with NumPy.

`>>> import numpy as np>>> m = np.array([... [1, 2, 3, 4, 5, 6, 7],... [5, 3, 4, 2, 1, 7, 6],... [6, 4, 3, 5, 1, 2, 7],... [5, 6, 3, 1, 4, 2, 7],... [1, 2, 3, 4, 5, 6, 7]... ])>>>>>> print(m)[[1 2 3 4 5 6 7] [5 3 4 2 1 7 6] [6 4 3 5 1 2 7] [5 6 3 1 4 2 7] [1 2 3 4 5 6 7]]>>>>>> crop_m = m[1:4, 2:7]>>> print(crop_m)[[4 2 1 7 6] [3 5 1 2 7] [3 1 4 2 7]]>>>`

The above code is an example of how we can crop an image matrix. Notice `crop_m`

is the cropped matrix (sub-matrix) that is sliced from the original matrix `m`

. The sub-matrix `crop_m`

is taking values from `[1:4, 2:7]`

, i.e., values from `1st row`

till `4th row`

and from `2nd column`

till `7th column`

. We should something similar for the image to obtain the cropped image. Lets write the cropping image function.

`def crop_this(image_file, start_pos, length, width, with_plot=False, gray_scale=False): image_src = read_this(image_file=image_file, gray_scale=gray_scale) image_shape = image_src.shape length = abs(length) width = abs(width) start_row = start_pos if start_pos >= 0 else 0 start_column = start_row end_row = length + start_row end_row = end_row if end_row <= image_shape[0] else image_shape[0] end_column = width + start_column end_column = end_column if end_column <= image_shape[1] else image_shape[1] print("start row \t- ", start_row) print("end row \t- ", end_row) print("start column \t- ", start_column) print("end column \t- ", end_column) image_cropped = image_src[start_row:end_row, start_column:end_column] cmap_val = None if not gray_scale else 'gray' if with_plot: fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Cropped") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_cropped, cmap=cmap_val) return True return image_cropped`

Lets understand what this function will actually result in.

- At the first step, we read the image either in grayscale or RGB and obtain the image matrix.
- We obtain the height and width of the image which is further used in the validation of the code.
- We make sure that the length and width are positive integers. Hence absolute values are considered.
- We calculate the four important values which are useful for slicing the matrix
`start_row`

,`end_row`

,`start_column`

,`end_column`

. We obtain that using the three arguments that are passed`start_pos`

,`length`

,`width`

. - We obtain the cropped image by slicing the matrix.
- We plot both the original and cropped images for the visualization.

Lets test the above function

`crop_this( image_file='lena_original.png', start_pos=199, length=100, width=200, with_plot=True)`

`start row - 199end row - 299start column - 199end column - 399`

`crop_this( image_file='lena_original.png', start_pos=199, length=100, width=200, with_plot=True, gray_scale=True)`

`start row - 199end row - 299start column - 199end column - 399`

This is it!!! We finally are able to crop the image by just knowing the starting position and length & width of the cropped image. Isnt it great? We can also add a lot of customization options like adding a border around the image and other things. To know how to add a border to the image, you can refer to my article.

Other similar articles can be found in my profile. Have a great time reading and implementing the same.

If you liked it, you can buy coffee for me from here.

]]>In this article, we will learn how to invert an image using NumPy. To get some gist of this, lets have two values `0`

and `1`

. Here `0`

represents Black and `1`

represents White. When we apply inversion to these values, we get:

- 0 inverted 1
- 1 inverted 0

The above only works when we have two values. `0`

for low and `1`

for high. If we were to relate the same with the Binary Image whose pixel values are just `1`

s and `0`

's. The inversion would be reversed. To put it in words we can say from **White** and **Black** to **Black** and **White**.

**Credits of Cover Image** - Photo by Jr Korpa on Unsplash

Unlike lists, if we want to add a number to the values of the list. We iterate through each element and add the number. Whereas, in NumPy, we need not iterate through each element and add. Instead, we can treat the array list as a single element and add the number. NumPy automatically adds that number to all the elements of the array list. This technique is called broadcasting.

The broadcasting technique is applicable to both matrices and arrays. It is very fast when compared to normal loops.

`>>> import numpy as np>>> M = np.array([1, 2, 3, 4, 5, 6])>>> M = 3 + M>>> Marray([4, 5, 6, 7, 8, 9])>>>`

Lets see the demonstration for a random matrix.

A simple demonstration of the **White Black** matrix and the image can be seen below.

- 1 is visualized as
**White** - 0 is visualized as
**Black**

`>>> import numpy as np>>> image_b = np.array([... [1,0,1],... [1,1,0],... [0,1,1]])>>> image_barray([[1, 0, 1], [1, 1, 0], [0, 1, 1]])>>>`

If we visualize the above matrix, we can see something like the below.

A simple demonstration of the **Black White** matrix and the image can be seen below.

- 1 is changed to 0
**Black** - 0 is changed to 1
**White**

`>>> # Broadcasting>>> image_i = 1 - image_b>>> # image_i = ~ image_b>>> image_iarray([[0, 1, 0], [0, 0, 1], [1, 0, 0]])>>>`

If we visualize the above matrix, we can see something like the below.

To convert a matrix into an inverted matrix we can also use the operation **~**'. It works in the same way.

If you want to know specifically how to convert an image into binary, you can refer to my article where I explain the procedure for both colored and grayscale images.

The above implementation worked since the images are already binarized. What if we wanted to apply this for a colored and non-binarized image? The scenarios will be different. Lets find out what we can do for those.

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV

`import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt`

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_src`

The above function reads the image either in grayscale or RGB and returns the image matrix.

For inverting an image using the `cv2`

library, we can use the method `bitwise_not()`

which is available in the library. We can just pass the image matrix as an argument in the method.

`def invert_lib(image_file, with_plot=False, gray_scale=False): image_src = read_this(image_file=image_file, gray_scale=gray_scale) cmap_val = None if not gray_scale else 'gray' image_i = cv2.bitwise_not(image_src) if with_plot: fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Inverted") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_i, cmap=cmap_val) return True return image_i`

The above function returns the inverted image of the original image that is passed. The same can be seen when plotted.

Lets test the above function

`invert_lib(image_file='lena_original.png', with_plot=True)`

`invert_lib(image_file='lena_original.png', with_plot=True, gray_scale=True)`

We have obtained the inverted images for both colored and grayscale images.

In order to invert the image, we have to apply the broadcasting technique using NumPy. We have to subtract each pixel value from `255`

. It is because the highest pixel value or color value is `255`

in an image. Either way, we can apply the **~** (negation) operation to the image.

`def invert_this(image_file, with_plot=False, gray_scale=False): image_src = read_this(image_file=image_file, gray_scale=gray_scale) # image_i = ~ image_src image_i = 255 - image_src if with_plot: cmap_val = None if not gray_scale else 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Inverted") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_i, cmap=cmap_val) return True return image_i`

Lets test the above function

`invert_this(image_file='lena_original.png', with_plot=True)`

`invert_this(image_file='lena_original.png', with_plot=True, gray_scale=True)`

This is it! We have successfully written the code for converting the image into an inverted image. The results are similar to that of the library code.

Lets see what will it result in when applied to the binarized image. I have put the result without the code.

The inverted image is exactly opposite to the binarized image. Hence this concludes the agenda of this article.

You should definitely check out my other articles on the same subject in my profile.

If you liked it, you can buy coffee for me from here.

]]>In this article, we will learn how to add/draw a border to the image. We use OpenCV to read the image. The rest of the part is handled by NumPy to code from scratch. We rely on it for the matrix operations that can be achieved with so much ease.

There are two aspects in which we can start to think:

- If the image is read in grayscale, we can simply keep the default color as black. This is because the length of the shape of the image matrix would be 2. Therefore we cannot add a color border whose color value would be of size 3 and thus it cannot be mapped easily.
- If the image is read in RGB, we can have a choice to pick the color for the border. This is because the length of the shape of the image matrix would be 3. Hence we can add a color border whose color value would be of size 3 which can be mapped easily.

**Credits of Cover Image** - Photo by Kanan Khasmammadov on Unsplash

Before proceeding further, we need to make sure we have enough colors (based on the users choice). I have extracted the possible color values from this website.

The code of the same can be seen below. The result is stored in a JSON file named `color_names_data.json`

.

`import requestsimport jsonfrom bs4 import BeautifulSoupdef extract_table(url): res = requests.get(url=url) con = res.text soup = BeautifulSoup(con, features='lxml') con_table = soup.find('table', attrs={'class' : 'color-list'}) headings = [th.get_text().lower() for th in con_table.find("tr").find_all("th")] table_rows = [headings] for row in con_table.find_all("tr")[1:]: each_row = [td.get_text().lower() for td in row.find_all("td")] table_rows.append(each_row) return table_rowscol_url = "https://www.colorhexa.com/color-names"color_rows_ = extract_table(url=col_url)color_dict = {}for co in color_rows_[1:]: color_dict[co[0]] = { 'r' : int(co[2]), 'g' : int(co[3]), 'b' : int(co[4]), 'hex' : co[1] }with open(file='color_names_data.json', mode='w') as col_json: json.dump(obj=color_dict, fp=col_json, indent=2)`

It is necessary to grab R, G, and B values in order for the mapping after the separation of pixels. We follow `split`

and `merge`

methods using NumPy.

The structure of the color data can be seen below.

`{ "air force blue": { "r": 93, "g": 138, "b": 168, "hex": "#5d8aa8" }, "alizarin crimson": { "r": 227, "g": 38, "b": 54, "hex": "#e32636" }, "almond": { "r": 239, "g": 222, "b": 205, "hex": "#efdecd" }, ...}`

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV

`import numpy as npimport cv2import jsonfrom matplotlib import pyplot as plt`

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_src`

The above function reads the image either in grayscale or RGB and returns the image matrix.

For adding/drawing borders around the image, the important arguments can be named as below:

`image_file`

Image file location or image name if the image is stored in the same directory.`bt`

Border thickness`color_name`

By default it takes 0 (black color). Otherwise, any color name can be taken.

We use the method `copyMakeBorder()`

available in the library OpenCV that is used to create a new bordered image with a specified thickness. In the code, we make sure to convert the color name into values from the color data we collected earlier.

The below function works for both RGB image and grayscale image as expected.

`def add_border(image_file, bt=5, with_plot=False, gray_scale=False, color_name=0): image_src = read_this(image_file=image_file, gray_scale=gray_scale) if gray_scale: color_name = 0 value = [color_name for i in range(3)] else: color_name = str(color_name).strip().lower() with open(file='color_names_data.json', mode='r') as col_json: color_db = json.load(fp=col_json) colors_list = list(color_db.keys()) if color_name not in colors_list: value = [0, 0, 0] else: value = [color_db[color_name][i] for i in 'rgb'] image_b = cv2.copyMakeBorder(image_src, bt, bt, bt, bt, cv2.BORDER_CONSTANT, value=value) if with_plot: cmap_val = None if not gray_scale else 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Bordered") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_b, cmap=cmap_val) return True return image_b`

Lets test the above function

`add_border(image_file='lena_original.png', with_plot=True, color_name='green')`

`add_border(image_file='lena_original.png', with_plot=True, gray_scale=True, color_name='pink')`

We can clearly see the borders are been added/drawn. For the grayscale image, though we have mentioned pink, a black border is drawn.

When we talk about the border, it is basically a constant pixel value of one color around the entire image. It is important to take note of the thickness of the border to be able to see. Considering the thickness we should append a constant value around the image that matches the thickness level.

In order to do so, we can use the `pad()`

method available in the library NumPy. This method appends a constant value that matches the level of the `pad_width`

argument which is mentioned.

**Example**

`>>> import numpy as np>>> m = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]>>> m = np.array(m)>>> marray([[1, 2, 3], [4, 5, 6], [7, 8, 9]])>>>>>> pm = np.pad(array=m, pad_width=1, mode='constant', constant_values=12)>>> pmarray([[12, 12, 12, 12, 12], [12, 1, 2, 3, 12], [12, 4, 5, 6, 12], [12, 7, 8, 9, 12], [12, 12, 12, 12, 12]])>>>>>> pmm = np.pad(array=m, pad_width=2, mode='constant', constant_values=24)>>> pmmarray([[24, 24, 24, 24, 24, 24, 24], [24, 24, 24, 24, 24, 24, 24], [24, 24, 1, 2, 3, 24, 24], [24, 24, 4, 5, 6, 24, 24], [24, 24, 7, 8, 9, 24, 24], [24, 24, 24, 24, 24, 24, 24], [24, 24, 24, 24, 24, 24, 24]])>>>`

More examples can be found in the documentation.

We just need to change the constant_values argument by taking the actual color value.

- For a grayscale image, we simply pad the image matrix with black color i.e., 0.
- For the RGB image, we are to grab the color value from the data collected, split the image into 3 matrices, and pad each matrix with each color value.The below function would explain the flow more clearly.

The below function would explain the flow more clearly.

`def draw_border(image_file, bt=5, with_plot=False, gray_scale=False, color_name=0): image_src = read_this(image_file=image_file, gray_scale=gray_scale) if gray_scale: color_name = 0 image_b = np.pad(array=image_src, pad_width=bt, mode='constant', constant_values=color_name) cmap_val = 'gray' else: color_name = str(color_name).strip().lower() with open(file='color_names_data.json', mode='r') as col_json: color_db = json.load(fp=col_json) colors_list = list(color_db.keys()) if color_name not in colors_list: r_cons, g_cons, b_cons = [0, 0, 0] else: r_cons, g_cons, b_cons = [color_db[color_name][i] for i in 'rgb'] r_, g_, b_ = image_src[:, :, 0], image_src[:, :, 1], image_src[:, :, 2] rb = np.pad(array=r_, pad_width=bt, mode='constant', constant_values=r_cons) gb = np.pad(array=g_, pad_width=bt, mode='constant', constant_values=g_cons) bb = np.pad(array=b_, pad_width=bt, mode='constant', constant_values=b_cons) image_b = np.dstack(tup=(rb, gb, bb)) cmap_val = None if with_plot: fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Bordered") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_b, cmap=cmap_val) return True return image_b`

Lets test the above function

`draw_border(image_file='lena_original.png', with_plot=True, color_name='cyan')`

`draw_border(image_file='lena_original.png', bt=10, with_plot=True, gray_scale=True)`

Yay! We did it. We coded the entire thing including color choice completely from scratch except for the part where we read the image file. We relied mostly on NumPy as it is very fast in computing matrix operations (We could have done it with `for`

loops if we wanted our code to execute very slow).

Personally, this was a great learning for me. I am starting to think about how difficult and fun that would be for the people who actually work on open source libraries.

You should definitely check out my other articles on the same subject in my profile.

If you liked it, you can buy coffee for me from here.

]]>In this article, we will learn how to binarize an image using NumPy, and of course, we will use OpenCV for reading the image both in grayscale and RGB.

To understand what binary is binary is something that is made of two things. In computer terminology, binary is just ** 0** and

signifies*0***Black**signifies*1***White**.

At the initial stages of learning image processing, we often think of a grayscale image as a **binary image**. Although it is not. But slowly when we pick up the subject, we realize how wrong we were. So, moving ahead, we will learn how to binarize the image with both using the library and without using the library (NumPy is used for matrix operations just to avoid the slowness of the program when used the regular **for loops**). Besides this, we will also use Matplotlib to plot the results.

**Credits of Cover Image** - Photo by Angel Santos on Unsplash

The binary operation works really well for the grayscale images. The problem with the color (RGB) images is that each pixel is a vector representing 3 unique values one for Red, one for Green, and one for Blue.

A typical grayscale images matrix would look like -

`array([[162, 162, 162, ..., 170, 155, 128], [162, 162, 162, ..., 170, 155, 128], [162, 162, 162, ..., 170, 155, 128], ..., [ 43, 43, 50, ..., 104, 100, 98], [ 44, 44, 55, ..., 104, 105, 108], [ 44, 44, 55, ..., 104, 105, 108]], dtype=uint8)`

A typical RGB images matrix would seem like -

`array([[[226, 137, 125], ..., [200, 99, 90]], [[226, 137, 125], ..., [200, 99, 90]], [[226, 137, 125], ..., [200, 99, 90]], ..., [[ 84, 18, 60], ..., [177, 62, 79]], [[ 82, 22, 57], ..., [185, 74, 81]], [[ 82, 22, 57], ..., [185, 74, 81]]], dtype=uint8)`

If we were to separate R, G, and B pixels from the above matrix. We get

`array([[226, 226, 223, ..., 230, 221, 200], [226, 226, 223, ..., 230, 221, 200], [226, 226, 223, ..., 230, 221, 200], ..., [ 84, 84, 92, ..., 173, 172, 177], [ 82, 82, 96, ..., 179, 181, 185], [ 82, 82, 96, ..., 179, 181, 185]], dtype=uint8)`

`array([[137, 137, 137, ..., 148, 130, 99], [137, 137, 137, ..., 148, 130, 99], [137, 137, 137, ..., 148, 130, 99], ..., [ 18, 18, 27, ..., 73, 68, 62], [ 22, 22, 32, ..., 70, 71, 74], [ 22, 22, 32, ..., 70, 71, 74]], dtype=uint8)`

`array([[125, 125, 133, ..., 122, 110, 90], [125, 125, 133, ..., 122, 110, 90], [125, 125, 133, ..., 122, 110, 90], ..., [ 60, 60, 58, ..., 84, 76, 79], [ 57, 57, 62, ..., 79, 81, 81], [ 57, 57, 62, ..., 79, 81, 81]], dtype=uint8)`

Whatever operation we compute on the grayscale image, we will need to compute the same on the RGB image but for 3 times separating R, G, and B pixels and finally merging them as a proper RGB image.

The packages that we mainly use are -

- NumPy
- Matplotlib
- OpenCV

`import cv2import numpy as npfrom matplotlib import pyplot as plt`

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_src`

The above function reads the image either in grayscale or RGB and returns the image matrix.

For converting the image into a binary image, we can simply make use of the ** threshold()** method available in the

**src**It is basically the image matrix.**thresh**It is the threshold value based on which pixels are given a new value. If the pixels are less than this value, we will revalue those pixels to. Otherwise, the pixels will be revalued to*255*.*0***maxval**It is the maximum pixel value that a typical image could contain (255).**type**It is basically a thresholding type that is given and based on that type the operation is computed. There are several types with which the operation is taken care of.

After this, we will plot the results to see the variation and hence the below function.

`def binarize_lib(image_file, thresh_val=127, with_plot=False, gray_scale=False): image_src = read_this(image_file=image_file, gray_scale=gray_scale) th, image_b = cv2.threshold(src=image_src, thresh=thresh_val, maxval=255, type=cv2.THRESH_BINARY) if with_plot: cmap_val = None if not gray_scale else 'gray' fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Binarized") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_b, cmap=cmap_val) return True return image_b`

Lets test the above function -

`binarize_lib(image_file='lena_original.png', with_plot=True)`

`binarize_lib(image_file='lena_original.png', with_plot=True, gray_scale=True)`

Now that we have seen the results of both original and binary images, it is obvious that the library code works for both. Its time to make our hands dirty to code the same from the scratch.

First, we will write a function that will revalue the pixel values which are less than the specified threshold to ** 255**.

By doing it, we will see something like below -

`def convert_binary(image_matrix, thresh_val): white = 255 black = 0 initial_conv = np.where((image_matrix <= thresh_val), image_matrix, white) final_conv = np.where((initial_conv > thresh_val), initial_conv, black) return final_conv`

We will call the above function three times by separating R, G, and B values and finally merge the same to obtain the binarized image. Once doing it, we can plot the results just like how we did it before.

`def binarize_this(image_file, thresh_val=127, with_plot=False, gray_scale=False): image_src = read_this(image_file=image_file, gray_scale=gray_scale) if not gray_scale: cmap_val = None r_img, g_img, b_img = image_src[:, :, 0], image_src[:, :, 1], image_src[:, :, 2] r_b = convert_binary(image_matrix=r_img, thresh_val=thresh_val) g_b = convert_binary(image_matrix=g_img, thresh_val=thresh_val) b_b = convert_binary(image_matrix=b_img, thresh_val=thresh_val) image_b = np.dstack(tup=(r_b, g_b, b_b)) else: cmap_val = 'gray' image_b = convert_binary(image_matrix=image_src, thresh_val=thresh_val) if with_plot: fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20)) ax1.axis("off") ax1.title.set_text('Original') ax2.axis("off") ax2.title.set_text("Binarized") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_b, cmap=cmap_val) return True return image_b`

We have made our binarizing code by just using NumPy. Lets test the same -

`binarize_this(image_file='lena_original.png', with_plot=True)`

`binarize_this(image_file='lena_original.png', with_plot=True, gray_scale=True)`

This is it. Whatever we wanted to accomplish, we have accomplished it. The results are quite similar to the one we got by using the library code.

Hence this concludes the aim of this article.

Do give a read

]]>I have been practicing image processing for quite a little - manipulating the images (image matrices to be precise). In doing so, I got to explore the equalizing methods for images so as to enhance the contrast to a certain extent that the manipulated image looks better than the original image. This technique is termed as **Histogram Equalization**.

**Credits of Cover Image** - Photo by Andreas Gcklhorn on Unsplash

Often times what happens is when the image is captured, it will not be the same as the natural view. In order to meet the level of natural view, post-processing should be done. Hence Histogram Equalization (Normalization) is one of those techniques to enhance the contrast by tweaking the pixel values of the image.

An example can be seen below - original image and equalized image.

If we were to plot the image histograms, it would look something like below -

**Credits** - The above images have been taken from the Internet for showing the examples.

This method works better for both bright and dark images, especially in the field of medical science there is higher importance in analyzing the X-ray images.

It is also very useful in viewing scientific images like thermal images and satellite images.

In this article, I will implement this method both by using the `openCV`

library and from scratch with just `NumPy`

and `Matplotlib`

. Although I would like to do without using `NumPy`

, it would take much time to compute.

**Note** - For coding from scratch, I will use `openCV`

to read the image and nothing else.

I have taken Lena Image for testing the functions. I have saved the same in my working directory.

The packages that we mainly use are:

- NumPy
- Matplotlib
- OpenCV It is only used for reading the image (in this article).

`import cv2import numpy as npfrom matplotlib import pyplot as plt`

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_src = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_src`

The above function reads the image either in `gray_scale`

or `RGB`

and returns the image matrix.

For equalizing, we can simply use the `equalizeHist()`

method available in the library `cv2`

. We have two aspects here -

When the image is read in

`RGB`

.- Separate the pixels based on the color combination. We can use the
`split()`

method available in the library`cv2`

. - Apply the equalization method for each matrix.
- Merge the equalized image matrices altogether with the method
`merge()`

available in the library`cv2`

.

- Separate the pixels based on the color combination. We can use the
When the image is read in

`gray_scale`

.- Just apply the equalization method for the image matrix.

Plot the original image and equalized image.

`def equalize_this(image_file, with_plot=False, gray_scale=False): image_src = read_this(image_file=image_file, gray_scale=gray_scale) if not gray_scale: r_image, g_image, b_image = cv2.split(image_src) r_image_eq = cv2.equalizeHist(r_image) g_image_eq = cv2.equalizeHist(g_image) b_image_eq = cv2.equalizeHist(b_image) image_eq = cv2.merge((r_image_eq, g_image_eq, b_image_eq)) cmap_val = None else: image_eq = cv2.equalizeHist(image_src) cmap_val = 'gray' if with_plot: fig = plt.figure(figsize=(10, 20)) ax1 = fig.add_subplot(2, 2, 1) ax1.axis("off") ax1.title.set_text('Original') ax2 = fig.add_subplot(2, 2, 2) ax2.axis("off") ax2.title.set_text("Equalized") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_eq, cmap=cmap_val) return True return image_eq`

Let's test the above function -

`equalize_this(image_file='lena_original.png', with_plot=True)`

`equalize_this(image_file='lena_original.png', with_plot=True, gray_scale=True)`

The above plots are clear and we can say that the equalized images look better than the original images. This was implemented using the `cv2`

library.

For this, I am using `NumPy`

for all the matrix operations. Again we can do it with `for`

loops, but it will take more time to compute. Even here we have two aspects as before -

When the image is read in

`RGB`

.- Separate the pixels based on the color combination. We can slice it down using
`NumPy`

operations. - Apply the equalization method for each matrix.
- Merge the equalized image matrices altogether with the method
`dstack(tup=())`

available in the library`NumPy`

.

- Separate the pixels based on the color combination. We can slice it down using
When the image is read in

`gray_scale`

.- Just apply the equalization method for the image matrix.

Plot the original image and equalized image.

Let's write our own function to compute the image equalization. Image pixel values are normally in the range of 0 to 255. So in total, we will have 256 pixels.

`def enhance_contrast(image_matrix, bins=256): image_flattened = image_matrix.flatten() image_hist = np.zeros(bins) # frequency count of each pixel for pix in image_matrix: image_hist[pix] += 1 # cummulative sum cum_sum = np.cumsum(image_hist) norm = (cum_sum - cum_sum.min()) * 255 # normalization of the pixel values n_ = cum_sum.max() - cum_sum.min() uniform_norm = norm / n_ uniform_norm = uniform_norm.astype('int') # flat histogram image_eq = uniform_norm[image_flattened] # reshaping the flattened matrix to its original shape image_eq = np.reshape(a=image_eq, newshape=image_matrix.shape) return image_eq`

**Credits** - The above code is an inspiration from the article written by * Tory Walker*.

The above function returns an equalized image matrix when passed the original image matrix as an argument.

Let's write another function that computes the equalization for both the `RGB`

image and the `gray_scale`

image taking the above function in use.

`def equalize_this(image_file, with_plot=False, gray_scale=False, bins=256): image_src = read_this(image_file=image_file, gray_scale=gray_scale) if not gray_scale: r_image = image_src[:, :, 0] g_image = image_src[:, :, 1] b_image = image_src[:, :, 2] r_image_eq = enhance_contrast(image_matrix=r_image) g_image_eq = enhance_contrast(image_matrix=g_image) b_image_eq = enhance_contrast(image_matrix=b_image) image_eq = np.dstack(tup=(r_image_eq, g_image_eq, b_image_eq)) cmap_val = None else: image_eq = enhance_contrast(image_matrix=image_src) cmap_val = 'gray' if with_plot: fig = plt.figure(figsize=(10, 20)) ax1 = fig.add_subplot(2, 2, 1) ax1.axis("off") ax1.title.set_text('Original') ax2 = fig.add_subplot(2, 2, 2) ax2.axis("off") ax2.title.set_text("Equalized") ax1.imshow(image_src, cmap=cmap_val) ax2.imshow(image_eq, cmap=cmap_val) return True return image_eq`

Let's test the above function -

`equalize_this(image_file='lena_original.png', with_plot=True)`

`equalize_this(image_file='lena_original.png', with_plot=True, gray_scale=True)`

The above plots are clear and we can say that the equalized images look better than the original images. This was implemented from scratch using the `NumPy`

library.

Let's compare the equalized image obtained from the `cv2`

library and the equalized image obtained from the code written from scratch.

We can notice there is a slight difference between the `library image`

and `scratch image`

. But both seem to be clear when compared with the `original image`

. Here I complete my article with my own takeaway.

Personally, I learned a lot by exploring and implementing different methods applied to increase the image intensity. Especially, trying to implement the code from scratch by both referring and learning.

It is always good to use the library methods as they seem to be more optimized and works 100 percent.

Image processing is a very crucial subject to learn and one really deserves to try out practicing with so much curiosity and ones' own exploration.

Do give a read to my other articles and let me know your thoughts -

]]>In this blog article, I will explain how we can mirror or flip an image in Python. All it is required is understanding the basic mathematics behind various matrix operations and matrix manipulations.

**Credits of Cover Image** - Photo by Michael Busch on Unsplash

**NumPy**- For matrix operations and manipulating the same.**OpenCV**- For reading the image and converting it into a 2D array (matrix).**Matplotlib**- For plotting the matrix as an image.

For this mini-project, I am using a famous Lena image which is mainly used for testing the computer vision models. Make sure to download this image and save it in the current working directory.

`import cv2import numpy as npfrom matplotlib import pyplot as plt`

First, we read the image file using the `imread()`

method from the module `cv2`

. To do that, we simply need to import the package and use so. Hence by doing this, we get the image in the form of a matrix.

By default, `imread()`

method reads the image in `BGR`

(`Blue`

, `Green`

, `Red`

) format. To convert the read image into regular format i.e., `RGB`

(`Red`

, `Green`

, `Blue`

) we use `cvtColor()`

method from the same module `cv2`

.

`def read_this(image_file, gray_scale=False): image_src = cv2.imread(image_file) if gray_scale: image_rgb = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_rgb = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_rgb`

The above function returns an image matrix from the image file that is passed. It consists of regular `if`

and `else`

condition if we want to get the image matrix either in `gray_scale`

or `RGB`

format.

To mirror the image basically, we need to reverse the matrix row-wise from left to right. For example, let's consider a matrix `A`

.

`>>> A = [ [4, 1, 1], [2, 8, 0], [3, 8, 1]]`

If we want to mirror this matrix (row-wise), then it would be -

`>>> import numpy as np>>> mirror_ = np.fliplr(A)>>> mirror_[[1, 1, 4], [0, 8, 2], [1, 8, 3]]`

We can also do this without using the `NumPy`

module. If so, we get to use loops and reverse each row. This would take a while if performed the same on image matrices as they are very large matrices and we don't want our code to perform very slow.

`def mirror_this(image_file, gray_scale=False, with_plot=False): image_rgb = read_this(image_file=image_file, gray_scale=gray_scale) image_mirror = np.fliplr(image_rgb) if with_plot: fig = plt.figure(figsize=(10, 20)) ax1 = fig.add_subplot(2, 2, 1) ax1.axis("off") ax1.title.set_text('Original') ax2 = fig.add_subplot(2, 2, 2) ax2.axis("off") ax2.title.set_text("Mirrored") if not gray_scale: ax1.imshow(image_rgb) ax2.imshow(image_mirror) else: ax1.imshow(image_rgb, cmap='gray') ax2.imshow(image_mirror, cmap='gray') return True return image_mirror`

The above function returns an image matrix that is reversed or flipped row-wise from left to right.

Let's plot the same -

`mirror_this(image_file="lena_original.png", with_plot=True)`

`mirror_this(image_file="lena_original.png", gray_scale=True, with_plot=True)`

To flip the image basically, we need to reverse the matrix column-wise from up to down. For example, let's consider a matrix `B`

.

`>>> B = [ [4, 1, 1], [2, 8, 0], [3, 8, 1]]`

If we want to flip this matrix (column-wise), then it would be -

`>>> import numpy as np>>> flip_= np.flipud(B)>>> flip_[[3, 8, 1], [2, 8, 0], [4, 1, 1]]`

We use `NumPy`

for flipping the matrix in order to maintain the fastness of the code.

`def flip_this(image_file, gray_scale=False, with_plot=False): image_rgb = read_this(image_file=image_file, gray_scale=gray_scale) image_flip = np.flipud(image_rgb) if with_plot: fig = plt.figure(figsize=(10, 20)) ax1 = fig.add_subplot(2, 2, 1) ax1.axis("off") ax1.title.set_text('Original') ax2 = fig.add_subplot(2, 2, 2) ax2.axis("off") ax2.title.set_text("Flipped") if not gray_scale: ax1.imshow(image_rgb) ax2.imshow(image_flip) else: ax1.imshow(image_rgb, cmap='gray') ax2.imshow(image_flip, cmap='gray') return True return image_flip`

The above function returns an image matrix that is reversed or flipped column-wise from up to down.

Let's plot the same -

`flip_this(image_file='lena_original.png', with_plot=True)`

`flip_this(image_file='lena_original.png', gray_scale=True, with_plot=True)`

`class ImageOpsFromScratch(object): def __init__(self, image_file): self.image_file = image_file def read_this(self, gray_scale=False): image_src = cv2.imread(self.image_file) if gray_scale: image_rgb = cv2.cvtColor(image_src, cv2.COLOR_BGR2GRAY) else: image_rgb = cv2.cvtColor(image_src, cv2.COLOR_BGR2RGB) return image_rgb def mirror_this(self, with_plot=True, gray_scale=False): image_rgb = self.read_this(gray_scale=gray_scale) image_mirror = np.fliplr(image_rgb) if with_plot: self.plot_it(orig_matrix=image_rgb, trans_matrix=image_mirror, head_text='Mirrored', gray_scale=gray_scale) return None return image_mirror def flip_this(self, with_plot=True, gray_scale=False): image_rgb = self.read_this(gray_scale=gray_scale) image_flip = np.flipud(image_rgb) if with_plot: self.plot_it(orig_matrix=image_rgb, trans_matrix=image_flip, head_text='Flipped', gray_scale=gray_scale) return None return image_flip def plot_it(self, orig_matrix, trans_matrix, head_text, gray_scale=False): fig = plt.figure(figsize=(10, 20)) ax1 = fig.add_subplot(2, 2, 1) ax1.axis("off") ax1.title.set_text('Original') ax2 = fig.add_subplot(2, 2, 2) ax2.axis("off") ax2.title.set_text(head_text) if not gray_scale: ax1.imshow(orig_matrix) ax2.imshow(trans_matrix) else: ax1.imshow(orig_matrix, cmap='gray') ax2.imshow(trans_matrix, cmap='gray') return True`

`imo = ImageOpsFromScratch(image_file='lena_original.png')### Mirroring ###imo.mirror_this()imo.mirror_this(gray_scale=True)### Flipping ###imo.flip_this()imo.flip_this(gray_scale=True)`

The above image results will be displayed. Now that everything is sorted we can create other image operations like `equalize()`

, `solarize()`

etc. I have not covered that in this article but definitely in the next one.

- By implementing this, I have learned and understood the logic behind various image operations and how to put that in the form of code.

We can also do the same using a well-known image library `Pillow`

. But, as a developer, we need to know things work from start to end. This is something where everyone should be curious about.

Please do read my another article related to image processing and computer vision where I explain *image convolution from scratch*.

Recently I took up a small project of developing a Python package and deploying it on PyPI. The package is based on the famous problem in Graph Theory - **Travelling Salesman Problem**.

The simple procedure to solve this is by taking random nodes with random distance measures between the nodes and apply the concept of **Dynamic Programming** that breaks the bigger problem into smaller sub-problems and thus try to obtain the optimal solution for the bigger problem by first finding the optimal solution for smaller sub-problems. Recursive methods are applied subsequently to solve each sub-problem and thus solve the original problem.

**Note**

To understand how to implement this programmatically, it is recommended that you refer to my medium blog where I explain the procedure in detail.

I have implemented the same but taken it to the advanced level by considering the authentic place coordinates and finding the shortest path by taking appropriate distance measures.

**Credits of Cover Image** - Photo by Nicolas Van Leekwijck on Unsplash

The name of the package is `dora-explorer`

. If you had watched any cartoons when you were a child, then you probably already know who **Dora** was. **Dora The Explorer** is the name of one such cartoon program in which Dora explores different places. I have just combined the concept of both cartoon and the traveling salesman problem where the programmer or the user can be beneficial.

Instead of just solving the problem by dynamic programming, I wanted to make it work for the real place names that are passed. In the background, it has to fetch the coordinate values and compute the distances and thus find the optimal shortest path. The package has certain methods to visualize the shortest path including the place coordinates on the map. Not just joining the lines from point `A`

to point `B`

, it can get the exact route that is required to reach point `B`

from point `A`

. Hence, all these features make the package very unique and new.

The current version of the package is `0.2.2`

and the link can be found below.

It can be easily installed via `pip`

.

`pip install dora-explorer --user`

**Requirements**

`plotlympuunidecodegeonamescacherequests`

The

`pip`

command automatically installs the requirements.

`import`

the package`from dora_explorer.distance_locator import DistanceLocatorfrom dora_explorer.travel_places import GeoTravellerfrom dora_explorer.tiny_explore import DoraTheExplorer`

Creating objects.

`dis_loc = DistanceLocator()geo_travel = GeoTraveller()`

Defining `from_place`

and `to_place`

.

`from_place = 'singapore'to_place = 'new zealand'`

The method `get_distance()`

is used to get the distance between two places. The default distance is taken in `km`

. Allowed options are -

- Miles
- Meters
- Feet and
- Yards

`## kilometersfrom_to_dis_km = dis_loc.get_distance(from_=from_place, to_=to_place)print("The distance in kms from {} to {} - {} kms".format(from_place, to_place, from_to_dis_km))## milesfrom_to_dis_miles = dis_loc.get_distance(from_=from_place, to_=to_place, in_miles=True)print("The distance in miles from {} to {} - {} miles".format(from_place, to_place, from_to_dis_miles))## metersfrom_to_dis_meters = dis_loc.get_distance(from_=from_place, to_=to_place, in_meters=True)print("The distance in meters from {} to {} - {} meters".format(from_place, to_place, from_to_dis_meters))## feetfrom_to_dis_feet = dis_loc.get_distance(from_=from_place, to_=to_place, in_feet=True)print("The distance in feet from {} to {} - {} feet".format(from_place, to_place, from_to_dis_feet))## yardsfrom_to_dis_yards = dis_loc.get_distance(from_=from_place, to_=to_place, in_yards=True)print("The distance in yards from {} to {} - {} yards".format(from_place, to_place, from_to_dis_yards))`

**Output**

`'The distance in kms from singapore to new zealand - 8358.12 kms''The distance in miles from singapore to new zealand - 5194.6 miles''The distance in meters from singapore to new zealand - 8358116.45 meters''The distance in feet from singapore to new zealand - 27422980.08 feet''The distance in yards from singapore to new zealand - 9143779.4 yards'`

We can also visualize the distance plot from `from_place`

to `to_place`

. To do this, we require a map. Hence, it is preferred to have MapBox API that can be obtained from the official website.

I have saved my API in a

`mapbox_api.txt`

file.

`read()`

the MapBox API`with open(file='mapbox_api.txt', mode='r') as api_file: map_api = api_file.read()`

To get the distance plot, we have to use the method `get_distance_plot()`

and the argument `geo_token`

should take a string value (API).

There are two main conditions to get the appropriate distance plot.

**Case - 1**: If `geo_token=map_api`

then

- If
`from_place`

and`to_place`

belong to the**same country**, then`with_map=True`

and`with_directions=True`

.

`## belong to the same country - Indiafrom_place = 'delhi'to_place = 'bengaluru'dis_loc.get_distance_plot( from_=from_place, to_=to_place, with_map=True, with_directions=True, geo_token=map_api)`

- If
`from_place`

and`to_place`

belong to**differernt countries**, then`with_map=True`

and`with_directions=False`

.

`from_place = 'singapore'to_place = 'new zealand'dis_loc.get_distance_plot( from_=from_place, to_=to_place, with_map=True, with_directions=False, geo_token=map_api)`

**Case - 2**: If `geo_token=None`

or `geo_token=""`

then `with_map=False`

and `with_directions=False`

.

- The distance plot will be simply a line plot from two coordinates (latitudes and longitudes).

`from_place = 'delhi'to_place = 'bengaluru'dis_loc.get_distance_plot(from_=from_place, to_=to_place)`

In order to get the shortest path, a `list`

of places should be given as input. There are two possible conditions to get the result.

**Case - 1**: If the `list`

of places belong to the **same country** considering `geo_token=map_api`

.

- Let's take the places belonging to Spain country.

`place_list = ["madrid", "barcelona", "alicante", "palma"]explore = DoraTheExplorer(place_list=place_list)explore.get_path( source_city='palma', with_map=True, with_directions=True, geo_token=map_api)`

**Output**

`'plot is saved successfully ...''palma >> barcelona >> madrid >> alicante >> palma'`

**Note** - If `geo_token=None`

and all the places belong to the **same country**, then a line plot is plotted after joining the coordinates (latitudes and longitudes).

**Minimum distance to cover all the places**

`## kilometersmin_distance_km = explore.get_min_dis(source_city='palma')print("The minimun distance to cover {} in kms - {}".format(place_list, min_distance_km))## milesmin_distance_miles = explore.get_min_dis(source_city='palma', in_miles=True)print("The minimum distance to cover {} in miles - {}".format(place_list, min_distance_miles))## metersmin_distance_meters = explore.get_min_dis(source_city='palma', in_meters=True)print("The minimum distance to cover {} in meters - {}".format(place_list, min_distance_meters))## feetmin_distance_feet = explore.get_min_dis(source_city='palma', in_feet=True)print("The minimum distance to cover {} in feet - {}".format(place_list, min_distance_feet))## yardsmin_distance_yards = explore.get_min_dis(source_city='palma', in_yards=True)print("The minimum distance to cover {} in yards - {}".format(place_list, min_distance_yards))`

**Output**

`"The minimun distance to cover ['madrid', 'barcelona', 'alicante', 'palma'] in kms - 1372.78""The minimum distance to cover ['madrid', 'barcelona', 'alicante', 'palma'] in miles - 853.19""The minimum distance to cover ['madrid', 'barcelona', 'alicante', 'palma'] in meters - 1372780.0""The minimum distance to cover ['madrid', 'barcelona', 'alicante', 'palma'] in feet - 4504091.18""The minimum distance to cover ['madrid', 'barcelona', 'alicante', 'palma'] in yards - 1501821.32"`

**Case - 2**: If the `list`

of places belong to **different countries** considering `geo_token=map_api`

or `geo_token=None`

.

- Let's take the places belonging to different countries.

`place_list = ['london', 'madrid', 'bengaluru', 'delhi']explore = DoraTheExplorer(place_list=place_list)explore.get_path( source_city='london', with_map=True, with_directions=True, geo_token=map_api)`

**Note** - Finding the shortest path or visualizing it can be difficult. The route directions API does not work if places belong to different countries.

**Output**

`"Cannot find the shortest path, as the cities ['london', 'madrid', 'bengaluru', 'delhi'] do not belong to the same country."`

**Minimum distance to cover all the places**

`min_distance = explore.get_min_dis(source_city='london')print(min_distance)`

**Output**

`"Cannot find the minimum distance to cover the cities ['london', 'madrid', 'bengaluru', 'delhi'] as they do not belong to the same country."`

That's it. A simple package to explore the details of the places finally deployed and documented.

- The package is still under development. Quite a few features and bugs need to be added and resolved respectively.
- If you have liked the idea and the package, you can star it on my GitHub page.
- If you would like to contribute, you are welcome to open a PR.
- If you would like to report an issue, I would be happy to check and resolve the same.

**End**