Introduction
Matrix is a type of data structure similar to an array where values are stored in rows and columns. Here, the values are of a unique type. When dealing with matrices (linear algebra) in Machine Learning and NLP, we often hear about two types of matrices as -
Dense Matrix - The matrix where most of the elements are non-zero. In this matrix, there are very few zero elements.
Sparse Matrix - In contrast, the matrix where most of the elements zero and very few elements are non-zero.
Note - There are no criteria as such how many zero values in a matrix determine that there is a need to sparse the matrix.
Credits of Cover Image - Photo by Alexander Schimmeck on Unsplash
Algorithm Explanation
Imagine you have a large matrix with N
rows and M
columns in which most of the values are zeros. You are asked to consider only non-zero elements since zero elements do not add much value. Definitely, you have time and space constraints since you are dealing with a very large matrix.
The simple method here is to neglect all 0
's and store the non-zero elements in the form of [ row_index
, col_index
, non-zero_value
].
Approach:
- Iterate through each
row
of an input matrix. - Iterate through each
value
of eachrow
. - If (value == 0) → skip.
- If (value != 0) → obtain the
row
number andcolumn
number along with thevalue
from the input matrix. - Save the result as [
row
,column
,value
] for everyrow
andcolumn
. - Stop.
Visual Explanation
Let's say we are given a matrix that has most of the elements to be 0
.
The following GIF
explains how to obtain the sparse matrix (GIF by Author).
Time to Code
In this section, we will try to code this in two different ways. One, with the help of the scipy
module and another, implementing our own sparse matrix.
Let's get started ...
Necessary Imports
import numpy as np
from scipy.sparse import csr_matrix
Random Matrix Creation
Just for the demonstration, we will make sure that the matrix contains 0
elements the most.
>>> mat = np.random.randint(low=0, high=3, size=(5, 5))
>>> print(mat)
[[2 2 0 2 0]
[2 1 0 0 2]
[2 1 0 1 0]
[0 1 2 0 2]
[0 1 2 2 1]]
SCIPY Implementation
With the help of the method csr_matrix()
we can easily obtain the sparse matrix.
>>> smat = csr_matrix(mat)
>>> print(smat)
(0, 0) 2
(0, 1) 2
(0, 3) 2
(1, 0) 2
(1, 1) 1
(1, 4) 2
(2, 0) 2
(2, 1) 1
(2, 3) 1
(3, 1) 1
(3, 2) 2
(3, 4) 2
(4, 1) 1
(4, 2) 2
(4, 3) 2
(4, 4) 1
We can think of a dictionary
in python which is a key
and value
paired. The above output is something like a dictionary
where keys are the index location (row, column) and values are the actual non-zero elements.
Custom Implementation
In order to implement it from scratch, we can follow the Algorithmic approach that I have explained previously.
class SparseMatrix():
def __init__(self, arr):
self.arr = arr
def retain_sparsity(self, to_dict=False):
sparse_mat = [
[rindx, cindx, val]
for (rindx, row) in enumerate(self.arr)
for (cindx, val) in enumerate(row)
if (val != 0)
]
if to_dict:
sparse_mat = {(r, c) : v for (r, c, v) in sparse_mat}
return sparse_mat
The above class SparseMatrix()
has a method retain_sparsity
wherein the default argument to_dict
is False
. The argument to_dict
is used whether to get the output in the form of dictionary
or not.
Object Creation
sparse = SparseMatrix(arr=mat)
Output - List
>>> smat_c = sparse.retain_sparsity()
>>> print(smat_c)
[[0, 0, 2],
[0, 1, 2],
[0, 3, 2],
[1, 0, 2],
[1, 1, 1],
[1, 4, 2],
[2, 0, 2],
[2, 1, 1],
[2, 3, 1],
[3, 1, 1],
[3, 2, 2],
[3, 4, 2],
[4, 1, 1],
[4, 2, 2],
[4, 3, 2],
[4, 4, 1]]
Output - dictionary
>>> smat_d = sparse.retain_sparsity(to_dict=True)
>>> print(smat_d)
{(0, 0): 2,
(0, 1): 2,
(0, 3): 2,
(1, 0): 2,
(1, 1): 1,
(1, 4): 2,
(2, 0): 2,
(2, 1): 1,
(2, 3): 1,
(3, 1): 1,
(3, 2): 2,
(3, 4): 2,
(4, 1): 1,
(4, 2): 2,
(4, 3): 2,
(4, 4): 1}
Conclusion
When we have space constraints while working with large matrices, it is often preferred to convert the matrix into sparse representation and this really takes less space comparatively the original matrix.
In fact, we can check the space (in bytes) occupied by the original matrix
mat
and thesparse
matrix.
>>> from sys import getsizeof
>>> # checking the space of original matrix
>>> getsizeof(mat)
156
>>> # checking the space of scipy sparse matrix
>>> getsizeof(smat)
24
>>> # checking the custom implementation sparse matrix
>>> getsizeof(smat_c)
92
What we can observe is, the
scipy
method consumes less space than our custom method. It is becausescipy
is an optimized well-developed library mainly used for various scientific computations.It is always better to use library methods than our own code to achieve faster results with fewer space constraints.
Buy Me Coffee
If you have liked my article you can buy some coffee and support me here. That would motivate me to write and learn more about what I know.