Group Code: J062

   Tejas N Vijapur
   Neel Jadhao
   Harshit Mittal

Efficient Sparse Matrices with Scipy

Introduction

Here's the thing: Real-world data is often riddled with missing or irrelevant values. Take recommendation systems, for example—thousands of users interact with millions of products, but each user engages with only a tiny fraction of them. This results in a sparse matrix/dataset, where most entries are zeros (indicating no interaction). Sparse datasets are everywhere! Whether it’s text data (where documents contain only a few words from a massive vocabulary), user-item recommendation systems, or even bioinformatics (where only specific genes are active in gene expression data), dealing with sparsity is more common than you might expect.

As you might expect there's lots of room to improve efficiency when it comes to sparse datasets. SciPy’s sparse module (scipy.sparse) provides specialized data structures and algorithms for handling sparse matrices—matrices that contain mostly zero values.

Practical Applications

Now, before we get into the technical stuff lets look into the areas where we see a lot of sparse matrices

Recommendation Systems:

This seems obvious once you read the next line.
Here's why: Among the thousands of available movies, most users will rate only a few.
Let’s consider an example:
Imagine we have a User-Movie Rating Matrix, where rows represent users, columns represent movies, and the values indicate ratings given by users.

Recommendation Systems example

Here, you can see that users rate only a few movies. This creates a sparse matrix, where most of the values are zeros, indicating no interaction between users and certain movies.

BioInformatics:

In bioinformatics, dealing with sparse data is common. For instance, when studying gene expression, not all genes are active in every tissue or condition. This results in a dataset where most values are zeros, indicating no expression.

BioInformatics example

Take a look at the example above: each row represents a gene, and each column represents a condition. Since only a few genes are expressed in each condition, the matrix is mostly filled with zeros, making it a prime example of sparsity in biological data.

Installation and Setup

pip install scipy

Then you can import the sparse modules

from scipy.sparse import ...

Storing Sparse Matrices

Imagine you have a huge dataset with mostly zero values, if you are working with python it will allocate memory for the zeroes as well, now that's a huge waste.

What's worse is that, many operations like matrix multiplication work with all those zeroes as well, when we can just ignore them.
That brings us to the core of this problem: How do I store only the non-zero values and ignore the 0's
Scipy provides four formats based on the kind of operations you want to perform. CSC and CSR are mostly better than COO, since COO is mostly used to build sparse matrices and BSR is only used on a narrow types of datasets. There are many more types of sparse matrices Scipy provides to build sparse matrices.
The Types of formats in which Scipy can store sparse matrices are:

1. Compressed Sparse Row (CSR) Format

CSR is very efficient for row based operations, such as slicing and matrix multiplication.

Storage Structure:

Delving deeper we see that if:


Array = [0, 0, 3]
        [4, 0, 0]
        [0, 5, 0]
    

Then scipy stores the below three info:
(Though Oversimplified it helps to think in this manner)


Values = [3, 4, 5]
Column Indices = [2, 0, 1]
Row Vector = [0, 1, 2, 3]
    

The way the program accesses data is:


Values[0:1] = [3]    # First Row Element
Values[1:2] = [4]    # Second Row Element
Values[2:3] = [5]    # Third Row Element
    

Since we already know the column indices, we reconstruct the matrix efficiently.

2. Compressed Sparse Column (CSC)

Similar to CSR but stores data column-wise.


Values:           [4, 5, 3]
Row Indices:      [1, 2, 0]
Column Pointers:  [0, 1, 2, 3]
    

3. Coordinate List (COO)

Stores matrix elements as a list of (row, column, value) tuples.


(row, col): value
(0, 2): 3
(1, 0): 4
(2, 1): 5
    

4. Block Sparse Row (BSR)

BSR stores non-zero values in blocks rather than individual elements. It is useful for structured matrices, such as those in physics and engineering simulations.


Matrix = [1, 2, 0, 0]
         [3, 4, 0, 0]
         [0, 0, 5, 6]
         [0, 0, 7, 8]

Blocks:
    [
        [[1, 2], [3, 4]],  # Block (0,0)
        [[5, 6], [7, 8]]   # Block (1,1)
    ]

Block Indices:  [0, 1]      # Just like Row Indice and Column Pointer
Block Pointers: [0, 1, 2]
    

Formats for Conatructing Sparse Matrices

COO also helps in constructing sparse matrices.

1. Dictionary of Keys (DOK)

Stores non-zero elements as a dictionary.


{
    (0, 2): 3,
    (1, 0): 4,
    (2, 1): 5
}
    

2. List of Lists (LIL)

Stores each row as a list of column-value pairs.


[
    [ (2, 3) ],    # Row 0
    [ (0, 4) ],    # Row 1
    [ (1, 5) ]     # Row 2
]
    

Code Example in Python

import numpy as np
from scipy.sparse import coo_matrix, csr_matrix, csc_matrix

# Define a simple 5x5 matrix
data = np.array([3, 4, 5])
rows = np.array([0, 1, 3])
cols = np.array([0, 2, 4])

# Create a COO sparse matrix
coo = coo_matrix((data, (rows, cols)), shape=(5, 5))
print("COO format:\n", coo)

# Convert to CSR format
csr = coo.tocsr()
print("\nCSR format:\n", csr)

# Convert to CSC format
csc = coo.tocsc()
print("\nCSC format:\n", csc)

The Output of the code above is:

COO format:
   (0, 0)	3
  (1, 2)	4
  (3, 4)	5

CSR format:
   (0, 0)	3
  (1, 2)	4
  (3, 4)	5

CSC format:
   (0, 0)	3
  (1, 2)	4
  (3, 4)	5

So you can see that we only store the non-zero values and hence increase storage efficiency and decease computation time.

Operations on Sparse matrices

# Create two sparse matrices
sparse_matrix_1 = csr_matrix([[0, 0, 3], [1, 0, 0], [0, 2, 0]])
sparse_matrix_2 = csr_matrix([[0, 1, 0], [2, 0, 0], [0, 0, 4]])

# Perform matrix multiplication
result = sparse_matrix_1.dot(sparse_matrix_2)
print("Result of multiplication:\n", result)
    

The Output of the code above is:

Result of multiplication:
   (0, 2)	12
  (1, 1)	1
  (2, 0)	4

Here instead of wasting resources on zeroes as well we only work with meaningful values.

Conclusion

Sparse datasets are not only common but crucial in many fields like recommendation systems, and bioinformatics. Understanding how to efficiently store, manipulate, and build machine learning models on these datasets can make a huge difference in both performance and accuracy.
And Remember: Less is more when it comes to sparsity!

References