Interactive online version: Open In Colab

Spectral Clustering

In this application, we’ll show how CoLA can be used to perform Spectral Clustering. This application allows us to showcase our Sparse operator.

In terms of the data, we will use the arXiv paper citation network of High Energy Physics. This is a directed graph showing the papers cited for each paper in the dataset. As in this case we will consider two papers to be related if at least one cited the other, then it will suffice to use an undirected graph.

The following command would download the data in case it is not on the data folder under the user’s HOME directory.

[1]:
![ ! -f "$HOME/data/cit-HepPh.txt" ] && wget -P ~/data https://www.andpotap.com/static/cit-HepPh.txt
[2]:
import os
import warnings

warnings.filterwarnings('ignore')
input_path = os.path.join(os.environ['HOME'], "data/cit-HepPh.txt")

To pre-process the data we will use the following code:

[3]:
import pandas as pd
from scipy.sparse import csr_matrix
import numpy as np


def load_graph_data(filepath, dtype, xnp, num_edges=-1):
    df = pd.read_csv(filepath, skiprows=4, delimiter="\t", header=None, names=["to", "from"])
    df = df[:num_edges]
    df2 = pd.read_csv(filepath, skiprows=4, delimiter="\t", header=None, names=["from", "to"])
    df2 = df2[:num_edges]
    df_undir = pd.concat((df, df2), axis=0)
    df_undir = df_undir.drop_duplicates()
    id_map = map_nodes_to_id(df_undir["from"].unique())
    N = len(id_map)
    print(f"Found {N:,d} nodes")
    for col in ["from", "to"]:
        df_undir[col] = df_undir[col].map(id_map)
    data = np.ones(shape=len(df_undir))
    row, col = np.array(df_undir["to"]), np.array(df_undir["from"])
    sparse_matrix = csr_matrix((data, (row, col)), shape=(N, N))
    out = transform_to_csr(sparse_matrix, xnp, dtype)
    data, col_ind, rowptr, shape = out
    return data, col_ind, rowptr, shape


def transform_to_csr(sparse_matrix, xnp, dtype):
    data = xnp.array(sparse_matrix.data, dtype=dtype, device=None)
    indices = xnp.array(sparse_matrix.indices, dtype=xnp.int64, device=None)
    indptr = xnp.array(sparse_matrix.indptr, dtype=xnp.int64, device=None)
    return data, indices, indptr, sparse_matrix.shape


def map_nodes_to_id(nodes):
    out = {}
    for idx in range(len(nodes)):
        out[int(nodes[idx])] = idx
    return out

The function load_graph_data creates the column indices and row pointers needed for the sparse CSR format. Now, we can load the data and create our Sparse adjacency matrix as follows

[4]:
import torch
import cola
from cola.backends import torch_fns as xnp

num_edges = -1
dtype = torch.float64
data, col_ind, rowptr, shape = load_graph_data(input_path, dtype, xnp, num_edges)
Ad = cola.ops.Sparse(data, col_ind, rowptr, shape)
Found 34,545 nodes

Given the adjacency matrix, we can now create the normalized Laplacian defined as \(L=I - D^{-1/2} A D^{-1/2}\), where \(D\) is the diagonal matrix that contains the degree of each node, \(A\) is the adjacency matrix and \(I\) is the identity. We can create the Laplacian operator really easily in CoLA as

[5]:
Deg = cola.ops.Diagonal(Ad @ torch.ones((Ad.shape[0], ), dtype=dtype))
Id = cola.ops.I_like(Deg)
Lap = Id - cola.inv(cola.sqrt(Deg)) @ Ad @ cola.inv(cola.sqrt(Deg))
Lap = cola.SelfAdjoint(Lap)
print(f"Size of the Laplacian: {Lap.shape}")
Size of the Laplacian: (34545, 34545)

Where I added the SelfAdjoint annotation at the end to ensure that CoLA dispatches algorithms for this type of symmetric operator. Spectral Clustering requires that we compute the eigenvectors of the smallest eigenvalues and use those eigenvectors as an embedding of our data. Once we do this, we can then use k-means to cluster points nearby as those points are related to cliques in the original graph.

[6]:
from sklearn.cluster import KMeans

eigvals, eigvecs = cola.eig(Lap, k=20, which="SM", alg=cola.Lanczos(max_iters=300))

WARNING:root:Non keyed randn used. To be deprecated soon.
[11]:
eigvals
[11]:
tensor([-3.1070e-15, -1.0551e-15,  1.0093e-15,  1.5716e-02,  2.3005e-02,
         3.2201e-02,  4.2626e-02,  4.2914e-02,  4.5332e-02,  4.5939e-02,
         5.5096e-02,  5.7384e-02,  6.2086e-02,  6.5325e-02,  6.7329e-02,
         6.9325e-02,  7.0176e-02,  7.0703e-02,  7.4464e-02,  7.5107e-02],
       dtype=torch.float64)

From the eigenvalues we can determine that there are 3 connected components of the citation graph. Let’s identify these components

[12]:
n_clusters = 3
x_emb = eigvecs.to_dense()[:,:n_clusters]
kmeans = KMeans(n_clusters=n_clusters).fit(x_emb)

And now let’s visualize these three components in the basis of these eigenvectors

[17]:
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(*x_emb.T, c=kmeans.labels_, cmap="tab10")
[17]:
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7f0e5dda6ec0>
../_images/notebooks_04_Spectral_Clustering_16_1.png