Interactive online version:

GPU Support and Changing Operator Device

In this notebook we’ll see how to leverage GPU acceleration for our numerical linear algebra routines. We’ll start by showing how to place a LinearOperator in either a GPU or a CPU and then we’ll apply what we just learned to a large-scale linear regression problem.

[1]:

import cola
import torch

N, D, B = 1_000, 40, 200
x = torch.randn(N, D)
ones = torch.ones((N, B))
A = cola.ops.Dense(x) @ cola.ops.Dense(x.T)

print(A.device)

%timeit A @ ones

cpu
218 µs ± 5.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

As we can see from the prints, the operator \(A\) is using CPU for its computations. To now have \(A\) use a GPU, we use the to() method (similar to PyTorch) and place \(A\) to the device of our choice. Under the hood, we find all the parameters that define \(A\) and place those arrays into the device.

[2]:

device_cpu = torch.device("cpu")
device_gpu = torch.device("cuda:0") if torch.cuda.is_available() else device_cpu  # defaults to CPU in case there is no GPU

A = A.to(device_gpu)
ones = ones.to(device_gpu)

print(A.device)

%timeit A @ ones

cuda:0
59.1 µs ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Looking at the previous prints, we can see that \(A\) was successfully placed on a GPU and that the MVM times reduced dramatically!

Lets now see the impact of hardware acceleration on a large-scale linear regression problem. We are going to use the Song UCI dataset. Here is some code to download and to preprocess the data

[3]:

![ ! -f "$HOME/data/song.mat" ] && wget -P ~/data https://www.andpotap.com/static/song.mat

[4]:

import os
from math import floor
import numpy as np
from scipy.io import loadmat

def load_uci_data(dataset, data_dir="./", train_p=0.75, test_p=0.15):
    file_path = os.path.join(data_dir, dataset + '.mat')
    data = np.array(loadmat(file_path)['data'])
    X = data[:, :-1]
    y = data[:, -1]
    X = X - X.min(0)[0]
    X = 2.0 * (X / X.max(0)[0]) - 1.0
    y -= y.mean()
    y /= y.std()

    train_n = int(floor(train_p * X.shape[0]))
    valid_n = int(floor((1. - train_p - test_p) * X.shape[0]))

    split = split_dataset(X, y, train_n, valid_n)
    train_x, train_y, valid_x, valid_y, test_x, test_y = split

    return train_x, train_y, test_x, test_y, valid_x, valid_y


def split_dataset(x, y, train_n, valid_n):
    train_x = x[:train_n, :]
    train_y = y[:train_n]

    valid_x = x[train_n:train_n + valid_n, :]
    valid_y = y[train_n:train_n + valid_n]

    test_x = x[train_n + valid_n:, :]
    test_y = y[train_n + valid_n:]
    return train_x, train_y, valid_x, valid_y, test_x, test_y


def get_test_rmse(coeffs, test_x, test_y):
    err = (test_y - test_x @ coeffs)**2.
    return (err.mean())**0.5

[5]:

out = load_uci_data(dataset="song", data_dir=os.path.join(os.environ['HOME'], "data"))
train_x, train_y, *_, test_x, test_y = out
print(f"Dataset N={train_x.shape[0]:,d} | D={train_x.shape[1]:,d}")

Dataset N=386,508 | D=90

This UCI dataset is fairly large, consisting of 386.5K observations and 90 features. We will now run a linear regression on this dataset using CPU.

[6]:

train_x = torch.tensor(train_x)
ones = torch.ones((train_x.shape[0], 1))
train_x = torch.cat((ones, train_x), dim=1)
train_y = torch.tensor(train_y)

test_x = torch.tensor(test_x)
ones = torch.ones((test_x.shape[0], 1))
test_x = torch.cat((ones, test_x), dim=1)
test_y = torch.tensor(test_y)

rhs = train_x.T @ train_y
XTX = cola.ops.Dense(train_x.T) @ cola.ops.Dense(train_x)
mu = 1e-2
XTX += mu * cola.ops.I_like(XTX)
XTX = cola.PSD(XTX)

print(XTX.shape)
print(XTX.device)
print(rhs.device)

(91, 91)
cpu
cpu

We lazily defined \(X^T X\) and added some regularization via \(\mu\). Moreover, everything is placed on the CPU. We can therefore get the coefficients of the regression and evaluate the model by running

[7]:

coeffs = cola.solve(XTX, rhs, method="iterative")
test_rmse = get_test_rmse(coeffs, test_x, test_y)
print(f"Test RMSE: {test_rmse:1.3e}")

Test RMSE: 8.628e-01

We selected method=iterative which uses CG as solver and therefore ensures that we are running the same algorithm but on different hardware.

[8]:

%timeit cola.solve(XTX, rhs, method="iterative")

2.6 s ± 121 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To move to a GPU, we again use the to() method and pass both the operator \(X^T X\) to the GPU as well as the RHS.

[9]:

XTX = XTX.to(device_gpu)
rhs = rhs.to(device_gpu)

coeffs = cola.solve(XTX, rhs, method="iterative")
test_rmse = get_test_rmse(coeffs, test_x.to(device_gpu), test_y.to(device_gpu))
print(f"Test RMSE: {test_rmse:1.3e}")

Test RMSE: 8.628e-01

[10]:

%timeit cola.solve(XTX, rhs, method="iterative")

645 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We thus see the significant gains that we get from running on a GPU.

In summary, we saw that using GPU acceleration is as simple as using the to() method on any LinearOperator that we want!