GPU Support and Changing Operator Device
In this notebook we’ll see how to leverage GPU acceleration for our numerical linear algebra routines. We’ll start by showing how to place a LinearOperator
in either a GPU or a CPU and then we’ll apply what we just learned to a large-scale linear regression problem.
[1]:
import cola
import torch
N, D, B = 1_000, 40, 200
x = torch.randn(N, D)
ones = torch.ones((N, B))
A = cola.ops.Dense(x) @ cola.ops.Dense(x.T)
print(A.device)
%timeit A @ ones
cpu
218 µs ± 5.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
As we can see from the prints, the operator \(A\) is using CPU for its computations. To now have \(A\) use a GPU, we use the to()
method (similar to PyTorch
) and place \(A\) to the device
of our choice. Under the hood, we find all the parameters that define \(A\) and place those arrays into the device
.
[2]:
device_cpu = torch.device("cpu")
device_gpu = torch.device("cuda:0") if torch.cuda.is_available() else device_cpu # defaults to CPU in case there is no GPU
A = A.to(device_gpu)
ones = ones.to(device_gpu)
print(A.device)
%timeit A @ ones
cuda:0
59.1 µs ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Looking at the previous prints, we can see that \(A\) was successfully placed on a GPU and that the MVM times reduced dramatically!
Lets now see the impact of hardware acceleration on a large-scale linear regression problem. We are going to use the Song UCI dataset. Here is some code to download and to preprocess the data
[3]:
![ ! -f "$HOME/data/song.mat" ] && wget -P ~/data https://www.andpotap.com/static/song.mat
[4]:
import os
from math import floor
import numpy as np
from scipy.io import loadmat
def load_uci_data(dataset, data_dir="./", train_p=0.75, test_p=0.15):
file_path = os.path.join(data_dir, dataset + '.mat')
data = np.array(loadmat(file_path)['data'])
X = data[:, :-1]
y = data[:, -1]
X = X - X.min(0)[0]
X = 2.0 * (X / X.max(0)[0]) - 1.0
y -= y.mean()
y /= y.std()
train_n = int(floor(train_p * X.shape[0]))
valid_n = int(floor((1. - train_p - test_p) * X.shape[0]))
split = split_dataset(X, y, train_n, valid_n)
train_x, train_y, valid_x, valid_y, test_x, test_y = split
return train_x, train_y, test_x, test_y, valid_x, valid_y
def split_dataset(x, y, train_n, valid_n):
train_x = x[:train_n, :]
train_y = y[:train_n]
valid_x = x[train_n:train_n + valid_n, :]
valid_y = y[train_n:train_n + valid_n]
test_x = x[train_n + valid_n:, :]
test_y = y[train_n + valid_n:]
return train_x, train_y, valid_x, valid_y, test_x, test_y
def get_test_rmse(coeffs, test_x, test_y):
err = (test_y - test_x @ coeffs)**2.
return (err.mean())**0.5
[5]:
out = load_uci_data(dataset="song", data_dir=os.path.join(os.environ['HOME'], "data"))
train_x, train_y, *_, test_x, test_y = out
print(f"Dataset N={train_x.shape[0]:,d} | D={train_x.shape[1]:,d}")
Dataset N=386,508 | D=90
This UCI dataset is fairly large, consisting of 386.5K observations and 90 features. We will now run a linear regression on this dataset using CPU.
[6]:
train_x = torch.tensor(train_x)
ones = torch.ones((train_x.shape[0], 1))
train_x = torch.cat((ones, train_x), dim=1)
train_y = torch.tensor(train_y)
test_x = torch.tensor(test_x)
ones = torch.ones((test_x.shape[0], 1))
test_x = torch.cat((ones, test_x), dim=1)
test_y = torch.tensor(test_y)
rhs = train_x.T @ train_y
XTX = cola.ops.Dense(train_x.T) @ cola.ops.Dense(train_x)
mu = 1e-2
XTX += mu * cola.ops.I_like(XTX)
XTX = cola.PSD(XTX)
print(XTX.shape)
print(XTX.device)
print(rhs.device)
(91, 91)
cpu
cpu
We lazily defined \(X^T X\) and added some regularization via \(\mu\). Moreover, everything is placed on the CPU. We can therefore get the coefficients of the regression and evaluate the model by running
[7]:
coeffs = cola.solve(XTX, rhs, method="iterative")
test_rmse = get_test_rmse(coeffs, test_x, test_y)
print(f"Test RMSE: {test_rmse:1.3e}")
Test RMSE: 8.628e-01
We selected method=iterative
which uses CG as solver and therefore ensures that we are running the same algorithm but on different hardware.
[8]:
%timeit cola.solve(XTX, rhs, method="iterative")
2.6 s ± 121 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
To move to a GPU, we again use the to()
method and pass both the operator \(X^T X\) to the GPU as well as the RHS.
[9]:
XTX = XTX.to(device_gpu)
rhs = rhs.to(device_gpu)
coeffs = cola.solve(XTX, rhs, method="iterative")
test_rmse = get_test_rmse(coeffs, test_x.to(device_gpu), test_y.to(device_gpu))
print(f"Test RMSE: {test_rmse:1.3e}")
Test RMSE: 8.628e-01
[10]:
%timeit cola.solve(XTX, rhs, method="iterative")
645 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
We thus see the significant gains that we get from running on a GPU.
In summary, we saw that using GPU acceleration is as simple as using the to()
method on any LinearOperator
that we want!