{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7b527b7f-477e-436a-99c1-2cb5264a3e9e",
   "metadata": {},
   "source": [
    "# GPU Support and Changing Operator Device\n",
    "\n",
    "In this notebook we'll see how to leverage GPU acceleration for our numerical linear algebra routines. We'll start by showing how to place a `LinearOperator` in either a GPU or a CPU and then we'll apply what we just learned to a large-scale linear regression problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c89db44c-ab01-4cb9-adc5-0f4489abf19d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cpu\n",
      "218 µs ± 5.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "import cola\n",
    "import torch\n",
    "\n",
    "N, D, B = 1_000, 40, 200\n",
    "x = torch.randn(N, D)\n",
    "ones = torch.ones((N, B))\n",
    "A = cola.ops.Dense(x) @ cola.ops.Dense(x.T)\n",
    "\n",
    "print(A.device)\n",
    "\n",
    "%timeit A @ ones"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea8617ed-6612-4fa3-b6db-07c885bae8f2",
   "metadata": {},
   "source": [
    "As we can see from the prints, the operator $A$ is using CPU for its computations. To now have $A$ use a GPU, we use the `to()` method (similar to `PyTorch`) and place $A$ to the `device` of our choice. Under the hood, we find all the parameters that define $A$ and place those arrays into the `device`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "0ceb4425-7051-4106-b8bc-e14c441adf6e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cuda:0\n",
      "59.1 µs ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "device_cpu = torch.device(\"cpu\")\n",
    "device_gpu = torch.device(\"cuda:0\") if torch.cuda.is_available() else device_cpu  # defaults to CPU in case there is no GPU\n",
    "\n",
    "A = A.to(device_gpu)\n",
    "ones = ones.to(device_gpu)\n",
    "\n",
    "print(A.device)\n",
    "\n",
    "%timeit A @ ones"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d131a133-7f66-44fe-b21c-eae8ca0b283f",
   "metadata": {},
   "source": [
    "Looking at the previous prints, we can see that $A$ was successfully placed on a GPU and that the MVM times reduced dramatically!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e952e49-3db9-45e8-bb58-c38f2f0f6a07",
   "metadata": {},
   "source": [
    "Lets now see the impact of hardware acceleration on a large-scale linear regression problem. We are going to use the Song UCI dataset. Here is some code to download and to preprocess the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "be6b0b99-570b-40c3-a577-c55a71a72fdb",
   "metadata": {},
   "outputs": [],
   "source": [
    "![ ! -f \"$HOME/data/song.mat\" ] && wget -P ~/data https://www.andpotap.com/static/song.mat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8671e6aa-c98f-4083-9dd1-c486ee0ee29c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from math import floor\n",
    "import numpy as np\n",
    "from scipy.io import loadmat\n",
    "\n",
    "def load_uci_data(dataset, data_dir=\"./\", train_p=0.75, test_p=0.15):\n",
    "    file_path = os.path.join(data_dir, dataset + '.mat')\n",
    "    data = np.array(loadmat(file_path)['data'])\n",
    "    X = data[:, :-1]\n",
    "    y = data[:, -1]\n",
    "    X = X - X.min(0)[0]\n",
    "    X = 2.0 * (X / X.max(0)[0]) - 1.0\n",
    "    y -= y.mean()\n",
    "    y /= y.std()\n",
    "\n",
    "    train_n = int(floor(train_p * X.shape[0]))\n",
    "    valid_n = int(floor((1. - train_p - test_p) * X.shape[0]))\n",
    "\n",
    "    split = split_dataset(X, y, train_n, valid_n)\n",
    "    train_x, train_y, valid_x, valid_y, test_x, test_y = split\n",
    "\n",
    "    return train_x, train_y, test_x, test_y, valid_x, valid_y\n",
    "\n",
    "\n",
    "def split_dataset(x, y, train_n, valid_n):\n",
    "    train_x = x[:train_n, :]\n",
    "    train_y = y[:train_n]\n",
    "\n",
    "    valid_x = x[train_n:train_n + valid_n, :]\n",
    "    valid_y = y[train_n:train_n + valid_n]\n",
    "\n",
    "    test_x = x[train_n + valid_n:, :]\n",
    "    test_y = y[train_n + valid_n:]\n",
    "    return train_x, train_y, valid_x, valid_y, test_x, test_y\n",
    "\n",
    "\n",
    "def get_test_rmse(coeffs, test_x, test_y):\n",
    "    err = (test_y - test_x @ coeffs)**2.\n",
    "    return (err.mean())**0.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fd565821-f1e5-4305-a9f3-fa05d1e54120",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset N=386,508 | D=90\n"
     ]
    }
   ],
   "source": [
    "out = load_uci_data(dataset=\"song\", data_dir=os.path.join(os.environ['HOME'], \"data\"))\n",
    "train_x, train_y, *_, test_x, test_y = out\n",
    "print(f\"Dataset N={train_x.shape[0]:,d} | D={train_x.shape[1]:,d}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2fe527b-f088-4cae-8dde-99c2d73b5290",
   "metadata": {},
   "source": [
    "This UCI dataset is fairly large, consisting of 386.5K observations and 90 features. We will now run a linear regression on this dataset using CPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e69ea2eb-7f7b-446c-9c03-76de7d614aa0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(91, 91)\n",
      "cpu\n",
      "cpu\n"
     ]
    }
   ],
   "source": [
    "train_x = torch.tensor(train_x)\n",
    "ones = torch.ones((train_x.shape[0], 1))\n",
    "train_x = torch.cat((ones, train_x), dim=1)\n",
    "train_y = torch.tensor(train_y)\n",
    "\n",
    "test_x = torch.tensor(test_x)\n",
    "ones = torch.ones((test_x.shape[0], 1))\n",
    "test_x = torch.cat((ones, test_x), dim=1)\n",
    "test_y = torch.tensor(test_y)\n",
    "\n",
    "rhs = train_x.T @ train_y\n",
    "XTX = cola.ops.Dense(train_x.T) @ cola.ops.Dense(train_x)\n",
    "mu = 1e-2\n",
    "XTX += mu * cola.ops.I_like(XTX)\n",
    "XTX = cola.PSD(XTX)\n",
    "\n",
    "print(XTX.shape)\n",
    "print(XTX.device)\n",
    "print(rhs.device)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d4e29e9-293d-448e-b8f6-ad21328a82cb",
   "metadata": {},
   "source": [
    "We lazily defined $X^T X$ and added some regularization via $\\mu$. Moreover, everything is placed on the CPU. We can therefore get the coefficients of the regression and evaluate the model by running"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "1d0c9198-0df1-4ade-bc10-6c2271c228f7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test RMSE: 8.628e-01\n"
     ]
    }
   ],
   "source": [
    "coeffs = cola.solve(XTX, rhs, method=\"iterative\")\n",
    "test_rmse = get_test_rmse(coeffs, test_x, test_y)\n",
    "print(f\"Test RMSE: {test_rmse:1.3e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "437eae88-9600-418a-88db-94e4cbcd70bc",
   "metadata": {},
   "source": [
    "We selected `method=iterative` which uses CG as solver and therefore ensures that we are running the same algorithm but on different hardware."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6b79538f-1fca-4d81-b2b7-cac857cb1091",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.6 s ± 121 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit cola.solve(XTX, rhs, method=\"iterative\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e9d1b71-b482-45b8-b4e5-c35d7614bf45",
   "metadata": {},
   "source": [
    "To move to a GPU, we again use the `to()` method and pass both the operator $X^T X$ to the GPU as well as the RHS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "aa4c895e-3074-4168-bcaa-8d47d85fe981",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test RMSE: 8.628e-01\n"
     ]
    }
   ],
   "source": [
    "XTX = XTX.to(device_gpu)\n",
    "rhs = rhs.to(device_gpu)\n",
    "\n",
    "coeffs = cola.solve(XTX, rhs, method=\"iterative\")\n",
    "test_rmse = get_test_rmse(coeffs, test_x.to(device_gpu), test_y.to(device_gpu))\n",
    "print(f\"Test RMSE: {test_rmse:1.3e}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "22aa1ebd-cc58-4f79-8e4a-ce8a00252f81",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "645 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit cola.solve(XTX, rhs, method=\"iterative\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21469dd6-3962-429a-a796-f9f6ea2ba044",
   "metadata": {},
   "source": [
    "We thus see the significant gains that we get from running on a GPU.\n",
    "\n",
    "In summary, we saw that using GPU acceleration is as simple as using the `to()` method on any `LinearOperator` that we want!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}