dlsyscourse · j1mk1m · Sep 3, 2025
diff --git a/hw0.ipynb b/hw0.ipynb
@@ -328,7 +328,7 @@
    "source": [
     "## Question 3: Softmax loss\n",
     "\n",
-    "Implement the softmax (a.k.a. cross-entropy) loss as defined in `softmax_loss()` function in `src/simple_ml.py`.  Recall (hopefully this is review, but we'll also cover it in lecture on 9/1), that for a multi-class output that can take on values $y \\in \\{1,\\ldots,k\\}$, the softmax loss takes as input a vector of logits $z \\in \\mathbb{R}^k$, the true class $y \\in \\{1,\\ldots,k\\}$ returns a loss defined by\n",
+    "Implement the softmax (a.k.a. cross-entropy) loss as defined in `softmax_loss()` function in `src/simple_ml.py`.  Recall (hopefully this is review, but we'll also cover it in Lecture 2), that for a multi-class output that can take on values $y \\in \\{1,\\ldots,k\\}$, the softmax loss takes as input a vector of logits $z \\in \\mathbb{R}^k$, the true class $y \\in \\{1,\\ldots,k\\}$ returns a loss defined by\n",
     "\\begin{equation}\n",
     "\\ell_{\\mathrm{softmax}}(z, y) = \\log\\sum_{i=1}^k \\exp z_i - z_y.\n",
     "\\end{equation}\n",
@@ -369,14 +369,13 @@
    "source": [
     "## Question 4: Stochastic gradient descent for softmax regression\n",
     "\n",
-    "In this question you will implement stochastic gradient descent (SGD) for (linear) softmax regression.  In other words, as discussed in lecture on 9/1, we will consider a hypothesis function that makes $n$-dimensional inputs to $k$-dimensional logits via the function\n",
+    "In this question you will implement stochastic gradient descent (SGD) for (linear) softmax regression.  In other words, as discussed in Lecture 2, we will consider a hypothesis function that makes $n$-dimensional inputs to $k$-dimensional logits via the function\n",
     "\\begin{equation}\n",
     "h(x) = \\Theta^T x\n",
     "\\end{equation}\n",
     "where $x \\in \\mathbb{R}^n$ is the input, and $\\Theta \\in \\mathbb{R}^{n \\times k}$ are the model parameters.  Given a dataset $\\{(x^{(i)} \\in \\mathbb{R}^n, y^{(i)} \\in \\{1,\\ldots,k\\})\\}$, for $i=1,\\ldots,m$, the optimization problem associated with softmax regression is thus given by\n",
     "\\begin{equation}\n",
-    "\\DeclareMathOperator*{\\minimize}{minimize}\n",
-    "\\minimize_{\\Theta} \\; \\frac{1}{m} \\sum_{i=1}^m \\ell_{\\mathrm{softmax}}(\\Theta^T x^{(i)}, y^{(i)}).\n",
+    "\\operatorname*{minimize}_{\\Theta} \\; \\frac{1}{m} \\sum_{i=1}^m \\ell_{\\mathrm{softmax}}(\\Theta^T x^{(i)}, y^{(i)}).\n",
     "\\end{equation}\n",
     "\n",
     "Recall from class that the gradient of the linear softmax objective is given by\n",
@@ -385,8 +384,7 @@
     "\\end{equation}\n",
     "where\n",
     "\\begin{equation}\n",
-    "\\DeclareMathOperator*{\\normalize}{normalize}\n",
-    "z = \\frac{\\exp(\\Theta^T x)}{1^T \\exp(\\Theta^T x)} \\equiv \\normalize(\\exp(\\Theta^T x))\n",
+    "z = \\frac{\\exp(\\Theta^T x)}{1^T \\exp(\\Theta^T x)} \\equiv \\operatorname*{normalize}(\\exp(\\Theta^T x))\n",
     "\\end{equation}\n",
     "(i.e., $z$ is just the normalized softmax probabilities), and where $e_y$ denotes the $y$th unit basis, i.e., a vector of all zeros with a one in the $y$th position.\n",
     "\n",
@@ -396,7 +394,7 @@
     "\\end{equation}\n",
     "where\n",
     "\\begin{equation}\n",
-    "Z = \\normalize(\\exp(X \\Theta)) \\quad \\mbox{(normalization applied row-wise)}\n",
+    "Z = \\operatorname*{normalize}(\\exp(X \\Theta)) \\quad \\text{(normalization applied row-wise)}\n",
     "\\end{equation}\n",
     "denotes the matrix of logits, and $I_y \\in \\mathbb{R}^{m \\times k}$ represents a concatenation of one-hot bases for the labels in $y$.\n",
     "\n",
@@ -487,18 +485,18 @@
     "\\end{equation}\n",
     "where $W_1 \\in \\mathbb{R}^{n \\times d}$ and $W_2 \\in \\mathbb{R}^{d \\times k}$ represent the weights of the network (which has a $d$-dimensional hidden unit), and where $z \\in \\mathbb{R}^k$ represents the logits output by the network.  We again use the softmax / cross-entropy loss, meaning that we want to solve the optimization problem\n",
     "\\begin{equation}\n",
-    "\\minimize_{W_1, W_2} \\;\\; \\frac{1}{m} \\sum_{i=1}^m \\ell_{\\mathrm{softmax}}(W_2^T \\mathrm{ReLU}(W_1^T x^{(i)}), y^{(i)}).\n",
+    "\\operatorname*{minimize}_{W_1, W_2} \\;\\; \\frac{1}{m} \\sum_{i=1}^m \\ell_{\\mathrm{softmax}}(W_2^T \\mathrm{ReLU}(W_1^T x^{(i)}), y^{(i)}).\n",
     "\\end{equation}\n",
     "Or alternatively, overloading the notation to describe the batch form with matrix $X \\in \\mathbb{R}^{m \\times n}$, this can also be written \n",
     "\\begin{equation}\n",
-    "\\minimize_{W_1, W_2} \\;\\; \\ell_{\\mathrm{softmax}}(\\mathrm{ReLU}(X W_1) W_2, y).\n",
+    "\\operatorname*{minimize}_{W_1, W_2} \\;\\; \\ell_{\\mathrm{softmax}}(\\mathrm{ReLU}(X W_1) W_2, y).\n",
     "\\end{equation}\n",
     "\n",
-    "Using the chain rule, we can derive the backpropagation updates for this network (we'll briefly cover these in class, on 9/8, but also provide the final form here for ease of implementation).  Specifically, let\n",
+    "Using the chain rule, we can derive the backpropagation updates for this network (we'll briefly cover these in class, Lecture 3, but also provide the final form here for ease of implementation).  Specifically, let\n",
     "\\begin{equation}\n",
     "\\begin{split}\n",
     "Z_1 \\in \\mathbb{R}^{m \\times d} & = \\mathrm{ReLU}(X W_1) \\\\\n",
-    "G_2 \\in \\mathbb{R}^{m \\times k} & = \\normalize(\\exp(Z_1 W_2)) - I_y \\\\\n",
+    "G_2 \\in \\mathbb{R}^{m \\times k} & = \\operatorname*{normalize}(\\exp(Z_1 W_2)) - I_y \\\\\n",
     "G_1 \\in \\mathbb{R}^{m \\times d} & = \\mathrm{1}\\{Z_1 > 0\\} \\circ (G_2 W_2^T)\n",
     "\\end{split}\n",
     "\\end{equation}\n",
@@ -510,7 +508,7 @@
     "\\end{split}\n",
     "\\end{equation}\n",
     "\n",
-    "**Note:** If the details of these precise equations seem a bit cryptic to you (prior to the 9/8 lecture), don't worry too much.  These _are_ just the standard backpropagation equations for a two-layer ReLU network: the $Z_1$ term just computes the \"forward\" pass while the $G_2$ and $G_1$ terms denote the backward pass.  But the precise form of the updates can vary depending upon the notation you've used for neural networks, the precise ways you formulate the losses, if you've derived these previously in matrix form, etc.  If the notation seems like it might be familiar from when you've seen deep networks in the past, and makes more sense after the 9/8 lecture, that is more than sufficient in terms of background (after all, the whole _point_ of deep learning systems, to some extent, is that we don't need to bother with these manual calculations).  But if these entire concepts are _completely_ foreign to you, then it may be better to take a separate course on ML and neural networks prior to this course, or at least be aware that there will be substantial catch-up work to do for the course.\n",
+    "**Note:** If the details of these precise equations seem a bit cryptic to you (prior to Lecture 3), don't worry too much.  These _are_ just the standard backpropagation equations for a two-layer ReLU network: the $Z_1$ term just computes the \"forward\" pass while the $G_2$ and $G_1$ terms denote the backward pass.  But the precise form of the updates can vary depending upon the notation you've used for neural networks, the precise ways you formulate the losses, if you've derived these previously in matrix form, etc.  If the notation seems like it might be familiar from when you've seen deep networks in the past, and makes more sense after Lecture 3, that is more than sufficient in terms of background (after all, the whole _point_ of deep learning systems, to some extent, is that we don't need to bother with these manual calculations).  But if these entire concepts are _completely_ foreign to you, then it may be better to take a separate course on ML and neural networks prior to this course, or at least be aware that there will be substantial catch-up work to do for the course.\n",
     "\n",
     "Using these gradients, now write the `nn_epoch()` function in the `src/simple_ml.py` file.  As with the previous question, your solution should modify the `W1` and `W2` arrays in place.  After implementing the function, run the following test.  Be sure to use matrix operations as indicated by the expresssions above to implement the function: this will be _much_ faster, and more efficient, than attempting to use loops (and it requires far less code)."
    ]