Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 10 additions & 12 deletions hw0.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@
"source": [
"## Question 3: Softmax loss\n",
"\n",
"Implement the softmax (a.k.a. cross-entropy) loss as defined in `softmax_loss()` function in `src/simple_ml.py`. Recall (hopefully this is review, but we'll also cover it in lecture on 9/1), that for a multi-class output that can take on values $y \\in \\{1,\\ldots,k\\}$, the softmax loss takes as input a vector of logits $z \\in \\mathbb{R}^k$, the true class $y \\in \\{1,\\ldots,k\\}$ returns a loss defined by\n",
"Implement the softmax (a.k.a. cross-entropy) loss as defined in `softmax_loss()` function in `src/simple_ml.py`. Recall (hopefully this is review, but we'll also cover it in Lecture 2), that for a multi-class output that can take on values $y \\in \\{1,\\ldots,k\\}$, the softmax loss takes as input a vector of logits $z \\in \\mathbb{R}^k$, the true class $y \\in \\{1,\\ldots,k\\}$ returns a loss defined by\n",
"\\begin{equation}\n",
"\\ell_{\\mathrm{softmax}}(z, y) = \\log\\sum_{i=1}^k \\exp z_i - z_y.\n",
"\\end{equation}\n",
Expand Down Expand Up @@ -369,14 +369,13 @@
"source": [
"## Question 4: Stochastic gradient descent for softmax regression\n",
"\n",
"In this question you will implement stochastic gradient descent (SGD) for (linear) softmax regression. In other words, as discussed in lecture on 9/1, we will consider a hypothesis function that makes $n$-dimensional inputs to $k$-dimensional logits via the function\n",
"In this question you will implement stochastic gradient descent (SGD) for (linear) softmax regression. In other words, as discussed in Lecture 2, we will consider a hypothesis function that makes $n$-dimensional inputs to $k$-dimensional logits via the function\n",
"\\begin{equation}\n",
"h(x) = \\Theta^T x\n",
"\\end{equation}\n",
"where $x \\in \\mathbb{R}^n$ is the input, and $\\Theta \\in \\mathbb{R}^{n \\times k}$ are the model parameters. Given a dataset $\\{(x^{(i)} \\in \\mathbb{R}^n, y^{(i)} \\in \\{1,\\ldots,k\\})\\}$, for $i=1,\\ldots,m$, the optimization problem associated with softmax regression is thus given by\n",
"\\begin{equation}\n",
"\\DeclareMathOperator*{\\minimize}{minimize}\n",
"\\minimize_{\\Theta} \\; \\frac{1}{m} \\sum_{i=1}^m \\ell_{\\mathrm{softmax}}(\\Theta^T x^{(i)}, y^{(i)}).\n",
"\\operatorname*{minimize}_{\\Theta} \\; \\frac{1}{m} \\sum_{i=1}^m \\ell_{\\mathrm{softmax}}(\\Theta^T x^{(i)}, y^{(i)}).\n",
"\\end{equation}\n",
"\n",
"Recall from class that the gradient of the linear softmax objective is given by\n",
Expand All @@ -385,8 +384,7 @@
"\\end{equation}\n",
"where\n",
"\\begin{equation}\n",
"\\DeclareMathOperator*{\\normalize}{normalize}\n",
"z = \\frac{\\exp(\\Theta^T x)}{1^T \\exp(\\Theta^T x)} \\equiv \\normalize(\\exp(\\Theta^T x))\n",
"z = \\frac{\\exp(\\Theta^T x)}{1^T \\exp(\\Theta^T x)} \\equiv \\operatorname*{normalize}(\\exp(\\Theta^T x))\n",
"\\end{equation}\n",
"(i.e., $z$ is just the normalized softmax probabilities), and where $e_y$ denotes the $y$th unit basis, i.e., a vector of all zeros with a one in the $y$th position.\n",
"\n",
Expand All @@ -396,7 +394,7 @@
"\\end{equation}\n",
"where\n",
"\\begin{equation}\n",
"Z = \\normalize(\\exp(X \\Theta)) \\quad \\mbox{(normalization applied row-wise)}\n",
"Z = \\operatorname*{normalize}(\\exp(X \\Theta)) \\quad \\text{(normalization applied row-wise)}\n",
"\\end{equation}\n",
"denotes the matrix of logits, and $I_y \\in \\mathbb{R}^{m \\times k}$ represents a concatenation of one-hot bases for the labels in $y$.\n",
"\n",
Expand Down Expand Up @@ -487,18 +485,18 @@
"\\end{equation}\n",
"where $W_1 \\in \\mathbb{R}^{n \\times d}$ and $W_2 \\in \\mathbb{R}^{d \\times k}$ represent the weights of the network (which has a $d$-dimensional hidden unit), and where $z \\in \\mathbb{R}^k$ represents the logits output by the network. We again use the softmax / cross-entropy loss, meaning that we want to solve the optimization problem\n",
"\\begin{equation}\n",
"\\minimize_{W_1, W_2} \\;\\; \\frac{1}{m} \\sum_{i=1}^m \\ell_{\\mathrm{softmax}}(W_2^T \\mathrm{ReLU}(W_1^T x^{(i)}), y^{(i)}).\n",
"\\operatorname*{minimize}_{W_1, W_2} \\;\\; \\frac{1}{m} \\sum_{i=1}^m \\ell_{\\mathrm{softmax}}(W_2^T \\mathrm{ReLU}(W_1^T x^{(i)}), y^{(i)}).\n",
"\\end{equation}\n",
"Or alternatively, overloading the notation to describe the batch form with matrix $X \\in \\mathbb{R}^{m \\times n}$, this can also be written \n",
"\\begin{equation}\n",
"\\minimize_{W_1, W_2} \\;\\; \\ell_{\\mathrm{softmax}}(\\mathrm{ReLU}(X W_1) W_2, y).\n",
"\\operatorname*{minimize}_{W_1, W_2} \\;\\; \\ell_{\\mathrm{softmax}}(\\mathrm{ReLU}(X W_1) W_2, y).\n",
"\\end{equation}\n",
"\n",
"Using the chain rule, we can derive the backpropagation updates for this network (we'll briefly cover these in class, on 9/8, but also provide the final form here for ease of implementation). Specifically, let\n",
"Using the chain rule, we can derive the backpropagation updates for this network (we'll briefly cover these in class, Lecture 3, but also provide the final form here for ease of implementation). Specifically, let\n",
"\\begin{equation}\n",
"\\begin{split}\n",
"Z_1 \\in \\mathbb{R}^{m \\times d} & = \\mathrm{ReLU}(X W_1) \\\\\n",
"G_2 \\in \\mathbb{R}^{m \\times k} & = \\normalize(\\exp(Z_1 W_2)) - I_y \\\\\n",
"G_2 \\in \\mathbb{R}^{m \\times k} & = \\operatorname*{normalize}(\\exp(Z_1 W_2)) - I_y \\\\\n",
"G_1 \\in \\mathbb{R}^{m \\times d} & = \\mathrm{1}\\{Z_1 > 0\\} \\circ (G_2 W_2^T)\n",
"\\end{split}\n",
"\\end{equation}\n",
Expand All @@ -510,7 +508,7 @@
"\\end{split}\n",
"\\end{equation}\n",
"\n",
"**Note:** If the details of these precise equations seem a bit cryptic to you (prior to the 9/8 lecture), don't worry too much. These _are_ just the standard backpropagation equations for a two-layer ReLU network: the $Z_1$ term just computes the \"forward\" pass while the $G_2$ and $G_1$ terms denote the backward pass. But the precise form of the updates can vary depending upon the notation you've used for neural networks, the precise ways you formulate the losses, if you've derived these previously in matrix form, etc. If the notation seems like it might be familiar from when you've seen deep networks in the past, and makes more sense after the 9/8 lecture, that is more than sufficient in terms of background (after all, the whole _point_ of deep learning systems, to some extent, is that we don't need to bother with these manual calculations). But if these entire concepts are _completely_ foreign to you, then it may be better to take a separate course on ML and neural networks prior to this course, or at least be aware that there will be substantial catch-up work to do for the course.\n",
"**Note:** If the details of these precise equations seem a bit cryptic to you (prior to Lecture 3), don't worry too much. These _are_ just the standard backpropagation equations for a two-layer ReLU network: the $Z_1$ term just computes the \"forward\" pass while the $G_2$ and $G_1$ terms denote the backward pass. But the precise form of the updates can vary depending upon the notation you've used for neural networks, the precise ways you formulate the losses, if you've derived these previously in matrix form, etc. If the notation seems like it might be familiar from when you've seen deep networks in the past, and makes more sense after Lecture 3, that is more than sufficient in terms of background (after all, the whole _point_ of deep learning systems, to some extent, is that we don't need to bother with these manual calculations). But if these entire concepts are _completely_ foreign to you, then it may be better to take a separate course on ML and neural networks prior to this course, or at least be aware that there will be substantial catch-up work to do for the course.\n",
"\n",
"Using these gradients, now write the `nn_epoch()` function in the `src/simple_ml.py` file. As with the previous question, your solution should modify the `W1` and `W2` arrays in place. After implementing the function, run the following test. Be sure to use matrix operations as indicated by the expresssions above to implement the function: this will be _much_ faster, and more efficient, than attempting to use loops (and it requires far less code)."
]
Expand Down