If the cartpole is already all the way at the right, we can't really select that action. So would it make sense to disallow that from either the random case (by sampling again) or the network case (by choosing the next highest Q value that the network predicts)?