Jekyll2021-05-26T06:12:49+00:00https://cookieblues.github.io//feed.xmlCookiebluesA blog about data.{"name"=>nil, "email"=>nil, "twitter"=>nil}Machine learning, notes 3b: Generative classifiers2021-04-01T00:00:00+00:002021-04-01T00:00:00+00:00https://cookieblues.github.io//guides/2021/04/01/bsmalea-notes-3b<p>As mentioned in <a href="https://cookieblues.github.io//guides/2021/03/30/bsmalea-notes-3a/">notes 3a</a>, generative classifiers model the <strong>joint probability distribution</strong> of the input and target variables $\text{Pr}(\mathbf{x}, t)$. This means, we would end up with a distribution that could generate (hence the name) new input variables with their respective targets, i.e., we can sample new data points with the joint probability distribution, and we will see how to do that in this post.</p> <p>The models, we will be looking at in this post, are called <strong>Gaussian Discriminant Analysis (GDA)</strong> models. Now is when the nomenclature starts getting tricky! Note that a Gaussian <em>Discriminant</em> Analysis model is a <em>generative</em> model! It is <em>not</em> a discriminative model despite its name.</p> <h2 id="quadratic-discriminant-analysis-qda">Quadratic Discriminant Analysis (QDA)</h2> <h3 id="setup-and-objective">Setup and objective</h3> <p>Given a training dataset of $N$ input variables $\mathbf{x} \in \mathbb{R}^D$ with corresponding target variables $t \in \mathcal{C}_c$ where $c \in \{1, \dots, K\}$, GDA models assume that the <strong>class-conditional densities</strong> are normally distributed</p> $\text{Pr}(\mathbf{x} \mid t = c, \boldsymbol{\mu}_c, \mathbf{\Sigma}_c) = \mathcal{N} \left( \mathbf{x} \mid \boldsymbol{\mu}_c, \mathbf{\Sigma}_c \right),$ <p>where $\boldsymbol{\mu}_c$ is the <strong>class-specific mean vector</strong> and $\mathbf{\Sigma}_c$ is the <strong>class-specific covariance matrix</strong>. Using Bayes’ theorem, we can now calculate the class posterior</p> $\overbrace{\text{Pr}(t=c | \mathbf{x}, \boldsymbol{\mu}_c, \mathbf{\Sigma}_c)}^{\text{class posterior}} = \frac{ \overbrace{\text{Pr}(\mathbf{x} \mid t = c, \boldsymbol{\mu}_c, \mathbf{\Sigma}_c)}^{\text{class-conditional density}} \, \overbrace{\text{Pr}(t=c)}^{\text{class prior}} }{ \sum_{k=1}^K \text{Pr}(\mathbf{x} \mid t = k, \boldsymbol{\mu}_k, \mathbf{\Sigma}_k) \, \text{Pr}(t=k) }.$ <p>We will then classify $\mathbf{x}$ into class</p> $\hat{h} (\mathbf{x}) = \underset{c}{\text{argmax }} \text{Pr}(t=c | \mathbf{x}, \boldsymbol{\mu}_c, \mathbf{\Sigma}_c).$ <h3 id="derivation-and-training">Derivation and training</h3> <p>For each input variable $\mathbf{x}_{n}$, we define $t_{nk} = 1$ if $\mathbf{x}_n$ belongs to class $\mathcal{C}_k$, otherwise $t_{nk}=0$, i.e., we end up having $k$ binary indicator variables for each input variable. Furthermore, let $\textbf{\textsf{t}} = \left( t_1, \dots, t_N \right)^\intercal$ denote all our target variables, and $\pi_c = \text{Pr}(t=c)$ the prior for class $c$. Assuming the data points are drawn independently, the likelihood function is given by</p> \begin{aligned} \text{Pr} \left( \textbf{\textsf{t}} | \boldsymbol{\mu}_1, \dots, \boldsymbol{\mu}_K, \mathbf{\Sigma}_1, \dots, \mathbf{\Sigma}_K \right) &amp;= \prod_{n=1}^N \prod_{k=1}^K \text{Pr} \left( t_n=k \right)^{t_{nk}} \text{Pr}\left( \mathbf{x}_n \mid t_n = k, \boldsymbol{\mu}_k, \mathbf{\Sigma}_k \right)^{t_{nk}} \\ &amp;= \prod_{n=1}^N \prod_{k=1}^K \pi_k^{t_{nk}} \mathcal{N} \left( \mathbf{x}_n \mid \boldsymbol{\mu}_k, \mathbf{\Sigma}_k \right)^{t_{nk}}. \end{aligned} <p>To simplify notation, let $\boldsymbol{\theta}$ denote all the class priors, class-specific mean vectors, and covariance matrices $\left\{ \pi_1, \dots, \pi_K, \boldsymbol{\mu}_1, \dots, \boldsymbol{\mu}_K, \mathbf{\Sigma}_1, \dots, \mathbf{\Sigma}_K \right\}$. As we know, <strong>maximizing the likelihood is equivalent to maximizing the log-likelihood</strong>. The log-likelihood is</p> \begin{aligned} \ln \text{Pr} \left( \textbf{\textsf{t}} | \boldsymbol{\theta} \right) &amp;= \ln \prod_{n=1}^N \prod_{k=1}^K \pi_k^{t_{nk}} \mathcal{N} \left( \mathbf{x}_n \mid \boldsymbol{\mu}_k, \mathbf{\Sigma}_k \right)^{t_{nk}} \\ &amp;= \sum_{n=1}^N \ln \prod_{k=1}^K \pi_k^{t_{nk}} \mathcal{N} \left( \mathbf{x}_n \mid \boldsymbol{\mu}_k, \mathbf{\Sigma}_k \right)^{t_{nk}} \\ &amp;= \sum_{n=1}^N \sum_{k=1}^K \ln \pi_k^{t_{nk}} \mathcal{N} \left( \mathbf{x}_n \mid \boldsymbol{\mu}_k, \mathbf{\Sigma}_k \right)^{t_{nk}} \\ &amp;= \sum_{n=1}^N \sum_{k=1}^K t_{nk} \left( \ln \pi_k + \, \ln \mathcal{N} \left( \mathbf{x}_n \mid \boldsymbol{\mu}_k, \mathbf{\Sigma}_k \right) \right) \\ &amp;= \sum_{n=1}^N \sum_{k=1}^K \left( t_{nk} \ln \pi_k + t_{nk} \ln \mathcal{N} \left( \mathbf{x}_n \mid \boldsymbol{\mu}_k, \mathbf{\Sigma}_k \right) \right). \quad \quad (1) \end{aligned} <p>Expanding $(1)$ will greatly help us in the upcoming derivations:</p> \begin{aligned} \ln \text{Pr} \left( \textbf{\textsf{t}} | \boldsymbol{\theta} \right) &amp;= \sum_{n=1}^N \sum_{k=1}^K t_{nk} \left( \ln \pi_k + \ln \left( \frac{1}{\sqrt{ (2\pi)^D \det{\mathbf{\Sigma}_k} }} \exp{\left( -\frac{1}{2} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal \mathbf{\Sigma}_k^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \right)} \right) \right) \\ &amp;= \sum_{n=1}^N \sum_{k=1}^K t_{nk} \left( \ln \pi_k + \ln \frac{1}{\sqrt{ (2\pi)^D \det{\mathbf{\Sigma}_k} }} -\frac{1}{2} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal \mathbf{\Sigma}_k^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \right) \\ &amp;= \sum_{n=1}^N \sum_{k=1}^K t_{nk} \left( \ln \pi_k -\frac{1}{2} \ln \left( (2\pi)^D \det{\mathbf{\Sigma}_k} \right) -\frac{1}{2} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal \mathbf{\Sigma}_k^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \right) \\ &amp;= \sum_{n=1}^N \sum_{k=1}^K t_{nk} \left( \ln \pi_k -\frac{1}{2} \left( D \ln 2\pi + \ln \det{\mathbf{\Sigma}_k} \right) -\frac{1}{2} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal \mathbf{\Sigma}_k^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \right) \\ &amp;= \sum_{n=1}^N \sum_{k=1}^K t_{nk} \left( \ln \pi_k -\frac{D}{2} \ln 2\pi +\frac{1}{2} \ln \det{\mathbf{\Sigma}_k^{-1}} -\frac{1}{2} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal \mathbf{\Sigma}_k^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \right). \quad \quad (2) \end{aligned} <p>We have to find the maximum likelihood solution for $\pi_c$, $\boldsymbol{\mu}_c$, and $\mathbf{\Sigma}_c$. Starting with $\pi_c$, we have to take the derivative of $(2)$, set it equal to 0, and solve for $\pi_c$, however, we have to maintain the constraint $\sum_{k=1}^K \pi_k = 1$. This is done by using a Lagrange multiplier $\lambda$, and instead maximizing</p> $\ln \text{Pr} \left( \textbf{\textsf{t}} | \boldsymbol{\theta} \right) + \lambda \left( \sum_{k=1}^K \pi_k - 1 \right). \quad \quad (3)$ <p>Using the result from $(2)$, we then take the derivative of $(3)$ with respect to $\pi_c$, set it equal to 0, and solve for $\pi_c$</p> \begin{aligned} \frac{\partial}{\partial \pi_c} \left( \ln \text{Pr} \left( \textbf{\textsf{t}} | \boldsymbol{\theta} \right) + \lambda \left( \sum_{k=1}^K \pi_k - 1 \right) \right) &amp;= 0 \\ \sum_{n=1}^N \frac{\partial}{\partial \pi_c} \left( t_{nc} \ln \pi_k \right) + \lambda \frac{\partial}{\partial \pi_c} \left( \pi_c - 1 \right) &amp;= 0 \\ \sum_{n=1}^N \frac{t_{nc}}{\pi_c} + \lambda &amp;= 0 \\ \pi_c \lambda &amp;= - N_c, \quad \quad (4) \end{aligned} <p>where $N_c$ is the number of data points in class $\mathcal{C}_c$, and since we know that $\sum_{k=1}^K \pi_k = 1$, we can find $\lambda$</p> \begin{aligned} \sum_{k=1}^K \pi_k \lambda &amp;= - \sum_{k=1}^K N_k \\ \lambda &amp;= -N. \end{aligned} <p>Substituting $\lambda = -N$ back into $(4)$ gives us</p> \begin{aligned} -\pi_c N &amp;= -N_c \\ \pi_c &amp;= \frac{N_c}{N}. \quad \quad (5) \end{aligned} <p>$(5)$ tells us that the class prior is simply the proportion of data points that belong to the class, which intuitively makes sense as well. Now we turn to maximizing the log-likelihood with respect to $\boldsymbol{\mu}_c$. Again, using the result from $(2)$, makes it easy for us to take the derivative with respect to $\boldsymbol{\mu}_c$, set it equal to 0, and solve for $\boldsymbol{\mu}_c$</p> \begin{aligned} \frac{\partial}{\partial \boldsymbol{\mu}_c} \left( \ln \text{Pr} \left( \textbf{\textsf{t}} | \boldsymbol{\theta} \right) \right) &amp;= 0 \\ \sum_{n=1}^N \frac{\partial}{\partial \boldsymbol{\mu}_c} -\frac{t_{nc}}{2} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal \mathbf{\Sigma}_c^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) &amp;= 0. \end{aligned} <p>To evaluate this derivative, we use the following <a href="https://en.wikipedia.org/wiki/Matrix_calculus#Scalar-by-vector_identities">matrix calculus identity</a>:</p> <blockquote> <p><em>If $\mathbf{A}$ is not a function of $\mathbf{u}$, and $\mathbf{A}$ is symmetric, then $\frac{\partial \mathbf{u}^\intercal \mathbf{A} \mathbf{u}}{\partial \mathbf{u}} = 2 \mathbf{A} \mathbf{u}$.</em></p> </blockquote> <p>Since covariances matrices are always symmetric, and the inverse of a symmetric matrix is also symmetric, we can use this identity to get</p> \begin{aligned} \sum_{n=1}^N \frac{\partial}{\partial \boldsymbol{\mu}_c} -\frac{t_{nc}}{2} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal \mathbf{\Sigma}_c^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) &amp;= 0 \\ \sum_{n=1}^N t_{nc} \mathbf{\Sigma}_c^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) &amp;= 0 \\ \sum_{n=1}^N t_{nc} \mathbf{\Sigma}_c^{-1} \mathbf{x}_n &amp;= \sum_{n=1}^N t_{nc} \mathbf{\Sigma}_c^{-1} \boldsymbol{\mu}_c \\ \sum_{n=1}^N t_{nc} \mathbf{x}_n &amp;= N_c \boldsymbol{\mu}_c \\ \frac{1}{N_c} \sum_{n=1}^N t_{nc} \mathbf{x}_n &amp;= \boldsymbol{\mu}_c. \quad \quad (6) \end{aligned} <p>Let’s take a moment to understand what $(6)$ is saying. $t_{nc}$ is only equal to 1, for the data points that belong to class $\mathcal{C}_c$. Which means that the sum on the left-hand side of $(6)$ only includes input variables $\mathbf{x}$ that belong to class $\mathcal{C}_c$. Afterwards we’re dividing that sum of vectors with the number of data points in the class $N_c$, which is the same as taking the average of the vectors. This means that the class-specific mean vector $\boldsymbol{\mu}_c$ is the average of the input variables $\mathbf{x}_n$ that belong to the class, i.e., <strong>the class-specific mean vector is just the mean of the vectors of the class</strong>. Once again, this makes intuitive sense as well.</p> <p>Lastly, we have to maximize the log-likelihood with respect to the class-specific covariance matrix $\mathbf{\Sigma}_c$. Again, we take the derivative with respect to $\mathbf{\Sigma}_c$ using the result from $(2)$, set it equal to 0, and solve</p> \begin{aligned} \frac{\partial}{\partial \mathbf{\Sigma}_c} \left( \ln \text{Pr} \left( \textbf{\textsf{t}} | \boldsymbol{\theta} \right) \right) &amp;= 0 \\ \sum_{n=1}^N \frac{\partial}{\partial \mathbf{\Sigma}_c} \left( \frac{t_{nc}}{2} \ln \det{\mathbf{\Sigma}_c^{-1}} -\frac{t_{nc}}{2} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal \mathbf{\Sigma}_c^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) \right) &amp;= 0 \\ \sum_{n=1}^N \frac{t_{nc}}{2} \frac{\partial}{\partial \mathbf{\Sigma}_c} \left( \ln \det{\mathbf{\Sigma}_c^{-1}} -\left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal \mathbf{\Sigma}_c^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) \right) &amp;= 0. \end{aligned} <p>This derivative requires a bit more work. Firstly, we can use the following <a href="https://en.wikipedia.org/wiki/Matrix_calculus#Scalar-by-vector_identities">identity</a>:</p> <blockquote> <p><em>If $a$ is not a function of $\mathbf{A}$, then $\frac{\partial \ln \det a\mathbf{A}}{\partial \mathbf{A}} = \mathbf{A}^{-1}$.</em></p> </blockquote> <p>This takes care of the first part. Secondly, we use a <a href="https://en.wikipedia.org/wiki/Trace_(linear_algebra)#Trace_of_a_product">property of the trace of a product</a>:</p> <blockquote> <p><em>If $\mathbf{u}$ is a column vector, then $\mathbf{u}^\intercal \mathbf{A} \mathbf{u} = \text{tr} \left( \mathbf{u}^\intercal \mathbf{u} \mathbf{A} \right)$.</em></p> </blockquote> <p>Finally, we use another <a href="https://en.wikipedia.org/wiki/Matrix_calculus#Scalar-by-vector_identities">matrix calculus identity</a>:</p> <blockquote> <p><em>If $\mathbf{B}$ is not a function of $\mathbf{A}$, then $\frac{\partial \text{tr}\left( \mathbf{B} \mathbf{A} \right)}{\partial \mathbf{A}} = \mathbf{B}^\intercal$.</em></p> </blockquote> <p>This now gives us</p> \begin{aligned} \sum_{n=1}^N \frac{t_{nc}}{2} \frac{\partial}{\partial \mathbf{\Sigma}_c} \left( \ln \det{\mathbf{\Sigma}_c^{-1}} -\left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal \mathbf{\Sigma}_c^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) \right) &amp;= 0 \\ \sum_{n=1}^N t_{nc} \left( \mathbf{\Sigma}_c -\frac{\partial}{\partial \mathbf{\Sigma}_c} \left( \text{tr} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) \mathbf{\Sigma}_c^{-1} \right) \right) &amp;= 0 \\ \sum_{n=1}^N t_{nc} \left( \mathbf{\Sigma}_c -\left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal \right) &amp;= 0 \\ \sum_{n=1}^N t_{nc} \mathbf{\Sigma}_c &amp;= \sum_{n=1}^N t_{nc} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal \\ \mathbf{\Sigma}_c &amp;= \frac{1}{N_c} \sum_{n=1}^N t_{nc} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal. \quad \quad (7) \end{aligned} <p>Just like the class-specific mean vector is just the mean of the vectors of the class, <strong>the class-specific covariance matrix is just the covariance of the vectors of the class</strong>, and we end up with our maximum likelihood solutions $(5)$, $(6)$, and $(7)$. Thus, we can classify using the following</p> $\hat{h} (\mathbf{x}) = \underset{c}{\text{argmax }} \ln \pi_c +\frac{1}{2} \ln \det{\mathbf{\Sigma}_c^{-1}} -\frac{1}{2} \left( \mathbf{x} - \boldsymbol{\mu}_c \right)^\intercal \mathbf{\Sigma}_c^{-1} \left( \mathbf{x} - \boldsymbol{\mu}_c \right). \quad \quad (8)$ <h3 id="python-implementation">Python implementation</h3> <p>Let’s start with some data - you can see it in the plot underneath. You can download the data <a href="https://github.com/cookieblues/cookieblues.github.io/raw/master/extra/bsmalea-notes-3b/data.csv">here</a>.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-3b/data.svg" /></p> <p>The code underneath is a simple implementation of QDA that we just went over.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">class</span> <span class="nc">QDA</span><span class="p">:</span> <span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">t</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">priors</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">means</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">covs</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span><span class="p">:</span> <span class="n">X_c</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">t</span> <span class="o">==</span> <span class="n">c</span><span class="p">]</span> <span class="bp">self</span><span class="p">.</span><span class="n">priors</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">X_c</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="bp">self</span><span class="p">.</span><span class="n">means</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">X_c</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">covs</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">cov</span><span class="p">(</span><span class="n">X_c</span><span class="p">,</span> <span class="n">rowvar</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span> <span class="n">preds</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">X</span><span class="p">:</span> <span class="n">posts</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span><span class="p">:</span> <span class="n">prior</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">priors</span><span class="p">[</span><span class="n">c</span><span class="p">])</span> <span class="n">inv_cov</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">covs</span><span class="p">[</span><span class="n">c</span><span class="p">])</span> <span class="n">inv_cov_det</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">det</span><span class="p">(</span><span class="n">inv_cov</span><span class="p">)</span> <span class="n">diff</span> <span class="o">=</span> <span class="n">x</span><span class="o">-</span><span class="bp">self</span><span class="p">.</span><span class="n">means</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="n">likelihood</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">inv_cov_det</span><span class="p">)</span> <span class="o">-</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">diff</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">inv_cov</span> <span class="o">@</span> <span class="n">diff</span> <span class="n">post</span> <span class="o">=</span> <span class="n">prior</span> <span class="o">+</span> <span class="n">likelihood</span> <span class="n">posts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">post</span><span class="p">)</span> <span class="n">pred</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">posts</span><span class="p">)]</span> <span class="n">preds</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">pred</span><span class="p">)</span> <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">preds</span><span class="p">)</span></code></pre></figure> <p>We can now make predictions with the following code.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="s">"../data.csv"</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">","</span><span class="p">,</span> <span class="n">skiprows</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">X</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span> <span class="n">t</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:,</span> <span class="mi">2</span><span class="p">]</span> <span class="n">qda</span> <span class="o">=</span> <span class="n">QDA</span><span class="p">()</span> <span class="n">qda</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">t</span><span class="p">)</span> <span class="n">preds</span> <span class="o">=</span> <span class="n">qda</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span></code></pre></figure> <p>This gives us the Gaussian distributions along with predictions that are shown below.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-3b/qda/preds.svg" /></p> <p>To make it easier for us to illustrate, how QDA works and how well it works, we can chart the original classes of the data points over the decision boundaries. This is shown underneath.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-3b/qda/decision_boundary.svg" /></p> <h2 id="linear-discriminant-analysis-lda">Linear Discriminant Analysis (LDA)</h2> <h3 id="setup-and-objective-1">Setup and objective</h3> <p>The only difference between linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) is that <strong>LDA does not have class-specific covariance matrices, but one shared covariance matrix among the classes</strong>. So, given a training dataset of $N$ input variables $\mathbf{x} \in \mathbb{R}^D$ with corresponding target variables $t \in \mathcal{C}_c$ where $c \in \{1, \dots, K\}$, LDA assumes that the <strong>class-conditional densities</strong> are normally distributed</p> $\text{Pr}(\mathbf{x} \mid t = c, \boldsymbol{\mu}_c, \mathbf{\Sigma}) = \mathcal{N} \left( \mathbf{x} \mid \boldsymbol{\mu}_c, \mathbf{\Sigma} \right),$ <p>where $\boldsymbol{\mu}_c$ is the <strong>class-specific mean vector</strong> and $\mathbf{\Sigma}$ is the <strong>shared covariance matrix</strong>. Using Bayes’ theorem, we can now calculate the class posterior</p> $\overbrace{\text{Pr}(t=c | \mathbf{x}, \boldsymbol{\mu}_c, \mathbf{\Sigma})}^{\text{class posterior}} = \frac{ \overbrace{\text{Pr}(\mathbf{x} \mid t = c, \boldsymbol{\mu}_c, \mathbf{\Sigma})}^{\text{class-conditional density}} \, \overbrace{\text{Pr}(t=c)}^{\text{class prior}} }{ \sum_{k=1}^K \text{Pr}(\mathbf{x} \mid t = k, \boldsymbol{\mu}_k, \mathbf{\Sigma}) \, \text{Pr}(t=k) }.$ <p>We will then classify $\mathbf{x}$ into class</p> $\hat{h} (\mathbf{x}) = \underset{c}{\text{argmax }} \text{Pr}(t=c | \mathbf{x}, \boldsymbol{\mu}_c, \mathbf{\Sigma}).$ <h3 id="derivation-and-training-1">Derivation and training</h3> <p>Just like in QDA, we need the log-likelihood. Using $(2)$, we find that</p> $\ln \text{Pr} \left( \textbf{\textsf{t}} | \boldsymbol{\theta} \right) = \sum_{n=1}^N \sum_{k=1}^K t_{nk} \left( \ln \pi_k -\frac{D}{2} \ln 2\pi +\frac{1}{2} \ln \det{\mathbf{\Sigma}^{-1}} -\frac{1}{2} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal \mathbf{\Sigma}^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) \right). \quad \quad (9)$ <p>Looking at $(9)$, we can see that there is no difference for the class-specific priors $(5)$ and means $(6)$ between QDA and LDA. However, the shared covariance matrix is obviously different - taking the derivative of $(9)$ with respect to the shared covariance matrix $\mathbf{\Sigma}$ and setting it equal to 0 gives us</p> \begin{aligned} \frac{\partial}{\partial \mathbf{\Sigma}} \ln \text{Pr} \left( \textbf{\textsf{t}} | \boldsymbol{\theta} \right) &amp;= 0 \\ \sum_{n=1}^N \sum_{k=1}^K \frac{t_{nk}}{2} \frac{\partial}{\partial \mathbf{\Sigma}} \left( \ln \det{\mathbf{\Sigma}^{-1}} -\left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal \mathbf{\Sigma}^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \right) &amp;= 0. \end{aligned} <p>Using the same matrix calculus properties as earlier, we can evaluate the derivative</p> \begin{aligned} \sum_{n=1}^N \sum_{k=1}^K \frac{t_{nk}}{2} \frac{\partial}{\partial \mathbf{\Sigma}} \left( \ln \det{\mathbf{\Sigma}^{-1}} -\left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal \mathbf{\Sigma}^{-1} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \right) &amp;= 0 \\ \sum_{n=1}^N \sum_{k=1}^K \frac{t_{nk}}{2} \left( \mathbf{\Sigma} -\frac{\partial}{\partial \mathbf{\Sigma}} \left( \text{tr} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \mathbf{\Sigma}^{-1} \right) \right) &amp;= 0 \\ \sum_{n=1}^N \sum_{k=1}^K t_{nk} \left( \mathbf{\Sigma} -\left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal \right) &amp;= 0 \\ \sum_{n=1}^N \sum_{k=1}^K t_{nk} \mathbf{\Sigma} &amp;= \sum_{n=1}^N \sum_{k=1}^K t_{nk} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal \\ \mathbf{\Sigma} &amp;= \frac{1}{N} \sum_{n=1}^N \sum_{k=1}^K t_{nk} \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right) \left( \mathbf{x}_n - \boldsymbol{\mu}_k \right)^\intercal. \quad \quad (10) \end{aligned} <p>We find that <strong>the shared covariance matrix is just the covariance of all the input varibles</strong>. Thus, we can classify using the following</p> $\hat{h} (\mathbf{x}) = \underset{c}{\text{argmax }} \ln \pi_c +\frac{1}{2} \ln \det{\mathbf{\Sigma}^{-1}} -\frac{1}{2} \left( \mathbf{x} - \boldsymbol{\mu}_c \right)^\intercal \mathbf{\Sigma}^{-1} \left( \mathbf{x} - \boldsymbol{\mu}_c \right).$ <h3 id="python-implementation-1">Python implementation</h3> <p>The code underneath is a simple implementation of LDA that we just went over.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">class</span> <span class="nc">LDA</span><span class="p">:</span> <span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">t</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">priors</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">means</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">cov</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">cov</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">rowvar</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span><span class="p">:</span> <span class="n">X_c</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">t</span> <span class="o">==</span> <span class="n">c</span><span class="p">]</span> <span class="bp">self</span><span class="p">.</span><span class="n">priors</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">X_c</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="bp">self</span><span class="p">.</span><span class="n">means</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">X_c</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span> <span class="n">preds</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">X</span><span class="p">:</span> <span class="n">posts</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span><span class="p">:</span> <span class="n">prior</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">priors</span><span class="p">[</span><span class="n">c</span><span class="p">])</span> <span class="n">inv_cov</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cov</span><span class="p">)</span> <span class="n">inv_cov_det</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">det</span><span class="p">(</span><span class="n">inv_cov</span><span class="p">)</span> <span class="n">diff</span> <span class="o">=</span> <span class="n">x</span><span class="o">-</span><span class="bp">self</span><span class="p">.</span><span class="n">means</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="n">likelihood</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">inv_cov_det</span><span class="p">)</span> <span class="o">-</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">diff</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">inv_cov</span> <span class="o">@</span> <span class="n">diff</span> <span class="n">post</span> <span class="o">=</span> <span class="n">prior</span> <span class="o">+</span> <span class="n">likelihood</span> <span class="n">posts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">post</span><span class="p">)</span> <span class="n">pred</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">posts</span><span class="p">)]</span> <span class="n">preds</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">pred</span><span class="p">)</span> <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">preds</span><span class="p">)</span></code></pre></figure> <p>We can now make predictions with the following code.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="s">"../data.csv"</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">","</span><span class="p">,</span> <span class="n">skiprows</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">X</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span> <span class="n">t</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:,</span> <span class="mi">2</span><span class="p">]</span> <span class="n">lda</span> <span class="o">=</span> <span class="n">LDA</span><span class="p">()</span> <span class="n">lda</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">t</span><span class="p">)</span> <span class="n">preds</span> <span class="o">=</span> <span class="n">lda</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span></code></pre></figure> <p>Underneath is a chart with the data points (color coded to match their respective classes), the class distributions that our LDA model finds, and the decision boundaries generated by the respective class distributions. As we can see, LDA has a more restrictive decision boundary, because it requires the class distributions to have the same covariance matrix.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-3b/lda/preds.svg" /></p> <h2 id="gaussian-naive-bayes">(Gaussian) Naive Bayes</h2> <h3 id="setup-and-objective-2">Setup and objective</h3> <p>We’ve looked at quadratic discriminant analysis (QDA), which assumes class-specific covariance matrices, and linear discriminant analysis (LDA), which assumes a shared covariance matrix among the classes, and now we’ll look at (Gaussian) Naive Bayes, which is also slightly different. Naive Bayes makes the assumption that the features are independent. This means that <strong>we are still assuming class-specific covariance matrices (as in QDA), but the covariance matrices are diagonal matrices</strong>. This is due to the assumption that the features are independent.</p> <p>So, given a training dataset of $N$ input variables $\mathbf{x} \in \mathbb{R}^D$ with corresponding target variables $t \in \mathcal{C}_c$ where $c \in \{1, \dots, K\}$, (Gaussian) Naive Bayes assumes that the <strong>class-conditional densities</strong> are normally distributed</p> $\text{Pr}(\mathbf{x} \mid t = c, \boldsymbol{\mu}_c, \mathbf{\Sigma}_c) = \mathcal{N} \left( \mathbf{x} \mid \boldsymbol{\mu}_c, \mathbf{\Sigma}_c \right),$ <p>where $\boldsymbol{\mu}_c$ is the <strong>class-specific mean vector</strong> and $\mathbf{\Sigma}_c$ is the <strong>class-specific diagonal covariance matrix</strong>. Using Bayes’ theorem, we can now calculate the class posterior</p> $\overbrace{\text{Pr}(t=c | \mathbf{x}, \boldsymbol{\mu}_c, \mathbf{\Sigma}_c)}^{\text{class posterior}} = \frac{ \overbrace{\text{Pr}(\mathbf{x} \mid t = c, \boldsymbol{\mu}_c, \mathbf{\Sigma}_c)}^{\text{class-conditional density}} \, \overbrace{\text{Pr}(t=c)}^{\text{class prior}} }{ \sum_{k=1}^K \text{Pr}(\mathbf{x} \mid t = k, \boldsymbol{\mu}_k, \mathbf{\Sigma}_c) \, \text{Pr}(t=k) }.$ <p>We will then classify $\mathbf{x}$ into class</p> $\hat{h} (\mathbf{x}) = \underset{c}{\text{argmax }} \text{Pr}(t=c | \mathbf{x}, \boldsymbol{\mu}_c, \mathbf{\Sigma}_c).$ <h3 id="derivation-and-training-2">Derivation and training</h3> <p>The derivation actually follows the derivation of the class-specific priors, means, and covariance matrices from QDA. The only difference is that we have to set everything but the diagonal to 0 in the class-specific covariance matrices. We therefore get the following</p> \begin{aligned} \pi_c &amp;= \frac{N_c}{N} \\ \boldsymbol{\mu}_c &amp;= \frac{1}{N_c} \sum_{n=1}^N t_{nc} \mathbf{x}_n \\ \mathbf{\Sigma}_c &amp;= \mathrm{diag} \left( \frac{1}{N_c} \sum_{n=1}^N t_{nc} \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right) \left( \mathbf{x}_n - \boldsymbol{\mu}_c \right)^\intercal \right) \end{aligned} <p>where diag means that we set every value not on the diagonal equal to 0.</p> <h3 id="python-implementation-2">Python implementation</h3> <p>The code underneath is a simple implementation of (Gaussian) Naive Bayes that we just went over.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">GaussianNB</span><span class="p">:</span> <span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">t</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">priors</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">means</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">covs</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span><span class="p">:</span> <span class="n">X_c</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">t</span> <span class="o">==</span> <span class="n">c</span><span class="p">]</span> <span class="bp">self</span><span class="p">.</span><span class="n">priors</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">X_c</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="bp">self</span><span class="p">.</span><span class="n">means</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">X_c</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">covs</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">diag</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">diag</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">cov</span><span class="p">(</span><span class="n">X_c</span><span class="p">,</span> <span class="n">rowvar</span><span class="o">=</span><span class="bp">False</span><span class="p">)))</span> <span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span> <span class="n">preds</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">X</span><span class="p">:</span> <span class="n">posts</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span><span class="p">:</span> <span class="n">prior</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">priors</span><span class="p">[</span><span class="n">c</span><span class="p">])</span> <span class="n">inv_cov</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">covs</span><span class="p">[</span><span class="n">c</span><span class="p">])</span> <span class="n">inv_cov_det</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">det</span><span class="p">(</span><span class="n">inv_cov</span><span class="p">)</span> <span class="n">diff</span> <span class="o">=</span> <span class="n">x</span><span class="o">-</span><span class="bp">self</span><span class="p">.</span><span class="n">means</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="n">likelihood</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">inv_cov_det</span><span class="p">)</span> <span class="o">-</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">diff</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">inv_cov</span> <span class="o">@</span> <span class="n">diff</span> <span class="n">post</span> <span class="o">=</span> <span class="n">prior</span> <span class="o">+</span> <span class="n">likelihood</span> <span class="n">posts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">post</span><span class="p">)</span> <span class="n">pred</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">classes</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">posts</span><span class="p">)]</span> <span class="n">preds</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">pred</span><span class="p">)</span> <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">preds</span><span class="p">)</span></code></pre></figure> <p>We can now make predictions with the following code.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="s">"../data.csv"</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">","</span><span class="p">,</span> <span class="n">skiprows</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">X</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span> <span class="n">t</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:,</span> <span class="mi">2</span><span class="p">]</span> <span class="n">nb</span> <span class="o">=</span> <span class="n">GaussianNB</span><span class="p">()</span> <span class="n">nb</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">t</span><span class="p">)</span> <span class="n">preds</span> <span class="o">=</span> <span class="n">nb</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span></code></pre></figure> <p>Underneath is a chart with the data points (color coded to match their respective classes), the class distributions that our (Gaussian) Naive Bayes model finds, and the decision boundaries generated by the respective class distributions. Note that while the decision boundary is not linear as in the case of LDA, the class distributions are completely circular Gaussian distributions, since the covariance matrices are diagonal matrices.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-3b/nb/preds.svg" /></p>{"name"=>nil, "email"=>nil, "twitter"=>nil}As mentioned in notes 3a, generative classifiers model the joint probability distribution of the input and target variables $\text{Pr}(\mathbf{x}, t)$. This means, we would end up with a distribution that could generate (hence the name) new input variables with their respective targets, i.e., we can sample new data points with the joint probability distribution, and we will see how to do that in this post.Machine learning, notes 3a: Classification2021-03-30T00:00:00+00:002021-03-30T00:00:00+00:00https://cookieblues.github.io//guides/2021/03/30/bsmalea-notes-3a<p>As mentioned in <a href="https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a">notes 1a</a>, in classification the possible values for the target variables are discrete, and we call these possible values “classes”. In <a href="https://cookieblues.github.io//guides/2021/03/22/bsmalea-notes-2">notes 2</a> we went through regression, which in short refers to constructing a function $h( \mathbf{x} )$ from a dataset $\mathbf{X} = \left( (\mathbf{x}_1, t_1), \dots, (\mathbf{x}_N, t_N) \right)$ that yields prediction values $t$ for new values of $\mathbf{x}$. The objective in classification is the same, except the values of $t$ are discrete.</p> <p>We are going to cover 3 different approaches or types of classifiers:</p> <ul> <li><strong>Generative classifiers</strong> that model the joint probability distribution of the input and target variables $\text{Pr}(\mathbf{x}, t)$.</li> <li><strong>Discriminative classifiers</strong> that model the conditional probability distribution of the target given an input variable $\text{Pr}(t | \mathbf{x})$.</li> <li><strong>Distribution-free classifiers</strong> that do not use a probability model but directly assign input to target variables.</li> </ul> <p>A quick disclaimer for this topic: <strong>the terminology will be very confusing</strong>, but we’ll deal with that when we cross those bridges.</p> <h2 id="generative-vs-discriminative">Generative vs discriminative</h2> <p>Here’s the list of classifiers that we will go over: for <strong>generative classifiers</strong> it’s <strong>quadratic discriminant analysis (QDA)</strong>, <strong>linear discriminant analysis (LDA)</strong>, and (Gaussian) <strong>naive Bayes</strong>, which are all special cases of the same model; for <strong>discriminative classiferis</strong> it’s <strong>logistic regression</strong>; and for <strong>distribution-free classifiers</strong> we will take a look at the <strong>perceptron</strong> but also the <strong>support vector machine (SVM)</strong>.</p> <p>So, they all do the same thing (classification). Which one is the best? Which one should you use? Well, let’s recall <a href="https://cookieblues.github.io//guides/2021/03/11/bsmalea-notes-1b/">the “no free lunch” theorem</a>, which broadly states that there isn’t one model that is always better than another. It always depends on your data. That being said, there are some things we can generally say about generative and discriminative classifiers. Ng and Jordan (2002) found that repeating the experiment of applying naive Bayes and logistic regression on binary classification tasks, <strong>naive Bayes (generative) performed better with less data, but logistic regression tended to perform better in general</strong><span class="sidenote-number"></span><span class="sidenote">Andrew Y. Ng and Michael I. Jordan, “On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes,” 2001.</span>. However, Ulusoy and Bishop (2006) notes that <strong>this is only the case, when the data follow the assumptions of the generative model</strong><span class="sidenote-number"></span><span class="sidenote">Ilkay Ulusoy and Christopher Bishop, “Comparison of Generative and Discriminative Techniques for Object Detection and Classification,” 2006.</span>, which means that logistic regression (discriminative) is generally better than naive Bayes (generative).</p> <p><strong>The general consensus is that discriminative models outperform generative models in most cases</strong>. The reason for this is that generative models in some way has a more difficult job, as they try to model the joint distribution instead of just the posterior. They also often times make unrealistic assumptions about the data. Yet, it cannot be stressed enough though that is not always the case, and <strong>you should not disregard generative models</strong>. As an example, generative adversarial networks (GANs) are generative models that have proved extremely useful in a variety of tasks. There are a few other reasons why you shouldn’t disregard generative models, e.g. they tend to be easier to fit. Regardless, we’re not here to figure out, which model to use, but to learn about both.</p> <h2 id="important-tools">Important tools</h2> <h3 id="multivariate-gaussian-distribution">Multivariate Gaussian distribution</h3> <p>In the following posts, we are going to rely heavily on the multivariate Gaussian (normal) distribution, and it’s very important that you grasp it. The multivariate Gaussian distribution is denoted $\mathcal{N} (\boldsymbol{\mu}, \mathbf{\Sigma})$, where $\boldsymbol{\mu}$ is the mean vector and $\mathbf{\Sigma}$ is the covariance matrix. The probability density function in $D$ dimensions is defined as</p> $\mathcal{N} (\mathbf{x} | \boldsymbol{\mu}, \mathbf{\Sigma}) = \frac{1}{\sqrt{(2\pi)^D \det{\mathbf{\Sigma}}}} \exp{\left( -\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\intercal \mathbf{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \right)}.$ <p>The covariance matrix determines the shape of the Gaussian distribution and is an important concept for the classifiers we are going to look at. The image underneath illustrates different types of covariance matrices.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-3a/gaussians.svg" /></p> <h3 id="bayes-theorem">Bayes’ theorem</h3> <p>Another important tool we are going to use is Bayes’ theorem. If you haven’t read the <a href="https://cookieblues.github.io//guides/2021/03/15/bsmalea-notes-1c/">post on frequentism and Bayesianism</a>, then here’s a quick recap on Bayes’ theorem. Given 2 events $\mathcal{A}$ and $\mathcal{B}$, we can expand their joint probability with conditional probabilities</p> $\text{Pr} (\mathcal{A}, \mathcal{B}) = \text{Pr} (\mathcal{A} | \mathcal{B}) \text{Pr} (\mathcal{B}) = \text{Pr} (\mathcal{B} | \mathcal{A}) \text{Pr} (\mathcal{A}).$ <p>Using the equation on the right, we can rewrite it and get Bayes’ theorem</p> \begin{aligned} \text{Pr} (\mathcal{A} | \mathcal{B}) \text{Pr} (\mathcal{B}) &amp;= \text{Pr} (\mathcal{B} | \mathcal{A}) \text{Pr} (\mathcal{A}) \\ \text{Pr} (\mathcal{A} | \mathcal{B}) &amp;= \frac{\text{Pr} (\mathcal{B} | \mathcal{A}) \text{Pr} (\mathcal{A})}{\text{Pr} (\mathcal{B})}. \end{aligned} <p>In terms of a hypothesis and data, we often use the words posterior, likelihood, prior, and evidence to refer to the parts of Bayes’ theorem</p> $\overbrace{\text{Pr}(\mathcal{\text{hypothesis}} | \mathcal{\text{data}})}^{\text{posterior}} = \frac{ \overbrace{\text{Pr}(\mathcal{\text{data}} | \mathcal{\text{hypothesis}})}^{\text{likelihood}} \, \overbrace{\text{Pr}(\mathcal{\text{hypothesis}})}^{\text{prior}} }{ \underbrace{\text{Pr}(\mathcal{\text{data}})}_{\text{evidence}} }.$ <p>We often write this as</p> $\text{posterior} \propto \text{likelihood} \times \text{prior},$ <p>where $\propto$ means “proportional to”.</p>{"name"=>nil, "email"=>nil, "twitter"=>nil}As mentioned in notes 1a, in classification the possible values for the target variables are discrete, and we call these possible values “classes”. In notes 2 we went through regression, which in short refers to constructing a function $h( \mathbf{x} )$ from a dataset $\mathbf{X} = \left( (\mathbf{x}_1, t_1), \dots, (\mathbf{x}_N, t_N) \right)$ that yields prediction values $t$ for new values of $\mathbf{x}$. The objective in classification is the same, except the values of $t$ are discrete.Machine learning, notes 2: Regression2021-03-22T00:00:00+00:002021-03-22T00:00:00+00:00https://cookieblues.github.io//guides/2021/03/22/bsmalea-notes-2<p>Regression analysis refers to a set of techniques for estimating relationships among variables. This post introduces <strong>linear regression</strong> augmented by <strong>basis functions</strong> to enable non-linear adaptation, which lies at the heart of supervised learning, as will be apparent when we turn to classification. Thus, a thorough understanding of this model will be hugely beneficial. We’ll go through 2 derivations of the optimal parameters namely the method of <strong>ordinary least squares (OLS)</strong>, which we briefly looked at in <a href="https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a/">notes 1a</a>, and <strong>maximum likelihood estimation (MLE)</strong>. We’ll also dabble with some Python throughout the post.</p> <h3 id="setup-and-objective">Setup and objective</h3> <p>Given a training dataset of $N$ input variables $\mathbf{x} \in \mathbb{R}^D$ with corresponding target variables $t \in \mathbb{R}$, the objective of regression is to construct a function $h(\mathbf{x})$ that yields prediction values of $t$ for new values of $\mathbf{x}$.</p> <p>The simplest linear model for regression is just known as <em>linear regression</em>, where the predictions are generated by</p> $h\left(\mathbf{x},\mathbf{w} \right) = w_0 + w_1 x_1 + \dots + w_D x_D = w_0 + \sum_{i=1}^D w_i x_i. \quad \quad (1)$ <p>The first term $w_0$ is commonly called the <strong>intercept</strong> or <strong>bias</strong> parameter, and allows $h$ to adapt to a fixed offset in the data<span class="sidenote-number"></span><span class="sidenote">We’ll show exactly what this means later in this post.</span>. If we introduce a $1$ as the first element of each input variable $\mathbf{x}$, we can rewrite $(1)$ with vector notation, i.e. if we define $\mathbf{x} = \left( 1, x_1, \dots , x_D \right)^\intercal$, we can rewrite $(1)$ as</p> $h(\mathbf{x},\mathbf{w}) = \mathbf{w}^\intercal \mathbf{x},$ <p>where $\mathbf{w} = \left(w_0, \dots, w_D \right)^\intercal$.</p> <p>In <a href="https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a/">notes 1a</a> we went over polynomial regression with a $1$-dimensional input variable $x \in \mathbb{R}$. Now we’re just doing linear regression (not polynomial), but we’re allowing our input variable to be $D$-dimensional $\mathbf{x} \in \mathbb{R}^D$, hence it becomes a vector instead of a scalar. For the sake of visualization, however, let’s stick to the same example dataset:</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.8</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.6</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.4</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.2</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.6</span><span class="p">,</span> <span class="mf">0.8</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span> <span class="n">t</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mf">4.9</span><span class="p">,</span> <span class="o">-</span><span class="mf">3.5</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.8</span><span class="p">,</span> <span class="mf">0.8</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.6</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.3</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">2.1</span><span class="p">,</span> <span class="mf">2.9</span><span class="p">,</span> <span class="mf">5.6</span><span class="p">])</span> <span class="n">N</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></code></pre></figure> <p>Since our input variables are $1$-dimensional, we’ll have 2 parameters: $w_1$ and $w_0$, but to make sure we also find the bias parameter, we have to introduce a column of $1$s in <code class="language-plaintext highlighter-rouge">x</code>, like we defined $\mathbf{x} = \left( 1, x_1, \dots , x_D \right)^\intercal$. We can do this with the following piece of code</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">column_stack</span><span class="p">([</span><span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">(</span><span class="n">N</span><span class="p">),</span> <span class="n">x</span><span class="p">])</span></code></pre></figure> <h3 id="derivation-and-training">Derivation and training</h3> <p>So, how do we train the model? We’ll look at 2 different approaches of deriving the method of training this model. Recall that <strong>training</strong> (or <strong>learning</strong>) refers to <strong>the process of estimating the parameters</strong> of our model, so when we ask how to train the model, it’s the same as asking how to estimate the values of $\mathbf{w}$.</p> <h4 id="ordinary-least-squares">Ordinary least squares</h4> <p>Like we did in <a href="https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a/">notes 1a</a>, we defined an <strong>objective function</strong> that calculated a measure of the performance of our model in terms of an error, and then we minimized this error with respect to our parameters. This means we would find the parameters that would result in the least error. We’ll use the same objective function as in <a href="https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a/">notes 1a</a>, the sum of squared errors (SSE), defined as</p> \begin{aligned} E(\mathbf{w}) &amp;= \sum_{n=1}^N \left( t_n - h \left( \mathbf{x}_n, \mathbf{w} \right) \right)^2 \\ &amp;= \sum_{n=1}^N \left( t_n - \mathbf{w}^\intercal \mathbf{x}_n \right)^2, \quad \quad (2) \end{aligned} <p>and we want to find values for $\mathbf{w}$ that minimizes $E$</p> <p><span class="marginnote"> To recap: we want to find the parameter values $\hat{\mathbf{w}}$ that minimize our objective function, which is defined as the sum of the squared differences between $h(\mathbf{x}_i,\mathbf{w})$ and $t_i$.</span></p> \begin{aligned} \hat{\mathbf{w}} = \underset{\mathbf{w}}{\arg\min} E(\mathbf{w}). \end{aligned} <p>Note that $(2)$ is a quadratic function of the parameters $\mathbf{w}$. Its partial derivatives with respect to $\mathbf{w}$ will therefore be linear in $\mathbf{w}$, which means there is a unique minimum. This is a property of <strong>convex</strong> functions. <!-- NOTE ON CONVEX --></p> <p>If we evaluate the SSE for values of $w_0$ and $w_1$ in a grid, then we can illustrate that our objective function has a unique minimum with a contour plot. This is shown below.</p> <p><span class="marginnote">The cross is located at the minimum of our objective function - this coordinate corresponds to the values of $w_0$ (x-axis) and $w_1$ (y-axis) that produces the smallest SSE for our dataset. The contour lines show some boundaries for the SSE; we can see that the optimal values of $w_0$ and $w_1$ lie within a boundary of value $19$, meaning that the minimum value of SSE is less than $19$.</span></p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-2/weights.svg" /></p> <p>So, how do we find the minimum of $E$? Recall from <a href="https://cookieblues.github.io//bslialo-notes-9b">the notes about extrema</a> that we find the minimum of a function by taking the derivative, setting the derivative equal to 0, and solving for the function variable. In our case, we have to take all the partial derivatives of $E$ with respect to $w_0, \dots, w_{D}$ and set it equal to 0. Remember that all the partial derivatives of $E$ gives us the gradient $\nabla E$. To ease notation, let all our input variables be denoted by $\mathbf{X}$ with $N$ rows (one for each input variable) and $D+1$ columns (one for each feature plus one for the bias) defined as</p> $\mathbf{X} = \begin{pmatrix} 1 &amp; X_{11} &amp; \cdots &amp; X_{1D} \\ 1 &amp; X_{21} &amp; \cdots &amp; X_{2D} \\ \vdots &amp; \vdots &amp; \ddots &amp; \vdots \\ 1 &amp; X_{N1} &amp; \cdots &amp; X_{ND} \end{pmatrix},$ <p>and let $\textbf{\textsf{t}} = \left( t_1, \dots, t_N \right)^\intercal$ denote all our target variables. We can now rewrite $(2)$ as</p> \begin{aligned} E(\mathbf{w}) &amp;= \sum_{n=1}^N \left( t_n - \mathbf{w}^\intercal \mathbf{x}_n \right)^2 \\ &amp;= (t_1 - \mathbf{w}^\intercal \mathbf{x}_1) (t_1 - \mathbf{w}^\intercal \mathbf{x}_1) + \cdots + (t_N - \mathbf{w}^\intercal \mathbf{x}_N) (t_N - \mathbf{w}^\intercal \mathbf{x}_N) \\ &amp;= \left( \textbf{\textsf{t}} - \mathbf{X}\mathbf{w} \right)^\intercal \left( \textbf{\textsf{t}} - \mathbf{X}\mathbf{w} \right) \\ &amp;= \left( \textbf{\textsf{t}}^\intercal - \left( \mathbf{X}\mathbf{w} \right)^\intercal \right) \left( \textbf{\textsf{t}} - \mathbf{X}\mathbf{w} \right) \\ &amp;= \textbf{\textsf{t}}^\intercal \textbf{\textsf{t}} - \textbf{\textsf{t}}^\intercal \mathbf{X}\mathbf{w} - \left( \mathbf{X}\mathbf{w} \right)^\intercal \textbf{\textsf{t}} + \left( \mathbf{X}\mathbf{w} \right)^\intercal \mathbf{X}\mathbf{w} \\ &amp;= \left( \mathbf{X}\mathbf{w} \right)^\intercal \mathbf{X}\mathbf{w} - 2 \left( \mathbf{X}\mathbf{w} \right)^\intercal \textbf{\textsf{t}} + \textbf{\textsf{t}}^\intercal \textbf{\textsf{t}}. \end{aligned} <p>If we now take the derivative with respect to $\mathbf{w}$, we get</p> \begin{aligned} \nabla E(\mathbf{w}) &amp;= \frac{\partial}{\partial \mathbf{w}} \left( \left( \mathbf{X}\mathbf{w} \right)^\intercal \mathbf{X}\mathbf{w} - 2 \left( \mathbf{X}\mathbf{w} \right)^\intercal \textbf{\textsf{t}} + \textbf{\textsf{t}}^\intercal \textbf{\textsf{t}} \right) \\ &amp;= 2 \mathbf{X}^\intercal \mathbf{X}\mathbf{w} - 2 \mathbf{X}^\intercal \textbf{\textsf{t}}, \end{aligned} <p>and setting this result equal to 0 lets us solve for $\mathbf{w}$</p> \begin{aligned} 2 \mathbf{X}^\intercal \mathbf{X}\mathbf{w} - 2 \mathbf{X}^\intercal \textbf{\textsf{t}} &amp;= 0 \\ \mathbf{X}^\intercal \mathbf{X} \mathbf{w} &amp;= \mathbf{X}^\intercal \textbf{\textsf{t}}\\ \mathbf{w} &amp;= \left( \mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{X}^\intercal \textbf{\textsf{t}}, \end{aligned} <p>which are our estimated values for the parameters</p> <p><span class="marginnote">To recap once again: $\left( \mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{X}^\intercal \textbf{\textsf{t}}$ are the values of $\mathbf{w}$ that minimizes our SSE objective function defined in $(2)$.</span></p> $\hat{\mathbf{w}} = \underset{\mathbf{w}}{\arg\min} E(\mathbf{w}) = \left( \mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{X}^\intercal \textbf{\textsf{t}}.$ <h4 id="maximum-likelihood">Maximum likelihood</h4> <p>Choosing the SSE as the objective function might seem a bit arbitrary though - for example why not just go with the sum of the errors? Why do we have to square them? To show why this is a good choice, and why the solution makes sense, we are going to derive the same solution from a probabilistic perspective using <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum likelihood estimation (MLE)</a>. To do this, we assume that the target variable $t$ is given by our function $h(\mathbf{x}, \mathbf{w})$ with a bit of noise added:</p> $t = h \left( \mathbf{x},\mathbf{w} \right) + \epsilon,$ <p>where $\epsilon \sim \mathcal{N} \left( 0,\alpha \right)$, i.e. $\epsilon$ is a Gaussian random variable with mean 0 and variance $\alpha$. This lets us say that given an input variable $\mathbf{x}$, the corresponding target value $t$ is normally distributed with mean $h(\mathbf{x}, \mathbf{w})$ and variance $\alpha$, i.e.</p> $\text{Pr}(t | \mathbf{x}, \mathbf{w}, \alpha) = \mathcal{N} \left( t | h(\mathbf{x},\mathbf{w}), \alpha \right). \quad \quad (3)$ <p>Let’s take a moment to understand exactly, what we’re doing. The image below illustrates what $(3)$ tells us. We are estimating parameters $\mathbf{w}$ such that our target variables $t$ are normally distributed around the output values of $h$.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-2/distribution.svg" /></p> <p>We can now use the entire dataset, $\mathbf{X}$ and $\textbf{\textsf{t}}$, to write up the likelihood function by making the assumption that our data points are drawn independently from $(3)$. The likelihood function then becomes the product of $(3)$ for all our input and target variable pairs, and is a function of $\mathbf{w}$ and $\alpha$:</p> $\text{Pr}(\textbf{\textsf{t}} | \mathbf{X}, \mathbf{w}, \alpha) = \prod_{i=1}^N \mathcal{N} \left( t_i | h(\mathbf{x}_i,\mathbf{w}), \alpha \right). \quad \quad (4)$ <p>Now we want to maximize the likelihood, which means we want to determine the values of our parameters $\mathbf{w}$ and $\alpha$ that maximizes $(4)$. This might seem dauntingly difficult, but we can make it simpler with a handy trick.<span class="marginnote">Taking the log of the likelihood not only simplifies the math, but it helps computationally as well, since the product of many probabilities usually causes <a href="https://en.wikipedia.org/wiki/Arithmetic_underflow">underflow</a>, whereas the sum of logs doesn’t.</span> Since the logarithm is a monotonically increasing function, maximizing the log-likelihood is equivalent to maximizing the likelihood. Taking the log of the likelihood gives us</p> \begin{aligned} \ln \text{Pr}(\textbf{\textsf{t}} | \mathbf{X}, \mathbf{w}, \alpha) &amp;= \ln \left( \prod_{i=1}^N \mathcal{N} \left( t_i | h(\mathbf{x}_i,\mathbf{w}), \alpha \right) \right) \\ &amp;= \sum_{i=1}^N \ln \left( \frac{1}{\sqrt{2 \pi \alpha}} \exp \left( -\frac{(t_n - \mathbf{w}^\intercal \mathbf{x}_i)^2}{2 \alpha} \right) \right) \\ &amp;= N \ln \frac{1}{\sqrt{2 \pi \alpha}} - \sum_{i=1}^N \frac{(t_n - \mathbf{w}^\intercal \mathbf{x}_i)^2}{2 \alpha} \\ &amp;= - \frac{N}{2} \ln 2 \pi \alpha - \frac{1}{2 \alpha} \underbrace{\sum_{i=1}^N (t_n - \mathbf{w}^\intercal \mathbf{x}_i)^2}_{\text{SSE}}. \quad \quad (5) \end{aligned} <p>Now it becomes evident why the SSE objective function is a good choice; the last term of $(5)$ is the only part dependent on $\mathbf{w}$ and is the same as SSE. Since the first term does not depend on $\mathbf{w}$, we can omit it, and since the maximum of the likelihood function with respect to $\mathbf{w}$ does not change by scaling with the positive constant $\frac{1}{2 \alpha}$, then we see <strong>maximizing the likelihood with respect to $\mathbf{w}$ is equivalent to minimizing the SSE objective function</strong>. The maximum likelihood solution for $\mathbf{w}$ is therefore the same as in our previous derivation</p> $\mathbf{w}_{\text{ML}} = \left( \mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{X}^\intercal \textbf{\textsf{t}}.$ <p>We can use the result of the maximum likelihood solution for $\mathbf{w}$ to find the value of the noise variance $\alpha$. If we insert the maximum likelihood solution for $\mathbf{w}$ in the log-likelihood, take the derivative, and set it equal to 0, then we can solve for $\alpha$</p> <p><span class="marginnote">Note that we are in fact jointly maximizing the likelihood with respect to both $\mathbf{w}$ and $\alpha$, but because the maximization with respect to $\mathbf{w}$ is independent of $\alpha$, we start by finding the maximum likelihood solution for $\mathbf{w}$, and then use that result to find $\alpha$.</span></p> \begin{aligned} \frac{\partial}{\partial \alpha} \ln \text{Pr}(\textbf{\textsf{t}} | \mathbf{X}, \mathbf{w}_{\text{ML}}, \alpha) &amp;= 0 \\ \frac{\partial}{\partial \alpha} \left( - \frac{N}{2} \ln 2 \pi \alpha - \frac{1}{2 \alpha} \sum_{i=1}^N (t_n - \mathbf{w}_{\text{ML}}^\intercal \mathbf{x}_i)^2 \right) &amp;= 0 \\ -\frac{N}{2\alpha} + \frac{1}{2 \alpha^2} \sum_{i=1}^N (t_n - \mathbf{w}_{\text{ML}}^\intercal \mathbf{x}_i)^2 &amp;= 0 \\ \frac{1}{\alpha} \sum_{i=1}^N (t_n - \mathbf{w}_{\text{ML}}^\intercal \mathbf{x}_i)^2 &amp;= N \\ \alpha_{\text{ML}} &amp;= \frac{1}{N} \sum_{i=1}^N (t_n - \mathbf{w}_{\text{ML}}^\intercal \mathbf{x}_i)^2. \end{aligned} <h4 id="python-implementation">Python implementation</h4> <p>In <a href="https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a/">notes 1a</a> we implemented the OLS solution, but since we have a probabilistic model now, we make predictions that are probability distributions over $t$ instead of just point estimates. This is done by substituting the maximum likelihood solutions for $\mathbf{w}$ and $\alpha$ into $(3)$</p> $\text{Pr}(t | \mathbf{x}, \mathbf{w}_{\text{ML}}, \alpha_{\text{ML}}) = \mathcal{N} \left( t | h(\mathbf{x},\mathbf{w}_{\text{ML}}), \alpha_{\text{ML}} \right).$ <p>We can find $\mathbf{w}$ and $\alpha$ with the following code snippet</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">X</span><span class="p">)</span> <span class="o">@</span> <span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">t</span> <span class="n">alpha</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">((</span><span class="n">t</span> <span class="o">-</span> <span class="n">X</span> <span class="o">@</span> <span class="n">w</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">t</span><span class="p">)</span></code></pre></figure> <p><span class="marginnote">Plot of the line $w_0+w_1 x$ with our estimated values for $\mathbf{w}$ along with the uncertainty $\alpha$.</span></p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-2/prob_linreg.svg" /></p> <h3 id="model-selection">Model selection</h3> <p>You might be wondering what linear regression is so good for considering the image above, since it’s not doing well, but now we’re going to shine light on that by looking at ways to improve the simple linear regression model.</p> <h4 id="basis-functions">Basis functions</h4> <p>We call a model linear if it’s linear in the parameters <em>not</em> in the input variables. However, $(1)$ is linear in both the parameters <em>and</em> the input variables, which limits it from adapting to nonlinear relationships. We can augment the model by replacing the input variables with nonlinear <strong>basis functions</strong> of the input variables</p> \begin{aligned} h(\mathbf{x},\mathbf{w}) &amp;= w_0 \phi_0(\mathbf{x}) + \cdots + w_{M-1} \phi_{M-1}(\mathbf{x}) \\ &amp;= \sum_{m=0}^{M-1} w_m \phi_m(\mathbf{x}) \\ &amp;= \mathbf{w}^\intercal \bm{\phi} (\mathbf{x}), \end{aligned} <p><span class="marginnote">Note that we had $D+1$ parameters in the simple linear regression model, but by augmenting it with basis functions, we now have $M$ parameters, which can be larger than $D$ if need be.</span> where we define $\bm{\phi}(\mathbf{x}) = \left( \phi_0 (\mathbf{x}), \dots, \phi_{M-1}(\mathbf{x}) \right)^\intercal$ and $\phi_0 ( {\mathbf{x}} ) = 1$ to keep the intercept $w_0$. By using nonlinear basis functions it is possible for $h$ to adapt to nonlinear relationships of $\mathbf{x}$, which we will see shortly - we call these models <strong>linear basis function models</strong>.</p> <p>We already looked at one example of basis functions in <a href="https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a/">notes 1a</a>, where we augmented the simple linear regression model with basis functions of powers of $x$, i.e. $\phi_i (x) = x^i$. Another common basis function is the Gaussian</p> $\phi_i (\mathbf{x}) = \exp \left( - \gamma_i \| \bm{\mu}_i -\mathbf{x} \|_2^2 \right).$ <p>Following the same derivation as before, we find the maximum likelihood solutions to be</p> $\mathbf{w}_{\text{ML}} = \left( \mathbf{\Phi}^\intercal \mathbf{\Phi} \right)^{-1} \mathbf{\Phi}^\intercal \textbf{\textsf{t}} \quad \text{and} \quad \alpha_{\text{ML}} = \frac{1}{N} \sum_{i=1}^N (t_n - \mathbf{w}_{\text{ML}}^\intercal \bm{\phi}(\mathbf{x}_i))^2,$ <p>where</p> $\mathbf{\Phi} = \begin{pmatrix} \phi_0 (\mathbf{x}_1) &amp; \phi_1 (\mathbf{x}_1) &amp; \cdots &amp; \phi_{M-1} (\mathbf{x}_1) \\ \phi_0 (\mathbf{x}_2) &amp; \phi_1 (\mathbf{x}_2) &amp; \cdots &amp; \phi_{M-1} (\mathbf{x}_2) \\ \vdots &amp; \vdots &amp; \ddots &amp; \vdots\\ \phi_0 (\mathbf{x}_N) &amp; \phi_1 (\mathbf{x}_N) &amp; \cdots &amp; \phi_{M-1} (\mathbf{x}_N) \end{pmatrix}.$ <p><span class="marginnote">Illustration of the effect of using $M-1$ Gaussian basis functions plus the intercept.</span></p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-2/prob_linreg_basis.svg" /></p> <p>The Gaussian basis function for the plot above was implemented as below, where $\mu_i=\frac{i}{M}$ and $\gamma_i = 1$ for $i = 1, \dots, M-1$.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">gaussian_basis</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span> <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">ex</span>\<span class="n">text</span><span class="p">{</span><span class="n">Pr</span><span class="p">}(</span><span class="o">-</span><span class="n">gamma</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">mu</span><span class="o">-</span><span class="n">x</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span></code></pre></figure> <!-- MENTION POLYNOMIAL REGRESSION --> <h4 id="regularization">Regularization</h4> <p>We briefly ran into the concept of regularization in the <a href="https://cookieblues.github.io//guides/2021/03/15/bsmalea-notes-1c">previous notes</a>, which we described as a technique of preventing overfitting. If we look back at the objective function we defined earlier, augmented with basis functions, we can introduce a regularization term</p> $E(\mathbf{w}) = \sum_{i=1}^N (t_n - \mathbf{w}^\intercal \bm{\phi}(\mathbf{x}_i))^2 + \underbrace{\lambda \sum_{j=0}^{M-1} | w_j |^q}_{\text{regularization}},$ <p>where $q &gt; 0$ denotes the type of regularization, and $\lambda$ controls the extent of regularization, i.e. how much do we care about the error from the data in relation to the regularization. The most common values of $q$ are $1$ and $2$, which are called $L_1$ regularization and $L_2$ regularization respectively. We call it <strong>lasso regression</strong> when we use $L_1$ regularization, and <strong>ridge regression</strong> when we use $L_2$ regularization.</p> <p>The objective function of ridge regression</p> \begin{aligned} E(\mathbf{w}) &amp;= \sum_{i=1}^N (t_n - \mathbf{w}^\intercal \bm{\phi}(\mathbf{x}_i))^2 + \lambda \sum_{j=0}^{M-1} | w_j |^2\\ &amp;= \sum_{i=1}^N (t_n - \mathbf{w}^\intercal \bm{\phi}(\mathbf{x}_i))^2 + \lambda \mathbf{w}^\intercal \mathbf{w} \end{aligned} <p>is especially convenient as it is a quadratic function of $\mathbf{w}$ and therefore has a unique global minimum. The solution to which is</p> $\hat{\mathbf{w}} = \left( \lambda \mathbf{I} + \mathbf{\Phi}^\intercal \mathbf{\Phi} \right)^{-1} \mathbf{\Phi}^\intercal \textbf{\textsf{t}},$ <p>where $\alpha$ stays the same as without regularization, since the regularization term has no influence on it.</p> <p>When we introduce regularization, the process of model selection goes from finding the appropriate number of basis functions to finding the appropriate value for the regularization parameter $\lambda$.</p> <p><span class="marginnote">Illustration of changing the value of the regularization parameter $\lambda$, while keeping the number of basis functions $M=8$ constant. Even though we overfitted earlier when $M=8$, our effective complexity is now controlled by the regularization instead, and the model will not overfit if $\lambda$ is large enough. Note also that as the regularization parameter is increased, the uncertainty increases as well.</span></p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-2/prob_linreg_basis_regularization.svg" /></p> <h3 id="summary">Summary</h3> <ul> <li>We can find the parameters for a <strong>linear regression</strong> through <strong>ordinary least squares</strong> or <strong>maximum likelihood estimation</strong>.</li> <li>Usually in <strong>linear regression</strong> we have a scalar parameter that is not multiplied by the input called the <strong>intercept</strong> or <strong>bias</strong> denoted $w_0$.</li> <li>The <strong>process of estimating the values of the parameters</strong> is called the <strong>training or learning</strong> process.</li> <li>Since the logarithm is a <strong>monotonically increasing</strong> function, <strong>maximizing the likelihood function is the same as maximizing the log-likelihood function</strong>.</li> <li>What makes a model <strong>linear</strong> is that it’s <strong>linear in the parameters</strong> not the inputs.</li> <li>We can augment linear regression with <strong>basis functions</strong> yielding <strong>linear basis function models</strong>.</li> <li><strong>Polynomial regression</strong> is a linear basis function model.</li> <li><strong>Regularization</strong> is a technique of <strong>preventing overfitting</strong>.</li> <li>There are <strong>different kinds of regularization</strong> in linear regression such as $L_1$ and $L_2$ regularization.</li> </ul>{"name"=>nil, "email"=>nil, "twitter"=>nil}Regression analysis refers to a set of techniques for estimating relationships among variables. This post introduces linear regression augmented by basis functions to enable non-linear adaptation, which lies at the heart of supervised learning, as will be apparent when we turn to classification. Thus, a thorough understanding of this model will be hugely beneficial. We’ll go through 2 derivations of the optimal parameters namely the method of ordinary least squares (OLS), which we briefly looked at in notes 1a, and maximum likelihood estimation (MLE). We’ll also dabble with some Python throughout the post.Machine learning, notes 1c: Frequentism and Bayesianism2021-03-15T00:00:00+00:002021-03-15T00:00:00+00:00https://cookieblues.github.io//guides/2021/03/15/bsmalea-notes-1c<p>As mentioned in <a href="https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a/">notes 1a</a>, machine learning is mainly concerned with prediction, and as you can imagine, prediction is very much concerned with probability. In this post we are going to look at the two main <a href="https://en.wikipedia.org/wiki/Probability_interpretations" target="_blank">interpretations of probability</a>: frequentism and Bayesianism.</p> <p>While the adjective “Bayesian” first appeared around the 1950s by R. A. Fisher<span class="sidenote-number"></span><span class="sidenote">S. E. Fienberg, “When did Bayesian inference become “Bayesian”?,” 2006.</span>, the concept was properly formalized long before by P. S. Laplace in the 18th century, but known as “inverse probability”<span class="sidenote-number"></span><span class="sidenote">S. M. Stigler, “The History of Statistics: The Measurement of Uncertainty Before 1900,” ch. 3, 1986.</span>. So while the gist of the Bayesian approach has been known for a while, it hasn’t gained much popularity until recently (last few decades), perhaps mostly due to computational complexity.</p> <p>The philosophical difference between the frequentist and Bayesian interpretation of probability is their definitions of probability: <strong>the frequentist (or classical) definition of probability is based on frequencies of events</strong>, whereas <strong>the Bayesian definition of probability is based on our knowledge of events</strong>. In the context of machine learning, we can interpret this difference as: what the data says versus what we know from the data.</p> <p>To understand what this means, I like to use <a href="https://stats.stackexchange.com/a/56" target="_blank">this analogy</a>. Imagine you’ve lost your phone somewhere in your home. You use your friend’s phone to call your phone - as it’s calling, your phone starts ringing (it’s not on vibrate). How do you decide, where to look for your phone in your home? <strong>The frequentist would use their ears to identify the most likely area from which the sound is coming</strong>. However, <strong>the Bayesian would also use their ears, but in addition they would recall which areas of their home they’ve previously lost their phone and take it into account</strong>, when inferring where to look for the phone. Both the frequentist and the Bayesian use their ears when inferring, where to look for the phone, but the Bayesian also incorporates <strong>prior knowledge</strong> about the lost phone into their inference.</p> <p>It’s important to note that there’s nothing stopping the frequentist from also incorporating the prior knowledge in some way. It’s usually more difficult though. The frequentist is really at a loss though, if the event hasn’t happened before and there’s no way to repeat it numerous times. A classic example is predicting if the Arctic ice pack will have melted by some year, which will happen either once or never. Even though it’s not possible to repeat the event numerous times, we do have prior knowledge about the ice cap, and it would be unscientific not to include it.</p> <h3 id="bayes-theorem">Bayes’ theorem</h3> <p>Hopefully, these last paragraphs haven’t confused you more than they’ve enlightened, because now we turn to formalizing the Bayesian approach - and to do this, we need to talk about <strong>Bayes’ theorem</strong>. Let’s say we have two sets of outcomes $\mathcal{A}$ and $\mathcal{B}$ (also called events). We denote the probabilities of each event $\text{Pr}(\mathcal{A})$ and $\text{Pr}(\mathcal{B})$ respectively. The probability of both events is denoted with the joint probability $\text{Pr}(\mathcal{A},\mathcal{B})$, and we can expand this with conditional probabilities</p> $\text{Pr}(\mathcal{A},\mathcal{B}) = \text{Pr}(\mathcal{A}|\mathcal{B}) \text{Pr}(\mathcal{B}), \quad \quad (1)$ <p>i.e., the conditional probability of $\mathcal{A}$ given $\mathcal{B}$ and the probability of $\mathcal{B}$ gives us the joint probability of $\mathcal{A}$ and $\mathcal{B}$. It follows that</p> $\text{Pr}(\mathcal{A},\mathcal{B}) = \text{Pr}(\mathcal{B}|\mathcal{A}) \text{Pr}(\mathcal{A}) \quad \quad (2)$ <p>as well. Since the left-hand sides of $(1)$ and $(2)$ are the same, we can see that the right-hand sides are equal</p> \begin{aligned} \text{Pr}(\mathcal{A}|\mathcal{B}) \text{Pr}(\mathcal{B}) &amp;= \text{Pr}(\mathcal{B}|\mathcal{A}) \text{Pr}(\mathcal{A}) \\ \text{Pr}(\mathcal{A} | \mathcal{B}) &amp;= \frac{\text{Pr}(\mathcal{B} | \mathcal{A}) \text{Pr}(\mathcal{A})}{\text{Pr}(\mathcal{B})}, \end{aligned} <p>which is Bayes’ theorem. This should seem familiar to you - if not, I’d recommend reading up on some basics of probability theory before moving on.</p> <p>We’re calculating the conditional probability of $\mathcal{A}$ given $\mathcal{B}$ from the conditional probability of $\mathcal{B}$ given $\mathcal{A}$ and the respective probabilities of $\mathcal{A}$ and $\mathcal{B}$. However, it might not be clear-cut, why this is so important in machine learning, so let’s write Bayes’ theorem in a more ‘data sciencey’ way:</p> $\overbrace{\text{Pr}(\mathcal{\text{hypothesis}} | \mathcal{\text{data}})}^{\text{posterior}} = \frac{ \overbrace{\text{Pr}(\mathcal{\text{data}} | \mathcal{\text{hypothesis}})}^{\text{likelihood}} \, \overbrace{\text{Pr}(\mathcal{\text{hypothesis}})}^{\text{prior}} }{ \underbrace{\text{Pr}(\mathcal{\text{data}})}_{\text{evidence}} }.$ <p>Usually, we’re not just dealing with probabilities but probability distributions, and the evidence (the denominator above) ensures that the posterior distribution on the left-hand side is a valid probability density and is called the <a href="https://en.wikipedia.org/wiki/Normalizing_constant">normalizing constant</a><span class="marginnote">A normalizing constant just ensures that any probability function is a probability density function with total probability of 1.</span>. Since it’s just a normalizing constant though, we often state the theorem in words as</p> $\text{posterior} \propto \text{likelihood} \times \text{prior},$ <p>where $\propto$ means “proportional to”. <!-- Note that if we assume what's called a *flat* prior, i.e., a prior that is ambivalent towards the hypothesis, then the posterior is proportional to the likelihood, and we end up with the frequentist approach; maximum likelihood. --></p> <h3 id="example-coin-flipping">Example: coin flipping</h3> <p>We’ll start with <a href="https://www.behind-the-enemy-lines.com/2008/01/are-you-bayesian-or-frequentist-or.html" target="_blank">a simple example</a> that I think nicely illustrates the difference between the frequentist and Bayesian approach. Consider the following problem:</p> <p><em>A coin flips heads up with probability $\theta$ and tails with probability $1-\theta$ ($\theta$ is unknown). You flip the coin 11 times, and it ends up heads 8 times. Now, would you bet for or against the event that the next two tosses turn up heads?</em></p> <p>For our sake, let’s define some variables. Let $X$ be a random variable representing the coin, where $X=1$ is heads and $X=0$ is tails such that $\text{Pr}(X=1) = \theta$ and $\text{Pr}(X=0) = 1-\theta$. Furthermore, let $\mathcal{D}$ denote our observed data (8 heads, 3 tails). Now, we want to estimate the value of the parameter $\theta$, so that we can calculate the probability of seeing 2 heads in a row. If the probability is less than 0.5, we will bet against seeing 2 heads in a row, but if it’s above 0.5, then we bet for. So let’s look at how the frequentist and Bayesian would estimate $\theta$!</p> <h4 id="frequentist-approach">Frequentist approach</h4> <p><strong>As the frequentist, we want to maximize the likelihood</strong>, which is to ask the question: what value of $\theta$ will maximize the probability that we got $\mathcal{D}$ given $\theta$, or more formally, we want to find</p> $\hat{\theta}_{\text{MLE}} = \underset{\theta}{\arg\max} \text{Pr}(\mathcal{D} | \theta).$ <p>This is called <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum likelihood estimation (MLE)</a>. The experiment of flipping the coin 11 times follows a binomial distribution with $n=11$ trials, $k=8$ successes, and $\theta$ the probability of succes. Using the likelihood of a binomial distribution, we can find the value of $\theta$ that maximizes the probability of the data. We therefore want to find the value of $\theta$ that maximizes</p> $\text{Pr}(\mathcal{D} | \theta) = \mathcal{L}(\theta | \mathcal{D}) = \begin{pmatrix} 11\\8 \end{pmatrix} \theta^{8} (1-\theta)^{11-8}. \quad \quad (3)$ <p>Note that $(3)$ expresses the <em>likelihood</em> of $\theta$ given $\mathcal{D}$, which is not the same as saying the probability of $\theta$ given $\mathcal{D}$. The image underneath shows our likelihood function $\text{Pr}(\mathcal{D} | \theta)$ (as a function of $\theta$) and the maximum likelihood estimate $\hat{\theta}_{\mathrm{MLE}}$.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-1c/frequentist_likelihood.svg" /></p> <p>Unsurprisingly, the value of $\theta$ that maximizes the likelihood is $\frac{k}{n}$, i.e., the proportion of successes in the trials. <!-- We've derived this result in a <a href="https://cookieblues.github.io//pages/bslialo-notes-9b">different post</a>. This is also called the [maximum likelihood estimate](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation){:target="_blank"} for $\theta$. --> The maximum likelihood estimate $\hat{\theta}_{\text{MLE}}$ is therefore $\frac{k}{n} = \frac{8}{11} \approx 0.73$. Assuming the coin flips are independent, we can calculate the probability of seeing 2 heads in a row:</p> $\text{Pr}(X=1) \times \text{Pr}(X=1) = \hat{\theta}_{\text{MLE}}^2 = \left( \frac{8}{11} \right)^2 \approx 0.53.$ <p>Since the probability of seeing 2 heads in a row is larger than 0.5, we would bet for!</p> <h4 id="bayesian-approach">Bayesian approach</h4> <p><strong>As the Bayesian, we want to maximize the posterior</strong>, so we ask the question: what value of $\theta$ will maximize the probability of $\theta$ given $\mathcal{D}$? Formally, we get</p> $\hat{\theta}_{\text{MAP}} = \underset{\theta}{\arg\max} \text{Pr}(\theta | \mathcal{D}),$ <p>which is called <a href="https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation">maximum a posteriori (MAP) estimation</a>. To answer the question, we use Bayes’ theorem</p> \begin{aligned} \hat{\theta}_{\text{MAP}} &amp;= \underset{\theta}{\arg\max} \overbrace{\text{Pr}(\theta | \mathcal{D})}^{\text{posterior}} \\ &amp;= \underset{\theta}{\arg\max} \frac{ \overbrace{\text{Pr}(\mathcal{D} | \theta)}^{\text{likelihood}} \, \overbrace{\text{Pr}(\theta)}^{\text{prior}} }{ \underbrace{\text{Pr}(\mathcal{D})}_{\text{evidence}} }. \end{aligned} <p>Since the evidence $\text{Pr}(\mathcal{D})$ is a normalizing constant not dependent on $\theta$, we can ignore it. This now gives us</p> $\hat{\theta}_{\text{MAP}} = \underset{\theta}{\arg\max} \text{Pr}(\mathcal{D}|\theta) \, \text{Pr}(\theta).$ <p>During the frequentist approach, we already found the likelihood $(3)$</p> $\text{Pr}(\mathcal{D}|\theta) = \begin{pmatrix} 11\\8 \end{pmatrix} \theta^{8} (1-\theta)^{3},$ <p>where we can drop the binomial coefficient, since it’s not dependent on $\theta$. The only thing left is the prior distribution $\text{Pr}(\theta)$. This distribution describes our initial (prior) knowledge of $\theta$. A convenient distribution to choose is the <a href="https://en.wikipedia.org/wiki/Beta_distribution">Beta distribution</a>, because it’s defined on the interval [0, 1], and $\theta$ is a probability, which has to be between 0 and 1. <span class="marginnote">Additionally, the Beta distribution is the <a href="https://en.wikipedia.org/wiki/Conjugate_prior">conjugate prior</a> for the binomial distribution, which broadly means that if the posterior and prior distributions are in the same family, then the prior is the conjugate prior for the likelihood. This is often times a desired property.</span> This gives us</p> $\text{Pr}(\theta) = \frac{\Gamma (\alpha) \Gamma (\beta)}{\Gamma (\alpha+\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1},$ <p>where $\Gamma$ is the <a href="https://en.wikipedia.org/wiki/Gamma_function">Gamma function</a><span class="marginnote">The Gamma function is defined as $\Gamma (n) = (n-1)!$ for any positive integer $n$.</span>. Since the fraction is not dependent on $\theta$, we can ignore it, which gives us</p> \begin{aligned} \text{Pr}(\theta | \mathcal{D}) &amp;\propto \theta^{8} (1-\theta)^{3} \theta^{\alpha-1} (1-\theta)^{\beta-1} \\ &amp;\propto \theta^{\alpha+7} (1-\theta)^{\beta+2}. \quad \quad (4) \end{aligned} <p>Note that we end up with another beta distribution (without the normalizing constant).</p> <p>It is now our job to set the prior distribution in such a way that we incorporate, what we know about $\theta$ <em>prior</em> to seeing the data. Now, we know that coins are usually pretty fair, and if we choose $\alpha = \beta = 2$, we get a beta distribution that favors $\theta=0.5$ more than $\theta = 0$ or $\theta =1$. The illustration below shows this prior $\mathrm{Beta}(2, 2)$, the normalized likelihood, and the resulting posterior distribution.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-1c/prior_a_b_2.svg" /></p> <p>We can see that the posterior distribution ends up being dragged a little more towards the prior distribution, which makes the MAP estimate a little different the MLE estimate. In fact, we get</p> $\hat{\theta}_{\text{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2} = \frac{2+8-1}{2+2+11-2} = \frac{9}{13} \approx 0.69,$ <p>which is a little lower than the MLE estimate - and if we now use the MAP estimate to calculate the probability of seeing 2 heads in a row, we find that we will <strong>bet against</strong> it</p> $\text{Pr}(X=1) \times \text{Pr}(X=1) = \hat{\theta}_{\text{MAP}}^2 = \left( \frac{9}{13} \right)^2 \approx 0.48.$ <p>Furthermore, if we were to choose $\alpha=\beta=1$, we get the special case where the beta distribution is a uniform distribution. In this case, our MAP and MLE estimates are the same, and we make the same bet. The image underneath shows the prior, likelihood, and posterior for different values of $\alpha$ and $\beta$, i.e., for different prior distributions.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-1c/different_priors.svg" /></p> <h4 id="fully-bayesian-approach">Fully Bayesian approach</h4> <p>While we did include a prior distribution in the previous approach, we’re still collapsing the distribution into a point estimate and using that estimate to calculate the probability of 2 heads in a row. However, in a truly Bayesian approach, we wouldn’t do this, as we don’t just have a single estimate of $\theta$ but a whole distribution (the posterior). Let $\mathcal{H}$ denote the event of seeing 2 heads in a row - then we ask: what is the probability of seeing 2 heads given the data, i.e., $\text{Pr}(\mathcal{H} | \mathcal{D})$? To answer this question, we first need to find the normalizing constant for the posterior distribution in $(4)$. Since it’s a beta distribution, we can look at $(4)$ and see that it must be $\frac{\Gamma(\alpha+\beta+11)}{\Gamma(\alpha+8)\Gamma(\beta+3)}$. Like earlier, we’ll also assume that the coin tosses are independent, which means that the probability of seeing 2 heads in a row (given $\theta$ and the data) is just equal to the probability of seeing heads squared, i.e, $\mathrm{Pr} (\mathcal{H} | \theta, \mathcal{D}) = \theta^2$.</p> <p>We can now answer this question by ‘integrating out’ $\theta$ as</p> \begin{aligned} \mathrm{Pr}(\mathcal{H} | \mathcal{D}) &amp;= \int_{\theta} \mathrm{Pr} (\mathcal{H}, \theta | \mathcal{D}) \, \mathrm{d}\theta \\ &amp;= \int_{\theta} \mathrm{Pr} (\mathcal{H} | \theta, \mathcal{D}) \, \overbrace{\mathrm{Pr} (\theta | \mathcal{D})}^{\mathrm{posterior}} \, \mathrm{d}\theta \\ &amp;= \int_0^1 \theta^2 \, \overbrace{\frac{\Gamma(\alpha+\beta+11)}{\Gamma(\alpha+8)\Gamma(\beta+3)}}^{\text{normalizing constant}} \, \theta^{\alpha+7} (1-\theta)^{\beta+2} \, \mathrm{d}\theta \\ &amp;= \frac{\Gamma(\alpha+\beta+11)}{\Gamma(\alpha+8)\Gamma(\beta+3)} \int_0^1 \theta^{\alpha+9} (1-\theta)^{\beta+2} \, \mathrm{d}\theta \\ &amp;= \frac{\Gamma(\alpha+\beta+11)}{\Gamma(\alpha+8)\Gamma(\beta+3)} \frac{\Gamma(\alpha+10)\Gamma(\beta+3)}{\Gamma(\alpha+\beta+13)} \\ &amp;= \frac{(\alpha+8)(\alpha+9)}{(\alpha+\beta+11)(\alpha+\beta+12)}. \end{aligned} <p>In this case, if we choose a uniform prior, i.e., $\alpha=\beta=1$, we actually get $\frac{45}{91} \approx 0.49$, so we would bet against. The reason for this is more complicated and has to do with the uniform prior not being completely agnostic<span class="marginnote">A better choice of prior would be <a href="https://en.wikipedia.org/wiki/Beta_distribution#Haldane's_prior_probability_(Beta(0,0))">Haldane’s prior</a>, which is the $\mathrm{Beta}(0, 0)$ distribution.</span>. Furthermore, we’ve also made the implicit decision not to update our posterior distribution between the 2 tosses, we’re predicting. You can imagine, we would gain knowledge about the fairness of the coin (i.e., about $\theta$) after tossing the coin the first time, which we could use to update our posterior distribution. However, to simplify the calculations we haven’t done that.</p> <h3 id="example-polynomial-regression-tba">Example: polynomial regression (TBA)</h3> <!-- Let's continue with the example from <a href="https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a/">notes 1a</a> and look at it from a probabilistic perspective. We'll start with the frequentist perspective and then gradually move on to the fully Bayesian perspective. To quickly refresh our memory: we had a dataset $\mathcal{D} = \{ (x_1, t_1), \dots, (x_N, t_N) \}$ of $N$ input and target variable pairs. Our objective was to fit a polynomial to the data. We can think about this from a probabilistic perspective by introducing an error term $$h(x, \mathbf{w}) = w_0 + w_1 x + w_2 x^2 + \dots + w_M x^M + \epsilon = \sum_{m=0}^M w_m x^m + \epsilon, \quad \quad (5)$$ where $\epsilon \sim \mathcal{N}\left(\mu, \alpha^{-1} \right)$, and usually we assume the Gaussian has zero mean. What $(5)$ means is that we assume the polynomial we create $h$ We can simplify this a bit by defining $\mathbf{w} = \left( w_0, w_1, w_2, \dots, w_M \right)^\intercal$ and $\mathbf{x}_i = \left( 1, x_i, x_i^2, \dots, x_i^M \right)^\intercal$ such that $$h(x, \mathbf{w}) = \mathbf{w}^\intercal \mathbf{x} + \epsilon.$$ The probability distribution of $t$ given the input variable $\mathbf{x}$, the parameters $\mathbf{w}$, and the precision (inverse variance) $\alpha$ is thus $$\mathrm{Pr}(t | x, \mathbf{w}, \alpha) = \mathcal{N} \left(t | h(x, \mathbf{w}), \alpha^{-1} \right) = \mathcal{N} \left(t | \mathbf{w}^\intercal \mathbf{x}, \alpha^{-1} \right).$$ For the sake of simplicity, let $\textbf{\textsf{x}} = \left\\{ x_1, \dots, x_N \right\\}$ and $\textbf{\textsf{t}} = \left\\{ t_1, \dots, t_N \right\\}$ denote all our input and target variables respectively. #### Frequentist approach Assuming the points are independent and identically distributed, we can write up the likelihood function, which is a function of $\mathbf{w}$ and $\alpha$, as $$\mathrm{Pr}( \textbf{\textsf{t}} | \textbf{\textsf{x}}, \mathbf{w}, \alpha) = \prod_{n=1}^N \mathrm{Pr}( t_n | x_n, \mathbf{w}, \alpha) = \prod_{n=1}^N \mathcal{N} \left( t_n | \mathbf{w}^\intercal \mathbf{x}_n, \alpha^{-1} \right).$$ We want to find the values of $\mathbf{w}$ and $\alpha$ that **maximizes the likelihood**, and as is common in [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation){:target="_blank"} mainly due to computational reasons, we find the maximum of the log-likelihood instead. The log-likelihood is \begin{aligned} \ln \left( \mathrm{Pr}( \textbf{\textsf{t}} | \textbf{\textsf{x}}, \mathbf{w}, \alpha) \right) &= \ln \left( \prod_{n=1}^N \mathcal{N} \left( t_n | \mathbf{w}^\intercal \mathbf{x}_n, \alpha^{-1} \right) \right) \\ &= \sum_{n=1}^N \ln \left( \frac{1}{\sqrt{2\pi\alpha^{-1}}} \exp \left( -\frac{\left(t_n - \mathbf{w}^\intercal \mathbf{x}_n \right)^2}{2\alpha^{-1}} \right) \right) \\ &= \sum_{n=1}^N \left( \ln \frac{1}{\sqrt{2\pi\alpha^{-1}}} - \frac{\left(t_n - \mathbf{w}^\intercal \mathbf{x}_n \right)^2}{2\alpha^{-1}} \right) \\ &= - N \ln \sqrt{2\pi\alpha^{-1}} - \frac{\alpha}{2} \sum_{n=1}^N \left( t_n - \mathbf{w}^\intercal \mathbf{x}_n \right)^2. \quad \quad (6) \end{aligned} Note that the sum in $(6)$ is equivalent to the objective function we used in <a href="https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a/">notes 1a</a>, the sum of squared errors (SSE) function, since the left term in $(6)$ is constant with respect to $\mathbf{w}$ - so the solution to $\mathbf{w}$ is the same as in the last post. If we maximize $(6)$ with respect to $\alpha$, we get $$\alpha^{-1}_{\text{ML}} = \frac{1}{N} \sum_{n=1}^N \left( t_n - \mathbf{w}_{\text{ML}}^\intercal \mathbf{x}_n \right)^2.$$ Now that we have determined the maximum likelihood solution for $\mathbf{w}$ and $\alpha$, we can make predictions for new values of $x$ expressed in terms of the **predictive distribution** $$\mathrm{Pr}(t|x, \mathbf{w}_\text{ML}, \alpha_\text{ML}) = \mathcal{N} \left( t| h(x, \mathbf{w}_\text{ML}), \alpha^{-1}_\text{ML} \right).$$ #### Bayesian approach If we introduce a **prior distribution** over our parameters (the polynomial coefficients) $\mathbf{w}$, we can go from the frequentist perspective to the Bayesian. Remember that the prior is a way for us to incorporate our knowledge about the parameters. Note that it is in this 'subjective' choice, frequentists object to the Bayesian approach. Let's assume a Gaussian prior distribution $$\mathrm{Pr}(\mathbf{w} | \beta) = \mathcal{N} \left( \mathbf{w} | \mathbf{0}, \beta^{-1} \mathbf{I} \right),$$ where $\beta$ is the precision (inverse variance) of the distribution, $\mathbf{0}$ is the $M+1 \times M+1$ [zero matrix](https://en.wikipedia.org/wiki/Zero_matrix){:target="_blank"}, and $\mathbf{I}$ is the $M+1 \times M+1$ [identity matrix](https://en.wikipedia.org/wiki/Identity_matrix){:target="_blank"}. We can rewrite it by using the [density of the multivariate Gaussian](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Properties){:target="_blank"} \begin{aligned} \mathcal{N} \left( \mathbf{w} | \mathbf{0}, \beta^{-1} \mathbf{I} \right) &= \frac{1}{\sqrt{(2\pi)^{M+1} |\beta^{-1}\mathbf{I}| }} \exp \left( -\frac{(\mathbf{w}-\mathbf{0})^\intercal (\beta^{-1} \mathbf{I})^{-1} (\mathbf{w}-\mathbf{0})}{2} \right) \\ &= \frac{1}{\left( 2\pi\beta^{-1} \right)^{\frac{M+1}{2}}} \exp \left( -\frac{\mathbf{w}^\intercal (\beta \mathbf{I}) \mathbf{w}}{2} \right) \\ &= \left( \frac{\beta}{2\pi} \right)^{\frac{M+1}{2}} \exp \left( -\frac{\beta}{2} \mathbf{w}^\intercal \mathbf{w} \right). \quad \quad (7) \end{aligned} Using Bayes' theorem as described above, the posterior distribution of our parameters is proportional to the product of the likelihood function and prior distribution $$\overbrace{\mathrm{Pr}(\mathbf{w} | \textbf{\textsf{x}}, \textbf{\textsf{t}}, \alpha, \beta)}^{\text{posterior}} \propto \overbrace{\mathrm{Pr}(\textbf{\textsf{t}} | \textbf{\textsf{x}}, \mathbf{w}, \alpha)}^{\text{likelihood}} \, \overbrace{\mathrm{Pr}(\mathbf{w}|\beta)}^{\text{prior}}. \quad \quad (8)$$ By maximizing the posterior distribution we can find the most probable values of $\mathbf{w}$ given the data. This is called the **maximum a posteriori** (MAP) estimate. Note that we're trying to find a point (the maximum) on the posterior distribution, so we don't have to normalize it by dividing by the evidence as mentioned earlier. By taking the logarithm of $(8)$ and substituting $(6)$ and $(7)$ in, we see that maximizing the posterior is given by maximizing \begin{aligned} \ln \left( \mathrm{Pr}(\mathbf{w} | \textbf{\textsf{x}}, \textbf{\textsf{t}}, \alpha, \beta) \right) &\propto \ln \left( \mathrm{Pr}(\textbf{\textsf{t}} | \textbf{\textsf{x}}, \mathbf{w}, \alpha) \mathrm{Pr}(\mathbf{w}|\beta) \right) \\ &= \ln \mathrm{Pr}(\textbf{\textsf{t}} | \textbf{\textsf{x}}, \mathbf{w}, \alpha) + \ln \mathrm{Pr}(\mathbf{w}|\beta) \\ &\propto -\frac{\alpha}{2} \sum_{n=1}^N \left( t_n - \mathbf{w}^\intercal \mathbf{x}_n \right)^2 \underbrace{-\frac{\beta}{2} \mathbf{w}^\intercal \mathbf{w}}_{\text{regularization}}, \quad \quad (9) \end{aligned} where we have dropped constant terms, since they don't impact the maximum. $(9)$ is almost the same as the sum of squared errors function, but we have the extra term $\frac{\beta}{2} \mathbf{w}^\intercal \mathbf{w}$, which is a **regularization** term. We will discuss regularization further in the next post, <a href="https://cookieblues.github.io//pages/bsmalea-notes-2">notes 2</a>but for now it suffices to say that **regularization is a technique to prevent overfitting**. Taking the derivative of $(9)$, setting it equal to 0, and solving for $\mathbf{w}$ gives us \begin{aligned} \frac{\partial}{\partial \mathbf{w}} \left( -\frac{\alpha}{2} \sum_{n=1}^N \left( t_n - \mathbf{w}^\intercal \mathbf{x}_n \right)^2 -\frac{\beta}{2} \mathbf{w}^\intercal \mathbf{w} \right) &= 0 \\ \left( \frac{\partial}{\partial \mathbf{w}} \left( t_n - \mathbf{w}^\intercal \mathbf{x}_n \right) \right) \left(-\alpha \sum_{n=1}^N \left( t_n - \mathbf{w}^\intercal \mathbf{x}_n \right) \right) - \beta \mathbf{w} &= 0 \\ &= \\ \end{aligned} #### Fully Bayesian approach While we've included a prior distribution, we're still calculating what is called a point estimate, i.e. we're finding the maximum of the posterior distribution, but to complete a fully Bayesian approach we would need to compute the **posterior predictive distribution**. Using Bayes' theorem, the likelihood, and prior that we've defined so far, we can write up the posterior distribution $$\overbrace{\mathrm{Pr} \left( \mathbf{w} | \textbf{\textsf{x}}, \textbf{\textsf{t}}, \alpha, \beta \right)}^{\mathrm{posterior}} \propto \overbrace{\mathrm{Pr}( \textbf{\textsf{t}} | \textbf{\textsf{x}}, \mathbf{w}, \alpha)}^{\mathrm{likelihood}} \, \overbrace{\mathrm{Pr}(\mathbf{w} | \beta)}^{\mathrm{prior}}.$$ We will omit the proof for now, since it's a little longer, but since both our likelihood and our prior are normal distributions, the posterior is also a normal distribution, and the posterior d \begin{aligned} \mathrm{Pr} \left( \mathbf{w} | \textbf{\textsf{x}}, \textbf{\textsf{t}}, \alpha, \beta \right) &\propto \mathrm{Pr}( \textbf{\textsf{t}} | \textbf{\textsf{x}}, \mathbf{w}, \alpha) \mathrm{Pr}(\mathbf{w} | \beta) \\ &= \mathcal{N} \left( \mathbf{w} | \mathbf{w}_N, \mathbf{S}_N \right), \end{aligned} where \begin{aligned} \mathbf{w}_N &= \alpha \mathbf{S}_N \mathbf{X}^\intercal \textbf{\textsf{t}} \\ \mathbf{S}_N^{-1} &= \beta \mathbf{I} + \alpha \mathbf{X}^\intercal \mathbf{X} \end{aligned} --> <!-- https://m-clark.github.io/bayesian-basics/intro.html https://www.behind-the-enemy-lines.com/2008/01/are-you-bayesian-or-frequentist-or.html https://github.com/jsantarc/Bayesian-regression-with-Infinitely-Broad-Prior-Gaussian-Parameter-Distribution- http://jakevdp.github.io/blog/2014/06/14/frequentism-and-bayesianism-4-bayesian-in-python/ https://www.ics.uci.edu/~smyth/courses/cs274/readings/bayesian_regression_overview.pdf -->{"name"=>nil, "email"=>nil, "twitter"=>nil}As mentioned in notes 1a, machine learning is mainly concerned with prediction, and as you can imagine, prediction is very much concerned with probability. In this post we are going to look at the two main interpretations of probability: frequentism and Bayesianism.Machine learning, notes 1b: Model selection and validation, the “no free lunch” theorem, and the curse of dimensionality2021-03-11T00:00:00+00:002021-03-11T00:00:00+00:00https://cookieblues.github.io//guides/2021/03/11/bsmalea-notes-1b<p>Now we know a bit about machine learning: it involves models. Machine learning attempts to model data in order to make predictions about the data. In the previous post we dove into the inner functions of a model, and that is very much what machine learning is about. Yet, it’s only half of it, really. The other half has to do with the concept of prediction, and how we make sure our models predict well. This course doesn’t dabble that deep into this other half - but there are a few important topics in this regard that you should be aware of, as they pop up in machine learning all the time.</p> <h3 id="model-selection-and-validation">Model selection and validation</h3> <p>In the previous post, we went over polynomial regression. In the Python implementation, we went with a 4th order polynomial. However, perhaps $M=4$ isn’t the ‘best’ choice for the order of the polynomial - but what is the ‘best’ choice, and how do we find it?</p> <p>Firstly, the order of our polynomial is set before the training process begins, and we call these special parameters in our model <strong>“hyperparameters”</strong>. Secondly, the process of figuring out the values of these hyperparameters is called <strong>hyperparameter optimization</strong> and is a part of <strong>model selection</strong>. Thirdly, as mentioned in the previous post, machine learning is mostly concerned with prediction, which means that we define the ‘best’ model as the one that <strong>generalizes</strong> the best on future data, i.e. which model would perform the best on data it wasn’t trained on?</p> <p><span class="marginnote">Note that the evaluation metric can be different from the objective function. In fact, it often times is.</span> To figure this out, we usually want to come up with some kind of <strong>evaluation metric</strong>. Then we divide our training dataset into 3 parts: a <strong>training</strong>, a <strong>validation</strong> (sometimes called <strong>development</strong>), and a <strong>test</strong> dataset.<span class="marginnote">Commonly the new training dataset is $80\%$ of the original, and the validation and test datasets are $10\%$ each.</span> Then we train our model on the training dataset, perform model selection on the validation dataset, and do a final evaluation of the model on the test dataset. This way, we can determine the model with the lowest <strong>generalization error</strong>. The generalization error refers to the performance of the model on <strong>unseen data</strong>, i.e. data that the model hasn’t been trained on.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-1b/model_selection_poly_reg.svg" /></p> <p>Let’s go back to the polynomial regression example. In the image above I’ve plotted the data points from the previous post together with the true function and $4$ different estimated polynomials of $4$ different orders: $2$, $4$, $6$, and $8$. As we increase the order of the polynomial, we increase what we call the <strong>complexity</strong> of our model, which roughly can be seen as correlated with the number of parameters in the model. So the more parameters our model has, roughly the more complex it is. As the order of the polynomial (the complexity of the model) increases, it begins approximating the data points better, until it perfectly goes through all the data points. Yet, if we perfectly match the data points in our training dataset, our model probably won’t generalize very well, because the data isn’t perfect; there’s always a bit of noise, which can also be seen above. When we end up fitting our model perfectly to out training dataset, which makes the model generalize poorly, we say that we are <strong>overfitting</strong>. In the image above, when the order is set to $8$ ($M=8$), we’re definitely overfitting. Conversely, when $M=2$, we could argue that we are <strong>underfitting</strong>, which means that the complexity of our model isn’t high enough to ‘capture the richness of variation’ in our data. In other words, it doesn’t pick up on the patterns in the data.</p> <h4 id="cross-validation">Cross-validation</h4> <p>If you have limited data, you might feel disadvantaged with the $80$-$10$-$10$ technique, because the size of the validation and test set is small, thereby not being a proper representation of your entire training set. In the extreme case of only $10$ data points, this would result in a validation and test set of size $1$, which is not exactly a great sample size! Instead, different techniques called <strong>cross-validation</strong> can be applied. The most common is the <strong>k-fold cross-validation</strong> technique, where you divide your dataset $\mathcal{D} = \{ \left( \mathbf{x}_1, t_1 \right), \dots, \left(\mathbf{x}_N, t_N \right) \}$ into $k$ distinct subsets. By choosing $1$ of the $k$ subsets to be the validation set, and the rest $k-1$ subsets to be the training set, we can repeat this process $k$ times by choosing a different subset to be the validation set every time. This makes it possible to repeat the training-validation process $k$ times, eventually going through the entire original training set as both training and validation set.</p> <!-- INSERT ANIMATION FOR CROSS-VALIDATION --> <h4 id="python-implementation">Python implementation</h4> <p>Following the example from last post we can try and figure out the best order $M$ for our polynomial. We start be defining our evaluation metric; we will use the popular <strong>mean squared-error (MSE)</strong>, which is very closely related to the sum of squared errors (SSE) function that we looked at briefly in the last post. The MSE is defined as the mean of the squared differences between our predictions and the true values, formally</p> $\text{MSE} = \frac{1}{N} \sum_{n=1}^N \left( t_n - h(x_n,\mathbf{w}) \right)^2,$ <p><span class="marginnote">The only difference between the mean squared-error and the sum of squared errors is that we’re now dividing by the number of data points.</span> where $N$ is the number of data points, $h$ is our polynomial, $\mathbf{w}=\left(w_0, \dots, w_M \right)^\intercal$ are the coefficients of our polynomial (the model parameters), and $(x_n, t_n)$ is an input-target variable pair. Below is a simple Python implementation of MSE that takes NumPy arrays as input. Make sure that the <code class="language-plaintext highlighter-rouge">true</code> and <code class="language-plaintext highlighter-rouge">pred</code> arrays are the same length - this could be done with an assertion if needed.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">mse</span><span class="p">(</span><span class="n">true</span><span class="p">,</span> <span class="n">pred</span><span class="p">):</span> <span class="k">return</span> <span class="nb">sum</span><span class="p">((</span><span class="n">true</span><span class="o">-</span><span class="n">pred</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">true</span><span class="p">)</span></code></pre></figure> <p>And below is an implementation of k-fold cross-validation. It yields the indices of the train and validation set for each fold. We only have to make sure that <code class="language-plaintext highlighter-rouge">n_splits</code> is not larger than <code class="language-plaintext highlighter-rouge">n_points</code>.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">kfold</span><span class="p">(</span><span class="n">n_points</span><span class="p">,</span> <span class="n">n_splits</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span> <span class="n">split_sizes</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">full</span><span class="p">(</span><span class="n">n_splits</span><span class="p">,</span> <span class="n">n_points</span> <span class="o">//</span> <span class="n">n_splits</span><span class="p">)</span> <span class="n">leftover</span> <span class="o">=</span> <span class="n">n_points</span> <span class="o">%</span> <span class="n">n_splits</span> <span class="n">split_sizes</span><span class="p">[:</span><span class="n">leftover</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">n_points</span><span class="p">)</span> <span class="n">current</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">split_size</span> <span class="ow">in</span> <span class="n">split_sizes</span><span class="p">:</span> <span class="n">val_idx</span> <span class="o">=</span> <span class="n">idx</span><span class="p">[</span><span class="n">current</span><span class="p">:</span><span class="n">current</span><span class="o">+</span><span class="n">split_size</span><span class="p">]</span> <span class="n">train_idx</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">delete</span><span class="p">(</span><span class="n">idx</span><span class="p">,</span> <span class="n">val_idx</span><span class="p">)</span> <span class="k">yield</span> <span class="n">train_idx</span><span class="p">,</span> <span class="n">val_idx</span> <span class="n">current</span> <span class="o">+=</span> <span class="n">split_size</span></code></pre></figure> <h3 id="the-no-free-lunch-theorem">The “no free lunch” theorem</h3> <p>While machine learning tries to come up with the best models, in actuality we empirically choose the best model when confronted with a task<span class="sidenote-number"></span><span class="sidenote">S. Raschka, “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning,” 2018.</span>. This is what model selection and validation, which we just learned about, does for us. A known theorem in machine learning (and optimization) is <strong>the “no free lunch” theorem</strong><span class="sidenote-number"></span><span class="sidenote">D. H. Wolpert, “The Lack of A Priori Distinctions Between Learning Algorithms,” 1996.</span>, which broadly says that there’s no universally best model. That is, you cannot say that one model is better than another model in all cases, e.g. mixture models are better than neural networks or vice-versa. This is why it’s important to learn about a plethora of models, so when you’re confronted with a task, you know not only to try one model and be content, if it’s doing alright or matches your expectation; there could be another model significantly outperforming, what you’re seeing.</p> <h3 id="the-curse-of-dimensionality">The curse of dimensionality</h3> <p>As mentioned in the beginning, this post is mainly about the issue of determining which models generalize the best. The “no free lunch” theorem tells us that we can never say that one model is the best, and model selection and validation gives us a framework to actually determine the best model for a specific task; the <strong>curse of dimensionality</strong> is a common enemy in this determination, which is even inherent in our training data! The term was first coined by Richard E. Bellman in 1957<span class="sidenote-number"></span><span class="sidenote">R. E. Bellman, “Dynamic Programming,” 1957.</span> to refer to the intractability of certain algorithms in high dimensionality. To facilitate the understanding of the curse of dimensionality, we’ll go through an example of classification. As mentioned in the previous post, classification is a supervised learning task, where we have to organize our data points into discrete groups that we call classes.</p> <p>So far we’ve been looking at polynomial regression with only a $1$-dimensional input variable $x$. However, in most practical cases, we’ll have to deal with data of high dimensionality, e.g. if humans were our observations, they could have multiple values describing them: height, weight, age, etc. In our example, we’ll have $10$ data points $N=10$, $2$ classes $C=2$, and each data point is $3$-dimensional $D=3$.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span> <span class="p">[</span><span class="mf">0.33</span><span class="p">,</span> <span class="mf">0.88</span><span class="p">,</span> <span class="mf">0.11</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.74</span><span class="p">,</span> <span class="mf">0.54</span><span class="p">,</span> <span class="mf">0.62</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.79</span><span class="p">,</span> <span class="mf">0.07</span><span class="p">,</span> <span class="mf">0.31</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.83</span><span class="p">,</span> <span class="mf">0.24</span><span class="p">,</span> <span class="mf">0.47</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.05</span><span class="p">,</span> <span class="mf">0.42</span><span class="p">,</span> <span class="mf">0.47</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.82</span><span class="p">,</span> <span class="mf">0.70</span><span class="p">,</span> <span class="mf">0.10</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.51</span><span class="p">,</span> <span class="mf">0.76</span><span class="p">,</span> <span class="mf">0.51</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.71</span><span class="p">,</span> <span class="mf">0.92</span><span class="p">,</span> <span class="mf">0.59</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.78</span><span class="p">,</span> <span class="mf">0.19</span><span class="p">,</span> <span class="mf">0.05</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.43</span><span class="p">,</span> <span class="mf">0.53</span><span class="p">,</span> <span class="mf">0.53</span><span class="p">]</span> <span class="p">])</span> <span class="n">t</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span></code></pre></figure> <p>The code snippet above shows our training dataset; <code class="language-plaintext highlighter-rouge">X</code> is our $10$ input variables of dimensionality $3$ (we have $3$ features), and <code class="language-plaintext highlighter-rouge">t</code> is our target variables, which in this case corresponds to $2$ classes. We can also see that the first $5$ data points belong to class $0$ and the last $5$ to class $1$; we have an equal distribution between our classes. If we plot the points using only the first feature (the first column in <code class="language-plaintext highlighter-rouge">X</code>), we get the plot underneath. A naive approach to classify the points would be to split the line into $5$ segments (each of length $0.2$), and then decide to classify all the points in this segment into class $0$ or $1$. In the image underneath I’ve coloured the segments after the classification I would make. With this naive approach we get $3$ mistakes.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-1b/one_dim_cod.svg" /></p> <p>But let’s see if we can do better! Using the first $2$ features now gives us a grid (of $0.2$ by $0.2$ tiles) that we can now use for our naive classification model. As shown underneath, we can now classify the points such that we only make $1$ mistake.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-1b/two_dim_cod.svg" /></p> <p>If we use all $3$ features, we can classify all the points perfectly, which is illustrated underneath. This is because, we now have $0.2$ by $0.2$ by $0.2$ cubes. From this it might seem like using all $3$ features is better than just using $1$ or $2$, since we’re able to better classify our data points - but this is where the counterintuitive concept of the <strong>curse of dimensionality</strong> comes in, and I tell you that it’s <em>not</em> better to use all the features.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-1b/three_dim_cod.gif" /></p> <p>The issue relates to the proportion of our data points compared to our classification sections; with $1$ feature we had $10$ points and $5$ sections, i.e. $\frac{10}{5}=2$ points per section, with $2$ features we had $\frac{10}{5 \times 5}=0.4$ points per section, and with $3$ features we had $\frac{10}{5 \times 5 \times 5}=0.08$ points per section. As we add more features, the available data points in our <strong>feature space</strong> become exponentially sparser, which makes it easier to separate the data points. Yet, it’s not because of any pattern in the data, in actuality it’s just the nature of higher dimensional spaces. In fact, the data points I listed were randomly generated from a uniform distribution, so the ‘pattern’ we’re fitting to isn’t actually there at all - it’s a result of the increased dimensionality, which makes the available data points become sparser. Because of this inherent sparsity we end up overfitting, when we add more features to our data. Which means we need more data to avoid sparsity, and that’s the curse of dimensionality: as the number of features increase, our data become sparser, which results in overfitting, and we therefore need more data to avoid it.</p> <p>The illustration underneath shows the exact thing, we just went over. The 100 points are randomly sampled from increasingly higher multivariate normal distributions and randomly assigned to a class. A hyperplane is found that tries to separate these points. As the dimensionality of the points increase, it becomes easier and easier to separate them. In fact, it’s always possible to perfectly separate N+1 points with N-dimensions.</p> <p style="text-align: center"><img src="https://cookieblues.github.io//extra/bsmalea-notes-1b/cod_final.svg" /></p> <h4 id="blessing-of-non-uniformity">Blessing of non-uniformity</h4> <p>So how do we avoid getting cursed? Luckily, the <strong>blessing of non-uniformity</strong><span class="sidenote-number"></span><span class="sidenote">P. Domingos, “A few useful things to know about machine learning,” Communications of the ACM, vol. 55, no. 10, pp. 78-87, 2012.</span> comes to our rescue! In most practical (real-world) scenarios our data are not spread out uniformly, but are instead concentrated in some places, which can nullify the curse of dimensionality a little bit. But what if it really is the curse of dimensionality, when we’re overfitting? There’s not a definitive answer, as it really depends on the dataset, but there is a related <a href="https://en.wikipedia.org/wiki/One_in_ten_rule" target="_blank">one in ten rule</a> of thumb; for every model parameter (roughly feature) we want at least $10$ data points. Some better options fall under the topic of <strong>dimensionality reduction</strong>, which we will look at later on in the course.</p> <h3 id="summary">Summary</h3> <ul> <li><strong>Hyperparameters</strong> are the parameters in a model that <strong>are determined before training</strong> the model.</li> <li>Model selection refers to the proces of <strong>choosing the model that best generalizes</strong>.</li> <li><strong>Training and validation sets</strong> are used to <strong>simulate unseen data</strong>.</li> <li><strong>Overfitting</strong> happens when our model <strong>performs well on our training dataset but generalizes poorly</strong>.</li> <li><strong>Underfitting</strong> happens when our model <strong>performs poorly on both our training dataset and unseen data</strong>.</li> <li>We can see if our model generalizes well with cross-validation techniques.</li> <li>The <strong>mean squared-error</strong> or <strong>MSE</strong> is a common evaluation metric.</li> <li>The <strong>“no free lunch”</strong> theorem tells us that there is <strong>no best model</strong>.</li> <li>The <strong>more features</strong>, the <strong>higher risk of overfitting</strong> is the curse of dimensionality in a nut shell.</li> <li>The <strong>blessing of non-uniformity counteracts the curse of dimensionality</strong> in most practical scenarios.</li> <li><strong>Dimensionality reduction</strong> can be a tool to remedy the curse of dimensionality.</li> </ul>{"name"=>nil, "email"=>nil, "twitter"=>nil}Now we know a bit about machine learning: it involves models. Machine learning attempts to model data in order to make predictions about the data. In the previous post we dove into the inner functions of a model, and that is very much what machine learning is about. Yet, it’s only half of it, really. The other half has to do with the concept of prediction, and how we make sure our models predict well. This course doesn’t dabble that deep into this other half - but there are a few important topics in this regard that you should be aware of, as they pop up in machine learning all the time.Machine learning, notes 1a: What is machine learning?2021-03-08T00:00:00+00:002021-03-08T00:00:00+00:00https://cookieblues.github.io//guides/2021/03/08/bsmalea-notes-1a<p>It seems, most people derive their definition of machine learning from a quote from Arthur Lee Samuel in 1959: “Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.” The interpretation to take away from this is that “machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.”</p> <p>While Arthur Lee Samuel first coined the term “machine learning” in 1959<span class="sidenote-number"></span><span class="sidenote">R. Kohavi and F. Provost, “Glossary of terms,” Machine Learning, vol. 30, no. 2–3, pp. 271–274, 1998.</span>, and machine learning <a href="https://en.wikipedia.org/wiki/AI_winter">‘took off’ after the 1970s</a>, the underlying theory applied in machine learning have existed long before. For example, the method of least-squares was first published by Adrien-Marie Legendre in 1805<span class="sidenote-number"></span><span class="sidenote">A. M. Legendre, “Nouvelles méthodes pour la détermination des orbites des comètes,” 1805.</span>, and Bayes’ theorem, which is the cornerstone of Bayesian statistics that has taken off in the 21st century, was first underpinned by Thomas Bayes in 1763<span class="sidenote-number"></span><span class="sidenote">T. Bayes and R. Price, “An essay towards solving a problem in the doctrine of chances,” in a letter to J. Canton, 1763.</span>.</p> <!--- INSERT TIMELINE --> <p>Machine learning draws a lot of its methods from statistics, but there is a distinctive difference between the two areas: <strong>statistics is mainly concerned with estimation</strong>, whereas <strong>machine learning is mainly concerned with prediction</strong>. This distinction makes for great differences, as we will see soon enough.</p> <h3 id="categories-of-machine-learning">Categories of machine learning</h3> <p>There are many different machine learning methods that solve different tasks and putting them all in rigid categories can be quite a task on its own. My posts will cover 2 fundamental ones; <strong>supervised</strong> learning and <strong>unsupervised</strong> learning, which can further be divided into smaller categories ash shown in the image underneath.</p> <!--- INSERT ANIMATION --> <p><img src="https://cookieblues.github.io//extra/bsmalea-notes-1a/categories_of_ml_2.svg" /></p> <p>It’s important to note that these categories are not strict, e.g. dimensionality reduction isn’t always unsupervised, and you can use density estimation for clustering and classification.</p> <h4 id="supervised-learning">Supervised learning</h4> <p>Supervised learning refers to a subset of machine learning tasks, where we’re given a dataset $\mathcal{D} = \left\{ (\mathbf{x}_1,\mathbf{t}_1), \dots, (\mathbf{x}_N,\mathbf{t}_N) \right\}$ of $N$ input-output pairs, and our goal is to come up with a function $h$ from the inputs $\mathbf{x}$ to the outputs $\mathbf{t}$. In more layman’s terms: we are given a dataset with predetermined labels that we want to predict - hence the learning is <em>supervised</em>, i.e. we are handed some data that tells us, what we want to predict. Each input-output pair refers to observations, where we want to predict the output from the input. Each input variable $\mathbf{x}$ is a $D_1$-dimensional vector (or a scalar), representing the observation with numerical values. The different dimensions of the input variable are commonly called <strong>features</strong> or <strong>attributes</strong>. Likewise, each output or <strong>target</strong> variable $\mathbf{t}$ is a $D_2$-dimensional vector (but most often just a scalar).</p> <p>In <strong>classification</strong> the possible values for the target variables form a finite number of discrete categories $t \in \{ C_1, \dots, C_k \}$ commonly called <strong>classes</strong>. An example of this could be trying to classify olive oil into geographical regions (our classes) based on various aspects (our features) of the olive oils<span class="sidenote-number"></span><span class="sidenote">J. Gromadzka and W. Wardencki, “Trends in Edible Vegetable Oils Analysis. Part B. Application of Different Analytical Techniques,” 2011.</span>. The features could be concentrations of acids in the olive oils, and the classes could be northern and southern France. Another classic example is recognizing handwritten digits<span class="sidenote-number"></span>. Given an image of $28 \times 28$ pixels,<span class="sidenote">Y. LeCun et al., “Gradient-based learning applied to document recognition,” 1998.</span> we can represent each image as a $784$-dimensional vector, which will be our input variable, and our target variables will be scalars from $0$ to $9$ each representing a distinct digit.</p> <p>You might’ve heard of <strong>regression</strong> before. Like classification, we are given a target variable, but in regression it is continuous instead of discrete, i.e. $t \in \mathbb{R}$. An example of regression that I’m fairly interested in is forecasting election results from polling. In this case, your features would obviously be the polls, but it could also include other data like days until the election or perhaps the parties’ media attention. The target variables are naturally the share of the votes for each party. Another example of regression could be predicting how much a house will be sold for. In this case, the features could be any measurements about the house, the location, and what other similar houses have been sold for recently - the target variable is the selling price of the house.</p> <h4 id="unsupervised-learning">Unsupervised learning</h4> <p>Another subset of machine learning tasks fall under unsupervised learning, where we’re only given a dataset $\mathcal{D} = \left\{ \mathbf{x}_1, \dots, \mathbf{x}_N \right\}$ of $N$ input variables. In contrast to supervised learning, we’re not told what we want to predict, i.e., we’re not given any target variables. The goal of unsupervised learning is then to find patterns in the data.</p> <p>The image of categories above divides unsupervised learning into 3 subtasks, the first one being <strong>clustering</strong>, which, as the name suggests, refers to the task of discovering ‘clusters’ in the data. We can define a cluster to be <strong>a group of observations that are more similar to each other than to observations in other clusters</strong>. Let’s say we had to come up with clusters for a basketball, a carrot, and an apple. Firstly, we could create clusters based on shapes, in which case the basketball and the apple are both round, but the carrot isn’t. Secondly, we could also cluster by use, in which case the carrot and apple are foods, but the basketball isn’t. Finally, we might cluster by colour, in which case the basketball and the carrot are both orange, but the apple isn’t. All three are examples are valid clusters, but they’re clustering different things.</p> <p>Then we have <strong>density estimation</strong>, which is the task of fitting probability density functions to the data. It’s important to note that density estimation is often done in conjunction to other tasks like classification, e.g. based on the given classes of our observations, we can use density estimation to find the distributions of each class and thereby (based on the class distributions) classify new observations. An example of density estimation could be finding extreme outliers in data, i.e., finding data that are highly unlikely to be generated from the density function you fit to the data.</p> <p>Finally, <strong>dimensionality reduction</strong>, as the name suggests, reduces the number of features of the data that we’re dealing with. Just like density estimation, this is often done in conjunction with other tasks. Let’s say, we were going to do a classification task, and our input variables have 50 features - if we could do the same task equally well after reducing the number of features to 5, we could save a lot of time on computation. Having a high number of dimensions in our input variables can also cause unwanted behaviour in our model, known as the curse of dimensionality, but that’s a tale for another time.</p> <h3 id="example-polynomial-regression">Example: polynomial regression</h3> <p>Let’s go through an example of machine learning. This is also to get familiar with the machine learning terminology. We’re going to implement a model called <em>polynomial regression</em>, where we try and fit a polynomial to our data. Given a training dataset of $N$ input variables $x \in \mathbb{R}$ (notice the input variables are one-dimensional) with corresponding target variables $t \in \mathbb{R}$, our objective is to fit a polynomial that yields values $\hat{t}$ of target variables for new values $\hat{x}$ of the input variable. We’ll do this by estimating the coefficients of the polynomial</p> $h(x, \mathbf{w}) = w_0 + w_1 x + w_2 x^2 + \dots + w_M x^M = \sum_{m=0}^M w_m x^m, \quad \quad (1)$ <p>which we refer to as the <strong>parameters</strong> or <strong>weights</strong> of our model. $M$ is the order of our polynomial, and $\mathbf{w} = \left( w_0, w_1, \dots, w_M \right)^\intercal$ denotes all our parameters, i.e. we have $M+1$ parameters for our $M$th order polynomial.</p> <p><span class="marginnote">In the next post we’ll discuss exactly what we mean by ‘best’ values.</span> Now, the objective is to estimate the ‘best’ values for our parameters. To do this, we define what is called an <strong>objective function</strong> (also sometimes called <strong>error</strong> or <strong>loss</strong> function). We construct our objective function such that it outputs a value that tells us how our model is performing. For this task, we define the objective function as the sum of the squared differences between the predictions of our polynomial and the corresponding target variables, i.e.</p> $E(\mathbf{w}) = \sum_{n=1}^N \left( t_n - h(x_n, \mathbf{w}) \right)^2, \quad \quad (2)$ <p>and if we substitute $h(x_n, \mathbf{w})$ with the right-hand side of $(1)$, we get</p> $E(\mathbf{w}) = \sum_{n=1}^N \left( t_n - \sum_{m=0}^M w_m x^m \right)^2.$ <p>Let’s take a minute to understand what $(2)$ is saying. The term in the parantheses on the right-hand side is commonly called the $n$th residual and is denoted $r_n = t_n - h(x_n, \mathbf{w})$. It’s the difference between the output of our polynomial for input variable $x_n$ and the corresponding target variable $t_n$. The difference can be both negative and positive depending on whether the output of our polynomial is lower or higher than the target.<span class="marginnote">Note that since we’re squaring all the differences, the value of the objective function $E$ cannot be lower than 0 - and if it’s exactly 0, then our model is making no mistakes, i.e., it is predicting the exact value of the target every time.</span> We therefore square these differences and add them all up in order to get a value that tells us how our polynomial is performing.</p> <p>This objective function is called the <a href="https://en.wikipedia.org/wiki/Residual_sum_of_squares">residual sum of squares or sum of the squared residuals</a> and is often used as a way to measure the performance of regression models in machine learnig. The image below shows the differences between the polynomium that we’re estimating and the data we’re given. These differences are the errors (or residuals) that the objective function is taking the square of and summing.</p> <p><img src="https://cookieblues.github.io//extra/bsmalea-notes-1a/residuals.svg" /></p> <p>So far, so good! Since the objective function tells us how well we’re doing, and the lower it is, the better we’re doing, we will try and find the minimum of the objective function. That is, we want to find the values for our parameters $\mathbf{w}$ that give us the lowest value for $E$. The process of determining the values for our parameters is called the <strong>training</strong> or <strong>learning</strong> process.</p> <!-- Recall from the <a href="https://cookieblues.github.io//bslialo-notes-9b">notes about extrema</a> that--> <p>To find the minimum of a function, we take the derivative, set it equal to 0, and solve for our parameters $\mathbf{w}$. Since we have a lot of parameters, we’ll take the partial derivative of $E$ with respect to the $i$th parameter $w_i$, set it equal to 0, and solve for it. This will give us a linear system of $M+1$ equations with $M+1$ unknowns (our parameters $\mathbf{w}$). We’ll go over the derivation of the solution to this problem in the next post, but for now we’ll just have it given. The solution to the system of equations is</p> $\hat{\mathbf{w}} = \left( \mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{X}^\intercal \textbf{\textsf{t}}, \quad \quad (3)$ <p>where $\textbf{\textsf{t}}$ denotes all our target variables as a column vector $\textbf{\textsf{t}} = \left( t_1, t_2, \dots, t_N \right)^\intercal$, and $\mathbf{X}$ is called the <strong>design matrix</strong> and is defined as</p> $\mathbf{X} = \begin{pmatrix} 1 &amp; x_1 &amp; x_1^2 &amp; \cdots &amp; x_1^M \\ 1 &amp; x_2 &amp; x_2^2 &amp; \cdots &amp; x_2^M \\ \vdots &amp; \vdots &amp; \vdots &amp; \ddots &amp; \vdots \\ 1 &amp; x_N &amp; x_N^2 &amp; \cdots &amp; x_N^M \\ \end{pmatrix}. \quad \quad (4)$ <p>To sum up: we’re given $N$ pairs of input and target variables $\left\{ (x_1, t_1), \dots, (x_N, t_N) \right\}$, and we want to fit a polynomial to the data of the form $(1)$ such that the value of our polynomium $h(x_i, \mathbf{w})$ is as close to $t_i$ as possible. We do this by finding values for the parameters $\mathbf{w}$ that minimize the objective function defined in $(2)$, and the solution to this is given in $(3)$.</p> <h4 id="python-implementation-of-polynomial-regression">Python implementation of polynomial regression</h4> <p>Let’s try and implement our model! We’ll start with the dataset shown underneath, where <code class="language-plaintext highlighter-rouge">x</code> is our input variables and <code class="language-plaintext highlighter-rouge">t</code> is our target variables.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.8</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.6</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.4</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.2</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.6</span><span class="p">,</span> <span class="mf">0.8</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span> <span class="n">t</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mf">4.9</span><span class="p">,</span> <span class="o">-</span><span class="mf">3.5</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.8</span><span class="p">,</span> <span class="mf">0.8</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.6</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.3</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">2.1</span><span class="p">,</span> <span class="mf">2.9</span><span class="p">,</span> <span class="mf">5.6</span><span class="p">])</span></code></pre></figure> <p>To begin with we can define the order of our polynomial, find the number of data points, and then set up our design matrix.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">M</span> <span class="o">=</span> <span class="mi">4</span> <span class="n">N</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">N</span><span class="p">,</span> <span class="n">M</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span></code></pre></figure> <p>If we look at the definition of the design matrix in $(4)$, we can fill out the columns of our design matrix with the following for-loop.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">M</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span> <span class="n">X</span><span class="p">[:,</span> <span class="n">m</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="o">**</span><span class="n">m</span></code></pre></figure> <p>Now we can find the parameters with the solution in $(3)$.</p> <p><span class="marginnote">The <code class="language-plaintext highlighter-rouge">@</code> performs matrix multiplication.</span></p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">X</span><span class="p">)</span> <span class="o">@</span> <span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">t</span></code></pre></figure> <p>Using NumPy’s <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.poly1d.html"><code class="language-plaintext highlighter-rouge">poly1d</code> function</a> we can generate outputs for our polynomial. <!-- MAYBE SHOW SMALL TEST OF poly1d FUNCTION --></p> <p><span class="marginnote">We flip the weights to accommodate the input of the <code class="language-plaintext highlighter-rouge">poly1d</code> function.</span></p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">h</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">poly1d</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">flip</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span> <span class="n">x_</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span> <span class="n">t_</span> <span class="o">=</span> <span class="n">h</span><span class="p">(</span><span class="n">x_</span><span class="p">)</span></code></pre></figure> <p>Now we can plot our estimated polynomial with our data points. I’ve also plotted the true function that the points were generated from.</p> <p><img src="https://cookieblues.github.io//extra/bsmalea-notes-1a/poly_reg.svg" /></p> <h3 id="summary">Summary</h3> <ul> <li>Machine learning studies <strong>how to make computers learn on their own</strong> with the goal of <strong>predicting the future</strong>.</li> <li><strong>Supervised learning</strong> refers to machine learning tasks, where we are given <strong>labeled data</strong>, and we want to predict those labels.</li> <li><strong>Unsupervised learning</strong>, as it suggests, refers to tasks, where we are <em>not</em> provided with labels for our data.</li> <li><strong>Features</strong> refer to the <strong>attributes</strong> (usually columns) of our data e.g. height, weight, shoe size, etc., if our observations are humans.</li> <li><strong>Classification and regression are supervised</strong> tasks, <strong>clustering, density estimation, and dimensionality reduction are unsupervised</strong> tasks.</li> <li><strong>Parameters</strong> refer to the values, <strong>we want to estimate</strong> in a machine learning model.</li> <li>The <strong>process of estimating the values of the parameters</strong> is called the <strong>training or learning</strong> process.</li> <li>An <strong>objective function</strong> is a <strong>measure of the performance</strong> of our model.</li> </ul>{"name"=>nil, "email"=>nil, "twitter"=>nil}It seems, most people derive their definition of machine learning from a quote from Arthur Lee Samuel in 1959: “Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.” The interpretation to take away from this is that “machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.”The bare minimum guide to Matplotlib2021-02-15T00:00:00+00:002021-02-15T00:00:00+00:00https://cookieblues.github.io//guides/2021/02/15/bare-minimum-matplotlib<p>If you want to work with arrays in Python, you use NumPy. If you want to work with tabular data, you use Pandas. The quintessential Python library for data visualization is Matplotlib. It’s easy to use, flexible, and a lot of other visualization libraries build on the shoulders of Matplotlib. This means that learning Matplotlib will make it easier to understand and work with some of the more fancy visualization libraries.</p> <h3 id="getting-started">Getting started</h3> <p>You’ll need to install the Matplotlib library. Assuming you have some terminal at your disposal and you have <a href="https://en.wikipedia.org/wiki/Pip_(package_manager)" target="_blank">pip</a> installed, you can install Matplotlib with the following commaned: <code class="language-plaintext highlighter-rouge">pip install matplotlib</code>. You can read more about the installation in Matplotlib’s <a href="https://matplotlib.org/users/installing.html" target="_blank">installation guide</a>.</p> <h3 id="two-approaches">Two approaches</h3> <p>We’ll begin by making a simple <a href="https://en.wikipedia.org/wiki/Scatter_plot" target="_blank">scatter plot</a> in two different ways: the ‘naive’ way and the object-oriented way. Both approaches have their pros and cons. Generally, we can say that the object-oriented approach is best when you need multiple plots next to each other.<span class="marginnote">I almost always use the object-oriented approach though, even when I don’t need to make multiple plots.</span></p> <h4 id="naive">‘Naive’</h4> <p>To start with we have to import matplotlib though. The <code class="language-plaintext highlighter-rouge">plt</code> framework is what we’ll use for Python plotting.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span></code></pre></figure> <p>We also import numpy, so we can easily generate points to plot! Let’s pick some points on the <a href="https://en.wikipedia.org/wiki/Sine" target="_blank">sine function</a>. We choose some x-values and then calculate the y-values with <code class="language-plaintext highlighter-rouge">np.sin</code>.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> <span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sin</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></code></pre></figure> <p>Now that we’ve generated our points, we can make our scatter plot! We use the <code class="language-plaintext highlighter-rouge">scatter</code> function from the <code class="language-plaintext highlighter-rouge">plt</code> framework to make the plot, and we use <code class="language-plaintext highlighter-rouge">show</code> to visualize our plot.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p>By running these $6$ lines, a window with the following plot should appear.</p> <p><img src="/extra/matplotlib-bare-minimum/scatter_plot.svg" /></p> <p>If we don’t want a scatter plot but a line plot, we can switch out <code class="language-plaintext highlighter-rouge">scatter</code> for <code class="language-plaintext highlighter-rouge">plot</code>.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p>This gives us the following plot.</p> <p><img src="/extra/matplotlib-bare-minimum/jagged_line_plot.svg" /></p> <p>However, this line is very jagged. We can make it more smooth by generating more points.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sin</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p><img src="/extra/matplotlib-bare-minimum/smooth_line_plot.svg" /></p> <h4 id="object-oriented">Object-oriented</h4> <p>Now that we know how to make and visualize a plot, let’s look at the object-oriented way of producing the same plot. However, why would we want to know this? Simply because the object-oriented way is more powerful and allows for more complicated plots, as will be evident when we want to make multiple plots.</p> <p>If we want to replicate the previous plot, we start by making a <code class="language-plaintext highlighter-rouge">Figure</code> object and an <code class="language-plaintext highlighter-rouge">Axes</code> object.<span class="marginnote">We assume, we have generated our data.</span></p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">()</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">()</span></code></pre></figure> <p>We can think of the <code class="language-plaintext highlighter-rouge">Figure</code> object as the frame, we want to put plots into, and the <code class="language-plaintext highlighter-rouge">Axes</code> object is an actual plot in our frame. We then add the line plot to the <code class="language-plaintext highlighter-rouge">Axes</code> object and use <code class="language-plaintext highlighter-rouge">show</code> again to visualize the plot.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p>This generates the same plot as before.</p> <h3 id="line-plots">Line plots</h3> <p>Here are examples of <a href="https://matplotlib.org/3.1.0/gallery/color/named_colors.html" target="_blank">colours</a> that we can use. We can specify colours in many different ways; hex code, RGB, plain old names.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">norm</span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">()</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">loc</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">color</span><span class="o">=</span><span class="s">"magenta"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">color</span><span class="o">=</span><span class="p">(</span><span class="mf">0.85</span><span class="p">,</span> <span class="mf">0.64</span><span class="p">,</span> <span class="mf">0.12</span><span class="p">))</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">color</span><span class="o">=</span><span class="s">"#228B22"</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p><img src="/extra/matplotlib-bare-minimum/colours.svg" /></p> <p>There are also many predefined <a href="https://matplotlib.org/3.1.0/gallery/lines_bars_and_markers/linestyles.html" target="_blank">linestyles</a> that we can use. Note that without defining colours, Matplotlib will automatically choose some distinct default colors for our lines.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">6</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">()</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">loc</span><span class="o">=-</span><span class="mi">3</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"solid"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">loc</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"dotted"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"dashed"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"dashdot"</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p><img src="/extra/matplotlib-bare-minimum/linestyles.svg" /></p> <p>We can also adjust the width of our lines!</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">()</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">7</span><span class="p">):</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="n">i</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">color</span><span class="o">=</span><span class="s">"black"</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="n">i</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p><img src="/extra/matplotlib-bare-minimum/linewidths.svg" /></p> <h3 id="scatter-plots">Scatter plots</h3> <p>For scatter plots, we can change the <a href="https://matplotlib.org/3.3.3/api/markers_api.html" target="_blank">markers</a> and their size. Here’s an example</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span> <span class="n">y1</span> <span class="o">=</span> <span class="n">x</span> <span class="n">y2</span> <span class="o">=</span> <span class="o">-</span><span class="n">y1</span> <span class="n">y3</span> <span class="o">=</span> <span class="n">y1</span><span class="o">**</span><span class="mi">2</span> <span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">()</span> <span class="n">ax</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">y1</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s">"v"</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">y2</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s">"X"</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">y3</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s">"s"</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p><img src="/extra/matplotlib-bare-minimum/markers.svg" /></p> <p>We can also combine line and scatter plots using the <a href="https://matplotlib.org/3.3.4/api/_as_gen/matplotlib.pyplot.plot.html" target="_blank"><code class="language-plaintext highlighter-rouge">ax.plot</code></a> function by changing the <code class="language-plaintext highlighter-rouge">fmt</code> parameter. The <code class="language-plaintext highlighter-rouge">fmt</code> parameter consists of a part for marker, line, and color: <code class="language-plaintext highlighter-rouge">fmt = [marker][line][color]</code>. If <code class="language-plaintext highlighter-rouge">fmt = "s--m"</code>, then we have square markers, a dashed line, and they’ll be coloured magenta.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span> <span class="n">y</span> <span class="o">=</span> <span class="n">x</span> <span class="o">**</span> <span class="mi">3</span> <span class="o">-</span> <span class="n">x</span> <span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">()</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="s">'H-g'</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p><img src="/extra/matplotlib-bare-minimum/linescatter.svg" /></p> <h3 id="histograms">Histograms</h3> <p>We can make histograms easily using the <code class="language-plaintext highlighter-rouge">ax.hist</code> function.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">10000</span><span class="p">)</span> <span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">()</span> <span class="n">ax</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p><img src="/extra/matplotlib-bare-minimum/hist.svg" /></p> <p>We can change a lot of things in the histogram to make it nicer - we can even add multiple!</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">10000</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span> <span class="n">x2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">10000</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span> <span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">()</span> <span class="n">ax</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'turquoise'</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s">'none'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">x2</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'magenta'</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s">'none'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p><img src="/extra/matplotlib-bare-minimum/hists.svg" /></p> <h3 id="legends">Legends</h3> <p>Naturally, we’ll want to add a legend to our plot. This is simply done with the <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html" target="_blank"><code class="language-plaintext highlighter-rouge">ax.legend</code></a> function.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">y1</span> <span class="o">=</span> <span class="n">x</span> <span class="n">y2</span> <span class="o">=</span> <span class="n">x</span><span class="o">**</span><span class="mi">2</span> <span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">()</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y1</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'turquoise'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'First'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y2</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'magenta'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Second'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p><img src="/extra/matplotlib-bare-minimum/legend.svg" /></p> <p>Matplotlib will automatically try and find the best position for the legend on your plot, but we can change it by providing an argument for the <code class="language-plaintext highlighter-rouge">loc</code> parameter. Also, a common preference is to not have a frame around the legend, and we can disable it by setting the <code class="language-plaintext highlighter-rouge">frameon</code> parameter to <code class="language-plaintext highlighter-rouge">False</code>. Additionally, Matplotlib lists the elements of the legend in one column, but we can provide the number of columns to use in the <code class="language-plaintext highlighter-rouge">ncol</code> parameter.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">y1</span> <span class="o">=</span> <span class="n">x</span> <span class="n">y2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sin</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">+</span><span class="n">np</span><span class="p">.</span><span class="n">cos</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">y3</span> <span class="o">=</span> <span class="n">x</span><span class="o">**</span><span class="mi">2</span> <span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">()</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y1</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'turquoise'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'First'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y2</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'magenta'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Second'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y3</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'forestgreen'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Third'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s">'lower center'</span><span class="p">,</span> <span class="n">frameon</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">ncol</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure> <p><img src="/extra/matplotlib-bare-minimum/more_legend.svg" /></p> <h3 id="final-tips">Final tips</h3> <p>There are so many quirks and different things you can do with Matplotlib, and unfortunately I cannot provide them all here. However, a few guidelines to get you started:</p> <ol> <li>You save figures with the <code class="language-plaintext highlighter-rouge">plt.savefig()</code> function.</li> <li>There are a bunch of libraries that build on the shoulders of Matplotlib that could be beneficial to the specific plot you’re trying to create, e.g. <a href="https://seaborn.pydata.org/">Seaborn</a>, <a href="https://docs.bokeh.org/en/latest/">Bokeh</a>, <a href="https://plotly.com/">Plotly</a>, and many more.</li> <li>Look at the <a href="https://matplotlib.org/stable/gallery/index.html">gallery</a>. Please, please, look at the <a href="https://matplotlib.org/stable/gallery/index.html">gallery</a>! Don’t waste 3 hours working on a plot, if someone has already made it.</li> </ol>{"name"=>nil, "email"=>nil, "twitter"=>nil}If you want to work with arrays in Python, you use NumPy. If you want to work with tabular data, you use Pandas. The quintessential Python library for data visualization is Matplotlib. It’s easy to use, flexible, and a lot of other visualization libraries build on the shoulders of Matplotlib. This means that learning Matplotlib will make it easier to understand and work with some of the more fancy visualization libraries.Pearson Correlation Coefficient and Cosine Similarity in Word Embeddings2021-01-11T00:00:00+00:002021-01-11T00:00:00+00:00https://cookieblues.github.io//machine%20learning/natural%20language%20processing/2021/01/11/pcc-and-cosine-similarity<p>A friend of mine recently asked me about word embeddings and similarity. I remember, I learned that the typical way of calculating the similarity between a pair of word embeddings is to take the cosine of the angle between their vectors. This measure of similarity makes sense due to the way that these word embeddings are commonly constructed, where each dimension is supposed to represent some sort of semantic meaning<span class="marginnote">These word embedding techniques have obvious flaws, such as words that are spelled the same way but have different meanings (called <a href="https://en.wikipedia.org/wiki/Homograph">homographs</a>), or sarcasm which often times is saying one thing but meaning the opposite.</span>. Yet, my friend asked if you could calculate the correlation between word embeddings as an alternative to cosine similarity, and it turns out that it’s almost the exact same thing.</p> <p><a href="https://www.aclweb.org/anthology/N19-1100/">Zhelezniak et al. (2019)</a> explains this well. Given a vocabulary of $N$ words $\mathcal{V} = \{ w_1, \dots, w_N \}$ with a corresponding word embedding matrix $\mathbf{W} \in \mathbb{R}^{N \times D}$, each row in $\mathbf{W}$ corresponds to a word. Considering a pair of these, we can calcuate their <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#For_a_sample">Pearson correlation coefficient (PCC)</a>. Let $(\mathbf{x}, \mathbf{y}) = \{ (x_1, y_1), \dots, (x_D, y_D) \}$ denote this pair, and we can compute the PCC as</p> $r_{xy} = \frac{ \sum_{i=1}^D (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{\sum_{i=1}^D (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^D (y_i - \bar{y})^2} }, \quad \quad (1)$ <p>where $\bar{x} = \frac{1}{D} \sum_{i=1}^D x_i$ is the sample mean; and analogously for $\bar{y}$.</p> <p>The <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> between vectors $\mathbf{x}, \mathbf{y}$ is</p> \begin{aligned} \cos \theta &amp;= \frac{\mathbf{x} \cdot \mathbf{y}}{\| \mathbf{x} \| \| \mathbf{y} \|} \\ &amp;= \frac{\sum_{i=1}^D x_i y_i}{\sqrt{\sum_{i=1}^D x_i^2} \sqrt{\sum_{i=1}^D y_i^2}} \quad \quad (2), \end{aligned} <p>where we see that equation $(1)$ and $(2)$ are the same, if the sample means are 0. The question then becomes: is the mean of word vectors (across the $D$ dimensions) 0?</p> <p><a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> is a popular algorithm for constructing word embeddings, and their pre-trained word embeddings are also commonly used. Let’s download the pre-trained word embeddings, and see if the mean of their vectors equal 0.</p> <p><span class="marginnote">The GloVe embeddings take up a little more than 800 MB. Depending on your connection, this might take a few minutes to download.</span></p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 </pre></td><td class="code"><pre><span class="kn">from</span> <span class="nn">urllib.request</span> <span class="kn">import</span> <span class="n">urlretrieve</span> <span class="kn">from</span> <span class="nn">zipfile</span> <span class="kn">import</span> <span class="n">ZipFile</span> <span class="n">GLOVE_URL</span> <span class="o">=</span> <span class="s">'http://nlp.stanford.edu/data/glove.6B.zip'</span> <span class="n">GLOVE_FILENAME</span> <span class="o">=</span> <span class="s">'raw_data.zip'</span> <span class="n">urlretrieve</span><span class="p">(</span><span class="n">GLOVE_URL</span><span class="p">,</span> <span class="n">GLOVE_FILENAME</span><span class="p">)</span> <span class="k">with</span> <span class="n">ZipFile</span><span class="p">(</span><span class="n">GLOVE_FILENAME</span><span class="p">)</span> <span class="k">as</span> <span class="n">zipfile</span><span class="p">:</span> <span class="n">zipfile</span><span class="p">.</span><span class="n">extractall</span><span class="p">(</span><span class="s">'data'</span><span class="p">)</span> </pre></td></tr></tbody></table></code></pre></figure> <p>This piece of code will give us a folder called <code class="language-plaintext highlighter-rouge">data</code> with 4 different GloVe embedding files of varying vector dimensionalities (50, 100, 200, and 300). I will use the file with dimensionality 300. We can now load the words and their corresponding vectors and calculate the mean of each vector.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 </pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span> <span class="n">DIM</span> <span class="o">=</span> <span class="mi">300</span> <span class="n">words</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span> <span class="n">word_matrix</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">400_000</span><span class="p">,</span> <span class="mi">300</span><span class="p">))</span> <span class="c1"># vocab is 400k </span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s">'data/glove.6B.</span><span class="si">{</span><span class="n">DIM</span><span class="si">}</span><span class="s">d.txt'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'utf-8'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="n">pbar</span> <span class="o">=</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">total</span><span class="o">=</span><span class="mi">400_000</span><span class="p">)</span> <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">f</span><span class="p">):</span> <span class="n">line</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">split</span><span class="p">()</span> <span class="n">words</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">line</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="n">word_matrix</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">line</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span> <span class="n">pbar</span><span class="p">.</span><span class="n">update</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="n">pbar</span><span class="p">.</span><span class="n">close</span><span class="p">()</span> <span class="n">means</span> <span class="o">=</span> <span class="n">word_matrix</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> </pre></td></tr></tbody></table></code></pre></figure> <p>Plotting these means in a histogram will give us insight into the distribution.</p> <p><span class="marginnote">Distribution of means of GloVe word vectors from <code class="language-plaintext highlighter-rouge">glove.6B.300d.txt</code>.</span></p> <figure> <img src="/extra/pcc-and-cosine-similarity/means_hist.svg" /> </figure> <p>As we can see, the means fall closely to 0, which means the PCC and the cosine similarity will be roughly the same when used to calculate similarity between pairs of word embeddings.</p>{"name"=>nil, "email"=>nil, "twitter"=>nil}A friend of mine recently asked me about word embeddings and similarity. I remember, I learned that the typical way of calculating the similarity between a pair of word embeddings is to take the cosine of the angle between their vectors. This measure of similarity makes sense due to the way that these word embeddings are commonly constructed, where each dimension is supposed to represent some sort of semantic meaningThese word embedding techniques have obvious flaws, such as words that are spelled the same way but have different meanings (called homographs), or sarcasm which often times is saying one thing but meaning the opposite.. Yet, my friend asked if you could calculate the correlation between word embeddings as an alternative to cosine similarity, and it turns out that it’s almost the exact same thing.How to do data science without machine learning2020-01-31T00:00:00+00:002020-01-31T00:00:00+00:00https://cookieblues.github.io//updates/2020/01/31/how-to-do-data-science-without-machine-learning<p>I recently participated in a challenge orchestrated by <a href="https://omdena.com/">Omdena</a>, which (if you don’t already know) is an international hub for AI enthusiasts that want to leverage AI solutions to solve some of humanity’s problems. The challenge was posed by a Nigerian NGO called RA365, and if you want to read more about it, Omdena asked me to write a small article for their blog regarding some work I did in the challenge. I didn’t really do much AI, so I decided to write about just that!</p> <p>You can find the article by clicking <a href="https://medium.com/omdena/ai-in-nigeria-doing-data-science-for-good-without-machine-learning-6f7b1856d813">here</a>.</p>{"name"=>nil, "email"=>nil, "twitter"=>nil}I recently participated in a challenge orchestrated by Omdena, which (if you don’t already know) is an international hub for AI enthusiasts that want to leverage AI solutions to solve some of humanity’s problems. The challenge was posed by a Nigerian NGO called RA365, and if you want to read more about it, Omdena asked me to write a small article for their blog regarding some work I did in the challenge. I didn’t really do much AI, so I decided to write about just that!The bare minimum guide to LaTeX2019-04-04T00:00:00+00:002019-04-04T00:00:00+00:00https://cookieblues.github.io//guides/2019/04/04/bare-minimum-latex<p>A few people have been asking for a short guide to LaTeX, but since I don’t have that much time on my hands these days, I thought I’d start with a very short guide that I can expand if needed.</p> <h3 id="why">Why?</h3> <p>You can think of LaTeX as a programming language for documents. It’s well-suited for large projects and complicated document structures, as it keeps track of a bunch of things by itself (we won’t get into those things in this post though). The other huge plus for LaTeX is the ability to write complicated mathematical equations simple and beautiful. <!--more--></p> <h3 id="getting-started-with-overleaf">Getting started with Overleaf</h3> <p>While it’s possible to install LaTeX and a LaTeX editor on your computer, I think using Overleaf (version 2) is the best choice for most; you won’t have to find a proper LaTeX installation, worry about a million LaTeX files, deal with packages, figure out how to collaborate with other people, or worry about having access to your documents on different computers.</p> <p>Firstly, go to <a href="https://www.overleaf.com?r=094710d5&amp;rm=d&amp;rs=b">this link</a><span class="sidenote-number"></span><span class="sidenote">This is a referral link that gives me a few bonuses on Overleaf - it won’t have any affect on you, but if you prefer not to use this, then go to Google and search for Overleaf v2.</span> and sign up for an Overleaf (version 2) account. Secondly, go back to Overleaf and log in. Thirdly, you should see a green button in the top left that says “New Project”. Click the button and choose “Blank Project”. It will ask you to name your project, which can be whatever you want (it can be changed later if needed), and then press “Create”.</p> <h3 id="the-structure-of-a-latex-document">The structure of a LaTeX document</h3> <p>You should see a document on the right side of your screen, a bit of code in the middle, and an overview of your project on the left only containing <code class="language-plaintext highlighter-rouge">main.tex</code>. If you don’t see the document on the right, then there should be a tiny button in the top right (below “Chat”) that looks like $2$ arrows pointing at each other. Hover over the button and it should say “Split screen”. If you click the button, your document should appear.</p> <p>Let’s look at the code now. I’ve posted below, what your code should look like (except for a few things).</p> <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="k">\documentclass</span><span class="p">{</span>article<span class="p">}</span> <span class="k">\usepackage</span><span class="na">[utf8]</span><span class="p">{</span>inputenc<span class="p">}</span> <span class="k">\title</span><span class="p">{</span>The bare minimum guide to LaTeX<span class="p">}</span> <span class="k">\author</span><span class="p">{</span>Cookieblues<span class="p">}</span> <span class="k">\date</span><span class="p">{</span>April 2019<span class="p">}</span> <span class="nt">\begin{document}</span> <span class="k">\maketitle</span> <span class="k">\section</span><span class="p">{</span>Introduction<span class="p">}</span> <span class="nt">\end{document}</span></code></pre></figure> <p>At first, it can look a bit daunting, but let’s try and decompose the different parts. The first line specifies what kind of document you want to write. In this case it defaults to an <code class="language-plaintext highlighter-rouge">article</code>, which is probably what most documents should be. The second line imports the package <code class="language-plaintext highlighter-rouge">inputenc</code>. The next three lines of code specifies some metadata about the document; the title, author and date. All of this is called the <strong>preample</strong>, because it comes before the actual document. This is where, you can make changes to the structure of your document, e.g. the size of the margins, the size of the font, define specific functions, change the layout, etc. However, we won’t dive into all these things in this guide. We’ll only change the date from <code class="language-plaintext highlighter-rouge">\date{April 2019}</code> to <code class="language-plaintext highlighter-rouge">\date{\today}</code>.</p> <p>Next up is the actual document, which is signified by the command <code class="language-plaintext highlighter-rouge">\begin{document}</code>, followed by <code class="language-plaintext highlighter-rouge">\end{document}</code> when the document ends. This is where you’ll write your actual paper. To understand how to do this, we’ll have to learn a bit of LaTeX.</p> <h3 id="writing-latex">Writing LaTeX</h3> <p>As can be seen from the code above, most commands in LaTeX follow one of two standards: either <code class="language-plaintext highlighter-rouge">\begin{command}</code> followed by <code class="language-plaintext highlighter-rouge">\end{command}</code> or just <code class="language-plaintext highlighter-rouge">\command</code>. Usually, a backslash <code class="language-plaintext highlighter-rouge">\</code> signifies a command, and the brackets<code class="language-plaintext highlighter-rouge">{</code> <code class="language-plaintext highlighter-rouge">}</code> is where you put the arguments for the command. An example is in the preample of our document, where we use the command <code class="language-plaintext highlighter-rouge">\author{}</code> and feed it the argument <code class="language-plaintext highlighter-rouge">Cookieblues</code>. Another is the <code class="language-plaintext highlighter-rouge">\maketitle</code> command in our document. It doesn’t take any arguments, but it writes the title, author, and date that we defined in our preample. We also use the command <code class="language-plaintext highlighter-rouge">\section{}</code>, which naturally creates a section in our document with the argument as the header. If we want to make a subsection in that section, we just use <code class="language-plaintext highlighter-rouge">\subsection{}</code>.</p> <p>You might’ve noticed, there’s a number next to the section. This is used for your table of contents. Let’s add a table of contents right after our title and before our introduction. We can do this with the command <code class="language-plaintext highlighter-rouge">\tableofcontents</code>. If we don’t want to have a section or subsection appear in the table of contents (and thereby not have a number next to it), we can put an asterisk at the end of the section command like <code class="language-plaintext highlighter-rouge">\subsection*{No number}</code>. Let’s add this subsection to our document as well. Our code should now look like the one below.</p> <p><span class="marginnote">You might have noticed the small comment after the subsection. If you want to write comments in your LaTeX code, you can use <code class="language-plaintext highlighter-rouge">%</code>.</span></p> <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="k">\usepackage</span><span class="na">[utf8]</span><span class="p">{</span>inputenc<span class="p">}</span> <span class="k">\title</span><span class="p">{</span>The bare minimum guide to LaTeX<span class="p">}</span> <span class="k">\author</span><span class="p">{</span>Cookieblues<span class="p">}</span> <span class="k">\date</span><span class="p">{</span><span class="k">\today</span><span class="p">}</span> <span class="nt">\begin{document}</span> <span class="k">\maketitle</span> <span class="k">\tableofcontents</span> <span class="k">\section</span><span class="p">{</span>Introduction<span class="p">}</span> <span class="k">\subsection*</span><span class="p">{</span>No number<span class="p">}</span> <span class="c">% this is a comment that won't appear in the document</span> <span class="nt">\end{document}</span></code></pre></figure> <h3 id="text">Text</h3> <p>If you want to write something in your document, you just type it out. No need for special commands, you just write what you want, where you want it. I’ll write “<code class="language-plaintext highlighter-rouge">This is the introduction</code>” under the introduction section, and I’ll write “<code class="language-plaintext highlighter-rouge">This is the subsection under the introduction</code>”. The commands for <strong>bold</strong> and <em>italic</em> text are <code class="language-plaintext highlighter-rouge">\textbf{bold}</code> and <code class="language-plaintext highlighter-rouge">\textit{italic}</code> respectively.</p> <h3 id="equations">Equations</h3> <p>This is where LaTeX shines! And why it’s used for the vast majority of natural scientific papers. If you want to write an equation, you can use the <code class="language-plaintext highlighter-rouge">\begin{equation}</code> command, write your equation, and end it with <code class="language-plaintext highlighter-rouge">\end{equation}</code>. I use LaTeX equations on this website as well, and here are a couple of examples of regular functions:</p> <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> x+y <span class="nt">\end{equation}</span></code></pre></figure> $x+y$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\sqrt</span><span class="p">{</span>2<span class="p">}</span> <span class="nt">\end{equation}</span></code></pre></figure> $\sqrt{2}$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\sin</span>(x) <span class="nt">\end{equation}</span></code></pre></figure> $\sin(x)$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\tan</span>(x) <span class="nt">\end{equation}</span></code></pre></figure> $\tan(x)$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\sqrt</span><span class="p">{</span><span class="k">\exp</span>(<span class="k">\cos</span>(2x))<span class="p">}</span> <span class="nt">\end{equation}</span></code></pre></figure> $\sqrt{\exp(\cos(2x))}$ <h4 id="super--and-subscripts">Super- and subscripts</h4> <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> x<span class="p">^</span>2 <span class="nt">\end{equation}</span></code></pre></figure> $x^2$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> a<span class="p">_{</span>12<span class="p">}</span> <span class="nt">\end{equation}</span></code></pre></figure> $a_{12}$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> e<span class="p">^{</span>a+b<span class="p">}</span> = e<span class="p">^</span>a e<span class="p">^</span>b <span class="nt">\end{equation}</span></code></pre></figure> $e^{a+b} = e^a e^b$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\sin</span><span class="p">^</span>2(x) + <span class="k">\cos</span><span class="p">^</span>2(x) = 1 <span class="nt">\end{equation}</span></code></pre></figure> $\sin^2(x) + \cos^2(x) = 1$ <h4 id="fractions">Fractions</h4> <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\frac</span><span class="p">{</span>1<span class="p">}{</span>2<span class="p">}</span> = 1/2 <span class="nt">\end{equation}</span></code></pre></figure> $\frac{1}{2} = 1/2$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\frac</span><span class="p">{</span>1<span class="p">}{</span><span class="k">\sin</span>(x)<span class="p">}</span> <span class="nt">\end{equation}</span></code></pre></figure> $\frac{1}{\sin(x)}$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\frac</span><span class="p">{</span>x+<span class="k">\frac</span><span class="p">{</span>1<span class="p">}{</span>x<span class="p">}}{</span>x<span class="p">^</span>2-1<span class="p">}</span> <span class="nt">\end{equation}</span></code></pre></figure> $\frac{x+\frac{1}{x}}{x^2-1}$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> (f/g)' = <span class="k">\frac</span><span class="p">{</span>f'g-fg'<span class="p">}{</span>g<span class="p">^</span>2<span class="p">}</span> <span class="nt">\end{equation}</span></code></pre></figure> $(f/g)' = \frac{f'g-fg'}{g^2}$ <h4 id="integrals-sums-and-other-operators">Integrals, sums, and other operators</h4> <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\int</span> x<span class="p">^</span>3 dx + <span class="k">\sum</span><span class="p">_{</span>n=1<span class="p">}^{</span>N<span class="p">}</span> n <span class="nt">\end{equation}</span></code></pre></figure> $\int x^3 dx + \sum_{n=1}^{N} n$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\int</span><span class="p">_</span>0<span class="p">^</span>1 <span class="k">\frac</span><span class="p">{</span>1<span class="p">}{</span>x<span class="p">}</span> dx + <span class="k">\sum</span><span class="p">_{</span>a,b<span class="p">}</span> a-b <span class="nt">\end{equation}</span></code></pre></figure> $\int_0^1 \frac{1}{x} dx + \sum_{a,b} a-b$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\int</span><span class="p">_</span>0<span class="p">^</span><span class="k">\pi</span> <span class="k">\sin</span> (x) dx + <span class="k">\sum</span><span class="p">_{</span>n=1<span class="p">}^</span>N <span class="k">\frac</span><span class="p">{</span>1<span class="p">}{</span>n<span class="p">}</span> <span class="nt">\end{equation}</span></code></pre></figure> $\int_0^\pi \sin (x) dx + \sum_{n=1}^N \frac{1}{n}$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\int</span> <span class="k">\frac</span><span class="p">{</span>x<span class="p">^{</span><span class="k">\sin</span> (x)<span class="p">}}{</span><span class="k">\sqrt</span><span class="p">{</span><span class="k">\cos</span> (x)<span class="p">}}</span> dx + <span class="k">\sum</span><span class="p">_{</span>n=1<span class="p">}^</span>N <span class="k">\frac</span><span class="p">{</span>n<span class="p">}{</span>n+1<span class="p">}</span> <span class="nt">\end{equation}</span></code></pre></figure> $\int \frac{x^{\sin (x)}}{\sqrt{\cos (x)}} dx + \sum_{n=1}^N \frac{n}{n+1}$ <h4 id="greek-letters">Greek letters</h4> <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> 2<span class="k">\pi</span> <span class="nt">\end{equation}</span></code></pre></figure> $2\pi$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> |x-a| &lt; <span class="k">\delta</span> <span class="nt">\end{equation}</span></code></pre></figure> $|x-a| &lt; \delta$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\varphi</span> = <span class="k">\frac</span><span class="p">{</span>1+<span class="k">\sqrt</span><span class="p">{</span>5<span class="p">}}{</span>2<span class="p">}</span> <span class="nt">\end{equation}</span></code></pre></figure> $\varphi = \frac{1+\sqrt{5}}{2}$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\int</span><span class="p">_</span>0<span class="p">^</span><span class="k">\pi</span> <span class="k">\sin</span> (x) dx <span class="nt">\end{equation}</span></code></pre></figure> $\int_0^\pi \sin (x) dx$ <h4 id="brackets">Brackets</h4> <p>Square brackets are just plainly written <code class="language-plaintext highlighter-rouge">[</code> and <code class="language-plaintext highlighter-rouge">]</code>, curly brackets require a backslash beforehand <code class="language-plaintext highlighter-rouge">\{</code> <code class="language-plaintext highlighter-rouge">\}</code> as they are otherwise used for wrapping arguments in LaTeX. If the expression is large, you can write <code class="language-plaintext highlighter-rouge">\left</code> and <code class="language-plaintext highlighter-rouge">\right</code> before the paranthesis to make LaTeX adjust the size of them for your expression. Here are some examples:</p> <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\{</span>1,2,3,4<span class="k">\}</span> <span class="nt">\end{equation}</span></code></pre></figure> $\{1,2,3,4\}$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> y - <span class="k">\left</span>( <span class="k">\frac</span><span class="p">{</span>1<span class="p">}{</span>x<span class="p">}</span> <span class="k">\right</span>)<span class="p">^</span>2 = 0 <span class="nt">\end{equation}</span></code></pre></figure> $y - \left( \frac{1}{x} \right)^2 = 0$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\left</span>( <span class="k">\int</span><span class="p">_</span>0<span class="p">^</span>t <span class="k">\log</span> (y) dy <span class="k">\right</span>)<span class="p">^</span>t <span class="nt">\end{equation}</span></code></pre></figure> $\left( \int_0^t \log (y) dy \right)^t$ <figure class="highlight"><pre><code class="language-latex" data-lang="latex"><span class="nt">\begin{equation}</span> <span class="k">\int</span><span class="p">_</span>a<span class="p">^</span>b <span class="k">\frac</span><span class="p">{</span>x<span class="p">}{</span>b-a<span class="p">}</span> dx = <span class="k">\left</span><span class="na">[ \frac{x}{b-a} \right]</span><span class="p">^</span>b<span class="p">_</span>a <span class="nt">\end{equation}</span></code></pre></figure> $\int_a^b \frac{x}{b-a} dx = \left[ \frac{x}{b-a} \right]^b_a$ <h3 id="final-tips">Final tips</h3> <p>You might’ve noticed that your document didn’t change even though you changed your LaTeX code. This is because LaTeX is just that: code. There’s a big green button that says “Recompile” in the top left of your document. If you click it, the new code you have written will be compiled, and you will see it in your document (or you get an error, if there’s a mistake in your code). To the right of the button is a small downward pointing arrow - if you click that, you can turn “Auto compile” on or off. This will get Overleaf to compile your code as you’re writing.</p> <p>When you get started writing mathematics in LaTeX, often times you’ll find yourself in need of symbols that you don’t know the name of and definitely not the LaTeX code for. In these cases, you could either Google your way to the answer, or use the brilliant site <a href="http://detexify.kirelabs.org/classify.html">Detexify</a>. You simply use your mouse to draw the symbol that you want, and Detexify tries to identify the most probable symbols that you’re looking for - and not only does it detect the symbol, but it also provides you with the package it’s from in case it requires one. <span class="marginnote">I swear, I use Detexify so often - it’s worth a bookmark!</span></p> <h3 id="exercises">Exercises</h3> <p>To practice writing equations, try and see if you can write the equations underneath.</p> $a^z + b^z = c^z$ $\sqrt{\frac{n}{n-1}}$ $e^{ix} = r ( \cos \theta + i \sin \theta )$ $\int \frac{\sin (x)}{x^2+1} dx$ $\int_a^b x^{10} \sum_{i=1}^n i^2 dx$ <p>These next ones require something I haven’t shown in this post, so a little bit of Googling or outside-the-box thinking is required.</p> $1+2+3+\cdots+n = \frac{n(n+1)}{2}$ $\lim_{x \to x_0} \frac{f(x)-f(x_0)}{x-x_0} = c$ $\frac{1}{\zeta (s)} = \prod_{p \text{ prime}} \left( 1 - \frac{1}{p^s} \right)$ $\det \begin{pmatrix} a &amp; b \\ c &amp; d \end{pmatrix} = ad-bc$ $\max \left\{ 0, \left| \frac{1}{a}-b \right| \right\}${"name"=>nil, "email"=>nil, "twitter"=>nil}A few people have been asking for a short guide to LaTeX, but since I don’t have that much time on my hands these days, I thought I’d start with a very short guide that I can expand if needed.