{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Word Embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Word embedding is a mapping of a word to a d-dimensional vector space.\n", "This real valued vector representation captures semantic and syntactic features.\n", "Polyglot offers a simple interface to load several formats of word embeddings." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from polyglot.mapping import Embedding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Formats\n", "\n", "The Embedding class can read word embeddings from different sources:\n", "\n", "- Gensim word2vec objects: (`from_gensim` method)\n", "- Word2vec binary/text models: (`from_word2vec` method)\n", "- polyglot pickle files: (`load` method)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "embeddings = Embedding.load(\"/home/rmyeid/polyglot_data/embeddings2/en/embeddings_pkl.tar.bz2\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "##Nearest Neighbors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common way to investigate the space capture by the embeddings is to query for the nearest neightbors of any word." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'blue',\n", " u'white',\n", " u'red',\n", " u'yellow',\n", " u'black',\n", " u'grey',\n", " u'purple',\n", " u'pink',\n", " u'light',\n", " u'gray']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "neighbors = embeddings.nearest_neighbors(\"green\")\n", "neighbors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "to calculate the distance between a word and the nieghbors, we can call the `distances` method" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 1.34894466, 1.37864077, 1.39504588, 1.39524949, 1.43183875,\n", " 1.68007386, 1.75897062, 1.88401115, 1.89186132, 1.902614 ], dtype=float32)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embeddings.distances(\"green\", neighbors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The word embeddings are not unit vectors, actually the more frequent the word is the larger the norm of its own vector." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZYAAAESCAYAAADe2fNYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xm8HFWZ//HPAwECybAPCUs0CLgxkAAWIoMmVBQUHWRG\nRtGfKOi4oKgIFgrjAoqi1jAjuAA6DkEdRQcHhBEETUHEQaUQbgQjSgI6LAYFlMimYM7vjzqdU7dz\nl773dld13f6+X69+3dq6+vQDuc+ts5pzDhERkW7ZqO4CiIjI9KLEIiIiXaXEIiIiXaXEIiIiXaXE\nIiIiXaXEIiIiXVVLYjGzjc3sZjO7fJTz55jZ7Wa2wsz2qbp8IiIyeXU9sbwLWAlsMIjGzA4DdnfO\n7QG8GTi34rKJiMgUVJ5YzGwX4DDg3wEb4ZLDgQsBnHM/BrY2sznVlVBERKaijieWfwMSYN0o53cG\n7irt3w3s0utCiYhId1SaWMzsZcBvnXM3M/LTyvpL2/Y174yISEPMqPjzDgQO9+0oM4EtzexLzrnX\nla65B5hX2t/FHxvm8MMPd48//jhz584FYNasWey+++4sXLgQgKGhIYCB2G9t90t56txvj0nd5alz\nf9WqVRx55JF9U5469y+++OKB/v1w1VVXATB37lxmzZrFueeeO9Yf9lPnnKvlBSwCLh/h+GHAFX77\nAOBHI73/6KOPdnWVvd9ewGl1l6FfXoqFYqFYjP2q4ndn1U8s7RyAmb0FwDl3vnPuCjM7zMxWAY8A\nx470xjVr1lRXyv43v+4C9JH5dRegj8yvuwB9ZH7dBRgktSUW59xyYLnfPr/t3PG1FEpERKassSPv\nDz300LqL0E+W1l2APrK07gL0kaV1F6CPLK27AP1iwYIFPf+MxiaWViOVgHPu2rrL0C8Ui0CxCBSL\noIrfnY1NLOVeQIPOzBbXXYZ+oVgEikWgWFSrsYlFRET6U2MTy8KFC4nSbFbd5egHeswPFItAsQgU\ni2o1NrF429RdABERGa6xicW3sWiqF1R/XKZYBIpFoFhUq7GJxduk7gKIiMhwjU0svsucEguqPy5T\nLALFIlAsqtXYxOIpsYiI9JnGJhbfxqLEguqPyxSLQLEIFItqNTaxeEosIiJ9prGJRW0sgeqPA8Ui\nUCwCxaJajU0snhKLiEifaWxiURtLoPrjQLEIFItAsahWYxOLp8QiItJnGptY1MYSqP44UCwCxSJQ\nLKrV2MTiKbGIiPSZxiYWtbEEqj8OFItAsQgUi2o1NrF4SiwiIn2msYlFbSyB6o8DxSJQLALFolqN\nTSyeEouISJ9pbGJRG0ug+uNAsQgUi0CxqFZjE4s3o+4CiIjIcI1NLGpjCVR/HCgWgWIRKBbVamxi\n8ZRYRET6TGMTi9pYAtUfB4pFoFgEikW1GptYPCUWEZE+09jE4ttYDqy7HP1A9ceBYhEoFoFiUa3G\nJhYvrrsAIiIyXGMTi29j+Xzd5egHqj8OFItAsQgUi2o1NrF4GsciItJnGptYfBvLZnWXox+o/jhQ\nLALFIlAsqtXYxOIpsYiI9JnGJhbfxqLEguqPyxSLQLEIFItqNTaxeDPrLoCIiAxXeWIxs5lm9mMz\nGzKzlWZ25gjXLDazh8zsZv96f/s1amMJVH8cKBaBYhEoFtWqvFeVc+5xMzvYOfeomc0AfmBmBznn\nftB26XLn3OHj3E6JRUSkz9RSFeace9RvbgpsDDw4wmU21j3UxhKo/jhQLALFIlAsqlVLYjGzjcxs\nCLgPuMY5t7LtEgccaGYrzOwKM3v2KLda2NOCiojIhNX1xLLOObcQ2AV4wQh/TdwEzHPOLQA+DVza\nfo9Vq1Zxx9c/wcabzvyomZ1mZieU7+PbaQZi3zl3bT+Vp879Vl16v5Snzn1K+qE8de63x6Tu8lS5\n77eX+tdpvranp8w51/MPGbMAZh8AHnPO/csY19wJ7OecW19ltmzZMve+mwwgypP4xt6XVESk+ZYt\nW+aWLFkyZlPDVNXRK2x7M9vab28OvAi4ue2aOWZmfnt/igQ4rB2mlHV37nmh+1z7X2aDTLEIFItA\nsahWHXNt7QhcaGYbUSS2LzvnlpnZWwCcc+cDRwLHmdmTwKPAUWPc7wTgWz0us4iIdKj2qrDJKlWF\nkSdxTx/rRESmi2lZFdYDn6i7ACIiEjQ2sZTaWAZ+LIvqjwPFIlAsAsWiWo1NLCUn1F0AEREJGptY\n/FxhguZBKlMsAsUiUCyq1djEIiIi/amxiaU8ejRKs4Feolj1x4FiESgWgWJRrcYmljZal0VEpE80\nNrG0tbEMdM8w1R8HikWgWASKRbUam1jabF53AUREpNDYxNI2Q+eJdZWjH6j+OFAsAsUiUCyq1djE\n0kZPLCIifaKxiaWtjeWtdZWjH6j+OFAsAsUiUCyq1djEIiIi/amxiaV9FbQozS6oqSi1U/1xoFgE\nikWgWFSrsYnFO7C0fUxdhRARkaCxiWXhwoXkSfzDusvRD1R/HCgWgWIRKBbVamxiERGR/tTYxFJq\nY/mg/7mupqLUTvXHgWIRKBaBYlGtxiaWku/5n9Phu4iINF5jfxmXxrHc0dqI0qyn6zj3K9UfB4pF\noFgEikW1GptYSn5X2j6jtlKIiAjQ4MTSamPJk7jctnJglGab1lOi+qj+OFAsAsUiUCyq1djEMorF\nwPV1F0JEZJA1NrGMseb9flWWox+o/jhQLALFIlAsqtXYxNLmhvLOoC9VLCJSp8Ymlra5wvZvO/2q\nCotSO9UfB4pFoFgEikW1GptYxqHeYSIiNWlsYmlrY9my7fQnKixK7VR/HCgWgWIRKBbVamxiKcuT\n+I9th55XS0FERKS5iaV9PRag3GD/mgqLUjvVHweKRaBYBIpFtRqbWNrlSfwX4Ca/OyNKs72iNNum\nzjKJiAyixnbLHWUcy76l7Z8CfwJmVlKgGqn+OFAsAsUiUCyqNW2eWLyr2/Y3q6UUIiIDrLGJZYQ2\nFoAvtR+I0mzaj8RX/XGgWASKRaBYVKuxiWUUXxvh2I1Rmk237yki0rcq/4VrZjPN7MdmNmRmK83s\nzFGuO8fMbjezFWa2T/v5kdpY/EzH/znC7ab1uBbVHweKRaBYBIpFtSpPLM65x4GDnXMLgb2Bg83s\noPI1ZnYYsLtzbg/gzcC5nd4/T+LXAtu3HX7P1EotIiKdqqWKyDn3qN/cFNgYeLDtksOBC/21Pwa2\nNrM55QtGaWMBIE/iB7pW2AZQ/XGgWASKRaBYVKuWxGJmG5nZEHAfcI1zbmXbJTsDd5X27wZ2meDH\nnD6FIoqIyCTV9cSyzleF7QK8YJS/JtrXr3flnTHWY2n53qQL2DCqPw4Ui0CxCBSLao2bWMxsazP7\noJldYmbfLb3ax4xMmHPuIeDbwHPaTt0DzCvt7+KPrXfxxRdjZkvN7DT/OqGcoG48ecmM+/MrL2rt\nz5r3jCPK581ssfa1r33tT/d9v73Uv04bqxmhW8w5N/YFZt+lSECXAI+XTjnn3Bcn/IFm2wNPOuf+\nYGabA1cBpzvnlpWuOQw43jl3mJkdAHzKOXdA+T5nnXWWO+mkk9qfaoaJ0mw3YFXp0EvzJL5iomXu\nd2a2WH+RFRSLQLEIFItg2bJlbsmSJWP+7pyqTqZ02R/YwTn3py595o7AhWa2EUXC+rJzbpmZvQXA\nOXe+c+4KMzvMzFYBjwDHTvKz2mc9/naUZnsArwTOzpP4kUneV0RERtFJYrkeeCawohsf6Jy7heFz\nerWOn9+2f/xY9+mgjQU2TCwAt7duQZFgGk9/iQWKRaBYBIpFtTpJLMcAV5rZDyl6cbUeoZxz7sO9\nKliXPD7GuX+srBQiIgOkk15hH6Po/jsH2APY3b/26GG5xtVJA1SexGM3IE0T5Ua7QadYBIpFoFhU\nq5MnllcCz3DO3dvrwvTIacCpFIMxh4nSbNs8idsHZ4qIyBR08sRyJ/BErwsyUR22sZAn8el5Em8G\nfGCE03O7WqiaqP44UCwCxSJQLKrVyRPLl4BvmdmnKdpY1nPOZT0pVW/81QjHjgFOrrgcIiLTWidP\nLMdTdBH+GPDFtldtJjHI57wRjiUAUZo5/9p8ygWrgeqPA8UiUCwCxaJanTyx7Oac+0vPS9J7fxjp\nYJRm25R2VwBPr6Y4IiLT05hPLGY2A3jYzPpuid9O21hK1o5yvNx4f9nkSlMv1R8HikWgWASKRbXG\nTCzOuScpBhS2r2/SOHkS/wX4NPB1YLQqr+2qK5GIyPTUSRvLV4DLzewYM1tiZnHr1evCjWUyE6nl\nSfzOPImPypN4tIGTx7Q2ojR7epRm80a5rq+o/jhQLALFIlAsqtVJG8vb/M8PjXBu1y6WpS9EaeYo\nnlx+4Q/1dLI2EZHpZtwnFufcfP/atf1VRQFHM4k2lon4amvDz5Dc11R/HCgWgWIRKBbV6uSJBTPb\nA3gNsBPFuigXOed+2cuCVSAGRhuHc2hpexV6ahER6VgnC339HfAT4BkUPaieCdxoZi/vcdnGNNXF\navIkvoZiSYBx+TEus6f0gT2k+uNAsQgUi0CxqFYnTyxnAi93zl3TOuD/I30G+FaPylWJPIlzwKI0\nM2DdOJf/ET25iIiMq5NeYTsD17Ud+1+K5YJr0802Fj8L8me7dsOKqf44UCwCxSJQLKrVSWJZAbyn\ntWNmBpwI9H7h5ArlSXw88C91l0NEpOk6SSzHAf9kZr8xsxuAe4E3E7oh12KqbSyj+OhYJ6M0m9mL\nD50q1R8HikWgWASKRbU66W78c+BZFOuynOV/Pss5t7LHZatcnsQjzidW8liUZh+spDAiIg3VUXdj\n59wTbNjOUqsejmPJKLoiA+wGrG47f3qUZqf77R3zJF7Tq4J0SvXHgWIRKBaBYlGtcROLn4DyGGAh\nUO5y65xzr+tRuep0CLBNnsT3+33zo/FH8hvUU0xEZJhO2lguBN5FMTvw6rZXbXrUxkKexH8pJZWW\n5452fZRmm0dptmVPCtMh1R8HikWgWASKRbU6qQp7MbCrc+73vS5Mv8qT+IYozd4MfH6E048CRGm2\nf57EeZRme1NUkV1VaSFFRPqEOTdaLY+/wGwFcKhzrva2hLJly5a5JUuWVF4NFaXZfcAOo5x+JnCb\n3/574Ca/f0OexIt7XjgRkXFU8buzk6qwLwGXmtlrylPm1z1tfl3yJJ4DbDLK6dtK25cAv6ZY+2VR\nlGadxFpEpPE6+WX3DmAOxRiPJq953zV5Ej85ibft3H4gSrPXR2l2lZ9SZtJUfxwoFoFiESgW1Rq3\njcU5N7+CcjTR3wC3TuD63YC7AKI0eyPwNOBUf+5c4K1dLZ2ISE06GsfSj3q8Hksnbhv/kmGuidJs\nFkU12r+3nXvpVAqiPvqBYhEoFoFiUS3V+09SnsR/Ke3u2OHbHgFGGt1f64SeIiLd1NjEUmcbS8lu\nwF51j75X/XGgWASKRaBYVKuxVWH9IE/iO0q7WwNvomgvebh0/BSKNW1ERAbCuONY+lVd41g6UZoC\n5ljgYopFwsbzRuAyigb9M/IkfrBHxRORAdYv41iGMbN9zeyrZvZRM9vCzPYws3/uReEabCNgRp7E\nS/MkfnjcqwtfBH4HvBt4YKpdkEVE6jKZqrC/A94CzKP4JXgOsJhx1jLptqGhIZYsWVLlR3bMr0jZ\n3ri/VZ7Ev4jS7D8onmTGsy5Ks3+gGEM0G3gJ8MY8iX/VfqGZLVavl4JiESgWgWJRrckklp9SrMdy\nA7DSzA6naF+QUfjG/TV++w3AG6I0+xywCHj2GG/977b9O9FsyiLS5ybTK+wW4G9bO865y4CPda1E\nHeqDcSxTkifx2/Ik3hPYdSLvG6mKTH+JBYpFoFgEikW1JpRY/Hr3BwHXl4875y6ZwD3mmdk1ZvYz\nM7vVzN45wjWLzewhM7vZv94/kXI2yUhVW+NYN8b6MCIitRszsZjZduV9V1gK7OpnPZ6MJ4B3O+f2\nBA4A3m5mzxrhuuXOuX3864z2k30yjqVbXj7RN0RpNqe1rT76gWIRKBaBYlGt8Z5YjhzpoHPuIuCG\nyXygc26Nc27Ibz8M/BzYaYRLB6YtIU/iyyim2T8eeB6wroO3rVDPMRHpR+M13h9rZg8D1zjn7m07\n95OpfriZzQf2AX7cdsoBB/qnonuA9zjnVpYvaHobS7s8iS8t7W4cpdmlhCeZ4ygGXpbNoagWm+2c\nuzZKs6cB3weuypP4jb0vcX9SXXqgWASKRbXGe2J5CPhH4FYz+6WZnWdmrzKzORRVWpNmZrMpBg++\nyz+5lN0EzHPOLQA+DVza/v4BUJ7t+LIxrns4SrNrKJaK3pmix9lNPS2ZiMgYxkssH3DOHQFsD7wa\nWAW8DvgFcNZkP9TMNgG+CXzFObdB0nDO/dE596jfvhLYxMy2LV9z9tlnY2ZLzew0/zqhXI/qOwA0\ndv/Gk5c88/78O98AzsmT+N47LjrzTWtXh3altauHaO2vXT20uLwP7LPb0R9abmaLozRbGqXZN+v+\nPlXtt471S3lq3j+hz8pT5/60+v0wkX2/vdS/TquifXpSU7qY2QzgE865kybxXgMuBB5wzr17lGvm\nAL91zjkz2x/4Rvu6MGeddZY76aSTBq6NIUqzRcC15WNrVw+x5W7jVg3Oy5P47l6Vq1+YaSBci2IR\nKBZBFVO6THquMDNb2GqEn+D7DqJoC/gpRVsKFPNjPQXAOXe+mb2dol3hSeBR4ETn3I/K9+nnucJ6\nbYrdjWfmSfynrhVGRBqlrxNL3QY5scCUk0sCpBTT/t/pp6ARkQHQl5NQ9otpNo5lMvakeNLbt9S2\nsrjD96b+52r8gEv/anwniXI986BTLALFolpaj6Wh8iReCawEMFty8HM+ueyGPIkfjdJsKrd9eZRm\nz8uT+IddKaSIDKTGJpbpNo5lKtoaJbegaJc6AfjUJG53fZRmbwHOp5ix+gx/nzl5Ev/9FIvac2qg\nDRSLQLGoVmOrwmRkeRI/liexUYz/mazz/c9/Bh6jWCbhiNY0MlGaWZRmc6dWUhGZrhqbWNTGEoxU\nf5wn8TqKJ9KNS4en+pi3Jkqz7wF3AL+J0uz2Kd6v61SXHigWgWJRrcZWhcn48iRuLTZW7gFiUZod\nBXwvT+L7AaI0u4JiIbFOlFdX2z1Ks7l5Eq8p9VLbKk/itVMquIg0mroby3pRmg0BCyb4tkXAfcBt\nfn91nsS7d7VgItI16m4sVfvnSbznQKBcL7lbl8oiIg3V2Kqwfl7zvmrdmq4iT+JvR2l2HPA4cEHp\nuI0xIPPM9gNRmi0AbgWOoJhotOWmPIn3K103n2J6n2PzJL5jquUHTd1RplgEikW19MQiw+RJfF6e\nxEtHOD6RR+chiul4Lm47vm+UZn9b2r8TeAHFQM1RRWl2vR/AqZ5oIg3Q2MSicSxBj/4Sm03xi/95\nXb7vD8a7IEqzRVGa7eq3DyqV4dbx3qu/SgPFIlAsqqXGe+lYlGb7ExZl+wLwpi7efgvgExRT1cT+\n2CLgw/5nyz8CT+RJ/K0ufrbIwFDj/Rg0jiWosI9+Oejl5aT/CZhqT7BHgXcQkgrAcoYnFYD/Ai6N\n0uyIkW6i8QqBYhEoFtVqbGKR6uVJ/OfS7unAdhTrvHwxT+LVFP8/PXeEt54PPL/LxbnEt7s8GKXZ\ntuNfLiJVaWyvMLWxBBXXH8+gGAT5oN9v/cRPv38DwwdkrjfFCTJHsw2QAQujNDviOZ9cNrP0eUcA\npwFRnsRTWkq7idSuECgW1WpsYpF6+NH8D4574cg+SNFm0nIzsM+UCwULyt2hozSLgBNL5/8cpdlG\nrXVnojTbC9gpT+KruvDZItKmsVVhamMJmlJ/nCfxR3y35Y2ApwHPoeh5BnDFBLs0j8ivTXPiCKfW\nrztDsXrpd6I0e2mUZru0Xxil2V9FaTZrqmWpW1P+v6iCYlEtPbFI5fyTw50AUZrtBuyYJ/G9/pxF\nafYU4Ndj3GIm8ARwFPCfUyjK//gyvB34Zp7E90VpNhtY649vRPFvxOVJ/OQUPkdkoKi7sfSlKM2M\noq3G5UnsojS7BfgbGD5Yc4pLNI/nGGCp354J7Jon8W3+c18EXA0clifxlT0sg0hXac37MSixCIyb\nWL4O/B54a4+LsSXwMPAx4Lt5Evekl4JIN2gcyxjUxhIMeP3xLODlrZ3VX/nIG4B3ApvnSXxUnsTH\nUcxZ1kv3UgzufB+wrMef1bEB//9iGMWiWmpjkUbLk/hR4DJ8F2c7ecniPLnm2rZrvtU6P8ITzg4U\nAzw/NoVizAaS1k7bZ6wE9qNoz1kCHJMn8YWlazcB1pXWzhFpvMY+sWgcS6A++kEHsfh7imWbDwVe\nmifx7/IkPhP4TY+K9GyK5Z1bU3EvjdLsBIAozd4N/Bl40vdY2+DfY5RmG0Vp9rMozc4e4dyHojQ7\nw29v2t6TTf9fBIpFtdTGIgJEabYzcLff3QL4JbALsBnwr8DbR3jbbcAzu1yUz1I8PR0PfA64AtjL\nn9s6T+KHojTbmmKutiP98UMoOhIAbJEn8WPtN43S7DzgqRSdDZr5j166oorfnY2tCtN6LIHWmggm\nG4s8ie9h+IwB80rbx/tXezXXQoq1a7rp7YQkdkrbuTXA5hQdEsquLm2/H/jnKM02/umZrzlk71O+\n+lGKqro9/PnnA9/vbpH7n/6NVKuxVWEiNWmN1v9CnsR/8l2ftymdv4zil/tIPj/Fz57ZQffqU6M0\nWww8Of+VJ19BMbPBHqXzy0d7Y5RmC6M0O3CKZRRRVZhIL0RpZn78jcH6QaGjdY++l+GzRffaO/Ik\n/owvzy4UE4peRHjyOTBP4h+OdYMozZ5GsUDboXkSXz3WtdJfVBUm0lCtRDJGe8a/Ae/22/MpGvGr\n8ukozT7ry3aXP/aG0vlLojSbA/yJourtAOAc4LV5Ev8iSrNNCat+XkXbpKNRmh1FseT0M/Ik/lXv\nvob0q8ZWhWkcS6A++kEDYjGDYozNicBWwAw/8/IbKNakebhbH+TnTRvNujGq1eb4n5sB64DrKeZ1\nu82/50/li6M0mx2l2T9EabaVf0L7GrApYR442q63yczFFqXZsVGaTaqzRAP+v5hW9MQiUiE/XuUv\nfntt6fgFwAVRmm1O0RX6uxT/PjcBHvDVatswfGbp7YAH/PbbKHqR1WE1xXgggI+UT5SS18spvtPj\nFInnVVGa7Vp+ovFPQouB69p7tkVp9hNgX7+7QTWOf+/GI/WIk+qpjUWkQaI0+xnF2Jj2OdNmUoyX\nafkf4KUUMxPsRfEUsjNwbmWF7cz+wAKK6Xcuo0gsVwIfoljb5wN5Ep/R9nT1sjyJv93aidJsY6A1\nSeiRFNPqrGWC/DiilwI350l893jXN5XmChuDEovIcKVfvluN9IvVV1OdTbEEdJPMZsMqwp8Ce1N0\nOFhBaeYDb6NOx+v4armEYhDrQf7wZm0rpk5ZlGYzgLnAPXWOJdJcYWNQG0ug+uNgkGORJ7H511rY\nMBZ5Ers8id8JvAy4pHRq/w4/4mn+dX8XijsRI7U77e1/HsKGSQXgV1Ga3e7bfo7ZeOassQa9fYTi\nCemg0rHrJ1fUMf2BorPEuijNXtmD+/cNtbGIDBhfjfTtKM32BHbOkzgfZdno8pPC4jyJW2vozKNI\nRj9k9N5sEZB3teAT8xT/85sAux/zkdbS2J+h6FRwlj//bYrqr3b7wfqnmS0oJhj9F4pBqnPyJF4z\nkcJEabYdRbVky9eBb0zkHk2iqjARIUqzF1O0v+xC8Qfn4jyJRx1MWXrfIRSLre1D6Lr81jyJzy9d\n08xfMmP7L4qnvrcD2wO7AltT9OwrWwScQdHR4tnlE602Mr9s9rpeF7hFbSxjUGIR6S9Rmh0B7J0n\n8Yfbjm9B0ZPrSeD/UczDdl3pktP8q+UU4IvAJykWW5uuhiimBQL4ZZ7Ez6jiQ6dlYjGzecCXKLon\nOuDzzrlzRrjuHOAlFH8BHOOcu7l8/qyzznInnXSSEguaB6lMsQj6ORZRmj0XeAVFr68/RWl2KvBR\nIMqT+EZ/zQ7AfaW3vRR4AfBev386RdtIy3bAiyhmERhm7eohttyt72dE3xN4l3/tmCfxnT4GDwEX\nAz+g6Nn3DiDNk/jkkW7iO2lsnifxo76n2+Z5Ej/SOj9dE8tcYK5zbsjMZgM/AY5wzv28dM1hwPHO\nucPM7LnA2c65A8r3UWIJ+vkXSNUUi2A6xCJKs38lzFCwFUWX6tcBV+dJfFeUZm+jmBF67zyJb/Hv\nuZ8iyazXkMQyGa8DrqWYmfttFE+CFxDG/JR9LU/i10zLxLJBAcwuBT7tnFtWOnYecI1z7ut+/zZg\nkXNu/V8vqgoTGQx+YOjMPIk7WjPHV70dStFI/2CexP/nj28P/M5fNkSx7MFRbW//NMO7Y58LHDf5\n0m/gQuD1XbzfhORJbNM+sZjZfIrZVvd0zj1cOn45cKZz7nq//z3gvc65n7SuUWIRkckoN5b7wZVb\nU/SAe5TQlXp/4AY/48Ec4LfAM4CfU1TPzdngxp3ZjLYpcSo28+P7usen7SSUvhrsYuBd5aRSvqRt\nf1gGPPvss3nhC1+4FPiVP/QHYKj16N/qwz8I++XxCv1Qnjr322NSd3lq3l/onPtUH5Wnzv0T8L8f\n8iRe13b+ATPbq+36zYFFJO7aPInvK11vUZrZ2tVD64D11Wutedn8/tybP/TyvXZ73enfLZ9/4CdX\nv/X+/Dt/jtLst2tXD+0wxvu7ur929RD331is9rDZNnMfH9po756vZVXLE4uZbUIx5cSVrf/x286f\nB1zrnLvI729QFaY2lmA61KV3i2IRKBZBt2Phlw14KsWgygUUHRGAYd2IjwA+AbwiT+JbS++NKKar\nuTZP4oOjNPs1YdwNwFrgU8AHKX5Pvqzt4xdTtKtMysf3dUy7qjAzM4p6xgecc+8e5Zpy4/0BwKfa\nG+9VFSYi/aI0Vudf8yQ+aYLvLc91tsFUMlGaHU3RkxYoEtcoY4MOBq4Z7/OqSCx1VIX9LfBa4Kdm\n1upCfCo+YzvnznfOXWFmh5nZKuAR4Ngayiki0qkFFMs+nz/ehe38jNej/qLPk/jLwJejNJuZJ3Fr\nKeytKOabj91iAAAK5ElEQVR9WwrcmifxAwBRmr2CYraEq0q3mAW8mWINoPOBt0y0jBNVe6+wyVJV\nWKAqj0CxCBSLYNBj4ce2PANY9fF93RPT8YlFREQq5GdTvg1g2bJl41w9dY2d3Xjhwmk52GlSBvkv\nsXaKRaBYBIpFtRqbWEREpD81NrFoPZZgkNcgaadYBIpFoFhUq7GJRURE+lNjE4vaWALVHweKRaBY\nBIpFtRqbWEREpD81NrGojSVQ/XGgWASKRaBYVKuxiUVERPpTYxOL2lgC1R8HikWgWASKRbUam1hE\nRKQ/NTaxqI0lUP1xoFgEikWgWFSrsYlFRET6U2MTi9pYAtUfB4pFoFgEikW1GptYRESkPzU2saiN\nJVD9caBYBIpFoFhUq7GJRURE+lNjE4vaWALVHweKRaBYBIpFtRqbWEREpD81NrGojSVQ/XGgWASK\nRaBYVKuxiUVERPpTYxOL2lgC1R8HikWgWASKRbUam1hERKQ/NTaxqI0lUP1xoFgEikWgWFSrsYlF\nRET6U2MTi9pYAtUfB4pFoFgEikW1GptYRESkPzU2saiNJVD9caBYBIpFoFhUq7GJRURE+lNjE4va\nWALVHweKRaBYBIpFtRqbWEREpD81NrGojSVQ/XGgWASKRaBYVKuxiUVERPpTYxOL2lgC1R8HikWg\nWASKRbUam1hERKQ/VZ5YzOw/zOw+M7tllPOLzewhM7vZv94/0nVqYwlUfxwoFoFiESgW1arjieUC\n4MXjXLPcObePf50x0gWrVq3qfsmaS/WCgWIRKBaBYuFV8Ud55YnFOXcd8PtxLrPx7vPII490p0DT\nw9Z1F6CPKBaBYhEoFt6KFSt6/hn92MbigAPNbIWZXWFmz667QCIi0rkZdRdgBDcB85xzj5rZS4BL\ngae3X7RmzZrKC9bH5tddgD4yv+4C9JH5dRegj8yvuwCDxJxz1X+o2XzgcufcXh1ceyewn3PuwfLx\n4447zpWrwxYsWDCwXZCHhoYG9ru3UywCxSIY5FgMDQ0Nq/6aNWsW55577rjNDVPRd4nFzOYAv3XO\nOTPbH/iGc25+tSUUEZHJqrwqzMy+BiwCtjezu4APAZsAOOfOB44EjjOzJ4FHgaOqLqOIiExeLU8s\nIiIyffVjr7BxmdmLzew2M7vdzN5bd3m6wczmmdk1ZvYzM7vVzN7pj29rZt81s1+a2dVmtnXpPaf4\nGNxmZoeUju9nZrf4c2eXjm9mZl/3x39kZk+t9ltOjJlt7AfJXu73BzIWZra1mV1sZj83s5Vm9twB\njsUp/t/ILWb2VV/2gYjFSIPLq/ruZvZ6/xm/NLPXjVtY51yjXsDGwCqKXh6bAEPAs+ouVxe+11xg\nod+eDfwCeBbwSeBkf/y9wMf99rP9d9/Ex2IV4Qn0BmB/v30F8GK//Tbgc377VcBFdX/vcWJyIvCf\nwGV+fyBjAVwIvMFvzwC2GsRY+O9zB7CZ3/868PpBiQXwfGAf4JbSsZ5/d2BbYDXFWKCtW9tjlrXu\nYE0iuM8DvlPafx/wvrrL1YPveSnwQuA2YI4/Nhe4zW+fAry3dP13gAOAHYGfl44fBZxXuua5fnsG\n8Lu6v+cY338X4HvAwRQdPRjEWFAkkTtGOD6IsdiW4g+ubXw5LwdeNEixoEgS5cTS8+8OvBo4t/Se\n84CjxipnE6vCdgbuKu3f7Y9NG77X3D7Ajyn+p7nPn7oPmOO3d6L47i2tOLQfv4cQn/Wxc849CTxk\nZtt2/xt0xb8BCbCudGwQY7Er8Dszu8DMbjKzL5jZLAYwFq4YcnAW8H/AvcAfnHPfZQBjUdLr777d\nGPcaVRMTy7TubWBms4FvAu9yzv2xfM4Vfy5M6+8PYGYvo+hyfjOjTO8zKLGg+MtxX4oqin2BRyie\n0tcblFiY2W7ACRR/te8EzDaz15avGZRYjKSfvnsTE8s9wLzS/jyGZ9PGMrNNKJLKl51zl/rD95nZ\nXH9+R+C3/nh7HHahiMM9frv9eOs9T/H3mgFs5doGnvaJA4HDrRgc+zUgNrMvM5ixuBu42zmX+/2L\nKRLNmgGMxXOA651zD/i/qP+bomp8EGPR0ut/Ew+McK9xf+c2MbHcCOxhZvPNbFOKRqbLai7TlJmZ\nAV8EVjrnPlU6dRlFAyX+56Wl40eZ2aZmtiuwB3CDc24NsNb3HDLgaOBbI9zrSGBZz77QFDjnTnXO\nzXPO7UpRB5w5545mMGOxBrjLzFrTGr0Q+BlF+8JAxYKiPeEAM9vcf4cXAisZzFi0VPFv4mrgECt6\nJ25D0a511ZilqrsxapINWC+haMRbBZxSd3m69J0OomhPGAJu9q8XUzRYfg/4pf8PvHXpPaf6GNwG\nHFo6vh9wiz93Tun4ZsA3gNuBHwHz6/7eHcRlEaFX2EDGAlgA5MAKir/StxrgWJxMkVhvoegtt8mg\nxILi6f1e4M8UbSHHVvXd/Wfd7l+vH6+sGiApIiJd1cSqMBER6WNKLCIi0lVKLCIi0lVKLCIi0lVK\nLCIi0lVKLCIi0lVKLCJTZGan+ZkBqv7c+Wa2zsz071j6iv6HlGnHr0NxRdux20c59soufOSog8HM\nbLH/5f9HM1vr17N4cxc+U6RvKbHIdLQcONBPWdGaQ2kGsLD1170/thvw/Ync2Mw2nkR57nHO/ZVz\nbkvgXcDnzGzPSdxHpBGUWGQ6upFiqo+Ffv/5wDUU016Uj612zq0xs53M7DIze8A/xfxT60a+muti\nM/uymT0EvN7MdjWz5f4J5Gpg+04L5py7EniAYhE3zGwbM/sfM/utmT1oZpeb2fopyc3sWjP7sJn9\nwH/eVX4q8w2Y2SvM7E4ze3an5RHpBSUWmXacc3+mWMtmkT/0AuA64Ad+u3Vsud++iGKNjx0pJt/7\nmJkdXLrl4cB/Oee2Ar7qXzmwHfARion7xp0bycw2MrPDKeb6url1mGLy0af412PAZ9re+mrgGGAH\nYFPgPRve2o4FPg4scc6tHK8sIr2kxCLT1XJCEjmIosrrutKx5wPLzWwexTT973XO/dk5twL4d6C8\nrvf1zrnWDNo7UEzf/gHn3BPOuesoZtcdcd0Ybycz+z3wKHAJcLRzbjUUi1c55y5xzj3unHsY+Bgh\nIUKRsC5wzq1yzj1OMUngwrb7v5si2Sxyzt0xfmhEekuJRaar7wMH+Wm+/9r/Iv8hRdvLNsCe/pqd\ngAedc4+U3vt/DF8hr7z2xE7A751zj5WO/XqcstzrnNsG2BI4Gzi11NazhZmdb2a/8lVty4GtWu1D\n3prS9mPA7Lb7nwR81jl37zjlEKmEEotMVz+iqHJ6E/C/AM65tRTTjr+Z4pf9r/3+tn7lzpanMDyZ\nlKu5fgNsY2ZblI49lQ6qwnwV3Xt9uY72h08Cng7s76vaFlE8/Yz1BNTuEOD9ZvYPE3iPSM8osci0\n5J8obgROZHjPrx/4Y8v9dXcB1wNnmtlmZrY38AbgK6Pc99f+vqeb2SZmdhDwsgmU6wmKddtP9odm\nUzyFtNZW/9AIbxsvyfyMYu2ez5rZ33VaFpFeUWKR6Ww58NcUyaTlOopeXOVk82qKddTvpVhI64PO\nucyfG2kd8dcAzwUeBD5IseDUWNrf/x/ADr4h/1PA5sD9FAnuyhGud23b7fs4535KkeC+YGaHjlMe\nkZ7SQl8iItJVemIREZGuUmIREZGuUmIREZGuUmIREZGuUmIREZGuUmIREZGuUmIREZGuUmIREZGu\nUmIREZGu+v8K14k92rmkbAAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "norms = np.linalg.norm(embeddings.vectors, axis=1)\n", "window = 300\n", "smooth_line = np.convolve(norms, np.ones(window)/float(window), mode='valid')\n", "plt.plot(smooth_line)\n", "plt.xlabel(\"Word Rank\"); _ = plt.ylabel(\"$L_2$ norm\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This could be problematic for some applications and training algorithms.\n", "We can normalize them by $L_2$ norms to get unit vectors to reduce effects of word frequency, as the following" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "embeddings = embeddings.normalize_words()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "white 0.4261\n", "blue 0.4451\n", "black 0.4591\n", "red 0.4786\n", "yellow 0.4947\n", "grey 0.6072\n", "purple 0.6392\n", "light 0.6483\n", "pink 0.6574\n", "colour 0.6824\n" ] } ], "source": [ "neighbors = embeddings.nearest_neighbors(\"green\")\n", "for w,d in zip(neighbors, embeddings.distances(\"green\", neighbors)):\n", " print(\"{:<8}{:.4f}\".format(w,d))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vocabulary Expansion" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from polyglot.mapping import CaseExpander, DigitExpander" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not all the words are available in the dictionary defined by the word embeddings.\n", "Sometimes it would be useful to map new words to similar ones that we have embeddings for.\n", "\n", "### Case Expansion\n", "For example, the word `GREEN` is not available in the embeddings," ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"GREEN\" in embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we would like to return the vector that represents the word `Green`, to do that we apply a case expansion:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "embeddings.apply_expansion(CaseExpander)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"GREEN\" in embeddings" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'White',\n", " u'Black',\n", " u'Brown',\n", " u'Blue',\n", " u'Diamond',\n", " u'Wood',\n", " u'Young',\n", " u'Hudson',\n", " u'Cook',\n", " u'Gold']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embeddings.nearest_neighbors(\"GREEN\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Digit Expansion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We reduce the size of the vocabulary while training the embeddings by grouping special classes of words.\n", "Once common case of such grouping is digits.\n", "Every digit in the training corpus get replaced by the symbol `#`.\n", "For example, a number like `123.54` becomes `###.##`.\n", "Therefore, querying the embedding for a new number like `434` will result in a failure" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"434\" in embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fix that, we apply another type of vocabulary expansion `DigitExpander`.\n", "It will map any number to a sequence of `#`s." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "embeddings.apply_expansion(DigitExpander)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"434\" in embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, the neighbors of the new number `434` will be other numbers:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'##',\n", " u'#',\n", " u'3',\n", " u'#####',\n", " u'#,###',\n", " u'##,###',\n", " u'##EN##',\n", " u'####',\n", " u'###EN###',\n", " u'n']" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embeddings.nearest_neighbors(\"434\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Demo" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Demo is available [here](https://bit.ly/embeddings)." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Citation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This work is a direct implementation of the research being described in the [Polyglot: Distributed Word Representations for Multilingual NLP](http://www.aclweb.org/anthology/W13-3520) paper.\n", "The author of this library strongly encourage you to cite the following paper if you are using this software." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "@InProceedings{polyglot:2013:ACL-CoNLL,\n", " author = {Al-Rfou, Rami and Perozzi, Bryan and Skiena, Steven},\n", " title = {Polyglot: Distributed Word Representations for Multilingual NLP},\n", " booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},\n", " month = {August},\n", " year = {2013},\n", " address = {Sofia, Bulgaria},\n", " publisher = {Association for Computational Linguistics},\n", " pages = {183--192}, \n", " url = {http://www.aclweb.org/anthology/W13-3520}\n", "}\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }