{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Serial Version and Spark Tuning (word-level correction)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# word_level_documentation.ipynb\n", "\n", "######################\n", "#\n", "# Submission by Kendrick Lo (Harvard ID: 70984997) for\n", "# CS 205 - Computing Foundations for Computational Science (Prof. R. Jones)\n", "# Advisor: A. Peleg\n", "# \n", "# This is part of a joint project with Gioia Dominedo that includes a separate\n", "# component for context checking. This notebook outlines algorithms for\n", "# word-level correction, and includes a serial Python algorithm based on a third\n", "# party algorithm (namely SymSpell, see below), as well as a Spark/Python\n", "# algorithm. A number of optimizations/compromises were attempted with varying\n", "# levels of success -- these attempts have been documented in this notebook.\n", "#\n", "######################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# contents\n", "\n", "* [Performance Summary](#Performance-Summary)
\n", "\n", "* [1a. word/document - original serial implementation](#1a.-serial-code-performance)\n", "* [1b. single word - optimized serial version (SymSpell python port)](#1b.-optimized-serial-version)\n", "* [2. single word - SPARK implementation (slow)](#2.-Original-SPARK-version:-SLOW)\n", "* [3. single word - SPARK implementation (faster)](#3.-Optimized-SPARK-version:-FASTER)\n", "* [4. single word - SPARK implementation (also fast)](#4.-Optimized-SPARK-version:-ALSO-FAST)\n", "* [5. document - SPARK implementation](#5.-SPARK-version-document-check)
\n",
"\n",
"* [AWS Experiments](#AWS-experiments)\n",
"* [Appendix: code snippets from optimization experiments](#Appendix)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Performance Summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1a. serial code performance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 26.2 s, sys: 556 ms, total: 26.8 s\n",
"Wall time: 26.8 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"number of possible corrections: 604\n",
" edit distance for deletions: 3\n",
"CPU times: user 60.6 ms, sys: 11.5 ms, total: 72.1 ms\n",
"Wall time: 64 ms\n",
"Out[8]:\n",
"[('there', (2972, 0)),\n",
" ('these', (1231, 1)),\n",
" ('where', (977, 1)),\n",
" ('here', (691, 1)),\n",
" ('three', (584, 1)),\n",
" ('thee', (26, 1)),\n",
" ('chere', (9, 1)),\n",
" ('theme', (8, 1)),\n",
" ('the', (80030, 2)), ...\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%time\n",
"get_suggestions(\"zzffttt\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"number of possible corrections: 0\n",
" edit distance for deletions: 3\n",
"CPU times: user 191 µs, sys: 81 µs, total: 272 µs\n",
"Wall time: 208 µs\n",
"Out[4]:\n",
"[]\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%time\n",
"best_word(\"there\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"number of possible corrections: 604\n",
" edit distance for deletions: 3\n",
"CPU times: user 55.3 ms, sys: 2.97 ms, total: 58.2 ms\n",
"Wall time: 57.2 ms\n",
"Out[4]:\n",
"('there', (2972, 0))\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Finding misspelled words in your document...\n",
"In line 3, taiths: suggested correction is < faith >\n",
"In line 11, the word < oonipiittee > was not found (no suggested correction)\n",
"In line 13, tj: suggested correction is < to >\n",
"In line 13, mnnff: suggested correction is < snuff >\n",
"In line 13, gjpt: suggested correction is < get >\n",
"In line 15, bh: suggested correction is < by >\n",
"In line 15, snc: suggested correction is < sac >\n",
"In line 15, uth: suggested correction is < th >\n",
"In line 15, unuer: suggested correction is < under >\n",
"In line 20, mthiitt: suggested correction is < thirty >\n",
"In line 21, cas: suggested correction is < was >\n",
"In line 22, pythian: suggested correction is < scythian >\n",
"In line 26, brainin: suggested correction is < brain >\n",
"In line 27, jfl: suggested correction is < of >\n",
"In line 28, ji: suggested correction is < i >\n",
"In line 28, stice: suggested correction is < stick >\n",
"In line 28, blaci: suggested correction is < black >\n",
"In line 28, eug: suggested correction is < dug >\n",
"In line 28, debbs: suggested correction is < debts >\n",
"In line 29, nericans: suggested correction is < americans >\n",
"In line 30, ainin: suggested correction is < again >\n",
"In line 30, ergs: suggested correction is < eggs >\n",
"In line 31, trumped: suggested correction is < trumpet >\n",
"In line 32, erican: suggested correction is < american >\n",
"In line 33, unorthodox: suggested correction is < orthodox >\n",
"In line 33, nenance: suggested correction is < penance >\n",
"In line 33, thg: suggested correction is < the >\n",
"In line 34, sln: suggested correction is < son >\n",
"In line 34, rgs: suggested correction is < rags >\n",
"In line 38, williaij: suggested correction is < william >\n",
"In line 38, eu: suggested correction is < e >\n",
"In line 40, fcsf: suggested correction is < ff >\n",
"In line 40, ber: suggested correction is < be >\n",
"In line 42, unorthodoxy: suggested correction is < orthodox >\n",
"In line 42, thpt: suggested correction is < that >\n",
"In line 42, the word < senbrnrgs > was not found (no suggested correction)\n",
"In line 44, fascism: suggested correction is < fascia >\n",
"In line 62, loo: suggested correction is < look >\n",
"In line 65, ththn: suggested correction is < then >\n",
"In line 65, scbell: suggested correction is < bell >\n",
"In line 65, ife: suggested correction is < if >\n",
"In line 65, yktcn: suggested correction is < skin >\n",
"In line 65, thl: suggested correction is < the >\n",
"In line 66, thi: suggested correction is < the >\n",
"In line 68, saij: suggested correction is < said >\n",
"In line 69, defendants: suggested correction is < defendant >\n",
"In line 69, cornr: suggested correction is < corner >\n",
"In line 69, nists: suggested correction is < fists >\n",
"In line 72, ro: suggested correction is < to >\n",
"In line 74, ath: suggested correction is < at >\n",
"In line 75, tti: suggested correction is < ti >\n",
"In line 75, rg: suggested correction is < re >\n",
"In line 75, acrific: suggested correction is < pacific >\n",
"In line 77, korea: suggested correction is < more >\n",
"In line 78, ro: suggested correction is < to >\n",
"In line 78, doatli: suggested correction is < death >\n",
"In line 81, ith: suggested correction is < it >\n",
"In line 81, ry: suggested correction is < by >\n",
"In line 81, kl: suggested correction is < ll >\n",
"In line 81, ech: suggested correction is < each >\n",
"In line 82, rb: suggested correction is < re >\n",
"In line 82, the word < ghmhvestigat > was not found (no suggested correction)\n",
"In line 82, nb: suggested correction is < no >\n",
"In line 82, rg: suggested correction is < re >\n",
"In line 83, rosenbt: suggested correction is < rodent >\n",
"In line 83, rgs: suggested correction is < rags >\n",
"In line 84, coriritted: suggested correction is < committed >\n",
"In line 86, fighti: suggested correction is < fight >\n",
"In line 88, bths: suggested correction is < baths >\n",
"In line 88, tchf: suggested correction is < the >\n",
"In line 91, ro: suggested correction is < to >\n",
"In line 91, ijb: suggested correction is < in >\n",
"In line 92, telegrnm: suggested correction is < telegram >\n",
"In line 92, jillia: suggested correction is < william >\n",
"In line 92, patt: suggested correction is < part >\n",
"In line 92, rson: suggested correction is < son >\n",
"In line 93, ecretdry: suggested correction is < secretary >\n",
"In line 95, purview: suggested correction is < purves >\n",
"In line 95, rder: suggested correction is < order >\n",
"In line 99, gor: suggested correction is < for >\n",
"In line 99, dthethg: suggested correction is < teeth >\n",
"In line 99, ared: suggested correction is < are >\n",
"In line 99, ro: suggested correction is < to >\n",
"In line 99, enb: suggested correction is < end >\n",
"In line 99, rg: suggested correction is < re >\n",
"In line 100, sacc: suggested correction is < sac >\n",
"In line 100, vthnz: suggested correction is < the >\n",
"In line 100, dri: suggested correction is < dry >\n",
"In line 100, yfu: suggested correction is < you >\n",
"In line 101, ile: suggested correction is < ill >\n",
"In line 101, rosi: suggested correction is < rose >\n",
"In line 101, rg: suggested correction is < re >\n",
"In line 102, fnir: suggested correction is < fair >\n",
"In line 102, jhy: suggested correction is < why >\n",
"In line 102, azi: suggested correction is < ami >\n",
"In line 103, fascist: suggested correction is < fascia >\n",
"In line 104, nb: suggested correction is < no >\n",
"-----\n",
"total words checked: 700\n",
"total unknown words: 3\n",
"total potential errors found: 94\n",
"CPU times: user 14.5 s, sys: 147 ms, total: 14.6 s\n",
"Wall time: 15 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1b. optimized serial version"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", "\n", "corresponds to `serial_single.py`\n", "
\n",
"Please wait...\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
" \n",
"Word correction\n",
"---------------\n",
"Enter your input (or enter to exit): hello\n",
"('hello', (1, 0))\n",
" \n",
"Enter your input (or enter to exit): there\n",
"('there', (2972, 0))\n",
" \n",
"Enter your input (or enter to exit): thinkl\n",
"('think', (557, 1))\n",
" \n",
"Enter your input (or enter to exit): prest\n",
"('rest', (209, 1))\n",
" \n",
"Enter your input (or enter to exit): \n",
"goodbye\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Original SPARK version: SLOW"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " WARNING: This version is extremely slow for even moderately sized files.\n", "
\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 113 ms, sys: 29.4 ms, total: 142 ms\n",
"Wall time: 5min 26s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 604\n",
" edit distance for deletions: 3\n",
"CPU times: user 79.9 ms, sys: 20.2 ms, total: 100 ms\n",
"Wall time: 1min 33s\n",
"Out[6]:\n",
"[(u'there', (2972, 0)),\n",
" (u'these', (1231, 1)),\n",
" (u'where', (977, 1)),\n",
" (u'here', (691, 1)),\n",
" (u'three', (584, 1)),\n",
" (u'thee', (26, 1)),\n",
" (u'chere', (9, 1)),\n",
" (u'theme', (8, 1)), ...\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%time\n",
"get_suggestions(\"zzffttt\", d, lwl)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 0\n",
" edit distance for deletions: 3\n",
"CPU times: user 76.2 ms, sys: 18.9 ms, total: 95.1 ms\n",
"Wall time: 2min 7s\n",
"Out[5]:\n",
"[]\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%time\n",
"best_word(\"there\", d, lwl)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 604\n",
" edit distance for deletions: 3\n",
"CPU times: user 95.8 ms, sys: 22.7 ms, total: 119 ms\n",
"Wall time: 1min 33s\n",
"Out[4]:\n",
"(u'there', (2972, 0))\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Optimized SPARK version: FASTER"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 52 ms, sys: 13 ms, total: 65 ms\n",
"Wall time: 54.1 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 604\n",
" edit distance for deletions: 3\n",
"CPU times: user 56.6 ms, sys: 14.8 ms, total: 71.4 ms\n",
"Wall time: 53.7 s\n",
"Out[12]:\n",
"[(u'there', (2972, 0)),\n",
" (u'these', (1231, 1)),\n",
" (u'where', (977, 1)),\n",
" (u'here', (691, 1)),\n",
" (u'three', (584, 1)),\n",
" (u'thee', (26, 1)),\n",
" (u'chere', (9, 1)),\n",
" (u'theme', (8, 1)),\n",
" (u'the', (80030, 2)), ...\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%time\n",
"get_suggestions(\"zzffttt\", d, lwl)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 0\n",
" edit distance for deletions: 3\n",
"CPU times: user 57.7 ms, sys: 15.6 ms, total: 73.4 ms\n",
"Wall time: 57.3 s\n",
"Out[4]:\n",
"[]\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%time\n",
"best_word(\"there\", d, lwl)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 604\n",
" edit distance for deletions: 3\n",
"CPU times: user 80.6 ms, sys: 19.3 ms, total: 99.8 ms\n",
"Wall time: 59.7 s\n",
"Out[4]:\n",
"(u'there', (2972, 0))\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Optimized SPARK version: ALSO FAST"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " The difference between this version and #3 is that it does not keep the corpus in RDD form; the corpus is broadcast to the workers by the driver and used as a dictionary upon which lookups are performed. When scaling, the dictionary itself, while large, can be smaller than the documents being checked and should fit in memory (since the serial version assumes that the dictionary can be stored in memory anyway).\n", "
\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 10.9 s, sys: 1.01 s, total: 11.9 s\n",
"Wall time: 42.1 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 604\n",
" edit distance for deletions: 3\n",
"CPU times: user 25.3 s, sys: 1.65 s, total: 26.9 s\n",
"Wall time: 1min 3s\n",
"Out[6]:\n",
"[('there', (2972, 0)),\n",
" ('these', (1231, 1)),\n",
" ('where', (977, 1)),\n",
" ('here', (691, 1)),\n",
" ('three', (584, 1)),\n",
" ('thee', (26, 1)),\n",
" ('chere', (9, 1)),\n",
" ('theme', (8, 1)),\n",
" ('the', (80030, 2)), ...\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%time\n",
"get_suggestions(\"zzffttt\", d, lwl)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 0\n",
" edit distance for deletions: 3\n",
"CPU times: user 15.9 s, sys: 1.3 s, total: 17.2 s\n",
"Wall time: 18.5 s\n",
"Out[6]:\n",
"[]\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%time\n",
"best_word(\"there\", d, lwl)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 604\n",
" edit distance for deletions: 3\n",
"CPU times: user 23.3 s, sys: 1.3 s, total: 24.6 s\n",
"Wall time: 1min 2s\n",
"Out[4]:\n",
"('there', (2972, 0))\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. SPARK version document check"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" This version parallelizes the word check for all the words in a document, using word-level correction. Since SPARK does not permit RDD manipulation from within an RDD transformation (i.e. no parallelism within a parallel task), we converted the `get_suggestions` function that acts on an individual word to a serial method. This allows us to then parallelize across multiple words in a document. This is a reasonable trade off when the number of words in a document is much larger compared to the number of suggestions that will likely be found for any given word).
\n", " IMPORTANT NOTE REGARDING PERFORMANCE RESULTS: The (modified) `no_RDD_get_suggestions` function still returns an entire list of all possible suggestions to the calling function (e.g. for context checking), even if only the top match is used or required. Future improvements may be made to `no_RDD_get_suggestions` to terminate early once a \"top\" match (e.g. minimum edit distance) is found; a speedup in that function will in turn lead to a performance improvement of the document checking function as well.\n", "
\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 10.9 s, sys: 976 ms, total: 11.8 s\n",
"Wall time: 41.2 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 604\n",
" edit distance for deletions: 3\n",
"CPU times: user 56.3 ms, sys: 4.17 ms, total: 60.5 ms\n",
"Wall time: 58.2 ms\n",
"Out[3]:\n",
"[('there', (2972, 0)),\n",
" ('these', (1231, 1)),\n",
" ('where', (977, 1)),\n",
" ('here', (691, 1)),\n",
" ('three', (584, 1)),\n",
" ('thee', (26, 1)),\n",
" ('chere', (9, 1)),\n",
" ('theme', (8, 1)),\n",
" ('the', (80030, 2)), ...\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%time\n",
"no_RDD_get_suggestions(\"zzffttt\", d, lwl)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"looking up suggestions based on input word...\n",
"number of possible corrections: 0\n",
" edit distance for deletions: 3\n",
"CPU times: user 419 µs, sys: 108 µs, total: 527 µs\n",
"Wall time: 435 µs\n",
"Out[3]:\n",
"[]\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Finding misspelled words in your document...\n",
" Unknown words (line number, word in text):\n",
"[(11, 'oonipiittee'), (42, 'senbrnrgs'), (82, 'ghmhvestigat')]\n",
" Words with suggested corrections (line number, word in text, top match):\n",
"[(3, 'taiths --> faith'), (13, 'gjpt --> get'), (13, 'tj --> to'), (13, 'mnnff --> snuff'), (15, 'bh --> by'), (15, 'uth --> th'), (15, 'unuer --> under'), (15, 'snc --> sac'), (20, 'mthiitt --> thirty'), (21, 'cas --> was'), (22, 'pythian --> scythian'), (26, 'brainin --> brain'), (27, 'jfl --> of'), (28, 'eug --> dug'), (28, 'stice --> stick'), (28, 'blaci --> black'), (28, 'ji --> i'), (28, 'debbs --> debts'), (29, 'nericans --> americans'), (30, 'ergs --> eggs'), (30, 'ainin --> again'), (31, 'trumped --> trumpet'), (32, 'erican --> american'), (33, 'thg --> the'), (33, 'nenance --> penance'), (33, 'unorthodox --> orthodox'), (34, 'rgs --> rags'), (34, 'sln --> son'), (38, 'eu --> e'), (38, 'williaij --> william'), (40, 'fcsf --> ff'), (40, 'ber --> be'), (42, 'thpt --> that'), (42, 'unorthodoxy --> orthodox'), (44, 'fascism --> fascia'), (62, 'loo --> look'), (65, 'ththn --> then'), (65, 'thl --> the'), (65, 'yktcn --> skin'), (65, 'scbell --> bell'), (65, 'ife --> if'), (66, 'thi --> the'), (68, 'saij --> said'), (69, 'cornr --> corner'), (69, 'defendants --> defendant'), (69, 'nists --> lists'), (72, 'ro --> to'), (74, 'ath --> at'), (75, 'rg --> re'), (75, 'acrific --> pacific'), (75, 'tti --> tit'), (77, 'korea --> more'), (78, 'doatli --> death'), (78, 'ro --> to'), (81, 'ry --> by'), (81, 'ith --> it'), (81, 'kl --> ll'), (81, 'ech --> each'), (82, 'rg --> re'), (82, 'rb --> re'), (82, 'nb --> no'), (83, 'rosenbt --> rodent'), (83, 'rgs --> rags'), (84, 'coriritted --> committed'), (86, 'fighti --> fight'), (88, 'bths --> baths'), (88, 'tchf --> the'), (91, 'ro --> to'), (91, 'ijb --> in'), (92, 'telegrnm --> telegram'), (92, 'rson --> son'), (92, 'jillia --> william'), (92, 'patt --> part'), (93, 'ecretdry --> secretary'), (95, 'purview --> purves'), (95, 'rder --> order'), (99, 'gor --> for'), (99, 'rg --> re'), (99, 'enb --> end'), (99, 'dthethg --> teeth'), (99, 'ro --> to'), (99, 'ared --> are'), (100, 'dri --> dry'), (100, 'yfu --> you'), (100, 'vthnz --> the'), (100, 'sacc --> sac'), (101, 'rosi --> rose'), (101, 'rg --> re'), (101, 'ile --> ill'), (102, 'jhy --> why'), (102, 'fnir --> fair'), (102, 'azi --> ai'), (103, 'fascist --> fascia'), (104, 'nb --> no')]\n",
"-----\n",
"total words checked: 700\n",
"total unknown words: 3\n",
"total potential errors found: 94\n",
"CPU times: user 9.9 s, sys: 619 ms, total: 10.5 s\n",
"Wall time: 1min 3s\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%time\n",
"correct_document(\"testdata/OCRsample.txt\", d, lwl, False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Finding misspelled words in your document...\n",
"-----\n",
"total words checked: 700\n",
"total unknown words: 3\n",
"total potential errors found: 94\n",
"CPU times: user 9.33 s, sys: 505 ms, total: 9.83 s\n",
"Wall time: 56.3 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# additional tests (local machine)\n",
"\n",
"Finding misspelled words in your document...\n",
"-----\n",
"total words checked: 12029\n",
"total unknown words: 19\n",
"total potential errors found: 719\n",
"-----\n",
"266.66 seconds to run\n",
"-----\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Finding misspelled words in your document...\n",
"-----\n",
"total words checked: 131340\n",
"total unknown words: 325\n",
"total potential errors found: 8460\n",
"-----\n",
"2897.28 seconds to run\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AWS Experiments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# 2 executors, 4 cores, 16 partitions\n", "# \"there\"\n", "\n", "serial: 62 ms\n", "spark 1: 96.31 s\n", "spark 2: 22.62 s\n", "spark 3: 41.35 s\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Serial version:\n", "\n", "total words checked: 700\n", "total unknown words: 3\n", "total potential errors found: 94\n", "-----\n", "19.01 seconds to run\n", "-----\n", "\n", "total words checked: 12029\n", "total unknown words: 19\n", "total potential errors found: 719\n", "-----\n", "384.75 seconds to run\n", "-----\n", "\n", "-----\n", "total words checked: 131340\n", "total unknown words: 325\n", "total potential errors found: 8460\n", "-----\n", "4305.46 seconds to run\n", "-----\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Please wait...\n", "Creating dictionary...\n", "total words processed: 1105285\n", "total unique words in corpus: 29157\n", "total items in dictionary (corpus words and deletions): 2151998\n", " edit distance for deletions: 3\n", " length of longest word in corpus: 18\n", "-----\n", "35.37 seconds to run\n", "-----\n", " \n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 700\n", "total unknown words: 3\n", "total potential errors found: 94\n", "-----\n", "41.82 seconds to run\n", "-----\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "total words checked: 12029\n", "total unknown words: 19\n", "total potential errors found: 719\n", "-----\n", "125.45 seconds to run\n", "-----\n", "\n", "-----\n", "total words checked: 131340\n", "total unknown words: 325\n", "total potential errors found: 8460\n", "-----\n", "995.66 seconds to run\n", "-----\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4 executors, 4 cores, 16 partitions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Please wait...\n", "Creating dictionary...\n", "total words processed: 1105285\n", "total unique words in corpus: 29157\n", "total items in dictionary (corpus words and deletions): 2151998\n", " edit distance for deletions: 3\n", " length of longest word in corpus: 18\n", "-----\n", "34.84 seconds to run\n", "-----\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 700\n", "total unknown words: 3\n", "total potential errors found: 94\n", "-----\n", "66.17 seconds to run\n", "\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 12029\n", "total unknown words: 19\n", "total potential errors found: 719\n", "-----\n", "150.06 seconds to run\n", "-----\n", "\n", "*** when I left memory at 5G\n", "-----\n", "total words checked: 131340\n", "total unknown words: 325\n", "total potential errors found: 8460\n", "-----\n", "2357.24 seconds to run\n", "-----\n", "\n", "*** boosted memory to 8G\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 131340\n", "total unknown words: 325\n", "total potential errors found: 8460\n", "-----\n", "972.82 seconds to run\n", "-----\n", "\n", "*** changed reduceByKeyLocally to reduceByKey + collectAsMap in createDocument\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 131340\n", "total unknown words: 325\n", "total potential errors found: 8460\n", "-----\n", "992.03 seconds to run\n", "-----\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4 executors, 4 cores, 32 partitions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Please wait...\n", "Creating dictionary...\n", "total words processed: 1105285\n", "total unique words in corpus: 29157\n", "total items in dictionary (corpus words and deletions): 2151998\n", " edit distance for deletions: 3\n", " length of longest word in corpus: 18\n", "-----\n", "66.97 seconds to run\n", "-----\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 700\n", "total unknown words: 3\n", "total potential errors found: 94\n", "-----\n", "148.86 seconds to run\n", "-----\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 12029\n", "total unknown words: 19\n", "total potential errors found: 719\n", "-----\n", "240.04 seconds to run\n", "-----\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4 executors, 4 cores, 64 partitions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Please wait...\n", "Creating dictionary...\n", "total words processed: 1105285\n", "total unique words in corpus: 29157\n", "total items in dictionary (corpus words and deletions): 2151998\n", " edit distance for deletions: 3\n", " length of longest word in corpus: 18\n", "-----\n", "71.95 seconds to run\n", "-----\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 700\n", "total unknown words: 3\n", "total potential errors found: 94\n", "-----\n", "163.45 seconds to run\n", "\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 12029\n", "total unknown words: 19\n", "total potential errors found: 719\n", "-----\n", "246.38 seconds to run\n", "-----\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 131340\n", "total unknown words: 325\n", "total potential errors found: 8460\n", "-----\n", "1191.01 seconds to run\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Future investigations may entail exploring and tuning these parameters, and experimenting with others, to optimize performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Appendix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Please wait...\n", "Creating dictionary...\n", "total words processed: 1105285\n", "total unique words in corpus: 29157\n", "total items in dictionary (corpus words and deletions): 2151998\n", " edit distance for deletions: 3\n", " length of longest word in corpus: 18\n", "-----\n", "69.25 seconds to run\n", "-----\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 700\n", "total unknown words: 3\n", "total potential errors found: 94\n", "-----\n", "198.92 seconds to run\n", "-----\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 12029\n", "total unknown words: 19\n", "total potential errors found: 719\n", "-----\n", "286.15 seconds to run\n", "\n", "Document correction... Please wait...\n", "-------------------------------------\n", "finding corrections for document\n", "Finding misspelled words in your document...\n", "-----\n", "total words checked: 131340\n", "total unknown words: 325\n", "total potential errors found: 8460\n", "-----\n", "1283.60 seconds to run\n", "-----\n", "
\n", " Note we also examined the output at localhost:4040 to identify the time-consuming stages.\n", "
\n", " Replace:\n", "
\n",
"import pyspark\n",
"conf = (pyspark.SparkConf()\n",
" .setMaster('local')\n",
" .setAppName('pyspark')\n",
" .set(\"spark.executor.memory\", \"2g\"))\n",
"sc = pyspark.SparkContext(conf=conf)\n",
"
\n", " with:\n", "
\n",
" from pyspark import SparkContext\n",
" sc = SparkContext()\n",
"
\n",
"\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 113 ms, sys: 24.6 ms, total: 138 ms\n",
"Wall time: 2min 48s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Replace:\n", "
\n",
"make_all_lower = sc.textFile(fname).map(lambda line: line.lower())\n",
"
\n", " with:\n", "
\n",
"make_all_lower = sc.textFile(fname, minPartitions = n_partitions).map(lambda line: line.lower())\n",
"
\n",
"\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 133 ms, sys: 28.9 ms, total: 162 ms\n",
"Wall time: 2min 55s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Replace:\n", "
\n",
"unique_words_with_count = count_once.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions).cache()\n",
"
\n", " with:\n", "
\n",
"unique_words_with_count = count_once.reduceByKey(lambda a, b: a + b).repartitionAndSortWithinPartitions(numPartitions =n_partitions).cache()\n",
"
\n",
"\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 150 ms, sys: 31.8 ms, total: 182 ms\n",
"Wall time: 3min 5s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Replace:\n", "
\n",
"make_all_lower = sc.textFile(fname).map(lambda line: line.lower())\n",
"
\n", " with:\n", "
\n",
"make_all_lower = sc.textFile(fname, use_unicode=False).map(lambda line: line.lower())\n",
"
\n",
"\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 106 ms, sys: 24.3 ms, total: 130 ms\n",
"Wall time: 2min 46s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" Replaced `get_deletes_list` with `get_n_deletes_list` function, and modified `parallel_create_dictionary` accordingly.
See Spark v. 2.0 above. \n", "
\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 52 ms, sys: 13 ms, total: 65 ms\n",
"Wall time: 54.1 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" Modified `parallel_create_dictionary` accordingly.
See Spark v. 3.0 above. \n", "
\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 10.9 s, sys: 1.01 s, total: 11.9 s\n",
"Wall time: 42.1 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Replace:\n", "
\n",
"new_dict = combine.reduceByKeyLocally(lambda a, b: (a[0]+b[0], a[1]+b[1]))\n",
"
\n", " with:\n", "
\n",
"new_dict = combine.reduceByKey(lambda a, b: (a[0]+b[0], a[1]+b[1])).collectAsMap()\n",
"
\n",
" \n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 7.16 s, sys: 921 ms, total: 8.08 s\n",
"Wall time: 1min 7s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Replace:\n", "
\n",
"unique_words_with_count = count_once.reduceByKey(lambda a, b: a + b, numPartitions=n_partitions).cache()\n",
"
\n", " with:\n", "
\n",
"unique_words_with_count = count_once.reduceByKey(lambda a, b: a + b).cache()\n",
"
\n",
" \n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 9.05 s, sys: 824 ms, total: 9.87 s\n",
"Wall time: 43.4 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Replace:\n", "
\n",
"n_partitions = 6 # number of partitions to be used\n",
"
\n", " with:\n", "
\n",
"n_partitions = 8 # number of partitions to be used\n",
"
\n",
"\n",
"Creating dictionary...\n",
"total words processed: 1105285\n",
"total unique words in corpus: 29157\n",
"total items in dictionary (corpus words and deletions): 2151998\n",
" edit distance for deletions: 3\n",
" length of longest word in corpus: 18\n",
"CPU times: user 11.6 s, sys: 1.08 s, total: 12.7 s\n",
"Wall time: 45.5 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### implement document checking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Replaced `get_suggestions` with `no_RDD_get_suggestions` function, and modified `correct_document` accordingly. \n", "
\n",
"Finding misspelled words in your document...\n",
"-----\n",
"total words checked: 700\n",
"total unknown words: 3\n",
"total potential errors found: 94\n",
"CPU times: user 1min 6s, sys: 1.36 s, total: 1min 8s\n",
"Wall time: 2min 1s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Added:\n", "
\n",
" # broadcast lookup dictionary to workers\n",
" bd = sc.broadcast(d)\n",
"
\n", " Modified `get_corrections` to:\n", "
\n",
" get_corrections = all_words.map(lambda (w, index): (w, (get_suggestions(w, bd.value, lwl, True), index)), preservesPartitioning=True).cache()\n",
"
\n",
"\n",
"Finding misspelled words in your document...\n",
"-----\n",
"total words checked: 700\n",
"total unknown words: 3\n",
"total potential errors found: 94\n",
"CPU times: user 10.6 s, sys: 620 ms, total: 11.2 s\n",
"Wall time: 57.3 s\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" These changes have been implemented. See Spark v. 4.0 above.
\n", " Add:\n", "
\n",
" gc = sc.accumulator(0)\n",
" get_corrections.foreach(lambda x: gc.add(1))\n",
" uc = sc.accumulator(0)\n",
" unknown_words.foreach(lambda x: uc.add(1))\n",
" ew = sc.accumulator(0)\n",
" error_words.foreach(lambda x: ew.add(1))\n",
"
\n", " Modified print statements to read:\n", "
\n",
" print \"total words checked: %i\" % gc.value\n",
" print \"total unknown words: %i\" % uc.value\n",
" print \"total potential errors found: %i\" % ew.value\n",
"
\n",
" \n",
"Finding misspelled words in your document...\n",
"-----\n",
"total words checked: 700\n",
"total unknown words: 3\n",
"total potential errors found: 94\n",
"CPU times: user 9.33 s, sys: 505 ms, total: 9.83 s\n",
"Wall time: 56.3 s \n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Added:\n", "
\n",
" def f(x):\n",
" val, dictionary, longestword, silent = x\n",
" return get_suggestions(val, dictionary, longestword, silent)\n",
"
\n", " and to `create_dictionary`:\n", "
\n",
" # parallelization via Pool\n",
" jobs = [(w, bd.value, lwl, False) for w in doc_words]\n",
" suggestion_lists = p.map(f, jobs)\n",
"
\n",
"