Morphological Analysis
======================
Polyglot offers trained `morfessor
models `__ to generate
morphemes from words. The goal of the Morpho project is to develop
unsupervised data-driven methods that discover the regularities behind
word forming in natural languages. In particular, Morpho project is
focussing on the discovery of morphemes, which are the primitive units
of syntax, the smallest individually meaningful elements in the
utterances of a language. Morphemes are important in automatic
generation and recognition of a language, especially in languages in
which words may have many different inflected forms.
Languages Coverage
------------------
Using polyglot vocabulary dictionaries, we trained morfessor models on
the most frequent words 50,000 words of each language.
.. code:: python
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))
.. parsed-literal::
1. Piedmontese language 2. Lombard language 3. Gan Chinese
4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz
7. Pashto, Pushto 8. Kurdish 9. Portuguese
10. Kannada 11. Korean 12. Khmer
13. Kazakh 14. Ilokano 15. Polish
16. Panjabi, Punjabi 17. Georgian 18. Chuvash
19. Alemannic 20. Czech 21. Welsh
22. Chechen 23. Catalan; Valencian 24. Northern Sami
25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese
28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian
31. Swedish 32. Swahili 33. Sundanese
34. Serbian 35. Albanian 36. Japanese
37. Western Frisian 38. French 39. Finnish
40. Upper Sorbian 41. Faroese 42. Persian
43. Sinhala, Sinhalese 44. Italian 45. Amharic
46. Aragonese 47. Volapük 48. Icelandic
49. Sakha 50. Afrikaans 51. Indonesian
52. Interlingua 53. Azerbaijani 54. Ido
55. Arabic 56. Assamese 57. Yoruba
58. Yiddish 59. Waray-Waray 60. Croatian
61. Hungarian 62. Haitian; Haitian Creole 63. Quechua
64. Armenian 65. Hebrew (modern) 66. Silesian
67. Hindi 68. Divehi; Dhivehi; Mald... 69. German
70. Danish 71. Occitan 72. Tagalog
73. Turkmen 74. Thai 75. Tajik
76. Greek, Modern 77. Telugu 78. Tamil
79. Oriya 80. Ossetian, Ossetic 81. Tatar
82. Turkish 83. Kapampangan 84. Venetian
85. Manx 86. Gujarati 87. Galician
88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali
91. Cebuano 92. Zazaki 93. Walloon
94. Dutch 95. Norwegian 96. Norwegian Nynorsk
97. West Flemish 98. Chinese 99. Bosnian
100. Breton 101. Belarusian 102. Bulgarian
103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib...
106. Bengali 107. Burmese 108. Romansh
109. Marathi (Marāṭhī) 110. Malay 111. Maltese
112. Russian 113. Macedonian 114. Malayalam
115. Mongolian 116. Malagasy 117. Vietnamese
118. Spanish; Castilian 119. Estonian 120. Basque
121. Bishnupriya Manipuri 122. Asturian 123. English
124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin
127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan...
130. Latvian 131. Urdu 132. Lithuanian
133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ...
Download Necessary Models
^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
%%bash
polyglot download morph2.en morph2.ar
.. parsed-literal::
[polyglot_data] Downloading package morph2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.ar is already up-to-date!
Example
-------
Word Segmentation
~~~~~~~~~~~~~~~~~
.. code:: python
from polyglot.text import Text, Word
.. code:: python
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
w = Word(w, language="en")
print("{:<20}{}".format(w, w.morphemes))
.. parsed-literal::
preprocessing ['pre', 'process', 'ing']
processor ['process', 'or']
invaluable ['in', 'valuable']
thankful ['thank', 'ful']
crossed ['cross', 'ed']
Sentence Segmentation
~~~~~~~~~~~~~~~~~~~~~
If the text is not tokenized properly, morphological analysis could
offer a smart of way of splitting the text into its original units.
Here, is an example:
.. code:: python
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"
.. code:: python
text.morphemes
.. parsed-literal::
WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])
Command Line Interface
~~~~~~~~~~~~~~~~~~~~~~
.. code:: python
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en morph | tail -n 30
.. parsed-literal::
which which
India In_dia
beat beat
Bermuda Ber_mud_a
in in
Port Port
of of
Spain Spa_in
in in
2007 2007
, ,
which which
was wa_s
equalled equal_led
five five
days day_s
ago ago
by by
South South
Africa Africa
in in
their t_heir
victory victor_y
over over
West West
Indies In_dies
in in
Sydney Syd_ney
. .
Demo
----
This demo does not reflect the models supplied by polyglot, however, we
think it is indicative of what you should expect from morfessor
`Demo `__
Citation
~~~~~~~~
This is an interface to the implementation being described in the
`Morfessor2.0: Python Implementation and Extensions for Morfessor
Baseline `__
technical report.
::
@InProceedings{morfessor2,
title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
author: {Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},
year: {2013},
publisher: {Department of Signal Processing and Acoustics, Aalto University},
booktitle:{Aalto University publication series}
}
References
----------
- `Morpho project `__
- `Background information on morpheme
discovery `__.