Language Detection ================== Polyglot depends on `pycld2 `__ library which in turn depends on `cld2 `__ library for detecting language(s) used in plain text. .. code:: python from polyglot.detect import Detector Example ------- .. code:: python arabic_text = u""" أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن " القوات الامنية تتوقف لليوم الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب انتشار قناصي التنظيم الذي يطلق على نفسه اسم "الدولة الاسلامية" والعبوات الناسفة والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية ". """ .. code:: python detector = Detector(arabic_text) print(detector.language) .. parsed-literal:: name: Arabic code: ar confidence: 99.0 read bytes: 907 Mixed Text ---------- .. code:: python mixed_text = u""" China (simplified Chinese: 中国; traditional Chinese: 中國), officially the People's Republic of China (PRC), is a sovereign state located in East Asia. """ If the text contains snippets from different languages, the detector is able to find the most probable langauges used in the text. For each language, we can query the model confidence level: .. code:: python for language in Detector(mixed_text).languages: print(language) .. parsed-literal:: name: English code: en confidence: 87.0 read bytes: 1154 name: Chinese code: zh_Hant confidence: 5.0 read bytes: 1755 name: un code: un confidence: 0.0 read bytes: 0 To take a closer look, we can inspect the text line by line, notice that the confidence in the detection went down for the first line .. code:: python for line in mixed_text.strip().splitlines(): print(line + u"\n") for language in Detector(line).languages: print(language) print("\n") .. parsed-literal:: China (simplified Chinese: 中国; traditional Chinese: 中國), name: English code: en confidence: 71.0 read bytes: 887 name: Chinese code: zh_Hant confidence: 11.0 read bytes: 1755 name: un code: un confidence: 0.0 read bytes: 0 officially the People's Republic of China (PRC), is a sovereign state located in East Asia. name: English code: en confidence: 98.0 read bytes: 1291 name: un code: un confidence: 0.0 read bytes: 0 name: un code: un confidence: 0.0 read bytes: 0 Best Effort Strategy -------------------- Sometimes, there is no enough text to make a decision, like detecting a language from one word. This forces the detector to switch to a best effort strategy, a warning will be thrown and the attribute ``reliable`` will be set to ``False``. .. code:: python detector = Detector("pizza") print(detector) .. parsed-literal:: WARNING:polyglot.detect.base:Detector is not able to detect the language reliably. .. parsed-literal:: Prediction is reliable: False Language 1: name: English code: en confidence: 85.0 read bytes: 1194 Language 2: name: un code: un confidence: 0.0 read bytes: 0 Language 3: name: un code: un confidence: 0.0 read bytes: 0 In case, that the detection is not reliable even when we are using the best effort strategy, an exception ``UnknownLanguage`` will be thrown. .. code:: python print(Detector("4")) :: --------------------------------------------------------------------------- UnknownLanguage Traceback (most recent call last) in () ----> 1 print(Detector("4")) /usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc in __init__(self, text, quiet) 63 self.quiet = quiet 64 """If true, exceptions will be silenced.""" ---> 65 self.detect(text) 66 67 @staticmethod /usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc in detect(self, text) 89 90 if not reliable and not self.quiet: ---> 91 raise UnknownLanguage("Try passing a longer snippet of text") 92 else: 93 logger.warning("Detector is not able to detect the language reliably.") UnknownLanguage: Try passing a longer snippet of text Such an exception may not be desirable especially for trivial cases like characters that could belong to so many languages. In this case, we can silence the exceptions by passing setting ``quiet`` to ``True`` .. code:: python print(Detector("4", quiet=True)) .. parsed-literal:: WARNING:polyglot.detect.base:Detector is not able to detect the language reliably. .. parsed-literal:: Prediction is reliable: False Language 1: name: un code: un confidence: 0.0 read bytes: 0 Language 2: name: un code: un confidence: 0.0 read bytes: 0 Language 3: name: un code: un confidence: 0.0 read bytes: 0 Command Line ------------ .. code:: python !polyglot detect --help .. parsed-literal:: usage: polyglot detect [-h] [--input [INPUT [INPUT ...]]] optional arguments: -h, --help show this help message and exit --input [INPUT [INPUT ...]] The subcommand ``detect`` tries to identify the language code for each line in a text file. This could be convieniet if each line represents a document or a sentence that could have been generated by a tokenizer .. code:: python !polyglot detect --input testdata/cricket.txt .. parsed-literal:: English Australia posted a World Cup record total of 417-6 as they beat Afghanistan by 275 runs. English David Warner hit 178 off 133 balls, Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth. English Afghanistan were then dismissed for 142, with Mitchell Johnson and Mitchell Starc taking six wickets between them. English Australia's score surpassed the 413-5 India made against Bermuda in 2007. English It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages, following South Africa's 408-5 and 411-4 against West Indies and Ireland respectively. English The winning margin beats the 257-run amount by which India beat Bermuda in Port of Spain in 2007, which was equalled five days ago by South Africa in their victory over West Indies in Sydney. Supported Languages ------------------- cld2 can detect up to 165 languages. .. code:: python from polyglot.utils import pretty_list print(pretty_list(Detector.supported_languages())) .. parsed-literal:: 1. Abkhazian 2. Afar 3. Afrikaans 4. Akan 5. Albanian 6. Amharic 7. Arabic 8. Armenian 9. Assamese 10. Aymara 11. Azerbaijani 12. Bashkir 13. Basque 14. Belarusian 15. Bengali 16. Bihari 17. Bislama 18. Bosnian 19. Breton 20. Bulgarian 21. Burmese 22. Catalan 23. Cebuano 24. Cherokee 25. Nyanja 26. Corsican 27. Croatian 28. Croatian 29. Czech 30. Chinese 31. Chinese 32. Chinese 33. Chinese 34. Chineset 35. Chineset 36. Chineset 37. Chineset 38. Chineset 39. Chineset 40. Danish 41. Dhivehi 42. Dutch 43. Dzongkha 44. English 45. Esperanto 46. Estonian 47. Ewe 48. Faroese 49. Fijian 50. Finnish 51. French 52. Frisian 53. Ga 54. Galician 55. Ganda 56. Georgian 57. German 58. Greek 59. Greenlandic 60. Guarani 61. Gujarati 62. Haitian_creole 63. Hausa 64. Hawaiian 65. Hebrew 66. Hebrew 67. Hindi 68. Hmong 69. Hungarian 70. Icelandic 71. Igbo 72. Indonesian 73. Interlingua 74. Interlingue 75. Inuktitut 76. Inupiak 77. Irish 78. Italian 79. Ignore 80. Javanese 81. Javanese 82. Japanese 83. Kannada 84. Kashmiri 85. Kazakh 86. Khasi 87. Khmer 88. Kinyarwanda 89. Krio 90. Kurdish 91. Kyrgyz 92. Korean 93. Laothian 94. Latin 95. Latvian 96. Limbu 97. Limbu 98. Limbu 99. Lingala 100. Lithuanian 101. Lozi 102. Luba_lulua 103. Luo_kenya_and_tanzania 104. Luxembourgish 105. Macedonian 106. Malagasy 107. Malay 108. Malayalam 109. Maltese 110. Manx 111. Maori 112. Marathi 113. Mauritian_creole 114. Romanian 115. Mongolian 116. Montenegrin 117. Montenegrin 118. Montenegrin 119. Montenegrin 120. Nauru 121. Ndebele 122. Nepali 123. Newari 124. Norwegian 125. Norwegian 126. Norwegian_n 127. Nyanja 128. Occitan 129. Oriya 130. Oromo 131. Ossetian 132. Pampanga 133. Pashto 134. Pedi 135. Persian 136. Polish 137. Portuguese 138. Punjabi 139. Quechua 140. Rajasthani 141. Rhaeto_romance 142. Romanian 143. Rundi 144. Russian 145. Samoan 146. Sango 147. Sanskrit 148. Scots 149. Scots_gaelic 150. Serbian 151. Serbian 152. Seselwa 153. Seselwa 154. Sesotho 155. Shona 156. Sindhi 157. Sinhalese 158. Siswant 159. Slovak 160. Slovenian 161. Somali 162. Spanish 163. Sundanese 164. Swahili 165. Swedish 166. Syriac 167. Tagalog 168. Tajik 169. Tamil 170. Tatar 171. Telugu 172. Thai 173. Tibetan 174. Tigrinya 175. Tonga 176. Tsonga 177. Tswana 178. Tumbuka 179. Turkish 180. Turkmen 181. Twi 182. Uighur 183. Ukrainian 184. Urdu 185. Uzbek 186. Venda 187. Vietnamese 188. Volapuk 189. Waray_philippines 190. Welsh 191. Wolof 192. Xhosa 193. Yiddish 194. Yoruba 195. Zhuang 196. Zulu