Named Entity Extraction ======================= Named entity extraction task aims to extract phrases from plain text that correpond to entities. Polyglot recognizes 3 categories of entities: - Locations (Tag: ``I-LOC``): cities, countries, regions, continents, neighborhoods, administrative divisions ... - Organizations (Tag: ``I-ORG``): sports teams, newspapers, banks, universities, schools, non-profits, companies, ... - Persons (Tag: ``I-PER``): politicians, scientists, artists, atheletes ... Languages Coverage ------------------ The models were trained on datasets extracted automatically from Wikipedia. Polyglot currently supports 40 major languages. .. code:: python from polyglot.downloader import downloader print(downloader.supported_languages_table("ner2", 3)) .. parsed-literal:: 1. Polish 2. Turkish 3. Russian 4. Indonesian 5. Czech 6. Arabic 7. Korean 8. Catalan; Valencian 9. Italian 10. Thai 11. Romanian, Moldavian, ... 12. Tagalog 13. Danish 14. Finnish 15. German 16. Persian 17. Dutch 18. Chinese 19. French 20. Portuguese 21. Slovak 22. Hebrew (modern) 23. Malay 24. Slovene 25. Bulgarian 26. Hindi 27. Japanese 28. Hungarian 29. Croatian 30. Ukrainian 31. Serbian 32. Lithuanian 33. Norwegian 34. Latvian 35. Swedish 36. English 37. Greek, Modern 38. Spanish; Castilian 39. Vietnamese 40. Estonian Download Necessary Models ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python %%bash polyglot download embeddings2.en ner2.en .. parsed-literal:: [polyglot_data] Downloading package embeddings2.en to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package embeddings2.en is already up-to-date! [polyglot_data] Downloading package ner2.en to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package ner2.en is already up-to-date! Example ------- Entities inside a text object or a sentence are represented as chunks. Each chunk identifies the start and the end indices of the word subsequence within the text. .. code:: python from polyglot.text import Text .. code:: python blob = """The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world".""" text = Text(blob) We can query all entities mentioned in a text. .. code:: python text.entities .. parsed-literal:: [I-ORG([u'Israeli']), I-PER([u'Benjamin', u'Netanyahu']), I-LOC([u'Iran'])] Or, we can query entites per sentence .. code:: python for sent in text.sentences: print(sent, "\n") for entity in sent.entities: print(entity.tag, entity) .. parsed-literal:: The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world". I-ORG [u'Israeli'] I-PER [u'Benjamin', u'Netanyahu'] I-LOC [u'Iran'] By doing more careful inspection of the second entity ``Benjamin Netanyahu``, we can locate the position of the entity within the sentence. .. code:: python benjamin = sent.entities[1] sent.words[benjamin.start: benjamin.end] .. parsed-literal:: WordList([u'Benjamin', u'Netanyahu']) Command Line Interface ~~~~~~~~~~~~~~~~~~~~~~ .. code:: python !polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en ner | tail -n 20 .. parsed-literal:: , O which O was O equalled O five O days O ago O by O South I-LOC Africa I-LOC in O their O victory O over O West I-ORG Indies I-ORG in O Sydney I-LOC . O Demo ---- .. raw:: html Citation ~~~~~~~~ This work is a direct implementation of the research being described in the `Polyglot-NER: Multilingual Named Entity Recognition `__ paper. The author of this library strongly encourage you to cite the following paper if you are using this software. :: @article{polyglotner, author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven}, title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition}, journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30 - May 2, 2015}}, month = {April}, year = {2015}, publisher = {SIAM} } References ---------- - `Polyglot-NER project page. `__ - `Wikipedia on NER `__.