Suggestion: Finding missing words in dictionaries via web scraping and natural language processing

Andrej Warkentin an.warkentin at web.de
Thu Aug 17 10:08:31 UTC 2017


Hello,

in a talk at the PyData Berlin meetup I saw this project: 
https://github.com/lusy/hora-de-decir-bye-bye , where spanish articles 
are scraped and searched for english words. In order to identify english 
words she used the dictionaries from Open Office and compared scraped 
words to the dictionaries. She mentioned the problem that not all words 
were in the dictionaries.

So I thought this could be used to find (or at least help finding) most 
missing words in dictionaries for all languages. One could scrape e.g. 
all Wikipedia articles of a certain language and create a candidate list 
of missing words. Or it could also be used to find domain specific words 
by scraping e.g. scientific articles, articles from certain types of 
websites and so on.

My question is if this would be something helpful at all or if missing 
words in dictionaries is not a problem anymore. Also, I unfortunately 
don't have much spare time at the moment to work on this so if anyone 
wants to pick this up feel free to do so. I will let you know when I 
implemented something myself.

I'm looking forward to your feedback.

Cheers,

Andrej



More information about the LibreOffice mailing list