Suggestion: Finding missing words in dictionaries via web scraping and natural language processing

Michael Stahl mstahl at redhat.com
Thu Aug 17 11:20:15 UTC 2017


On 17.08.2017 12:08, Andrej Warkentin wrote:
> Hello,
> 
> in a talk at the PyData Berlin meetup I saw this project: 
> https://github.com/lusy/hora-de-decir-bye-bye , where spanish articles 
> are scraped and searched for english words. In order to identify english 
> words she used the dictionaries from Open Office and compared scraped 
> words to the dictionaries. She mentioned the problem that not all words 
> were in the dictionaries.
> 
> So I thought this could be used to find (or at least help finding) most 
> missing words in dictionaries for all languages. One could scrape e.g. 
> all Wikipedia articles of a certain language and create a candidate list 
> of missing words. Or it could also be used to find domain specific words 
> by scraping e.g. scientific articles, articles from certain types of 
> websites and so on.
> 
> My question is if this would be something helpful at all or if missing 
> words in dictionaries is not a problem anymore. Also, I unfortunately 
> don't have much spare time at the moment to work on this so if anyone 
> wants to pick this up feel free to do so. I will let you know when I 
> implemented something myself.

by "missing words in dictionaries", do you mean that if "teh" was used
as an archaic spelling of "tea" in a work of Shakespeare (completely
made up and hypothetical example), that we should add "teh" to the
dictionary and no longer flag it as a wrongly spelled word?



More information about the LibreOffice mailing list