Suggestion: Finding missing words in dictionaries via web scraping and natural language processing
Michael Stahl
mstahl at redhat.com
Thu Aug 17 11:20:15 UTC 2017
On 17.08.2017 12:08, Andrej Warkentin wrote:
> Hello,
>
> in a talk at the PyData Berlin meetup I saw this project:
> https://github.com/lusy/hora-de-decir-bye-bye , where spanish articles
> are scraped and searched for english words. In order to identify english
> words she used the dictionaries from Open Office and compared scraped
> words to the dictionaries. She mentioned the problem that not all words
> were in the dictionaries.
>
> So I thought this could be used to find (or at least help finding) most
> missing words in dictionaries for all languages. One could scrape e.g.
> all Wikipedia articles of a certain language and create a candidate list
> of missing words. Or it could also be used to find domain specific words
> by scraping e.g. scientific articles, articles from certain types of
> websites and so on.
>
> My question is if this would be something helpful at all or if missing
> words in dictionaries is not a problem anymore. Also, I unfortunately
> don't have much spare time at the moment to work on this so if anyone
> wants to pick this up feel free to do so. I will let you know when I
> implemented something myself.
by "missing words in dictionaries", do you mean that if "teh" was used
as an archaic spelling of "tea" in a work of Shakespeare (completely
made up and hypothetical example), that we should add "teh" to the
dictionary and no longer flag it as a wrongly spelled word?
More information about the LibreOffice
mailing list