Suggestion: Finding missing words in dictionaries via web scraping and natural language processing

Thu Aug 17 20:39:24 UTC 2017

On 08/17/2017 10:08 AM, Andrej Warkentin wrote:

> So I thought this could be used to find (or at least help finding) most missing words in dictionaries for all languages.

Back when the OOo dictionary for Afrikaans was created, a program was
run through the dictionary corpus, excluding words that were also found
in the English dictionary.

Which is why the three of the most common words in Afrikaans were not
found in that dictionary, for at least the first five revisions of it.

Die man.
Two words, which as a phrase, have completely different meanings, when
read in Afrikaans, and in English.  «I'll grant that "man" is bad
Afrikaans, but it is the only example I can think of, offhand, that
isn't also off-colour in either, or both languages.»

> My question is if this would be something helpful at all or if missing words in dictionaries is not a problem anymore.

Once a dictionary has reached a certain size, it starts to include words
that are rarely used, whose spelling is a common misspelling for another
word. Earlier in this thread somebody mentioned "teh" as one example.

For dictionaries that are under initial construction, this type of tool
can be extremely useful.

> don't have much spare time at the moment to work on this so if anyone

My impression is that a Python Library Module that includes this
functionality exists.  What would need to be done, would be to either
hook that library up with a bot that scrapes Wikipedia, etc, or an
extension that reads ODF documents.

Your question might be more appropriate for the upstream project that
provides the dictionaries used by LibreOffice. Also might be appropriate
on the Language Tool (grammar checking) list.

jonathon