A spell checker for german courses

Background

This summer semester I selected the practical lab: data analysis and visualization. Probably because of mistake of the teacher assistant, the table of courses’ names provided contains garbled texts. They should have save it in utf8 so as to capture those German special characters.

Idea

This is a simple spell checker on German words with special character.

It is specially implemented for replacing course’ names having ‘?’ with the right character.

I simply followed the Idea of Novig and modified his code.

Code

import re
from collections import Counter
import pickle

WORDS = pickle.load( open( "germanwords.p", "rb" ) )

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'öäüÖÄÜßáóúéÁÓÚÉ'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    return set(replaces)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

def correctionWord(word): 
    "Most probable spelling correction for word."
    return max(candidates(word.lower()), key=P)

def correctionStr(string):
    "Correction for a string with multiple words"
    trimstr = re.sub("\s*\(.*\)\s*", "", string)
    strarray = re.findall(r'[A-Z\?]?[a-z\?]+', trimstr)
    for i, substring in enumerate(strarray):
        if "?" in substring:
            strarray[i] = correctionWord(substring)
    resultstr = " ".join(x for x in strarray)        
    return resultstr

Usage

Simply run with: myCorrectStr = correctionStr("Einf?hrung f?r settings?")

Then myCorrectStr should be "einführung für settings?"

Note

Remember to put the germanwords.p file in the right directory.

I constructed this file by counting on a text file containing 5 German books and 5 semesters of LMU’s course schedules.

Novig’s blog

coursetable.p

Written on August 17, 2017