4. Core Engine

[package com.xmlmind.spellcheck.engine]

Here, there is no notion of graphical user interface. This section is meant for developers who want to modify the GUI components, or create entirely different applications. This can be the case, for example, if you want to do batch processing or modify the servlets of the Client/Server edition.

4.1. Services provided by SpellChecker

They come in different categories:

  • Checking work per se: scanning a sequence of characters, checking individual words, getting suggestions for an erroneous word,

  • Manipulating languages and dictionaries, in particular the personal dictionary (learning words and suggestions).

  • Setting and getting miscellaneous control options.

About multi-thread safety

The SpellChecker class cannot be shared between different threads since it contains the state of a spelling session.

Therefore in the context of a multi-threaded server for example, one instance of SpellChecker must be created for each spelling session. Since SpellChecker is not a heavyweight object (it does not contain the compiled dictionaries), there is no performance issue.

At the opposite DictionaryManager (used by SpellChecker) is multi-threadable and its purpose is precisely to share dictionaries. See the section about dictionaries for more details.

4.2. Step-by-step

  1. The first step is to create an instance of SpellChecker.

    This is performed directly. The simplest constructor has one argument, which is a "dictpath", the name of a directory where dictionaries are installed.

    For a discussion of the dictionary storage conventions, see the "Dictionary Management" section below.

    String dictPath = ... // application-dependent
    SpellChecker checker = new SpellChecker( dictPath );
  2. Then it is possible to set a number of options: see the 'Options' section for more details.

    It is important to set the current language: setSelectedLanguage(languageCode) performs this task. Available languages can be obtained by listLanguages().

    Something useful is the load and save path for the personal dictionaries (There is a distinct personal dictionary for each language). This is probably application and system-dependent.

    String path = homeDir + File.separatorChar 
                          + "myapp_spell" + File.separatorChar + "%L%" ;
    checker.setPersonalDictionaryPath( path );

    Actually the path set is a pattern that has to contain a marker for the language name. The marker for the language name is "%L%", as it can be seen in the example above.

  3. Then we come to the spell-checking main loop. The model we use is very simple and attempts to make as few hypotheses as possible about the client application.

    The actual implementation of your text is abstracted by an interface called CharSequence (the same as in the JRE 1.4).

    The Spell Checker accepts a piece of text so a CharSequence) through SpellChecker.setInput(...), then checkNext() is invoked.

    CharSequence myInput = ... // get from your application
    checker.setInput( myInput );
    int err = checker.checkNext();

    If checkNext() returns ERR_NONE, the piece of text set as input is correct, and the application has to proceed on the next piece of text or to finish.

  4. Processing the errors returned by checkNext():

    ERR_NONE

    No error has been detected. The application should proceed to the next piece of text, or to finish.

    ERR_UNKNOWN_WORD

    A word not contained in dictionaries, and not compound from existing words. Typically, you will invoke getSuggestions() to obtain pertinent (we hope so) suggestions for correcting the word.

    ERR_WRONG_CAP

    The word is known, but is improperly capitalized. For example it is a proper name starting with a lowercase letter, or an acronym expected to be in all caps (for example "Xml" instead of "XML"). It can also be a plain word after an end-of-sentence punctuation mark. It is also possible here to invoke getSuggestions() which should return the properly capitalized word first among other suggestions.

    This error can be inhibited by setting the CheckCase option to false.

    ERR_PUNCTUATION

    A dubious sequence of punctuation marks was found: either a whitespace before marks such as dot, comma, colon, semicolon, question or exclamation mark, or two consecutive marks (except dots) such as ".,". Here also, getSuggestions() will propose replacements.

    This error can be inhibited by setting the CheckPunctuation option to false.

    ERR_DUPLICATE

    Two identical consecutive words. (Note: In some languages it can sometimes be correct, like in English "had had" or in French "nous nous", but this case is not yet supported). getWord() and getPosition() return the second word, but getSuggestions() does not return proper results. The action is basically to ignore the error or to delete the second word.

    This error can be inhibited by setting the CheckDuplicate option to false.

    ERR_REPLACE

    If the personal dictionary has been enriched with replacements to perform automatically (using learnAutoReplacement()) - this corresponds with a command like "Replace Always"- the checkNext() method signals it has encountered such a replacement. The action to take here is to invoke getReplacement( ) passing the word obtained by getWord(), the to proceed in the check loop. For example:

    String word = checker.getWord();
    myTextSource.replace( checker.getPosition(), word.length(),
                          checker.getReplacement(word) );

    This mechanism can be inhibited by setting the AutoReplace option to false.

  5. Notes about setInput() and the checking loop:

    The character sequence set with setInput() is assumed to stay unmodified by the application until checkNext() reaches its end. Depending on the implementation, this can often be unrealistic if a replacement is performed. Therefore:

    • the simplest way is to always call the setInput() method before checkNext(), with a fragment reflecting the updated state of the text source.

    • Alternately, the input text can be left untouched by the modifications, but the application has to translate the positions returned by getPosition(), since they are relative to the original text fragment

  6. A skeleton of the search loop:

    Note: The text source (here mySource) typically implements the TextSource interface defined in package com.xmlmind.spellcheck.ui.

       void doSearch() {
           for(;;) 
           {
                ... // prepare 
                // acquire next fragment from application:
                input = mySource.getText(checker.getCharChecker());
                if (input == null) {
                    ... // no more input
                    return;
                }
                checker.setInput(input);
                int err = checker.checkNext();
                if (err == SpellChecker.ERR_NONE) {
                    // end reached: update position in source
                    ...
                    continue;
                }
                String failingWord = checker.getWord();
                int replacePos = checker.getPosition();
                int replaceSize = failingWord_.length();
                // application dependent:
                mySource.highlight(replacePos, replaceSize);
    
                switch(err) {
                case SpellChecker.ERR_DUPLICATE:
                    showStatus("duplicate word: " + failingWord);
                    break;
                case SpellChecker.ERR_REPLACE:
                    mySource.replace(replacePos, replaceSize,
                                     checker.getReplacement(failingWord));
                    continue;
                case SpellChecker.ERR_WRONG_CAP:
                    showSuggestions("word should be capitalized");
                    break;
                case SpellChecker.ERR_PUNCTUATION:
                    showSuggestions("punctuation problem");
                    break;
                case SpellChecker.ERR_UNKNOWN_WORD:
                    showSuggestions("unrecognized word");
                    break;
                }
                // get and process user commands:
                break;
           }
       }
  7. Displaying suggestions:

    The interface Suggestions returned by SpellChecker.getSuggestions() provides methods to retrieve suggestions individually: getSuggestion(int index) or as an array: String[] Suggestions.toArray().

    Suggestions are ordered by decreasing pertinence and their number is given by Suggestions.getCount().

    The maximum number of returned suggestions can be set by setSuggestionLimit.

    Example:

    String[] suggestions = checker.getSuggestions().toArray();
    JList displayList = new JList(suggestions);
    if (suggestions.length > 0)
        displayList.setSelectedIndex(0);
  8. Smarter suggestions:

    There is a mechanism to teach the SpellChecker to make better suggestions. When a word entered by a user to correct an erroneous word is not among the first suggestions found, it is possible to invoke learnSuggestion() with the wrong word and its correction as arguments: the next time this word is encountered, the learned suggestion will be put atop the suggestion list.

    void doReplace() {
        String correction = ...; // get correction from user
        // if not among the 3 first , learn it:
        if (!suggestions.contain( correction, 3 ))
            checker.learnSuggestion( failingWord, correction, 
                                     SpellChecker.TEMPORARY_DICT );
        ...
    }

    In this example, the learned suggestion is put into the temporary dictionary, therefore lost at the end of the session. It is also possible to put it in the persistent personal dictionary (SpellChecker.PERSONAL_DICT).

4.3. Options

Options are manipulated in a get/set way (to be compatible with the Java Bean requirements).

For example, the IgnoreCase option is handled with boolean getIgnoreCase() and void setIgnoreCase(boolean).

SpellChecker has also two methods (loadOptions and saveOptions) to globally set/retrieve options from/into a java.util.Properties object.

 

Table 1. Options

OptionDescriptionTypeDefault value
IgnoreCaseif set, ignore capitalization errorsbooleanfalse
IgnoreMixedCaseIf set, do not check words containing case mixing (e.g. "SpellChecker")booleanfalse
IgnoreDigitsIf set, do not check words containing digits (e.g. "b2b")booleantrue
IgnoreURLIf set, ignore words looking like URL or file names (e.g. "www.xxx.com" or "c:\boot.ini")booleantrue
IgnoreDuplicatesIf set, do not signal two successive identical words as an error.booleanfalse
CheckPunctuationIf set, punctuation checking is enabled: misplaced white space and wrong sequences, like a dot following a comma, are detected.booleanfalse
AllowCompoundIf set, all words formed by concatenating two legal words with an hyphen are accepted. If the language allows it, two words concatenated without hyphen are also accepted.booleantrue
AllowPrefixesIf set, a word formed by concatenating a registered prefix and a legal word is accepted. For example if "mini-" is a registered prefix, accepts "mini-computer".booleantrue
AllowFileExtIf set, accepts any word ending with registered file extensions (e.g. "myfile.txt", "index.html" etc.)booleantrue
AutoReplaceEnables the "Replace Always" feature. If set, the checkNext method of SpellChecker can return ERR_REPLACE, then getReplacement() can be used to retrieve the replacement value.booleantrue
SuggestionForceIntensity of suggestion search: ranges from 0 to FORCE_MAX.intFORCE_DEFAULT
SuggestionLimitMaximum number of suggestions returned (does not influence the duration of a suggestion search).int15

4.4. Manipulating Dictionaries

There are numerous methods for managing dictionaries.

To know more about dictionary structure, read the Dictionary Builder documentation.

The most likely used methods are the following:

  • setPersonalDictionaryPath: defines a pattern for file storage location for personal dictionaries.

  • listLanguages: returns a list of items described detected languages and dictionaries.

  • setSelectedLanguage: selects a language, loads default dictionary if necessary.

  • selectDictionary: loads a dictionary (if necessary) and selects implicitly the dictionary's language. This method works like setSelectedLanguage, except that other dictionaries already loaded in the same language are removed.

  • getSelectedLanguage getSelectedLanguageInfo: information about the currently selected language.

  • savePersonalDictionaries: forces a save of all personal dictionaries (for example on exit).

  • getDictionaryManager setDictionaryManager: for more advanced control.

  • setDictionaryPath: defines a non-standard directory where dictionary archives (.dar) can be found.

Other methods:

  • clearLanguageDictionaries: resets a language.

  • listEditableDictionaries: returns a list of editable dictionaries for the current language.

  • manageEditableDictionary: to select, add, load, or remove an editable dictionary.

  • getEditableWords: returns an array of word descriptors from the current editable dictionary.

  • changeWord: to edit the contents of editable dictionaries.