6. Format of the hints file

When encountering an unrecognized word, the spell checker engine tries to alter it in different ways and find the altered forms in its dictionaries. Standard alterations are: inserting a character (to correct an omission), suppressing a character (to correct a superfluous keystroke), swapping two adjacent characters, replacing one character by another (especially pairs of characters that are neighbors on a keyboard).

The spell checker engine can also use specific knowledge of the considered language. Some spell checker engines convert the word into phonetics and lookup the phonetic form. This is powerful for seriously bad misspellings. The drawback is that writing phonetic conversion rules is tedious and delicate.

The method used by XMLmind Spell Checker is both simple and powerful: it consists of specifying groups of character sequences that are easily mistaken one for the other. For example, if the sequences "ph"and "f" are said to be easily mistaken one for the other, the engine will quickly find the correct spelling for "elefant", because it will try to substitute "f" by "ph".

Most often, such hints reflect phonetic similarity, but they can also deal with more specific cases (for example in French, people often write "ceuil" instead of "cueil" in words like "recueil", "accueil" etc.). Some spell checkers treat such frequent mistakes by using special catalogs, but the method implemented in XMLmind Spell Checker is more general and powerful.

In short, the hints files define two types of information:

6.1. Character declarations

XMLmind Spell Checker requires a declaration for characters used in word lists. This helps to detect malformed words.

By default, the ASCII uppercase and lowercase letters, digits, hyphen, dot and apostrophe are declared as acceptable ``word characters''.

To declare supplementary characters, use the %chars directive. It takes one argument (i.e. no space inside) which is a string of characters to declare. For example:

%chars àâéèëêîôùû

The %chars directive declares the characters may appear anywhere in the word.

Two other directives %noninitial and %nonfinal allow to refine this. They define whether a character may appear at the first or the last position in a word. For example:

%noninitial '
%nonfinal '

means that the apostrophe may appear only inside a word, not at the beginning (%noninitial) or at the end (%nonfinal).

By default the hyphen, dot and apostrophe are non-initial and non-final.

These directives are rarely used beyond the example above (Namely in French and Italian).

6.2. The %mistake directive

The syntax is very simple:

%mistake[modifier] seq1 seq2 ... seqN

This means that each time one of these sequences is found in an unknown word, the spell-checking engine will attempt to replace it by one of the other sequences of the same rule and lookup the newly formed hypothesis in the dictionary.

To put it more clearly, let's consider the rule %mistake f ff ph and assume that the word 'elefant' is encountered. The engine here will try to replace "f" by "ff" and "ph", generating and looking up in the dictionary "eleffant" and "elephant", and in principle will find the latter as a suggestion.

The modifier is an indication of how likely the substitutions are. The possible forms are '-' (less likely), or '+' (more likely). Several modifiers can be combined. For example, in French we could have the following directives:

%mistake+  a â à
%mistake++ i î
%mistake-  i y

It means that stumbling over grave or circumflex accents is quite likely, while confusing a 'i' with 'y' is less likely.

Note: the %mistake-- likelihood is the default for any pair of letters. So it is generally useless to specify more than one '-'.

It is suggested to use this directive with moderation, as it can slow down the engine. Especially, directives with many sequences lead to a higher combinatorial complexity.

Special cases: characters ^ and $

These characters have a special meaning. When used in a sequence, they make the sequence match only when appearing respectively at the beginning or the end of a word. For example:

%mistake  ^kn ^n
%mistake  $ gh$ w$

The first rule tells that at the beginning of a word "kn" can be mistaken for (sounds like) a "n". The second rule means that at the end of a word, "gh" or "w" can be forgotten or erroneously added ("$" alone means "nothing" or "silent").

Note

It makes no sense to mix sequences with and without a "$" (resp. a "^"). However it is possible for a sequence to have both (whole word). This should be used with moderation.

6.3. The %kbline directive

This is in fact a kind of shortcut to replace many %mistake directives: the argument is a string of horizontally adjacent characters of a keyboard. The directive specifies that each character is ``close to'' its one or two neighbors.

For example here, "q" is close to "w", "w" to "e", "e" to "r" etc.

# English keyboard:

%kbline qwertyuiop
%kbline asdfghjkl
%kbline  zxcvbnm

The likelihood defined is roughly equivalent to the one of %mistake-. Modifiers can also be applied to %kbline. Thus %kbline+ is roughly equivalent to %mistake.

6.4. Miscellaneous

Another directive controls the compound words: %compoundmin length

This directive means that compound words (without hyphens) are automatically allowed, provided that the length of each component is at least the length specified in the directive. This is meant for German and Nordic languages.

For example in German, the directive %compoundmin 3 means that words like "aus" and "gehen" can be automatically composed into "ausgehen", and that "in" and "gehen" will not allow "ingehen" (because the length of "in" is less than 3).