Table of Contents
Qizx supports contextual full-text search through extensions functions.
Future versions will support the XQuery Full-Text specifications, at least partially.
Contextual (or context-sensitive) full-text search means that text patterns can be searched in the context of specific XML elements. For example, it is possible to express queries like:
find the SECTION elements whose TITLE child element contains the word "hazard":
collection(...)//SECTION [ ft:contains(" hazard ", TITLE) ]find the TABLE elements where the phrase "first try" occurs in a cell of the first column:
collection(...)//TABLE [ ft:contains (" 'first try' ", ROW/CELL[1]) ]When used to search XML Libraries, full-text search functions are index-based and therefore very efficient.
Do not confuse ft:contains with the standard function contains (or fn:contains): the latter is not index based and therefore generally very slow. It searches any simple string, while ft:contains searches a full-text expression (whose syntax is described hereafter).
These functions can also be used to search parsed documents or even constructed fragments, but in this case the execution speed is significantly lower (as an indication, the entire works of Shakespeare — 8 Mb of XML — can be scanned in about one second on a 3 GHz processor).
Full-text functions return a boolean value (is there a match for the full-text expression?).
They are typically used inside a XPath predicate.
This is the most general form of full-text query.
There are also more specialized functions which make it easier to write specific types of full-text queries (single phrase, all words of a sequence...). Please see the next section.
ft:contains ($queryas xs:string [,$context-nodesas node()* ] ) as xs:boolean x:fulltext ($queryas xs:string [,$context-nodesas node()* ] ) as xs:boolean
Note: x:words and x:fulltext are aliases of the function ft:contains.
This function implements context-sensitive full-text search: it can search boolean combinations of words, word patterns and phrases, in the context of specific elements. It is typically used inside a predicate.
For example the following expression returns SPEECH elements which contain both words "romeo" and "juliet":
//SPEECH [ ft:contains(" romeo AND juliet ") ]Returned value: The function returns true if the string-value of at least one node of the context-nodes parameter matches the full-text query. Matching is therefore not affected by element substructure (mixed content). For example the phrase 'to be or not to be' would be found in <line>To be <b>or not to be</b> ...</line>.
Parameter $query: a string containing a full-text pattern (see the syntax below).
Parameter $context-nodes: An optional node or sequence of nodes which specify or restrict the area where the full-text expression is searched for.
When the context-nodes argument is not specified, the current context node '.' is used implicitly like in the example above (so it must be inside a predicate). When context-nodes parameter is present, it can be relative to the current context node: for example this expression finds SPEECH elements which contain a LINE element which in turn contains both words "romeo" and "juliet":
//SPEECH [ ft:contains(" romeo AND juliet ", LINE) ]A word without the wildcard characters '%' and '_'.
By default, case and accents are ignored (i.e. "café" is equivalent to "CAFE").
What is a "word" is defined by the Word Sieve (word parser) in use for the queried XML Library: see Chapter 6, Configuring the indexing process for a definition of a Word Sieve and how to specify it.
Characters that cannot belong to a word and are not special characters like '&', '|', '-', '%' and '_' (as defined below) are simply ignored.
A word that in addition contains wildcard characters of a SQL-style pattern.
The underscore '_' matches a single character.
The percent sign '%' matches the longest possible sequence of characters.
For example "intern%" would match intern, internal, internals etc.
Syntax: . The sign 'term1 OR term2|' can be used instead of OR.
OR has precedence over AND (see below).
Syntax: . The sign '&', or even a simple juxtaposition can be used instead of term1 AND term2AND.
Thus "romeo AND juliet", "romeo & Juliet", "Roméo Juliet" are equivalent.
Syntax: sign '-' or keyword NOT.
For example "Romeo -Juliet" is equivalent to "Romeo AND NOT Juliet".
Ordered sequence of terms (simple words or patterns), surrounded by single or double quotes. By default, terms must appear exactly in the order specified.
It is possible to specify a tolerance or distance, which is the maximum number of words interspersed among the terms of the phrase query. The notation is where phrase~NN is a optional count of words (4 if not specified). The two following examples match the phrase "to be or not to be, that is the question":
//SPEECH [ ft:contains(" 'to be that question'~ ", LINE) ]
//SPEECH [ ft:contains(" 'to be or question'~6 ", LINE) ]
Notice that there are some limitations in the above syntax: the OR cannot combine AND clauses or phrases, however this limitation can be resolved by a boolean combination of calls to ft:contains, for example:
doc("r_and_j.xml")//LINE [ ft:contains("name AND rose")
or ft:contains(" 'smell as sweet' ") ]would yield the two lines (Romeo and Juliet, act II scene 2):
<LINE>What's in a name? that which we call a rose</LINE> <LINE>By any other name would smell as sweet;</LINE>
These functions are convenient for specialized text search.
ft:phrase ($wordsas xs:string+ [,$spacingas xs:integer ] [,$context-nodesas node()* ]) as xs:boolean
A variant of ft:contains specialized in phrase search, which allows words to be specified as a sequence of strings.
Parameter $words: a string or a sequence of strings containing words to search for. A String can contain several words, it is parsed using the Word Sieve defined in the Indexing Specifications of the XML Library.
Parameter $spacing: an optional integer which is the maximum number of words which can be interspersed in an occurrence of the phrase to search.
Parameter $context-nodes: An optional node or sequence of nodes which specify or restrict the area where the full-text expression is searched for.
For example:
ft:phrase( ("to", "be", "or", "not"), 5 )as well as:
ft:phrase( ("to be", "or not"), 5 )are equivalent to:
ft:contains(" 'to be or not'~5 ")function ft:all-words ($wordsas xs:string+ [,$context-nodesas node()* ]) as xs:boolean
A variant of ft:contains which allows words to be specified as a sequence of strings.
Parameter $words: a string or a sequence of strings containing words to search for. A String can contain several words, it is parsed using the Word Sieve defined in the Indexing Specifications of the XML Library.
Parameter $context-nodes: An optional node or sequence of nodes which specify or restrict the area where the full-text expression is searched for.
For example:
ft:all-words( ("romeo", "juliet"), LINE )is equivalent to
ft:contains("romeo AND juliet", LINE)function ft:any-word ($wordsas xs:string+ [,$context-nodesas node()* ] ) as xs:boolean
A variant of ft:contains which allows words to be specified as a sequence of strings.
Parameter $words: a string or a sequence of strings containing words to search for. A String can contain several words, it is parsed using the Word Sieve defined in the Indexing Specifications of the XML Library.
Parameter $context-nodes: An optional node or sequence of nodes which specify or restrict the area where the full-text expression is searched for.
For example:
ft:any-word( ("romeo", "juliet"), LINE )is equivalent to
ft:fulltext("romeo OR juliet", LINE)function ft:highlighter ($queryas xs:string,$fragmentas element(),$partsas node()*,$optionsas element(option) ] ) as element()
This function is a companion of the full text search functions, which can be used to ``highlight'' matched terms. This is typically useful to present results of a full text search.
More precisely it returns a copy of a document fragment where matched terms are surrounded by generated elements.
By default a generated element has the name 'span' and an attribute 'class' with a value equal to the prefix 'hi' followed by the rank of the term in the query.
Applied to a LINE in the example LINE[ft:contains("name OR rose")], this would produce something like:
<LINE>What's in a <span class='hi0'>name</span>? that which we call a <span class='hi1'>rose</span></LINE>
Parameter $query: a string containing a full-text pattern (see the syntax above).
Parameter $fragment: The root of the XML fragment to process.
Parameter $parts: The optional third argument $parts is a list of sub-elements which must be specifically highlighted (if empty, the whole root fragment is highlighted, otherwise only the specified parts are highlighted).
Parameter $options: The options can be used to redefine the output. For example:
<options element='frag' attribute='style' prefix='st'/>
would surround terms with <frag style="st0"></frag> instead of <span class='hi0'></span>.
Recognized options:
name of the element used to surround a matched term. Default is 'span'.
name of the attribute defining the class of the highlighted element. Default is 'class'.
prefix of the generated class. Default is 'hi'.