<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<article>
  <title>XQuery Full-Text for the impatient</title>

  <articleinfo>
    <author>
      <surname>Xavier Franc</surname>
    </author>

    <pubdate>October 18, 2025</pubdate>

    <copyright>
      <year>Xavier Franc, Axyana Software - 2008</year>
    </copyright>

    <legalnotice>
      <para>This article is published under the <ulink
      url="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons
      "Attribution-Share Alike"</ulink> license.</para>
    </legalnotice>
  </articleinfo>

  <para><ulink url="http://www.w3.org/TR/xpath-full-text-10/">XQuery and XPath
  Full Text 1.0</ulink> (abbreviated <acronym>XQFT</acronym> hereafter) is a
  new Recommendation (standard) from the W3C that extend the XQuery language
  with comprehensive functionalities for full-text search.</para>

  <para>After a short presentation of main concepts, the present document
  simply introduces the main features through concrete examples (problem and
  solution). A more detailed explanation of syntax and semantics would be
  lengthy and beyond the scope of this document. We would be quite satisfied
  if this document helps you grasping the essential ideas, and writing your
  first XQuery Full-Text queries.</para>

  <section>
    <title>How XQuery Full-Text is different from usual full-text</title>

    <itemizedlist>
      <listitem>
        <para>Basically, a standard full-text engine searches a collection of
        <glossterm>pages</glossterm> or <glossterm>documents</glossterm>
        matching a particular query (that is, containing certain words or
        combinations of words). The engine returns the matching
        <emphasis>pages</emphasis> (or a reference to these pages).</para>
      </listitem>

      <listitem>
        <para>Standard full-text engines mostly ignore the
        <emphasis>structure</emphasis> of documents, whether this structure
        corresponds with the formats XML, HTML, PDF, RTF, etc. Roughly
        speaking, they treat each document as a simple sequence of
        words.</para>
      </listitem>

      <listitem>
        <para>Yet some full-text engines are able to divide document contents
        in areas or <emphasis>fields</emphasis>, and use these fields in
        queries. For example in Google&trade;, there are fields like "page
        title", "page body", "page links", so for example it is possible
        search only the pages whose title contains the phrase "yes we can".
        The <ulink url="http://lucene.apache.org/">Lucene</ulink> open-source
        engine allows you to define fields at will.</para>

        <para>So fields are a step towards structuring document contents. As
        we will see, XQuery Full-Text goes much farther in this
        direction.</para>
      </listitem>

      <listitem>
        <para>XQuery deals with XML documents. An XML document is structured
        in <emphasis>elements</emphasis>, which can contain other elements and
        text (that is, words). XQuery is able to return not only
        <emphasis>documents</emphasis> but precise <emphasis>nodes</emphasis>
        (matching a query) inside documents.</para>

        <para>So naturally XQuery <emphasis>Full-Text</emphasis> is able to
        use XML elements to restrict or refine queries (much like fields in
        usual full-text engines). This is sometimes called contextual
        full-text search. For example, in a DocBook document one can search a
        <literal>section</literal> whose <literal>title</literal> contains the
        word "enhancement": this is much more selective than searching for a
        document that contains that word. Here is how it looks:</para>

        <programlisting>  //section[ title ftcontains "enhancement" ]</programlisting>

        <para>The great advantage of XQuery Full-Text is that <emphasis>any
        element</emphasis> can be used as a context to refine queries, not
        only manually designed fields. And of course context elements can be
        selected with all the power of XQuery expressions.</para>
      </listitem>
    </itemizedlist>
  </section>

  <section>
    <title>Data used in this tutorial</title>

    <para>We will use documents with a relatively simple structure:
    Shakespeare's plays put in XML format by Jon Bosak. This is a collection
    of 37 documents which can be found at <ulink
    url="http://xml.coverpages.org/bosakShakespeare200.html">http://xml.coverpages.org/bosakShakespeare200.html</ulink>.
    The top element is PLAY, containing ACT, SCENE, SPEECH and a few others.
    An element SPEECH is used for each utterance by a character, containing a
    SPEAKER element for the name of the character, and as many LINE elements
    as there are lines of text.</para>

    <para>Excerpt:</para>

    <programlisting>&lt;PLAY&gt;
&lt;TITLE&gt;A Midsummer Night's Dream&lt;/TITLE&gt;
...
&lt;ACT&gt;&lt;TITLE&gt;ACT II&lt;/TITLE&gt;
 &lt;SCENE&gt;&lt;TITLE&gt;SCENE I.  A wood near Athens.&lt;/TITLE&gt;
  &lt;STAGEDIR&gt;Enter, from opposite sides, a Fairy, and PUCK&lt;/STAGEDIR&gt;
  &lt;SPEECH&gt;
    &lt;SPEAKER&gt;PUCK&lt;/SPEAKER&gt;
   &lt;LINE&gt;How now, spirit! whither wander you?&lt;/LINE&gt;
  &lt;/SPEECH&gt;

  &lt;SPEECH&gt;
    &lt;SPEAKER&gt;Fairy&lt;/SPEAKER&gt;
   &lt;LINE&gt;Over hill, over dale,&lt;/LINE&gt;
   &lt;LINE&gt;Thorough bush, thorough brier,&lt;/LINE&gt;
   &lt;LINE&gt;Over park, over pale,&lt;/LINE&gt;
   &lt;LINE&gt;Thorough flood, thorough fire,&lt;/LINE&gt;
   &lt;LINE&gt;I do wander everywhere,&lt;/LINE&gt;
   &lt;LINE&gt;Swifter than the moon's sphere;&lt;/LINE&gt;
...</programlisting>

    <note>
      <para>The way these sample documents are actually accessed is beyond the
      scope of this document and may depend on the particular XQuery
      implementation being used.</para>
    </note>
  </section>

  <section>
    <title>Basic syntax</title>

    <para>The fundamental full-text operator is noted by keyword
    <command>ftcontains</command>:</para>

    <programlisting>   <emphasis>domain</emphasis> ftcontains <emphasis>full-text-query</emphasis></programlisting>

    <para>On its right side <command>ftcontains</command> requires a full-text
    query ("full-text selection" in the specifications).</para>

    <para>On the left-side, an expression specifies a <glossterm>search
    domain</glossterm>. It should yield a node or generally a sequence of
    nodes.</para>

    <para>The <command>ftcontains</command> operator returns a boolean value:
    true if the full-text query is matched by at least one node in the
    left-side expression (search domain), false if no match.</para>

    <para>In most cases, ftcontains is used as a
    <glossterm>predicate</glossterm> (between square brackets) following a
    <glossterm>path expression</glossterm>. Example (where the path expression
    is //PLAY):</para>

    <programlisting>//PLAY[ . ftcontains "juliet" ]</programlisting>

    <para>This query means: find elements PLAY which themselves (the dot is a
    shorthand for 'self') contain the word "Juliet".</para>

    <para>The search domain is frequently the dot expression '.' (meaning
    self), but it can be more specific. For example:</para>

    <programlisting>//PLAY[ TITLE ftcontains "Henry" ]</programlisting>

    <para>This query means: find elements PLAY whose child element TITLE
    contains the word "henry" (not case sensitive).</para>
  </section>

  <section>
    <title>Simple queries</title>

    <variablelist>
      <varlistentry>
        <term><emphasis role="bold">Find plays which contain the phrase "to be
        or not to be":</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution in XQFT:</emphasis></para>

          <programlisting>//PLAY[ . ftcontains "to be or not to be" ]</programlisting>

          <para>or equivalently:</para>

          <programlisting>//PLAY[ . ftcontains "to be or not to be" phrase ]</programlisting>

          <programlisting>//PLAY[ . ftcontains { "to be", "or", "not to be" } phrase ]</programlisting>

          <para><emphasis role="bold">Results:</emphasis></para>

          <programlisting>&lt;PLAY&gt;&lt;TITLE&gt;The Tragedy of Hamlet, Prince of Denmark&lt;/TITLE&gt;
...
&lt;/PLAY&gt;</programlisting>

          <para><emphasis role="bold">Notes:</emphasis></para>

          <itemizedlist>
            <listitem>
              <para>A sequence of words (like "to be or not to be") without
              other specification is a phrase. It is matched by the same
              sequence of words, in order and without interspersed
              words.</para>
            </listitem>

            <listitem>
              <para>A phrase, like any other full-text query can span XML
              elements. For example:</para>

              <programlisting> //SPEECH[ . ftcontains  "discontent made glorious" ]</programlisting>

              <para>This query would return the following result:</para>

              <programlisting>&lt;SPEECH&gt;&lt;SPEAKER&gt;GLOUCESTER&lt;/SPEAKER&gt;
&lt;LINE&gt;Now is the winter of our <emphasis role="bold">discontent</emphasis>&lt;/LINE&gt;
&lt;LINE&gt;<emphasis role="bold">Made glorious</emphasis> summer by this sun of York;&lt;/LINE&gt;
...</programlisting>
            </listitem>
          </itemizedlist>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><emphasis role="bold">Find LINE elements which contain both
        words "romeo" and "Juliet":</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//LINE[ . ftcontains "romeo juliet " all words ]</programlisting>

          <para>or equivalently:</para>

          <programlisting>//LINE[ . ftcontains { "romeo", " juliet" } all words ]</programlisting>

          <programlisting>//LINE[ . ftcontains "romeo " ftand "juliet" ]</programlisting>

          <para>Notice the keyword <literal>ftand</literal> used to avoid
          syntax ambiguities with the plain <literal>and</literal>.</para>

          <para><emphasis role="bold">Results:</emphasis></para>

          <programlisting>&lt;LINE&gt;Is father, mother, Tybalt, Romeo, Juliet,&lt;/LINE&gt;</programlisting>

          <programlisting>&lt;LINE&gt;And Romeo dead; and Juliet, dead before,&lt;/LINE&gt;</programlisting>

          <programlisting>&lt;LINE&gt;Romeo, there dead, was husband to that Juliet;&lt;/LINE&gt;</programlisting>

          <programlisting>&lt;LINE&gt;Than this of Juliet and her Romeo.&lt;/LINE&gt;
</programlisting>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><emphasis role="bold">Find LINE elements which contain both
        words "romeo" and "Juliet" in this order:</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//LINE[ . ftcontains "romeo juliet" all words ordered ]</programlisting>

          <para><emphasis role="bold">Results:</emphasis></para>

          <programlisting>&lt;LINE&gt;Is father, mother, Tybalt, Romeo, Juliet,&lt;/LINE&gt;</programlisting>

          <programlisting>&lt;LINE&gt;And Romeo dead; and Juliet, dead before,&lt;/LINE&gt;</programlisting>

          <programlisting>&lt;LINE&gt;Romeo, there dead, was husband to that Juliet;&lt;/LINE&gt;</programlisting>

          <para>Notice that the fourth item of previous query does not match
          this query because words "romeo" and "Juliet" are not in the
          required order.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><emphasis role="bold">Find LINE elements which contain word
        "romeo" or word "Juliet" or both:</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//LINE[ . ftcontains "romeo juliet" any word ]</programlisting>

          <para>or equivalently:</para>

          <programlisting>//LINE[ . ftcontains { "romeo", "juliet" } any word ]</programlisting>

          <programlisting>//LINE[ . ftcontains "romeo" ftor "juliet" ]</programlisting>

          <para>Notice the keyword <literal>ftor</literal> used to avoid
          syntax ambiguities with the plain <literal>or</literal>.</para>

          <para><emphasis role="bold">Results: 165
          occurrences.</emphasis></para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><emphasis role="bold">Find LINE elements which contain word
        "romeo" but not word "Juliet":</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//LINE[ . ftcontains "romeo" ftand ftnot "juliet" ]</programlisting>

          <para><emphasis role="bold">Results: 116
          occurrences.</emphasis></para>
        </listitem>
      </varlistentry>
    </variablelist>
  </section>

  <section>
    <title>Scoring</title>

    <para>Getting scores from a XQuery full-text query is done through an
    extension of the FLWOR (aka Flower) loop:</para>

    <programlisting> for $hit <emphasis role="bold">score</emphasis> $score in //SPEECH[ . ftcontains "king" ]
    order by $score descending
 return $hit</programlisting>

    <para>The keyword <emphasis role="bold">score</emphasis> introduces a
    variable that receives the score value. This value is guaranteed to be
    between 0 and 1, and of course a higher value means a more relevant
    hit.</para>

    <para>The loop above is a typical way of obtaining the results of the
    query sorted by decreasing score.</para>

    <para>How are scores computed ? The answer from the W3C standard is: this
    is "<glossterm>implementation-dependent</glossterm>"... It is very likely
    however than an actual implementation will take into account the
    frequencies of query terms in the queried collection.</para>

    <para>If several <emphasis role="bold">ftcontains</emphasis> appear in the
    expression after the <emphasis role="bold">in</emphasis> keyword, which
    one is used to compute the scores ? Again the standard says
    "<glossterm>implementation-dependent</glossterm>". It might the average,
    the maximum or the first of the score values for each <emphasis
    role="bold">ftcontains</emphasis>.</para>
  </section>

  <section>
    <title>Case sensitivity and other matching options</title>

    <para>By default, the letter case is not taken into account for full-text
    search. But XQFT provides several matching options: case sensitivity,
    diacritics sensitivity (accents), stemming and wildcards:</para>

    <variablelist>
      <varlistentry>
        <term><emphasis role="bold">Find plays which contain the phrase "the
        King" with this case:</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//PLAY[ . ftcontains "the King" case sensitive ]</programlisting>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><emphasis role="bold">Find LINE elements which contain the word
        "Orléans" with its accent:</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//LINE[ . ftcontains "Orléans" diacritics sensitive]</programlisting>

          <para>Note: this query returns no result as "Orléans" is written
          without e acute in Shakespeare's plays.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><emphasis role="bold">Find LINE elements which contain words
        matching the pattern ".+let":</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//LINE[ . ftcontains ".+let" with wildcards ]</programlisting>

          <para><emphasis role="bold">Results: 193, matching Hamlet, Capulet,
          goblet, doublet etc.</emphasis></para>

          <para>The pattern is a regular expression with limited syntax.
          Essentially, the dot matches any character and can be followed by
          occurrence indicators '?', '*', and '+' respectively meaning
          "optional", "any number" and "at least one".</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><emphasis role="bold">Find LINE elements which contain word
        "hammer" or related words by stemming:</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//LINE[ . ftcontains "hammer" with stemming language "en" ]</programlisting>

          <para>This query would find lines with words "hammer", "hammers",
          "hammered" etc depending on the capabilities of the stemmer in
          use.</para>

          <para>Reminder: "<glossterm>stemming</glossterm>" means reducing a
          word to a "stem" or radix, and replacing it with a OR of all the
          words that have the same stem. This is a language-dependent
          capability, this is why the keyword <literal>language</literal> has
          to be used jointly.</para>
        </listitem>
      </varlistentry>
    </variablelist>
  </section>

  <section>
    <title>More advanced features</title>

    <para>XQFT has a few more advanced features that we will just mention
    here:</para>

    <variablelist>
      <varlistentry>
        <term><emphasis role="bold">Occurrence counting: find SPEECH elements
        where the word "love" appears 7 times or more:</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//SPEECH[ . ftcontains "love" occurs at least 7 times ]</programlisting>

          <para>8 results.</para>

          <para>There other occurrence count options: "at most 7 times",
          "exactly 7 times", "from 6 to 8 times".</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><emphasis role="bold">Proximity: find LINE elements which
        contain word1 "near" word2:</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//LINE[ . ftcontains "love tender" all words distance at most 8 words ]</programlisting>

          <para>This means that words "love" and "tender" may appear in any
          order, and the number of words from the first to the last (included)
          must not exceed 8 words.</para>

          <para>We have already seen another way of writing this:</para>

          <programlisting>//LINE[ . ftcontains "love tender" all words window 8 words ]</programlisting>

          <para>The "distance" option, like "times", has four variants:</para>

          <programlisting>//LINE[ . ftcontains "love tender" all words distance at least 8 words ]</programlisting>

          <programlisting>//LINE[ . ftcontains "love tender" all words distance exactly 8 words ]</programlisting>

          <programlisting>//LINE[ . ftcontains "love tender" all words distance from 7 to 9 words ]</programlisting>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><emphasis role="bold">Anchoring: find LINE elements
        <firstterm>starting</firstterm> with phrase "to be":</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//LINE[ . ftcontains "to be" at start ]</programlisting>

          <para>This means that the first two words inside the matched LINE
          elements are "to" and "be".</para>

          <para>It is also possible to specify "at end", or "entire content"
          which means both "at start" and "at end".</para>

          <programlisting>//LINE[ . ftcontains "to be" at end ]</programlisting>

          <programlisting>//LINE[ . ftcontains "to be" entire content ]</programlisting>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><emphasis role="bold">Mild-not: find LINE elements containing
        the word "king" but not inside the phrase "king of
        France":</emphasis></term>

        <listitem>
          <para><emphasis role="bold">Solution:</emphasis></para>

          <programlisting>//LINE[ . ftcontains "king" not in "king of France"]</programlisting>

          <para>Note: this query does not mean rejecting the phrase "king of
          France" always; If the word "king" appears outside of this phrase, a
          LINE would match, for example this one:</para>

          <programlisting>    &lt;LINE&gt;No <emphasis role="bold">king</emphasis> of England, if not king of France.&lt;/LINE&gt;</programlisting>
        </listitem>
      </varlistentry>
    </variablelist>

    <bridgehead>Other features</bridgehead>

    <para>There are still other features (thesaurus, stop-words, scope,
    ignored content) that the keen student will find in the W3C
    specifications.</para>
  </section>

  <section>
    <title>Conclusion</title>

    <para>The intent of this tutorial is to provide an idea of the power of
    XQuery Full-Text:</para>

    <itemizedlist>
      <listitem>
        <para>XQFT offers a rich set of features, it it probably more
        comprehensive than the query language of most existing full-text
        systems.</para>
      </listitem>

      <listitem>
        <para>It is well integrated, fully composable with the plain XQuery
        language.</para>
      </listitem>

      <listitem>
        <para>It allows querying nodes, not only complete documents, and
        allows using any element as a search context.</para>
      </listitem>

      <listitem>
        <para>On the minus side, its syntax is a bit verbose and redundant,
        not very elegant. A number of features are "implementation-defined",
        which is not optimal for interoperability.</para>
      </listitem>
    </itemizedlist>
  </section>
</article>
