Autocomplete Data Structures


When impelementing autocomplete functionality on your website, for fast and efficient lookup of user-entered text, you have a few options. If the amount of data to be used in your autocomplete implementation is not too large, you can send it all at once to the front-end and cache it in the browser. But many times, this may not be possible (e.g. if you have thousands or even millions of rows of text that the user can search on, such as sites like Google or Amazon).

In cases like these, you will need to pull the data from the server. You can pull the data from the database each time, but performance would not be very good. A better solution would be to cache the data on the server in some kind of data structure that gives optimal lookup performance. So as the user is typing, the round-trip to the server via an AJAX call would bring back the data quickly and efficiently.
One such possible structure is the Trie or Prefix Tree. This structure is described in Wikipedia as:
In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.
A more specific implementation of the Trie structure is the Radix Tree, Patricia trie/tree, or crit bit tree. The Radix Tree is a data structure that provides for fast searching of prefixes. A nice implementation of this structure was created by Tahseen Ur Rehman and can be found here: http://code.google.com/p/radixtree/.

In cases like these, you will need to pull the data from the server. You can pull the data from the database each time, but performance for this would not be good. A better solution would be to cache the data on the server in some kind of data structure that gives optimal lookup performance. So as the user is typing, the round-trip to the server via an AJAX call would bring back the data quickly and efficiently.

One such possible structure is the Trie or Prefix Tree. This structure is described in Wikipedia as:

In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

A more specific implementation of the Trie structure is the Radix Tree, Patricia trie/tree, or crit bit tree. The Radix Tree is a data structure that provides for fast and efficient searching of prefixes. A good implementation of this structure was created by Tahseen Ur Rehman and can be found here: http://code.google.com/p/radixtree/. The nice thing about this implementation is that you can store all of your search text as keys and any other type of object as the values associated with those keys.

For example, suppose you have an online store and want to allow the user to search the entire list of products that you sell. You can cache the list of product names along with some detailed product information into your Radix Tree on the server side. You might also want to load the product info on application startup and periodically refresh the content.

Below is an implementation of a Factory class that is used to retrieve our Radix tree. It is synchronized to handle refreshing the tree by a separate thread running in Spring.


import java.util.List;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import ds.tree.RadixTree;
import ds.tree.RadixTreeImpl;

/**
 * Factory class for creating the tree data structure 
 * used for autocomplete lookup of products.
 *
 */ 

 public class ProductTreeFactory {
    private static Log log = LogFactory.getLog(ProductTreeFactory.class);

    private RadixTree<Product> productTree = new RadixTreeImpl<Product>();
    private ProductLookupDao productLookupDao;

   /**
    * Sets the ProductLookupDao.
    *
    * @param productLookupDao The DAO
    */
   public void setProductLookupDao(ProductLookupDao productLookupDao) {
      this.productLookupDao = productLookupDao;
   }

   /**
    * Returns the tree object containing products.
    *
    * @return The product tree
    */
   public synchronized RadixTree<Product> getProductTree() {
      return productTree;
   }

   /**
    * Loads the tree with product data from the database.
    */
   public synchronized void loadTree() {
      log.info("Loading product data into tree");

      productTree = new RadixTreeImpl<Product>();
      List<Product> products = productLookupDao.findAllProducts();

      for (Product product : products) {
         productTree.insert(product.getProductName(), product);
      }
   }
}

The Spring config below uses the quartz scheduler to refresh the tree once an hour.

    <!-- Scheduled Jobs -->
    <bean id="reloadTreeJob"
                    class="org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean">
               <property name="targetObject" ref="productTreeFactory" />
               <property name="targetMethod" value="loadTree" />
    </bean>

    <bean id="reloadTreeTrigger" class="org.springframework.scheduling.quartz.SimpleTriggerBean">
       <!-- see the example of method invoking job above -->
       <property name="jobDetail" ref="reloadTreeJob" />
        <!-- repeat every hour -->
        <property name="repeatInterval" value="3600000" />
    </bean>

    <!-- Quartz Scheduler -->
    <bean class="org.springframework.scheduling.quartz.SchedulerFactoryBean">
           <property name="triggers">
                   <list>
                           <ref bean="reloadTreeTrigger" />
                   </list>
           </property>
    </bean>

    <!-- Utility Classes -->
    <bean id="productTreeFactory" class="myapp.ProductTreeFactory" init-method="loadTree">
            <property name="productLookupDao" ref="productLookupDao"/>
    </bean>

 

Some other options that I found for caching data on the server are using a simple TreeSet or an embedded database such as HSQLDB or the H2 database. However, these options are not as fast or efficient as the Trie structure.

Update: I ran some timings on different structures for autocomplete and found a surprising result.

  1. #1 by Coolepe on May 20, 2009 - 4:30 pm

    I am using an ordered ArrayList to cache my autocompleter data (+30 000 records).
    I perform a binary search on this list (Collections.binarySearch(list,prefix)) to find the position of the prefix.
    If the prefix is not in the list (most of the time)
    -> position= -1 – position;
    The next X items that start with the prefix are sent to the autocompleter.
    This is really fast. I am going to try the Radix tree, and compare the performance.

    • #2 by stevedaskam on May 21, 2009 - 2:44 am

      That’s awesome. Let me know what the results of your tests are.

      Thanks,
      Steve

      • #3 by Coolepe on May 22, 2009 - 2:55 pm

        Hi Steve,
        Love your articles!
        For what it’s worth:
        -Dataset of 22000 items (actually product names)
        -I performed 50 tests with random prefixes
        -Binary search algorithm on average: 82307 nanoseconds
        -With the Radix tree implementation (http://code.google.com/p/radixtree/) on average: 368855 nanoseconds
        So the Radix tree algorithm is 4,5 times slower in my humble test cases.

    • #4 by Saket Agrawal on December 13, 2010 - 5:47 am

      I am assuming Collections.binarySearch(list,prefix) will use a comparator to check if a particular list starts with given prefix. In this case multiple item of list will be consider to be equal to given prefix. Collections.binarySearch doesn’t guarantee which one will be found. That means index returned may be of last matching element. How did you solve this problem.

      • #5 by Saket Agrawal on December 14, 2010 - 12:01 am

        I guess i didn’t followed

        -> position= -1 – position;

        completely, which I do now :-).

        Ignore my question. Nice post !!!!

  2. #6 by stevedaskam on May 22, 2009 - 11:28 pm

    Coolepe,

    Thanks for the results!! That’s very interesting. I will have to take the binary search algorithm for a test-drive.

  3. #7 by jash on February 9, 2011 - 9:56 am

    Ur post was very helpful to me . Thanks

  4. #8 by nicolaidiethelm on May 22, 2012 - 4:15 am

    Thank you for your interesting post about autocomplete data structures!
    If the set of autocomplete suggestions is rank-ordered, one should consider using a SuggestTree. For any given prefix, it provides fast access to the top k suggestions that start with that prefix.
    http://suggesttree.sourceforge.net/

Leave a reply to Saket Agrawal Cancel reply