Indexing External Data in Umbraco

Umbraco ships with Examine for building fast and searchable indexes for your site content. Examine is built on Apache's Lucene indexing and search engine to provide full text search options that you can add to any Umbraco website.

One little used feature of Examine is the ability to extend it to index and search data that is stored outside of the Umbraco CMS. This can be very useful if you want to include content from existing external sources such as a separate SQL database or a Web Service.

This is a quick example of how to index data from a Custom Database table using Examine. First set-up a simple library method to fetch data from your data source. I won't go into too much detail but assuming you have a database with a table of 'Things' then you could use Entity Framework and write a simple repository like so:

public class ThingRepository
{
    private readonly ThingDbContext _dbContext;

    public ThingRepository()
    {
        this._dbContext = new ThingDbContext();
    }

    public List<Thing> GetAllTheThings()
    {
        return this._dbContext.Things;
    }
}

Next implement the Examine ISimpleDataService interface to create a custom Data Indexer, this allows you to build a custom examine index from any source you like. All you have to do is write code to retrieve the data and then convert it into an Examine SimpleDataSet. 

using Examine.LuceneEngine;

public class CustomDataService : ISimpleDataService 
{
    public CustomDataService() { }

    public IEnumerable<SimpleDataSet> GetAllData(string indexType)
    {
        var data = Enumerable.Empty<SimpleDataSet>();

        List<Things> things = ThingRepository.GetAllTheThings();

        foreach(Thing thing in things) 
        {
            data.Add(new SimpleDataSet() 
            {
                NodeDefinition = new IndexedNode() 
                {
                    NodeId = thing.Id, 
                    Type = "content"
                },
                RowData = new Dictionary<string, string>()
                {
                    {"title", thing.Title},
                    {"contents", thing.Contents}
                }
            });
         }

         return data;
    }
}

The entries need to have unique IDs across the index, this is easily done when you are talking to a database since you will probably already have some form of unique key on your data.

For other sources if a suitable unique identifier is not already available you might require more creativity. You have scope here to combine multiple data sources if you wish or transpose the data in some way prior to indexing.

With that in place you need to tell Examine about the new index and fields in the ExamineIndex.config:

<IndexSet SetName="CustomIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/CustomIndex">
    <IndexUserFields>
        <add Name="title" />
        <add Name="contents" />
    </IndexUserFields>
</IndexSet>

And add the custom data service to the ExamineSettings.config:

<add name="CustomIndexer" type="Examine.LuceneEngine.Providers.SimpleDataIndexer, Examine"
    dataService="myNamespace.CustomDataService, myNamespace"
    indexSet="CustomIndexSet" runAsync="true" />

One drawback to this set-up is that the index will not automatically be rebuilt unless triggered manually from the Umbraco admin or by some other mechanism that you would have to implement yourself.

A solution that works quite well is to create a static page or custom MVC route that does the index rebuild and have the task scheduler in Umbraco hit this page on a predetermined interval.

If this is developed as a static aspx page then make sure you add it to the list of reserved pages in the web.config. You will also want to secure this page in some way to prevent someone forcing an index rebuild whenever they feel like it. Since the Umbraco scheduled task works off a URL only you can put a GUID in the query string and have your page code check to ensure only the scheduled task can actually trigger the index rebuild.

In umbracoSettings.config:

<scheduledTasks>
    <!-- add tasks that should be called with an interval (seconds) -->
    <task log="true" alias="examineReindex" interval="900" url="http://localhost/examine.aspx?key=SOME_GUID"/>
</scheduledTasks>

Now the data can be retrieved from the index as normal and searched within your Umbraco site as you would any other indexed content. For more information on actually using Examine for search check out the documentation.