Caching is a time tested solution to performance optimization in information retrieval systems. In EII systems, caching delivers a dual performance punch. It not only improves query execution times when a cache hit happens, but it also reduces the load on the operational data sources on the back end.
EII platforms provide access to distributed data by defining views over remote data sources. These views integrate the data from multiple data sources and make them available to applications through standard interfaces. There is a range of caching features available from EII implementations.
A basic level of view caching is analogous to the materialized view facility provided by some relational databases. A view in EII systems is very similar to a view in relational databases in that it is metadata that defines how data of interest can be accessed. Being just metadata, a view occupies little storage space, and it is evaluated when queries written over that view are executed. One way to improve performance of queries on those views is to evaluate the views in advance and store the results in local tables. The queries over the views can then be evaluated against these local “materialized” tables. This type of caching lends itself to situations where
- View evaluations are computationally expensive
- Result data sets are finite
- Data sources do not support set/batch operations (some web services for instance)
- Load on backend data sources is only allowed on schedule.
This kind of view caching is supported by caching policies addressing when the cache is to be refreshed, how to store the cached results, indexes to build on the cached data, maximum size of cache and any synchronization mechanisms that are used for cache coherence. This policy has to be designed with a view to not only providing fast, timely responses from the EII platform, but also to reduce load on backend data sources. In essence this is a light-weight ETL operation.
Other approaches to view caching build out the cache incrementally. The query engine determines if any given query results in a cache hit or miss by means of additional data the describes the content of a cache at a given time. On a hit, the query is serviced by the cache. On a miss, the query is evaluated by distributing the query across the remote data sources in question, and using the results to answer the queries as well as augment the cache. In this manner, over time, the cache would be sufficiently populated to increase the hit/miss ratio to an acceptable level. This kind of view caching is suitable for scenarios where
- The entire data set being represented by the view is extremely large
- There are repeating patterns in queries that end up yielding optimizations for a large set of queries when a relatively small portion of the data set is cached
- It is acceptable to have a warm up period while populating a cache
- It is acceptable to have longer query times at times (a cache miss)
- Some unpredictability of load on source systems can be tolerated
Caching policy decisions include – cache invalidation policies, the predicates on which the cache will be segmented, means of indexing the cache etc.
Most EII vendors either deliver a built in database, or can leverage an external database for caching views. Ipedo XIP, for instance ships with both a full featured relational database and a full featured native XML database as part of the EII platform. This is important since it doesn’t serve our purpose well if, after caching a large amount of data, we are not able to serve the data quick enough from the cache.