Mar 04 2011

Indexed aggregates

Category: Uncategorizedzvolkov @ 3:40 pm

Everybody is familiar with indexes, used by RDBMSes to speed up filters and joins. But what about aggregates? Normally, aggregating values requires a full scan though the data. Let’s take, for example, AVERAGE. In order to calculate average price for the entire table, we must do a complete index scan. At least, it’s a single value, so we may easily cache it if needed to increase the performance. Now, imagine we need to aggregate a subset of data from the very first date in the table to an arbitrary user-selected date. This time we must do a partial index scan, and we must cache a value per each date…

Now, most of the time we’re simply adding the same values again and again, only the stop date differs. Wouldn’t it be nice if there was a way to reuse the aggregated data, so that subsequent queries could somehow benefit from the previous queries?

It turns out there is such a way. I learned it when reading the CouchDB guide. CouchDB employs a cool pattern that can be used in a multitude of similar scenarios. For those familiar with CouchDB, I’m talking about the persisted B-tree index of map/reduce results. For others, here is the idea:

The idea is to precalculate the aggregated data and store it at different levels of the B-tree index. The index can then be used to efficiently query the aggregate without having to reaggregate all the data all the time. Then, if any leaf-level value changes, only the ascending path through the tree has to get recalculated.

In our average price example, the index could store the SUM and the COUNT of items at day, month, and year levels. Then, if anybody wants to query average price year-to-date all you had to do is sum up all the SUMs and COUNTs for all the full months since year start, plus all the days available for the last month, then divide total SUM by total COUNT. If a past price has to change, the change has to propagate through the index, but only corresponding day’s and month’s and year’s values have to be updated, and even then the values for other days and other months within the year can be reused for the calculation!

Not only simple calculations can be indexed this way. More complex, lossy calculations like Standard Deviation can be composed from partitioned data too. Here is an example of how Standard Deviation can be calculated on chunks of data, and then combined to form the final value:

--geting row data
;with q1 as (select TradeDate, TransactionValue from Transactions where AccountID = 64337 and TransactionValue is not null),
--splitting the date into Year / Month / Date
q2 as (select day(TradeDate) as d, month(TradeDate) as m, year(TradeDate) as y, TransactionValue from q1),
--calculating sum, sum of squares, and count for each DAY
q3 as (select d, m, y, sum(TransactionValue) as T, sum(TransactionValue*TransactionValue) as S, count(*) as C from q2 group by d, m, y),
--combining days into months
q4 as (select m, y, sum(T) as T, sum(S) as S, sum(C) as C from q3 group by m, y),
--combining months into years
q5 as (select y, sum(T) as T, sum(S) as S, sum(C) as C from q4 group by y),
--combining years into single value
q6 as (select sum(T) as T, sum(S) as S, sum(C) as C from q5),
--calculating the mean
q7 as (select T, S, C, T/C as M from q6),
--calculating the variance
q8 as (select (S - T*M)/(C-1) as Variance from q7)
--calculating the std deviation
select SQRT(Variance) from q8