Open Source aggregation engine with slice & dice web interface
Alexandru Toth
alxtoth@users.sourceforge.net
Overview
Aggregation engine with scalable SQL backend database
Slice & Dice web interface with hierarchy menus
Understands basic MDX syntax
Can be embedded as a frame into web interfaces
Open Source, GPL v2 license
Architecture
Front-end: Apache / Lighttpd /squid
SSL, proxy, digest authentication
Web servers are good in protecting against buffer overflows and other things you
don't want to know about.
Middle tier:
OLAP engine written in Python
presentation in CherryPy (Python)
Can be embedded as a frame into web interfaces
Back-end:
mySQL, PostgreSQL, SQLite (queries are simple, porting to other databases shouldn't be big issue.
Just make sure your favourite database supports partitioning)
memcached for storing cached cells, dimension tables, metadata
Authentication and Authorization
Secure authentication is HARD. Delegate and configure it in the webserver instead of coding it in Python
CherryPy receives authenticated user from webserver
Access restrictions on each dimension can be configured for users
Ex: User "Alex" only sees year "2002" and region "Europe" (and hierarchy members bellow)
Authentication is optional. Without it, all users are "anonymous", and what they see is configurable
Why Python and memcached?
Python is very high-level language: Cubulus v0.45 (slice&dice, MDX parser, database failover) is less than 3000 lines
When fact table has millions of rows, runtime of Python becomes neglectible compared to database runtime
Memcached runtime is neglectible to Python runtime
Why not use a Python hashtable instead of memcached? That would require locking and coding expiration of old items. Memcached already does these things.
If it's memory consumption gets too big, just restart memcached. (Python will reconnect and may repeat some queries).
Ideal in shared hosting environment, where CherryPy runs in session-less mode, with constant memory footprint
Database Scalability & Failover
Scale up: with partitioned fact table in MySQL 5.1 on a multi-CPU / multi-core computer,
queries over partitions are executed in parallel
Scale-out & failover: queries can be distributed to multiple database servers.
If all servers are available, queries are executed in parallel.
If one database is unavailable, queries are re-executed in the available db servers
Maximum degree of parallelism: each cell can be queried in a different CPU
(Theoretical speculation; bring me hardware to test it)
Caching
If one cell was already computed, don't query for it again
Database might not understand what to cache.
SQL analogy:
first issue:
SELECT sum(sales) GROUP BY month WHERE continent="Europe"
then issue:
SELECT sum(sales) GROUP BY continent WHERE month="January"
... missing and continent != "Europe"
For OLAP engine it is easy to store a cell for "Europe"+"January", and generate no SQL for cached cells
Predictable performance
If all cells have similar SQL statements, one can measure response time in one CPU and add hardware until response time is acceptable
Assumption: SUM is so slow that it is more likely users run of patience before computer runs of memory. Consequence: data fits in memory.
You WILL want more CPUs before you run out of memory
Unpredictable data
In practice data will be "L-shaped", "sparse", "non-uniform"
Analytic queries are unpredictable
Don't bother computing summary tables and indexes nobody will use; it's enough to cache recent queries
A report with 10 rows and 10 columns starts looking "embarassingly parallel"
Hardware keeps getting cheaper
MDX language
Cubulus is now able to parse a small subset of MDX. This allows for easier integration on HTML level. For example MDX statement:
Select {[time].[all time].children} on rows, {[region].[all region].children} on columns from alxtoth
At the moment there is no support for nested dimensions, formulas, crossjoin etc.
HTML integration
MDX statement in previous slide can be displayed in a frame with URL:
Instead of transforming hierarchical queries into star- or snowflake-schema with lots of joins, transform the dimensions such that hierarchy indexes are sorted. This way all joins become range queries
"Hierarchical range-clustering of keys" comes from the article: V. Markl, F.Ramsak, R. Bayer: Improving OLAP Performance by Multidimensional Hierarchical Clustering
No index, no joins
The "curse of dimensionality" means in practice that one has to create LOTS of indexes in order to handle the unpredictable analityc queries.
There can be only one clusterred index. If selectivity is high, the other indexes produce only cache misses (and random seeks)
Use partitioning instead indexing. Remember, partitioned queries run in parallel
Full table scan is a sensible option. If so, do it with one pass over the table
C language case study is presented in Kenneth A. Ross, "Conjunctive Selection Conditions in Main Memory"
Traditional star query
select prod.p_cat , sum(figure)
from fact, prod, time, customer
where
fact.c_id = customer.c_id and customer.c_type = 'big' and
fact.t_id = time.t_id and time.t_year = '2007' and
fact.p_id = prod.p_id and
fact.m_id = 15 and
fact.s_id = 2
group by prod.p_cat;
Traditional star query
p_cat
sum(figure)
books
7754933.3210
car
7757807.1538
cosmetic
7797828.1414
electro
7749172.6525
fashion
7800735.9866
food
7799165.1570
sport
7734439.1736
Traditional star query plan
d
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
time
ALL
PRIMARY
NULL
NULL
NULL
14
Using where; Using temporary; Using filesort
1
SIMPLE
fact
ref
PRIMARY
PRIMARY
4
star2.time.t_id
25401
Using where
1
SIMPLE
prod
eq_ref
PRIMARY
PRIMARY
4
star2.fact.p_id
1
1
SIMPLE
customer
eq_ref
PRIMARY
PRIMARY
4
star2.fact.c_id
1
Using where
Hierarchical range-clustering"
Column c_0_1 represents "books" etc; SUMs are the same:-)
select
sum(figure *(9 <= dim_1 and dim_1 <= 14)) as c_0_1,
sum(figure *(15 <= dim_1 and dim_1 <= 20)) as c_0_2,
sum(figure *(21 <= dim_1 and dim_1 <= 26)) as c_0_3,
sum(figure *(27 <= dim_1 and dim_1 <= 32)) as c_0_4,
sum(figure *(33 <= dim_1 and dim_1 <= 38)) as c_0_5,
sum(figure *(39 <= dim_1 and dim_1 <= 44)) as c_0_6,
sum(figure *(45 <= dim_1 and dim_1 <= 50)) as c_0_7
from fact_cub where
6 <= dim_0 and dim_0 <= 17 and
7 <= dim_2 and dim_2 <= 12 and
dim_3=3 and
dim_4 =3;
Hierarchical range-clustering"-style
c_0_1
c_0_2
c_0_3
c_0_4
7754933.3210
7757807.1538
7797828.1414
7749172.6525
c_0_5
c_0_6
c_0_7
7800735.9866
7799165.1570
7734439.1736
"Hierarchical range-clustering"-style
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
fact_cub
ALL
NULL
NULL
NULL
NULL
2540160
Using where
Context
Fact size = 2 500 000 rows, same data in both tables
Table "fact" has a composite Primary Key of all dimensions, no partitions, no additional indexes. Dimension tables have all PK of the id fields. There is no database tuning for "star query"
Table "fact_cub" has no Primary Key, no indexes, no partitions. (Even though Cubulus can make good use for partitions; recommended partitioning would be on dim_0, dim_4, dim_5, or range partitioning for time+ key partitioning for scenario+ key partitioning for measure)
In Cubulus distribution there is random database generator will populate database with several millions rows, and create suitable partitions. Also there is an example how to import your star schema, which you can optimize and compare
Conclusion
The star-query could be optimized (also avoid the string comparison)
But in general the runtime of a star-query degrades with the number of of joins. Also needs effort to tune indexes, materialized views etc.
Proposed queries only need partitioning. Partitioning criteria is simple: by time and by non-aggregatable measures ex: Scenario, Currency.
Future work
Port to different database engines (currently only MySQL)
Use different aggregations (currently only SUM)
Produce XML instead of HTML will allow using Cubulus as backend MDX server
Charting (ex: when output will be XML, this can be done with SVG and XLST processing in browser)
Support for LDAP/Kerberos/NTLM
References
OLAP aggregation engine uses "hierarchical range-clustering of keys" by "V. Markl, F.Ramsak, R. Bayer: Improving OLAP Performance by Multidimensional Hierarchical Clustering. Proc. of the Intl. Database Engineering and Applications Symposium, pp. 165-177, 1999", link
SQL statements inspired from Kenneth A. Ross, "Conjunctive Selection Conditions in Main Memory", link