Cubulus OLAP
Open Source aggregation engine with slice & dice web interface
Alexandru Toth
alxtoth@users.sourceforge.net
Overview
- Aggregation engine with scalable SQL backend database
- Slice & Dice web interface with hierarchy menus
- Understands basic MDX syntax
- Can be embedded as a frame into web interfaces
- Open Source, GPL v2 license
Architecture
- Front-end: Apache / Lighttpd /squid
- SSL, proxy, digest authentication
- Web servers are good in protecting against buffer overflows and other things you
don't want to know about.
- Middle tier:
- OLAP engine written in Python
- presentation in CherryPy (Python)
- Can be embedded as a frame into web interfaces
- Back-end:
- mySQL, PostgreSQL, SQLite (queries are simple, porting to other databases shouldn't be big issue.
Just make sure your favourite database supports partitioning)
- memcached for storing cached cells, dimension tables, metadata
Authentication and Authorization
- Secure authentication is HARD. Delegate and configure it in the webserver instead of coding it in Python
- CherryPy receives authenticated user from webserver
- Access restrictions on each dimension can be configured for users
- Ex: User "Alex" only sees year "2002" and region "Europe" (and hierarchy members bellow)
- Authentication is optional. Without it, all users are "anonymous", and what they see is configurable
Why Python and memcached?
- Python is very high-level language: Cubulus v0.45 (slice&dice, MDX parser, database failover) is less than 3000 lines
- When fact table has millions of rows, runtime of Python becomes neglectible compared to database runtime
- Memcached runtime is neglectible to Python runtime
- Why not use a Python hashtable instead of memcached? That would require locking and coding expiration of old items. Memcached already does these things.
- If it's memory consumption gets too big, just restart memcached. (Python will reconnect and may repeat some queries).
Ideal in shared hosting environment, where CherryPy runs in session-less mode, with constant memory footprint
Database Scalability & Failover
- Scale up: with partitioned fact table in MySQL 5.1 on a multi-CPU / multi-core computer,
queries over partitions are executed in parallel
- Scale-out & failover: queries can be distributed to multiple database servers.
If all servers are available, queries are executed in parallel.
If one database is unavailable, queries are re-executed in the available db servers
- Maximum degree of parallelism: each cell can be queried in a different CPU
- (Theoretical speculation; bring me hardware to test it)
Caching
- If one cell was already computed, don't query for it again
- Database might not understand what to cache.
- SQL analogy:
- first issue:
- SELECT sum(sales) GROUP BY month WHERE continent="Europe"
- then issue:
- SELECT sum(sales) GROUP BY continent WHERE month="January"
- ... missing and continent != "Europe"
- For OLAP engine it is easy to store a cell for "Europe"+"January", and generate no SQL for cached cells
Predictable performance
- If all cells have similar SQL statements, one can measure response time in one CPU and add hardware until response time is acceptable
- Assumption: SUM is so slow that it is more likely users run of patience before computer runs of memory. Consequence: data fits in memory.
- You WILL want more CPUs before you run out of memory
Unpredictable data
- In practice data will be "L-shaped", "sparse", "non-uniform"
- Analytic queries are unpredictable
- Don't bother computing summary tables and indexes nobody will use; it's enough to cache recent queries
- A report with 10 rows and 10 columns starts looking "embarassingly parallel"
- Hardware keeps getting cheaper
MDX language
- Cubulus is now able to parse a small subset of MDX. This allows for easier integration on HTML level. For example MDX statement:
- Select {[time].[all time].children} on rows, {[region].[all region].children} on columns from alxtoth
- At the moment there is no support for nested dimensions, formulas, crossjoin etc.
Theoretical background
- Instead of transforming hierarchical queries into star- or snowflake-schema with lots of joins, transform the dimensions such that hierarchy indexes are sorted. This way all joins become range queries
- "Hierarchical range-clustering of keys" comes from the article: V. Markl, F.Ramsak, R. Bayer: Improving OLAP Performance by Multidimensional Hierarchical Clustering
No index, no joins
- The "curse of dimensionality" means in practice that one has to create LOTS of indexes in order to handle the unpredictable analityc queries.
- There can be only one clusterred index. If selectivity is high, the other indexes produce only cache misses (and random seeks)
- Use partitioning instead indexing. Remember, partitioned queries run in parallel
- Full table scan is a sensible option. If so, do it with one pass over the table
- C language case study is presented in Kenneth A. Ross, "Conjunctive Selection Conditions in Main Memory"
Traditional star query
- select prod.p_cat , sum(figure)
from fact, prod, time, customer
where
fact.c_id = customer.c_id and customer.c_type = 'big' and
fact.t_id = time.t_id and time.t_year = '2007' and
fact.p_id = prod.p_id and
fact.m_id = 15 and
fact.s_id = 2
group by prod.p_cat;
Traditional star query
-
| p_cat | sum(figure) |
| books | 7754933.3210 |
| car | 7757807.1538 |
| cosmetic | 7797828.1414 |
| electro | 7749172.6525 |
| fashion | 7800735.9866 |
| food | 7799165.1570 |
| sport | 7734439.1736 |
Traditional star query plan
-
| d | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | time | ALL | PRIMARY | NULL | NULL | NULL | 14 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | fact | ref | PRIMARY | PRIMARY | 4 | star2.time.t_id | 25401 | Using where |
| 1 | SIMPLE | prod | eq_ref | PRIMARY | PRIMARY | 4 | star2.fact.p_id | 1 | |
| 1 | SIMPLE | customer | eq_ref | PRIMARY | PRIMARY | 4 | star2.fact.c_id | 1 | Using where |
Hierarchical range-clustering"
- Column c_0_1 represents "books" etc; SUMs are the same:-)
- select
sum(figure *(9 <= dim_1 and dim_1 <= 14)) as c_0_1,
sum(figure *(15 <= dim_1 and dim_1 <= 20)) as c_0_2,
sum(figure *(21 <= dim_1 and dim_1 <= 26)) as c_0_3,
sum(figure *(27 <= dim_1 and dim_1 <= 32)) as c_0_4,
sum(figure *(33 <= dim_1 and dim_1 <= 38)) as c_0_5,
sum(figure *(39 <= dim_1 and dim_1 <= 44)) as c_0_6,
sum(figure *(45 <= dim_1 and dim_1 <= 50)) as c_0_7
from fact_cub where
6 <= dim_0 and dim_0 <= 17 and
7 <= dim_2 and dim_2 <= 12 and
dim_3=3 and
dim_4 =3;
Hierarchical range-clustering"-style
-
| c_0_1 | c_0_2 | c_0_3 | c_0_4 |
| 7754933.3210 | 7757807.1538 | 7797828.1414 | 7749172.6525 |
| c_0_5 | c_0_6 | c_0_7 |
| 7800735.9866 | 7799165.1570 | 7734439.1736 |
"Hierarchical range-clustering"-style
-
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | fact_cub | ALL | NULL | NULL | NULL | NULL | 2540160 | Using where |
Context
- Fact size = 2 500 000 rows, same data in both tables
- Table "fact" has a composite Primary Key of all dimensions, no partitions, no additional indexes. Dimension tables have all PK of the id fields. There is no database tuning for "star query"
- Table "fact_cub" has no Primary Key, no indexes, no partitions. (Even though Cubulus can make good use for partitions; recommended partitioning would be on dim_0, dim_4, dim_5, or range partitioning for time+ key partitioning for scenario+ key partitioning for measure)
- In Cubulus distribution there is random database generator will populate database with several millions rows, and create suitable partitions. Also there is an example how to import your star schema, which you can optimize and compare
Conclusion
- The star-query could be optimized (also avoid the string comparison)
- But in general the runtime of a star-query degrades with the number of of joins. Also needs effort to tune indexes, materialized views etc.
- Proposed queries only need partitioning. Partitioning criteria is simple: by time and by non-aggregatable measures ex: Scenario, Currency.
Future work
- Port to different database engines (currently only MySQL)
- Use different aggregations (currently only SUM)
- Produce XML instead of HTML will allow using Cubulus as backend MDX server
- Expand MDX subset: dimension nesting, calculated formulas etc
- Charting (ex: when output will be XML, this can be done with SVG and XLST processing in browser)
- Support for LDAP/Kerberos/NTLM
References
- OLAP aggregation engine uses "hierarchical range-clustering of keys" by "V. Markl, F.Ramsak, R. Bayer: Improving OLAP Performance by Multidimensional Hierarchical Clustering. Proc. of the Intl. Database Engineering and Applications Symposium, pp. 165-177, 1999", link
- SQL statements inspired from Kenneth A. Ross, "Conjunctive Selection Conditions in Main Memory", link
Credits in no particular order
Credits in no particular order