Parallel algorithms for Palo Cube Rules

In the previous weeks several people asked me, why Jedox so far is the only BI company that invests in the GPU technology. GPUs make sense when the speed of execution matters. And speed does matter for Palo users, especially when it comes down to financial planning and simulation.

Whenever planning data or planning assumptions are changed at the base level, all aggregations have to be recalculated as quickly as possible to get new consolidated results for a new planning scenario. To deliver this speed, already back in 2005 the Palo developers decided to use an in-memory technology for Palo, which by itself delivers more speed than a disk-based or relational approach.

Choosing in-memory was a wise decision and a lucky one as well, because GPU acceleration actually is only effective in an in-memory architecture (also including the graphic memory of a GPU). GPUs are not helping much on a hard disk or inside a relational database.

Recently I had an interesting conversation with Dr. Tobias Lauer from the Institute of Computer Science at the University of Freiburg. Tobias is one of the research genius behind Palo GPU and he explained how Palo benefits from the parallel algorithms that run in today’s GPUs. This is what I understood from him:

A parallel algorithm utilizes hardware architectures with multiple processing units (processors or processor cores) by executing simultaneously (= in parallel) individual steps of a program that would otherwise be computed sequentially. Depending on the number of available processors, one can distinguish multi-core moderate parallelism (e.g., 2-16 cores) and massive parallelism (hundreds or more processors).

The latter category includes modern GPUs, each consisting of several hundred processing units. Since all the individual processors of a GPU usually execute the same code at the same time, this architecture is suitable for data-parallel (the same operation on many different data) rather than task-parallel applications (different things to be executed simultaneously).

A very simple example from the business intelligence context would be the function

turnover(P) = quantity(P) x price (P)

for a product P. Instead of storing all three figures in the OLAP database, it is sufficient (and for reasons of memory requirement and data consistency even desirable) to save only the quantity and price for a product P and to calculate the turnover dynamically (by an Cube Rule) from those.

For the calculation of the total turnover of a whole group W of products, the equation turnover(W) = quantity(W) x price (W) will lead to a wrong result if quantity(W) is the cumulative total number of all goods and price(W) is the aggregated price. Hence, the individual turnover for each product in the group W must be calculated first, before they can finally be summed up (or, using Palo terminology: we have to use an N-rule). Sequential programs need to run each of the calculations after one another, roughly like this:

1. For each product P in W do (sequentially):
a. Find the quantity and price of P
b. Multiply the two values
c. Add that product to the result
2. Return result

Our new approach is to do these individual calculations in parallel, i.e. to calculate simultaneously. Graphics processors (GPUs) are ideal architectures for this: the same operation (here: multiplication) is executed on many different data sets (here: quantities and prices of all products). A bit over-simplified, our algorithm performs the following steps:

1. Find quantities and prices for all the products P in W simultaneously.
2. Match these records so that quantity and price of the same product are placed next to each other (very quick through parallel sort)
3. Multiply all related pairs (quantity, price) simultaneously and store the results in an array.
4. Add up the array to get the overall result (very quickly by parallel reduce)
5. Return result

Unlike the above sequential algorithm, our parallel approach can perform two steps – finding data and multiplication – for all data sets almost simultaneously. The sorting and the final summation are accomplished by standard algorithms of parallel computing which are also very fast.

In initial tests we have seen very promising results, where our parallel approach has achieved significant speedups compared to the sequential algorithm currently used in Palo.

2 Responses to “Parallel algorithms for Palo Cube Rules”


  • Does it work on a virtual Linux system via VMWare in a server environment?
    Does is work on an office PC with a low-end graphics card with passive cooling?
    Does it work on a terminal PC via web-interface, either virtual desktop or browser-based?

  • Tom,

    Here are the answers:

    “Does it work on a virtual Linux system via VMWare in a server environment?”

    To achieve their high perfomance, our parallel algorithms designed for GPUs have to run on actual GPU hardware. Since usually a VM does not have (exclusive) control over the hardware it runs on, it cannot guarantee that the GPU is available for a specific programm.
    So, no, even if the software may “work” on a virtual machine, you won’t get the GPU speedup.

    “Does is work on an office PC with a low-end graphics card with passive cooling?”

    In principle, yes. However, there are some minimum hardware requirements, depending in part on your OLAP model:
    First of all, at present we only support Nvidia GPUs that are CUDA-capable (which sort of means that non-graphics algorithms run on them). They also need to have “compute capability 1.3″, which means they can handle double precision floating point arithmetics.
    Finally, the total graphics memory on the card(s) needs to be large enough for the data to be processed. We do hold (part of) the cube data in graphics memory. Hence, the larger your cubes, the bigger your required GPU memory.
    What does this mean in practice? — We are currently doing a lot of our own development and tests on cards of the GTX-2xx series, which come with up to 2 GB graphics memory and start as low as EUR 200,-. This may not be low-end, but it’s what most people nowadays have in their home computers. (You may have to beg to get it in your office desktop, though. So it’s a valid question.)

    “Does it work on a terminal PC via web-interface, either virtual
    desktop or browser-based?”

    Yes, if you were asking about client machines. Since our current parallel algorithms are only part of the SERVER software, it does not matter what front-end you use to access the server, be it web-based, via Excel spreadsheet, or otherwise.

Leave a Reply

Spam protection by WP Captcha-Free