Dan's research blog

2014-09-03 daily

Today’s highlight was another hack session with Sophie Clayton in the Armbrust Lab. The same query to compute the cytometric diversity (“richness”) that we ran a month ago on a smaller dataset now ran out of memory after 12 minutes; after an hour of futzing I got it to finish successfully in 2.5 minutes. In the second half of the post, I will dive into this particular query more deeply.

We met with the designer to discuss possible new logos and branding for the UW eScience Institute and for the WRF Data Science Studio we will be opening (and moving into) in November.

Also, Asterios invited me to be a co-chair for next year’s ~~Data Analytics in the Cloud~~ Data Analytics at Scale (DanaC) workshop at SIGMOD 2015, which would be the fourth incarnation of the DanaC workshop. We are working on the proposal, and I hope it is accepted!

Here is the core of the first version of Sophie’s richness query (#32538 on Myria):

def makebins(x): x//(pow(10,3.5)/16);

AllDataBinned = select Cruise, Day, File_Id,
                            makebins(fsc_small) as fsc_bin,
                            makebins(chl_small) as chl_bin,
                            makebins(pe) as pe_bin,
                            count(fsc_small) as num_particles
                from AllData
                where pop!="beads";

Richness = select Cruise, Day, File_Id,
                            count(num_particles) as richness
                            from AllDataBinned;

Here is the core of the rewrite (#32532 on Myria):

AllDataBinned = select Cruise, Day, File_Id,
                            makebins(fsc_small) AS fsc_bin,
                            makebins(chl_small) AS chl_bin,
                            makebins(pe) AS pe_bin
                            -- do not compute the count
                from AllData
                where pop!="beads";

DistinctBins = distinct(AllDataBinned);

Richness = select Cruise, Day, File_Id,
                            count(*) AS richness
                            from DistinctBins;

Did you see it? It’s easy to miss. The only difference is that we swapped a GroupBy, which computed a Count aggregate that we then ignored, for a Distinct. This optimization (standard in commercial databases) has two huge benefits:

It reduced the memory requirements by more than a factor of 4, making the query tractable at its current scale.
It enabled the query to be run in a pipelined manner. Consider the SQL query:
```
SELECT x, COUNT(*) AS c FROM R GROUP BY x;
```
For a given (x1, c1) tuple that will appear in the answer, the query cannot output that tuple until it knows that all input tuples with x = x1 have been seen. In a hash-based aggregate, like Myria uses, this property means that no answers can be produced until the entire input R has been processed, and slows the query down.

(Note: Many databases will sort R (or use an index on x) so that they see all the values of x in order, meaning they can produce (x1,c) as soon as a tuple of R has a new value x2. We are working on this for Myria.)

In contrast, consider the SQL query:
```
SELECT DISTINCT x FROM R;
```
For this Distinct query, the output is simply the unique values of x. For this answer, we can output every new value x1 for x as soon as it appears in the input stream. (We have to keep x1 around to make sure that we do not produce it twice, but we do not need to wait until we have seen all values of x). When we produce x1 early, downstream computation (in this case, computing the count of the number of full bins for each cytogram) can continue immediately. This new query plan results in: better overlap between different parts of the computation, more efficient use of the parallel resources of our cluster, and ultimately, happier users who can do their science faster.

(Note: as above, if the values of x are sorted then we do not even need to remember all values x1 that we have ever seen — just the most recent version. We are working on this for Myria, too.)

Based on what we learned today, I created several new issues for Raco and for Myria and am well on the way to fixing them. The upshot of this work is that the “slow query”, written the first way, now results in the same plan as when I rewrote it the fast way.

I’ve said it before, and I’ll say it again: working with real users on real problems is the ~~only~~ most effective way to make sure your system is actually useful. The dividends for Myria of working with Sophie and other real scientists for Myria are huge.

There are comments.

blogroll

social