I spent a lot of time this weekend revisiting our new database design for KBMOD, and see last post. I had a good excuse to finally learn about the spatial database support in PostgreSQL — PostGIS. Through several iterations and redrafts, I sped some of our test queries up by a factor of about 28x (from 21s to 0.7s).
Here were the key insights:
- Rather than think of each image as a 2-D box (
ra
anddec
) with an associated timestamp, think of it as a 3-D box (ra
,dec
,time
). This box is still fairly small because each image covers only a few seconds or minutes. - The image 3-D box lives in a single column in a single table, so there are no on-the-fly joins and there is less data read from disk (probably).
-
It turns out that PostGIS cannot do a hybrid index-lookup on space and time at the same time. So the 2-D box + time-range version of the query takes two steps — first, intersect each trajectory spatially with all images, then filter out the examples where the timestamps do not sync up.
By treating the image as a 3-D box (and the trajectory as a line in 3-D space), we are able to do both these phases at once.
-
To intersect a line with a box, it suffices to simply check that their bounding boxes overlap. PostGIS is faster if you ask for overlapping bounds rather than run the
ST_Intersects
code.
I think we’re now close enough with the database that we should put the end-to-end pipeline together:
- pose a set of trajectories
- use the database to find all the images these trajectories may overlap with
- use the database to group these results by image, so that we have a set of (trajectory, image) pairs ordered by image.
- read each intersected image off disk once, pulling out the relevant pixels for all trajectories as a group.
- co-add the pixels and find new trans-Neptunian objects — aka, Pluto’s buddies!
Looking forward to our next meeting!
Comments !