Over the weekend, both Sophie Clayton and Andy Becker worked independently on their Data Science Incubator projects; I spent some time then and today answering emails :).
Sophie has been loading underway data (GPS, temperature, salinity, etc. from ships in motion) into SQLShare for cleaning. Every research vessel is its own special flower that represents dates and times in different ways, and we want to bring them all into a normalized format. We will then load the normalized data into Myria and join it with the SeaFlow data in her analyses.
(For now, the Microsoft SQL Server system underpinning SQLShare is better for messy data than Myria itself, because SQL Server handles more data types and corner cases than we do. This will change!)
We made Sophia’s queries run dramatically faster by materializing SQLShare datasets once all the fields were in the right types. Turns out, recomputing datetime
objects can be really slow when you want to do interval joins on them! We also discussed how to find bad rows in datasets, e.g., if you’re getting errors casting values to float
, you need to know about WHERE ISNUMERIC(x) <> 1
so you can find the bad values of x
.
Andy has been experiencing the joys of loading data into databases with indexes, foreign keys, auto commit, checkpointing, and all that. After some iteration, we figured out to check the Postgres logs and found that his remote COPY
commands were running out of memory. Chunking the data made it finish.
I finally got Myria’s web interface to correctly push queries into the Postgres without weird Google App Engine issues, fixed all the Travis-CI tests, and deployed it.
There are comments.