Blog

Using Big Data to Catalog our Universe

It’s a familiar problem – you have a lot of data but want the knowledge of what this data means. Astronomers have the same problem – millions of photos of the sky but need the knowledge of what they mean. Big Data problems are often categorized as transforming data into insights – and that is exactly what some scientists are doing with “Sky Surveys”. Sky Survey is the term used to describe millions of images taken by telescope along with the information of when and where they were taken. Using this data is The Celeste collaboration, a group of scientists who have and are still working to catalogue our universe in a way that’s visual and understandable. And, it’s never been done before…

Putting a telescope into orbit may cut out a few hundred miles through our atmosphere, but what does each photo mean? Aside from this, there are also implications such as diffraction spikes from the telescope and gravitational lensing that has occurred along the journey. The Celeste project has addressed such challenges in order to build a meaningful catalogue of our universe.

Collecting all known data about what we know about the universe so far is most certainly a big data problem. The computational performance of this project in the petascale work, meaning that the Celeste collaborators have performed computations at a rate exceeding a thousand million million operations per second. And if you want to know more of the technical stuff, they did this with over 9000 CPU’s (Central Processing Units), and MIT high productivity language called Julia and a 178 terabyte dataset representing 188 million stars and galaxies.

In 1998, the Apache Point Observatory in New Mexica began imaging every visible object in the sky. This project proudly shares that they have already created the most detailed three-dimensional maps of the Universe ever made, with deep multi-colour images of one-third of the sky, and spectra for more than three million astronomical objects. The project has released fourteen data versions of their datasets thus far. They continue to release new data sets annually. It’s not hard to imagine that these ever-expanding datasets will offer even more opportunities for the Celeste collaboration in their analysis work.

Over the course of their first three years, the Celeste collaboration developed a new parallel computing method that was used to process the dataset (about 178 terabytes) and produce the most accurate catalogue of 188 million astronomical objects in just 14.6 minutes with state-of-the-art point and uncertainty estimates.

The Celeste Collaborators have opened the challenge of building a catalogue of the universe, and as with all big data projects, crave more data. This project provides the encouragement needed to prove how important and scalable big data can be.

Leave a Reply