This week we have a little look at the background data storage and the improvement works we are currently doing. These are needed to a) make the entire photometry dataset available again for searches online, and b) speed up the light curve searches and prepare the system for further growth and future online applications, such as a photometry alert system.

Currently all the photometry data is stored in one huge table in the database. This table has now more than 300 million lines. At the time this was set up, we did not anticipate that we would gather such a large volume of data so quickly. Searching for light curves hence involves finding the entries at a given position amongst all the data. This is obviously slow, despite the fact that the columns containing the coordinates (Right Ascension and Declination) are indexed. Furthermore, in the case we need to recover the table from a backup, the indexing has to be redone. This makes this process very slow. Thus, we have to re-organise the data.

The currently proposed new working structure for the tables storing the data is shown in the above image. The photometry will initially still be stored in the same photometry table, during the submission process. So nothing at all will change for the user that processes the data. The final step of writing the data into the table will be even quicker, as we will not index the data as before. We have created a new Gaia cross-reference table. This contains all GaiaDR3 sources that are in 1.5deg x 1.5deg squares around each of the HOYS target fields. There are about 2.5million such sources. We have assigned each individual Gaia source to a specific HOYS target region. This assignment is obvious for isolated regions. If two fields are close and overlap, such as e.g. IC1396A and IC1396N, we have defined borders between them.

In order to cross match the Gaia sources against our photometry, we will set up a process which works in the background. It will take one line of the photometry table (the brightness measurement for a single star in a single image). It will then find all Gaia sources within a given radius (3 arcsec) around that source. If there is just one, that is the matching source. We will then move the line of photometry data into a new table that contains only the photometry of sources that are in the same HOYS field as the matching Gaia source. In order to identify which object in the photometry table for a given region is matched to which Gaia source, a new lookup table for each region is created. This serves as intermediary between the photometry tables and the reference catalogue. We also store the separation of our data and the reference source. If more than one source is matched within 3 arcseconds, the brightest object will be considered the correct match but all other potential matches are also stored. If nothing is matched, the data is stored in a separate table. These data could be due to transient objects, not included in the Gaia catalogue.

This might look a quite complicated structure, but it has huge benefits for the speed of restoring data (should this be needed again), and in particular the light curve search. We now only need to search the Gaia reference list for the closest match of a coordinate set. By default only bright Gaia sources (Gmag<18mag) will be searched. These are only about 500,000 objects, and hence much less than the 300 million entries currently. When we know the Gaia source queried, we can then very quickly find the list of photometry IDs in the relevant lookup table and extract those from the respective photometry file. Note this involves only looking up indexed integer values in much smaller tables, which is much faster. We estimate an improvement of at least x100 in the speed of accessing light curves. We will report back when we measured the improvement in speed 🙂