Science Data Processor Consortium Report – SKA eNews – July 2015

Overview

The Science Data Processor (SDP) consortium was set up to examine the design of the computing hardware platforms, software, and algorithms needed to process science data from the correlator, or non-imaging processor, into science data products for the SKA community. Presently there are 18 full partners (which vary from individual institutes to alliances of institutes) and 5 associate partners within the consortium. Work has thus far been shared according to expertise and available effort in the following areas:

  • Project management;
  • System Engineering;
  • Architecture;
  • Computation;
  • Data;
  • Data Delivery;
  • Local Infrastructure;
  • Local Monitoring & Control;
  • Pipelines;
  • Science Support and Quality Management.

The SDP project recently underwent an external Preliminary Design Review (PDR). The recommendations from that have resulted in a number of changes, not least to the project milestones and deliverables. There were discussion of these changes, including how the overall architecture for SDP is evolving. The role prototyping will be playing in down selecting candidates for various aspects of the architectural components, together with how to go about the development of a baseline implementation of the architecture, which will form the focus of a consortium meeting to be held at ASTRON at the end of June.

In future eNEWS contributions we will explore the SDP areas and developments in more detail. As ‘SDP’ we will present you with some of the interesting challenges that have arisen and decisions that have had to be made, and our partners will share snapshots of the work they are undertaking, the wider impacts and local interests. For this eNEWS edition we are focusing on one of the significant questions to be addressed that previous radio telescopes have largely been able to avoid.

How precious is our data?

In order to maintain throughput and avoid “pooling” of data, the SKA’s back-end computer, will need to complete the processing of any given observation in the same length of time that it took for the receivers to collect the data for that observation. This raises many issues with the data reduction pipelines as it requires the processing time to be completely deterministic.

This is a very different regime in which to work as compared to that in which astronomers have traditionally worked – in the past, they have preferred to re-analyse the data applying different parameters to see what works best, or to clean images with interactively selected numbers of iterations, rather than to run a data reduction pipeline with fixed values. More subtly, even consistently parametrised algorithms such as minimisation functions (necessary to find the best-fitting calibration solutions) may not converge at a fully determined timescale (leading to “stragglers”), or nodes may fail after being allocated work and not deliver results.

In order to build an SKA that can function, astronomers will need to hand over the controls to the SKA itself so that observations and their data processing can be scheduled together. Decisions on whether it is absolutely necessary for the processing to produce a particular solution will then be handled automatically within the SDP. A conventional “High Availability” (HA) system for the SDP, where overheads are included at all stages to ensure that all data products are recoverable in cases of individual failures of nodes and all computations can be repeated if necessary, would be very costly according to Dr Bojan Nikolic, SDP Consortium Project Engineer.

He says: “Costs of traditional high availability are significant, sometimes leading to a doubling of the hardware cost.” He continued: “Since conventional high availability restarts failed processes (possibly on a different node or island) it impacts the determinism of time-to-compute. If used pervasively this would require additional margin in estimated time-to-compute, potentially leading to reduced overall efficiency of the SDP system.”

To work towards a deterministic processing time in a sensible cost margin, the SDP consortium has defined the concepts of “precious data”, and the complementary “non-precious data”. Precious data is that without which the data processing results are seriously compromised – for example antenna delay calibration values (without which all baselines to that antenna are irrecoverable) – whereas non-precious data values can be ignored without dis-proportionately impacting the results (for example losing one small chunk of visibility data because no good solutions can be found for a particular type of calibration, or because the node performing the calculation on it ran slow or failed). The loss of non-precious data will affect signal-to-noise at a marginal (and predictable) level but not cause corruption in the remaining data, this is shown visually in the figure below.

Figure 1: The loss of non-precious data (right hand side) only leads to a gradual and predictable reduction in signal-to-noise without corrupting any other areas of the data set, compared to the idealized case on the left hand side. Black arrows represent the direction of data flow.

The choice of where the boundary between precious and non-precious data lies is ultimately driven by a cost-benefit analysis. More robustness to data loss can be introduced if, for example, all precious calculations are done multiple times to reduce the chance that a solution is not obtained, or by building in a computing rate and Input/Output rate overhead so that there is some “slack” in the system that allows repeats of failed calculations, whilst still completing the data processing within the nominally allowed time.

Nikolic concludes: “Implementing the concept of non-precious data in SDP would allow for graceful handling of failures by degradation of output result quality. This fits well with the SKA definition of availability and would also allow use of some cheaper components within the SDP.”

Implications on costs…

Suppose we had an SKA operating cost of around €65 million per year, a typical 6 hour observation on one instrument might “cost” around €25,000, which is about 0.1% of the SDP budget for one instrument. An average “throw away” rate of 1% of the data into the SDP for reasons of non-convergence or processing failure would cost €250 per 6 hours, or €365,000 per year (or 1% of the operations budget). If this data loss can be avoided by building a bigger SDP then the SKA would effectively be cheaper to run (per unit sensitivity), and more expensive to build. The break-even point for this will not be known until after the SKA is built. If spending 1% extra on the SDP increased the effective availability of the SKA by 1% then the extra €0.5M spent initially would be recovered within a year.