JASMIN Forum 2012
Posted on September 25, 2012 (Last modified on November 8, 2023) • 5 min read • 1,031 wordsForum addressing challenges of large scale data analysis in the atmospheric science community.
On 25 September 2012, NCAS hosted a research forum at Reading University to address the challenges of large scale data analysis in the numerical weather prediction, oceanographic and climate science community. The forum heard presentations about many NCAS activities including UPSCALE, CASCADE, HiGEM and WISER as well as presentations from the EO, atmospheric composition and oceanographic perspectives.
The particular context was the recent purchase of the JASMIN and CEMS facilities which are managed by the Centre for Environmental Data Archival (CEDA) at the Rutherford Appleton Laboratory. JASMIN and CEMS are designed to enable centralised terabyte to petabyte scale data analysis for the NCAS and NCEO communities. The forum also took a wider view of data analysis challenges faced by the community. In discussing these challenges and potential solutions the forum aimed to uncover common problems, how recent improvements in technology can address these problems and identify additional work needed - particularly in the area of parallel data analysis.
The forum started by introducing the JASMIN architecture and plans for meeting the data analysis needs of NCAS researchers. We then heard presentations from many projects and communities tackling large scale data analysis challenges, starting with those who have had early experience with the JASMIN system. We finished off with presentations on developments in analysis tools and visualisation and a general discussion.
The age of "Big Data" is generally recognised as bringing challenges of scale in 3 areas: volume, rate and complexity. The forum heard many examples of projects facing challenges in all three of these areas. All projects are experiencing data volume growth which has implications for the ability for individual institutions to meet their storage requirements. With increases in volume comes increases in transfer rates both from the site of production (e.g. an HPC facility or satellites) to where it is stored and from storage to the analysis platform. Even for moderate volumes of data, data complexity can overwhelm the ability of users to analyse the data with the tools at their disposal. Amongst the wide variety of topics discussed a few highlights are described below.
M Mizielinski reported on the UPSCALE project as an example of an early adopter of the JASMIN system. UPSCALE expects ~300TiB (330TB) of data through the lifetime of the project with 1-2TiB generated per day. At this rate, transfer out of the HPC facility becomes a significant issue. The network links and data transfer services put in place by CEDA from HERMIT to JASMIN has managed up to 5-6TiB/day transfer to JASMIN where further analysis has been possible through direct access to the data from UPSCALE's dedicated analysis VMs.
J. Remedios, representing the EO community reported that after several years of being able to cope with the storage requirements of satellite products they were once again struggling to cope with the next generation of products. Next generation instruments such as the IRS on MTG-Sounder will produce 700 Tb/year from 2018 onwards. There is an increasing need for re-processing of EO data for climate applications and for the generation of multiple products from the same dataset. Synergies between different re-processing systems mean that the co-hosting of multiple datasets with analysis facilities is an obvious next step which they forsee will be met by CEMS.
L. Shaffrey reported on the HiGEM project's experiences producing 60+TB of climate model output. Decadal forecasts expected to total 75+TB of NetCDF data. Storing HiGEM/UJCC datasets at BADC has been very successful. It allowed easy access for different institutions and removed the hassle of dataset access and archiving from scientists. Problems with the timely transfer of data from supercomputers to BADC were resolved quickly by BADC. The conversion of data from the UK Met. Office PP format to NetCDF was less successful, hampered by insufficient metadata. HiGEM found that the data volumes contemporary climate models cause serious problems for the tools generally available to users.
Matt Evans described the data challenges faced by atmospheric composition research as both similar and distinct from the modelling and EO communities. Composition modelling also faces rapidly expanding data volumes but data complexity is a more significant challenge for them. Data in this field takes both numerical and symbolic forms, for instance the Leeds Master Chemical Mechanism, and their combination presents unique challenges for tool development.
In the tools section, J. Blower presented on the possibilities for visualisation through facilities like JASMIN and CEMS, emphasising the importance of visualisation for quality control, data discovery and data collaboration. P. Stier presented the upcoming JASMIN-CIS project to develop a high-level toolbox for intercomparison of EO, model and observation datasets which will be deployed on JASMIN. From CEDA S. Pascoe presented some forward-looking ideas for building parallelism into data analysis work-flows and how this might be achieved on JASMIN.
It was clear from the discussion following the presentations that there is huge demand for centralised data storage and analysis facilities to meet the future needs of research. The JASMIN and CEMS facilities are already meeting this challenge but a lot of work remains to be done to capitalise on their initial success and develop them into mature infrastructure for earth system research. The forum's discussions also touched on wider challenges that the community faces in the age of data-driven research such as data format conversion, metadata generation and training of researchers in informatics. Sharing facilities, tools and expertise can make an important contribution to the search for solutions to these challenges.