Creating an Ensemble Aggregation

Last modified: Tue, 03/14/2017 - 00:19

This section describes how to use the THREDDS Data Server and NCML to create a new data set from existing netCDF files or aggregations with an ensemble axis.

An Ensemble Axis Aggregation

Basic Aggregation Setup for the Time Axis

In order to build the aggregation you need to in order to use the ensemble facilities in LAS first you have to prepare your data collection so that you have one data source for each ensemble member. That means that if one ensemble run consists of several files covering different time ranges, first you will need to create a time aggregation of those data. Once you have a collection of files or OPeNDAP aggregations where each on represents on ensemble member you can create the aggregation along the time axis.

In this example, we'll walk you through the process of creating a time aggregation and the ensemble aggregation together in one NCML file. The example presented here is a small sub-set of the full ensemble, but should serve to illustrate all of the steps.

We begin with data files for two parameters, covering three year.

tas_CLIVAR_atm_monthly.198204-198303.nc
tas_CLIVAR_atm_monthly.198304-198403.nc
tas_CLIVAR_atm_monthly.198404-198503.nc
zg_CLIVAR_atm_monthly.198204-198303.nc
zg_CLIVAR_atm_monthly.198304-198403.nc
zg_CLIVAR_atm_monthly.198404-198503.nc

These data represent one of the ensemble runs and we will end up creating an ensemble of three such runs, but first we have get these files organized along the time axis. The NCML facilities in the Java netCDF library allow you to easily create a time aggregation of these data.

Below is the bare bones XML needed to create the time aggregation.

 <dataset ID="CM2.1U_CDAef_v1.0_apf r1 Atmosphere" name="CM2.1U_CDAef_v1.0_apf r1 Atmosphere" urlPath="CM2.1U_CDApf_v1.0_r1Atmos_wo_vars">
 <dataType>Grid</dataType>
 <property name="viewer" value="http://data1.gfdl.noaa.gov:8380/lasV7/getUI.do?data_url=http://data1.gfdl.noaa.gov:8380/thredds/dodsC/CM2.1U_CDApf_v1.0_r1Atmos, Visualize with Live Access Server"/>
 <serviceName>ipcc</serviceName>
 <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
 <aggregation type="union">
 <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
 <aggregation dimName="time" type="joinExisting" timeUnitsChange="true">
 <scan location="/home/users/rhs/clivar/gfdl_cm2_1/CM2.1U_CDAef_v1.0_apf/r1/pp/atmos/ts/monthly" suffix="tas_*.nc"/>
 </aggregation>
 </netcdf>
 <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
 <aggregation dimName="time" type="joinExisting" timeUnitsChange="true">
 <scan location="/home/users/rhs/clivar/gfdl_cm2_1/CM2.1U_CDAef_v1.0_apf/r1/pp/atmos/ts/monthly" suffix="zg_*.nc"/>
 </aggregation>
 </netcdf>
 </aggregation>
 </netcdf>
 </dataset>

You should take note of several things about what the XML above. First of all, the inner <aggregation> element using type="joinExisting" is where the files are being organized along the time axis. As it turns out, these files use a different base date for the time units so the timeUnitsChange="true" is required to let the software know that it must extract the units from each file when computing the date/time values within the time axis of that file.

To further organize the data within we will use an outer type="union" aggregation so that both data parameters will be available from a single URL. This is not strictly necessary to work with these data within LAS since each variable can have it's own data source URL, but it certainly makes access to the data easier for other clients.

The Aggregation Complete Setup for the Time Axis

We might expect that the setup above would all that's necessary to prepare each ensemble run, but it turns out that we can do better. In its current implementation the Unidata CDM library does not take into account the CF time bounds variable when aggregating along the time axis. So as a result, in the case of these files where the time axis "starts over" with a new base date in each file, the time bounds variable will get aggregated together, but the values will be wrong. And as it turns out, there are also some problems with the automatic generation of the time values to construct the virtual time axis in the aggregation. To work around these limitations we can get an even better representation of the data as an aggregation by using the power of NCML to specify exactly what values to use for the time axis and the time bounds variable.

 <dataset ID="CM2.1U_CDAef_v1.0_apf r1 Atmosphere" name="CM2.1U_CDAef_v1.0_apf r1 Atmosphere" urlPath="CM2.1U_CDApf_v1.0_r1Atmos">
 <dataType>Grid</dataType>
 <property name="viewer" value="http://data1.gfdl.noaa.gov:8380/lasV7/getUI.do?data_url=http://data1.gfdl.noaa.gov:8380/thredds/dodsC/CM2.1U_CDApf_v1.0_r1Atmos, Visualize with Live Access Server"/>
 <serviceName>ipcc</serviceName>
 <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
 <aggregation type="union">
 <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
 <variable name="time" shape="time" type="float">
 <attribute name="units" value="days since 1982-04-01 00:00:00"/>
 <attribute name="bounds" value="time_bnds"/>
 <attribute name="_CoordinateAxisType" value="Time"/>
 <values start="15" increment="30.5"/>
 </variable>
 <variable name="time_bnds" shape="time bnds" type="float">
 <values>
 0 30 
 30 61 
 61 91 
 91 122 
 122 153 
 153 183 
 183 214 
 214 244 
 244 275 
 275 306 
 306 334 
 334 365 
 365 395
 395 426
 426 456
 456 487
 487 518
 518 548
 548 579
 579 609
 609 640
 640 671
 671 700
 700 731
 731 761
 761 792
 792 822
 822 853
 853 884
 884 914
 914 945
 945 975
 975 1006
 1006 1037
 1037 1065
 1065 1096
 </values>
 </variable>
 <aggregation dimName="time" type="joinExisting" timeUnitsChange="true">
 <scan location="/home/users/rhs/clivar/gfdl_cm2_1/CM2.1U_CDAef_v1.0_apf/r1/pp/atmos/ts/monthly" suffix="tas_*.nc"/>
 </aggregation>
 </netcdf>
 <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
 <variable name="time" shape="time" type="float">
 <attribute name="units" value="days since 1982-04-01 00:00:00"/>
 <attribute name="bounds" value="time_bnds"/>
 <attribute name="_CoordinateAxisType" value="Time"/>
 <values start="15" increment="30.5"/>
 </variable>
 <variable name="time_bnds" shape="time bnds" type="float">
 <values>
 0 30 
 30 61 
 61 91 
 91 122 
 122 153 
 153 183 
 183 214 
 214 244 
 244 275 
 275 306 
 306 334 
 334 365 
 365 395
 395 426
 426 456
 456 487
 487 518
 518 548
 548 579
 579 609
 609 640
 640 671
 671 700
 700 731
 731 761
 761 792
 792 822
 822 853
 853 884
 884 914
 914 945
 945 975
 975 1006
 1006 1037
 1037 1065
 1065 1096
 </values>
 </variable>
 <aggregation dimName="time" type="joinExisting" timeUnitsChange="true">
 <scan location="/home/users/rhs/clivar/gfdl_cm2_1/CM2.1U_CDAef_v1.0_apf/r1/pp/atmos/ts/monthly" suffix="zg_*.nc"/>
 </aggregation>
 </netcdf>
 </aggregation>
 </netcdf>
 </dataset>

The resulting file looks complicated, but it's mostly because we are forced to list each value for the time bounds array. Of course, if you have a large data collection this could be quite tedious and you might have to consider some automated method to generate these values. Using the same name as the existing time_bnds variable cause the values in the aggregation to be replace by the values we supply in the NCML.

With the time axis, the changes are even simpler. We can specify the times we want by assuming regularly space data starting in the middle of the month with 30.5 days between each increment. This will land us in the middle of each month for the axis value and the time bounds will specify the precise interval covered by each data grid along the time axis. We also used NCML to specify some attribute values for the time axis.

Once this process is repeated for each of the three runs we can go on to the next step of preparing the ensemble aggregation.

Assembling the Ensemble Runs into an Aggregation

Once the above work is installed into a THREDDS Data Server for each ensemble run, you will have one data access URL for each run. The result can then be further aggregated with an additional axis. In this case you are creating a new axis so the aggregation type will be type="joinNew".

<?xml version="1.0"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
 <dimension name="ensemble" lenght="3"/>
 <variable name="ensemble" type="double">
 <attribute name="long_name" value="Ensemble of Realizations"/>
 <attribute name="_CoordinateAxisType" value="Ensemble"/>
 <attribute name="axis" value="E"/>
 <attribute name="standard_name" value="realization"/>
 </variable>
 <variable shape="ensemble" name="labels" type="String">
 <attribute name="long_name" value="Realizations"/>
 <values>Realization01 Realization02 Realization03</values>
 </variable>
 <variable name="plev">
 <attribute name="positive" value="down"/>
 </variable>
 <aggregation dimName="ensemble" type="joinNew">
 <variableAgg name="tas"/>
 <variableAgg name="zg"/>
 <netcdf location="http://dunkel.pmel.noaa.gov:8930/thredds/dodsC/CM2.1U_CDApf_v1.0_r1Atmos" coordValue="1"/>
 <netcdf location="http://dunkel.pmel.noaa.gov:8930/thredds/dodsC/CM2.1U_CDApf_v1.0_r2Atmos" coordValue="2"/>
 <netcdf location="http://dunkel.pmel.noaa.gov:8930/thredds/dodsC/CM2.1U_CDApf_v1.0_r3Atmos" coordValue="3"/>
 </aggregation>
</netcdf>

One word of caution. If you are aggregating existing files that have time as the unlimited dimension, you will have modify the existing time axis using NCML to ask the server not treat it as the unlimited dimension since it will not longer be the outer dimension and the standard client library will not be able to read the data source.

In the case of this aggregation, we build the actual aggregation axis using a variable of data type double and supply the value of the coordinate when specifying the URL of each aggregated ensemble run. When building the ensemble axis we also supply some attributes, <attribute name="_CoordinateAxisType" value="Ensemble"/>, <attribute name="axis" value="E"/>, and <attribute name="standard_name" value="realization"/> all of which serve to identify this axis as an ensemble axis to various software clients including LAS.

Finally, the CF standard describes a way to provide text labels to an ensemble axis by creating a character variable with the same dimension as the actual coordinate variable of the axis. If you add such a variable, LAS will use it when building the user interface for interacting with this data set.

Finally, we fix the plev vertical coordinate so that it has the positive="down" attribute, a requirement for CF, so the entire collection can be recognized as a scientific data grid in the CDM software. Now that we have our aggregations built, we can configure LAS to use the ensemble.

Search form

You are here

An Ensemble Axis Aggregation

Basic Aggregation Setup for the Time Axis

The Aggregation Complete Setup for the Time Axis

Assembling the Ensemble Runs into an Aggregation