Empowering weather and climate forecast

Citizen observations for better local forecasts

Get the

Installing the python package in the repository provides the command-line tool maelstrom-train, which starts a run based on configuration defined in YAML files. This includes options for the data loader, models, training, optimization, loss, and output. The exact configuration of each experiment is documented in the settings found in etc/exp1.yml, etc/exp2.yml, and etc/exp3.yml. Use the following run commands:
maelstrom-train --config etc/raw.yml --hardware gpu
maelstrom-train --config etc/exp1.yml --hardware gpu
maelstrom-train --config etc/exp2.yml --hardware gpu
maelstrom-train --config etc/exp3.yml --hardware gpu


This application is being delivered by The Norwegian Meteorological Institute.


The goal is to produce high resolution (1x1 km) hourly temperature forecasts 58 hours in advance for the Nordic countries.


The dataset includes a range of predictors:
Predictor nameUnitHas leadtime?
2m temperature control member°CYes
2m temperature 10% from ensemble°C Yes
2m temperature 90% from ensemble°C Yes
Cloud area fraction 1 Yes
Precipitation amount mm Yes
10m x wind component m/s Yes
10m y wind component m/s Yes
2m temperature bias previous day °C Yes
2m temperature bias at initialization time °C No
True altitudem No
Model altitudem No
True land area fraction 1 No
Model land area fraction 1 No
Standard deviation of analysis at initialization time°C No
The MEPS modelling system changes frequently over time as model physics, data assimilation, and ensemble perturbation methods are improved. We chose the 2-year period 2020/03/01 to 2022/02/28, which represents a time period where the model configuration remained relatively constant. Some dates in the dataset are missing due to missing data in MET Norway’s archive. The dataset includes Numerical Weather Prediction (NWP) runs that were available at 03Z, 09Z, 15Z, and 21Z. Including all forecast initialization times for every day of the two-year period would yield too large a dataset. Instead, only one of the 4 runs is included for each day, where the provided runtime alternates from day to day. Including data from the 4 runs is important as it forces the ML solution to deliver forecasts that work for any forecast initialization time. Data is stored in 675 netCDF files, one for each forecast run.

Experimental setup

The prediction problem is to predict three percentiles: 10%, 50%, 90%. This means the output layer in the ML-models have 3 predictands. This application uses the quantile score as a loss function, as described in Deliverable D1.1.
Metric Tier 2 dataset
Data size 6.2 TB
Grid size 2321x1796
Spatial resolution 1 km
Lead times 58 (0, 1, 2, … 58)
Time period 2020/03-2022/02
Number of predictors 14
Predictor size (time, leadtime, y, x, predictor) 675 * 58 * 2321 * 1796 * 14
Target size (time, leadtime, y, x) 675 * 2321 * 1796 * 14
The models are trained using the first 50% of the dataset (2020/03/01 to 2021/02/28). Each experiment is trained for 1 epoch, making one pass over the training data. The training reads the dates of the NWP runs in a random order. As a single training sample is large (i.e. only 365 samples in total), the domain was split into 63 patches each of a size of 256x256 points to increase the number of parameter updates in an epoch. This results in a total of 22,806 training batches and parameter updates for one epoch. A small validation dataset was used to record the quality of the solution as the training progresses. A 29GB subset of the total dataset was used, where 24 dates were selected that samples all months of the year and all forecast initialization times. Additionally, the validation used only a subset of the full domain, including part of the North sea, southern Norway, and southern Sweden. The validation grid has size of 512 x 1024, with its southwest corner at x-index 300 and y-index 550. This was done to keep the validation dataset small enough to fit in main memory while at the same time including a wide variety of weather situations. Validation takes 42s and is run every time 4 input files have been processed. This results in a ~12% performance penalty, but allows the testing stage to select the model parameters that are best. The model parameters that had the best validation loss during the training were saved and used for testing. The test dataset consists of the second half of the full dataset (2021/03/01 to 2022/02/28). Each predictor was normalised by subtracting its mean and dividing by its standard deviation. The same normalisation coefficients were used for all grid points, times, and lead times.

Benchmark models

Four benchmark models are tested. The simplest is a linear regression (LINREG) model of the 14 predictors. This is implemented as a one-layer neural network with a linear activation function between the inputs and outputs. The second model is a dense neural network (DNN), where five dense layers, each with five units, are connected by a rectified linear unit (ReLu) activation function. These layers are connected to the three outputs by linear activation functions. Weights are shared across grid points and lead times. The third model is based on 2D convolutional neural network (CNN) layers. The basic architecture consists of three convolutional layers, each with a 3x3 stencil. The layers have 12, 5, and 5 filters respectively. A dense layer connects the convolutional layers and the outputs. ReLU is used as the activation between layers, and linear activation is used before the output layer. Weights are shared among all lead times. The fourth model is a U-Net model (Ronneberger et al., 2015) consisting of 3 levels, with 16, 32, and 64 output channels. The model weights are shared among all lead times. The convolution size is 3x3 and the up and downsampling ratio is 2. Additionally, the models are compared against the raw model output (RAW) that has not been postprocessed.

Experiment 1: Comparing benchmark models

In this experiment, we compared the performance of RAW, LINREG, DNN, CNN, and U-Net (see Figure 1 and Table 3 below). All models improve the forecast accuracy over RAW by 6-11%. CNN performed similarly to the simpler LINREG and DNN networks, suggesting that the added complexity isn’t able to extract complex relationships between the input and output. The more complex UNET model, however, performed better than the other models. We did not analyse the reason for the performance increase, but we suspect it is related to the model’s ability to detect large scale features.

Experiment 2: Comparing leadtime-dependent weights vs. leadtime as a predictor

The prediction task produces forecasts for all 59 lead times simultaneously. We know that some predictors (e.g., bias_recent) are more relevant to early lead times than later ones. To introduce lead time information into the models, we compare adding lead time as a (static) predictor field and adding a locally-connected layer at the end where each lead time has different parameters. We tested this on the best model from experiment 1 (U-Net). Adding lead time as a predictor led to a small improvement of scores for lead times 0-6h, but no noticeable improvement later on. The lead time layer significantly improved scores for lead time 0, but reduced the quality for all lead times after that. The overall modest improvements from lead time information was surprising to us, but may indicate that lead time information can be inferred by the model (e.g. the spread of the ensemble is a proxy for lead time).

Experiment 3: Adding new features

In this experiment, new features were added to the input. These include the x and y grid values, the month of the year, and the hour of the day. We expect these predictors will aid the model in differentiating seasonal, diurnal, and spatially varying biases. We tested this on the model that performed best in experiments 1 and 2 (U-Net). We did not, however, find a noticeable difference in quality when introducing these features.
Test scores for the different experiments as a function of forecast lead time.


Model Parameters Test loss Improvement
Experiment 1
RAW 0 0.2249 °C 0 %
LINREG 45 0.2094 °C 6.89 %
DNN 183 0.2067 °C 8.09%
CNN 2,437 0.2091 °C 7.03%
UNET 115,893 0.1995 °C 11.29%
Experiment 2
UNET leadtime predictor 116,040 0.1986 °C 11.69 %
UNET leadtime layer 119,895 0.2043 °C 9.16%
Experiment 3
UNET leadtime layer and extra predictors 116,628 0.2032 °C 9.65%


We will investigate several approaches to further improve the predictive accuracy of the models and the computational performance of the training pipeline.

Improving predictive accuracy

The experiments showed that the U-Net is a promising model capable of beating simpler models. We still believe that the model can be further tuned to make better use of the input predictors. For example, adding extra features in experiment 3 did not lead to improved predictions. We will investigate if the network structure can be altered in a way that better finds the signals that potentially exist in this dataset. The NWP model used in the dataset has known biases that we hope the ML solution would be able to identify. We will investigate if these are indeed detected, and experiment with different network structures that are better able to detect these biases. We will also properly tune the many hyper-parameters of the U-Net, such as number of levels and output channels. Since the dataset has a very high resolution (1x1km), it is possible that the 3 levels in the U-Net is not sufficient to allow large-scale meteorological features to be recognized.

Improving computational performance

In the experiments, the data loader streams data from the disk to the GPUs. As the model is trained, the next batch of data is fetched and processed in parallel. The processing pipeline for the most expensive model (U-Net, in experiment 3) looks like this:
Diagram for typical processing and training times for one file (10GB) when training a UNET model on an A100 GPU on Jewels-Booster. The two stages are run in parallel in a pipeline fashion. Since stage 2 is currently faster than stage 1, the pipeline is bound by the data processing step. This is also consistent with the analysis in the Deliverable 3.4, indicating that I/0 is still a main bottleneck for the computing performance of ML pipeline. Additionally, validation is performed after every 4th file and takes 29 seconds.


The data processing steps are done on-the-fly to allow the ML modeller the flexibility of experimenting with different data processing settings. This, however, makes the pipeline bound to the speed of the data processing, and not the training. Hence, GPU utilisation is far from optimal. We plan to alleviate this problem by parallelizing the code that performs the various data processing steps. We anticipate that with some work, the pipeline will eventually be bound by the training.

We will also investigate distributed training, by exploiting data parallelism.