-
Notifications
You must be signed in to change notification settings - Fork 7
Targets and Truth Data
The RespiNow Hub addresses both nowcasting and short-term forecasting of various epidemiological indicators. These are detailed in a separate entry.
- Nowcasting addresses the statistical correction of partial/preliminary data. It is common for real-time surveillance data to be completed or revised restrospectively. This is because by the time a new version of a surveillance data set is published, not all relevant reports will already have been received by the organization curating the data. Later versions of the data set will be updated with the additional information. Nowcasting typically, but not necessarily, corresponds to upward correction of data to account for delayed reports.
- Forecasting concerns the future epidemiological development and thus time points for which not even partial data is currently available.
As submissions are due on Thursdays, but reporting weeks (i.e., the week definition used in RKI's surveillance data) are from Monday through Sunday, we require a convention on how to index the weeks:
- The Sunday preceding the Thursday of submission is horizon 0 and is usually a nowcasting task (i.e., statistical correction of data points which already exist, but may still be subject to revisions).
- The Sunday following the Thursday of submission is horizon 1 and usually a forecasting task (i.e., prediction of data points which are not even partially known).
- The Sundays of previous weeks, indexed by horizons -1, -2,-3 correspond to further nowcasting tasks. Due to the definition of our nowcasting targets (see Section Targets) nowcasts should be submitted back until horizon -3.
- The Sundays of following weeks, indexed by horizons 2, 3, ... correspond to further forecasting tasks. Forecasts are typically only feasible a few weeks into the future and we are not planning to display forecasts more than 4 weeks into the future (more likely focusing on 2 weeks ahead).
Note that for some data sources (notably SurvStat) the data set available on Thursday already contains partial data for the ongoing week, i.e., contains a strongly incomplete data point which we label with the date of the following Sunday. Horizon 1 can then be treated like a nowcasting problem, even though the target_end_date is in the future.
In previous work on COVID-19 hospitalization nowcasting (see discussion in Wolffram et al 2023) we realized that there can be pitfalls in defining the exact prediction target in a nowcasting exercise. The reason is that data revisions can in principle continue indefinitely and it is unclear which later version of the data represents the actual target and should be used for evaluation. To ensure the well-definedness of the modelling task we therefore limit the maximum time over which data can be corrected. For all targets we will consider the data as published 4 weeks after the respective Sunday as final (or, to be more precise, the data published on the Thursday 4 weeks and four days later). Nowcasts are thus only necessary for horizons 0 through -3 weeks. Note that this definition also applies to forecasting targets. Here, too, the goal is to predict, what the data will look like after four weeks of time to stabilize.
In terms of the provided files, the nowcasting / forecasting target corresponds to the contents of the target file or equivalently the row sums of the columns up to value_4w in the reporting_triangle file. See the next section for details.
For each target we provide four files, here with illustrative links to the SurvStat influenza data:
- latest_data-survstat-influenza.csv - the current version of the time series, which for the last couple of time points may still be subject to revisions. Data are on the scale of absolute counts.
-
reporting_triangle-survstat-influenza.csv - the raw reporting triangle with counts stratified by reporting delay. Note that due to the indexing of weeks, for some data sources (like SurvStat), the reporting triangle will contain a column
value_-1wwith a negative reporting delay (i.e., values which have already been reported prior to the Thursday of the respective week). Note that this raw reporting triangle is obtained simply by computing increments between subsequent data versions. It is thus possible that occasionally negative entries occur, even if data are usually only corrected upwards. This can break modelling codes. - reporting_triangle-survstat-influenza-preprocessed.csv - a pre-processed reporting triangle where negative values have been re-distributed to preceding weeks with positive values. For instance, a sequence of values (3, 2, -1) would become (3, 1, 0) and (3, 1, -2) would become (2, 0, 0).
-
target-survstat-influenza.csv As detailed above, the nowcasting target is defined as all reports up to a delay of 4 weeks. This file contains a simple time series restricted to these reports (incomplete for the last three weeks). This corresponds to the row sums of the columns up to
value_4win thereporting_trianglefile. Values are typically slightly lower than in thelatest_datafile.