Big Data

A Deep Learning-based Spatio-temporal NDVI Data Fusion Model

  • SUN Ziyu , 1, 2 ,
  • OUYANG Xihuang 1, 2 ,
  • LI Hao 3 ,
  • WANG Junbang , 1, 2, *
  • 1. National Ecosystem Science Data Center, Key Laboratory of Ecosystem Network Observation and Modeling, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China
  • 2. University of Chinese Academy of Sciences, Beijing 100049, China
  • 3. Xiongan Institute of Innovation, Baoding, Hebei 071899, China
*WANG Junbang, E-mail:

Received date: 2023-05-14

  Accepted date: 2023-07-30

  Online published: 2023-12-27

Supported by

The National Natural Science Foundation of China(31971507)

The National Natural Science Foundation of China(31861143015)

The Joint Research Project of the People’s Government of Qinghai Province and Chinese Academy of Sciences(LHZX-2020-07)


Satellite remote sensing provides the changes information of Earth surface on large spatial scale in a long time series and has been widely used in ecology. However, the possible impact from human activities generally occurs on a smaller spatial scale and could be detected in a longer time, which requires the remote sensing data having the both higher spatial and temporal resolution. Meanwhile, the development of the spatiotemporal data fusion algorithm provides an opportunity for the requirements. In this paper, based on deep learning, we proposed a residual convolutional neural network (Res-CNN) model to improve the fusion result considerably with brand-new network architecture to fuse the NDVI retrievals from Landsat 8 and MODIS images. Experiments conducted in two different areas demonstrate improvements by comparing them with existing algorithms. The model performance was evaluated by a linear regression between predictions and observations and quantified by determination coefficients (R2), regressive ecoefficiency (slope). The two excellent models, ESTARFM and FSDAF, were compared with the new model on their performance. The results showed that the predicted NDVI had the higher exploitational on the variability in the Landsat-based NDVI with the R2 of 0.768 and 0.807 at the urban and grassland sites. The predicted NDVI was well consistent with the observations with the slope of 1.01 and 0.989, and the R-RMSE of 95.76% and 93.58% at the urban and grassland sites respectively. This study demonstrated that the Res-CNN model developed in this paper exhibits higher accuracy and stronger robustness than the traditional models. This research is full implications because it not only provides a model on the spatio-temporal data fusion, but also can provide the data of a long time series for the management and utilization of agriculture and grassland ecosystems on the regional scale.

Cite this article

SUN Ziyu , OUYANG Xihuang , LI Hao , WANG Junbang . A Deep Learning-based Spatio-temporal NDVI Data Fusion Model[J]. Journal of Resources and Ecology, 2024 , 15(1) : 214 -226 . DOI: 10.5814/j.issn.1674-764x.2024.01.019

1 Introduction

Normalized Difference Vegetation Index (NDVI) data at high spatial and temporal resolutions are required for making advancements in both ecological research and vegetation management (Gao et al., 2006; Zurita-Milla et al., 2008; Hilker et al., 2009; Wu et al., 2012a; Abadi et al., 2016; Wei et al., 2016; Wu et al., 2016; Zhu et al., 2017; Song et al., 2018; Tan et al., 2019). Building long-term time series and high spatio-temporal resolution satellite NDVI data can facilitate the more efficient management and use of remote sensing technology in precision agriculture, ecological protection and grassland resources; hence it has scientific and practical significance in the development and application of data fusion models.
Satellite remote sensing offers great potential to monitor and analyze terrestrial vegetation at various spatial and temporal scales owing to its synoptic coverage and repetitive measurements. However, in satellite sensors design, a tradeoff must be made between spatial, spectral, and temporal resolutions. Therefore, these requirements of high frequency and spatial resolution data could not be satisfied in most practical settings.
Satellite derived NDVI is the only viable method for satellite data are subject to tradeoffs between spatial and temporal resolution, and no readily available conventional satellite data achieves both high spatial and temporal resolution. For example, MODIS offers daily frequency but the spatial resolution is 500 m, which results in an unacceptable level of mixed pixels and uncertainty for many applications. Landsat provides a finer spatial resolution of 30 m, but its revisiting frequency is every 16 days, and Landsat sensors are often unable to collect good quality data due to cloud contamination and mechanical failures (e.g., the Scan Line Corrector failure of Landsat 7). To overcome these obstacles, and as a fundamental task in the field of remote sensing imagery, fusion techniques are emerging as a powerful way to obtain images that mitigate or transcend the individual limitations of input datasets and therefore produce simultaneously high temporal and spatial resolution products.
The spatio-temporal adaptive reflection fusion model (STARFM) is known as the first published spatio-temporal data fusion model (Gao et al., 2006). It has been widely applied for monitoring environmental changes (Wu et al., 2012a; Gevaert and García-Haro, 2015). Then this model has been enhanced and refined by a large margin, and a series of improved versions have been proposed (Hilker et al., 2009; Zhu et al., 2010; Zhu et al., 2016; Zhu et al., 2018). For instance, data fusion based on image element reconstruction (Gevaert and García-Haro, 2015), wavelet transform (Acerbi-Junior et al., 2006), sparse matrix and dictionary learning (Wei et al., 2015; Wei et al., 2016), hybrid image element decomposition (Wu et al., 2012b; Wu et al., 2016), and filter-based intensity modulation (Liu et al., 2018). Owing to its open source and strongly predictive performance, STARFM and its improved versions have become widely used models in spatio-temporal data fusion (Emelyanova et al., 2013), they often require at least one pair of higher-spatial-lower-temporal resolution data (HSLT, such as Landsat) and lower-spatial-higher-temporal resolution data (LSHT, such as MODIS) of the same temporal-phase as input.It assumes that the temporal changes of all land cover classes within a coarser pixel are consistent; thus making it suitable only for homogeneous landscapes such as large area croplands (Das and Ghosh, 2016). Moreover, Zhu et al. (2016) proposed a new hybrid method named Flexible Spatiotemporal Data Fusion method (FSDAF) that integrated ideas from Unmixing-based approaches with spatial interpolation and STARFM into one framework. The FSDAF can improve the spatio-temporal fusion accuracy in an area with heterogeneous landscapes and abrupt land-cover changes using only one pair of images for input, thus effectively reducing the amount of data use and making it easier to implement. However, the correlation between the number of input image pairs and the quality of data fusion has not been certainly defined (Zhu et al., 2016).
For the surface heterogeneity and abrupt changes in land cover types cause vegetation studies error reasons, these classical spatiotemporal methods are not always the preferred tool. In recent years, along with the development of computer technology and the improvement of data processing capability, and an increasing of available remote sensing images, taking full advantage of these data and improving the quality of data fusion have become an inevitable choice for research (Das and Ghosh, 2016). Besides, deep learning provides new opportunity and method to advance the performance of data fusion models (Masi et al., 2016; Yuan et al., 2018).
Deep learning is an emerging data-driven method to transform the expression of dimensional data into a representation at higher dimensionality, smaller feature level, thereby predicting data of various spatial scales (LeCun et al., 2015; Schmidhuber, 2015). As a representative model, Convolutional Neural Networks (CNN) has been widely used in remote sensing in recent few years (Masi et al., 2016; Liu et al., 2017; Song et al., 2018; Tan et al., 2018; Tan et al., 2019). Also, it is firstly applied to the field of spatio-temporal data fusion in 2018 (Song et al., 2018; Tan et al., 2018). Song et al. proposed a hybrid method (STFDCNN) for spatio-temporal image fusion (Song et al., 2018), which firstly establish a non-linear mapping relationship through the learning of LTHS and HTLS images according to CNN, then construct the so-called super-resolution learning network (SRCNN) based on CNN learning of the LTHS images obtained previously and the HTLS images at the prediction moment. Finally, the HTHS images at the prediction moment can be obtained by High-pass filter. Tan et al. based on the first principles in geography directly applied CNN to develop deep convolutional spatio-temporal fusion network (DCSTFN) (Tan et al., 2018). Training HTLS image and the LTHS image to extract relevant features to obtain the features at the prediction moment by the first law of geography. Then restore the LTHS image at the prediction moment is reduced by deconvolution. Although compared with traditional spatio-temporal data fusion methods, the CNN-based spatio-temporal data fusion model has been proven to be a more potential and efficient way, but its application is still in the preliminary exploration stage and confronts some issues (Wei et al., 2023).
CNN-based spatio-temporal data fusion model, however, was considered to face the following some challenges: Firstly, it is crucial to select appropriate LTHS reference images in spatiotemporal fusion, from which all the detailed high-frequency information comes, thus, predictions are necessarily affected by their references causing fusion results to resemble the references to some degree. It could be much worse when there are significant ground changes during the reference and prediction period. These problems should be resolved or mitigated so that image quality of prediction can be further improved. Second, predicted images from CNN models are not as sharp and clear as actual observations for feature-level fusion. The convolutional network minimizes its losses to make predictions as close to ground truths as possible, therefore, errors are balanced among each pixel to reach a global optimum. Moreover, practices indicate that the activation function and network structure can be affect the quality and efficiency of the predicted image for image reconstruction renders it much likely to yield blurry images (Tan et al., 2018; Belgiu and Stein, 2019; Tan et al., 2019; Zhang et al., 2019b). The activation function in CNN model generally uses the global optimal result in order to capture the real information to the greatest extent, which would induce some uncertainty in some specific area as that of the traditional model (Zhao et al., 2016).
To date, a standard terminology on how to exploit deep learning for significant performance gain in spatiotemporal fusion is still lacking. And there have only been a few studies that have investigated the use of CNN-based image super resolution approaches in the spatiotemporal fusion of remotely sensed images (Shin et al., 2016; Song et al., 2018; Wang and Wang, 2020).
In this paper, we propose to construct a new Res-CNN model and produced the NDVI data from MODIS and Landsat 8 Operational Land Imager (OLI). Res-CNN model introduced a new network structure and activation function which reduces training difficulty and improves performance. Specifically, this paper aims: 1) To demonstrate the superior fusion performance of our proposed method over a set of contemporary traditional spatiotemporal image fusion methods; 2) To generate and evaluate the fused Landsat-MODIS NDVI images at a 30m resolution in Daxing and AHB area; However, by this new model, our final object is to develop the method of spatio-temporal data fusion from multiple- source satellite sensors and provide the data of both higher spatial and temporal resolution for ecosystem researches and managements.

2 Study area, satellite data, and synthetic datasets

A partial dataset from the literature (Kastrati et al., 2019; Misra, 2020) is utilized as the research data in this paper, which is the 10 pairs Landsat-MODIS dataset (hereafter referred to as the Daxing Dataset) for Daxing District, Beijing (39°00ʹ03ʺN, 115°05ʹ55ʺE) and the 10 pairs Landsat-MODIS dataset (hereafter referred to as the AHB dataset) for Ar Horqin Banner (43°21ʹ43ʺN, 119°02ʹ15ʺE) is located in the northeast of China. The Daxing Dataset is collected from September 2013 to January 2019. Beijing Daxing International Airport is built from December 2014 to September 2019, located exactly on the site which represents gradual land cover change and contains significant physical changes in plant growth. The AHB dataset is collected from September 2013 to January 2019.
The LTHS image of the Daxing Dataset is from Landsat 8-OLI image with a spatial resolution of 30 m, and the corresponding HTLS image is MODIS data product MOD09A1 with a spatial resolution of 500 m. In theory, the correlation between daily data of MODIS and Landsat 8-OLI data is stronger than that of MODIS every 8 days, therefore the use of daily data would theoretically result in a better fusion. However, in actual experiment, it is found that MODIS every 8 days data had better results than daily data that may cause by the 8 days of data have removed the clouds as much as possible and the missing data is fixed, also the best possible pixel values are retained in each pixel. Hence, MOD09A1, a product of MODIS data every 8 days, is chosen for this experiment.
The preprocessing of the raw data includes radiometric calibration, geometric correction, atmospheric correction, and projection transformation. Finally, the two data sources are re-projected into a unified horizontal-axis Mercator map projection (UTM), cropped to the same spatial extent, and a remote sensing image pair with a size of 1640×1640 rows and columns with a spatial resolution of 30 m is obtained used for model input and accuracy evaluation. Both MODIS and Landsat data are resampled to a resolution of 30m using the nearest neighbor method.
Fig. 1 The NDVI from the Daxing Dataset
At the current stage, these two strategies of spatio-temporal data fusion for NDVI are fusion at the band reflectance level and fusion at the NDVI level, which the former fuses each band reflectance and then calculates NDVI and the latter first calculates NDVI in terms of band reflectance and then fuses NDVI (Ao et al., 2021). Fusion at the NDVI level is considered to be more accurate in the fusion process, have less computational effort, easier to eliminate the noise in NDVI in practical applications (Tian et al., 2013; Chen et al., 2015), means the fusion strategy of NDVI level is more suitable.
Fig. 2 The NDVI from the AHB Dataset
The collected Landsat-MODIS dataset is fused by means of bands fusion to generate NDVI data at the corresponding time. The corresponding equations of NDVI are given below:
where BNIR is the near-infrared reflectance of the remote sensing image; BRED is the red reflectance of the remote sensing image.
Table 1 The description on Daxing Dataset and AHB Dataset
Dataset Image size Pairs Timespan
Daxing 1640×1640 10 2013/09/01-2019/01/21
AHB 2800×2480 12 2015/09/25-2018/10/03

3 Methods

3.1 Convolutional neural networks (CNN)

Originally, CNNs were applied to extract high-level features in the image classification and recognition tasks. Eventually, their applications were extended to the image super-resolution and data fusion domains with direct mapping between input(s) and output (Dong et al., 2015; Liu et al., 2017). Currently, the applications of CNNs on image fusion are being actively explored (Chen et al., 2023). For example, a CNN model was effectively utilized to combine image of a similar scene taken with various central settings to acquire clearness (Liu et al., 2017). Applying convolutional neural network (CNN) models for panchromatic sharpening algorithms in remote sensing imagery enhances the precision of translation sharpening techniques. It has been demonstrated that leveraging prior knowledge of remote sensing images can significantly enhance the efficacy of the approach (Masi et al., 2016; Wei et al., 2017).
Generally, the CNN architecture consists of a number of convolution layers, activation function, pooling layers and fully connected layers. The convolutional layer is a “filter” in the network. The output of this convolution layers is called a “feature map”. The activation function is very important for features extraction and its non-linear expression. Therefore, some criterions should be considered when the CNN was applied in the spatio-temporal fusion for remote sensing data. The activation function firstly should ensure that the gradient of the function cannot be 0 at the far end of infinity, otherwise the gradient will vanish, and the network will degenerate (Tan et al., 2019). Secondly, there should be a certain degree of acceptance for the inflow of negative gradients, otherwise the nonlinear expression of the feature cannot be identified by the activation function, meanwhile, it is more robust and is a continuous function.

3.2 The proposed method for CNN based spatiotemporal fusion

While originally proposed as a method for prediction image recognition in the categorical variables, such as image classification and similar image search. Recently, CNN has been modified to produce high accuracy missing data image in spatiotemporal data fusion (Tan et al., 2018). It has opened the door for wide CNN-based spatiotemporal image fusion (Wei et al., 2023).
In this context, SRCNN and DCSTFN with three layers of neural networks to downscale the coarse images and to high spatial resolution (Song et al., 2018; Tan et al., 2018). SRCNN and DCSTFN has some specific limitations that were reported in previous research (Song et al., 2018; Tan et al., 2018). Which included mainly that it has a very basic architecture with the intuition that deeper is better as theoretically proved in the field of deep learning theory. But following the network layers deepen, it was often accompanied by the gradient vanishing and the network explosion has been happen (Krizhevsky et al., 2017; Zhang et al., 2019a). It was reported in previous research (Song et al., 2018). The gradient vanishing and the network explosion often led to a reduction in the quality of the reconstructed image. It was also led to the convergence rate of the network becomes slower, and the training of the network longer and more difficult. These limitations greatly restrict its performance.
To overcome this drawback, CNN-based follow-ups build deeper and avoided the gradient vanishing and the network explosion happen. We build a new model named Res-CNN to complex structures by stacking more convolutional layers to yield more accurate inference. The Res-CNN model was specifically realized through the 3 sub-modules that were named as FineNet, CoarseNet and ReconstructNet to extract features in inputting Landsat and MODIS images at the time t0, then predict Landsat by the MODIS at the t1 (Fig. 3).
Fig. 3 The framework of Res-CNN developed in this study

Note: @is a separator between the number of feature maps and the size of feature maps; * is the separator between the width and height of the feature map; 3×3 represents the size of the convolutional kernel.

3.2.1 Realization of the model

The Res-CNN model were specifically realized through the 3 sub-modules that were named as FineNet, CoarseNet and ReconstructNet to extract features in inputting Landsat and MODIS images at the time t0, then predict Landsat by the MODIS at the t1 (Fig. 3).

3.2.2 FineNet sub-module

The sub-module extracts the features by utilizing a residual network skip-connect structure and spatial pyramid pooling (SPP). Each convolution in the residual module is followed by Batch Normalization and activated by Mish. The model uses 3×3 small kernels for all convolutions, a step size of 2 for feature map reduction, and a padding of 1. The SPP was designed to contains the three pooling with the sizes of 5, 9, and 13. The results after pooling are combined to form new features. In the Fig. 3, The number before “@” represents the channels (such as: 3@W*H, where “3” represents 3 channels), “W” representative width and “H” representative height. The number before “W” and “H” represents the multiples (Fig. 4).
Fig. 4 The framework of FineNet sub-module in this study

Note: @is a separator between the number of feature maps and the size of feature maps; * is the separator between the width and height of the feature map.

3.2.3 CoarseNet sub-module

The CoarseNet was designed to realize convolution and deconvolution. The deconvolution aims to expand the image size so that it can be operated to match the image size of the higher resolution, and the convolution in this network is followed by Batch Normalization and LeakyRelu activation. The convolution of this module uses 3×3 small convolution kernels, and the deconvolution step size used for feature map expansion is set as 2, and the padding and output padding both are set as 1 (Fig. 5).
Fig. 5 The framework of CoarseNet sub-module in this study

Note: @is a separator between the number of feature maps and the size of feature maps; * is the separator between the width and height of the feature map; 3×3 represents the size of the convolutional kernel.

3.2.4 Reconstructnet sub-module

The Reconstructnet was designed and utilized to predict the Landsat image at the next moment (Fig. 6). In this network, deconvolution and full connectivity are used. The network is activated by the function of Relu after the convolution.
Fig. 6 The framework of Reconstructnet sub-module in this study

Note: @is a separator between the number of feature maps and the size of feature maps; * is the separator between the width and height of the feature map; 3×3 represents the size of the convolutional kernel.

3.3 Experimental design

The Res-CNN model is compiled by Python programming language and it uses the deep learning framework Pytorch which is an open-source machine learning library that can not only enables GPU acceleration, but also supports dynamic neural network and provides a python application programming interface (API), provides convenience for the development of the original model (Jarihani et al., 2014). The number of features can be set in the Res-CNN model autonomously, makes the use of the Res-CNN model more flexible. If the number of features is too large, the training process will consume plenty of time. If the number of features is too small, the Res-CNN model will not be able to extract enough image feature information for training. The optimization model chosen for network training is the Adam model (Tian et al., 2013), which updates the neural network weights based on the number of iterations of the training data, means that the model has excellent performance in non-stationary and non-linear fitting (Chen et al., 2015). And its learning rate starts from 1×10‒4 and is adjusted according to the cosine annealing strategy. We provide the hyper-parameters required for the model in Appendix A.
In the model training phase, the input dataset (10 pairs of NDVI images in the same area) is placed in a designated folder according to specific naming rules for training. Each image is set to 100 iterations of training. In general, the larger the convolution kernel is, the more image information can be obtained, the better the features are acquired, the higher the accuracy of the model is, and the better the fusion effect is. The specific settings can be modified according to the hardware conditions.

3.4 Model accuracy assessment

The fusion results were compared to the actual Landsat 8 OLI image on 29 April 2016 in Daxing Dataset and on 29 April 2018 in the AHB dataset, using both qualitative and quantitative assessments to check the fused NDVI products and summarize their performance differences. we selected 1500 point as samples from study area and compared their NDVI values of the predicted Landsat-like NDVI image against that of the true image for the prediction date using scatter, which will provide an intuitive comparison to explain the difference between the fusion and the actual NDVI value.
We used several metrics quantitative assessment, but Different fusion metrics have their limitations and can only reveal some parts of the fused image quality (Lee et al., 2018). Including the coefficient of the determination (R2), root mean square error (RMSE), structural similarity index (SSIM), spectral angle mapper (SAM), and Kling-Gupta efficiency (KGE).
The coefficient of determination (R2) is an index that measures the degree of fit of the regression function between the observed data and the predicted data. The specific definition of is as follows:
${{R}^{2}}=1-\frac{\sum\limits_{(i,j)\in I}^{N}{{{({{f}_{\Delta }}({{x}_{i}},{{y}_{j}})-f({{x}_{i}},{{y}_{j}}))}^{2}}}}{\sum\limits_{(i,j)\in I}^{N}{{{({{f}_{\Delta }}({{x}_{i}},{{y}_{j}})-f(\overline{{{x}_{i}}},\overline{{{y}_{j}}}))}^{2}}}}$
where ${{f}_{\Delta }}({{x}_{i}},{{y}_{j}})$ and $f({{x}_{i}},{{y}_{j}})$ is the value of the predicted image versus the value of actual image. $f(\overline{{{x}_{i}}},\overline{{{y}_{j}}})$ is the mean of actual image. N is the number of pixels. The value of R2 is usually in the range of 0 to 1, and closer to 1 indicates the better prediction result.
Root Mean Square Error (RMSE) is a measure of the deviation between the observed value and the true value, a common measure of the difference between the predicted value and actual values. Its specific definition is as follows:
$RMSE\text{=}\sqrt{\frac{\sum\limits_{(i,j)\in I}{{{({{f}_{\Delta }}({{x}_{i}},{{y}_{j}})-f({{x}_{i}},{{y}_{j}}))}^{2}}}}{N}}$
A smaller value of RMSE means that the predicted value is closer to the actual value, and the model fusion is better.
Kling-Gupta efficiency (KGE), a related index for estimating the predictability of the model, is used to measure the degree of influence of the model input on the model output. The definition is as follows:
$KGE=1-\sqrt{{{(r-1)}^{2}}+{{\left( \frac{\sigma {{f}_{\Delta }}({{x}_{i}},{{y}_{j}})}{\sigma f({{x}_{i}},{{y}_{j}})}-1 \right)}^{2}}+{{\left( \frac{{{f}_{\Delta }}(\overline{{{x}_{i}}},\overline{{{y}_{j}}})}{f(\overline{{{x}_{i}}},\overline{{{y}_{j}}})}-1 \right)}^{2}}}$
where r is the correlation coefficient between the predicted and actual values, $\sigma {{f}_{\Delta }}({{x}_{i}},{{y}_{j}})$ and $\sigma f({{x}_{i}},{{y}_{j}})$ denotes the standard deviation of the predicted and actual values, $\sigma {{f}_{\Delta }}({{x}_{i}},{{y}_{j}})$ and $f(\overline{{{x}_{i}}},\overline{{{y}_{j}}})$ denotes the mean of the predicted and true values. The closer the value of KGE is to 1, the better the fusion effect.
The structural similarity index (Structural Similarity, SSIM) is an index used to measure the similarity of two images which can visually measure the similarity evaluation between the actual image and the predicted image. Its definition is as follows:
$\begin{align} & SSIM\text{=} \\ & \frac{[\overline{{{y}_{j}}})+{{C}_{1}}][\overline{{{y}_{j}}})+{{C}_{2}}]}{[{{f}_{\Delta }}{{(\overline{{{x}_{i}}},\overline{{{y}_{j}}})}^{2}}+f{{(\overline{{{x}_{i}}},\overline{{{y}_{j}}})}^{2}}+{{C}_{1}}][\sigma {{f}_{\Delta }}{{(\overline{{{x}_{i}}},\overline{{{y}_{j}}})}^{2}}+\sigma f{{(\overline{{{x}_{i}}},\overline{{{y}_{j}}})}^{2}}+{{C}_{2}}]} \\ \end{align}$
where $\sigma {{f}_{\Delta }}(\overline{{{x}_{i}}},\overline{{{y}_{j}}})$ and $\sigma f(\overline{{{x}_{i}}},\overline{{{y}_{j}}})$ denote the covariance between the predicted and actual values. C1 and C2 are parameters that enhance the stability of SSIM. the values of SSIM range from –1 to 1. The closer the value of SSIM is to 1, the more similar the predicted image is to the actual image.
In order to validate the impact of different loss functions on Res-CNN, the overall accuracy (OA) and F1-score (Jia et al., 2021) were used as performance evaluation metrics. Their calculation formulas are as follows:
$OA=\frac{TP+TN}{TP+TN+FP+FN}\times 100\%$
$F1-score=2\times \frac{\frac{TP}{TP+FP}\times \frac{TP}{TP+FN}}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}\times 100\%$
where TP, FP, FN, TN refer to the number of correct identifications, the number of incorrect identifications, the number of target classes identified as other classes, and the number of correct identifications of other classes, in that order.

4 Results

4.1 Ablation experiments

In order to validate the effectiveness of the Mish function in improving the performance of the Res-CNN network, under the condition of consistent experimental conditions, ablation experiments were conducted on the Daxing Dataset using both the Mish function and the Relu function. The results are shown in Table 2. Compared to the Relu function, the Mish function achieved a 2.9% increase in overall accuracy (OA) and a 3% increase in F1-score. Therefore, the Mish function enhances the model's ability to extract complex and varied information, effectively improving the accuracy of data fusion.
Table 2 Accuracy comparison of ablation experiments for Res-CNN model
Model OA F1-score
Res-CNN + Relu 93.4% 94.3%
Res-CNN + Mish 96.3% 97.3%

4.2 Res-CNN mode fusion performance

The efficiency of model training was evaluated according to the changes in training accuracy and training loss rate of NDVI (Fig. 7 and Fig. 8). It can be found that the training accuracy of the Res-CNN model approaches to almost no-changes after 100 iterations and the network could be considered approaching to convergent. The results meant the Res- CNN having a relative higher training efficiency.
Fig. 7 The changes of the loss and accuracy rate along the epoch in Res-CNN model in Daxing Dataset
Fig. 8 The changes of the loss and accuracy rate along the epoch in Res-CNN model in AHB Dataset
The red solid line represents the relationship between each epoch and R2 in the training process, and the green solid line represents the relationship between each epoch and R2 in the image reconstruction process. The red dotted line represents the relationship between each epoch and loss curve in the training process, and the green dotted line represents the relationship between each epoch and loss curve in the image reconstruction process.
More specifically, the loss curves further declined if the iteration number is 10 to 20 in the training phase, which suggested that the activation function, here is Mish function, allows an inflow of negative gradient, which can further enhance the results of fusion. The loss in the prediction phase is higher than the loss when it is used for training, because the training data and test data are independent of each other, and the features extracted from the training data may not be fully reflected in the test data, but the loss of an excellent model should be small enough and not much different. The BN (Batch Normalization) layer enhances the generalization ability to a certain extent and makes the training more efficient. And the attention mechanism can be considered as the weight of weights to further improve the accuracy of feature recognition. The essence of the spatio-temporal data fusion model is to use the existing data to accurately predict the data at the prediction moment. Res-CNN model tends to have stable results without large fluctuations during training and reconstruction. It is proved that the inclusion of activation function (Mish function), BN layer and the attention mechanism both are positively correlated with the robustness and accuracy of the model.

4.3 Res-CNN model evaluation

The model developed in this paper showed a better performance by comparing with the ESTARFM and FSDAF models and quantified through the evaluation indicators (Tables 3, 4). Specifically, the overall R2 of the Res-CNN model is improved by 5%, 5.6% and 16.7%, 2.7% when compared with the other two models, also the RMSE is smaller than that of the other two models, which meant the performance of the Res-CNN model is the best among the three models. The KGE index of the Res-CNN model is higher than that of the ESTARFM model and FSDAF model, illustrating that the input quality has the least influence on the Res-CNN model, and the stability and robustness of the model is stronger. On the indicator of the SSIM, the Res- CNN is improved4%, 3.2% and 6%, 1.9% compared with that of the ESTARFM model and FSDAF model, respectively.
Table 3 The quantitative evaluations on the fusion result on 29 April 2014 according to the Daxing Dataset
Daxing Dataset Res-CNN ESTARFM FSDAF
RMSE 1.09 1.28 1.34
R2 0.768 0.737 0.658
KGE 0.78 0.695 0.745
SSIM 0.94 0.897 0.884
Table 4 The quantitative evaluations on the fusion result on 3 October 2018 according to the AHB dataset
RMSE 1.06 0.93 0.983
R2 0.807 0.764 0.786
KGE 0.898 0.867 0.888
SSIM 0.967 0.937 0.949
Though the FSDAF is an improved version of the ESTARFM model, here it was applied in the fusion of NDVI rather than single band, which may cause the increasing error in FSDAF. According to the SSIM indicators and output images of the three models, the Res-CNN model shows the higher spatial pattern similarity on the images than the other two models. The AHB dateset has many small-area grasslands (the width of the fields is less than 250 m), which resulted in many mixed pixels in the HTLS image. Maybe, it is the reason for the RMSE metric of the Res-CNN model has a higher RMSE error.
The fused NDVI of the three models were compared with the actual observations through the visual interpretation as showed by the range in the red color box in Fig. 8. The comparison showed that the overall presentation of ESTARFM model and FSDAF model are lighter than the observed images in Daxing Dataset, especially in the middle and upper right regions, which would be attributed to the capability to process the missing values in the ESTARFM results. The comparison showed that the overall presentation of ESTARFM model and FSDAF model is blurred than the observed images in AHB dataset, which would be selected for moving window and similar pixels of ESTARFM model and FSDAF model. The Res-CNN-based NDVI, meanwhile, is obviously closer to the observation image on the visual performance along with the more detailed information of its spatial features. Our Res-CNN model applies an advanced convolutional approach to extract image features by avoiding the problem of the missing values, which would result a more reliable Res-CNN model than the other two models from the view of the visual interpretation.
From the view of the model design, in ESTARFM and FSDAF model, the spatial information at the prediction time greatly relies on that of the MODIS images at the prediction time, and the MODIS images of lower spatial resolution are enlarged by 16 times to match with the Landsat as a higher spatial resolution, which inevitably greatly affect the spatial expression in the output data. In the Res-CNN model, the image feature extraction was realized through convolution, which could greatly enhance the expression of spatial information.
The further evaluation was analyzed by the linear regression between the fusion results from the three models and the corresponding values on the known images from Landsat (considered as “observations”) based on the randomly sampled 1500 points (Fig. 9-11). The results showed that the Res-CNN model has a higher coefficient of determination (R2=0.773, R2=0.804), a slope closer to 1, and a lower relative root mean square (R-RMSE) of 95.76%, 93.58%. In contrast, the ESTARFM and FSDAF model had lower R2 (0.714, 0.721 and 0.643, 0.737), lower slopes (0.76, 0.97 and 0.86, 0.91), and higher R-RMSE (103.819%, 109.564% and 101.471%, 107.625%). The latter two models underestimated against the NDVI observations, and more serious for the observations with higher values. The result meant that Res-CNN model had better performance than the other two models according to the visual comparisons and the statistical quantification.
Fig. 9 The NDVI observation results of different methods in the Daxing area on April 29, 2016 and AHB Dataset on April 29, 2018. (a) Observed, (b) Res-CNN, (c) ESTARFM, (d) FSDAF
Fig. 10 The linear regression-based evaluation on the different data fusion models against the observed NDVI on April 29,2016 in the Daxing Data
Fig. 11 The linear regression-based evaluation on the different data fusion models against the observed NDVI on April 29, 2018 in the AHB Data

5 Discussion

In this study, a Res-CNN model was developed based on CNN by which the deep learning method is introduced into the spatio-temporal fusion of remote sensing data. Compared with the two none-CNN model, our method was considered to have the following advantages: 1) It showed higher fusion accuracy and lower requires on the quality of input data. 2) It could be applied to produce NDVI time series data. 3) The trained network could be directly and further used to fuse a large number of images and improved the efficiency of spatio-temporal data fusion.
Although the Res-CNN model is more flexible, it is still far from adaptive change. According to the influence of BN layer retrograde feedback is given to the training network to ensure that new features are added without loss to the original features of the image; the choice of the new activation function makes the network allow the negative gradients inflow, enhances the overall effect and improves the local accuracy. In future research, convolution kernels of different sizes can be used to obtain features of different scales, then that can be combined to obtain better prediction models during the process of training (Yi et al., 2020).
The previous spatio-temporal data fusion model is believed to be able to maintain the correctness of the spectral information to the greatest extent, but the large amount of spatial detail information for the prediction moment comes from the MODIS data at the prediction moment and additional auxiliary data will inevitably affect the representation of spatial information (Belgiu and Stein, 2019). Generative Adversarial networks (GANs) is a class of methods with high development potential in deep learning that generates the best fusion results through continuous confrontation between the generative model and the loss model, may be another breakthrough in the spatiotemporal data fusion model (Kingma and Ba, 2014; Yi et al., 2020). In addition, the evaluation of spatio-temporal data fusion cannot only consider the global effect, the variability among locally fused pixels is the key to affect the reliability of spatio- temporal data fusion, especially there is a possibility that deep learning-based framework results in oscillation. The literature proposed based on the interactive use between Bayesian network and CNN network has the potential to solve the uncertainty between local and global (Lee et al., 2018). Up to now, the interactive use between Bayesian network and CNN network in spatiotemporal data fusion is very limited. This model has been widely used in the field of computer vision, and there is limited research in the field of spatio-temporal data fusion. Therefore, designing of a new network structure with the choice of activation function is of great importance for the reconstruction of remote sensing time series data.

6 Conclusions

This study developed a remote sensing data fusion model based on to reconstruct the spatio-temporal image at the prediction moment, on the basis of machine learning, a new network structure is used to design a deep learning-based remote sensing spatio-temporal data fusion method, Res-CNN, by the introduction of the BN layer and the new activation function, a new network structure is formed,not only does the predicted image gain more sharpness and clarity, but also the prediction accuracy is highly boosted. A series of experiments in two different areas (Daxing and AHB dataset) demonstrate the superiority of our Res-CNN model.
The new network structure of the Res-CNN model and the Mish activation function make the image at the prediction moment more clearly. Compared with the ESTARFM model and FSDAF model, the fusion accuracy is improved, which makes the predicted data more reliable and flexible in remote sensing applications. Therefore, the model developed in this paper is expected to play a greater role in areas where the input data is limited due to the greater influence of the cloud, such as the Qinghai-Tibet Plateau, it can not only provide model support but also generate long time series of high temporal and spatial resolution satellite remote sensing data through the big data platform to provide data support for ecological environmental protection, meticulous management of grassland resources and agricultural land use, that has important scientific and practical significance.

Appendix A

Landsat t0 MODIS t0 MODIS t1
Inputs 1024×1024 64×64 64×64
Feature extraction Res-Block 3×3 Conv2d
Stride=1 1024×1024
Two Layers
3×3 Conv2d Stride=1
Two Layers
3×3 Conv2d Stride=1
Res-Block 3×3 Conv2d
Stride=2 512×512
Three Layers
3×3 DeConv2d Stride=2
Three Layers 3×3
DeConv2d Stride=2
Res-Block 3×3 Conv2d
Stride=1 512×512
3×3 Conv2d Stride=1
3×3 Conv2d Stride=1
Res-Block 3×3 Conv2d Stride=1 512×512
SPP 512×512
3×3 Conv2d Stride=1 512×512
Use a represent Landsat t0’s feature map Use b represent MODIS t0’s feature map Use c represent MODIS t1’s feature map
Feature fusion a‒b+c 512×512
Feature reconstruction 3×3 DeConv2d Stride=2×1024×1024
FCN 1024×1024
FCN 1024×1024
Abadi M, Barham P, Chen J, et al. 2016. TensorFlow: A system for large-scale machine learning. USENIX Association,

Acerbi-Junior F, Clevers J, Schaepman M E. 2006. The assessment of multi-sensor image fusion using wavelet transforms for mapping the Brazilian Savanna. International Journal of Applied Earth Observation and Geoinformation, 8(4): 278-288.


Ao Z R, Sun Y, Xin Q C A. 2021. Constructing 10-m NDVI time series from Landsat 8 and Sentinel 2 images using convolutional neural networks. IEEE Geoscience and Remote Sensing Letters, 18(8): 1461-1465.


Belgiu M, Stein A. 2019. Spatiotemporal image fusion in remote sensing. Remote Sensing, 11(7): 818. DOI: 10.3390/rs11070818.

Chen B, Huang B, Xu B. 2015. Comparison of spatiotemporal fusion models: A review. Remote Sensing, 7(2): 1798-1835.


Das M, Ghosh S K. 2016. Deep-STEP: A deep learning approach for spatiotemporal prediction of remote sensing data. IEEE Geoscience and Remote Sensing Letters, 13(12): 1984-1988.


Dong C, Loy C C, He K, et al. 2015. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2): 295-307.


Emelyanova I V, McVicar T R, Van Niel, et al. 2013. Assessing the accuracy of blending Landsat-MODIS surface reflectances in two landscapes with contrasting spatial and temporal dynamics: A framework for algorithm selection. Remote Sensing of Environment, 133: 193-209.


Gao F, Masek J, Schwaller M, et al. 2006. On the blending of the Landsat and MODIS surface reflectance: Predicting daily Landsat surface reflectance. IEEE Transactions on Geoscience and Remote Sensing, 44(8): 2207-2218.


Gevaert C M, García-Haro F J. 2015. A comparison of STARFM and an unmixing-based algorithm for Landsat and MODIS data fusion. Remote Sensing of Environment, 156: 34-44.


Hilker T, Wulder M A, Coops N C, et al. 2009. A new data fusion model for high spatial-and temporal-resolution mapping of forest disturbance based on Landsat and MODIS. Remote Sensing of Environment, 113(8): 1613-1627.


Jarihani A A, McVicar T R, Van Niel T G, et al. 2014. Blending Landsat and MODIS data to generate multispectral indices: A comparison of “Index-then-Blend” and “Blend-then-Index” approaches. Remote Sensing, 6(10): 9213-9238.


Jia M M, Wang Z M, Mao D H, et al. 2021. Rapid, robust, and automated mapping of tidal flats in China using time series Sentinel-2 images and Google Earth Engine. Remote Sensing of Environment, 255, 112285. DOI: 10.1016/j.rse.2021.112285.

Kastrati Z, Imran A S, Yayilgan S Y. 2019. The impact of deep learning on document classification using semantically rich representations. Information Processing & Management, 56(5): 1618-1632.


Kingma D P, Ba J. 2014. Adam: A method for stochastic optimization. Computer Science,

Krizhevsky A, Sutskever I, Hinton G E. 2017. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84-90.


LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature, 521(7553): 436-444.


Lee S G, Sung Y, Kim Y G, et al. 2018. Variations of AlexNet and GoogLeNet to improve Korean character recognition performance. Journal of Information Processing Systems, 14(1): 205-217.

Liu M, Liu X., Wu L, et al. 2018. A modified spatiotemporal fusion algorithm using phenological information for predicting reflectance of paddy rice in southern China. Remote Sensing, 10(5): 772. DOI: 10.3390/rs10050772.

Liu Y, Chen X, Peng H, et al. 2017. Multi-focus image fusion with a deep convolutional neural network. Information Fusion, 36: 191-207.


Masi G, Cozzolino D, Verdoliva L, et al. 2016. Pansharpening by convolutional neural networks. Remote Sensing, 8(7): 594. DOI: 10.3390/rs8070594.

Misra D. 2020. Mish:A self regularized non-monotonic activation function. British Machine Vision Conference.

Schmidhuber J. 2015. Deep learning in neural networks: An overview. Neural Networks, 61: 85-117.


Shin H C, Roth H R, Gao M, et al. 2016. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Transactions on Medical Imaging, 35(5): 1285-1298.


Song H, Liu Q, Wang G, et al. 2018. Spatiotemporal satellite image fusion using deep convolutional neural networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(3): 821-829.


Tan Z, Di L, Zhang M, et al. 2019. An enhanced deep convolutional model for spatiotemporal image fusion. Remote Sensing, 11(24): 2898. DOI: 10.3390/rs11242898.

Tan Z, Yue P, Di L, et al. 2018. Deriving high spatiotemporal remote sensing images using deep convolutional network. Remote Sensing, 10(7): 1066. DOI: 10.3390/rs10071066.

Tian F, Wang Y, Fensholt R, et al. 2013. Mapping and evaluation of NDVI trends from synthetic time series obtained by blending Landsat and MODIS data around a coalfield on the Loess Plateau. Remote Sensing, 5(9): 4255-4279.


Wang X, Wang X. 2020. Spatiotemporal fusion of remote sensing image based on deep learning. Journal of Sensors, (6): 1-11.

Wei J, Chen L, Chen Z, et al. 2023. An experimental study of the accuracy and change detection potential of blending time series remote sensing images with spatiotemporal fusion. Remote Sensing, 15(15): 3763. DOI: 10.3390/rs15153763.

Wei J, Wang L, Liu P, et al. 2016. Spatiotemporal fusion of remote sensing images with structural sparsity and semi-coupled dictionary learning. Remote Sensing, 9(1): 21. DOI: 10.3390/rs9010021.

Wei Q, Bioucas-Dias J, Dobigeon N, et al. 2015. Hyperspectral and multispectral image fusion based on a sparse representation. IEEE Transactions on Geoscience and Remote Sensing, 53(7): 3658-3668.


Wei Y, Yuan Q, Shen H, et al. 2017. Boosting the accuracy of multispectral image pansharpening by learning a deep residual network. IEEE Geoscience and Remote Sensing Letters, 14(10): 1795-1799.


Wu M, Wang J, Niu Z, et al. 2012a. A model for spatial and temporal data fusion. Journal of Infrared & Millimeter Waves, 31(1): 80-84.

Wu M Q, Niu Z, Wang C Y, et al. 2012b. Use of MODIS and Landsat time series data to generate high-resolution temporal synthetic Landsat data using a spatial and temporal reflectance fusion model. Journal of Applied Remote Sensing, 6(1): 63507. DOI: 10.1117/1.JRS.6.063507.

Wu M, Wu C, Huang W, et al. 2016. An improved high spatial and temporal data fusion approach for combining Landsat and MODIS data to generate daily synthetic Landsat imagery. Information Fusion, 31: 14-25.


Yi D, Ahn J, Ji S. 2020. An effective optimization method for machine learning based on ADAM. Applied Sciences, 10(3): 1073. DOI: 10.3390/app10031073.

Yuan Q, Wei Y, Meng X, et al. 2018. A multiscale and multidepth convolutional neural network for remote sensing imagery pan-sharpening. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(3): 978-989.


Zhang X, Jiang L, Yang D, et al. 2019a. Urine sediment recognition method based on multi-view deep residual learning in microscopic image. Journal of Medical Systems, 43: 1-10.


Zhang Y, Ling F, Foody G M, et al. 2019b. Mapping annual forest cover by fusing PALSAR/PALSAR-2 and MODIS NDVI during 2007-2016. Remote Sensing of Environment, 224: 74-91.


Zhao H, Gallo O, Frosio I, et al. 2016. Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging, 3(1): 47-57.


Zhu X, Cai F, Tian J, et al. 2018. Spatiotemporal fusion of multisource remote sensing data: Literature survey, taxonomy, principles, applications, and future directions. Remote Sensing, 10(4): 527. DOI: 10.3390/rs10040527.

Zhu X, Chen J, Gao F, et al. 2010. An enhanced spatial and temporal adaptive reflectance fusion model for complex heterogeneous regions. Remote Sensing of Environment, 114(11): 2610-2623.


Zhu X, Helmer E H, Gao F, et al. 2016. A flexible spatiotemporal method for fusing satellite images with different resolutions. Remote Sensing of Environment, 172: 165-177.


Zhu X, Tuia D, Mou L, et al. 2017. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine, 5(4): 8-36.

Zurita-Milla R, Clevers J G, Schaepman M E. 2008. Unmixing-based Landsat TM and MERIS FR data fusion. IEEE Geoscience and Remote Sensing Letters, 5(3): 453-457.