Bias Correction of Multi-sensor Total Column Ozone Satellite Data for 1978–2017

This study constructs a merged total column ozone (TCO) dataset using 20 available satellite Level 2 TCO (L2SAT) datasets over 40 years from 1978 to 2017. The individual 20 datasets and the merged TCO dataset are corrected against ground-based Dobson and Brewer spectrophotometer TCO (GD) measurements. Two bias correction methods are used: simple linear regression (SLR) as a function of time and multiple linear regression (MLR) as a function of time, solar zenith angle, and effective ozone temperature. All of the satellite datasets are consistent with GD within ±2 – 3 %, except for some degraded data from the Total Ozone Mapping Spectrometer/Earth Probe during a period of degraded calibration and from the Ozone Mapping and Profiling Suite (OMPS) provided from NOAA at an early stage of measurements. OMPS data provided from NASA show fairly stable L2SAT– GD differences. The Global Ozone Monitoring Experiment/MetOp-A and -B datasets show abrupt changes of approximately 8 DU coincident with the change of retrieval algorithm. For the TCO merged datasets created by averaging all coincident data located within a grid cell from the 20 satellite-borne TCO datasets, the differences between corrected and uncorrected TCOs by MLR are generally positive at lower latitudes where the bias correction increases TCO because of low effective ozone temperature. In the trend analysis, the difference between corrected and uncorrected TCO trends by MLR shows clear seasonal and latitudinal dependency, whereas such seasonal and latitudinal dependency is lost by SLR. The root mean square difference of L2SAT–GD for the uncorrected merged datasets, 8.6 DU, is reduced to 8.4 DU after correction using SLR and MLR. Therefore, the Corresponding author: Hiroaki Naoe, Department of Climate and Geochemistry Research, Meteorological Research Institute, 1-1 Nagamine, Tsukuba, Ibaraki 305-0052, Japan E-mail: hnaoe@mri-jma.go.jp J-stage Advance Published Date: 3 February 2020 Journal of the Meteorological Society of Japan Vol. 98, No. 2 354


Introduction
The increased halogen loading in the stratosphere caused by human activity has led to a decrease in total ozone abundance (e.g., World Meteorological Organization 2019). Changes in the ozone layer are of major concern to scientists and the general public, and thus atmospheric ozone is being closely monitored to assess the recovery of the stratospheric ozone layer from chemical ozone depletion. In response to the observed ozone depletion, the Montreal Protocol on Substances that Delete the Ozone Layer was agreed upon in 1987, and the successful implementation of the protocol and its later amendments are expected to result in ozone recovery to pre-1980s levels (World Meteorological Organization 2019). Newchurch et al. (2003) performed a statistical analysis of satellite measurements that stratospheric ozone depletion, which is controlled mainly by gas-phase catalytic cycles, slowed globally after 1997. However, the estimated dates of return of the total column ozone (TCO) to pre-1980 values for the baseline scenario (RCP-6.0) are different for the mid-latitudes in the northern and southern hemispheres (NH and SH) and significantly delayed for the Antarctic ozone. They will probably occur before the mid-century (~ 2035)/ around the mid-century for annually averaged NH/SH mid-latitude ozone, and after the mid-century (~ 2060) for the Antarctic ozone hole (World Meteorological Organization 2019).
Global-scale satellite observations of TCO have been performed since the early 1970s, and regular and continuous ozone monitoring by the Total Ozone Mapping Spectrometer (TOMS) and Solar Backscatter Ultraviolet (SBUV) instruments onboard the Nimbus-7 satellite started in 1978(McPeters et al. 1996Bhartia et al. 2013). Since then, ozone observations by multiple instruments on different polar-orbiting platforms have been available. Some of these instruments are the Global Ozone Monitoring Experiment (GOME) onboard the European Remote Satellite-2 (ERS-2) (Bodeker et al. 2001), the Ozone Monitoring Instrument (OMI) onboard the Earth Observation System Aura satellite , and the Ozone Mapping and Profiling Suite (OMPS) onboard the Suomi National Polar-orbiting Partnership (S-NPP) satellite (Flynn et al. 2014).
Satellite-borne instruments have an important advantage for ozone measurements compared to sparsely distributed ground-based networks because they can provide daily global TCO maps with a resolution that is sufficient to detect meteorological variability across regions. However, they suffer from several disadvantages, including instrument calibration drift and occasional gaps in data continuity due to changes in retrieval algorithms and observation conditions. In contrast, ground-based instruments are regularly cross-calibrated and provide continuous time series data, but they are spatially scattered, and their coverage of the globe is far from complete.
So far, some differences within ±3 % were reported between the TCO measurements of TOMS or GOME instruments and ground-based observations (Fioletev et al. 2008). Antón et al. (2010) evaluated TOMS/ Earth Probe (EP) TCO data and concluded that TOMS data for the period of 2000 -2004 should not be used for trend analysis due to degraded calibration issues. Van Roozendael et al. (2012) reported an updated GOME Data Processor version 5 (GDP5) spectral fitting algorithm for reprocessing 16-year GOME data records and they indicated that the updated GDP5 showed a clear improvement with reduced solar zenith angle and seasonal dependences for nearly all regions compared with ground-based measurements. Loyola et al. (2011) evaluated GOME-2/MetOp-A TCO measurements for 3 years (2007 -2009) using GDP version 4.4 retrieval algorithm and they found that GOME-2 TCO slightly underestimated (overestimated) ground-based ozone by 0.5 -1 % (~ 0.5 %) over the mid-latitudes of the NH (SH). Meanwhile, Kroon et al. (2011) evaluated OMI ozone profiles against co-flying instruments and collocated ozone profiles and they showed an agreement of stratospheric ozone profiles within 20 % with global correlative data, with a systematic positive bias of ~ 60 % in the tropics and 30 % in the mid-latitude regions. Furthermore, TCO datasets of the European sensors GOME, SCanning Imaging Absorption SpectroMeter for Atmospheric CartograpHY (SCIAMACHY), and GOME-2 were reprocessed with GOME-type Direct FITting (GODFIT) version 3 retrieve algorithm (Lerot et al. 2014;Koukouli et al. 2015). The ozone datasets had good quality and less sensitivity to instrumental degradation owing to a new reflectance soft-calibration scheme. These studies suggest that in order to obtain accurate global ozone records, satellite measurements must be validated against measurements made by ground-based instruments. Therefore, a merged ozone dataset from multiple satellite-acquired ozone after the validation is suitable for trend analysis and climate change studies.
One of the major difficulties in assessing longterm global ozone variations is data inhomogeneity (Fioletov et al. 2002). Satellite measurements are obtained by different instruments and retrieved by dif ferent algorithms; furthermore, the instruments are subject to problems such as radiation damage (van der A et al. 2010). Although SBUV instruments are quite similar in design, different orbital characteristics between the satellites resulted in measurements made at different local times and solar zenith angles Frith et al. 2014). Thus, the drifting orbits for the series of SBUV/2 instruments on NOAA satellites are possibly the biggest source of uncertainty regarding the determination of long-term ozone change from these instruments. Operational satellite changes and observation interruptions result in systematic errors for datasets that change over time, and these errors must be taken into account when estimating the ozone recovery trend.
To date, several merged ozone datasets have been created. For example, Bodeker et al. (2001) constructed a merged ozone dataset from TOMS and GOME datasets after removing latitudinal and temporal drift and biases by intersatellite comparisons and by examining the differences between satellite overpass and ground-based measurements as a function of time and latitude. Meanwhile, Stolarski and Frith (2006) constructed a merged ozone dataset from TOMS and SBUV version 8 retrieved data by intersatellite comparisons and by averaging the data during instrument overlapped periods. Besides, merged monthly zonal mean version 8.6 ozone data were constructed from nine independent SBUV-type instruments operating in the 8 am -4 pm equator crossing time range Frith et al. 2014). These merged datasets validated by intersatellite comparison or overpass ground-based datasets are used for trend analysis based on monthly data.
On the other hand, ozone datasets are used for long-term reanalysis such as the European Centre for Medium-Range Weather Forecasts Reanalysis-Interim (ERA-Interim) (Dragani 2011), the Japanese 55-year Reanalysis conducted by the Japan Meteorological Agency (JRA-55) (Kobayashi et al. 2015), and Modern-Era Retrospective Analysis for Research and Applications version 2 (MERRA-2) (Wargan et al. 2017). However, these ozone datasets are not homogenized for the usage of the reanalysis. In contrast, van der A et al. (2010, 2015) constructed merged TCO datasets from the assimilation based on high temporal-resolution satellite data in which the differences between Level 2 (L2) retrieval datasets and groundbased observations were corrected for systematic biases as a function of solar zenith angle (SZA), viewing angle, time, and effective ozone temperature (T eff ).
In this study, we apply the method of van der A et al. (2010, 2015) for the bias correction of 20 satellite datasets comprising the latest TCO retrievals, to produce continuous and homogeneous TCO datasets covering 40 years from 1978 to 2017. We first evaluate biases in the individual satellite TCO datasets by comparing the satellite data with spatially and temporally collocated ground-based Dobson and Brewer spectrophotometer networks. Then, we create individual corrected datasets by weighted averaging all data within each grid cell. The empirically corrected merged TCO dataset with high temporal resolution (e.g., daily or hourly) constructed in this manner is thus suitable for trend analysis as well as assimilation of long-term reanalysis. Besides, the corrected datasets are useful for the post-processing of Dobson spectrophotometer data, because the retrieval algorithm for the Dobson data does not account for natural interannual variations in stratospheric temperature, which significantly affects ozone absorption coefficients (Koukouli et al. 2016).
This paper is organized as follows. In Section 2, we describe the L2 satellite-borne TCO datasets used, all of which are available from public archives. Section 3 presents our methods, including two bias correction methods, and methods used to calculate the differences between coincident satellite overpass data and ground-based observations and to produce Level 3 (L3) TCO datasets. Section 4 presents the corrected TCO data for biases by simple linear regression (SLR) as a function of time, and Section 5 presents the corrected data by multiple linear regression (MLR) as a function of time, SZA, and T eff . Section 6 investi-gates the TCO trend in the corrected and uncorrected merged TCO datasets, and Section 7 presents our concluding remarks.

Datasets
In this study, 20 L2 satellite TCO datasets covering 40 years from 1978 to 2017 are acquired from public archives ( Table 1). The short names used to refer to these datasets are mainly based on the name of collecting instruments. For datasets composed of data acquired by the same type of instrument but on a different satellite or by a different retrieval algorithm, the dataset name includes the name of the satellite as well as algorithm or provider. Thus, the 20 datasets used are: TOMS/N7, TOMS/EP, GOME, TOGOMI, SCIAMACHY, TOSOMI, OMI-TOMS, OMI-DOAS, GOME-2A, GOME-2B, OMPS-NASA, OMPS-NOAA, SBUV/N7, SBUV/N9, SBUV/N11, SBUV/ N14, SBUV/N16, SBUV/N17, SBUV/N18, and SBUV/N19. The instruments that acquired the ozone data and the datasets are briefly described below.

Ground-based data
Ground-based Dobson and Brewer spectrophotometer TCO measurements are collected from "Total Ozone" of the archive of the World Ozone and Ultraviolet Data Center (WOUDC). Daily TCO values, either the best representative values or daily averages, are usually submitted to WOUDC by a contributor. For example, the Japan Meteorological Agency has submitted the best representative values of Dobson observations and the daily averages of Brewer observations. By visually inspecting the differences between satellite overpass and ground-based TCO measurements as a function of time for each station, data with low number of observations, large data gap, sudden jumps, and very large offsets are rejected. Thus, 34 Brewer and 39 Dobson TCO measurement stations are selected in this study (Tables A1, A2, respectively). Single monochromator Brewer (and Dobson) spectrophotometers are susceptible to scattered light error for large SZA observations, whereas double-monochromator Brewer spectrophotometers  * The names of datasets are generally based on the instruments used to acquire the data. Exceptions are TOGOMI and TOSOMI, which were acquired by GOME and SCIAMACHY instruments, respectively. The SBUV/N9, /N11, /N14, /N16, /N17, /N18, and /N19 datasets were acquired by the second generation (SBUV/2) instruments. Both GOME-2A and -2B instruments were acquired by the GOME-2 instrument. OMPS products were acquired by the OMPS Nadir Mapper (NM). enable more precise observations (Karppinen et al. 2015). Thus, single or/and double Brewer measurements are listed in Table A1 for reference.

TOMS
The TOMS onboard Nimbus 7 (TOMS/N7) with TCO version 8 retrieval algorithm provided the daily TCO global distribution (TOMS Science Team 1996;McPeters et al. 1996). TOMS/N7 had an orbit altitude of 935 km and this instrument performed cross-track scanning with a 102° field of view (FOV) using 35 sample points at 3° intervals, corresponding to a 50 km resolution at the Earth's surface for near-nadir angle observations. This basic retrieval algorithm uses two wavelengths to derive TCO. The shorter wavelength, which has a stronger continuum ozone absorption in the Huggins band (310 -340 nm), is sensitive to TCO. Meanwhile, the longer wavelength, which has weak ozone absorption, characterizes the underlying surface and clouds.
The TOMS onboard the EP satellite (TOMS/EP) started measurements from an initial orbit of 500 km with an inclination of 98°. In December 1997, the orbit was elevated to an altitude of 739 km in order to increase the daily global coverage to 90 % (TOMS Science Team 1998;McPeters et al. 1998). The TOMS/EP TCO measurements were processed with TCO version 8 retrieval algorithm. After the first few years of observation, TOMS/EP had serious optical degradation and the calibration had to be empirically corrected. Thus, it is not recommended to use degraded data from EP for trend analysis (Antón et al. 2010).
TOMS instruments were also mounted on the Meteor-3 satellite of the former USSR and the Japanese ADvanced Earth Observing Satellite (ADEOS). The Meteor-3 satellite had three-diffuser systems (cover, working, and reference sequence) for backscatter ultraviolet instruments. It had a non-sunsynchronous near-circular and near-polar prograde orbit that slowly drifted with local time. The reference diffuser was usually exposed for one sequence every 106 days (half the period of orbital precession). This precession posed a serious problem for trend determination from the TOMS/Meteor-3 instrument. The ADEOS spacecraft failed after only about a year in orbit; hence, little data are available. TOMS/Meteor-3 and TOMS/ADEOS measurements are not used in this study because their L2 data were not available from public archives.
2.4 GOME GOME, onboard ERS-2, was a nadir-viewing scanning spectrometer designed to measure a range of atmospheric trace constituents in the troposphere and stratosphere (Deutsches Zentrum für Luft und Raumfahrt (DLR) 2012). The largest FOV of GOME was 320 × 40 km, and three cross-track measurements provided global coverage at the equator every 3 days. An operational TCO dataset was generated by GDP version 4, using a Differential Optical Absorption Spectroscopy (DOAS) algorithm (van Roozendael et al. 2006). A DOAS-type algorithm compares the spectral fitting of effective slant columns and air-mass factor computations (mainly from look-up tables). In accordance with the Ozone Climate Change Initiative (CCI) project of the European Space Agency (ESA) (Lerot et al. 2014), the TCO dataset was reprocessed using GODFIT algorithm version 4 (Lerot et al. 2010;Van Roozendael et al. 2012), and this updated reprocessed dataset is used in this study.
Another GOME nadir TCO dataset was retrieved using the Total Ozone retrieval scheme for GOME developed for the OMI instrument (TOGOMI version 2.1) (Valks et al. 2004; Tropospheric Emission Monitoring Internet Service 2005).

SCIAMACHY
The SCanning Imaging Absorption spectroMeter for Atmospheric CHartographY (SCIAMACHY) onboard the ESA Environmental satellite (Envisat) had a nadir mode that detected upwelling solar radiation scattered in the atmosphere and reflected by the Earth's surface. It had a 30 × 60 km along-track FOV and a swath width of 960 km (Ozone CCI 2012;Ebojie et al. 2014;Lerot et al. 2014). An operational TCO dataset was generated using SCIAMACHY Data Processor (SDP) versions 3 -5 by DOAS technique (Lerot et al. 2009). The SCIAMACHY TCO dataset was also reprocessed using GODFIT algorithm to achieve a high level of consistency with other instruments (Lerot et al. 2014;Van Roozendael et al. 2012). This updated reprocessed dataset is used in this study.
Another SCIAMACHY TCO dataset was retrieved using the Total Ozone retrieval scheme for SCIAMACHY, which used OMI-DOAS algorithm (TOSOMI version 2.0) (Eskes et al. 2005;van der A and Eskes 2006), which was an application of TOGOMI algorithm to SCIAMACHY.

OMI
OMI is a Dutch-Finnish nadir-viewing near-UV/ visible charge-coupled device (CCD) spectrometer onboard NASA's Earth Observation System's Aura satellite (Bhartia 2002;OMI team 2012). OMI measurements cover the 264 -504 nm spectra with a spectral resolution of 0.42 -0.63 nm. The nominal surface FOV is 13 × 24 km at nadir with a swath width of 2600 km. Typically, 60 cross-track measurements are made per swath, and 1650 swath observations are made per orbit. The TCO measurements are processed using TOMS version 8 algorithm, which uses two wavelengths (317.5 nm and 331.2 nm) to obtain the TCO measurements. The other four wavelengths are used for diagnostics and error correction (Bhartia 2005).
Another TCO dataset is derived using the DOASbased TCO algorithm of the Royal Netherlands Meteorological Institute (KNMI) (Veefkind 2006), which was originally used to process data from the GOME instrument onboard ERS-2. Because this algorithm is most suitable for weakly absorbing trace gases, it estimates TCO data from longer wavelengths than those used by the TOMS algorithm.

OMPS
The OMPS Nadir Mapper (NM) onboard the S-NPP satellite is a TCO sensor operated by the Joint Polar Satellite System (JPSS) program (Godin 2014). OMPS-NM measurements are processed in near realtime at the NOAA Interface Data Processing Segment using the Environmental Data Record algorithm, which is similar to NASA's version 7 TCO retrieval algorithm (Flynn et al. 2014). OMPS-NM has a 110° cross-track FOV with 35 cross-track bins and a 0.27° along-track slit width, corresponding to a 50-km resolution at the Earth's surface. It provides instantaneous coverage of a 2800-km-wide swath at the Earth's surface. In this study, the OMPS-NOAA dataset comprises OMPS data from the JPSS NOAA site.
Another OMPS-NM TCO dataset is provided by NASA's Goddard Earth Science Data and Information Services Center, which uses an algorithm that is similar to NASA TOMS version 8 retrieval algorithm used for OMI (Jaross 2017;Kramarova et al. 2014). Because the CCD readout in the NASA product is split at the center, there are typically 36 cross-track measurements per swath and 400 swath observations per orbit. The OMPS-NASA dataset in this study comprises OMPS data from the NASA site.

Methods
All L2 binary and ASCII data are first transformed into a uniform ASCII format. Then, uncorrected L3 TCO datasets are created from the uniform L2 ASCII data. In this study, two bias correction methods are used. Biases of datasets are corrected by SLR as a function of time and by MLR as a function of time, SZA, and T eff . The systematic mean biases and drift of the individual datasets are evaluated from the differences between coincident L2 satellite overpass data and selected ground-based observations. Meanwhile, T eff is calculated if a L2 retrievals are not available. Then, the linear regression coefficients are used to create corrected L3 TCO datasets from the uniform L2 ASCII data. Finally, we describe a method for estimating the uncertainties of systematic mean biases and drift.

Uniform ASCII format
All datasets in binary NetCDF, HDF, HDF-EOS5 formats, or original ASCII formats are converted to our uniform ASCII format. Elements of the uniform format are observation time, center location (latitude and longitude) of a satellite pixel, four corner loca-tions of the pixel, TCO and TCO error, SZA, viewing zenith angle (VZA), T eff , surface flags (sea/land/ice), and error flags. The SBUV datasets contain one file for each observation day while the other datasets contain one file for each orbit. Ground-based observation data from WOUDC in a CSV format are converted to the uniform format.

Differences between satellite overpass TCO
observations and ground-based measurement data (SAT-GD diff ) The satellite overpass at a ground-based station is defined when the distance between the center of the satellite pixel and the ground-based station location is less than 50 -200 km, depending on the satellite pixel size. Because daily data for Brewer and Dobson measurements are derived from "TotalOzone" of the WOUDC, we assume that ground-based measurements would be at local noon. Thus, we consider a time window within ±4 h (i.e., from 8 am to 4 pm) for satellite overpass. The local noon is estimated from the longitude of the ground-based station. For spatiotemporal collocated satellite overpass and groundbased data, differences between coincident satellite L2 TCO observations (L2SAT) and daily TCO groundbased measurement data (GD) from the Brewer and Dobson spectrophotometers (Table A1 and A2, respectively), SAT-GD diff = L2SAT − GD, are calculated.
In this paper, several criteria are adopted to SZA and cloud amount based on the purpose of the analysis. To analyze the dependence of SZA, cloud amount, and T eff on SAT-GD diff , satellite overpass data when SZA < 89° and all the cloud amount are used. For the regression analysis, only satellite overpass data at SZA < 80° and cloud amount < 0.8 are used, since SAT-GD diff becomes large at large SZA, and the root mean square of SAT-GD diff (RMSD) are large when cloud amount is close to 1. To create TCO products at each grid mesh, L2SAT at SZA < 84° are used because these data are often flagged as "good". The cloud amount criterion is set at 0.8.

Effective ozone temperature (T eff )
The ozone absorption coefficient, which is used by retrieval algorithms to estimate TCO, is strongly dependent on the stratospheric temperature that changes with season and latitude. T eff is defined, following Koukouli et al. (2016), as the integral of the temperature weighted with ozone profile over altitude as follows: In this study, the biases of satellite data as a function of T eff are corrected as part of MLR fitting. However, the retrieval T eff is not available in some datasets. In that case, T eff is computed from JRA-55 (Kobayashi et al. 2015) temperature profile data, T (z) and the climatological monthly-mean ozone profile, O 3 (z) (Fortuin and Kelder 1998). The temperature data are extracted from the 6-hourly three-dimensional temperature distribution in JRA-55, and T (z) is derived at local noon from the reanalysis temperature for each ground station by interpolation (van der A et al. 2010).
Temperature correction (van der A et al. 2010; Koukouli et al. 2016) is applied to post-process the Dobson TCO as follows: where TCO corr is the corrected Dobson TCO generated using T eff , TCO Dobson is the Dobson measurement corresponding to the Dobson reference effective temperature of 226.7 K.

Bias correction
Biases of the individual and merged satellite TCO datasets are corrected in comparison with selected GD by two methods: SLR and MLR. In the first bias correction method, daily SAT-GD diff that is within three times the standard deviation at each groundbased station is averaged over all the stations. Then, to correct the mean bias (offset) and the trend (longterm drift) of the data, SLR is applied to daily stationaveraged SAT-GD diff as a function of time. Some datasets contain degraded data because of instrumental problems or because they were obtained during the early stage of the satellite deployment (Antón et al. 2010;Flynn et al. 2014). In such cases, the satellite measurement period is divided into two parts, and linear regression analysis is applied separately to the data collected during each sub-period. Moreover, SBUV datasets of satellites that drifted from an afternoon (ascending) orbit to a morning (descending) orbit are analyzed as different time series. We only use SBUV measurements in the 8 am -4 pm equator crossing time range, with a single exception for SBUV/N11 in the 1994 -1995 period to avoid a gap in the time series (Frith et al. 2014).
In the second bias correction method, MLR is applied to daily SAT-GD diff as a function of time, SZA, and T eff at all the stations rather than daily stationaveraged differences.

Level 3 datasets
We produce hourly L3 TCO datasets by spatially re-sampling L2SAT data to a linear Gaussian grid with TL159 resolution (320 longitudinal intervals × 160 latitudinal intervals). The L2 longitude and latitude data are first converted to 0.1° intervals and then remapped to a grid location. Besides, the start and end times of each hourly interval recorded in advance as line numbers of an orbit-based-text file and number of records, are used for the L3 production.
Here, only satellite TCO data collected on the sunlit side of Earth are used. Data flagged as "bad" as well as those that do not satisfy certain criteria (i.e., data collected when SZA > 84° (McPeters et al. 1996), cloud cover > 0.8, TCO values outside the 100 -700 DU range) are not used. Basically, for non-SBUV datasets, all data flagged as "good" are used. In the case of SBUV datasets, however, global data coverage during 1993 -96 and 2003 -04 would be extremely low if all data flagged as "bad" were excluded. To prevent data gaps, we disregarded quality flags and include all SBUV datasets for the L3 production.
Each gridded product in the dataset is generated by first multiplying data within a grid cell by a weighting factor that is inversely proportional to the TCO error, and then averaging the weighted data within the grid cell. In other words, the data with low errors within a grid cell are assumed to be more reliable. In some datasets, satellite FOVs at higher latitudes are larger than the grid cell size. In that case, the average grid cell value is patched to adjacent grid cells, where the number of adjacent grid cells is proportional to the ratio of the FOV area, calculated using spherical trigonometry, to the grid cell area. The daily and monthly L3 datasets are created by summing hourly L3 values.
Moreover, merged datasets are produced in a similar but separate manner by averaging the data located within each grid cell over all datasets multiplied by the weighting factor that is inversely proportional to the TCO error.

Uncertainties
Data averaging and the uncertainty in the measurements are outlined in Damadeo et al. (2014) and Petropavlovskikh et al. (2019). The uncertainty in the mean is representative of both the uncertainty in the measurements and the overall variance (Damadeo et al. 2014). A simple lag-1 autocorrelation correction is considered for the SLR model that is applied to daily station-averaged SAT-GD diff . Since we have an approach for using the station-averaged SAT-GD diff , these differences are assumed to be homoscedastic. No seasonal cycle is included in this regression because Fourier components representing the seasonal cycle are relatively small.
The trend analysis is conducted based on two different periods to describe the ozone decrease in the 1980s and 1990s, and the slow ozone increase since the early 2000s (Petropavlovskikh et al. 2019). Fourier components representing 12, 6, 4, and 3 months periodicity are included in the SLR, and other proxies such as the quasi-biennial oscillation, the 11-year solar cycle, etc., are not included.

Bias correction as a function of time by simple linear regression (SLR)
In this section, the satellite overpass retrievals of each dataset are compared with coincident GD made by Dobson and Brewer spectrophotometers. The systematic mean bias and drift of SAT-GD diff over time are assessed using SLR. The daily L2SAT is compared with corresponding daily GD recorded at each groundbased station as follows: First, daily SAT-GD diff is calculated for each site, and then the mean daily differences for all the sites are calculated. Second, SLR is applied to the daily mean differences as a function of time to determine the bias (offset) and trend (slope) of each dataset (Table 2). Figure 1 shows a 31-day-running mean of SAT-GD diff and the regression lines (shown in red) in the 20 datasets. Overall, all the datasets agree with GD within ±2 -3 % (< 10 DU), except for periods when the measurements were significantly degraded (TOMS/EP during 2000-2003, and OMPS-NOAA during 2012-2014. In many datasets, the slope (trend) of SAT-GD diff is within ±1 % (< 4 DU) per decade. Exceptions are TOMS/EP during 2000 -2003 due to instrument degradation (Antón et al. 2010), and OMPS-NOAA and GOME-2B because of unstable satellite measurements during the early deployment stage. After their early stage, the drift in these two acquired datasets is reduced (i.e., the slope is smaller). GOME and SCIAMACHY data that were processed with the GODFIT algorithm (Lerot et al. 2010) show almost no slope (within ±0.1 DU per decade) and mean biases of 2.9 and 0.3 DU, respectively (Table  2), compared with GD. The 31-day-running mean peak-to-peak amplitude ranges ~ 6 DU in GOME and 3 DU in SCIAMACHY (Figs. 1c, e, respectively). Meanwhile, the TOGOMI and TOSOMI are similar to those of GOME and SCIAMACHY, respectively, except that the offset in TOGOMI (0.5 %) is reduced in comparison with that in GOME (1 %), and the RMSD is 9.5 DU for TOSOMI compared with 8.7 DU for SCIAMACHY.
OMI-TOMS and OMI-DOAS datasets agree well with GD . The time series of SAT-GD diff (Figs. 1g, h, respectively) are almost flat throughout the year, both datasets have a mean bias of −3.1 DU and −1.8 DU, respectively. OMI-TOMS has a slightly larger drift and a smaller RMSD compared with OMI-DOAS. Similar to GOME and SCIAMACHY that were also processed with the DOAS-type algorithm, OMI-DOAS displays seasonal behavior with relatively smaller seasonal amplitude.

Merged TCO
Nov 1978 -Dec 2017 8.6 8.4 Errors of offset and drift are shown by twice the standard error. SBUV/N14 (S14), (q) SBUV/N16 (S16), (r) SBUV/N17 (S17), (s) SBUV/N18 (S18), and (t) SBUV/N19 (S19). A 31-day running mean is applied. changes on 1 July 2013, which coincided with the algorithm change from GDP 4.6 to GDP 4.7 (Figs. 1i, j) when the difference increases suddenly by ~ 8 DU. The seasonality associated with the data processed by DOAS-type algorithms can, for the most part, be attributed to the fact that the algorithms do not account for latitudinal and temporal changes in the effective temperature (Koukouli et al. 2016). In addition, data processed with DOAS-type algorithms acquired at high SZA show poor agreement with GD, apparently as a result of seasonal and geometrical problems (i.e., measurements at high SZA often located at high latitudes). OMPS-NASA shows a remarkable agreement with GD, with a mean bias of around zero and a slightly upward trend of 0.4 % per decade (Fig. 1l). In contrast, during the early stage of satellite deployment in 2012 -2014, OMPS-NOAA measurements are extremely degraded (Flynn et al. 2014), and SAT-GD diff shows an abnormally high slope. After 2015, however, the OMPS-NOAA SAT-GD diff exhibits stable behavior (Fig. 1k). Thus, we include only OMPS-NOAA from after 2015 in the merged TCO dataset.
The mean offsets of SAT-GD diff for datasets acquired with SBUV-type instruments are generally within 1 -3 %. For daily data without smoothing (not shown), fewer good matches are available for pure nadir observation instruments than those for a twodimensional CCD of OMI and a mapper of OMPS, and thus the variability in the SBUV anomalies is large. Because of poor sampling and low-quality SBUV instruments in the 1990s and early 2000s, SBUV/N9, /N11, and /N14 measurements exhibit large biases and significant drift due to large deviations of the equator crossing time (ECT) from an initial nearnoon ECT Frith et al. 2014). In early series SBUV datasets (SBUV/N7, /N9, /N11, and /N14; Figs. 1m-p), the peak-to-peak amplitude of 31-day-running mean fluctuations is within ±10 DU, whereas it is within ±5 DU in datasets acquired later (SBUV/N16, /N17, /N18, and /N19; Figs. 1q -t).
The latitudinal distributions of the mean relative SAT-GD diff , i.e., (L2SAT − GD)/L2SAT, in the 20 datasets are shown in Fig. 2. Mean biases of satellite TCO relative to GD at each station range from −2 % to 5 %. The biases are small and close to zero at higher latitudes in NH and positive near the equator and in SH. However, the SH positive biases may be driven by the negative bias stations in NH mid-latitudes. The spread (variance) ranges from 3 % to 5 % at most of the stations. From the figure, no significant change is observed in biases in the GD records in the early part of the satellite records compared with the later part.
Ozone trends would be best diagnosed from the temporal evolution of the global TCO average. Figure  3 shows the time series of uncorrected satellitemeasured TCO averaged over 60°S -60°N for each of the 20 datasets and the corresponding TCO data corrected by SLR as a function of time. Both the uncorrected and corrected time series show a general decline in ozone levels over time in the 1980s and early 1990s. The most visible effect of the trend correction is to reduce the trend over the period of 1978 -1995. Since the ground network was poorly calibrated in that period than in the post 2000 period, this might be due to uncertainty in the ground data. In this figure, there are a local maximum in 1988 -1991 and a minimum in 1993. The low TCO values during 1992 -1994, which are reduced by 5 -10 DU compared with the 1990 level, are attributable to the Mount Pinatubo eruption in June 1991 (Fioletov et al. 2002). However, the TCO values return to the late 1980s level in1998 -1999, and no long-term drift is apparent after 2000.
The TOMS/N7 L2SAT overestimates GD by 2 -4 DU during 1979 -1993 (i.e., differences are positive; Fig. 1a). As a result, the corrected TCO magnitudes are reduced to 280 -293 DU. Conversely, the TOMS/ EP L2SAT underestimates GD during 1996 -1999 (i.e., differences are negative; Fig. 1b). Thus, the corrected TCO values are higher than the uncorrected values. However, there are large differences between the corrected TOMS/EP dataset and other corrected datasets which include data for 1996 -1999 (i.e., GOME, TOGOMI, and SBUV/N9, /N11, and /N14) (Fig. 3b). This result suggests that not only the TOMS/EP TCO data from after 2000 when the instrument is known to have degradation issues, but also the TOMS/EP TCO data from before 2000 may be problematic. Hence, the data corrected by SLR may need to be carefully inspected before they are used to determine long-term ozone trends.
The satellite datasets for the 2000s agree well, within 5 DU, with one another, and the corrected datasets agree within 4 DU. In the 2010s, NASA-satellite TCO data (OMI-TOMS, OMI-DOAS, OMPS-NOAA, and OMPS-NASA) and SBUV-series data agree well with one another. However, GOME-2A and -2B TCO data are ~ 5 DU higher than the NASA-satellite and SBUV-series data (except for the OMPS-NOAA data from before 2015). The OMPS-NASA TCO dataset shows remarkable stability with respect to the groundbased network; the corrected globally-averaged OMPS-NASA TCO data are almost the same as those of the uncorrected data (Fig. 1l). Remarkably, intersatellite differences between the NASA satellites and GOME-2 series after around 2013 significantly decrease in magnitude from ~ 5 DU for uncorrected TCOs to 2 -3 DU for corrected TCOs, except for OMPS-NOAA.
In summary, the systematic mean biases and drift of TCO in the 20 satellite retrieval datasets as a function of time are removed by SLR fitting of the coincident satellite overpass and Dobson and Brewer GD data. In some datasets, measurements are extremely degraded during certain periods. In that case, linear regression analysis is applied separately to different parts of the satellite measurement period. In all the datasets, satellite overpass data agree with GD within ±2 -3 % when the significantly degraded TOMS/EP measurements during 2000 -2003 and OMPS-NOAA measurements during 2012 -2014 are excluded. Abrupt jumps of ~ 8 DU in GOME-2A and -2B coincident with a change in the retrieval algorithm. The RMSD of the merged TCO dataset, 8.6 DU, is reduced to 8.4 DU by bias correction with respect to time.

Bias correction by multiple linear regression (MLR)
Biases of all the available satellite TCO datasets and the multi-sensor merged TCO dataset relative to the selected GDs are corrected as a function of time, SZA, and T eff by MLR. First, number frequency distributions of satellite overpass measurements are presented for each dataset and then relative SAT-GD diff and RMSD are investigated as a function of SZA, cloud amount, and T eff . To assess the validity of this method for correcting biases in satellite TCO retrievals, we present the corrected TCO anomalies relative to uncorrected TCO values on a global map and examine the time series of globally averaged TCO.
In relation to SZA, frequency distributions of satellite overpass measurements (Fig. 4a) are unimodal or bimodal with peaks at SZA of 30 -40° or 60° or both. The distributions of number frequencies and relative differences are heavily influenced by the fact that most of the ground stations are located in NH mid-latitudes. Because the overpass SZA varies according to the season and latitude of the observations, the distribution largely reflects the geophysical location of the ground stations, which are mostly unevenly distributed in NH extratropics. Overall, the relative SAT-GD diff (Fig. 4b) and RMSD (Fig. 4c) are small when SZA is 20 -60°. However, there are significant differences among the datasets when SZA is high, especially for SZA > 80°, because errors in the GD are likely due to scattered light. The frequency distributions in relation to cloud amount (Fig. 4d) show cloud-free percentages of just 4 -5 % for satellite overpass measurements retrieved by DOAS-type algorithms and 35 -40 % for those by TOMS-type algorithms, except for OMPS-NOAA. The number frequencies of retrievals generally decrease as the cloud amount increases because overpass differences in this study use only GD in direct sun to ensure accuracy. Overall, the relative SAT-GD diff in most of the datasets shows little dependency on cloud amount (Fig. 4e). The SAT-GD diff in TOMS/N7 and GOME-2B (OMI-TOMS and OMPS-NOAA) shows positive (negative) biases, and biases of SCIAMACHY, TOSOMI, GOME-2A, and OMPS-NASA are near zero within ±1 %. The RMSD increases as increasing the cloud amount (Fig. 4f). Because clouds can substantially reduce the accuracy of satellite TCO products (Antón and Loyola 2011), these results suggest that cloud amount may be suitable for the consideration of TCO error rather than for the TCO itself. The frequency distributions of satellite overpass measurements in relation to T eff (Fig. 4g) show that the L2 retrievals of T eff for GOME, SCIAMACHY, TOGOMI, GOME-2A and GOME-2B peak at ~ 230 K whereas the modal of JRA-55-derived T eff peaks at ~ 220 K. Van Roozendael et al. (2012) indicated that the DOAS may overestimate T eff and that a difference of 10 K in T eff results in a 3 % change in TCO. The T eff dependency of SAT-GD diff and its RMSD (Figs. 4h,i) show large spreads among the datasets at low temperatures because of the small sample sizes and large atmospheric variability in winter months. The relative SAT-GD diff for GOME, SCIAMACHY, TOGOMI, and TOSOMI shows larger positive biases at lower T eff , whereas that for TOMS/EP, OMI-TOMS, and OMPS-NOAA shows large negative biases at lower T eff . The OMI-DOAS, OMPS-NASA, and SBUV ensemble mean biases are small, within ±1 %, when T eff is higher than 220 K.
MLR coefficients of SAT-GD diff as a function of time, SZA, and T eff are presented in Table 3 for the 20 datasets, along with their RMSDs. The RMSDs of all the datasets are ~ 10 DU (3 %), and bias correction by MLR reduces the RMSD by 1 -8 %, compared with the RMSDs of the uncorrected data (see RMSD1 in Table 2); GOME-2A and -2B show the largest RMSD reduction. The RMSD of uncorrected merged datasets, 8.6 DU, is reduced to 8.4 DU after correction by MLR. Figure 5 shows latitude time cross-sections of differences between corrected TCO by MLR and uncorrected TCO for the 20 datasets (panels 5a -t) and the merged dataset (panel 5u). Overall, the correction differences show latitudinal dependency and distinct seasonal variation. The positive TCO correction in TOMS/EP, OMI-TOMS, OMI-DOAS, and OMPS-NOAA and negative TCO correction in GOME, TOGOMI, and GOME-2B at almost whole latitudes are basically attributed to negative and positive SAT-GD diff , as shown in Fig. 1. Positive TCO correction at only higher latitudes is distinct in TOMS/N7. TCO correction is relatively small in SBUV datasets, due to small magnitude of SAT-GD diff , as shown in Figs. 1 and 2. These general characteristics in individual data- Data period: 1 TOMS/EP, July 1996 to December 1999 (but the bias is not corrected in the merged TCO data); 2 GOME-2A sets are well reflected in the merged dataset in Fig. 5u. Large gaps in SH after 2003 in the GOME and TOGOMI datasets are due to an ERS-2 recording failure . In the datasets processed by DOAS-type algorithms, TCO retrievals are not available when the Earth's surface is covered by snow or ice, because the algorithms cannot distinguish between the snow-or ice-covered ground and cloud . The effective light path is enhanced by multiple reflections in clouds, and when snow and ice is present, the highly reflecting surface produces an elongated "ghost" column; in both cases, TCO is overestimated . However, in GOME-2A and -2B, after the retrieval algorithm was changed in mid-October 2016, the monthly coverage over 60 -90°S increases to ~ 100 %, from ~ 40 % with the former algorithm.
It is interesting to examine how the spatial distribu-tion and temporal behavior of the merged data change as a result of bias correction by MLR. For example, Figs. 6a -c show longitude-latitude sections of monthly mean corrected minus uncorrected TCO differences, RMSD, and T eff in October 2008. In general, TCO in October is high over the east Asia at higher latitudes and over the Southern Ocean, whereas it is low over 30°N -30°S region and over Antarctica. A large part of the merged TCO data in 2008 is derived from the OMI-TOMS and OMI-DOAS datasets, both of which have negative biases. Thus, the corrected TCO results show increases in TCO in most parts of the world (Fig.  6a). Figure 6b shows that the RMSD is large at the equator and in Antarctica, and small in the Southern Ocean in this example. Figure 6c shows a low T eff in lower latitudes and a high T eff in the Southern Ocean. In Table 3, coefficients of T eff for OMI-TOMS and OMI-DOAS are positive, indicating that the bias correction in higher/lower T eff regions show decreases/increases in TCO in those regions. The key point is that this bias correction by MLR is not uniform globally because the correction in a global map depends on T eff and SZA, which are critical parameters affecting TCO retrievals (van der A et al. 2010).
The time series of bias correction by SLR and MLR at a mid-latitude station in NH is shown in Fig. 6d. Bias correction by SLR does not reveal any seasonal behavior because this correction uses only the mean offset and drift. In contrast, bias correction by MLR reveals clear seasonal behavior. The seasonal peak-topeak amplitude changes from ~ 3 DU before 2013 and to ~ 6 DU after 2013. This amplitude change coincides with the algorithm change in GOME-2A and -2B. The GOME-2A and -2B satellite overpass time series (Figs. 1i,j,respectively) show negative biases against GD before 2013 and positive biases after 2013. Figure 7 shows the time series of merged uncorrected and corrected TCOs by SLR and MLR averaged over 60°S -60°N, and the differences of corrected TCO relative to the uncorrected TCO; NH and SH averages are also shown. For comparison, the KNMI ozone multi-sensor reanalysis of van der A et al. (2010) is plotted. The reanalysis was constructed by data assimilation using reprocessed versions of 13 satellite corrected datasets for 1978 -2008 against GD for biases as a function of time, SZA, VZA, and T eff .
The behavior of the global, NH and SH averages of TCO corrected by SLR is similar to that corrected by MLR, and good agreement is also seen between both our corrected and merged TCO time series and the merged KNMI time series of van der A et al. (2010). This agreement indicates that the simplified method in which only the mean satellite bias and drift are corrected has some benefits as an assimilation data source in that it removes the discontinuity caused by satellite de-biased data, and it is expected to reduce apparent biases among individual satellite measurements. From Fig. 7b, the bias correction by MLR is higher than that by SLR. This difference could be caused by T eff of Dobson and retrieved satellite ozone as well as SZA. By comparing these time series with those of the individual satellite datasets (Fig. 3), we find that both SLR and MLR fittings correct the positive biases during the TOMS/N7 observation period of 1978 -1993 and the negative biases during the OMI period of 2004 -2017.
In summary, bias correction by MLR is not uniform globally and shows distinct seasonal variation because the correction depends on time, SZA, and T eff . The bias correction by MLR is higher than that by SLR. In addition, the time series after SLR correction for mean bias and drift shows good agreement with the TCO time series corrected by MLR (R 2 = 0.998). Therefore, the simplified method has some benefits as an assimilation data source in that it removes the discontinuity caused by satellite de-biased data, and it is expected to reduce apparent biases among individual satellite measurements.

Trend analysis
In this section, the TCO trend in the corrected and uncorrected merged TCO datasets is investigated. Apparent biases in individual satellite measurements are reduced in the MLR corrected merged dataset. Figure 8 shows the seasonal march of the TCO trend in NH and SH mid-latitudes along with the latitudinal distribution of the TCO trend during two periods, 1979 -1996 and 2000 -2017. These two different trends correspond to the ozone decrease in the 1980s and 1990s and the slow ozone increase since the early 2000s (Petropavlovskikh et al. 2019).
The negative TCO trend is the strongest in winter and spring with largest variability during 1979 -1996 in NH mid-latitudes (35 -60°N; Fig. 8a), as reported previously by Fioletov et al. (2002). During these years, the year-round trend is statistically significant (i.e., values are within twice the standard error) in both NH and SH mid-latitudes. The summer and fall uncertainties are smaller than those in winter and spring. In SH mid-latitudes (35 -60°S; Fig. 8b), the ozone decline is less seasonally dependent. Overall, TCO trend shows a statistically significant decline of 10 -20 DU per decade during 1979. During 2000 -2017, the trend is near zero in all seasons in both NH and SH mid-latitudes, except for an insignificant positive trend in January over 35 -60°N and a significant positive trend in April and May over 35 -60°S. In both hemispheres during 1979 -1996, the difference between the corrected and uncorrected TCO trends shows seasonal dependence, being positive in summer months. This influence may be attributed to T eff and/or SZA dependency for MLR. It is notable that bias correction by SLR loses seasonality for differences between corrected and uncorrected TCOs (not shown).
The TCO trend is near zero over the equator and substantially declines outside the 30°S -30°N zone during 1979 -1996 (Fig. 8c). The cumulative ozone loss over the mid-latitude during that period is ~ 5 %, in agreement with results of Fioletov et al. (2002) andWorld Meteorological Organization (2015). In this period, the corrected TCO data show smaller negative trends compared with the uncorrected data. During 2000 -2017 (Fig. 8d), the trend is near zero over 20 -30°S and 10 -60°N. Meanwhile, it is significantly positive over 10°S -10°N and insignificantly positive over 40 -60°S. In this period, the corrected data show a large positive trend over 10°S -30°N, compared with the uncorrected data. Overall, our trend correction by MLR is higher than that by SLR especially at lower latitudes because bias correction increases in TCO at low T eff .
In summary, the difference between MLR corrected and uncorrected TCO trends shows clear seasonal and latitudinal dependency while such seasonal and latitudinal dependence of bias correction by SLR against time is lost. The differences between MLR and SLR corrected TCO trends are generally positive at lower latitudes where bias correction by MLR increases in TCO due to low T eff .

Concluding remarks
In this study, we have constructed the merged dataset using 20 available satellite TCO datasets covering the 40 years from 1978 to 2017. The individual datasets and the merged dataset are corrected against selected GD by two methods: SLR against time and MLR against time, SZA, and T eff .
In the 20 datasets, SAT-GD diff shows little dependence of SZA when SZA < 60°. However, it shows large deviations when SZA > 80° with anomalies > 5 % because GD errors are likely due to scattered light at large SZA. The dependency of the cloud amount is small in most of the datasets and the (TCO) RMSD for the cloud amount increases as increasing the cloud amount. Although clouds can significantly reduce the accuracy of satellite TCO retrievals (Antón and Loyola 2011), this means that cloud amount may not be a suitable parameter for TCO bias correction and that it may be taken into consideration in TCO error correction. SAT-GD diff shows a large spread at low T eff , and this result is attributable to small sample sizes and probably also to large atmospheric variability in winter months.
Because SZA and T eff dependencies are geophysically linked (Koukouli et al. 2016), these two effects are important for validation studies. van der A et al. (2010) also indicated that SZA and T eff have clear seasonal components and that an explicit seasonal or latitudinal dependency almost disappears when their corrections were applied to the satellite observations. In our analysis of SAT-GD diff , all datasets agree within ±2 -3 % (< 10 DU), except for some significantly degraded measurements in TOMS-EP during 2000 -2003 and OMPS-NOAA during 2012 -2014. The systematic drifts are within ±1 % (< 4 DU) per decade, except in degraded TOMS/EP measurements, and in OMPS-NOAA and GOME-2B measurements during the early stage of satellite deployment. Later, OMPS-NOAA and GOME-2B retrievals exhibit stable behaviors.
The merged datasets are created by averaging all data located within a grid cell after multiplication with weights inversely proportional to TCO errors. The behavior of the merged TCO after bias correction by SLR is similar to that after correction by MLR. Both corrected merged TCO datasets agree well with the KNMI merged TCO reported by van der A et al. (2010). Therefore, correction for mean bias and drift also improves the merged TCO datasets. The RMSDs in the corrected datasets by SLR and MLR are reduced from 8.6 DU to 8.4 DU. This is an additional benefit for data correction and time-series homogenization. A drawback of bias correction by SLR is that bias correction of seasonal and latitudinal dependence is not detected, whereas bias correction by MLR shows clear seasonal and latitudinal dependence (Fig. 6).
For the trend analysis, bias correction by MLR shows clear seasonal and latitudinal dependence while such seasonal and latitudinal dependence of bias correction by SLR against time is lost. The differences between MLR corrected TCO and uncorrected TCO trends are generally positive at lower latitudes where bias correction increases in TCO due to low T eff . A further investigation is needed for this positive TCO correction in the low latitude.
Data availability. The homogenized monthly mean merged TCO datasets are available from the authors upon request via e-mail (biascorrection_l2sat_totalozone AT mri-jma.go.jp). of Education, Culture, Sports, Science, and Technology of Japan. The authors give many thanks to Atsuya Kinoshita, Itaru Uesato, Mikio Ueno, Nozomu Ohkawara, Makoto Inoue, Toshihiko Hirooka, and especially Kazuyuki Miyazaki. Without his constructive and scientific advices to claim JMA's ozone products, nothing about our ozone homogenization project would be accomplished at all.