Taking drift-diffusion analysis from the study of turbulent flows to the study of particulate matter smog and air pollutants dynamics

Drift-diffusion analysis has been introduced in physics as a method to study turbulent flows. In the current study, it is proposed to use the method to identify underlying dynamical models of particulate matter smog, ozone and nitrogen dioxide concentrations. Data from Chiangmai are considered, which is a major city in the northern part of Thailand that recently has witnessed a dramatic increase of hospitalization that are assumed to be related to extreme air pollution levels. Three variants of the drift-diffusion analysis method (kernel-density, binning, linear approximation) are considered. It is shown that all three variants explain the annual pollutant peaks during the first half of the year by assuming that the parameters of the physical-chemical evolution equations of the pollutants vary periodically throughout the year. Therefore, our analysis provides evidence that the underlying dynamical models of the three pollutants being considered are explicitly time-dependent.

Our departure point is a time-discrete sequence of observations of pollutant concentrations. This sequence will be referred to as historical trajectory X h (n) given for the time points n = 1, . . . , N (with N = 60, see below). In what follows, n will denote consecutive months. Our goal is to derive a stochastic model from the historical trajectory in analogy to the proposal by Friedrich-Peinke-Renner for historical financial data and to take seasonal effects into account. Following the Friedrich-Peinke-Renner method [24], we consider the increments Y n (τ) defined by Y n (τ) = X(n + τ) − X(n) ⇒ X(n + τ) = Y n (τ) + X(n) for τ 0 and Y n (0) = 0. Parameter τ defines a time scale. The increments Y n are assumed to satisfy an evolution equation that describes how Y n (τ) evolves from small scales of a few months (e.g., τ ≈ 1, 2, 3) to large scales of a year (e.g., τ ≈ 12, 13,14). In order to determine that evolution equation, we consider R increment trajectories of length S with τ = 0, . . . , S, n = 1, . . . , R, and R + S = N. The evolution equation for Y n is then obtained using the drift-diffusion analysis [1].
Although drift-diffusion analysis [1] is as such a non-parametric data analysis method, it requires to fix a priori the type of the stochastic model under consideration. In what follows, we consider a model given in terms of the stochastic iterative map In equation (1), f will be referred to as drift function (in analogy to the drift function of a Fokker-Planck equation [25,26]). The drift function f is assumed to depend on the month m of the year, where m depends on n and τ like m(n, τ) = v if v ∈ [1,11] and m(n, τ) = 12 if v = 0 with v = (n + τ) mod 12. In equation (1) (τ) denotes statistically independent random variables distributed like a normal distribution with mean zero and variance 2. The parameter g 0 is the noise strength or noise amplitude and, in general, may depend on the month of the year. Moreover, g 2 is the noise variance. For the sake of simplicity, it is assumed that g does not depend on the state Y n (i.e., an additive noise model is considered). In early studies by Friedrich and Peinke [1] and Stanton [27] on the drift-diffusion analysis, Friedrich, Peinke, and Stanton have determined representations for drift and diffusion coefficients of Markov diffusion processes in terms of conditional averages. In analogy to those representations, from equation (1) we obtain the Friedrich-Peinke-Stanton representation of the drift in terms of the conditional average For the noise variance we obtain which is not a conditional average because we assume that the noise term is state-independent (i.e., additive). The drift function f can approximately be described by means of several methods. The Friedrich-Peinke binning method [1,3] yields the estimator where χ j are indicator functions equal to 1 in appropriately defined intervals. We consider Y n ∈ [A, B) and use K bins of width ∆y such that y 1 = A, y K+1 = B, and y j = A + ( j − 1)∆y. The bin-intervals are I j = [y j , y j+1 ). The indicator function is χ j (z) = 1 if z ∈ I j and χ j (z) = 0 otherwise. In equation (4), δ mn is the Kronecker function that equals 1 if the (running) month n corresponds to a particular month m of the year and zero otherwise. That is, only those pairs Y n (τ), Y n (τ+1) contribute to c j (m) for which the (running) month n is the month of the year m of interest. Moreover, we have Z jm = S τ=0 [ R n=1,Y n (τ)∈I j δ mn ]. The kernel density estimation method suggested by Stanton [27] yields where the standard deviation h is given by h = sL −0.2 , where s is the sample standard deviation of all Y n scores that belong to a particular month m of the year (i.e., that show up on the sum and for which 24001-2 δ mn = 1 holds -these are the scores from which the density is estimated) and L is the number of such scores [28,29]. Moreover, For the relative small data sets that will be considered below, we will use the model that describes some dependency of f on z and features the smallest number of parameters. That is, we will consider the order p = 1. In this case, equation (1) becomes the linear regression model with A = A 0 and B = A 1 . The intercept and slope parameters A(m) and B(m) can be estimated by fitting the linear regression model equation (6) to scatter plots of Y n (τ + 1) versus Y n (τ) given for every month m.
In fact, the Y n (τ + 1) versus Y n (τ) scatter plots are used to determine f for all three approximations defined by equations (4), (5) and (6) since equations (4), (5) and (6) involve the data pairs Y n (τ + 1) and Y n (τ) for a fixed month m, that is, all pairs Y n (τ + 1) and Y n (τ) for which n corresponds to a particular month m of the year. Moreover, from equation (3) it follows that g of the linear regression model equation (6) can be estimated from the root-mean-squared error RMSE of the regression model like g(m) = RMSE(m)/ √ 2. Data were taken from the Pollution Control Department (PCD) of Thailand [30]. Pollutant data for PM 10 , O 3 , and NO 2 in N = 60 months from January 2010 to December 2014 were retrieved for the Provincial Hall measurement station in Chiangmai. Figure 1 shows the pollutant time series. The station measured raw PM 10 concentrations (in µg/m 3 ) as averaged values for every day. From the daily raw data, the PCD determined for each month the maximum scores. By contrast, O 3 and NO 2 , raw concentrations (in ppb) were measured by the station every hour. From those hourly raw data, maximum scores of the day and maximum values for the month were determined. The monthly extreme value data for PM 10 , O 3 , and NO 2 published on the PCD website [30] were retrieved and analyzed. As mentioned above, the study of extreme value data is of importance because extreme pollutant concentrations are related to increased health risks [18][19][20] and death rates [21,22]. All three pollutants PM 10  For each pollutant trajectory X(n), increment trajectories Y n (τ) were derived for reference time points n in the first three years (i.e., n = 1, . . . , R with R = 36) such that each trajectory covered a two years period (i.e., τ = 0, . . . , S with S = 23). From the trajectories Y n , scatter plots for each month m showing Y n (τ + 1) versus Y n (τ) were obtained. From the scatter plots, the drift functions f were determined by means of the 3 different approximations defined by equations (4), (5) and (6). Figure 2 shows the drift  By visual inspection of figure 2, the kernel density estimation method has the advantage to account for nonlinear characteristics of f in a smooth fashion. It has the disadvantage of being described by the whole data sets of Y n (τ + 1) and Y n (τ) pairs that contribute to the relevant scatter plots. That is, each smooth function f is characterized by a relatively large set of parameters given in terms of Y n (τ + 1) and Y n (τ) pairs. The linear approximation has the advantage of being conveniently characterized by two parameters A and B. It has the disadvantage of not capturing nonlinear effects. The drift function approximation by means of the binning method can be regarded as a compromise between the two other approximations. The stair-case like approximate drift functions obtained from the binning method account for nonlinearities. They can be described by a bin width ∆y, the bin centers y k,c = (y k+1 + y k )/2, and the function values f k . Consequently, to describe f with K bins, we need 2K + 1 parameters. Therefore, the number of parameters to describe f with the binning method is larger as compared to the number of parameters that characterize the linear regression model but smaller as compared to the number of Y n (τ + 1) and Y n (τ) parameters that are needed to approximate f by means of the kernel density estimation method. The linear regression model approximation, that has the advantage of requiring the smallest number of characterizing parameters, was determined in detail for all three pollutants. That is, the parameters A, B and g were estimated as described above. Figure 3 shows the model parameters thus obtained. For all three pollutants, the intercept parameter A varied across the months of the year in a characteristic pattern. For PM 10 and O 3 in January and February and for NO 2 in January it was positive and assumed the largest positive values. Subsequently, in March and April (PM 10 ), April and May (O 3 and NO 2 ), parameter A was negative and assumed the largest-in-the-amount negative values. These patterns, as discussed above in the context of PM 10 and figure 2, describe the mechanism that leads to the peaks in the original trajectories X(n) around February, March, and April, see figure 1. For the remaining months from May to December, the intercept parameters A were overall relatively small (i.e., close to zero). PM 10 and O 3 showed exceptions from this rule in September, where the parameter values A were positive and assumed 20% (PM 10 ) and 45% (O 3 ) of their respective maximal positive A parameters. For all three pollutants, the slope parameter B was found to be relatively close to unity. For PM 10 and NO 2 , the noise amplitude g showed clear seasonal peaks around January, February, and March (PM 10 ) and March and April (NO 2 ).
In order to validate the model, we tested the residuals occurring in equation (6). We determined the first ten lag-s autocorrelation coefficients of the residuals for each trajectory Y n (τ). The coefficients are shown in figure 3 (panels D, E, F) together with single-time-series thresholds [31] (solid lines) for statistically significant and Bonferroni adjusted [32] multiple-tests thresholds (dashed lines). We found that some of the correlation coefficients (in particular, lag-1 correlation coefficients of O 3 and NO 2 ) violated the single-time-series criterion for being not statistically significant. However, all correlation coefficients were found to be within the boundaries of the multiple-tests thresholds. Residuals of trajectories Y n (τ) were also tested for violation of normality using the Anderson-Darling normality test. For all PM 10 and NO 2 trajectories Y n (τ), the residuals did not violate the normality assumption. For O 3 , the normality assumption was violated in 4 out of R = 36 trajectories Y n . Overall, the correlation and normality tests supported the model assumptions.
We showed how to identify stochastic dynamical models for the evolution of air pollutants on the basis of single, historical trajectories of pollutant concentrations. To this end, we followed the earlier work on financial data and considered pollutant increments rather than the raw pollutant data. In addition, three different representation methods of the drift functions of the dynamical models were used. In doing so, we derived three main results: First, we found that all three representation methods were consistent with each other, see figure 2. Second, we were able to show that experimentally observed annual air pollutant peaks were caused by drift functions of physical-chemical air pollutant systems that change qualitatively from the pre-peak months (e.g., January and February) to the post-peak months (e.g., March and April), see figure 2 again. These qualitative changes in the drift functions are assumed to reflect periodic changes in the physical-chemical laws determining the evolution of the PM 10 , O 3 , and NO 2 pollutant concentrations. Third, it was found that the linear approximation representation method of drift functions (which is the most parsimony method) is sufficient to reproduce the emergence of the yearly pollutant peaks, see figure 1.