Euclidean Sum of Exponentially Decaying Contributions Tutorial

Exponentially Decaying Contributions (EDC)

Equation 1 describes the expected concentration, C(d), at a distance, d, away from a source. Where C(0) is the concentration at the source. The hyperparameter, αE, describes the distance at which the concentration from the source has been reduced by 95%.

(1)

Sum of Exponentially Decaying Contributions (SEDC)

When there are N source nodes, the concentrations observed for M sampling points (i=1,…,M), can be expressed as a sum of the EDC from each jth source with initial concentrations, C(0j), to each ith sampling point using the Euclidean distance matrix, Dij. Again, the hyperparameter, αE, describes the distance at which the concentration from the source has been reduced by 95%, or the exponential decay range.

(2)

Estimating the Hyperparameter Value

The exponential decay range may be obtained with experimental data; however, this can be costly when the number of source nodes, N, is large. Therefore, methods for estimating the exponential decay range can be useful. Messier et al. (Messier et al. 2014) estimates the exponential decay range by using Q different values for the hyper parameter to construct Q different predictor variables. Each of these predictors can be used in regressions against observations from sampling data. Whichever qth hyperparameter leading to the qth predictor variable that corresponds to the best fitting model (i.e. lowest AIC, highest R2, etc.) is selected.

However, simulations of sources, observations, and different hyperparameters suggest that maximizing the coefficient associated with the predictor variable is often equivalent (as seen in the demonstration below) to maximizing the fit of the model, so long as the constructed predictor variable is standardized (i.e. with interquartile range method, Z-score). The coefficient can be left as the standardized coefficient or for some data may be expressed as a risk ratio (RR), an abundance ratio (AR), or relative abundance ratio (RAR), depending on the measurement and transformation of the response data. For example, if the data is log(y), where y is either a 1 or 0, then a RR can be determined by taking the exponent of the coefficient. If the data is a relative abundance of species with a log10 transformation, the RAR can be determined by raising 10 to the power of the coefficient.

Demonstration of Estimating the SEDC Using Simulated Data

In the below demonstration, response data, y, was created using the following linear model:

(3)

Here β0 represents an intercept value, which has been set to 0.2. β1 represents a coefficient value, which has been set to 0.25. The constructed predictor variable, x1, represents an estimation of concentrations at the sampling locations modeled using the SEDC from 7 source nodes of the same source type with a true exponential decay range for two scenarios. In the first scenario, the true exponential decay range is 2 km and in the second scenario it is 4 km. Some error, ε, has also been simulated.

GOAL: estimate these exponential decay ranges for each scenario without a priori knowledge of the true exponential decay range using a linear model.

If helpful, you can imagine that these sources are sprinklers which spray swine manure onto fields. In this example we might be trying to measure the relative abundance of swine fecal matter in the air at 20 different sampling locations. This estimation can be utilized in order to come up with exposure maps that may help in determining potential microbial exposures that nearby residents might face.

https://indyweek.com/news/archives/updated-study-shows-n.c.-hog-farms-spray-hog-poop-neighbors-homes-cooper-vetoes-hb-467/

We can estimate the SEDC continuously across our whole study by using an estimation grid. We can also construct our predictor variable, x1, by estimating at the sampling locations. We can then run a regression of our constructed predictor variable, x1, against our measured observations, or response data, y, at our 20 sampling sites. We can keep track of the fit by measuring R-squared and we can keep track of the RAR. By doing so, we can find the optimal hyperparameter value to describe the exponential decay range.

Below there are two examples using simulated data. In the top example, the exponential decay range was set 2 km and in the bottom example it was set to 4 km. The locations of sources are fixed, however the way we model the contributions from those sources differs as we vary the hyperparameter αE. With our model, we can find the sum of contributions at any point in our study area and make a map of this estimate by interpolating values from an estimation grid. Also, we could instead make estimates of the contributions at the sampling locations instead of the estimation grid locations in order to construct our predictor variable. We can then compare those contributions to our observations at those sampling locations using regression. Through regression we can obtain coefficients and with a little extra work, we can obtain estimates of fit (i.e. R-squared, AIC, etc.).

We can compare the r-squared values associated with this regression for different hyperparameter values, where the hyperparameter that corresponds to the best fit is the most predictive hyperparameter value and the hyperparameter value corresponding to the largest coefficient is the most physically meaningful hyperparameter value. These are essentially the same, however it can be slightly less computationally expensive to use the coefficient method because estimating the R-squared is one more step.

Essentially this is the maximization of an objective function, where the objective function is either a function of the error associated with the regression or the coefficient values. This can be done numerically as shown below or with algorithms that are less computationally expensive (i.e. gradient descent).

Use the toggle located under the figures to see how different values of αE change the figures on the bottom showing the EDC, SEDC, the constructed predictor variable x1, and the resulting R-squared and RAR on the right.

True αE =

Sum of Exponentially Decaying Contribution

Simulations for Example
2km
4km

Estimation αE
(kilometers)

Bibliography

Messier, K.P., Kane, E., Bolich, R. and Serre, M.L. 2014. Nitrate variability in groundwater of North Carolina using monitoring and private well data models. Environmental Science & Technology 48(18), pp. 10804–10812.