preliminary
source code of this document: https://github.com/phineas-pta/hogg2010/blob/main/hogg2010.qmd
this document’s computations were carried out with julia
1.11.5, and its rendering was achieved via quarto
1.7.31
references & resources
reference: David W. Hogg, Jo Bovy, Dustin Lang. 2010. Data analysis recipes: Fitting a model to data
- paper: https://arxiv.org/pdf/1008.4686
- source code: https://github.com/davidwhogg/DataAnalysisRecipes/blob/master/straightline/straightline.tex
- lecture: https://www.youtube.com/playlist?list=PLBB44462E5201DD1E
you will also find here other papers from the Data analysis recipes series by David W. Hogg, and their original TeX files are available in the GitHub repository linked above
- https://arxiv.org/pdf/2005.14199
- https://arxiv.org/pdf/1710.06068
- https://arxiv.org/pdf/1205.4446
- https://arxiv.org/pdf/0807.4820
additional resources: the following are various python implementations i’ve come across, but it’s worth noting that they often do not include solutions for every exercise
- https://astrowizici.st/teaching/phs5000/
- https://www.pymc.io/projects/examples/en/latest/generalized_linear_models/GLM-robust-with-outlier-detection.html
- https://github.com/binado/fitaline/blob/main/fitaline.ipynb
- https://github.com/wdconinc/data-analysis-recipes-fitting-a-model-to-data/blob/master/Exercises_in_%22Data_Analysis_Recipes_Fitting_a_Model_to_Data%22.ipynb
- https://github.com/nhuntwalker/data_analysis_recipes/blob/master/worked_exercises.ipynb
- https://github.com/jimbarrett27/hogg2010/blob/master/solutions.ipynb
- https://github.com/nadanai263/fittingmodeltodata/blob/master/Fitting%20a%20model%20to%20data%20(Hogg%2C%20Bovy%2C%20Lang%202010).ipynb
- https://github.com/Lachimax/hblfit/blob/main/notebooks/1-standard-practise.ipynb
data
data to be used throughout the tutorial:
\(x\) | \(y\) | \(\sigma_x\) | \(\sigma_y\) | \(\rho_{xy}\) |
---|---|---|---|---|
201 | 592 | 9 | 61 | -0.84 |
244 | 401 | 4 | 25 | 0.31 |
47 | 583 | 11 | 38 | 0.64 |
287 | 402 | 7 | 15 | -0.27 |
203 | 495 | 5 | 21 | -0.33 |
58 | 173 | 9 | 15 | 0.67 |
210 | 479 | 4 | 27 | -0.02 |
202 | 504 | 4 | 14 | -0.05 |
198 | 510 | 11 | 30 | -0.84 |
158 | 416 | 7 | 16 | -0.69 |
165 | 393 | 5 | 14 | 0.30 |
201 | 442 | 5 | 25 | -0.46 |
157 | 317 | 5 | 52 | -0.03 |
131 | 311 | 6 | 16 | 0.50 |
166 | 400 | 6 | 34 | 0.73 |
160 | 337 | 5 | 31 | -0.52 |
186 | 423 | 9 | 42 | 0.90 |
125 | 334 | 8 | 26 | 0.40 |
218 | 533 | 6 | 16 | -0.78 |
146 | 344 | 5 | 22 | -0.56 |
The full uncertainty covariance matrix for each data point is given by \(\begin{bmatrix} \sigma_x^2 & \sigma_{xy} \\ \sigma_{xy} & \sigma_y^2 \end{bmatrix}\) with \(\sigma_{xy}=\rho_{xy}\sigma_x\sigma_y\)
given the authors’ academic credentials, these data most likely stem from astrophysics, with each measurement of x and y possessing uncertainties and correlations, and potentially subject to some additional manipulation
prepare Julia environment
using Statistics: mean, var, quantile
using LinearAlgebra: Diagonal, dot, logabsdet, eigen, transpose, isposdef
using LogExpFunctions: logaddexp, logsumexp
using KernelDensity: kde
using Turing, StatsPlots
# using MarkdownTables: markdown_table
setprogress!(false); # disable progress bar in Turing.jl
[ Info: [Turing]: progress logging is disabled globally
[ Info: [AdvancedVI]: global PROGRESS is set as false
data entry
const N = 20;
const x = [201., 244., 47., 287., 203., 58., 210., 202., 198., 158., 165., 201., 157., 131., 166., 160., 186., 125., 218., 146.];
const y = [592., 401., 583., 402., 495., 173., 479., 504., 510., 416., 393., 442., 317., 311., 400., 337., 423., 334., 533., 344.];
const σ_x = [ 9., 4., 11., 7., 5., 9., 4., 4., 11., 7., 5., 5., 5., 6., 6., 5., 9., 8., 6., 5.];
const σ_y = [ 61., 25., 38., 15., 21., 15., 27., 14., 30., 16., 14., 25., 52., 16., 34., 31., 42., 26., 16., 22.];
const ρ_xy = [-.84, .31, .64, -.27, -.33, .67, -.02, -.05, -.84, -.69, .3, -.46, -.03, .5, .73, -.52, .9, .4, -.78, -.56];
data but in tensor format, to be useful in sections 7 & 8
data points 5 through 20 in the table, which exclude outliers, to be used in various exercises
options to run MCMC chains
helper functions
"""print value with its uncertainty"""
function print_uncertainty(𝕩, δ𝕩; digits=2)
return "$(round(𝕩; digits=digits)) ± $(round(δ𝕩; digits=digits))"
end
"""scatter plot with uncertainties as ellipses"""
function plot_ellipses(ℕ, 𝕩, 𝕪, ℤ, 𝕊; n_σ=1)
p = scatter(𝕩, 𝕪, label=nothing)
for i ∈ 1:ℕ
covellipse!(p, ℤ[i], 𝕊[i], n_std=n_σ, label=nothing)
end
return p
end
"""flatten MCMC chain from matrix with N_samples rows and N_chains columns to a vector of length N_samples × N_chains"""
function flatten_chains(mcmc_chain)
return reduce(vcat, mcmc_chain.data)
end
#=
# formatter to use with MarkdownTables
value_to_markdown(𝕩::Number) = "$(round(𝕩; digits=3))"
value_to_markdown(𝕩::AbstractString) = 𝕩
value_to_markdown(𝕩) = "`$𝕩`"
=#
"""print summary of MCMC chains with interesting values"""
function print_summary(mcmc_chains)
# return markdown_table(summarystats(mcmc_chains)[:, [:mean, :std, :mcse, :rhat]]; formatter=value_to_markdown)
return summarystats(mcmc_chains)[:, [:mean, :std, :mcse, :rhat]]
end
function print_information_criterion(data_model, mcmc_chains)
# posterior log likehood
post_logℒ = loglikelihood(data_model, mcmc_chains) # shape: N_samples × N_chains
# Deviance Information Criterion
# copied from https://github.com/StatisticalRethinkingJulia/ParetoSmoothedImportanceSampling.jl/blob/master/src/dic.jl
# deviance = (-2) .* post_logℒ
# DIC = mean(deviance) + var(deviance) / 2
# given 𝔼(a+b𝕏) = a+b𝔼(𝕏) and 𝕍𝕒𝕣(a+b𝕏) = b²𝕍𝕒𝕣(𝕏) we can simplify the formula for DIC
DIC = 2 * (var(post_logℒ) - mean(post_logℒ))
# Widely Applicable Information Criterion / Watanabe–Akaike information criterion
# copied from https://github.com/StatisticalRethinkingJulia/ParetoSmoothedImportanceSampling.jl/blob/master/src/waic.jl
lppd = dropdims(logsumexp(post_logℒ .- log(size(post_logℒ)[1]); dims=1); dims=1) # log pointwise predictive density
pD = dropdims(var(post_logℒ; dims=1, corrected=false); dims=1) # overfitting penalty
WAIC = (-2) * sum(lppd - pD)
println("Deviance Information Criterion: $DIC")
println("Widely Applicable Information Criterion: $WAIC")
println("Log predictive density: $(sum(lppd))")
println("effective number of parameters: $(sum(pD))")
end
const x_min, x_max = extrema(x);
const y_min, y_max = extrema(y);
const angle90 = π / 2; # useful in sections 7 & 8
quick look at data
1 Standard practice
quick summary
- Objective: Fit a straight line \(y=m\,x+b\) to data using \(\chi^2\) minimization.
- Assumptions:
- Negligible uncertainties in \(x\)-direction.
- Known, Gaussian uncertainties in \(y\)-direction.
- Linear relationship with no intrinsic scatter.
- Method:
- Solve using matrix algebra.
- Uncertainties in \(m\) and \(b\) derived from the covariance matrix.
- Limitations: Rarely valid in practice due to real-world data complexities (e.g., outliers, unknown uncertainties).
conventional method for fitting a straight line \(y=m\,x+b\) to data points where only the y-values have known Gaussian uncertainties
Construct the matrices: \[ Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} \quad A = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix} \quad C = \begin{bmatrix} \sigma_{y_1}^{2} & 0 & \cdots & 0 \\ 0 & \sigma_{y_2}^{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma_{y_N}^{2} \end{bmatrix} \]
best-fit slope \(m\) and intercept \(b\) is given by: \(\begin{bmatrix} b \\ m \end{bmatrix} = X = \left[A^\top\,C^{-1}\,A\right]^{-1}\,\left[A^\top\,C^{-1}\,Y\right]\)
1.1 exercise 01
The “standard uncertainty variance” is to be found in the diagonals of the matrix: \(\begin{bmatrix} \sigma_b^2 & \sigma_{mb} \\ \sigma_{mb} & \sigma_m^2 \end{bmatrix} = \left[A^\top\,C^{-1}\,A\right]^{-1}\)
standard uncertainty (square root of variance) on b is 18.246166749268173, on m is 0.10778047654050076
i.e. \(b=\) 34.05 ± 18.25 and \(m=\) 2.24 ± 0.11
1.2 exercise 02
Yₑₓ₀₂ = reshape(y, :, 1); # vector to 1-column matrix
Aₑₓ₀₂ = hcat(ones(N), x);
Cₑₓ₀₂ = Diagonal(σ_y .^ 2);
tmp = transpose(Aₑₓ₀₂) * inv(Cₑₓ₀₂); # intermediary result
cov_bm = inv(tmp * Aₑₓ₀₂); # covariance matrix of b & m
Xₑₓ₀₂ = cov_bm * tmp * Yₑₓ₀₂;
bₑₓ₀₂, mₑₓ₀₂ = Xₑₓ₀₂;
scatter(x, y, yerror=σ_y, label=nothing)
plot!(𝕩 -> mₑₓ₀₁*𝕩 + bₑₓ₀₁, label="exr 01")
plot!(𝕩 -> mₑₓ₀₂*𝕩 + bₑₓ₀₂, label="exr 02")
new values for the fitted line: \(b=\) 213.27 ± 14.39 and \(m=\) 1.08 ± 0.08
the outliers drastically affect the fit:
- the slope is reduced by more than half and the intercept is adjusted by ~200 to compensate.
- uncertainties actually decrease
1.3 exercise 03
Yₑₓ₀₃ = reshape(y´, :, 1); # vector to 1-column matrix
Aₑₓ₀₃ = hcat(ones(N´), x´, x´.^2);
Cₑₓ₀₃ = Diagonal(σ_y´ .^ 2);
tmp = transpose(Aₑₓ₀₃) * inv(Cₑₓ₀₃); # intermediary result
cov_bmq = inv(tmp * Aₑₓ₀₃); # covariance matrix of b, m & q
Xₑₓ₀₃ = cov_bmq * tmp * Yₑₓ₀₃;
bₑₓ₀₃, mₑₓ₀₃, qₑₓ₀₃ = Xₑₓ₀₃;
scatter(x´, y´, yerror=σ_y´, label=nothing)
plot!(𝕩 -> mₑₓ₀₁*𝕩 + bₑₓ₀₁, label="exr 01")
plot!(𝕩 -> qₑₓ₀₃*𝕩^2 + mₑₓ₀₃*𝕩 + bₑₓ₀₃, label="exr 03")
\(b=\) 72.89 ± 38.91 and \(m=\) 1.6 ± 0.58 and \(q=\) 0.0023 ± 0.002
2 The objective function
quick summary
objective function: to measure how well a model fits the data → optimize this function to find best-fit model
for the standard line-fitting case (Gaussian errors in y), best-fit model = maximizing the likelihood of observing the data given the line parameters \((m, b)\) which also minimizes the \(\chi^2\) value, which represents the weighted sum of squared differences between the data and the fitted line
2.1 exercise 04
Gaussian likelihood: \[ \mathcal{L} = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma_{t_i}^2}} \exp{\left(- \frac{(t_i - T)^2}{2\sigma_{t_i}^2}\right)} \]
log-likelihood: when we are faced with the situation of taking the product of (potentially many) values close to zero, it’s a good idea to take the logarithm and sum the values instead: \[ \log{\mathcal{L}} = -\frac{N}{2}\log(2\pi) - \frac{N}{2}\log{\sigma_{t_i}^2} - \sum_{i=1}^N \frac{(t_i - T)^2}{2\sigma_{t_i}^2} \]
bonus: \(\log{\mathcal{L}} = \mathrm{K} - \frac{1}{2} \chi^2\) with \(K\) some constant
derivative: \[ \frac{\partial \log{\mathcal{L}}}{\partial T} = -\frac{\partial}{\partial T} \sum_{i=1}^N \frac{(t_i - T)^2}{{2\sigma_{t_i}^2}} = \sum_{i=1}^N \frac{(t_i - T)}{\sigma_{t_i}^2} \]
This takes its maximum when the derivative is \(0\) \[ \sum_{i=1}^N \frac{(t_i - T)}{\sigma_{t_i}^2} = 0 = \sum_{i=1}^N \frac{t_i}{\sigma_{t_i}^2} -T\sum_{i=1}^N \frac{1}{\sigma_{t_i}^2} \]
we get the weighted mean as expected \[ \hat{T} = \frac{\sum_{i=1}^N \frac{t_i}{\sigma_{t_i}^2}}{\sum_{i=1}^N \frac{1}{\sigma_{t_i}^2}} \]
2.2 exercise 05
Equation (7) is \[ \chi^2 = [Y-AX]^\top C^{-1}[Y-AX] \] By the product rule, we have \[ \frac{\partial \chi^2}{\partial X} = \left(\frac{\partial}{\partial X}[Y-AX]^\top\right)C^{-1}[Y-AX] + [Y-AX]^\top\left(\frac{\partial}{\partial X} C^{-1}[Y-AX]\right) \] Differentiating, we get \[ \frac{\partial \chi^2}{\partial X} = -A^\top C^{-1}[Y-AX] [Y-AX]^\top C^{-1}[-A] \] Multiplying out \[ \frac{\partial \chi^2}{\partial X} = -A^\top C^{-1}Y + A^\top C^{-1}AX - Y^\top C^{-1}A + [AX]^\top C^{-1}A \] Since \(-A^\top C^{-1}Y\) is a scalar, it equals its own transpose \[ \frac{\partial \chi^2}{\partial X} = -2A^\top C^{-1}Y + A^\top C^{-1}AX + X^\top A^\top C^{-1}A \] The two terms in \(X\) are also simply transposes of one another, and can thus be trivially added together \[ \frac{\partial \chi^2}{\partial X} = -2A^\top C^{-1}Y + 2 A^\top C^{-1}AX \] \(X\) takes its maximum likelihood value when the derivative is \(0\): \[ -2A^\top C^{-1}Y + 2 A^\top C^{-1}A\hat{X} = 0 \] Rearranging: \[ A^\top C^{-1}A\hat{X} = 2A^\top C^{-1}Y \] And so \[ \hat{X} = \left(A^\top C^{-1}A\right)^{-1}A^\top C^{-1}Y \] As given in equation (5)
3 Pruning outliers
quick summary
- Generative Model:
- Treat data as a mixture of “good” (on the line) and “bad” (outliers) points.
- Likelihood combines foreground (line) and background (outlier) distributions.
- Priors:
- Flat priors for \(m, b\); informative priors for outlier parameters (e.g., amplitude \(P_{\mathrm{bad}}\), background variance \(V_{\mathrm{bad}}\)).
- Marginalization:
- Use MCMC to integrate over nuisance parameters (\(P_{\mathrm{bad}},Y_{\mathrm{bad}},V_{\mathrm{bad}}\)) and obtain marginalized posterior for \(m, b\).
- Avoids arbitrary data rejection (e.g., sigma clipping) and provides robust uncertainties.
importance of modeling outliers (data points that don’t fit the assumed model) rather than simply ignoring them
one of the well-known methods to deal with outliers: sigma clipping (iteratively removing points 1σ/3σ/5σ far from the fit then re-fit then remove until stable) → a procedure which works very well but doesn’t an objective function to be optimized, so not statiscally justifiable
authors’ proposal: mixture model
- outliers come from a distribution with probability \(P_{\mathrm{bad}}\), example of distribution: \(\mathcal{N}\left(Y_{\mathrm{bad}},V_{\mathrm{bad}}\right)\)
- inliers come from straight line with probability \(1-P_{\mathrm{bad}}\), therefore distribution: \(\mathcal{N}\left(mx_i+b,\sigma_{y_i}^2\right)\)
We now model our data as being drawn from a mixture of 2 Gaussians, one which is the ‘true’ relation and one which is a distribution of outliers. This mixture model is generated by marginalising an ‘exponential’ model which contains \(N\) latent classifying labels \(q_i\) for each data point \[ \mathcal{L} \propto \prod_{i=1}^N\left[ \frac{1-P_{\mathrm{bad}}}{\sqrt{2\pi\sigma_{y_i}^2}} \exp\left(-\frac{[y_i-mx_i-b]^2}{2\sigma_{y_i}^2}\right) + \frac{P_{\mathrm{bad}}}{\sqrt{2\pi[V_{\mathrm{bad}}+\sigma_{y_i}^2]}} \exp\left(-\frac{[y_i-Y_{\mathrm{bad}}]^2}{2[V_{\mathrm{bad}}+\sigma_{y_i}^2]}\right) \right] \]
3.1 exercise 06
hidden unrecommended code
# DRAFT (SHOULDN’T USE): manually modify log likehood
@model function modelₑₓ₀₆_draft(ℕ, 𝕩, 𝕪, σ_𝕪)
b ~ Normal(0, 5)
m ~ Normal(0, 5)
# for outliers
P_bad ~ Uniform(0, 1)
Y_bad ~ Uniform(y_min, y_max) # mean for all outliers
V_bad ~ InverseGamma(.001, .001) # additional variance for outliers
for i ∈ 1:ℕ
𝕪̂ᵢ = b + m * 𝕩[i]
σ_badᵢ = sqrt(σ_𝕪[i]^2 + V_bad)
# faster computation: marginalize discrete param
inlier = log1p(-P_bad) + logpdf(Normal(𝕪̂ᵢ , σ_𝕪[i]), 𝕪[i])
outlier = log( P_bad) + logpdf(Normal(Y_bad, σ_badᵢ), 𝕪[i])
Turing.@addlogprob! logaddexp(inlier, outlier)
end
end
# recommendation: use convenient constructors for mixture models
@model function modelₑₓ₀₆(ℕ, 𝕩, 𝕪, σ_𝕪)
b ~ Normal(0, 5)
m ~ Normal(0, 5)
# for outliers
P_bad ~ Uniform(0, 1)
Y_bad ~ Uniform(y_min, y_max) # mean for all outliers
V_bad ~ InverseGamma(.001, .001) # additional variance for outliers
for i ∈ 1:ℕ
𝕪̂ᵢ = b + m * 𝕩[i]
σ_badᵢ = sqrt(σ_𝕪[i]^2 + V_bad)
𝕪[i] ~ MixtureModel(Normal, [(𝕪̂ᵢ, σ_𝕪[i]), (Y_bad, σ_badᵢ)], [1 - P_bad, P_bad])
end
end
data_modelₑₓ₀₆ = modelₑₓ₀₆(N, x, y, σ_y);
chainsₑₓ₀₆ = sample(data_modelₑₓ₀₆, NUTS(), MCMCThreads(), N_samples, N_chains; num_warmup=N_warmup, thinning=N_thinning);
┌ Info: Found initial step size
└ ϵ = 0.2
┌ Info: Found initial step size
└ ϵ = 0.2
┌ Info: Found initial step size
└ ϵ = 0.2
┌ Info: Found initial step size
└ ϵ = 0.2
diagnostics of chains:
Deviance Information Criterion: 221.7352952317259
Widely Applicable Information Criterion: 881.058088349788
Log predictive density: -431.39938552609465
effective number of parameters: 9.129658648799358
Summary Statistics
parameters mean std mcse rhat
Symbol Float64 Float64 Float64 Float64
b 2.0360 4.8289 0.0345 1.0001
m 2.4291 0.0454 0.0003 1.0001
P_bad 0.2955 0.1259 0.0009 1.0000
Y_bad 442.6367 51.3189 0.3730 1.0000
V_bad 13962.1324 63748.2562 456.7060 1.0000
all chains seem converged, we’re good!
good thing about MCMC is that it samples directly from marginalized distribution so no need to compute any integral after running
3.2 exercise 07
data_modelₑₓ₀₇ = modelₑₓ₀₆(N, x, y, σ_y ./ 2);
chainsₑₓ₀₇ = sample(data_modelₑₓ₀₇, NUTS(), MCMCThreads(), N_samples, N_chains; num_warmup=N_warmup, thinning=N_thinning);
println('═' ^ 80)
print_information_criterion(data_modelₑₓ₀₇, chainsₑₓ₀₇)
print_summary(chainsₑₓ₀₇)
┌ Info: Found initial step size
└ ϵ = 0.018750000000000003
┌ Info: Found initial step size
└ ϵ = 0.05
┌ Info: Found initial step size
└ ϵ = 0.05
┌ Info: Found initial step size
└ ϵ = 0.025
════════════════════════════════════════════════════════════════════════════════
Deviance Information Criterion: 240.1627443618504
Widely Applicable Information Criterion: 953.061520252609
Log predictive density: -462.2800752497833
effective number of parameters: 14.250684876521213
Summary Statistics
parameters mean std mcse rhat
Symbol Float64 Float64 Float64 Float64
b -0.5160 5.3314 0.0381 1.0000
m 2.4016 0.9425 0.0251 1.0013
P_bad 0.6163 0.1512 0.0028 1.0004
Y_bad 417.2441 37.9225 0.2856 1.0000
V_bad 15136.0479 9283.7038 76.0338 1.0002
p = plot(layout=2)
density!(p, chainsₑₓ₀₆[:P_bad], subplot=1, label=nothing, title="P_bad with original error")
density!(p, chainsₑₓ₀₇[:P_bad], subplot=2, label=nothing, title="P_bad with half error")
with original error: peak at 0.2, arguably 3-4 outliers in the data, so it makes sense
with ½ original error: bigger value for the peak meaning that many more points have been considered outliers
little extra
maximum likelihood estimates:
ModeResult with maximized lp of -4066.13
[-8.324643296432741, 0.044615203183588636, 0.0818541357432142, 239.83585277205523, Inf]
maximum a posteriori estimates:
ModeResult with maximized lp of -Inf
[0.7959712976074101, -9.16487460309687, 0.8256362265152977, 497.5010434363468, Inf]
sampling from the prior distribution:
Summary Statistics
parameters mean std mcse rhat
Symbol Float64 Float64 Float64 Float64
b -0.1233 5.0757 0.0725 0.9998
m 0.1104 4.9841 0.0705 1.0006
P_bad 0.5036 0.2871 0.0042 1.0000
Y_bad 381.8850 121.5185 1.7256 0.9998
V_bad Inf NaN NaN 1.0739
compute probability of each point to be classified as outlier:
# Compute the un-normalized log probabilities for each cluster
log_prob_assign_outliers = zeros(N_chains, N_samples, N);
Threads.@threads for step ∈ 1:N_samples
for chain ∈ 1:N_chains
b = chainsₑₓ₀₆[:b][step, chain]
m = chainsₑₓ₀₆[:m][step, chain]
V_bad = chainsₑₓ₀₆[:V_bad][step, chain]
P_bad = chainsₑₓ₀₆[:P_bad][step, chain]
Y_bad = chainsₑₓ₀₆[:Y_bad][step, chain]
for i ∈ 1:N
ŷᵢ = b + m * x[i]
inliers_log_prob = log1p(-P_bad) + logpdf(Normal(ŷᵢ, σ_y[i]), y[i])
σ_badᵢ = sqrt(σ_y[i]^2 + V_bad)
outliers_log_prob = log(P_bad) + logpdf(Normal(Y_bad, σ_badᵢ), y[i])
# Bayes rule to compute the assignment probability: ℙ(outliers | data) ∝ ℙ(data | outliers) ℙ(outliers)
log_prob_assign_outliers[chain, step, i] = outliers_log_prob - logaddexp(inliers_log_prob, outliers_log_prob)
end
end
end
log_prob_assign_outliers_bis = dropdims(logsumexp(log_prob_assign_outliers; dims=2); dims=2); # shape: N_chains × N
proba_outlier = exp.(log_prob_assign_outliers_bis .- log(N_samples));
# Average across the MCMC chain
proba_outlier_bis = dropdims(mean(proba_outlier; dims=1); dims=1);
4 Uncertainties in the best-fit parameters
quick summary
- Standard \(\chi^2\) Uncertainties:
- Valid only under strict assumptions (correct uncertainties, no outliers).
- Often underestimate true uncertainties in real data.
- Bayesian Posterior:
- Uncertainties derived from posterior distribution (e.g., variance, credible intervals).
- Accounts for model complexity and outliers.
- Empirical Methods:
- Bootstrap: Resample data with replacement to estimate parameter distributions.
- Jackknife: Leave-one-out resampling to assess sensitivity to individual points.
- Both methods useful when uncertainties are unknown or unreliable.
This brief section reiterates how to determine the uncertainties (often represented by a covariance matrix) associated with the best-fit parameters obtained from the fitting process.
It stresses that the validity of these uncertainty estimates depends heavily on the correctness of the initial assumptions about the data’s error properties and the appropriateness of the model itself.
4.1 exercise 08
\[ \begin{bmatrix} \sigma_b^2 & \sigma_{mb} \\ \sigma_{mb} & \sigma_m^2 \end{bmatrix} = \left[A^\top\,C^{-1}\,A\right]^{-1} = \]
2×2 Matrix{Float64}:
207.188 -1.05427
-1.05427 0.00599181
bootstrap (sample with replacement)
M = 30; # number of bootstrap samples
bₑₓ₀₈_bootstrap = zeros(M);
mₑₓ₀₈_bootstrap = zeros(M);
Threads.@threads for j ∈ 1:M # bootstrap sample
idx_bs = rand(1:N, N) # bootstrap indices
Y_bs = reshape(y[idx_bs], :, 1) # vector to 1-column matrix
A_bs = hcat(ones(N), x[idx_bs])
C_bs = Diagonal(σ_y[idx_bs] .^ 2)
tmp = transpose(A_bs) * inv(C_bs) # intermediary result
bₑₓ₀₈_bootstrap[j], mₑₓ₀₈_bootstrap[j] = inv(tmp * A_bs) * tmp * Y_bs
end
\[ \sigma^2_{b_\mathrm{bootstrap}} = \frac{1}{M}\sum^M_{j=1}\left( b_{j_\mathrm{bootstrap}} - b_\mathrm{best-fit} \right)^2 = \]
\[ \sigma^2_{m_\mathrm{bootstrap}} = \frac{1}{M}\sum^M_{j=1}\left( m_{j_\mathrm{bootstrap}} - m_\mathrm{best-fit} \right)^2 = \]
\[ \sigma_{bm_\mathrm{bootstrap}} = \frac{1}{M}\sum^M_{j=1}\left( b_{j_\mathrm{bootstrap}} - b_\mathrm{best-fit} \right)\left( m_{j_\mathrm{bootstrap}} - m_\mathrm{best-fit} \right) = \]
jackknife (leave-one-out)
bₑₓ₀₈_jackknife = zeros(M);
mₑₓ₀₈_jackknife = zeros(M);
Threads.@threads for i ∈ 1:N # jackknife sample
idx_jk = fill(true, N) # jackknife indices
idx_jk[i] = false # remove j-th point
Y_jk = reshape(y[idx_jk], :, 1) # vector to 1-column matrix
A_jk = hcat(ones(N-1), x[idx_jk])
C_jk = Diagonal(σ_y[idx_jk] .^ 2)
tmp = transpose(A_jk) * inv(C_jk) # intermediary result
bₑₓ₀₈_jackknife[i], mₑₓ₀₈_jackknife[i] = inv(tmp * A_jk) * tmp * Y_jk
end
\[ \sigma^2_{b_\mathrm{jackknife}} = \left( 1 - \frac{1}{N} \right)\sum^N_{i=1}\left( b_{i_\mathrm{jackknife}} - b_\mathrm{best-fit} \right)^2 = \]
\[ \sigma^2_{m_\mathrm{jackknife}} = \left( 1 - \frac{1}{N} \right)\sum^N_{i=1}\left( m_{i_\mathrm{jackknife}} - m_\mathrm{best-fit} \right)^2 = \]
\[ \sigma_{bm_\mathrm{jackknife}} = \left( 1 - \frac{1}{N} \right)\sum^N_{i=1}\left( b_{i_\mathrm{jackknife}} - b_\mathrm{best-fit} \right)\left( m_{i_\mathrm{jackknife}} - m_\mathrm{best-fit} \right) = \]
4.2 exercise 09
data_modelₑₓ₀₉ = modelₑₓ₀₆(N´, x´, y´, σ_y´);
chainsₑₓ₀₉ = sample(data_modelₑₓ₀₉, NUTS(), MCMCThreads(), N_samples, N_chains; num_warmup=N_warmup, thinning=N_thinning);
println('═' ^ 80)
print_information_criterion(data_modelₑₓ₀₉, chainsₑₓ₀₉)
print_summary(chainsₑₓ₀₉)
┌ Info: Found initial step size
└ ϵ = 0.05
┌ Info: Found initial step size
└ ϵ = 0.2
┌ Info: Found initial step size
└ ϵ = 0.2
┌ Info: Found initial step size
└ ϵ = 0.2
════════════════════════════════════════════════════════════════════════════════
Deviance Information Criterion: 157.69638955957026
Widely Applicable Information Criterion: 626.5719652376915
Log predictive density: -306.96896785400037
effective number of parameters: 6.317014764845421
Summary Statistics
parameters mean std mcse rhat
Symbol Float64 Float64 Float64 Float64
b 2.4314 4.8047 0.0410 1.0000
m 2.4195 0.0405 0.0003 1.0001
P_bad 0.0563 0.0550 0.0006 1.0005
Y_bad 380.9167 119.6706 0.8939 1.0003
V_bad Inf Inf NaN 1.0002
data_modelₑₓ₀₉_½err = modelₑₓ₀₆(N´, x´, y´, σ_y´ ./ 2);
chainsₑₓ₀₉_½err = sample(data_modelₑₓ₀₉_½err, NUTS(), MCMCThreads(), N_samples, N_chains; num_warmup=N_warmup, thinning=N_thinning);
print_information_criterion(data_modelₑₓ₀₉_½err, chainsₑₓ₀₉_½err)
print_summary(chainsₑₓ₀₉_½err)
┌ Info: Found initial step size
└ ϵ = 0.036865234375000014
┌ Info: Found initial step size
└ ϵ = 0.05
┌ Info: Found initial step size
└ ϵ = 0.05
┌ Info: Found initial step size
└ ϵ = 0.05
Deviance Information Criterion: 181.33037188830056
Widely Applicable Information Criterion: 718.9223116410654
Log predictive density: -349.0744541335065
effective number of parameters: 10.386701687026314
Summary Statistics
parameters mean std mcse rhat
Symbol Float64 Float64 Float64 Float64
b -0.1162 5.6647 0.0450 1.0000
m 2.4327 0.0453 0.0006 1.0000
P_bad 0.4148 0.1514 0.0013 1.0005
Y_bad 366.5573 54.5760 0.4160 1.0004
V_bad 24548078.6263 3461841875.5036 24521508.9123 1.0004
p = plot(layout=2)
plot!(p, kde((flatten_chains(chainsₑₓ₀₉[ :b]), flatten_chains(chainsₑₓ₀₉[ :m]))), subplot=1, xlabel="b", ylabel="m", title="with original error")
plot!(p, kde((flatten_chains(chainsₑₓ₀₉_½err[:b]), flatten_chains(chainsₑₓ₀₉_½err[:m]))), subplot=2, xlabel="b", ylabel="m", title="with half error")
The posterior distributions got much larger with the reduction in uncertainties. They’re way underestimated.
The posterior distributions aren’t Gaussian anymore, now they’re some “islands”, which correspond to different choices of outliers
If individual data-points have been estimated by some means that effectively relies on shared information, then there will be large covariances among the data points. These covariances bring off-diagonal elements into the covariance matrix \(C\), which was previously trivially constructed under the assumption that all covariances are precisely \(0\).
Once the off-diagonal elements are non-zero, \(\chi^2\) must be computed by the matrix expression: \(\chi^2 = [Y-AX]^\top C^{-1}[Y-AX]\)
5 Non-Gaussian uncertainties
no exercise in this section
quick summary: can somehow model as gaussian anyway …
- Non-Gaussian Uncertainties:
- Replace Gaussian likelihood with appropriate distributions (e.g., biexponential, Student’s t).
- Model upper/lower limits by truncating likelihoods (e.g., Poisson for binned data).
- Avoid Ad-Hoc Methods:
- Softening \(\chi^2\)(e.g., Huber loss) is discouraged without generative justification.
- Use MCMC to infer parameters of complex likelihoods.
6 Goodness of fit and unknown uncertainties
quick summary
- \(\chi^2\) as a Metric:
- Expected value \(\chi^2 \approx N - 2\) for a good fit.
- Large deviations suggest model inadequacy or incorrect uncertainties.
- Bayesian Perspective:
- Cannot reject a model outright; compare models using Bayes factors.
- Infer unknown uncertainties by marginalizing over them in the posterior.
standard (frequentist) paradigm: \(\chi^2\) to measure goodness of fit
with a straight line: \(\chi^2\) expect to be in between \([N-2] \pm \sqrt{2[N-2]}\) with 2 is model parameters count (\(m\) & \(b\))
6.1 exercise 10
\(\chi^2 = [Y-AX]^\top C^{-1}[Y-AX]\)
tmp = Yₑₓ₀₁ - Aₑₓ₀₁ * Xₑₓ₀₁; # intermediary result
χ²ₑₓ₀₁ = dot(tmp, inv(Cₑₓ₀₁), tmp);
χ²ₑₓ₀₁ / (N´ - 1) # χ² per degrees of freedom
1.2453846607493875
This is a pretty good fit. Aside from the plot showing a linear fit that generally follows the points (whether the points themselves are linear or not), the reduced \(\chi^2\) metric is pretty close to 1. This is indicating that on average the squared difference between my model and each point is roughly on par (actually slightly larger) than the measured inherent variance of each point. Can’t ask for too much better than that when you’re assuming the functional form of a distribution.
tmp = Yₑₓ₀₂ - Aₑₓ₀₂ * Xₑₓ₀₂; # intermediary result
χ²ₑₓ₀₂ = dot(tmp, inv(Cₑₓ₀₂), tmp);
χ²ₑₓ₀₂ / (N - 1) # χ² per degrees of freedom
15.261248567473642
This is a horrid fit. Squared differences 15× larger than what’s described by measured variances.
6.2 exercise 11
Yₑₓ₁₁ = reshape(y´, :, 1); # vector to 1-column matrix
Aₑₓ₁₁ = hcat(ones(N´), x´);
Sₑₓ₁₁ = 1:2000;
χ²ₑₓ₁₁ = map(Sₑₓ₁₁) do s
σ²ₑₓ₁₁ = fill(s, N´)
Cₑₓ₁₁ = Diagonal(σ²ₑₓ₁₁)
tmp1 = inv(Cₑₓ₁₁)
tmp2 = transpose(Aₑₓ₁₁) * tmp1
Xₑₓ₁₁ = inv(tmp2 * Aₑₓ₁₁) * tmp2 * Yₑₓ₁₁
tmp3 = Yₑₓ₁₁ - Aₑₓ₁₁ * Xₑₓ₁₁
return dot(tmp3, tmp1, tmp3)
end
subsetNm2 = N´ - 2
scatter(Sₑₓ₁₁, χ²ₑₓ₁₁, yaxis=:log, xlabel="S", ylabel="χ²", label=nothing)
hline!([subsetNm2], label="χ²=N-2")
vline!([mean(σ_y´ .^ 2)], label="mean σ²")
value S would be between: 938 and 939
mean variances \(\left(\sigma_{y_i}^2\right)_{i=1}^N\) of the true data set: 739.0625
The uniform offset \(S\) necessary to make \(\chi^2 = N-2\) is quite larger than both the mean and median variances \(\left(\sigma_{y_i}^2\right)_{i=1}^N\) of the true data set. If the data were in fact linear with some measurement uncertainties, perhaps the uncertainties themselves are underestimated.
6.3 exercise 12
Gaussian likelihood: \[ \mathcal{L} = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\mathrm{S}}} \exp{\left(- \frac{(y_i - mx_i - b)^2}{2\mathrm{S}}\right)} \]
@model function modelₑₓ₁₂(ℕ, 𝕩, 𝕪)
b ~ Normal(0, 5)
m ~ Normal(0, 5)
S ~ InverseGamma(.001, .001) # common variance for all data
for i ∈ 1:ℕ
𝕪[i] ~ Normal(b + m * 𝕩[i], sqrt(S))
end
end
data_modelₑₓ₁₂ = modelₑₓ₁₂(N´, x´, y´)
chainsₑₓ₁₂ = sample(data_modelₑₓ₁₂, NUTS(), MCMCThreads(), N_samples, N_chains; num_warmup=N_warmup, thinning=N_thinning);
println('═' ^ 80)
print_information_criterion(data_modelₑₓ₁₂, chainsₑₓ₁₂)
print_summary(chainsₑₓ₁₂)
┌ Info: Found initial step size
└ ϵ = 0.00078125
┌ Info: Found initial step size
└ ϵ = 0.0015625
┌ Info: Found initial step size
└ ϵ = 0.00078125
┌ Info: Found initial step size
└ ϵ = 0.00078125
════════════════════════════════════════════════════════════════════════════════
Deviance Information Criterion: 158.01676347170306
Widely Applicable Information Criterion: 629.2810782107024
Log predictive density: -310.03761918999885
effective number of parameters: 4.602919915352328
Summary Statistics
parameters mean std mcse rhat
Symbol Float64 Float64 Float64 Float64
b 0.6802 4.9422 0.0348 1.0001
m 2.3760 0.0548 0.0004 1.0000
S 1067.2937 459.6777 3.2433 0.9999
7 Arbitrary 2-dimensional uncertainties
quick summary
- Generative Model for 2D Uncertainties:
- Extend to account for uncertainties in \(x\) and \(y\) directions.
- Project uncertainties onto the line’s direction to compute likelihood.
- Comparison to PCA:
- PCA is invalid for noisy data; generative model accounts for measurement errors.
more complex scenario of fitting a line when both the x and y measurements have uncertainties: using a covariance tensor for each data point to represent the uncertainties in x, the uncertainties in y, and any correlation between the x and y errors
covariance tensor \(S_i = \begin{bmatrix} \sigma_{x_i}^2 & \sigma_{xy_i} \\ \sigma_{xy_i} & \sigma_{y_i}^2 \end{bmatrix}\) with \(\sigma_{xy}=\rho_{xy}\sigma_x\sigma_y\)
likelihood \[ \mathcal{L} = \prod_{i=1}^N \frac{1}{2\pi\sqrt{\det(S_i)}}\exp\left(-\frac{1}{2}\,\left[Z_i - Z\right]^\top S_i^{-1} \left[Z_i - Z\right]\right) \] with \[ Z_i = \begin{bmatrix} x_i \\ y_i \end{bmatrix} \quad ; \quad Z = \begin{bmatrix} \hat{x_i} \\ \hat{y_i} \end{bmatrix} = \begin{bmatrix} \hat{x_i} \\ m\hat{x_i}+b \end{bmatrix} \]
instead of slope \(m\), use unit vector \(\vec{v}\) orthogonal to the line (i.e. normal vector) \[ \vec{v} = \frac{1}{\sqrt{1+m^2}} \begin{bmatrix} -m \\ 1 \end{bmatrix} = \begin{bmatrix} -\sin\theta \\ \cos\theta \end{bmatrix} \] with \(\theta = \arctan m\) angle made between the line and the \(x\) axis
orthogonal displacement \(\Delta_i\) of each data point \((x_i,y_i)\) from the line: \(\Delta_i = \vec{v}^\top Z_i - b\cos\theta\)
each data point’s covariance tensor \(S_i\) projects down to an orthogonal variance \(\Sigma_i^2 = \vec{v}^\top S_i \vec{v}\)
then the log likelihood can be written as \[ \log\mathcal{L} = K - \frac{1}{2}\sum_{i=1}^N \left[\frac{\Delta_i^2}{\Sigma_{i}^2} +\log{|\Sigma_{i}^2|} + \log{(1+m^2)} \right] \] where \(K\) is some constant
the authors suggest: performing the fit or likelihood maximization not in terms of \((m,b)\) but rather \((\theta,b_\perp)\), where \(b_\perp = b\cos\theta\) the perpendicular distance of the line from the origin
In the astrophysics literature, there is a tradition, when there are uncertainties in both directions, of fitting the “forward” and “reverse” relations — that is, fitting y as a function of x and then x as a function of y — and then splitting the difference between the 2 slopes so obtained, or treating the difference between the slopes as a systematic uncertainty. This is unjustified.
Another common method for finding the linear relationship in data when there are uncertainties in both directions is principal components analysis. The method of PCA does return a linear relationship for a data set, in the form of the dominant principal component. However, this is the dominant principal component of the observed data, not of the underlying linear relationship that, when noise is added, generates the observations. For this reason, the output of PCA will be strongly drawn or affected by the individual data point noise covariances \(S_i\).
example of mixture model with outlier \[ Z_{\mathrm{bad}} = \begin{bmatrix} X_{\mathrm{bad}} \\ Y_{\mathrm{bad}} \end{bmatrix} \quad ; \quad V_{\mathrm{bad}} = \begin{bmatrix} V_{\mathrm{bad}_x} & \rho_{\mathrm{bad}}\sqrt{V_{\mathrm{bad}_x}V_{\mathrm{bad}_y}} \\ \rho_{\mathrm{bad}}\sqrt{V_{\mathrm{bad}_x}V_{\mathrm{bad}_y}} & V_{\mathrm{bad}_y} \end{bmatrix} \] \[ \mathcal{L} \propto \prod_{i=1}^N\left[ \frac{1-P_{\mathrm{bad}}}{\sqrt{2\pi\Sigma_{i}^2}} \cos(\theta) \exp\left(-\frac{\Delta_i^2}{2\Sigma_{i}^2}\right) + \frac{P_{\mathrm{bad}}}{\sqrt{2\pi\det{(V_{\mathrm{bad}}+S_i)}}} \exp\left(-\frac{1}{2}\,\left[Z_i - Z_{\mathrm{bad}}\right]^\top \left[V_{\mathrm{bad}}+S_i\right]^{-1} \left[Z_i - Z_{\mathrm{bad}}\right]\right) \right] \]
7.1 exercise 13
hidden unrecommended code
# DRAFT (SHOULDN’T USE): manually modify log likehood
@model function modelₑₓ₁₃_draft(ℕ, ℤ, 𝕊)
b ~ Normal(0, 5) # intercept
θ ~ Uniform(-angle90, angle90) # angle of the fitted line, use this instead of slope
v⃗ = [-sin(θ), cos(θ)] # unit normal vector
m := tan(θ) # `:=` operator add variable to the chain
for i ∈ 1:ℕ
Δᵢ = dot(ℤ[i], v⃗) - b*v⃗[2] # orthogonal displacement of each data point from the line
Σᵢ² = dot(v⃗, 𝕊[i], v⃗) # orthogonal variance of projection of each data point to the line
Turing.@addlogprob! -.5*(Δᵢ^2 / Σᵢ² + log(abs(Σᵢ²)) + log1p(m^2))
end
end
# recommendation: multivariate normal distribution
@model function modelₑₓ₁₃(ℕ, ℤ, 𝕊)
b ~ Normal(0, 5)
m ~ Normal(0, 5)
for i ∈ 1:ℕ
xᵢ = ℤ[i][1]
ℤ̂ᵢ = [xᵢ, b + m * xᵢ]
ℤ[i] ~ MvNormal(ℤ̂ᵢ, 𝕊[i])
end
end
data_modelₑₓ₁₃ = modelₑₓ₁₃(N´, Z´, S´);
chainsₑₓ₁₃ = sample(data_modelₑₓ₁₃, NUTS(), MCMCThreads(), N_samples, N_chains; num_warmup=N_warmup, thinning=N_thinning);
println('═' ^ 80)
print_information_criterion(data_modelₑₓ₁₃, chainsₑₓ₁₃)
print_summary(chainsₑₓ₁₃)
┌ Info: Found initial step size
└ ϵ = 0.025
┌ Info: Found initial step size
└ ϵ = 0.025
┌ Info: Found initial step size
└ ϵ = 0.025
┌ Info: Found initial step size
└ ϵ = 0.025
════════════════════════════════════════════════════════════════════════════════
Deviance Information Criterion: 247.00888431179794
Widely Applicable Information Criterion: 984.7350833469969
Log predictive density: -488.041225247589
effective number of parameters: 4.326316425909415
Summary Statistics
parameters mean std mcse rhat
Symbol Float64 Float64 Float64 Float64
b 4.0385 4.7008 0.0353 1.0001
m 2.4210 0.0355 0.0003 1.0001
7.2 exercise 14
hidden unrecommended code
# DRAFT (SHOULDN’T USE): manually modify log likehood
@model function modelₑₓ₁₄_draft(ℕ, ℤ, 𝕊)
b ~ Normal(0, 5) # intercept
θ ~ Uniform(-angle90, angle90) # angle of the fitted line, use this instead of slope
v⃗ = [-sin(θ), cos(θ)] # unit normal vector
m := tan(θ) # `:=` operator add variable to the chain
# for outliers
P_bad ~ Uniform(0, 1)
X_bad ~ Uniform(x_min, x_max)
Y_bad ~ Uniform(y_min, y_max)
σ_bad_x ~ InverseGamma(.001, .001)
σ_bad_y ~ InverseGamma(.001, .001)
ρ_bad ~ Uniform(-1, 1)
Z_bad = [X_bad, Y_bad]
V_bad_xy = ρ_bad * σ_bad_x * σ_bad_y
V_bad = [σ_bad_x^2 V_bad_xy; V_bad_xy σ_bad_y^2]
for i ∈ 1:ℕ
Δᵢ = dot(ℤ[i], v⃗) - b*v⃗[2] # orthogonal displacement of each data point from the line
Σᵢ² = dot(v⃗, 𝕊[i], v⃗) # orthogonal variance of projection of each data point to the line
tmp1 = ℤ[i] - Z_bad
tmp2 = 𝕊[i] + V_bad
inlier = log1p(-P_bad) - .5*log(abs(Σᵢ²)) + log(abs(v⃗[2])) - .5 * Δᵢ^2 / Σᵢ²
outlier = log(P_bad) - .5 * (logabsdet(tmp2)[1] + dot(tmp1, tmp2, tmp1))
Turing.@addlogprob! logaddexp(inlier, outlier)
end
end
@model function modelₑₓ₁₄(ℕ, ℤ, 𝕊)
b ~ Normal(0, 5)
m ~ Normal(0, 5)
# for outliers
P_bad ~ Uniform(0, 1)
X_bad ~ Uniform(x_min, x_max)
Y_bad ~ Uniform(y_min, y_max)
σ_bad_x ~ InverseGamma(.001, .001)
σ_bad_y ~ InverseGamma(.001, .001)
ρ_bad ~ Uniform(-1, 1)
Z_bad = [X_bad, Y_bad]
V_bad_xy = ρ_bad * σ_bad_x * σ_bad_y
V_bad = [σ_bad_x^2 V_bad_xy; V_bad_xy σ_bad_y^2]
# maybe more computationally efficient if use LKJ Cholesky for correlation matrix
# see: https://discourse.julialang.org/t/singular-exception-with-lkjcholesky/85020/25
for i ∈ 1:ℕ
Σ_badᵢ = 𝕊[i] + V_bad
if !isposdef(Σ_badᵢ) # sometimes isn’t positive-definitive
Turing.@addlogprob! -Inf # force sampler to reject a sample
break
else
xᵢ = ℤ[i][1]
ℤ̂ᵢ = [xᵢ, b + m * xᵢ]
ℤ[i] ~ MixtureModel(MvNormal, [(ℤ̂ᵢ, 𝕊[i]), (Z_bad, Σ_badᵢ)], [1 - P_bad, P_bad])
end
end
end
data_modelₑₓ₁₄ = modelₑₓ₁₄(N, Z, S);
chainsₑₓ₁₄ = sample(data_modelₑₓ₁₄, NUTS(), MCMCThreads(), N_samples, N_chains; num_warmup=N_warmup, thinning=N_thinning);
println('═' ^ 80)
print_information_criterion(data_modelₑₓ₁₄, chainsₑₓ₁₄)
print_summary(chainsₑₓ₁₄)
┌ Info: Found initial step size
└ ϵ = 0.025
┌ Info: Found initial step size
└ ϵ = 0.05
┌ Info: Found initial step size
└ ϵ = 0.05
┌ Info: Found initial step size
└ ϵ = 0.025
════════════════════════════════════════════════════════════════════════════════
Deviance Information Criterion: 369.79616331784376
Widely Applicable Information Criterion: 1463.3422277068032
Log predictive density: -709.1683608352579
effective number of parameters: 22.502753018143757
Summary Statistics
parameters mean std mcse rhat
Symbol Float64 Float64 Float64 Float64
b 3.7896 4.7017 0.0377 1.0000
m 2.4312 0.0359 0.0003 1.0002
P_bad 0.1942 0.0861 0.0007 1.0000
X_bad 185.3930 52.9724 0.4222 1.0001
Y_bad 457.8542 56.7519 0.4168 1.0002
σ_bad_x 141.0835 96.3230 0.7048 0.9999
σ_bad_y 111.8610 87.5325 0.7000 1.0001
ρ_bad -0.5655 0.3885 0.0036 1.0002
bₑₓ₁₄ = mean(chainsₑₓ₁₄[:b]);
mₑₓ₁₄ = mean(chainsₑₓ₁₄[:m]);
plot_ellipses(N, x, y, Z, S)
plot!(𝕩 -> mₑₓ₁₄*𝕩 + bₑₓ₁₄, label="exr 14")
plot!(𝕩 -> mₑₓ₀₂*𝕩 + bₑₓ₀₂, label="exr 02")
little extra
compute probability of each point to be classified as outlier:
7.3 exercise 15
plot_ellipses(N´, x´, y´, Z´, S´)
plot!(𝕩 -> mₑₓ₀₁*𝕩 + bₑₓ₀₁, label="forward fit")
plot!(𝕩 -> (𝕩 - bₑₓ₁₅) / mₑₓ₁₅, label="reverse fit")
- forward fit: slope = 2.24 and intercept = 34.05
- reverse fit: slope = 2.64 and intercept = -49.94
7.4 exercise 16
Z̄´ = [mean(x´), mean(y´)];
Z´_centered = [Z´[i] - Z̄´ for i ∈ 1:N´];
Q = sum([z*transpose(z) for z ∈ Z´_centered])
2×2 Matrix{Float64}:
25057.0 55542.8
55542.8 1.36261e5
2-element Vector{Float64}:
0.3824362551881647
0.923981877916257
mₑₓ₁₆ = eigvecs_max[end] / eigvecs_max[begin];
bₑₓ₁₆ = Z̄´[2] - mₑₓ₁₆ * Z̄´[1]; # all principal components go through the means
scatter(x´, y´, label=nothing)
plot!(𝕩 -> mₑₓ₁₆*𝕩 + bₑₓ₁₆, label="exr 16")
plot!(𝕩 -> mₑₓ₀₁*𝕩 + bₑₓ₀₁, label="exr 01")
slope = 2.42 and intercept = -4.6
8 Intrinsic scatter
quick summary
- Intrinsic Scatter:
- Add a parameter $ V $ to model intrinsic variance (e.g., perpendicular to the line).
- Model scatter perpendicular to the line, marginalizing over true positions.
- Avoids ad-hoc “scatter subtraction” and provides principled uncertainty estimates.
introduce an intrinsic Gaussian variance \(V\), orthogonal to the line
the log likelihood can be written as \[ \log\mathcal{L} = K - \frac{1}{2}\sum_{i=1}^N \left[\frac{\Delta_i^2}{\Sigma_{i}^2+V} +\log{|\Sigma_{i}^2+V|} + \log{(1+m^2)} \right] \] where \(K\) is some constant
limitations: it models only the distribution orthogonal to the relationship
8.1 exercise 17
@model function modelₑₓ₁₇(ℕ, ℤ, 𝕊)
b ~ Normal(0, 5)
θ ~ Uniform(-angle90, angle90) # angle of the fitted line, use this instead of slope
v⃗ = [-sin(θ), cos(θ)] # unit normal vector
m := tan(θ) # `:=` operator add variable to the chain
V ~ InverseGamma(.001, .001) # intrinsic Gaussian variance orthogonal to the line
for i ∈ 1:ℕ
Δᵢ = dot(ℤ[i], v⃗) - b*v⃗[2] # orthogonal displacement of each data point from the line
Σᵢ² = dot(v⃗, 𝕊[i], v⃗) # orthogonal variance of projection of each data point to the line
tmp = Σᵢ² + V # intermediary result
Turing.@addlogprob! -.5*(Δᵢ^2 / tmp + log(abs(tmp)) + log1p(m^2))
end
end
data_modelₑₓ₁₇ = modelₑₓ₁₇(N´, Z´, S´);
chainsₑₓ₁₇ = sample(data_modelₑₓ₁₇, NUTS(), MCMCThreads(), N_samples, N_chains; num_warmup=N_warmup, thinning=N_thinning);
println('═' ^ 80)
print_information_criterion(data_modelₑₓ₁₇, chainsₑₓ₁₇)
print_summary(chainsₑₓ₁₇)
┌ Info: Found initial step size
└ ϵ = 0.025
┌ Info: Found initial step size
└ ϵ = 0.025
┌ Info: Found initial step size
└ ϵ = 0.025
┌ Info: Found initial step size
└ ϵ = 0.025
════════════════════════════════════════════════════════════════════════════════
Deviance Information Criterion: 123.47737305379421
Widely Applicable Information Criterion: 491.75357604976875
Log predictive density: -242.6821637024135
effective number of parameters: 3.194624322470837
Summary Statistics
parameters mean std mcse rhat
Symbol Float64 Float64 Float64 Float64
b 2.0692 4.9119 0.0354 1.0001
θ 1.1751 0.0070 0.0001 1.0002
m 2.3946 0.0474 0.0003 1.0002
V 7.5046 20.5960 0.1584 1.0000
bₑₓ₁₇ = mean(chainsₑₓ₁₇[:b]);
mₑₓ₁₇ = mean(chainsₑₓ₁₇[:m]);
move_up = sqrt(mean(chainsₑₓ₁₇[:V])) / cos(mean(chainsₑₓ₁₇[:θ]));
plot_ellipses(N´, x´, y´, Z´, S´)
plot!(𝕩 -> mₑₓ₁₇*𝕩 + bₑₓ₁₇, label="exr 17")
plot!(𝕩 -> mₑₓ₁₇*𝕩 + bₑₓ₁₇ + move_up, linestyle=:dash, label=nothing)
plot!(𝕩 -> mₑₓ₁₇*𝕩 + bₑₓ₁₇ - move_up, linestyle=:dash, label=nothing)
plot!(𝕩 -> mₑₓ₀₁*𝕩 + bₑₓ₀₁, label="exr 01")
8.2 exercise 18
# confidence level α → 2 quantiles: 1/2 ± α/2
level9x = quantile(chainsₑₓ₁₇[:V].data, [.5+.95/2 .5+.99/2]) # use matrix not vector for vline() later
density(chainsₑₓ₁₇[:V], title="V", label=nothing)
vline!(level9x, label=["95%" "99%"])
the posterior is very skewed, so the upper limit is much more informative than the lower limit which isn’t very informative because it’s very close to \(0\).