Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
e9713c1
Created quick start page, various small edits, componentArrays added
Alexi-M Jun 30, 2025
42e4bee
added graphs and new nll structure (not working)
Alexi-M Jul 2, 2025
d21f36f
fix that should fix something something merge issues (please work)
Alexi-M Jul 2, 2025
02fc116
fix fix fix (pt 30823423)
Alexi-M Jul 2, 2025
22a2cf8
fix export
Moelf Jul 2, 2025
c5f4902
fix color cycle
Moelf Jul 3, 2025
a4d7b5e
fixed likelihoodSpec and added test cases
Alexi-M Jul 3, 2025
d93fe08
removed all references to roofitnll functor
Alexi-M Jul 3, 2025
d8da6f7
removed vector{vector} support for LikelihoodSpec
Alexi-M Jul 3, 2025
9a43a10
fixed last instance of roofitnll
Alexi-M Jul 3, 2025
160e183
Added chi2 to likelihoodspec struct
Alexi-M Jul 8, 2025
34b7f84
Merge branch 'main' into graphs
Alexi-M Jul 8, 2025
9a1754b
minor docs changes and large data testing
Alexi-M Jul 8, 2025
4646214
minor fixes to makie ext, reintroduced vector{vector} input to likeli…
Alexi-M Jul 9, 2025
2c0ebb9
New example that I didnt actually check
Alexi-M Jul 9, 2025
9f941a5
consistent parameter formatting in runtests.jl
Alexi-M Jul 10, 2025
f7dd9ff
formatting update
Alexi-M Jul 10, 2025
62a59cc
Merge branch 'main' into documentation!!!
Alexi-M Jul 10, 2025
d0056a8
removed random ß
Alexi-M Jul 10, 2025
c51a772
fixed code duplication issue
Alexi-M Jul 10, 2025
5ca8754
fix aqua
Moelf Jul 10, 2025
38d5d88
wrapperless methods for likelihoodspec-optimization integration
Alexi-M Jul 10, 2025
02e943c
no more revisions
Alexi-M Jul 10, 2025
86d1552
fixed errors and tests (and new example)
Alexi-M Jul 10, 2025
31fbebb
fix
Alexi-M Jul 10, 2025
fa838f0
fix fix fix
Alexi-M Jul 10, 2025
07a1bc1
why
Alexi-M Jul 10, 2025
93f0f02
just work this time
Alexi-M Jul 10, 2025
ead2871
if this errors one more time
Alexi-M Jul 10, 2025
45636ad
please
Alexi-M Jul 10, 2025
056339a
whywhywhywhywhy
Alexi-M Jul 10, 2025
c40250c
documentation (math???)
Alexi-M Jul 11, 2025
bb1bb47
loaded doc test
Alexi-M Jul 11, 2025
b54ddc0
readme requested change
Alexi-M Jul 11, 2025
25aac2d
autodocs references test
Alexi-M Jul 11, 2025
03bd89b
latex small commit
Alexi-M Jul 11, 2025
1aa25a5
more latex
Alexi-M Jul 11, 2025
db34df3
likelihood spec et chi2 optimization
Alexi-M Jul 14, 2025
fb6507e
chi2 equations
Alexi-M Jul 14, 2025
7b27543
Merge branch 'main' into documentation!!!
Alexi-M Jul 14, 2025
885dcb0
rewrote intro doc without functor mention, vector valued methods for …
Alexi-M Jul 16, 2025
2ed5d09
removed param freezing
Alexi-M Jul 16, 2025
cb16a97
random is no longer random
Alexi-M Jul 16, 2025
31834ea
project.toml changes
Alexi-M Jul 16, 2025
4bb4c1c
why.
Alexi-M Jul 16, 2025
504ed32
compat for random = "1"
Alexi-M Jul 16, 2025
679372f
better docs structure & likelihoodspec method documentation
Alexi-M Jul 16, 2025
39b7ea7
Merge branch 'main' into new_documentation
Alexi-M Jul 17, 2025
f317b17
more docs
Alexi-M Jul 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 162 additions & 6 deletions docs/src/90-quickstart.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,176 @@
```@contents
Pages = ["showcase.md"]
```

# [Quick Start](@id quickstart)
To optimize the fit of one function over a data set,
```julia
using FHist, Optimize, ForwardDiff
using BinnedDistributionFit, FHist, Optimize, ForwardDiff
# User inputs
d(x, ps...) = x^ps[1]+ps[2] # <-- Enter function here
h = Hist1D(; bincounts = 1:100, binedges = 1:100) # <-- Enter histogram here
pdf(x, ps...) = x^ps[1] + ps[2] # <-- Enter function here
hist = Hist1D(; bincounts = 1:100, binedges = 1:100) # <-- Enter histogram here

support = extrema(binedges(h1))
support = extrema(binedges(hist))

NLL = LikelihoodSpec(ExtendPdf(d, support), h1)
NLL = LikelihoodSpec(ExtendPdf(pdf, support), hist)

# Guess of the values of ps
ps = [1, 2]
# Optimization implementation
optf = OptimizationFunction(NLL_wrapper, AutoForwardDiff())
prob = OptimizationProblem(optf, ComponentArray(norms=integral(h),p1 = ps))
prob = OptimizationProblem(optf, ComponentArray(norms = [sum(bincounts(hist))], p1 = ps))
sol = solve(prob, Optimization.LBFGS())

@show sol.u
```

# Introduction
This page will act as a quick start guide to the BinnedDistributionFit package. To create a fit for one function, first create an `ExtendPdf` which contains the desired function. We define both the value of the function and its support. It is expected that the function input has some number of free parameters, labeled as `ps` throughout the documentation, though they are not necessary in the creation of an `ExtendPdf`.

Once an `ExtendPdf` is created, we can then evaluate it at a single point using the [`scalar_eval`](@ref BinnedDistributionFit.scalar_eval) function. To evaluate a pdf at multiple points, we can use [`vector_eval`](@ref BinnedDistributionFit.vector_eval) instead.

```julia
f(x) = x^2 + x*2
d1 = ExtendPdf(f, (1,2))
scalar_eval(d1, 2)
# Output: 8
vector_eval(d1, [1, 2])
# Output: [3, 8]

g(x, ps) = x^ps[1] + x*ps[2]
d2 = ExtendPdf(g, (1, 2))
scalar_eval(d1, 2, [2, 3])
# Output: 10
vector_eval(d1, [1, 2], [2, 3])
# Output: [4, 10]
```

## NLL Loss Function
`BinnedDistributionFit` uses the FHist package to create histograms. Using. `Hist1D`, we score a certain function on its fit to the data using several methods. The simplest method is using negative log likelihood scoring, as it does not require the standard deviation of each histogram bin.

We create a [`LikelihoodSpec`](@ref BinnedDistributionFit.LikelihoodSpec) with our data and pdf, which we can then call with specific values of the parameters of pdf.
```julia
pdf = ExtendPdf((x, _) -> x, (1, 3))
hist = Hist1D(; binedges = 1:3, bincounts = [2.0, 4.0])
NLL = LikelihoodSpec(
pdf, hist;
num_integrator = BinnedDistributionFit.QuadGKIntegrator()
)
```
The method of integration can be specified. By default, it is set to `SimpleSumIntegrator()`.

In this case, as there are no parameters `ps` specified when `pdf` was created, we call `NLL` with just the norm of `pdf`. Either a vector of vectors or a `ComponentArray` may be used to call `NLL`.
```julia
NLL([[2.0]])
# Output: 1.6827899396467232

NLL(ComponentArray(norm = [2.0]))
# Output: 1.6827899396467232
```

## Using Multiple Functions

In addition to the `ExtendPdf` struct, `BinnedDistributionFit` also uses the `SumOfPdfs` struct to handle multiple input to be evaluated over the same dataset. The same `ExtendPdf` syntax is used. Internally, whenever two `ExtendPdf` are summed the output is a `SumOfPdfs`.

The order of arguments for both `NLL` methods (the one taking a vector of vectors and a `ComponentArray`) is the same: `norms, pN...`, where `pN` is the Nth set of parameters.
```julia
pdf1 = ExtendPdf((x, _) -> x, (1,3))
pdf2 = ExtendPdf((x, _) -> x^2, (1,3))

hist = Hist1D(; binedges = 1:3, bincounts = [2.0, 4.0])
NLL = LikelihoodSpec(
pdf1 + pdf2, hist;
num_integrator = BinnedDistributionFit.QuadGKIntegrator()
)

NLL([[2.0, 0.5], [], []])
# Output: 0.8497340452248006

NLL(ComponentArray(norms = [2.0, 0.5], p1 = [0], p2 = [0]))
# Output: 0.8497340452248006
```

## Chi Squared Loss Function
Another loss function implemented into the `BinnedDistributionFit` is the chi squared function (referenced as `chi2`). The NLL and chi squared loss functions are not commensurable. Only one should be used for analysis of a given data set.

We use the keyword argument `loss_type` when defining the `LikelihoodSpec` to specify chi squared as the loss type. The default of `loss_type` is `NLL`.
```julia
pdf = ExtendPdf((x, _) -> x-0.5, (1,3))
hist = Hist1D(; binedges = 1:3, bincounts = [1, 2], sumw2 = [1, 1])
x = LikelihoodSpec(
pdf, hist;
loss_type = BinnedDistributionFit.CSQ(),
num_integrator = BinnedDistributionFit.QuadGKIntegrator()
)

x([3.0])
# Output: 0

x(ComponentArray(norm = [3.0]))
# Output: 0
```
Note that the `sumw2` field must contain only nonzero values or else the value of the `LikelihoodSpec` will be `Inf`.

Example with multiple pdfs:
```julia
pdf1 = ExtendPdf((x, _) -> x-0.5, (1, 4))
pdf2 = ExtendPdf((x, _) -> 3x^2, (1, 4))

hist = Hist1D(; binedges = 1:4, bincounts = [1, 2, 3], sumw2 = [1, 2, 1])
x = BinnedDistributionFit.chi2_functor(
pdf1 + pdf2, hist;
loss_type = BinnedDistributionFit.CSQ(),
num_integrator = BinnedDistributionFit.QuadGKIntegrator()
)

x([[3.0, 4.0], [], []])
# Output: 1.73774816049

x(ComponentArray(norms = [3.0, 4.0], p1 = [0], p2 = [0]))
# Output: 1.73774816049
```

## Usage with Optimization.jl
`BinnedDistributionFit` can integrate with various optimization packages. This example shows how to fit and graph one pdf using `Optimization` and `CairoMakie `. First, we can create sample data data and choose 5,0000 random points from it.
```@setup Plotting_Example
using Optimization, ForwardDiff, Distributions, CairoMakie, BinnedDistributionFit
using ComponentArrays, FHist
CairoMakie.activate!(; type="svg")
```
```@example Plotting_Example; continued = true
L_dist = Laplace(50, 20)
L_data = rand(L_dist, 50000)
```
Since we know the shape of the data, we can use a Laplace distribution as our function. We then create a pdf with two unknown parameters, both stored in the vector ps.
```@example Plotting_Example; continued = true
f(x, ps) = pdf(Laplace(ps[1], ps[2]), x)
```
We then use the core components of the `BinnedDistributionFit` package: `ExtendPdf` and `LikelihoodSpec`. We create a histogram to wrap the data we generated and then wrap that and the `ExtendPdf` into `LikelihoodSpec`, our loss function.
```@example Plotting_Example; continued = true
hist = BinnedDistributionFit.Hist1D(L_data; binedges = 0:100)
pdf_input = BinnedDistributionFit.ExtendPdf(f, (0,100))

NLL = BinnedDistributionFit.LikelihoodSpec(pdf_input, hist)
```
We then generate an initial guesses for the values of the parameters and an overall norm. Note that each element of the 'ComponentArray' must be of type `Vector{float}`.
```@example Plotting_Example; continued = true
para_guess = ComponentArray(norm = [47000.], p1 = [70., 30.])
```
Using the `Optimization` package we define the function and problem to optimize.
```@example Plotting_Example; continued = true
optf = OptimizationFunction(NLL, AutoForwardDiff())
prob = OptimizationProblem(
optf, para_guess;
# lower and upper bounds on the parameters
lb = ComponentArray(norm = [eps()], p1 = [eps(), eps()]),
ub = ComponentArray(norm = [100000], p1 = [100, 50])
)

sol = solve(prob, Optimization.LBFGS())
```
Finally, we can use the `BinnedDistributionFit` extension of `Makie` to plot the fitted pdf and the histogram.
```@example Plotting_Example
fig = BinnedDistributionFit.plotthing(NLL, sol.u)

fig # hide
```
21 changes: 16 additions & 5 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,19 @@ CurrentModule = BinnedDistributionFit
Documentation for [BinnedDistributionFit](https://github.com/Moelf/BinnedDistributionFit.jl).



# What is BinnedDistributionFit?
BinnedDistributionFit is a package which defines several loss functions to compare probability density functions (pdfs) or sums thereof to specific histogram data. BinnedDistributionFit supports any number of independently normalized pdfs. Either negative log likelihood or chi squared regressions may be used.

# Mathematical Background
`BinnedDistributionFit` supports two methods for calculating likelihood: Negative log likelihood and the chi squared test.
The question that this package seeks to answer, in simple terms, is as such: Given some data, what is the most likely set of parameters which fits some function to that data. Symbolically, we may write ``p(\text{ps}\ | \ O)``, or, "the probability of ``\text{ps}`` given ``O``." This measure is called likelihood.

`BinnedDistributionFit` supports two methods for calculating likelihood: Negative log likelihood and the reduced chi squared statistic. Both of these yield measurements of how accurately a function fits to particular data, with better fits resulting in lower scores.

## Negative Log Likelihood Function
Given a histogram and a pdf, we can compare the differences between each observed and expected value over the support (domain) of the pdf using likelihood as a measurement of error. We use negative log likelihood (NLL) since this simplifies the calculation of the overall likelihood by reducing the product of each points' likelihood to the sum of the logs of their likelihoods. We negate this sum to format this properly to act as a minimization problem.

The negative log likelihood function has two terms: the per-bin term and the overall Poisson term. The per-bin term sums the the products of the observations with the log of the expected terms (for a given set of parameters ``\text{ps}``). We know that the data at each point of a histogram is described by a Poisson distribution.

With histogram data ``y`` at each of a set of points ``x_i`` on the support ``S``, we can perform our evaluation like so
```math
-\ln(\mathcal{L})=-\sum_{i=1}^{N_{\text{bins}}}y_i\ln\left(\sum_{j\ =1}^{N_{\text{pdfs}}}\frac{N_i\cdot\text{UserPdf}_i(x_i,\text{parameters})}{N_{\text{expected}}\cdot\int_S\text{UserPdf}_i(x,\text{parameters})dx}\right)-\left(N_{\text{observed}}\ln N_{\text{expected}}-N_{\text{expected}}\right).
Expand All @@ -47,10 +50,18 @@ When only one pdf is defined this equation reduces to
-\ln(\mathcal{L})=-\sum_{i=1}^{N_{\text{bins}}}y_i\ln\left(\frac{\text{UserPdf}_i(x_i,\text{parameters})}{\int_S\text{UserPdf}_i(x,\text{parameters})dx}\right)-\left(N_{\text{observed}}\ln N_{\text{expected}}-N_{\text{expected}}\right).
```

## Chi Squared Function
Compared to the NLL function, the chi squared measure of error is much simpler. However, it has the downside that it requires some measure of the standard error of each bin. This package uses the built-in `Hist1D` field `sumw2' to define the variance (``\sigma^2``) of the histogram. The formula for this method is
## Reduced Chi Squared Statistic

Given the likelihood function ``L(x | \theta)``, we can define ``\lambda(\theta)`` as
```math
\frac{L(x | \theta)}{L(x | \hat{\theta})}.
```
``L(x | \hat{\theta})`` is the global minimum of the likelihood function (at some parameters ``\hat{\theta}``). The ``\chi^2`` distribution can then be described by
```math
\chi^2 = -2\ln \lambda (\theta)
```
Compared to the NLL function, the chi squared measure of error is much simpler. However, it has the downside that it requires some measure of the standard error of each bin. This package uses the built-in `Hist1D` field `sumw2` to define the variance ``\sigma^2`` of the histogram. The formula for the method implemented in this package is
```math
\chi^2 = \sum_{i=1}^{N_{\text{bins}}}\frac{\left(y_i-\sum_{j=1}^{N_{\text{pdfs}}}\frac{N_i^2\cdot\text{UserPdf}_i(x_i,\text{parameters})}{N_{\text{expected}}\cdot\int_S\text{UserPdf}_i(x,\text{parameters})dx}\right)^2}{\sigma_i^2}.
```
This formulation uses the same variables as the previous equation.

50 changes: 31 additions & 19 deletions src/BinnedDistributionFit.jl
Original file line number Diff line number Diff line change
Expand Up @@ -44,23 +44,35 @@ struct ExtendPdf{F, S} <: AbstractPdf
support::S
end

# what is "AbstractVectorVector"? This doesnt seem to be a real type
"""
scalar_eval(d::ExtendPdf, x, ps::AbstractVectorVector; kw...)
Evaluates a pdf at a single point. See [`ExtendPdf`](@ref) for details on the struct.

For example, if we have a pdf `d` with parameters `params`, we can evalute the pdf at a point x as follows:
For example, if we have a pdf `d` with parameters `params`, we can evalute the pdf over an array `[a, b, c]` as follows:

```julia
params = [1.0, 2.0]
scalar_eval(d, x, params)
scalar_eval(d, [a, b, c], params)
```
"""
function scalar_eval(d::ExtendPdf, x, ps=(); kw...)
d.func(x, ps; kw...)
function scalar_eval(d::ExtendPdf, x, ps = (); kw...)
return d.func(x, ps; kw...)
end

function vector_eval(d::ExtendPdf, x::AbstractVector, ps=(); kw...)
d.func.(x, Ref(ps); kw...)
"""
vector_eval(d::ExtendPdf, x::AbstractVector, ps::AbstractVectorVector; kw...)
Evaluates a pdf at a set of points. See [`ExtendPdf`](@ref) for details on the struct.

For example, if we have a pdf `d` with parameters `params`, we can evalute the pdf at a set of points x as follows:

```julia
params = [1.0, 2.0]
scalar_eval(d, x, params)
```
"""
function vector_eval(d::ExtendPdf, x::AbstractVector, ps = (); kw...)
return d.func.(x, Ref(ps); kw...)
end

"""
Expand All @@ -74,11 +86,11 @@ Type representing a sum of pdfs.
## Examples

```julia
julia> d1 = ExtendPdf((x, _)->x, (1,3))
julia> d1 = ExtendPdf((x, _) -> x, (1,3))

julia> d2 = ExtendPdf((x, _)->x^2, (1,3))
julia> d2 = ExtendPdf((x, _) -> x^2, (1,3))

julia> sd = d1+d2
julia> sd = d1 + d2
SumOfPdfs{Vector{ExtendPdf{F, Tuple{Int64, Int64}} where F}, Tuple{Int64, Int64}}(ExtendPdf{F, Tuple{Int64, Int64}} where F[ExtendPdf{var"#13#14", Tuple{Int64, Int64}}(var"#13#14"(), (1, 3)), ExtendPdf{var"#15#16", Tuple{Int64, Int64}}(var"#15#16"(), (1, 3))], (1, 3))

julia> BinnedDistributionFit.scalar_eval(sd, 2.0)
Expand All @@ -89,7 +101,7 @@ julia> BinnedDistributionFit.vector_eval(sd, [2.0, 3.0])
6.0
12.0

julia> sd2 = d1+d2+d3
julia> sd2 = d1 + d2 + d3

julia> BinnedDistributionFit.scalar_eval(sd2, 2.0, [[], [], [1,2]])
ERROR: BoundsError: attempt to access 2-element Vector{Any} at index [3]
Expand All @@ -107,26 +119,26 @@ function Base.:+(a::ExtendPdf, b::ExtendPdf)
if a.support != b.support
error("+: Supports of ExtendPdfs $a and $b are not equal")
end
SumOfPdfs([a, b], a.support)
return SumOfPdfs([a, b], a.support)
end

function Base.:+(a::SumOfPdfs, b::ExtendPdf)
if a.support != b.support
error("+: Supports of the SumOfPdfs $a and ExtendPdf $b are not equal")
end
SumOfPdfs([a.pdfs; b], a.support)
return SumOfPdfs([a.pdfs; b], a.support)
end
function Base.:+(a::ExtendPdf, b::SumOfPdfs)
if a.support != b.support
error("+: Supports of the ExtendPdf $a and SumOfPdfs $b are not equal")
end
SumOfPdfs([a; b.pdfs], a.support)
return SumOfPdfs([a; b.pdfs], a.support)
end
function Base.:+(a::SumOfPdfs, b::SumOfPdfs)
if a.support != b.support
error("+: Supports of SumOfPdfs $a and $b are not equal")
end
SumOfPdfs([a.pdfs; b.pdfs], a.support)
return SumOfPdfs([a.pdfs; b.pdfs], a.support)
end

"""
Expand All @@ -142,8 +154,8 @@ params2 = [3.0, 4.0]
scalar_eval(sumofpdfs([pdf1, pdf2]), x, [params1, params2])
```
"""
function scalar_eval(d::SumOfPdfs, x, vps=fill(nothing, length(d.pdfs)); kw...)
mapreduce(+, d.pdfs, vps) do d, ps
function scalar_eval(d::SumOfPdfs, x, vps = fill(nothing, length(d.pdfs)); kw...)
return mapreduce(+, d.pdfs, vps) do d, ps
scalar_eval(d, x, ps; kw...)
end
end
Expand All @@ -152,15 +164,15 @@ end
vector_eval(d::SumOfPdfs, x::AbstractVector, vps::Vector{Vector}; kw...)
Evaluates the sum of pdfs over an array of points. `vps` needs to be a vector of vectors of parameters because each pdf in the sum has its own set of parameters.

For example, if we have two pdfs `pdf1` and `pdf2` with parameters `params1` and `params2`, respectively, we can evaluate the sum of the pdfs over an array point `[a, b, c]` as follows:
For example, if we have two pdfs `pdf1` and `pdf2` with parameters `params1` and `params2`, respectively, we can evaluate the sum of the pdfs over an array `[a, b, c]` as follows:
```julia
params1 = [1.0, 2.0]
params2 = [3.0, 4.0]
vector_eval(sumofpdfs([pdf1, pdf2]), [a, b, c], [params1, params2])
```
"""
function vector_eval(d::SumOfPdfs, x::AbstractVector, vps=fill(nothing, length(d.pdfs)); kw...)
mapreduce(+, d.pdfs, vps) do d, ps
function vector_eval(d::SumOfPdfs, x::AbstractVector, vps = fill(nothing, length(d.pdfs)); kw...)
return mapreduce(+, d.pdfs, vps) do d, ps
vector_eval(d, x, ps; kw...)
end
end
Expand Down
Loading
Loading