Update `Statistex.variance` to use Welford's online algorithm #11

weaversam8 · 2025-07-02T22:37:44Z

Earlier today I opened bencheeorg/benchee#472, and I started thinking about how Statistex could be used to accomplish this. Statistex already supports calculating several statistics while ignoring the list of samples, if the prerequisite parts are available to the caller (e.g. average can be called with :total and :sample_size)

One statistic that didn't support this today was variance (and by extension standard_deviation), since both require the entire list of samples to operate. Algorithms to calculate variance "on-line" exist, such as Welford's Online Algorithm, but they depend on a different statistic, m2. This statistic is "the running sum of squared differences from the current mean", and really only has a meaning in the context of computing variance.

In this PR, I've:

added this m2 statistic
updated the variance statistic to use m2 when calculating variance instead of average
I also exposed this m2 statistic on the top level Statistex struct, even if it's not a particularly useful statistic on its own.
- (I figured the ability for users to incrementally update statistics like variance and standard_deviation was in keeping with Statistex's philosophy of avoiding recomputation.)

This PR also updates some typespecs to better capture the :ignored behavior for statistics that support it, and tweaks the last few digits of floating point results from the variance and standard_deviation statistic. I'm not a statistics expert, but I believe that my floating point tweaks represent a shift towards more accuracy, not less, given the notes about loss of precision on the above linked Wikipedia page:

[The Welford's online] algorithm is much less prone to loss of precision due to catastrophic cancellation, but might not be as efficient because of the division operation inside the loop.

I didn't open an issue before I opened this PR, so I know these changes might not be desired - if so, please let me know! I'd love to discuss better ways to handle what I described in bencheeorg/benchee#472 if this isn't the best approach.

- Also exposes `m2` statistic so this algorithm can be used from user code - Added test for the above - Update dialyzer specs to better tolerate `:ignored` option when calling certain statistic functions

PragTob

👋

First of all, thanks a lot for the PR and the ideas!

On a high level, I'm always happy for statistex to support more and help people more so from that pov I love this! You also correctly picked up that it's very much within statistex' philosophy to let you provide as many intermediary values as possible if you want to. 👍 💚

The high level questions to me are:

Does this have an impact on performance for the "normal" use case? Online algorithms are pretty cool but usually it comes at a cost, so that the "everything at once" algorithms are faster. Would love to have benchmark showing how fast each approach is - if only for my benchmarking nerdyness.
The question on if we're gonna do bencheeorg/benchee#472 is also still open - I just wanna keep that in mind, but statistex is a general purpose library so even if we don't I'm happy to have this in statistex (pending performance results as a separate way if slower)

Also 👋 I believe this is the first PR review you're getting from me. I leave a lot of comments. That's normal. It doesn't mean I don't like the PR, in fact I think this PR is fantastic! I'm just rather detail oriented and want to understand things so I'll leave small comments on wording/naming and ask questions. Just because I leave a comment doesn't mean things have to change, I'm often just genuinely curios.

On the bigger level for this PR:

we need to keep backwards compatibility with options that people can pass in (the missing :average in variance)
The performance results are interesting to me to see if we can, as this PR does, just always use the m2 variant as performance is comparable or if we need to make it opt in/separate API/whatever
since I think most people don't know what m2 is we might need to add some more docs and/or name it differently (I just want people who see m2 in the struct to be able to quickly find the answer to "what the hell is this"?)

Again thanks a lot, sorry for the many comments I know they can seem intimidating but they're definitely not meant that way 😅 Let me see if the train wifi allows me to upload a bunny pic...

PragTob · 2025-07-06T11:28:20Z

README.md

 * `mix test` to run tests
 * `mix dialyzer` to run dialyzer for type checking, might take a while on the first invocation (try building plts first with `mix dialyzer --plt`)
 * `mix credo` to find code style problems
+* To generate code coverage information:


This isn't accurate for this project as we use excoveralls - although to be honest haven't used the old school --cover in a long while. So it's mix coveralls.html :)

PragTob · 2025-07-06T11:29:50Z

lib/statistex.ex

+        m2: 17142.857142857145,
+        variance: 2857.1428571428573,
+        standard_deviation: 53.45224838248488,
+        standard_deviation_ratio: 0.10845383729779542,


Thanks for the comment, but the differences are so far back I don't think they matter/a normal test with "to a precision of 5 digits" wouldn't even have caught it.

PragTob · 2025-07-06T11:35:38Z

lib/statistex.ex

      ** (ArgumentError) Passed an empty list ([]) to calculate statistics from, please pass a list containing at least one number.
  """
-  @spec standard_deviation(samples, keyword) :: float
+  @spec standard_deviation(samples | :ignored, keyword) :: float


So, this confused me a bit :) I can now see why you did it, but I don't think it should be part of the type specs. Me passing in :ignored in some of the doc tests was just my cheeky way of showing that if you provide all the other data needed then truly the samples aren't even touched and hence don't matter. There is no special meaning to :ignored - you could pass in a pid or whatever. I don't think people should do that though.

I think if we wanted to highlight/encourage do that more we should supply a function/interface that just doesn't take any samples and either works on direct arguments or the keyword list but raises if any of them aren't there :)

PragTob · 2025-07-06T11:36:29Z

lib/statistex.ex

        ** (ArgumentError) Passed an empty list ([]) to calculate statistics from, please pass a list containing at least one number.
  """
-  @spec standard_deviation_ratio(samples, keyword) :: float
+  @spec standard_deviation_ratio(samples | :ignored, keyword) :: float


(same :ignored comment as above)

PragTob · 2025-07-06T11:37:59Z

lib/statistex.ex

+  @doc """
+  Calculate the running sum of squared differences from the current mean.
+
+  This value is only used when trying to calculate the variance in a single pass, using Welford's online algorithm.


A link similar to the one in your PR may be great here!

PragTob · 2025-07-06T11:43:56Z

lib/statistex.ex

+          0
+
+        {0, 0} ->
+          0


I think here an if works clearer as you can do == 0 and so we don't need to make it 2 separate cases for 0.0 and 0.

PragTob · 2025-07-06T11:45:42Z

lib/statistex.ex

+
+        _ ->
+          Keyword.get_lazy(options, :average, fn ->
+            average(:ignored, sample_size: count, total: total)


I think here I'd just extract a (private) calc_average function passing sample_size and total in as the args - it'll be extremely tiny but yeah - avoids the :ignored bit 👀

PragTob · 2025-07-06T11:46:23Z

lib/statistex.ex

+  def m2([], _), do: raise(ArgumentError, @empty_list_error_message)
+
+  def m2(samples, options) when is_list(samples) do
+    count = Keyword.get(options, :sample_size, 0)


I know it's nitpicky/focussed on wording but as the rest of the code calls this sample_size I think it'd be worth to keep with that wording :) (or since we keep updating it, maybe something like current_sample_size?)

PragTob · 2025-07-06T11:47:45Z

lib/statistex.ex

+    count = count + 1
+    delta = sample - mean
+    mean = mean + delta / count
+    delta2 = sample - mean


is there a better name for delta2? I haven't read the algorithm description yet but would hope there is a better name to use 😁

PragTob · 2025-07-06T11:50:39Z

lib/statistex.ex


-    average =
-      Keyword.get_lazy(options, :average, fn -> average(samples, sample_size: sample_size) end)
+    m2 = Keyword.get_lazy(options, :m2, fn -> m2(samples) end)


not allowing to provide :average any more would be a breaking change which we should avoid (as I wouldn't want to bump a major for this). I suppose that means we'd need to keep the "old" offline way to calculate variance around as well.

weaversam8 · 2025-07-07T14:45:00Z

Thanks for the detailed comments! No worries about the quantity, I also prefer detailed notes instead of one larger comment. I'll take a look at making some of these tweaks soon.

I was also curious about the benchmarking, and my first thought was to use Benchee. @PragTob would you be OK if I added some tests that benchmarked this library using Benchee? Even if that might make a circular dependency between Benchee and Statistex?

PragTob · 2025-10-05T18:22:47Z

@weaversam8 if you can make the circular dependency work, for sure! Otherwise, a separate repo may be the solution. I've definitely benchmarked benchee using benchee :)

Also sorry for the silence... I'll appologize in some other thread 😅

Update Statistex.variance to use Welford's online algorithm

6ff7750

- Also exposes `m2` statistic so this algorithm can be used from user code - Added test for the above - Update dialyzer specs to better tolerate `:ignored` option when calling certain statistic functions

PragTob reviewed Jul 6, 2025

View reviewed changes

Update Statistex.variance to use Welford's online algorithm #11

Are you sure you want to change the base?

Update Statistex.variance to use Welford's online algorithm #11

Uh oh!

Conversation

weaversam8 commented Jul 2, 2025

Uh oh!

PragTob left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weaversam8 commented Jul 7, 2025

Uh oh!

PragTob commented Oct 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update `Statistex.variance` to use Welford's online algorithm #11

Update `Statistex.variance` to use Welford's online algorithm #11