-
Notifications
You must be signed in to change notification settings - Fork 4
Update Statistex.variance to use Welford's online algorithm
#11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Also exposes `m2` statistic so this algorithm can be used from user code - Added test for the above - Update dialyzer specs to better tolerate `:ignored` option when calling certain statistic functions
PragTob
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋
First of all, thanks a lot for the PR and the ideas!
On a high level, I'm always happy for statistex to support more and help people more so from that pov I love this! You also correctly picked up that it's very much within statistex' philosophy to let you provide as many intermediary values as possible if you want to. 👍 💚
The high level questions to me are:
- Does this have an impact on performance for the "normal" use case? Online algorithms are pretty cool but usually it comes at a cost, so that the "everything at once" algorithms are faster. Would love to have benchmark showing how fast each approach is - if only for my benchmarking nerdyness.
- The question on if we're gonna do bencheeorg/benchee#472 is also still open - I just wanna keep that in mind, but statistex is a general purpose library so even if we don't I'm happy to have this in statistex (pending performance results as a separate way if slower)
Also 👋 I believe this is the first PR review you're getting from me. I leave a lot of comments. That's normal. It doesn't mean I don't like the PR, in fact I think this PR is fantastic! I'm just rather detail oriented and want to understand things so I'll leave small comments on wording/naming and ask questions. Just because I leave a comment doesn't mean things have to change, I'm often just genuinely curios.
On the bigger level for this PR:
- we need to keep backwards compatibility with options that people can pass in (the missing
:averageinvariance) - The performance results are interesting to me to see if we can, as this PR does, just always use the m2 variant as performance is comparable or if we need to make it opt in/separate API/whatever
- since I think most people don't know what m2 is we might need to add some more docs and/or name it differently (I just want people who see
m2in the struct to be able to quickly find the answer to "what the hell is this"?)
Again thanks a lot, sorry for the many comments I know they can seem intimidating but they're definitely not meant that way 😅 Let me see if the train wifi allows me to upload a bunny pic...
| * `mix test` to run tests | ||
| * `mix dialyzer` to run dialyzer for type checking, might take a while on the first invocation (try building plts first with `mix dialyzer --plt`) | ||
| * `mix credo` to find code style problems | ||
| * To generate code coverage information: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't accurate for this project as we use excoveralls - although to be honest haven't used the old school --cover in a long while. So it's mix coveralls.html :)
| m2: 17142.857142857145, | ||
| variance: 2857.1428571428573, | ||
| standard_deviation: 53.45224838248488, | ||
| standard_deviation_ratio: 0.10845383729779542, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment, but the differences are so far back I don't think they matter/a normal test with "to a precision of 5 digits" wouldn't even have caught it.
| ** (ArgumentError) Passed an empty list ([]) to calculate statistics from, please pass a list containing at least one number. | ||
| """ | ||
| @spec standard_deviation(samples, keyword) :: float | ||
| @spec standard_deviation(samples | :ignored, keyword) :: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this confused me a bit :) I can now see why you did it, but I don't think it should be part of the type specs. Me passing in :ignored in some of the doc tests was just my cheeky way of showing that if you provide all the other data needed then truly the samples aren't even touched and hence don't matter. There is no special meaning to :ignored - you could pass in a pid or whatever. I don't think people should do that though.
I think if we wanted to highlight/encourage do that more we should supply a function/interface that just doesn't take any samples and either works on direct arguments or the keyword list but raises if any of them aren't there :)
| ** (ArgumentError) Passed an empty list ([]) to calculate statistics from, please pass a list containing at least one number. | ||
| """ | ||
| @spec standard_deviation_ratio(samples, keyword) :: float | ||
| @spec standard_deviation_ratio(samples | :ignored, keyword) :: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same :ignored comment as above)
| @doc """ | ||
| Calculate the running sum of squared differences from the current mean. | ||
| This value is only used when trying to calculate the variance in a single pass, using Welford's online algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A link similar to the one in your PR may be great here!
| 0 | ||
|
|
||
| {0, 0} -> | ||
| 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here an if works clearer as you can do == 0 and so we don't need to make it 2 separate cases for 0.0 and 0.
|
|
||
| _ -> | ||
| Keyword.get_lazy(options, :average, fn -> | ||
| average(:ignored, sample_size: count, total: total) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here I'd just extract a (private) calc_average function passing sample_size and total in as the args - it'll be extremely tiny but yeah - avoids the :ignored bit 👀
| def m2([], _), do: raise(ArgumentError, @empty_list_error_message) | ||
|
|
||
| def m2(samples, options) when is_list(samples) do | ||
| count = Keyword.get(options, :sample_size, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's nitpicky/focussed on wording but as the rest of the code calls this sample_size I think it'd be worth to keep with that wording :) (or since we keep updating it, maybe something like current_sample_size?)
| count = count + 1 | ||
| delta = sample - mean | ||
| mean = mean + delta / count | ||
| delta2 = sample - mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a better name for delta2? I haven't read the algorithm description yet but would hope there is a better name to use 😁
|
|
||
| average = | ||
| Keyword.get_lazy(options, :average, fn -> average(samples, sample_size: sample_size) end) | ||
| m2 = Keyword.get_lazy(options, :m2, fn -> m2(samples) end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not allowing to provide :average any more would be a breaking change which we should avoid (as I wouldn't want to bump a major for this). I suppose that means we'd need to keep the "old" offline way to calculate variance around as well.
|
Thanks for the detailed comments! No worries about the quantity, I also prefer detailed notes instead of one larger comment. I'll take a look at making some of these tweaks soon. I was also curious about the benchmarking, and my first thought was to use Benchee. @PragTob would you be OK if I added some tests that benchmarked this library using Benchee? Even if that might make a circular dependency between Benchee and Statistex? |
|
@weaversam8 if you can make the circular dependency work, for sure! Otherwise, a separate repo may be the solution. I've definitely benchmarked benchee using benchee :) Also sorry for the silence... I'll appologize in some other thread 😅 |

Earlier today I opened bencheeorg/benchee#472, and I started thinking about how Statistex could be used to accomplish this. Statistex already supports calculating several statistics while ignoring the list of samples, if the prerequisite parts are available to the caller (e.g.
averagecan be called with:totaland:sample_size)One statistic that didn't support this today was
variance(and by extensionstandard_deviation), since both require the entire list of samples to operate. Algorithms to calculate variance "on-line" exist, such as Welford's Online Algorithm, but they depend on a different statistic,m2. This statistic is "the running sum of squared differences from the current mean", and really only has a meaning in the context of computing variance.In this PR, I've:
m2statisticvariancestatistic to usem2when calculatingvarianceinstead ofaveragem2statistic on the top levelStatistexstruct, even if it's not a particularly useful statistic on its own.varianceandstandard_deviationwas in keeping with Statistex's philosophy of avoiding recomputation.)This PR also updates some typespecs to better capture the
:ignoredbehavior for statistics that support it, and tweaks the last few digits of floating point results from thevarianceandstandard_deviationstatistic. I'm not a statistics expert, but I believe that my floating point tweaks represent a shift towards more accuracy, not less, given the notes about loss of precision on the above linked Wikipedia page:I didn't open an issue before I opened this PR, so I know these changes might not be desired - if so, please let me know! I'd love to discuss better ways to handle what I described in bencheeorg/benchee#472 if this isn't the best approach.