Add CI runner with LLVM Flang on Windows/MinGW#753
Add CI runner with LLVM Flang on Windows/MinGW#753mmuetzel wants to merge 3 commits intoElmerCSC:develfrom
Conversation
|
There is a new error that I haven't seen before: |
|
Humm. I never had issues with the CLANG64 build step, the only problem have been the tests. I'm actually trying to build the tip of the It finishes with the builld step, but the tests all fail :( |
|
I don't see anything equivalent to The current build error occurs for every runner in the CI. I guess it is something new on the devel branch. |
Might be a regression from 8103e68. |
|
What I added was the ability to write the branch to the std out on ElmerSolver and ElmerGrid: See the commit: It passed all the tests after merge on devel. Perhaps you have some conflicts on config.h that misses the "ELMER_FEM_BRANCH" because ElmerGrid cannot find it even if it does on the devel branch. |
|
Just guessing: Maybe the issue arises because GitHub runs the CI on a "merge branch" (when the PR is from a fork). Maybe, |
|
Maybe also relevant: So, the CI apparently runs on a detached head (that is on no branch at all). There should also be a fallback for the case where the sources are built from a distribution tarball (instead of from a git checkout). |
|
See #754 for a potential fix. |
|
Currently, 47 tests failed out of 981 using the CLANG64 environment. |
|
That MPIEXEC did the trick. I'm only getting ~50 tests failing now instead of everything: I am using the |
|
My first guess would be that at least some of the test failures are due to missing MATC - not tried ever (!?!). Would need much more work to decouple completely, might be easier to fix the compilation (which has worked on every platform everywhere for the last 35 years or so ..) ? |
|
@juharu this is the error I'm getting with MATC enabled in CLANG64: |
|
Right ok, it's perhaps not about the MATC on the elmersolver side (forget the patch above).. Maybe you |
|
|
I believe the reason why MATC cannot be enabled is because it doesn't work in combination with OpenMP on Windows. (Access to thread-local storage across DLL borders doesn't work well.) I think I know how to fix that. I'll open a PR for that shortly. Edit: See #755. |
|
Still 47 tests failed out of 981. |
|
When I build ElmerFEM with Does that give a hint as to what might be going wrong? |
|
thanks,
I'll try to look. Is there btw any change of increasing default stack size on windows ?
Br, Juha
…________________________________
Lähettäjä: Markus Mützel ***@***.***>
Lähetetty: maanantai 26. tammikuuta 2026 12.31
Vastaanottaja: ElmerCSC/elmerfem ***@***.***>
Kopio: juharu ***@***.***>; Mention ***@***.***>
Aihe: Re: [ElmerCSC/elmerfem] Add CI runner with LLVM Flang on Windows/MinGW (PR #753)
[https://avatars.githubusercontent.com/u/65065102?s=20&v=4]mmuetzel left a comment (ElmerCSC/elmerfem#753)<#753 (comment)>
When I build ElmerFEM with -O0 -g and run the circuits_harmonic_foil with attached gdb, I get the following backtrace from the segmentation fault:
Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ffaff164a92 in Fortran::runtime::AssignTicket::Continue(Fortran::runtime::WorkQueue&) ()
from D:\repo\elmerfem\.build-clang\fem\src\libelmersolver.dll
(gdb) bt
#0 0x00007ffaff164a92 in Fortran::runtime::AssignTicket::Continue(Fortran::runtime::WorkQueue&) ()
from D:\repo\elmerfem\.build-clang\fem\src\libelmersolver.dll
#1 0x00007ffaff166030 in _FortranAAssign () from D:\repo\elmerfem\.build-clang\fem\src\libelmersolver.dll
#2 0x00007ffafe0d5976 in pelementbase::dlinenodalpbasisall (gradphi=...) at D:/repo/elmerfem/fem/src/PElementBase.F90:157
#3 0x00007ffafe16f56f in elementdescription::nodalfirstderivatives (n=2, dlbasisdx=..., element=..., u=0.57735026918962573, v=0, w=0,
usolver=0x78b67040) at D:/repo/elmerfem/fem/src/ElementDescription.F90:2439
#4 0x00007ffafe178db8 in elementdescription::elementinfo (element=..., nodes=..., u=0.57735026918962573, v=0, w=0, detj=0, basis=...,
dbasisdx=<error reading variable: Location address is not set.>, ddbasisddx=<error reading variable: Location address is not set.>,
secondderivatives=<error reading variable: Cannot access memory at address 0x0>,
bubbles=<error reading variable: Cannot access memory at address 0x0>, basisdegree=<error reading variable: Location address is not set.>,
edgebasis=<error reading variable: Location address is not set.>, rotbasis=<error reading variable: Location address is not set.>,
usolver=<error reading variable: Cannot access memory at address 0x0>) at D:/repo/elmerfem/fem/src/ElementDescription.F90:3234
#5 0x00007ffafed23265 in defutils::vectorelementedgedofs (bc=0x78bc5990, element=0x7aaee4c8, n=2, parent=..., np=4, integral=...,
edofs=<error reading variable: Cannot access memory at address 0x0>, secondfamily=<error reading variable: Cannot access memory at address 0x0>,
faceelement=<error reading variable: Cannot access memory at address 0x0>,
quadraticapproximation=<error reading variable: Cannot access memory at address 0x0>,
simplicialmesh=<error reading variable: Cannot access memory at address 0x0>) at D:/repo/elmerfem/fem/src/DefUtils.F90:6797
#6 0x00007ffaff047c7f in elmersolver::initcond () at D:/repo/elmerfem/fem/src/ElmerSolver.F90:2309
#7 0x00007ffaff03c2e0 in elmersolver::setinitialconditions () at D:/repo/elmerfem/fem/src/ElmerSolver.F90:1831
#8 0x00007ffaff0254cb in elmersolver (initialize=0) at D:/repo/elmerfem/fem/src/ElmerSolver.F90:554
#9 0x00007ff78cb617e8 in solver () at D:/repo/elmerfem/fem/src/Solver.F90:57
(gdb)
Does that give a hint as to what might be going wrong?
—
Reply to this email directly, view it on GitHub<#753 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACTOMSXBEU5HZ7BPJ26WGYD4IXUILAVCNFSM6AAAAACSQ62GNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOOJYHA4TENBWHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
The information in this email may be confidential and is intended solely for the use of the individual or entity to whom it is intended. If you are not the intended recipient of this message, please delete the message and notify the sender immediately. For information on how we process personal data and our contact information, please see CSC's website: Privacy<https://csc.fi/en/privacy>
Tämän sähköpostin tiedot voivat olla luottamuksellisia ja ne on tarkoitettu yksinomaan sen henkilön tai yhteisön käyttöön, jolle ne on osoitettu. Jos et ole viestissä tarkoitettu vastaanottaja, tuhoa viesti ja ilmoita asiasta välittömästi viestin lähettäjälle. Tietoja henkilötietojen ja yhteystietojen käsittelystä löydät CSC:n verkkosivuilta: Tietosuoja<https://csc.fi/tietosuoja>
|
|
gfortran with "valgrind" doesn't show anything suspicous, but in this exact place within the initial conditions settings (in ElmerSolver.F90 ~2310)
the flang compilation on my ubuntu had some execess stack usage, which was resolved by BLOCK-END BLOCK there somewhre,
which is in itself maybe indication that something is a bit off somewhere ...
I'll try flang compilation...
_
|
|
@juharu Yes, Clang accepts |
|
Afaict, the default stack size on Windows is 1 MB. There is no way to change the stack size at load or run time. Which value would be reasonable for ElmerSolver (or ElmerSolver_mpi)? |
|
Which value would be reasonable for ElmerSolver (or ElmerSolver_mpi)?
My ubuntu default stack is 8mb, seems quite enough - after the code changes i did to reduce stack usage because of flang.
... but these changes also included the BLOCK-END BLOCK thing...
Just thought that maybe it maybe could be quickly tested ?
…________________________________
Lähettäjä: Markus Mützel ***@***.***>
Lähetetty: maanantai 26. tammikuuta 2026 13.10
Vastaanottaja: ElmerCSC/elmerfem ***@***.***>
Kopio: juharu ***@***.***>; Mention ***@***.***>
Aihe: Re: [ElmerCSC/elmerfem] Add CI runner with LLVM Flang on Windows/MinGW (PR #753)
[https://avatars.githubusercontent.com/u/65065102?s=20&v=4]mmuetzel left a comment (ElmerCSC/elmerfem#753)<#753 (comment)>
Afaict, the default stack size on Windows is 1 MB. There is no way to change the stack size at load or run time.
But like @hmartinez82<https://github.com/hmartinez82> wrote, there are flags to increase the stack size when linking an executable.
Which value would be reasonable for ElmerSolver (or ElmerSolver_mpi)?
—
Reply to this email directly, view it on GitHub<#753 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACTOMSTBXZFPY2775ZADI6T4IXY2ZAVCNFSM6AAAAACSQ62GNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOOJZGAZTOMBQGQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
The information in this email may be confidential and is intended solely for the use of the individual or entity to whom it is intended. If you are not the intended recipient of this message, please delete the message and notify the sender immediately. For information on how we process personal data and our contact information, please see CSC's website: Privacy<https://csc.fi/en/privacy>
Tämän sähköpostin tiedot voivat olla luottamuksellisia ja ne on tarkoitettu yksinomaan sen henkilön tai yhteisön käyttöön, jolle ne on osoitettu. Jos et ole viestissä tarkoitettu vastaanottaja, tuhoa viesti ja ilmoita asiasta välittömästi viestin lähettäjälle. Tietoja henkilötietojen ja yhteystietojen käsittelystä löydät CSC:n verkkosivuilta: Tietosuoja<https://csc.fi/tietosuoja>
|
|
"valgrind" reports a clean run for the "circuit_harmonic_foil" with flang compiled ElmerSolver on |
|
I pushed a commit that changes the stack size to 8 MB for the CI runners on Windows. If that makes a difference, it might make sense to add that to the CMake build system rules instead. |
|
With that, only 14 tests failed out of 981 (from previously 47 failed tests): Good catch! I don't think that a 8 MB stack is excessive for scientific software. I'll look into how that could be integrated into the CMake build system rules. |
Ok, thanks for the testing. Not expected to me obiviously, but not really sure what the Fortran standard says about this, or how it might be interpreted, or does it actually specify it the context at all ... ? ... OTOH the flang on ubuntu seems to work ok ? Maybe not about fortran after all ? |
|
If I understand correctly, the issue is because Flang "stores" the program arguments in the runtime. But the Flang runtime is static (at least on Windows). The command line arguments get (correctly) initialized for the runtime in the executable. That sounds like a design flaw in the Flang runtime to me... Not sure how to best solve that. (It might be best if the Flang runtime were a shared library. But that sounds hard to do without support from upstream LLVM.) A workaround might be moving the command line argument handling from the library to the executable in Elmer. Would that be possible? |
|
|
I opened #760 which leaves the actual handling of the command line arguments where it is. But it collects them in the executable and passes them to the DLL as arguments. |
|
Not sure what is happening currently: Some CI jobs are failing before they are even started. Maybe, we just need to wait some time before they figure out what is wrong and fix it. |
|
The issue with some runners not starting yesterday was probably this: https://www.githubstatus.com/incidents/90hj03y5tj3c I rebased on a current head of the Two of them involve Zoltan. One Lua. The test There is no MUMPS package for the CLANG64 environment of MSYS2 currently. Afaict, CPardiso is part of the Intel MKL (which isn't used by the MinGW runners either). |
|
|
The log file for the Lua test just ends with the following for the CLANG64 CI runner: When running the test locally, the log file continues like this: So, something odd might be happening before That test passes for me locally in the CLANG64 environment (also when I run it repeatedly). |
|
The Lua test passes if I build with What is odd that Maybe, something about the |
|
at first look, both "ninlen" and "tninlen" seem to be declared as simple int variables, so this seems |
|
I'm not an OpenMP expert. But the following change avoids the segmentation fault for me: fem/src/GeneralUtils.F90 | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/fem/src/GeneralUtils.F90 b/fem/src/GeneralUtils.F90
index 8240cf21b..9c1e75649 100644
--- a/fem/src/GeneralUtils.F90
+++ b/fem/src/GeneralUtils.F90
@@ -1242,9 +1242,7 @@ CONTAINS
closed_region = .FALSE.
END IF
- !$OMP PARALLEL DEFAULT(NONE) &
- !$OMP FIRSTPRIVATE(tcmdstr, tninlen, lstat) &
- !$OMP SHARED(lua_result, result_len, closed_region, i, j, inlen, first_bang)
+ !$OMP CRITICAL(LuaEval)
IF(closed_region) THEN
lstat = lua_dostring( LuaState, &
'return tostring('// tcmdstr(1:tninlen-1) // ')'//c_null_char, 1)
@@ -1252,7 +1250,7 @@ CONTAINS
IF (i == 1 .and. first_bang .and. j == inlen) THEN ! ' # <luacode>' case, do not do 'return tostring(..)'.
! Instead, just execute the line in the lua interpreter
- lstat = lua_dostring( LuaState, tcmdstr(1:tninlen) // c_null_char, 1)
+ lstat = lua_dostring( LuaState, tcmdstr(1:tninlen) // c_null_char, 1)
ELSE ! 'abc = # <luacode>' case, oneliners only
@@ -1260,10 +1258,8 @@ CONTAINS
'return tostring('// tcmdstr(1:tninlen) // ')'//c_null_char, 1)
END IF
END IF
- !$OMP CRITICAL
lua_result => lua_popstring(LuaState, result_len)
- !$OMP END CRITICAL
- !$OMP END PARALLEL
+ !$OMP END CRITICAL(LuaEval)
matcstr(1:result_len) = lua_result(1:result_len)
ninlen = result_lenI.e., expand the critical section around the entire code that uses Does that look reasonable? (Edit on 28-01-2026: No, it does not.) |
|
There is no indication of any trouble with neither of gfortran nor flang-21 on my ubuntu. To me that says |
|
Thanks, l'll forward your patch to Juhani Kataja, who (i think) included the Lua stuff. |
|
I opened #761 for the OpenMP change regarding the |
|
No problems with the Zoltan cases on my unubtu & gcc either, other than building Zoltan needed "-Wno-incompatible-pointer-types" to be added to C compiler options. |
|
It looks like LLVM Flang is having problems to isolate variables from the surrounding context for OpenMP sections (in nested subroutines). LLVM 22 is currently in its release candidate phase. Afaict, there have been some improvements for the OpenMP implementation in Flang in that version. Maybe, we should just wait until LLVM 22 is released (and MSYS2 updated to that version). We can continue here with the newer version of LLVM Flang then. I'll leave this open for now if you don't mind. |
|
MSYS2 now distributes a MUMPS package for their CLANG64 environment (see: msys2/MINGW-packages#28452). LLVM is still at version 21. So, I expect no change with respect to the OpenMP errors. (But a couple more tests might be running now that MUMPS can be enabled in that environment.) |
|
In the latest round of CI, 8 out of 1007 tests failed on the new runner: I haven't looked at the details. |
The CLANG64 environment of MSYS2 is based on a LLVM toolchain. That means the compilers (`clang`, `clang++`, `flang`), linker (`lld`), other "binutils" (e.g., `ar`, `nm`), and runtime libraries (compiler runtime, OpenMP, ...) are from LLVM. Use that environment to build a larger part of ElmerFEM with LLVM Flang (compared to the CI runner using Flang on Ubuntu).
Use a shell array for the arguments for the MUMPS library to simplify the `pkg-config` commands.
|
MSYS2 updated their packages to LLVM 22. Re-based to check if that makes a difference. |
|
Apparently, that didn't make a difference. The following 8 out of 1007 tests still failed: Still haven't looked into any detail or communality between the failing tests. |
The CLANG64 environment of MSYS2 is based on a LLVM toolchain. That means the compilers (
clang,clang++,flang), linker (lld), other "binutils" (e.g.,ar,nm), and runtime libraries (compiler runtime, OpenMP, ...) are from LLVM.Use that environment to build a larger part of ElmerFEM with LLVM Flang (compared to the CI runner using Flang on Ubuntu).
Some features need to be disabled for the time being:
There is no MUMPS package for the CLANG64 environment of MSYS2 (see mumps: Update to 5.8.2 and enable CLANG builds msys2/MINGW-packages#27438).Attempting to enable MATC leads to linker errors. If I understand correctly, that is because the symbollistheadersisn't exported correctly. (ld.bfdseems to be more lenient about that.)The proposed build rules are able to build ElmerFEM in that environment. But a lot of tests are failing. I haven't looked yet if these failures can be put into some categories.
@hmartinez82: I wrote that I'd ping you when I opened this PR. Does building with these rules work in the CLANGARM64 environment? I guess you'd need to disable MPI (because Microsoft hasn't (yet?) released MSMPI for Windows on ARM).
Are a similar number of tests failing for you?