Skip to content

Add CI runner with LLVM Flang on Windows/MinGW#753

Open
mmuetzel wants to merge 3 commits intoElmerCSC:develfrom
mmuetzel:ci-mingw
Open

Add CI runner with LLVM Flang on Windows/MinGW#753
mmuetzel wants to merge 3 commits intoElmerCSC:develfrom
mmuetzel:ci-mingw

Conversation

@mmuetzel
Copy link
Copy Markdown
Contributor

@mmuetzel mmuetzel commented Jan 22, 2026

The CLANG64 environment of MSYS2 is based on a LLVM toolchain. That means the compilers (clang, clang++, flang), linker (lld), other "binutils" (e.g., ar, nm), and runtime libraries (compiler runtime, OpenMP, ...) are from LLVM.

Use that environment to build a larger part of ElmerFEM with LLVM Flang (compared to the CI runner using Flang on Ubuntu).

Some features need to be disabled for the time being:

  • There is no MUMPS package for the CLANG64 environment of MSYS2 (see mumps: Update to 5.8.2 and enable CLANG builds msys2/MINGW-packages#27438).
  • Attempting to enable MATC leads to linker errors. If I understand correctly, that is because the symbol listheaders isn't exported correctly. (ld.bfd seems to be more lenient about that.)
  • And finally, quadruple precision floating-point isn't supported by the Flang runtime that is distributed by MSYS2 either.

The proposed build rules are able to build ElmerFEM in that environment. But a lot of tests are failing. I haven't looked yet if these failures can be put into some categories.

@hmartinez82: I wrote that I'd ping you when I opened this PR. Does building with these rules work in the CLANGARM64 environment? I guess you'd need to disable MPI (because Microsoft hasn't (yet?) released MSMPI for Windows on ARM).
Are a similar number of tests failing for you?

@mmuetzel
Copy link
Copy Markdown
Contributor Author

There is a new error that I haven't seen before:

FAILED: [code=1] elmergrid/src/CMakeFiles/ElmerGrid.dir/fempre.c.obj 
D:\a\_temp\msys64\ucrt64\bin\cc.exe -DCONTIG="" -DDISABLE_MATC -DELMER_BROKEN_MPI_IN_PLACE -DHAVE_EXECUTECOMMANDLINE -DMINGW32 -DUSE_ARPACK -DUSE_ISO_C_BINDINGS -DWIN32 -ID:/a/elmerfem/elmerfem/elmergrid/src/metis -ID:/a/elmerfem/elmerfem/build/elmergrid/src -Wno-error=incompatible-pointer-types -fopenmp -O3 -DNDEBUG -std=gnu99 -MD -MT elmergrid/src/CMakeFiles/ElmerGrid.dir/fempre.c.obj -MF elmergrid\src\CMakeFiles\ElmerGrid.dir\fempre.c.obj.d -o elmergrid/src/CMakeFiles/ElmerGrid.dir/fempre.c.obj -c D:/a/elmerfem/elmerfem/elmergrid/src/fempre.c
D:/a/elmerfem/elmerfem/elmergrid/src/fempre.c: In function 'main':
D:/a/elmerfem/elmerfem/elmergrid/src/fempre.c:86:71: error: 'ELMER_FEM_BRANCH' undeclared (first use in this function)
   86 |   printf("Version: %s-%s (Rev: %s, Compiled: %s)\n",ELMER_FEM_VERSION,ELMER_FEM_BRANCH,&
      |                                                                       ^~~~~~~~~~~~~~~~
D:/a/elmerfem/elmerfem/elmergrid/src/fempre.c:86:71: note: each undeclared identifier is reported only once for each function it appears in

@hmartinez82
Copy link
Copy Markdown

hmartinez82 commented Jan 22, 2026

Humm. I never had issues with the CLANG64 build step, the only problem have been the tests. I'm actually trying to build the tip of the devel branch with this package recipe:


_realname=elmerfem
pkgbase=mingw-w64-${_realname}
pkgname=("${MINGW_PACKAGE_PREFIX}-${_realname}-git")
pkgver=9.0.r4035.g82bb0e9
pkgrel=1
_commit=82bb0e9cc379f074037184d91676edd78a76d61a
pkgdesc="Finite element software for multiphysical problems (mingw-w64)"
arch=('any')
mingw_arch=('ucrt64' 'clang64' 'clangarm64')
url='https://www.elmerfem.org/'
msys2_repository_url="https://github.com/ElmerCSC/elmerfem"
msys2_references=(
  'aur: elmerfem'
)
license=('spdx:GPL-2.0-or-later')
depends=("${MINGW_PACKAGE_PREFIX}-adios2"
         "${MINGW_PACKAGE_PREFIX}-anari-sdk"
         "${MINGW_PACKAGE_PREFIX}-boost"
         "${MINGW_PACKAGE_PREFIX}-cli11"
         "${MINGW_PACKAGE_PREFIX}-cgns"
         "${MINGW_PACKAGE_PREFIX}-eigen3"
         "${MINGW_PACKAGE_PREFIX}-fast_float"
         "${MINGW_PACKAGE_PREFIX}-ffmpeg"
         $([[ ${MINGW_PACKAGE_PREFIX} == *-clang-* ]] || echo "${MINGW_PACKAGE_PREFIX}-gcc-libgfortran")         
         "${MINGW_PACKAGE_PREFIX}-gl2ps"
         "${MINGW_PACKAGE_PREFIX}-liblas"
         "${MINGW_PACKAGE_PREFIX}-libmariadbclient"
         $([[ ${CARCH} == aarch64 ]] || echo "${MINGW_PACKAGE_PREFIX}-msmpi")
         "${MINGW_PACKAGE_PREFIX}-netcdf-fortran"
         "${MINGW_PACKAGE_PREFIX}-opencascade"
         "${MINGW_PACKAGE_PREFIX}-openslide"
         "${MINGW_PACKAGE_PREFIX}-openvdb"
         "${MINGW_PACKAGE_PREFIX}-omp"
         "${MINGW_PACKAGE_PREFIX}-openblas"
         $([[ ${CARCH} == aarch64 ]] || echo "${MINGW_PACKAGE_PREFIX}-parmetis")
         "${MINGW_PACKAGE_PREFIX}-pdal"
         "${MINGW_PACKAGE_PREFIX}-qt6-declarative"
         "${MINGW_PACKAGE_PREFIX}-qwt-qt6"
         "${MINGW_PACKAGE_PREFIX}-scnlib"
         "${MINGW_PACKAGE_PREFIX}-suitesparse"
         "${MINGW_PACKAGE_PREFIX}-utf8cpp"
         "${MINGW_PACKAGE_PREFIX}-vtk"
         )
makedepends=(git
             "${MINGW_PACKAGE_PREFIX}-cc"
             "${MINGW_PACKAGE_PREFIX}-fc"
             "${MINGW_PACKAGE_PREFIX}-cmake")
source=("${_realname}"::"git+https://github.com/ElmerCSC/elmerfem.git#commit=${_commit}")
sha256sums=('971bad89b9b5070fdff8dcd29abf24967431080570f00d46678d82abac86477a')

_apply_patch_with_msg() {
  for _patch in "$@"
  do
    msg2 "Applying ${_patch}"
    patch -p1 -i "${srcdir}/${_patch}"
  done
}

pkgver() {
  cd "${srcdir}/${_realname}"
  git describe --long --abbrev=7 | sed 's/\([^-]*-g\)/r\1/;s/-/./g' | sed 's/release\.//g'
}

prepare() {
  cd ${_realname}

  # _apply_patch_with_msg \
  #   002-fix-build-with-newer-qt5.patch \
  #   003-Add-local-variable-i.patch \
  #   004-Fix-the-evaluation-of-tensor-components.patch \
  #   005-add-i-to-local-scope.patch \
  #   006-fix-some-compile-errors-in-DCRComplexSolve.patch \
  #   007-flang-logical-xor.patch \
  #   008-fix-openmp.patch
}

build() {
  mkdir -p "${srcdir}/build-${MSYSTEM}" && cd "${srcdir}/build-${MSYSTEM}"

  declare -a _extra_config
  if check_option "debug" "n"; then
    _extra_config+=("-DCMAKE_BUILD_TYPE=Release")
  else
    _extra_config+=("-DCMAKE_BUILD_TYPE=Debug")
  fi

  if [[ ${MINGW_PACKAGE_PREFIX} == *-clang-* ]]; then
    export FFLAGS="-I${MINGW_PREFIX}/include"
  fi

  CFLAGS+=" -Wno-implicit-function-declaration -Wno-old-style-definition" \
  CXXFLAGS+=" -Wno-deprecated-declarations" \
  MSYS2_ARG_CONV_EXCL="-DCMAKE_INSTALL_PREFIX=" \
    "${MINGW_PREFIX}"/bin/cmake.exe \
      -Wno-dev \
      -G"Ninja" \
      -DCMAKE_INSTALL_PREFIX="${MINGW_PREFIX}" \
      "${_extra_config[@]}" \
      -DCPACK_BUNDLE_EXTRA_WINDOWS_DLLS=OFF \
      -DBUILD_SHARED_LIBS=ON \
      -DBLA_VENDOR="OpenBLAS" \
      -DHAVE_QP=OFF \
      -DWITH_ELMERGUI=ON \
      -DWITH_ElmerIce=ON \
      -DWITH_LUA=ON \
      -DWITH_MATC=OFF \
      -DWITH_MPI=ON \
      -DWITH_Mumps=OFF \
      -DWITH_OCC=ON \
      -DWITH_OpenMP=ON \
      -DWITH_PARAVIEW=ON \
      -DWITH_QT6=ON \
      -DWITH_VTK=ON \
      -DWITH_Zoltan=OFF \
      -DCREATE_PKGCONFIG_FILE=ON \
      ../${_realname}

  "${MINGW_PREFIX}"/bin/cmake.exe --build .
}

check() {
  cd "${srcdir}/build-${MSYSTEM}"

  "${MINGW_PREFIX}"/bin/ctest.exe -LE slow || true
}

package() {
  cd "${srcdir}/build-${MSYSTEM}"

  DESTDIR="${pkgdir}" "${MINGW_PREFIX}"/bin/cmake.exe --install .
}

It finishes with the builld step, but the tests all fail :(

@mmuetzel
Copy link
Copy Markdown
Contributor Author

mmuetzel commented Jan 22, 2026

I don't see anything equivalent to -DMPIEXEC_EXECUTABLE="$(cygpath -m "${MSMPI_BIN}")/mpiexec.exe" in your configure step.
Did you install MSMPI? You'd need to set the path to the executable or the tests won't run indeed.

The current build error occurs for every runner in the CI. I guess it is something new on the devel branch.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

mmuetzel commented Jan 22, 2026

The current build error occurs for every runner in the CI. I guess it is something new on the devel branch.

Might be a regression from 8103e68.
Afaict, ELMER_FEM_BRANCH is not set by the CMake rules if the execute_process command returns anything other than 0.
@raback: Should it be set to "unknown" or something like that in case that git command fails?

@raback
Copy link
Copy Markdown
Contributor

raback commented Jan 22, 2026

What I added was the ability to write the branch to the std out on ElmerSolver and ElmerGrid:
"MAIN: Version: 26.1-devel (Rev: 82bb0e9, Compiled: 2026-01-22)"

See the commit:
8103e68

It passed all the tests after merge on devel. Perhaps you have some conflicts on config.h that misses the "ELMER_FEM_BRANCH" because ElmerGrid cannot find it even if it does on the devel branch.
"Version: 26.1-devel (Rev: 82bb0e9, Compiled: 2026-01-22)"

@mmuetzel
Copy link
Copy Markdown
Contributor Author

Just guessing: Maybe the issue arises because GitHub runs the CI on a "merge branch" (when the PR is from a fork). Maybe, git branch --show-current results in an error on these "merge branches" for some reason?

@mmuetzel
Copy link
Copy Markdown
Contributor Author

Maybe also relevant:
https://github.com/ElmerCSC/elmerfem/actions/runs/21247614714/job/61140719873#step:3:70

  /usr/bin/git checkout --progress --force refs/remotes/pull/753/merge
  Note: switching to 'refs/remotes/pull/753/merge'.
  
  You are in 'detached HEAD' state. You can look around, make experimental
  changes and commit them, and you can discard any commits you make in this
  state without impacting any branches by switching back to a branch.
  
  If you want to create a new branch to retain commits you create, you may
  do so (now or later) by using -c with the switch command. Example:
  
    git switch -c <new-branch-name>
  
  Or undo this operation with:
  
    git switch -
  
  Turn off this advice by setting config variable advice.detachedHead to false
  
  HEAD is now at ddbee19 Merge 807576b5e64af7f03555607af946c4361a5ad5c1 into 82bb0e9cc379f074037184d91676edd78a76d61a

So, the CI apparently runs on a detached head (that is on no branch at all).
That is probably why git fails when it is asked for the name of the branch.

There should also be a fallback for the case where the sources are built from a distribution tarball (instead of from a git checkout).

@mmuetzel
Copy link
Copy Markdown
Contributor Author

See #754 for a potential fix.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

Currently, 47 tests failed out of 981 using the CLANG64 environment.

	  5 - AdvDiffFCT (Failed)                               serial transient
	 50 - Classical2DShell (Failed)                         p-fem serial
	101 - ContactPatch3DZoltan_np4 (Failed)                 n-t parallel partition zoltan
	149 - DirichletNeumannZoltan_np3 (Failed)               parallel partition zoltan
	186 - EM_port_eigen_3D (Failed)                         complex_eigen eigen emwave serial
	187 - EM_port_eigen_3D_prisms (Failed)                  complex_eigen eigen emwave serial
	193 - ElastPelem2dPmultg (Failed)                       p-fem serial
	201 - ElasticBeamRestart (Failed)                       restart serial transient
	345 - LinearFormsAssembly (Failed)                      benchmark serial threaded
	348 - Lua (Failed)                                      lua quick serial
	357 - MeshLevelsRestart (Failed)                        quick restart serial
	385 - NaturalConvectionRestart (Failed)                 restart serial transient
	386 - NaturalConvectionRestartCycle (Failed)            restart serial
	455 - PoissonML (Failed)                                quick serial
	472 - RenameRestart (Failed)                            restart serial transient
	487 - RotatingBCMagnetoDynamicsGeneric (Failed)         mortar serial whitney
	512 - SD_ElastPelem2dPmultg (Failed)                    p-fem serendipity serial
	527 - SD_NaturalConvectionRestartCycle (Failed)         restart serendipity serial
	534 - SD_Shell_BenchmarkCase2_High_Order (Failed)       serendipity serial shell
	537 - SD_Shell_OpenHemisphere_High_Order (Failed)       serendipity serial shell
	566 - Shell_BenchmarkCase2_High_Order (Failed)          serial shell
	641 - TEAM30a_3ph_transient (Failed)                    harmonic restart serial
	675 - VectorHelmholtzWaveguideQuadBlock (Failed)        emwave serial whitney
	728 - circuits_harmonic_foil (Failed)                   3D circuits harmonic mgdyn serial whitney
	729 - circuits_harmonic_foil_anl_rotm (Failed)          3D circuits harmonic mgdyn rotm serial whitney
	730 - circuits_harmonic_foil_wvector (Failed)           3D circuits harmonic mgdyn serial whitney wvector
	731 - circuits_harmonic_homogenization_coil_solver (Failed) 3D circuits harmonic homogenization mgdyn serial stranded whitney
	734 - circuits_harmonic_stranded (Failed)               3D circuits harmonic mgdyn serial whitney
	735 - circuits_harmonic_stranded_homogenization (Failed) 3D circuits harmonic homogenization mgdyn serial stranded whitney
	739 - circuits_transient_stranded_full_coil (Failed)    3D circuits mgdyn serial stranded transient whitney
	827 - mgdyn_harmonic (Failed)                           benchmark serial whitney
	828 - mgdyn_harmonic_loss (Failed)                      serial whitney
	829 - mgdyn_harmonic_wire (Failed)                      serial vtu whitney
	830 - mgdyn_harmonic_wire_Cgauge (Failed)               serial vtu whitney
	831 - mgdyn_harmonic_wire_Cgauge_automatic (Failed)     serial vtu whitney
	832 - mgdyn_harmonic_wire_impedanceBC (Failed)          serial whitney
	833 - mgdyn_harmonic_wire_impedanceBC2 (Failed)         serial whitney
	848 - mgdyn_steady_quad_extruded_restart (Failed)       extrude failing restart serial whitney
	849 - mgdyn_steady_quad_extruded_restart_np3 (Failed)   extrude failing parallel restart whitney
	853 - mgdyn_steady_wire_periodic (Failed)               mortar serial whitney
	872 - p-FEM_two_solvers (Failed)                        p-fem quick serial
	873 - p-FEM_with_varying_p (Failed)                     p-fem quick serial
	958 - AIFlowSolve (Failed)                              elmerice elmerice-fast
	977 - ForceToStress_parallel (Failed)                   elmerice elmerice-fast
	1004 - Permafrost_Biot (Failed)                          elmerice permafrost
	1013 - TemperateIceTest (Failed)                         elmerice elmerice-fast
	1014 - TemperateIceTestFct (Failed)                      elmerice elmerice-fast

@hmartinez82
Copy link
Copy Markdown

That MPIEXEC did the trick. I'm only getting ~50 tests failing now instead of everything: I am using the devel branch. msys2/MINGW-packages#27519

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 23, 2026

My first guess would be that at least some of the test failures are due to missing MATC - not tried ever (!?!). Would need much more work to decouple completely, might be easier to fix the compilation (which has worked on every platform everywhere for the last 35 years or so ..) ?

@hmartinez82
Copy link
Copy Markdown

@juharu this is the error I'm getting with MATC enabled in CLANG64:

[1/1] Linking CXX executable ElmerGUI\Application\ElmerGUI.exe
FAILED: [code=1] ElmerGUI/Application/ElmerGUI.exe
C:\WINDOWS\system32\cmd.exe /C "cd . && C:\msys64\clang64\bin\clang++.exe -march=nocona -msahf -mtune=generic -O2 -pipe -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong -Wp,-D__USE_MINGW_ANSI_STDIO=1 -Wno-deprecated-declarations -fopenmp=libomp -O3 -DNDEBUG -mwindows @CMakeFiles\ElmerGUI.rsp -o ElmerGUI\Application\ElmerGUI.exe -Wl,--out-implib,ElmerGUI\Application\libElmerGUI.dll.a -Wl,--major-image-version,0,--minor-image-version,0 && cd ."
ld.lld: error: unable to automatically import from listheaders with relocation type IMAGE_REL_AMD64_SECREL in ElmerGUI/Application/CMakeFiles/ElmerGUI.dir/vtkpost/matc.cpp.obj
ld.lld: error: unable to automatically import from listheaders with relocation type IMAGE_REL_AMD64_SECREL in ElmerGUI/Application/CMakeFiles/ElmerGUI.dir/vtkpost/matc.cpp.obj
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 23, 2026

Right ok, it's perhaps not about the MATC on the elmersolver side (forget the patch above).. Maybe you
could disable MATC only within ElmerGUI somehow, instead of the whole package (if that is what is done now)?

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 23, 2026

Right ok, it's perhaps not about the MATC on the elmersolver side (forget the patch above).. Maybe you
could disable MATC only within ElmerGUI somehow, instead of the whole package (if that is what is done now)?
Yes, well, maybe the cmake option for MATC is really only for ElmerGUI, and the test failures maybe about
something else ...
I don't have a windows machine, so can't easily find out what the troubles really are about ...

@mmuetzel
Copy link
Copy Markdown
Contributor Author

mmuetzel commented Jan 23, 2026

I believe the reason why MATC cannot be enabled is because it doesn't work in combination with OpenMP on Windows. (Access to thread-local storage across DLL borders doesn't work well.)

I think I know how to fix that. I'll open a PR for that shortly.

Edit: See #755.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

mmuetzel commented Jan 23, 2026

Rebased on a current head of the devel branch with enabled MATC after #755 has been merged. (Thanks, @juharu.)
Let's see if that makes a difference when it comes to the failing tests. (Probably a low chance...)

@mmuetzel
Copy link
Copy Markdown
Contributor Author

Still 47 tests failed out of 981.
But ElmerGUI now linked with enabled MATC and OpenMP. At least, there is that. 😉

@mmuetzel
Copy link
Copy Markdown
Contributor Author

When I build ElmerFEM with -O0 -g and run the circuits_harmonic_foil with attached gdb, I get the following backtrace from the segmentation fault:

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ffaff164a92 in Fortran::runtime::AssignTicket::Continue(Fortran::runtime::WorkQueue&) ()
   from D:\repo\elmerfem\.build-clang\fem\src\libelmersolver.dll
(gdb) bt
#0  0x00007ffaff164a92 in Fortran::runtime::AssignTicket::Continue(Fortran::runtime::WorkQueue&) ()
   from D:\repo\elmerfem\.build-clang\fem\src\libelmersolver.dll
#1  0x00007ffaff166030 in _FortranAAssign () from D:\repo\elmerfem\.build-clang\fem\src\libelmersolver.dll
#2  0x00007ffafe0d5976 in pelementbase::dlinenodalpbasisall (gradphi=...) at D:/repo/elmerfem/fem/src/PElementBase.F90:157
#3  0x00007ffafe16f56f in elementdescription::nodalfirstderivatives (n=2, dlbasisdx=..., element=..., u=0.57735026918962573, v=0, w=0,
    usolver=0x78b67040) at D:/repo/elmerfem/fem/src/ElementDescription.F90:2439
#4  0x00007ffafe178db8 in elementdescription::elementinfo (element=..., nodes=..., u=0.57735026918962573, v=0, w=0, detj=0, basis=...,
    dbasisdx=<error reading variable: Location address is not set.>, ddbasisddx=<error reading variable: Location address is not set.>,
    secondderivatives=<error reading variable: Cannot access memory at address 0x0>,
    bubbles=<error reading variable: Cannot access memory at address 0x0>, basisdegree=<error reading variable: Location address is not set.>,
    edgebasis=<error reading variable: Location address is not set.>, rotbasis=<error reading variable: Location address is not set.>,
    usolver=<error reading variable: Cannot access memory at address 0x0>) at D:/repo/elmerfem/fem/src/ElementDescription.F90:3234
#5  0x00007ffafed23265 in defutils::vectorelementedgedofs (bc=0x78bc5990, element=0x7aaee4c8, n=2, parent=..., np=4, integral=...,
    edofs=<error reading variable: Cannot access memory at address 0x0>, secondfamily=<error reading variable: Cannot access memory at address 0x0>,
    faceelement=<error reading variable: Cannot access memory at address 0x0>,
    quadraticapproximation=<error reading variable: Cannot access memory at address 0x0>,
    simplicialmesh=<error reading variable: Cannot access memory at address 0x0>) at D:/repo/elmerfem/fem/src/DefUtils.F90:6797
#6  0x00007ffaff047c7f in elmersolver::initcond () at D:/repo/elmerfem/fem/src/ElmerSolver.F90:2309
#7  0x00007ffaff03c2e0 in elmersolver::setinitialconditions () at D:/repo/elmerfem/fem/src/ElmerSolver.F90:1831
#8  0x00007ffaff0254cb in elmersolver (initialize=0) at D:/repo/elmerfem/fem/src/ElmerSolver.F90:554
#9  0x00007ff78cb617e8 in solver () at D:/repo/elmerfem/fem/src/Solver.F90:57
(gdb)

Does that give a hint as to what might be going wrong?

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 26, 2026 via email

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 26, 2026 via email

@hmartinez82
Copy link
Copy Markdown

hmartinez82 commented Jan 26, 2026

@juharu Yes, Clang accepts -Wl,--stack,<bytes> and I'd be surprised if Flang is not the same since it's the same backend.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

Afaict, the default stack size on Windows is 1 MB. There is no way to change the stack size at load or run time.
But like @hmartinez82 wrote, there are flags to increase the stack size when linking an executable.

Which value would be reasonable for ElmerSolver (or ElmerSolver_mpi)?

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 26, 2026 via email

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 26, 2026

"valgrind" reports a clean run for the "circuit_harmonic_foil" with flang compiled ElmerSolver on
my ubuntu...

@mmuetzel
Copy link
Copy Markdown
Contributor Author

I pushed a commit that changes the stack size to 8 MB for the CI runners on Windows.

If that makes a difference, it might make sense to add that to the CMake build system rules instead.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

With that, only 14 tests failed out of 981 (from previously 47 failed tests):

	101 - ContactPatch3DZoltan_np4 (Failed)                 n-t parallel partition zoltan
	149 - DirichletNeumannZoltan_np3 (Failed)               parallel partition zoltan
	201 - ElasticBeamRestart (Failed)                       restart serial transient
	348 - Lua (Failed)                                      lua quick serial
	357 - MeshLevelsRestart (Failed)                        quick restart serial
	385 - NaturalConvectionRestart (Failed)                 restart serial transient
	386 - NaturalConvectionRestartCycle (Failed)            restart serial
	472 - RenameRestart (Failed)                            restart serial transient
	527 - SD_NaturalConvectionRestartCycle (Failed)         restart serendipity serial
	641 - TEAM30a_3ph_transient (Failed)                    harmonic restart serial
	848 - mgdyn_steady_quad_extruded_restart (Failed)       extrude failing restart serial whitney
	849 - mgdyn_steady_quad_extruded_restart_np3 (Failed)   extrude failing parallel restart whitney
	958 - AIFlowSolve (Failed)                              elmerice elmerice-fast
	977 - ForceToStress_parallel (Failed)                   elmerice elmerice-fast

Good catch!

I don't think that a 8 MB stack is excessive for scientific software. I'll look into how that could be integrated into the CMake build system rules.

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 26, 2026

Is that expexted?

Ok, thanks for the testing.

Not expected to me obiviously, but not really sure what the Fortran standard says about this, or how it might be interpreted, or does it actually specify it the context at all ... ?

... OTOH the flang on ubuntu seems to work ok ? Maybe not about fortran after all ?

@mmuetzel
Copy link
Copy Markdown
Contributor Author

mmuetzel commented Jan 26, 2026

If I understand correctly, the issue is because Flang "stores" the program arguments in the runtime. But the Flang runtime is static (at least on Windows). The command line arguments get (correctly) initialized for the runtime in the executable.
But the DLL (where the ElmerSolver subroutine lives) has its own separate instance of the Flang runtime where the command line arguments haven't been set to anything.

That sounds like a design flaw in the Flang runtime to me...

Not sure how to best solve that. (It might be best if the Flang runtime were a shared library. But that sounds hard to do without support from upstream LLVM.)

A workaround might be moving the command line argument handling from the library to the executable in Elmer. Would that be possible?

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 26, 2026

A workaround might be moving the command line argument handling from the library to the executable in Elmer. Would >that be possible?
Yes, I had the same thought, maybe I'll have a look tomorrow!

@mmuetzel
Copy link
Copy Markdown
Contributor Author

mmuetzel commented Jan 26, 2026

I opened #760 which leaves the actual handling of the command line arguments where it is. But it collects them in the executable and passes them to the DLL as arguments.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

mmuetzel commented Jan 26, 2026

Rebased again after #759 was merged. (Thanks, @juharu)
While at it, I dropped the commit again that increased the stack size in the CI rules. (That should be happening automatically now when targeting Windows.)

@mmuetzel
Copy link
Copy Markdown
Contributor Author

Not sure what is happening currently: Some CI jobs are failing before they are even started.
I noticed the same also happens in other repositories. So likely something isn't working correctly on the GitHub side.

Maybe, we just need to wait some time before they figure out what is wrong and fix it.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

The issue with some runners not starting yesterday was probably this: https://www.githubstatus.com/incidents/90hj03y5tj3c

I rebased on a current head of the devel branch again. After that, 4 tests failed out of 981:

The following tests FAILED:
	101 - ContactPatch3DZoltan_np4 (Failed)                 n-t parallel partition zoltan
	149 - DirichletNeumannZoltan_np3 (Failed)               parallel partition zoltan
	348 - Lua (Failed)                                      lua quick serial
	977 - ForceToStress_parallel (Failed)                   elmerice elmerice-fast

Two of them involve Zoltan. One Lua.

The test ForceToStress_parallel is failing with this in the log files:

  WARNING:: CheckLinearSolverOptions: Only MUMPS and CPardiso direct solver interface implemented in parallel, trying MUMPS!
  ERROR:: CheckLinearSolverOptions: MUMPS solver has not been installed.
  ERROR:: CheckLinearSolverOptions: MUMPS solver has not been installed.
  
  ERROR:: CheckLinearSolverOptions: MUMPS solver has not been installed.
  
  ERROR:: CheckLinearSolverOptions: MUMPS solver has not been installed.

There is no MUMPS package for the CLANG64 environment of MSYS2 currently. Afaict, CPardiso is part of the Intel MKL (which isn't used by the MinGW runners either).
So, that test should probably be skipped if ElmerSolver was built without either solver interface.

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 27, 2026

I rebased on a current head of the devel branch again. After that, 4 tests failed out of 981:
Nice! Maybe we'll get there in the end. Does the Lua test log file show anything ? Nothing seems broken
on ubuntu ....

So, that test should probably be skipped if ElmerSolver was built without either solver interface.
I'll make it so.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

mmuetzel commented Jan 27, 2026

The log file for the Lua test just ends with the following for the CLANG64 CI runner:

  LoadInputFile: First time visiting
  LoadInputFile: Reading base load of sif file
  

When running the test locally, the log file continues like this:

LoadInputFile: First time visiting
LoadInputFile: Reading base load of sif file
LoadInputFile: Loading input file: case.sif
LoadInputFile: Reading base load of sif file

So, something odd might be happening before LoadInputFile: Loading input file: case.sif is written to the log. (Or the output to stdout is buffered and the crash is actually later than that.)

That test passes for me locally in the CLANG64 environment (also when I run it repeatedly).
But I'm not using the same configuration as the CI runner. (E.g., I'm currently building without MPI and with -O0 -g and -DCMAKE_BUILD_TYPE="Debug" to be able to attach gdb and get reasonable debugging behavior. I haven't figured out how to use gdb with executables that require being run by mpiexec.)
I'll try if I can reproduce the crash by making my configuration incrementally more similar to the CI.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

The Lua test passes if I build with -O0. It fails with -O1.
For -O1 -g, I get the following backtrace from the segmentation fault:

Thread 1 received signal SIGSEGV, Segmentation fault.
readandtrim::trimluaexpression () at D:/repo/elmerfem/fem/src/GeneralUtils.F90:1236
1236               tninlen = ninlen
(gdb) bt
#0  readandtrim::trimluaexpression () at D:/repo/elmerfem/fem/src/GeneralUtils.F90:1236
#1  0x00007ffba3080ae9 in generalutils::readandtrim (unit=28, str=..., echo=.FALSE.,
    literal=<error reading variable: Cannot access memory at address 0x0>, noeval=.FALSE.) at D:/repo/elmerfem/fem/src/GeneralUtils.F90:1054
#2  0x00007ffba3050816 in loadinputfile::sectioncontents (list=0x0, infileunit=28, scanonly=.TRUE.)
    at D:/repo/elmerfem/fem/src/ModelDescription.F90:1668
#3  0x00007ffba304357b in modeldescription::loadinputfile (model=..., infileunit=28, baseload=.TRUE., scanonly=.TRUE., runc=.FALSE.,
    controlonly=.FALSE.) at D:/repo/elmerfem/fem/src/ModelDescription.F90:1057
#4  0x00007ffba3059fd9 in modeldescription::loadmodel (boundariesonly=.FALSE., numprocs=1, mype=-1549971564, meshindex=12581512)
    at D:/repo/elmerfem/fem/src/ModelDescription.F90:2641
#5  0x00007ffba35a633f in elmersolver (initialize=0, args=..., noargs=<optimized out>) at D:/repo/elmerfem/fem/src/ElmerSolver.F90:406
#6  0x00007ff752b21892 in solver () at D:/repo/elmerfem/fem/src/Solver.F90:86
(gdb)

What is odd that tninlen is apparently a "null pointer" (or however that is called in Fortran) at that point:

(gdb) p ninlen
$1 = 13
(gdb) p tninlen
Cannot access memory at address 0x0

Maybe, something about the !$OMP directives around there that is not quite correct?

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 27, 2026

at first look, both "ninlen" and "tninlen" seem to be declared as simple int variables, so this seems
odd...

@mmuetzel
Copy link
Copy Markdown
Contributor Author

mmuetzel commented Jan 27, 2026

I'm not an OpenMP expert. But the following change avoids the segmentation fault for me:

 fem/src/GeneralUtils.F90 | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/fem/src/GeneralUtils.F90 b/fem/src/GeneralUtils.F90
index 8240cf21b..9c1e75649 100644
--- a/fem/src/GeneralUtils.F90
+++ b/fem/src/GeneralUtils.F90
@@ -1242,9 +1242,7 @@ CONTAINS
              closed_region = .FALSE.
            END IF
 
-           !$OMP PARALLEL DEFAULT(NONE) &
-           !$OMP FIRSTPRIVATE(tcmdstr, tninlen, lstat) &
-           !$OMP SHARED(lua_result, result_len, closed_region, i, j, inlen, first_bang)
+           !$OMP CRITICAL(LuaEval)
            IF(closed_region) THEN
              lstat = lua_dostring( LuaState, &
                  'return tostring('// tcmdstr(1:tninlen-1) // ')'//c_null_char, 1)
@@ -1252,7 +1250,7 @@ CONTAINS
              IF (i == 1 .and. first_bang .and. j == inlen) THEN  ! ' # <luacode>' case, do not do 'return tostring(..)'.
                ! Instead, just execute the line in the lua interpreter
 
-             lstat = lua_dostring( LuaState, tcmdstr(1:tninlen) // c_null_char, 1)
+               lstat = lua_dostring( LuaState, tcmdstr(1:tninlen) // c_null_char, 1)
 
              ELSE ! 'abc = # <luacode>' case, oneliners only
 
@@ -1260,10 +1258,8 @@ CONTAINS
                    'return tostring('// tcmdstr(1:tninlen) // ')'//c_null_char, 1)
              END IF
            END IF
-           !$OMP CRITICAL
            lua_result => lua_popstring(LuaState, result_len)
-           !$OMP END CRITICAL
-           !$OMP END PARALLEL
+           !$OMP END CRITICAL(LuaEval)
 
            matcstr(1:result_len) = lua_result(1:result_len)
            ninlen = result_len

I.e., expand the critical section around the entire code that uses LuaState and remove the parallel section in that function completely.

Does that look reasonable? (Edit on 28-01-2026: No, it does not.)

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 27, 2026

There is no indication of any trouble with neither of gfortran nor flang-21 on my ubuntu. To me that says
either compiler problems (targeting mingw), or maybe its about stack size again ? There is somewhat large "MAXLEN" parameter in that routine, maybe you could try dividing in by, say, 10 ?

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 27, 2026

Thanks, l'll forward your patch to Juhani Kataja, who (i think) included the Lua stuff.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

I opened #761 for the OpenMP change regarding the Lua test. I think that is the right fix for an actual issue. (But like I already wrote: not an OpenMP expert.)

@juharu
Copy link
Copy Markdown
Contributor

juharu commented Jan 28, 2026

No problems with the Zoltan cases on my unubtu & gcc either, other than building Zoltan needed "-Wno-incompatible-pointer-types" to be added to C compiler options.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

It looks like LLVM Flang is having problems to isolate variables from the surrounding context for OpenMP sections (in nested subroutines).

LLVM 22 is currently in its release candidate phase. Afaict, there have been some improvements for the OpenMP implementation in Flang in that version. Maybe, we should just wait until LLVM 22 is released (and MSYS2 updated to that version). We can continue here with the newer version of LLVM Flang then.

I'll leave this open for now if you don't mind.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

MSYS2 now distributes a MUMPS package for their CLANG64 environment (see: msys2/MINGW-packages#28452).

LLVM is still at version 21. So, I expect no change with respect to the OpenMP errors. (But a couple more tests might be running now that MUMPS can be enabled in that environment.)

@mmuetzel
Copy link
Copy Markdown
Contributor Author

In the latest round of CI, 8 out of 1007 tests failed on the new runner:

  The following tests FAILED:
  	105 - ContactPatch3DZoltan_np4 (Failed)                 n-t parallel partition zoltan
  	154 - DirichletNeumannZoltan_np3 (Failed)               parallel partition zoltan
  	354 - Lua (Failed)                                      lua quick serial
  	446 - PartitioningZoltanQuads_np4 (Failed)              parallel zoltan
  	597 - Shell_with_Solid_Beam_Par_np3 (Failed)            block parallel shell
  	602 - Shell_with_Solid_EigenanalysisPar_np3 (Failed)    block eigen parallel shell
  	657 - TEAM30a_3ph_transient_par_np3 (Failed)            harmonic parallel restart
  	702 - WinkelBmNavierInternalFETI_np4 (Failed)           mumps parallel
  Errors while running CTest

I haven't looked at the details.

The CLANG64 environment of MSYS2 is based on a LLVM toolchain. That means the
compilers (`clang`, `clang++`, `flang`), linker (`lld`), other "binutils"
(e.g., `ar`, `nm`), and runtime libraries (compiler runtime, OpenMP, ...) are
from LLVM.

Use that environment to build a larger part of ElmerFEM with LLVM Flang
(compared to the CI runner using Flang on Ubuntu).
Use a shell array for the arguments for the MUMPS library to simplify the
`pkg-config` commands.
@mmuetzel
Copy link
Copy Markdown
Contributor Author

MSYS2 updated their packages to LLVM 22. Re-based to check if that makes a difference.

@mmuetzel
Copy link
Copy Markdown
Contributor Author

mmuetzel commented Mar 25, 2026

Apparently, that didn't make a difference. The following 8 out of 1007 tests still failed:

	105 - ContactPatch3DZoltan_np4 (Failed)                 n-t parallel partition zoltan
	154 - DirichletNeumannZoltan_np3 (Failed)               parallel partition zoltan
	354 - Lua (Failed)                                      lua quick serial
	446 - PartitioningZoltanQuads_np4 (Failed)              parallel zoltan
	597 - Shell_with_Solid_Beam_Par_np3 (Failed)            block parallel shell
	602 - Shell_with_Solid_EigenanalysisPar_np3 (Failed)    block eigen parallel shell
	657 - TEAM30a_3ph_transient_par_np3 (Failed)            harmonic parallel restart
	702 - WinkelBmNavierInternalFETI_np4 (Failed)           mumps parallel

Still haven't looked into any detail or communality between the failing tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants