August 16, 2023

Offline Leads meeting – Aug 16, 2023

Attendees: Herb Greenlee, Thomas Junk, Tingjun Yang, Erica Snider, Katherine Lato

LArSoft Report:

  • Experiment meetings for 2024 plan development coming up starting in Sept. This is when we meet one-on-one to develop the work plan for the next year.
    • Please let us know if there is someone other than an offline lead that we should talk with for your experiment?
  • Progress on this year’s plan has been steady in some areas, but not fast
    • Thread safety:  all larevt services thread safe. Ones that access DB use concurrent caching. Now working on expt code. ArgoNeuT completed, SBND in progress (work started in spring)
    • Discrepancy between single-threaded and MT results in SN pipeline demonstrator understood, but work still underway to fix (on-going since winter)
    • Geometry refactoring to accommodate pixels:  factored into wire readout geometry and volume geometry. Adapting experiment code. After validation, will introduce pixel readout geom (on-going since winter)
    • GPUaaS:  working to extend to Pytorch. Allows pytorch GNNs
      • Herb: Have re-started a Reco Group. One of the conveners is at Tufts, and has been tasked with integrating the DL reco workflows. Several people have tried to run LArSoft on non-standard architectures. We’re thinking of appointing someone in MicroBooNE to be the HPC tzar.
        • Is GPUaaS targeting HPC, or stand-alone GPUs?
        • Erica: This GPUaaS is specifically for inferencing from grid node. But we can also develop algorithms to run directly on HPC or on a GPU. However, only runs with  TenserFlow currently. It’s being extended to Pytorch, which should be of interest to MicroBooNE and DUNE.
      • Tom: Network layer–don’t want to share if you have a lot of GPUs. If you just want to use it nearby and remote, do you need two versions of everything?
        • Erica: No. Our GPUaaS uses a triton server to access GPUs. That server can be run remotely or on the same node as the job running the client. So can serve jobs locally.
      • Tom: Are we going to do this (run on local GPUs or through GPUaaS) for future algorithms?
        • Erica: grid-like resources will continue to be important, which is the problem GPUaaS solves. The future problem is to make algorithms run natively on GPUs/HPC. Data needs to be structured properly to be efficient on GPUs. Will need to restructure some code and make use of portability packages. Mark Paterno is available to help with assessing changes needed, and modifying / writing the algorithms. [Some algorithms will be best optimized by completely rewriting. Others can be adapted as noted here.]
  • Budget for next year looks like it will be very tight. Do not know the impact on LArSoft. Paradoxically might help, but all remains to be seen.
  • Spack
    • Effort is starting to again be put into making LArSoft external buildable dependencies.   While there is a large set of products to deal with headway has been made.  We believe building recipes for most products will be straight forward with a couple exceptions, notably Tensorflow.
    • The tutorial has been given to Mu2e.  We are working with Mu2e to produce a clean usable build. Hope they will eventually give the tutorial to LArSoft community
    • CI’s first nightly build of the art suite completed! It was built on both SLF 7 and AL 9!  The build scripts are invoked independently from Jenkins.  We need to move to using Jenkins on the backend to reduce boilerplate.
      • If we end up on AL 9, plan is to retire UPS.
    • Experiments continue to struggle with keeping develop branch of repositories current, even as rate of integration releases is less than the usual once per week.
      • Should we discuss this release model? Will make this a topic for discussions with each experiment.

 

MicroBooNE – Herb Greenlee

  • Main concern cetbuild tools and Spack transition, and in particular, how to keep MCC9.1 (which is old) alive through the transition. Current effort to facilitate Spack transition is aimed at addressing CMakeLists.txt files. Have 400-500 in ub*, and are about 80% done with that. 
    • Was going to investigate if the same thing can be done with MCC9?
  • It’s taken a couple of weeks working on it part-time. It will go faster next time since know what to do now and have helper scripts ready. Not sure if underlying packages may require a rebuild of art. Currently on version 3.1.2. Herb is talking with Kyle. Curious whether other experiments have done cetmodules as Chris recommends.
    • Tom thinks that DUNE has updated–about a year ago. Used the migrate script from Chris. Still have a lot of UPS in CMakeLists.txt files, and don’t know how to make it go away. Not everything works if replace UPS with FindPackage. 
    • Needs to get rid of art_make’s, since that is deprecated in cetmodules
    • Tom knows that a lot of the root targets need to be specified one by one
    • Herb commented that this should be easy, since cetmodules builds in a lot of transitivity. All build targets are transitive, so list of dependencies gets shorter relative to cetbuildtools
    • Tom thought it might be fragile, if someone removes something that breaks transitivity, for instance. Example of a GArSoft package that turned out to be one of the LAr pieces.
  • Otherwise, just keep MicroBooNE informed of AL9 and what is coming. 
    • Would  not be opposed to keeping UPS alive in AL9

DUNE – Tom Junk

  • Same issues already discussed
  • Want a sustainable Spack solution that we can hand off to junior release managers.
    • Noted that there is a unilateral effort by one individual to do a Spack build of entire suite. That Believes that person is about ⅔’s through entire product stack. Know no details of what they are doing other than they’re hacking through this on their own.
  • Requested an AL 9 build node, but on hold for now, since they can’t use it yet (except on Tom’s personal desktop, which is running AL9). So no AL9 resources for the collaboration.
    • Can change an AL 9 build node to another type, so it’s not a waste.
    • Tom: can run SL7 in a container on AL9. The grid container doesn’t have some packages needed, so added a bunch of development RPMs to the SL7 container on his desktop node. 
    • Has questions about privilege in a container, but that is not a topic for this meeting.

ArgoNeuT – Tingjun Yang

  • Have decided to stop updating code. Don’t need any new art releases for ArgoNeuT code in the future. So are now an island, like lariatcode.

Please email Katherine Lato or Erica Snider for any corrections or additions to these notes.