All posts by klato

January 30, 2024

The January 2024 LArSoft Offline Leads status update was handled via email and a google document. 

LArSoft – Erica Snider

  • Work is proceeding to accelerate DUNE physics processing using GPUs for select LArSoft algorithms, starting with the PDFastSimPAR module. The plan is to first  optimize serial execution, then parallelize on CPUs, and if needed to achieve performance goals, parallelize on GPUs. There will be update reports at both LArSoft Coordination Meeting and within DUNE FD reco group this month.
  • A new machine learning algorithm, NuGraph2, is available within LArSoft. The algorithm performs hit classification and clustering using a GNN, and is being developed for use in MicroBooNE, and will be used in ICARUS later. Details were discussed at the Dec 12 LCM. Documentation on the algorithm and how to use it will be posted to LArSoft.org in coming weeks. 
  • A Spack build of LArSoft became available last month. SciSoft is planning to provide a tutorial on how to build experiment code under Spack using this new release. Experiments wishing to participate will need to use machines under AL9 to run the builds.
  • Geometry re-factoring:  working on completing documentation, the last step needed prior to releasing it as a LArSoft v10 release candidate for experiment validation.

DUNE – Heidi Schellman, Tingjun Yang, Tom Junk

DUNE will start work this period on the Spack migration now that the LArSoft Spack build was made available.  We have already been in contact with Steve White, Marc Mengel, Patrick Gartung and Kyle Knoepfel on the subject.  We will need to keep SL7 build nodes and interactive nodes as long as possible – there have been proposals to shut them down in March but we would like them longer.  We also are attending the container task force meetings and have a container that works for builds and will test the current worker node container again.

LArSoft comment:  The project pushed back on the original March 20 suggestion, and SCF Dept Head agreed to allow at least some build nodes to remain up beyond that date. A suggested compromise shutdown date is mid-May. LArSoft (tentatively) responded that this would be acceptable. EOL is end of June.

ICARUS – Daniele Gibin, Tracy Usher

No Report

LArIAT – Jonathan Asaadi

No Report

MicroBooNE – Herbert Greenlee

No Report

SBND – Andrzej Szelc

No Report

SBN Data/InfrastructureSteven J. Gardiner, Giuseppe B. Cerati

No Report

Please email Katherine Lato or Erica Snider for any corrections or additions to these notes.

November 15, 2023

LArSoft Offline Leads 11/15/23 Meeting

Attendees: Herb Greenlee, Tom Junk, Will Foreman, Giuseppi Cerati, Erica Snider, Katherine Lato

At the meeting, we went over the draft of the 2024 work plan for LArSoft. 

Number one priority continues to be multi-threading and High Performance Computing (HPC) with several people working on this in 2024. Notably, SciSoft has effort available to make algorithms run on GPUs. Requirement is that they be relatively slow, in LArSoft, and amenable to acceleration with a GPU. There was a follow-on question about GPU as a service. This is described at: https://larsoft.org/using-gpu-as-a-service-in-larsoft/. SciSoft believes this is ready to run at scale, so just need to work out the logistics of spinning up the GPU server somewhere.

Spack migration has a deadline of Q2 2024 in order to provide time for experiments to complete their migrations in advance of the June 2024 SL7 EOL.  AL9 requires Spack, since there will be no UPS support. This work seems close, though we still do not have a detailed timeline to completion. Comment:  Experiments have changes to make, so do not want to be pinched for time. Response:  Given that all experiment code has been migrated to cetmodules, the procedure for getting from there to a Spack build should be straight-forward. Expect that any issues will be related to getting the experiment-specific product stacks under Spack. Q:  Containers should provide some cover? Yes, should provide a buffer to the June 2024 EOL. Generally, the process will be that AL9 and SL7 will co-exist for some time to allow experiments to migrate. Spack has to come first. 

Support for multi-experiment event display has been on the wish list for a while so that upper management understands that multiple experiments have requested this.

A new item in this year’s work plan are updates to the LArSoft infrastructure. These include, but are not limited to:

  • Sampling frequencies vary across TPCs in protoDUNE, while LArSoft supports only a single value.
  • Support for non-planar cathode geometries to facilitate tracking across non-planar cathodes. 
  • Support for TPC-dependent drift velocities and electron lifetimes. 

Appendix B is a short summary of our major observations from one-on-one meetings with each experiment in September and October of 2023. Common items include: 

  1. Event display that is useful in the current environment.
  2. HPC.
  3. Faster processing.
  4. Event generators

Round Robin:

  • SBND:  Will Foreman
    • working to integrate blip reco into SBND code. Once completed, will migrate to LArSoft. But first want to make sure it is working in SBND
  • MicroBooNE:  all question answered during work plan discussion
  • SBN: Giuseppi 
    • SBND has allocation to run on Polaris at Argonne. Will be running simulation. Running in a container. First asked about Spack, but was not available, so opted for container. That has been demonstrated to work.
  • DUNE:  will send comments later.

Please email Katherine Lato or Erica Snider for any corrections or additions to these notes.

August 16, 2023

Offline Leads meeting – Aug 16, 2023

Attendees: Herb Greenlee, Thomas Junk, Tingjun Yang, Erica Snider, Katherine Lato

LArSoft Report:

  • Experiment meetings for 2024 plan development coming up starting in Sept. This is when we meet one-on-one to develop the work plan for the next year.
    • Please let us know if there is someone other than an offline lead that we should talk with for your experiment?
  • Progress on this year’s plan has been steady in some areas, but not fast
    • Thread safety:  all larevt services thread safe. Ones that access DB use concurrent caching. Now working on expt code. ArgoNeuT completed, SBND in progress (work started in spring)
    • Discrepancy between single-threaded and MT results in SN pipeline demonstrator understood, but work still underway to fix (on-going since winter)
    • Geometry refactoring to accommodate pixels:  factored into wire readout geometry and volume geometry. Adapting experiment code. After validation, will introduce pixel readout geom (on-going since winter)
    • GPUaaS:  working to extend to Pytorch. Allows pytorch GNNs
      • Herb: Have re-started a Reco Group. One of the conveners is at Tufts, and has been tasked with integrating the DL reco workflows. Several people have tried to run LArSoft on non-standard architectures. We’re thinking of appointing someone in MicroBooNE to be the HPC tzar.
        • Is GPUaaS targeting HPC, or stand-alone GPUs?
        • Erica: This GPUaaS is specifically for inferencing from grid node. But we can also develop algorithms to run directly on HPC or on a GPU. However, only runs with  TenserFlow currently. It’s being extended to Pytorch, which should be of interest to MicroBooNE and DUNE.
      • Tom: Network layer–don’t want to share if you have a lot of GPUs. If you just want to use it nearby and remote, do you need two versions of everything?
        • Erica: No. Our GPUaaS uses a triton server to access GPUs. That server can be run remotely or on the same node as the job running the client. So can serve jobs locally.
      • Tom: Are we going to do this (run on local GPUs or through GPUaaS) for future algorithms?
        • Erica: grid-like resources will continue to be important, which is the problem GPUaaS solves. The future problem is to make algorithms run natively on GPUs/HPC. Data needs to be structured properly to be efficient on GPUs. Will need to restructure some code and make use of portability packages. Mark Paterno is available to help with assessing changes needed, and modifying / writing the algorithms. [Some algorithms will be best optimized by completely rewriting. Others can be adapted as noted here.]
  • Budget for next year looks like it will be very tight. Do not know the impact on LArSoft. Paradoxically might help, but all remains to be seen.
  • Spack
    • Effort is starting to again be put into making LArSoft external buildable dependencies.   While there is a large set of products to deal with headway has been made.  We believe building recipes for most products will be straight forward with a couple exceptions, notably Tensorflow.
    • The tutorial has been given to Mu2e.  We are working with Mu2e to produce a clean usable build. Hope they will eventually give the tutorial to LArSoft community
    • CI’s first nightly build of the art suite completed! It was built on both SLF 7 and AL 9!  The build scripts are invoked independently from Jenkins.  We need to move to using Jenkins on the backend to reduce boilerplate.
      • If we end up on AL 9, plan is to retire UPS.
    • Experiments continue to struggle with keeping develop branch of repositories current, even as rate of integration releases is less than the usual once per week.
      • Should we discuss this release model? Will make this a topic for discussions with each experiment.

 

MicroBooNE – Herb Greenlee

  • Main concern cetbuild tools and Spack transition, and in particular, how to keep MCC9.1 (which is old) alive through the transition. Current effort to facilitate Spack transition is aimed at addressing CMakeLists.txt files. Have 400-500 in ub*, and are about 80% done with that. 
    • Was going to investigate if the same thing can be done with MCC9?
  • It’s taken a couple of weeks working on it part-time. It will go faster next time since know what to do now and have helper scripts ready. Not sure if underlying packages may require a rebuild of art. Currently on version 3.1.2. Herb is talking with Kyle. Curious whether other experiments have done cetmodules as Chris recommends.
    • Tom thinks that DUNE has updated–about a year ago. Used the migrate script from Chris. Still have a lot of UPS in CMakeLists.txt files, and don’t know how to make it go away. Not everything works if replace UPS with FindPackage. 
    • Needs to get rid of art_make’s, since that is deprecated in cetmodules
    • Tom knows that a lot of the root targets need to be specified one by one
    • Herb commented that this should be easy, since cetmodules builds in a lot of transitivity. All build targets are transitive, so list of dependencies gets shorter relative to cetbuildtools
    • Tom thought it might be fragile, if someone removes something that breaks transitivity, for instance. Example of a GArSoft package that turned out to be one of the LAr pieces.
  • Otherwise, just keep MicroBooNE informed of AL9 and what is coming. 
    • Would  not be opposed to keeping UPS alive in AL9

DUNE – Tom Junk

  • Same issues already discussed
  • Want a sustainable Spack solution that we can hand off to junior release managers.
    • Noted that there is a unilateral effort by one individual to do a Spack build of entire suite. That Believes that person is about ⅔’s through entire product stack. Know no details of what they are doing other than they’re hacking through this on their own.
  • Requested an AL 9 build node, but on hold for now, since they can’t use it yet (except on Tom’s personal desktop, which is running AL9). So no AL9 resources for the collaboration.
    • Can change an AL 9 build node to another type, so it’s not a waste.
    • Tom: can run SL7 in a container on AL9. The grid container doesn’t have some packages needed, so added a bunch of development RPMs to the SL7 container on his desktop node. 
    • Has questions about privilege in a container, but that is not a topic for this meeting.

ArgoNeuT – Tingjun Yang

  • Have decided to stop updating code. Don’t need any new art releases for ArgoNeuT code in the future. So are now an island, like lariatcode.

Please email Katherine Lato or Erica Snider for any corrections or additions to these notes.

May 2023 Meeting Notes

Offline Leads Meeting, May 30

Attendees: Erica Snider, Tom Junk, Tracy Usher, Miquel Nebot, Tingjun Yang, Steven Gardiner, Katherine Lato

LArSoft Report: Erica

  • Work plan update
    • Saba Sehrish has been working to pick up the work to make services thread safe. 
      • She had completed a number of changes, but was pulled away before she was able to commit them to the head of develop. Has nearly finished with updating all the relevant features branches to be able to merge these
      • Most important priority is to address any outstanding services that access database, and to make sure they use the concurrent caching currently available in LArSoft. 
    • SPACK migration timelines are still uncertain. 
      • The Spack team is confident that major changes aren’t needed to SPACK, but still not sure yet if the current implementation of SPACK can support what is needed, hence the uncertainty of the timeline. 
      • One note they emphasized is that we will be able to support builds under SPACK long before we have a complete development environment. 
    • Pixel detectors within LArSoft. 
      • There has been steady progress refactoring the geometry service as needed, and this work is nearly completed. 
      • Once that is done, the current geometry will be implemented via two separate components, a volume geometry hierarchy, and a readout geometry description. The latter knows about the former, but not vice versa. 
      • The next step will be to implement a pixel readout geometry description. A draft of this already exists. Full implementation will require working with people from  DUNE ND LAr
    • No progress on neutrino event refactoring. 
  • AL9 support
    • Current CSAID plan is to not spend the effort to bring UPS forward to AL9. If this is an issue, talk to your Spokespeople and have them bring it to the attention of CSAID management. 
  • Compilers
    • Recently included the possibility to build under clang 14 (c14) and gcc 12.1 (e26) starting with LArSoft v09_74_01 released on May 5. Default builds remain unchanged.
    • C14 and and e26 will become the default qualifiers when we migrate to art 3.13 (which will be soon)
  • Experiments:  please remember that keeping experiment code in pace with integration releases is necessary in order to have the benefit of the CI system for code updates and release validation. 

 

DUNE – Tom Junk

  • Release information, we’re at 9.75 at DUNE. 
  • One of our customers of SPACK is sending emails. Misner? He wants to run stuff on __ LArCP3 and link it with. We may have to upgrade DUNE’s software SPACK. There were some weird warnings. We contacted original developers, who reassured us that they were valid. 

SBND – Miquel

  • Worked through the update from the last release. Resolved an issue with dependencies, external ones. They were holding us back, but they are aware..

Argonaut – Tingjun 

  • Person who has been making updates has a new position, so won’t be continuing. Management says drop support for Argonaut support. If anyone wants to do work, they will have to do the updating. Believe there is no active work being done.
  • So, Argonaut is in the same position as LArIAT. There haven’t been any LArIAT releases for a long time.
  • Thanks for the support for so many years.
  • Probably makes sense to remove them from the CI tests. 

SBN: Steven

  • Miquel covered some. Do you want a message to take to MicroBooNE?
  • In response to Steven’s request for more information on Neutrino refactoring, Erica explained that LArSoft is built against a particular version of GENIE. The same is true of every generator. The only way to break that connection is to produce a text file from one generator. The idea was to break that dependency so that you could change the version of GENIE at run-time and not have LArSoft built against a particular one. That would be a model for all other generators that are built within LArSoft. Experiments pretty closely track each other for what version they use, but there’s no reason why we should have to coordinate that through LArSoft.The text file is difficult to deal with. Working with Robert Hatcher to do that. Developed a plan at the beginning of last year. It looked good. It’s been sitting ever since. Tried to get him to pick it up recently. He may have taken it back to GENIE management. 
  • Steven may follow up with Robert directly.

ICARUS – Tracy:

  • We are making progress for converting to the new LArG4. Converting to the geometry. Giuseppe picked it up recently and he’s been making progress. General schedule has a large simulation/reconstruction in fall with SBND. Targeting to have the work done at the end of summer. Giuseppe needs to work with Hans.
  • We are impacted by space charge stuff. Erica: I’ll try to get a meeting this week.
  • We have a soft production release. The main production is going to be done in fall. What we’re doing now is what we think we’ll do for the summer.
  • ICARUS workshop in 3 weeks, if we can find a room at Fermilab. 
  • We’re making good progress. ICARUS has its own challenges with the horizontal wires and other things, but we’re starting to make good progress.
  • Unified event display – continued lament. Titus would be good, but it doesn’t display what ICARUS needs. They don’t have the cycles to fix it themselves. All along it’s been ‘made to work’ which makes it difficult to understand. ICARUS breaks everything. 
    • Erica:  heard. Currently have no one available to work on this. The project continues to lobby for the effort needed.

Please email Katherine Lato or Erica Snider for any corrections or additions to these notes.

April 2023 Meeting Notes

The April 2023 LArSoft Offline Leads status update was handled via email and a google document. 

LArSoft – Erica Snider

  • Geometry refactoring to accommodate pixels
    • Have made good progress in removing the detector-specific readout geometry description from the generic detector volume description. Final elements of the design are completed, and the work to refactor the wire readout geometry description is at an advanced stage (Kyle Knoepfel). We expect to start testing this new geometry within the next two weeks. Once completed, a pixel readout geometry will be introduced based on an existing prototype written by Tom Junk. 
  • Thread safety work status
    • Saba Sehrish has picked up work she performed over a year ago on making LArSoft services thread safe. This work included introducing concurrent caching needed to make services that access databases thread safe. She began her current effort by working to bring forward the feature branches of that previous body of work to the head of develop. She will then turn to the service changes needed by ICARUS to make their production workflow thread safe.
  • art 3.12 update
    • We expect to be ready to migrate to art 3.12 on the time scale of a week or two. There will be some changes to user code needed. Kyle Knoepfel will talk about these at the April 18 LCM.
  • AL9 support
    • LArSoft will be supported under AL9 as SL7 enters end of life. In preparation for this, the SciSoft team has acquired the necessary infrastructure (build and CI nodes) to begin working on this support. An important point to take note of is that CSAID does not now support UPS under AL9, and has no plans to do so in the future. Consequently, the migration to the Spack-based build and development environments must be completed prior to providing full support for AL9. Since the timeline for the Spack migration is not yet known, however, we do not know when full support for AL9 will be available. We will keep the community informed as the situation changes.
  • Multi-threading workshop

DUNE – Heidi Schellman, Tingjun Yang, Michael Kirby Tom Junk

Just to let people be aware – Shekhar Mishra has been attempting to build LArSoft and dunes on AL9.1, as part of an AI/ML project for ICEBERG.  He says he has been able to build LArSoft on AL9.1 and is working on dunesw.  Not all dependent products of dunesw are built with mrb or have known build shims, such as duneanaobj and srproxy.  Some products have hand-built UPS products – dunedaqdataformats, dunedetdataformats, highfive and nlohmann_json.  The hand-built UPS products do not involve compiled code – they are header-only libraries and should just copy over to Shekhar’s setup.  We have been answering Shekhar’s questions but not actually doing work to support this path.  We have suggested using SL7 in a container on his AL9 machine as an interim solution.

ICARUS – Daniele Gibin, Tracy Usher

No Report

LArIAT – Jonathan Asaadi

No Report

MicroBooNE – Herbert Greenlee

SBND – Andrzej Szelc

No Report

SBN Data/InfrastructureSteven J. Gardiner, Giuseppe B. Cerati

No Report

Please email Katherine Lato or Erica Snider for any corrections or additions to these notes.

February 2023 Meeting Notes

Offline Leads – Feb. 22, 2023

Attendees: Giuseppi Cerati, Erica Snider, Katherine Lato

LArSoft report

  • SciSoft / LArSoft effort
    1. Expected gains in effort from the Division, as suggested earlier from CSAID management, will probably not materialize, so the Division is unlikely to have additional effort to apply to SciSoft support
    2. Both Saba Sehrish and Marc Paterno are completing project work that has occupied their attention for the past few years, and will have time to devote to LArSoft
    3. Saba will pick up multi-threading work that she was pursuing earlier. Will talk to Tracy and Giuseppe about the ICARUS production workflows that were the focus of that. In the longer term, has an interest in working on expanding GPU processing capabilities within LArSoft
    4. Marc has interest in GPU programming, so will be looking for algorithms that lend themselves to GPU solutions and writing code for that.
      • Saba’s GPU interests may complement this in that she might have interest in enabling GPU applications in a production setting
      • Both are needed to advance GPU usage
  • LArSoft 2023 work plan update
    1. Multi-threading
      • Mike is continuing to make progress on DUNE SN pipeline. Currently working to fix issues in TrajCluster, which uses globals extensively. Has fixed a number of other bugs along the way. 
    2. Geometry changes to accommodate pixel detector readouts
      • Tom Junk, Tingjun Yang, and Kyle Knoepfel have been meeting regularly to discuss Geometry changes needed to accommodate pixels, which mostly involves refactoring readout geometry elements from those pertaining to physical volumes and planes.
      • There has been considerable progress on this work. Kyle’s presentation at the last Coordination Meeting describes the state of his most current work, which involves re-factoring of the ChannelMap and GeometryCore classes to break circular runtime dependencies. This work is nearing completion.
      • Remaining steps include removing WireGeo objects from PlaneGeo (since the former represents readout geometry while the latter represents a physical structure); introducing readout geometry classes, where a new wire geometry class would complete the refactoring of wire readout descriptions from the physical geometry; and providing a pixel readout geometry class, where we already have a draft class from Tom Junk.
      • Discussion
        • How visible will these changes be to the experiments?  Do not want to be in a position where they need to branch from the mainline develop branch too soon in order to avoid disruption to production schedules.
          • A:  the changes to iteration patterns will be visible, but have already been incorporated into the code. The biggest change will be that all readout geometry questions will need to be redirected from GeometryCore to a new class that depends on the type of readout you are using. So existing code that uses wires will need to adapt to that change.
          • SciSoft will make appropriate changes to experiment repositories. We will not be able to automate some of the changes, but we can provide scripts to flag what we can’t change in private code
          • SciSoft will work closely with the experiments as this rollout takes place.

SBN Data/Infrastructure – Giuseppi Cerati

  • Machine learning
    • Suggested that the priority of integrating machine learning into LArSoft should be raised significantly. Seems a missing piece, especially in context of GPUs.
      • Noted NuSonic provides GPU capability for inference problems within LArSoft
    • Lisa Goodenough reached out and offered help to use NuSonic. Some people have expressed concern about a lack of provenance information with NuSonic, eg, what version of TensorFlow was being run. Ship something external and aren’t sure what version was run. Less of a problem than not having a solution at all, in Giuseppi’’s opinion.
      • Perhaps we can define conventions on information to return with the job. All of the inferencing configuration should be specified as part of the art configuration in sufficient detail to allow replication. So then just need to focus on information that needs to be retrieved.
  • Memory management
    1. Geant4 simulation is largest memory consumer for ICARUS – typically over 8 GB, which requires 5 slots. Occasionally even 6 slots are required.
    2. SciSoft has informed them that using the new LArG4 interface will reduce memory consumption, so completing that migration is becoming a higher priority. The lack of staffing is an issue.
      • SciSoft can offer mostly consulting help, but it is good to keep us in the loop in case there are places we can more directly help.
    3. Another approach is to drop art objects that are not being used. Gianluca recently pointed out that dropping data products on input works to reduce memory, but unused transient data products are retained until the end of the event. Posed the question, can we drop transient products in the middle of the job to help with in-event memory management?
      • This is outlined in an email from Feb 7 to Kyle and Erica.
      • Will respond.

Please email Katherine Lato or Erica Snider for any corrections or additions to these notes.