Offline leads meeting – 7/13/22
Attendees: Tom Junk, Chris Backhouse, Andrzej Szelc, Kyle Knoepfel, Tingjun Yang, Tracy Usher, Erica Snider, Katherine Lato
- Multi-threading work
- Mike Wang is continuing work on a DUNE dataprep workflow used for SN processing. Had been investigating a difference in results from hit finding when run in single threaded vs multi-threaded mode. The recob::Hit PR discussed at the July 12 LCM was one outcome of this work, and fixed (at least one of) the differences. From this point, he will continue to add workflow elements until it is entirely thread safe / multi-threaded.
- Spack migration: phase 2
- Have started working on Phase 2 of Spack migration, which will involve additional adaptations to Spack to support the full set of functionality needed to manage coherent releases. Will also need to understand and possibly remedy dependency structure of code in order to make Spack happy.
- Chris Green kindly provided the following high-level list of tasks that make up Phase 2. (With a sixth step added by Tom Junk.)
- The experiments must convert all of their code to use Cetmodules and modern CMake best practices (a la LArSoft phase 1).
- The experiments must also produce and/or verify Spack recipes for their own packages, and for all external dependencies not directly supported by SciSoft.
- The current LArSoft stack and its dependencies must be verified to be buildable by Spack. There have been many changed/added dependencies since the last time this was done, so this is not a trivial task.
- We must have a system usable by LArSoft and experimental release managers capable of building and releasing a fixed and reproducible distribution of their code and all dependencies via Spack for all supported platforms and compilers. These distributions must be installable on supported systems with maximum (re-)use of pre-built and cached binaries, and minimum rebuilding of packages unchanged from one release to the next.
- We must have a multi-package development system capable of using and producing Spack-built binary packages for distribution via BuildCache.
- Validate everything on the release current at this point, obtain sign-off from all experiments, then execute the migration.
Note that items (1) and (2) involve changes to experiment code and repositories. The largest uncertainties in the scope and scale of work lie in items (4) and (5). Until these are understood, we cannot provide detailed task lists or timelines. In the mean time, experiments should work on (1) and (2), and open tickets or communicate with SciSoft team members when they encounter problems or have questions.
- Tom: Add step to verify that the Spack-built code runs and produces comparable results as the UPS version.
- Erica: Yes! (added above)
- Kyle: Does Chris talk about wrapping UPS products in Spack?
- Erica: He did in conversations about the migration, but it was not clear (to me) exactly how that fits into the plan – whether it pertains to some or all legacy things for instance.
- Kyle: Chris is presenting the big work required with migration? If we have bridge technologies, that’s not covered yet?
- Erica: Correct. I asked for the big picture at this point so that we have a framework for discussing status and more detailed planning.
- Workshop planning discussion
- Points where we are seeking input
- Feedback on the proposal circulated
- Thoughts on specific problems / pieces of code that need to be made thread-safe or multi-threaded
- Once code is identified, then the experiments should start identifying the teams that will come to the workshop to work on things.
- What if any tutorials might be helpful at the beginning of the workshop?
- We’re looking at 3 or 4 days for this. When might be a good time? Or maybe better, when are bad times?
- Andrzej: Is this more a thing for experts, or people to learn? Saw comments about tutorials. And in person?
- Erica: In my mind, a dual purpose. Acquaint more people with multi-threading techniques and solve particular problems of immediate relevance to the experiments.
- Target will be for experienced C++ coders. So not beginning grad students if we are to solve a real problem.
- Are advantages to working in-person – engage with experts more easily. But expect this will not be practical.
- Also in the proposal, to work in small teams, each working together on a single piece of code. Hack-a-thon style.
- Work on code that matters.
- Have seen this model work with the right technology. So have to put some effort into identifying “google docs for coding.”
- Andrzej: thinks the hack-a-thon idea makes it more enticing. Having some kind of introduction at the beginning would be good. We haven’t identified where the problems are.
- Erica: Also in the proposal, first have the experiments talk about what problems they’re trying to solve with multi-threading. Particular solutions will depend on the code. “These are the problems. These are the approaches to fix it.” Like for the database, need concurrent caching. Art provides this. Could provide a tutorial for how to use concurrent caching. So target tutorials to the solutions needed. Or might encounter an unanticipated problem along the way and decide a tutorial would help, so stop and learn about a solution.
- Ensuing discussion concluded that workshop / hack-a-thon would be best if focused on cases where we know there is a problem, but do not yet know where, and do not yet know the solution. For things where we do know a solution, we might not need a workshop / hack-a-thon session.
- Seemed to be general agreement on this point (?)
- Andrzej: Should each experiment identify the problem, talk to LArSoft team for advice, then everyone comes in with a defined problem.
- Yes. It’s important that everyone comes in with a well-defined problem.
- Do not want to front-load too much work, but this seems a reasonable approach. If we can’t find such problems, then we don’t need to waste people’s time with a workshop, and can instead focus on facilitating fixing the specific pieces of code that need fixing.
- Kyle: suggested reviewing slides / talks from the previous workshops on multi-threading (though the team would be amenable to repeating some of them)
- Points where we are seeking input
Links to relevant slides and videos of talks:
- 2017 presentation Introduction to multi-threading
- 2019 Presentation – Multi-threaded art
- 2019 Presentation – Making code thread-safe
- 2019 Presentation (powerpoint download) – Experience learning to make code thread-safe
- 2019 Presentation Introduction to multi-threading and vectorization
DUNE: Tom Junk
- He ran the cetmodules migration script Chris had in Feb., and made all the “required” changes, but not all of the “recommended” changes. There are a bunch of find_ups_products. Do those need to go away? [Yes, believe so.] Not using cetmodules yet, did it in a practice run, but can flip the switch at any point.
- Not done with a similar thing for GArSoft. Haven’t tracked down the alternative [libraries??]. Required latest version of Tensorflow and products from LArSoft that use Tensorflow. That all works. Currently, GArSoft is stuck on Pandora.
- Thinking about how to handle large scale of raw / processed digits. Talking with many people. Tied in with multi-threading, although multi-threading may be icing on the cake, since plan to manage by constructing workflows that operate at APA level [from file i/o through data prep and deconvolution]. Issues with file I/O we still have to deal with. Have been consulting with Kyle on this.
- Kyle: only framework support applicable is stuff (like removing cache) which Tom is aware of. Or alternating way data products are stored. That’s a big change. They’re reading one APA at a time. Things could be improved a bit. Framework does support the concept of an abstract delayed reader. That doesn’t get away from the basic problem they’re having.
- Tracy: Before ICARUS could run multi-threading, there’s some services that needed to be changed. Two of them, maybe Detector ones.
- Kyle: DetectorPropertiesService and DetectorClocksService are already thread safe. ChannelMappingService and the services that access things in databases are still issues. Saba & Kyle made a lot of progress, but didn’t get it finished. There is a dedicated branch for this.
- This particular work was one of the casualties of the bleeding of effort from the project team. So have not made progress on it since Saba left.
- Tracy: We would like to make use of this. We’re running single threaded jobs on three grid slots, effectively throwing away two cores.
- Erica: The loss of effort has hurt us. The important thing now is to know exactly what services are the impediments in your case.
- Tracy: I’ll try to follow up.
DUNE: Tingjun Yang
- Working on simulating neutrino interactions in the Near Detector. HeI summarized this at the last LArSoft meeting. We figured out a way to save energy deposits in both detectors. Identified a few places we need to make the framework (LArSoft) more flexible to accommodate different detector types (eg, the geometry system). Hans provided a workaround for one of the problems, and Gianluca made some improvements to the Geometry service.
- Next want to work on the drift and detector response simulation. Need to think about how to get the location of the pixels, determine direction inside volumes, etc,, which will require changes to the geometry.
- Erica: started work on this with Hans and Kyle (and Tingjun). Believe everyone agrees on the conceptual design, but need more discussion and more planning to make a detailed design that we can start implementing. Have been busy the past month, but will try to continue this work before the end of the month.
SBND: Andrzej Szelc
- Had a SBND collaboration meeting end of June at Fermilab. People want to use different generators, some BSM generators. No one seems to know about the LArSoft work to make this easier, or the GENIE work. And would like Genie 3.2 as soon as it comes out.
- Erica: SciSoft team is getting weekly reminders about the need for this. Believe the holdup until now has been spack-related work, but now that Phase 1 is completed, should be able to prioritize getting GENIE updated.
- Reconstruction of the photon detection progress.
SBN Data/Infrastructure: Chris Backhouse
- Nothing to report.
ICARUS: Tracy Usher
- Nothing to report.