HPCS System Updates

Lawrencium Emergency Upgrade

Date & Time: Thursday, April 30, from 12 PM to 5 PM
Purpose: To maintain the security of the entire system, we need to apply the security patch on the Operating System and reboot all the nodes across the cluster. Jobs will be canceled, data access and transfer won’t be possible, and Open On Demand access will be unavailable. We expect this work to be done by the end of today.

Lawrencium Maintenance on April 14

Date & Time: Tuesday, April 14, from 9 AM to 5 PM
Purpose: Our team will work with DDN, the vendor for the Lustre file system on /global/scratch, to perform critical hardware maintenance. This includes replacing a core system drive and resolving network connectivity issues to improve overall system stability.

Lawrencium cluster powered down due to power overload at the data center

Due to a cooling failure in Building 50B, we powered down the cluster. No estimate of timeline to return yet. Updates will be shared as we progress.

lrc-xfer and globus down October 14th

Due to hardware storage, LRC-xfer and Globus are down. Users won’t be able to transfer data until it’s fixed.

IDM will undergo brief maintenance on October 6th and 8th

A brief interruption to identity services, particularly those related to Shibboleth, Grouper, identity linking, password reset, group creation, and RADIUS. Users may experience delays when logging in to the terminal and the OOD web portal during the maintenance window 4-7pm on October 6th and 8th.

Lawrencium Cluster back to normal

SSH access has been fixed, and most of the down nodes are back online with scratch access back in place.

Temporary Service Interruption Affecting SSH Access

SSH access to cluster login nodes is currently having an unplanned service interruption, resulting in failed or slow logins. (Resolved)
The Scratch file system may not be accessible on some compute nodes, and some compute nodes are still offline at the moment.

Lawrencium Downtime extended until 5th September 2025

Due to technical difficulties in the datacenter upgrade, the power will resume back on 3rd September, and after two days of rigorous testing on the system, the cluster will be back online on 5th September by 5 pm

Lawrencium Upcoming Downtime 19th August 8 am to 29th August 2025 5pm

As a part of a DOE-funded infrastructure modernization project, all the power utilities in the datacenter(50B-2275) will be powered down.

Lawrencium data transfer node is unreachable 16th June, 2025

We are checking the root cause of the issue and working towards resolving it as soon as possible. Possibly, the configuration of the switch is altered during LBLNet maintenance work over the weekend.

LBLNet performing network devices maintenance in data center 14th June, 2025

Users will experience brief connection issues on the cluster. Jobs shall continue to run.

Lawrencium scratch file system repair work complete. 5:41 AM May 13th, 2025

We are seeing improvement in the filesystem response. The issues related to OOD and Globus are resolved. Slurm reservation is released

Lawrencium scratch file system repair work continues. 11:00 AM May 8th, 2025

We are working closely with our storage vendor, DDN, for diagnosis and solutions to bring the file system back to normal. Slurm reservation will stay until work is complete.

Slurm reservation on compute node for Lawrencium scratch file system repair work. 1:30 PM May 7th, 2025

Users won’t be able to submit new jobs, and already submitted jobs won’t start until the work is complete and the reservation is released

Lawrencium scratch file system is back. 12 PM, May 7th, 2025

Cooling issue at the datacenter affected the filesystem. We are still working on resolving some lingering issues. lrc-xfer was restarted. Please get in touch with us if you encounter any issues.

Lawrencium scratch file system down. May 6th, 2025

Users may experience unresponsive interactive commands on the Scratch file system.

Lawrencium login delays April 2nd, 2025

The IdM team upgraded machines supporting identity services, which affected cluster authentication. Logins may take up to 30 seconds.

LR3 partition is retiring. February 27th, 2025

The scheduled decommissioning date is March 7th. If you have been using LR3 for your jobs, please migrate your workload to other partitions

The lab-wide power disruption around 2.30 PM on Sunday, Jan 26, affected the Lawrencium Supercluster. January 28th, 2025

Due to the impact on compute nodes and filesystem you might have noticed problems in job completion

Lawrencium Supercluster is back online after a week-long downtime. December 11th, 2024

HPC services, including Open OnDemand and Globus, have been returned

Lawrencium Downtime (Dec 1- Dec 11 ) due to a Power Outage at Building 50B. November 27th, 2024

Due to an electrical shutdown planned by PIMD and facilities at Building 50B, we need to power down the Lawrencium cluster at noon on December 1st. The cluster will be back online on December 11th at 5 p.m.

The software, firmware, and operating system upgrade on the scratch file system is complete. November 26th, 2024

User now have full access to the cluster

Two-day file system downtime from 8 a.m. Monday, November 25, until 5 p.m. Tuesday, November 26th 2024. November 23th, 2024

User will experience slow I/O until it recovers from restart

Scratch file system: Back online 3 PM November 14th, 2024

User will experience slow I/O until it recovers from restart

Scratch file system: Restart of scratch file system at 3 PM November 14th, 2024

User will experience disruption for about 45 minutes

Network issue partially resolved, Week-long Downtime canceled, October 24th, 2024

Users are now able to log in and access data. There will be a short downtime on Oct 25th, from 9 am to noon, to fix the file system issue.

Network issue on the cluster, October 24th 2024

Due to network issue login nodes, OOD and DTN are unreachable. We are working on the fix. We appreciate your patience during the process.

Open OnDemand login issue for LBL affiliate users Friday, July 5th 2024

Users using any email besides the LBL domain cannot log in to the OOD portal. We are working on the fix and will have an update as soon as the issue is fixed.

Lawrencium Supercluster Back in Service Wednesday, 3rd 2024 5pm.

System-wide Migration to Rocky-8 OS is complete

Lawrencium Downtime schedules for three days, Monday, July 1st – Wednesday, 3rd 2024

Downtime is scheduled for Rocky-8 OS system-wide rollout.

Open OnDemand is back to service on Tuesday, May 21

OOD was offline on Monday and was affected by file system-related issues.

Globus is back in service Tuesday, Nov 28

Globus is still off Monday, Nov 27

We have contacted the vendor again and will keep you posted.

Lawrencium is back online on Nov 22nd.

Globus service is not available.

Lawrencium downtime scheduled Nov 21-22

Slurm upgrade and system maintenance

LR7 partition is back online. 3:30pm, Friday, Sep 1st

Nodes in LR7 partition are offline due to elevated temperature from the low coolant level 3:30pm, Wednesday, Aug 23rd.

An investigation is going on

The lrc-xfer (Designated Data Transfer node) will have a brief downtime at 11:45am Tuesday, Aug 8th

lrc-xfer, as well as Globus Google Cloud Storage collection and the Globus AWS S3 Collection will be offline for about 15 minutes.

Lawrencium Supercluster back to service July 25th.

Data center power outage and scratch file system upgrade.

Lawrencium Supercluster Downtime

Data center power outage and cluster maintenance work July 20-24

Lawrencium Supercluster New Firewall Configured June 12th

Lawrencium users may have experienced sluggish slurm commands and a delay in scheduling jobs today (June 12th), this is due to a new configure on the firewall we recently applied. The problem should have been resolved now. If you continue to see any issues, please email us at hpcshelp@lbl.gov.

Lawrencium Supercluster Downtime

LBLnet network upgrade 8am May 20th- 5pm 21st

Open OnDemand(OOD) portal login issue, 9.00 AM May 2

Open OnDemand portal is having a login issue, and we are working on resolving the problem. We will keep you updated on the progress. Sorry for the inconvenience.

Lawrencium cluster login issue persists on March 13th.

Lawrencium users may still experience login failure related to the maintenance work performed last week(March 7-10th). The IDM team who is providing the authentication service is conducting an investigation. We will keep you posted on any progress being made.

Lawrencium Scheduler is back to service 10:30 AM Dec 14

Slurm upgrade is complete

Lawrencium Supercluster Downtime

Slurm upgrade 8am Dec 13th – 5pm Dec 14th

/global/scratch/users/ is back to service.

The parallel file system scratch is back to service 10:30 AM Dec 5

Lawrencium Scheduler Slurm is back to service 9:30 AM Nov 3

We are working to restore the service 8:00 AM Nov 3

Lawrencium Supercluster Back to Service

Slurm upgrade is complete, the Lawrencium cluster is returned to service at 5:30 PM on Oct 6

Lawrencium Supercluster Downtime

Slurm upgrade 8am Oct 4th – 5pm Oct 6

Lawrencium Supercluster back online

Slurm patching complete 5:00pm Tuesday, April 26

Lawrencium Supercluster Downtime

Slurm patches 8:00am Tuesday, April 26 – 5:00pm Tuesday, April 26

Reboot Compute Nodes Started at 6:30 PM Thursday, Feb 10 –1:00 PM Friday, Feb 11

Address problems related to MPI jobs, including poor performance with OpenMPI, and issues with pre-compiled IntelMPI binaries.

Slurm Upgrade is Complete

Service restored at 5:30pm Wednesday, Feb 2

Lawrencium Supercluster Downtime

Slurm upgrade 8:00am Tuesday, Feb 1 – 5:00pm Thursday, Feb 3

Lawrencium Supercluster Service restored post-power outage

Service restored at 1:40pm on Sunday, 12/19 2021

Login Failure: Sunday, 12/12 (Service is back 10:40am)

The lawrencium login authentication server is currently down. Users can’t login. We have contacted IDM team at LBNL and will keep you posted. Service is resumed.

Lawrencium Supercluster Power Outage

There is a scheduled power outage in 50B-2275 to accommodate testing of the existing Emergency Power Off system. All of the HPC computing systems will be shutdown and unavailable starting at 5:00pm on Friday, 12/17. We anticipate returning the systems to production on Sunday, 12/19 by 5:00pm.

Footer

Operations

IT Division

IT Help Desk