(Available for LBNL researchers)
Overview
In recognition of the increasing importance of research computing across many scientific disciplines, LBNL has made a significant investment in developing the LBNL Condo Cluster program, as a way to grow and sustain midrange computing for Berkeley Lab. The Condo program is intended to provide Berkeley Lab researchers with state-of-the-art, professionally-administered computing systems and ancillary infrastructure, with the intent of improving competitiveness on grants, and achieving economies of scale with centralized computing systems and data center facilities.
The model for sustaining the Condo program is premised on faculty and principal investigators using equipment purchase funds from their grants or other available funds to purchase compute nodes (individual servers) which are then added to the Lab’s Lawrencium compute cluster. This allows PI-owned nodes to take advantage of the high speed Infiniband interconnect and high performance Lustre parallel filesystem storage associated with Lawrencium. Operating costs for managing and housing PI-owned compute nodes are waived in exchange for letting other users make use of any idle compute cycles on the PI-owned nodes. These PI owners have priority access to computing resources equivalent to those purchased with their funds, but now can access more nodes for their research if needed. This provides the PI with much greater flexibility as compared to owning a standalone cluster.
This program is intended for PIs that would otherwise purchase a small (4 nodes) to medium scale (72 nodes) standalone Linux cluster. Projects with larger compute needs or many users or groups should consider setting up a dedicated cluster so that they can better prioritize shared access between their users. Please request a condo account at myLRC with the relevant documents.
Program Details
Compute node equipment is purchased and maintained based on a 4-year lifecycle at which point the PI owning the nodes will be notified that the nodes will have to be upgraded during year 5. If the hardware is not upgraded by the end of 5 years, the PI may donate the equipment to Condo or take possession of the equipment (removal of the equipment from LC3 and transfer to another location is at the PI’s expense); nodes left in the cluster after five years may be removed and disposed of at the discretion of the HPCS program manager
All Lawrencium and condo users have a 10GB home directory on the Lab’s shared HPC infrastructure and are charged $25/mo. for account maintenance which includes backups of their home directory. Users or projects needing more space for persistent data can purchase storage shelves that can be hosted by the Lab’s HPC infrastructure. Storage shelves are purchased and maintained on a 5-yr lifecycle after which the PI must renew the storage purchase at the then-prevailing price or remove the data within 3 months.
Once a PI has decided to participate, the PI or his designated person works with the HPC Services manager and operations team to procure the desired number of compute nodes and storage. Generally, it takes about three months from start to finish. In the interim, a test condo queue with a small allocation will be setup for the PI’s users in anticipation of the new equipment. Users may submit jobs to the general Lawrencium queues on the cluster, but use will incur the cpu usage fees of $0.01 per service unit. Jobs are subject to general queue limitations and guaranteed access to contributed cores is not provided until purchased nodes are provisioned.
Recommended Equipment
Compute node with the following specifications:
General Computing Node – (Current Condo node configuration) | |
---|---|
Processors | Dual-socket, 28-core, 2.0GHz Intel Ice Lake Xeon 6330 processors (56 cores/node) |
Memory | 256GB (16 X 16GB) DDR4 RDIMMs |
Interconnect | 100Gb/s Mellanox ConnectX6 HDR-100 Infiniband interconnect |
Hard Drive | 1.92 TB NVMe Drive (Local swap and log files) |
Warranty | 5 yr |
GPU Computing Node for Machine Learning and Image Processing | |
---|---|
Processors | Single-socket, 64-core, 2.0GHz AMD 7713 processor (64 cores/node) |
Memory | 512GB (8 X 64GB) 3200Mhz DDR4 RDIMMs |
Interconnect | 100Gb/s Mellanox ConnectX-6 HDR Infiniband interconnect |
GPU | 4 ea. Nvidia A40 GPU accelerator boards |
Hard Drive | 240GB SSD (Local swap and log files) |
Warranty | 5 yrs |
Condo Storage: PIs can purchase an increment of disks consisting of 8x14TB Nearline SAS 7200RPM disks to be added to our Hitachi HNAS G370 storage subsystem. This will provide 84TB usable persistent storage.
Prospective condo owners should contact HPC Services Manager Gary Jung prior to purchasing any equipment to insure compatibility.