Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
technical:generic:caviness-gen1-ossnfs-rebuild [2019-07-31 12:27] – [First-Generation Rebuild] frey | technical:generic:caviness-gen1-ossnfs-rebuild [2019-07-31 12:28] (current) – [Design of Second-Generation] frey | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Caviness: Rebuild of First-Generation OSS and NFS Nodes ====== | ||
+ | The in-rack storage servers (Lustre OSS and NFS) require a higher number of PCIe expansion lanes versus the compute nodes. | ||
+ | |||
+ | In the first-generation OSS and NFS nodes, getting the requisite number of PCIe lanes required a less-common node design from the vendor, with a few trade-offs (2OU chassis, no integrated console video port, a single multiplexed LAN port carrying both IPMI and data). | ||
+ | |||
+ | One major support issue with the first-generation OSS and NFS nodes is that the system logging gets written into RAM, not to a persistent storage medium. | ||
+ | |||
+ | Each first-generation OSS and NFS node was designed with the following complement of storage devices originally slated for specific uses: | ||
+ | |||
+ | ^Physical location^Qty^Device^Purpose^ | ||
+ | |internal|1|240 GB SSD|L2ARC| | ||
+ | |external JBOD|2|400 GB SSD|ZIL (mirror)| | ||
+ | |external JBOD|10|8000 GB HDD|RAIDZ2 + hot spare (OSS); RAIDZ3 + hot spare (NFS)| | ||
+ | |||
+ | ===== Issues ===== | ||
+ | |||
+ | There are two issues with the original intent for the storage devices present in first-generation storage servers. | ||
+ | |||
+ | ==== L2ARC ==== | ||
+ | |||
+ | Having the L2ARC device present in the storage node itself is not an issue for the NFS servers. | ||
+ | |||
+ | In operation, the NFS server' | ||
+ | |||
+ | ==== ZIL ==== | ||
+ | |||
+ | ZFS leverages the intent log (ZIL) to accelerate the client side of synchronous transactions. | ||
+ | |||
+ | In an HPC environment, | ||
+ | |||
+ | With fully-synchronous behavior disabled, the ZIL serves no purpose: | ||
+ | |||
+ | < | ||
+ | [root@r00nfs0 ~]# zpool iostat -v | ||
+ | capacity | ||
+ | pool alloc | ||
+ | ---------------------------------------- | ||
+ | r00nfs0 | ||
+ | raidz3 | ||
+ | 35000c500950389c7 | ||
+ | 35000c50095039373 | ||
+ | 35000c50095208afb | ||
+ | 35000c500950394cb | ||
+ | 35000c5009503950f | ||
+ | 35000c50095039557 | ||
+ | 35000c5009503964b | ||
+ | 35000c5009520775f | ||
+ | 35000c5009520835f | ||
+ | logs - - - - - - | ||
+ | mirror | ||
+ | 35000cca0950134c0 | ||
+ | 35000cca095015e6c | ||
+ | cache | ||
+ | Micron_5100_MTFDDAK240TCC_172619295CF8 | ||
+ | ---------------------------------------- | ||
+ | </ | ||
+ | |||
+ | and on an OSS server with Lustre on top of the pool: | ||
+ | |||
+ | < | ||
+ | [root@r00oss0 ~]# zpool iostat -v | ||
+ | | ||
+ | pool | ||
+ | --------------------- | ||
+ | ost0pool | ||
+ | raidz2 | ||
+ | 35000c500950395b3 | ||
+ | 35000c5009515c8a3 | ||
+ | 35000c50095039577 | ||
+ | 35000c50095038d97 | ||
+ | 35000c5009520053f | ||
+ | 35000c50095092ad7 | ||
+ | 35000c5009503960b | ||
+ | 35000c5009515fa93 | ||
+ | 35000c500950396cb | ||
+ | logs | ||
+ | mirror | ||
+ | 350011731014cb4ec | ||
+ | 350011731014cbdac | ||
+ | --------------------- | ||
+ | </ | ||
+ | |||
+ | In short, none of the first-generation OSS and NFS servers are using the 400 GB SSDs in the external JBOD. | ||
+ | |||
+ | ===== Design of Second-Generation ===== | ||
+ | |||
+ | With the ZIL unused and the L2ARC device unused on the OSS servers, the second-generation storage servers were designed accordingly: | ||
+ | |||
+ | ^Physical location^Qty^Device^Purpose^ | ||
+ | |internal|1|960 GB SSD|swap, ''/ | ||
+ | |external JBOD|1|480 GB SSD|L2ARC| | ||
+ | |external JBOD|11|12000 GB HDD|RAIDZ2 + hot spare (OSS); RAIDZ3 + hot spare (NFS)| | ||
+ | |||
+ | The L2ARC has been moved to the JBOD, so that each OSS server' | ||
+ | |||
+ | Additionally, | ||
+ | |||
+ | The second-generation node design will require changes to the Warewulf VNFS and provisioning (to partition and format the internal SSD, etc.). | ||
+ | |||
+ | ===== First-Generation Rebuild ===== | ||
+ | |||
+ | To match the second-generation design, the first-generation nodes' SSDs could be repurposed for the OSSs and OSTs: | ||
+ | |||
+ | ^Physical location^Qty^Device^Purpose^ | ||
+ | |internal|1|240 GB SSD|**swap, ''/ | ||
+ | |external JBOD|2|400 GB SSD|**L2ARC + spare**| | ||
+ | |external JBOD|10|8000 GB HDD|RAIDZ2 + hot spare| | ||
+ | |||
+ | The 400 GB ZIL mirror can be destroyed and removed from the ZFS pool and one device added back as an L2ARC without taking the OSS node offline. | ||
+ | |||
+ | ===== Proposed Timeline ===== | ||
+ | |||
+ | When the second-generation hardware is integrated into Caviness a modified VNFS and provisioning profile is mandated. | ||
+ | |||
+ | The second-generation hardware is currently expected to be delivered and in operation by September 2019. Initial integration of a single second-generation OST into the existing Lustre file system for testing seems advisable. |