Tuesday, July 06, 2021

Building a Raspberry Pi 4 MPI Cluster in 2021

 


We are using Ubuntu on Raspberry Pi 4 boards with 8GB RAM, 64GB flash drives, coupled using a cheap gigabit Ethernet switch. I wanted to use PoE (power over Ethernet), but that requires a "hat", an additional daughter board to extract the power, and it's moderately expensive compared to the cost of ordinary power supplies. Moreover, PoE-capable switches are more expensive and a shade harder to get ahold of. (Apologies for the mess in the photo, we should straighten that out. We also don't yet have a permanent location for this. Another group in our lab 3-D printed holders and a 19" rack mount frame for theirs, but we haven't gotten that far yet.)

Getting it all running was rather a pain. Here were our pain points and some advice:

  • We accidentally installed ARM7 Ubuntu on some machines and ARM64 on others. This problem won't become apparent until you compile and run your own code on multiple nodes via MPI, at which point it will tell you "Exec format error," and you'll have to go back and make all the nodes agree on architecture & chipset support. This was the last major problem we had to solve, but I mention it first since it's one you want to get right up front. All things being equal, unless you're creating a mixed cluster with older hardware, you probably want the 64-bit installation.
  • My first mistake was mixing installs of MPICH and OpenMPI. They are two separate implementations of MPI. Either is apparently fine, but don't mix them. If you just do
    sudo apt install mpi
    you will get OpenMPI. It doesn't include headers and the development tools, so you won't be able to compile. You also need the package mpi-default-dev.
  • You need openssh-server, but that's usually included a default Ubuntu setup. Likely, you'll also need to install gcc, make, git and gdb.
  • We're still tinkering with the best way to share setup info, including username databases and SSH keys for students and the like, but what we've settled on for the moment is Ansible, a popular networked systems management tool.
  • We set things up to share the executable via NFS. (We're not doing data-intensive stuff, just introductory programming exercises for now, so we're not sharing some major data farm.) Getting permissions right here took a little bit of work.
  • Our biggest pain point, which took the longest to solve, was getting the firewall settings right. Even though ompi_info tells me it's not compiled for IPv6, in fact the basic ssh that is used to initiate communications apparently runs over IPv6 anyway, if v6 is configured on our systems. Took us a couple of hours to figure this one out. Even when we briefly turned off the firewall entirely for debugging purposes, we were getting timeouts that baffled us. (ss was a big help here in figuring out what connections are trying to happen, but it takes a little greping to sort the wheat from the chaff.) (And random, 35-year-long rant: what is it with UNIX folks and short commands/tool names? "ss"? What is that?!? At least "netstat" has some mnemonic relationship to what it does.)
    Also, the default setting for Ubuntu firewall is "all outbound traffic allowed, no inbound traffic allowed," so even if you think you have the firewall entirely off, that might not mean what you think it means!
When your setup is close to working right, 

mpirun -np 2 --host raspi1,raspi2 hostname 

should print out the names of your hosts.  (Replace raspi1 and raspi2 with DNS names or IP addresses for your machines.) That just executes the command hostname on the remote host, showing that your communication is working. Since each machine has that command on it, it won't reveal the first problem above, the ARM7/64 issue.
 
That's just some quick notes in case you're running into similar problems. I'll try to flesh this out later.

No comments: