I've been looking into why an error while communicating with a slave node results in that node forever reporting BUSY when the master tries to recover it. And I think I've found it.
On the master, if a node fails for whatever reason (e.g, a network error occurs during a film file transfer), the master marks the server as inactive. In various places, it then calls RenderFarm::reconnectFailed(). reconnecFailed() then attempts to walk through the sevrer list, looking for any servers marked as inactive, and calls RenderFarm::connect() to try and re-connect to the slave.
This is where the problem is. The slave is not expecting a ServerConnect handshake, since the slave itself did not fault. So when the master tries to reconnect to it, it's sending ServerConnect with no SessionID and the slave is rejecting it, saying that it is BUSY. If the slave is killed and re-started, such that it is back to the idle state, then on the next reconnectFailed(), the connect() will succeed and the master will again upload all the scene files to the slave and get it running again.
So RenderFarm::connect() should be updated (with whatever corresponding changes in the render server are needed). If there is an SID known for the slave, it should first attempt to issue the server status command with that SID. If that succeeds, then it knows the slave is OK, and can mark it as active again. If that fails with BUSY, then it's the wrong SID for the slave and it should give up. Is the slave's server info command reports that it is IDLE, then it proceeds to fall through to the existing ServerConnect logic.