Logic flaw in server re-connect code

Discussion related to the implementation of new features & algorithms to the Core Engine.

Moderators: jromang, tomb, zcott, coordinators

Logic flaw in server re-connect code

Postby cwichura » Thu Mar 01, 2012 9:36 am

I've been looking into why an error while communicating with a slave node results in that node forever reporting BUSY when the master tries to recover it. And I think I've found it.

On the master, if a node fails for whatever reason (e.g, a network error occurs during a film file transfer), the master marks the server as inactive. In various places, it then calls RenderFarm::reconnectFailed(). reconnecFailed() then attempts to walk through the sevrer list, looking for any servers marked as inactive, and calls RenderFarm::connect() to try and re-connect to the slave.

This is where the problem is. The slave is not expecting a ServerConnect handshake, since the slave itself did not fault. So when the master tries to reconnect to it, it's sending ServerConnect with no SessionID and the slave is rejecting it, saying that it is BUSY. If the slave is killed and re-started, such that it is back to the idle state, then on the next reconnectFailed(), the connect() will succeed and the master will again upload all the scene files to the slave and get it running again.

So RenderFarm::connect() should be updated (with whatever corresponding changes in the render server are needed). If there is an SID known for the slave, it should first attempt to issue the server status command with that SID. If that succeeds, then it knows the slave is OK, and can mark it as active again. If that fails with BUSY, then it's the wrong SID for the slave and it should give up. Is the slave's server info command reports that it is IDLE, then it proceeds to fall through to the existing ServerConnect logic.

Thanks
cwichura
 
Posts: 486
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby cwichura » Thu Mar 01, 2012 10:41 am

And as a side note: I've noticed that lux frequently reports an I/O error while retrieving samples from a server when the resume file writer thread kicks in on the master during the download. I think this is the main reason I keep having slaves get knocked off; not because of an actual network problem, but because of an internal bug in lux. I've no idea where to look for this one, though...
cwichura
 
Posts: 486
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby jeanphi » Thu Mar 01, 2012 10:49 am

Hi,

Thanks for the report.

Jeanphi
jeanphi
Developer
 
Posts: 7284
Joined: Mon Jan 14, 2008 7:21 am

Re: Logic flaw in server re-connect code

Postby Lord Crc » Thu Mar 01, 2012 2:35 pm

Thanks, I'll have a look. I plan to do some more work on the networking bit before we release the next stable.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4932
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Fri Mar 02, 2012 2:34 am

I'm adding a ServerReconnect command to re-establish the session. If it fails it will try to do a regular connection attempt to establish a new session. Hopefully this should solve this issue.

Now I just need to figure out a reliable way to test this :)
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4932
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Fri Mar 02, 2012 6:07 am

I managed to test it by forcing a fail condition on film retrieval, and it seems to work fine here. Code is in the repository.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4932
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby cwichura » Sat Mar 03, 2012 8:24 pm

Thanks. I had a look at your diffs in the bug tracker website. I'll have to wait for the next weeklies build to test it, myself, since I don't have a build environment for lux set up.
cwichura
 
Posts: 486
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Sat Mar 03, 2012 11:28 pm

Which platform are you on? I can sadly only whip up some windows builds.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4932
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby SATtva » Sun Mar 04, 2012 2:48 am

I'm going to make Linux weeklies this evening anyway.
Linux builds packager
聞くのは一時の恥、聞かぬのは一生の恥
User avatar
SATtva
Developer
 
Posts: 6164
Joined: Tue Apr 07, 2009 12:19 pm
Location: from Siberia with love

Re: Logic flaw in server re-connect code

Postby cwichura » Sun Mar 04, 2012 2:46 pm

My primary machine and slave are Windows 7 64-bit. But I've also used EC2 spot instances running linux to get some additional slave power when the local machines aren't enough. (By the fireplace kept my master (i7 quad-core at 2.2Ghz), local slave (a hexa-core i7 at 3.2Ghz) running for three days, rendering it at 5120x2880, after which I added in 5 EC2 8 CPU spot instances for another 36 hours just to get it to around 2000S/p).
cwichura
 
Posts: 486
Joined: Sun Feb 12, 2012 11:31 pm

Next

Return to Architecture & Design

Who is online

Users browsing this forum: No registered users and 1 guest