Logic flaw in server re-connect code

Discussion related to the implementation of new features & algorithms to the Core Engine.

Moderators: jromang, tomb, zcott, coordinators

Re: Logic flaw in server re-connect code

Postby Lord Crc » Sun Mar 04, 2012 3:15 pm

New weeklies are up for linux and windows :)
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4932
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby jeanphi » Sun Mar 04, 2012 3:46 pm

And Mac.

Jeanphi
jeanphi
Developer
 
Posts: 7241
Joined: Mon Jan 14, 2008 7:21 am

Re: Logic flaw in server re-connect code

Postby cwichura » Mon Mar 05, 2012 11:55 am

I've got it running on my master plus two slaves (all Win7 64-bit). So far, seems to be happy. I haven't had a transfer error yet, though, so haven't tested the re-connect part. Will report back as things happen.

But thanks for the quick response on getting this looked at! Losing work that the servers cranked through really sucks, especially when they are an Amazon EC2 instance that you're paying for the compute time on...

And I also wanted to say that, as a network engineer by day, I was pleasantly surprised the first time I ran lux with slaves and saw it connect over IPv6. More apps should be made IPv6 aware!
cwichura
 
Posts: 481
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Mon Mar 05, 2012 2:50 pm

cwichura wrote:I've got it running on my master plus two slaves (all Win7 64-bit). So far, seems to be happy. I haven't had a transfer error yet, though, so haven't tested the re-connect part. Will report back as things happen.


Great :) Data loss is something we try to take seriously and prevent if possible.

cwichura wrote:And I also wanted to say that, as a network engineer by day, I was pleasantly surprised the first time I ran lux with slaves and saw it connect over IPv6. More apps should be made IPv6 aware!


Hehe, well if you find any bugs don't be shy ;) I didn't have any working ipv6 net at home (haven't updated my router to support it yet), so could only test with the loopback device. But I fully agree with the sentiment, it was in fact my driver for adding support in the first place :)
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4932
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby cwichura » Mon Apr 02, 2012 8:18 am

I have tracked down a flaw in the new re-connect logic. In renderfarm::connect(), pretty much the first thing it does is clear the sid field. If you're in a case where the network is interrupted, reconnect() is called and when that fails, connect() is then called, destroying the sid field. When the network comes back, it still tries to call reconnect(), but the sid is no-longer valid, so the reconnect perpetually fails.
cwichura
 
Posts: 481
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Fri Apr 06, 2012 7:57 am

cwichura wrote:I have tracked down a flaw in the new re-connect logic. In renderfarm::connect(), pretty much the first thing it does is clear the sid field. If you're in a case where the network is interrupted, reconnect() is called and when that fails, connect() is then called, destroying the sid field. When the network comes back, it still tries to call reconnect(), but the sid is no-longer valid, so the reconnect perpetually fails.


Ah shoot, didn't think about that possibility. I've hopefully added a fix for this, but I couldn't test it properly, I've only managed to test that it works with establishing a new session if the slave dies and is brought up again. I'm making some new "weekly" builds today which will include this fix, please let me know if it helps with this issue.

edit: To elaborate, the fix is that it will only try to establish a new session if the slave explicitly rejected the reconnection attempt. Otherwise it will just try to reconnect again next time.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4932
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby cwichura » Fri Apr 06, 2012 12:53 pm

I can't test this just now, since I don't have access to the slaves with the holiday today. But I've been looking at the code, and I'm not sure if the new change is correct.

Please correct me if I'm reading it wrong, but this is what I see:

reconnectFailed() only tries to reconnect when serverInfoList[i].active is false. It does not change the state of .active before calling reconnect(). So on entry to reconnect(), .active is still false.

In reconnect(), the possible results are:
1) TCP connection succeeds, and handshake succeeds. .active set to true, reconnect() returns true.
2) TCP connection succeeds, handshake fails. .active set to false, reconnect() returns false
3) TCP connection fails, causing an exception to be thrown. (Or any other exception occurs.) .active unchanged, so remains false, reconnect() returns false

So the problem is that cases 2 and 3 have the same result of false/false, which will cause reconnectFailed() to call connect() again. And connect() will blast the SID. However, case 3 should NOT be causing it to proceed with connect(). I think your test worked, since you said you disrupted the slave, so full restart by connect() would be required. Where I see the most problems is when the network separating the master and the slave becomes unavailable, the slave is still happily computing away and reconnecting it should work fine, except the SID gets clobbered so the slave rejects the master.

Since reconnect() is only called by reconnectFailed(), I think changing the result of case 3 to return true, but leave .active false should fix this. Then reconnectFailed() will not call connect(), but active will sill be false, so Lux will still attempt to reconnect the next time reconnectFailed() is called.

Or am I reading things wrong?
cwichura
 
Posts: 481
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Fri Apr 06, 2012 2:55 pm

Argh, you're correct. I had limited time to get out the weeklies before the guest arrived and my mind silently inserted a ! before the if statement inside the loop in reconnectFailed.

I'll try to get a proper fix out by tomorrow.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4932
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby cwichura » Mon Apr 16, 2012 8:55 am

Just following up on this. I haven't seen anything posted to the source repository for this, and I see you all are now gearing up to make a 1.0 release candidate, so am hoping this doesn't slip through the cracks with the RC shuffle.
cwichura
 
Posts: 481
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Mon Apr 16, 2012 9:23 am

Shoot. I've made a change but it seems it never went into the main repository for some reason, I think it got forgotten due to the repository corruption we had last week.

It won't make RC1, but I'll make a "weekly" shortly after with an implementation of http://www.luxrender.net/mantis/view.php?id=614 so it'll be in there.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4932
Joined: Sat Nov 17, 2007 2:10 pm

PreviousNext

Return to Architecture & Design

Who is online

Users browsing this forum: No registered users and 1 guest