Logic flaw in server re-connect code

Discussion related to the implementation of new features & algorithms to the Core Engine.

Moderators: jromang, tomb, zcott, coordinators

Re: Logic flaw in server re-connect code

Postby cwichura » Mon Apr 16, 2012 9:39 am

Along those lines, did anything come of the changes we discussed about streaming to memory versus disk? I'm less concerned about that, as I can mostly work around that issue by running the servers with the -W option. But the reconnect bug is much more serious. (And it at least sounds like it should be fixed before RC2 comes along.)

Thanks!
cwichura
 
Posts: 351
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Mon Apr 16, 2012 9:55 am

I've plans to do some more work on the networking code before 1.0 release, hopefully I'll get to that as well.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4450
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby cwichura » Mon Apr 16, 2012 10:05 am

I read through the BTS ticket you referenced about sending only stuff that is needed. I guess I'd like to throw out that there needs to be some cache management if this is added. Otherwise, one will eventually end up with a slave having a ton of cached objects that are no-longer ever going to be referenced. So maybe have a config value for max cache size (with a default of 1GB) and then deleted oldest unreferenced item when space is needed. But now you also need to keep track of the cache items; I dunno if there is an open source library for cache management that would help with this.

But I have to admit, it seems like a lot of work for not much gain. Sending files to the slaves on startup has never been an issue for me. The only time where this is slow is if I'm using an EC2 instance for additional slave power. But in that case, it's a spot instance that has just been spun up and so will have no cached objects anyway...
cwichura
 
Posts: 351
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Mon Apr 16, 2012 10:16 am

cwichura wrote:I read through the BTS ticket you referenced about sending only stuff that is needed. I guess I'd like to throw out that there needs to be some cache management if this is added. Otherwise, one will eventually end up with a slave having a ton of cached objects that are no-longer ever going to be referenced. So maybe have a config value for max cache size (with a default of 1GB) and then deleted oldest unreferenced item when space is needed. But now you also need to keep track of the cache items; I dunno if there is an open source library for cache management that would help with this.


Yeah I've thought about it. The trickiest bit is in fact to ensure the cleanup is executed at all (ie in case ctrl+c etc).

cwichura wrote:But I have to admit, it seems like a lot of work for not much gain. Sending files to the slaves on startup has never been an issue for me. The only time where this is slow is if I'm using an EC2 instance for additional slave power. But in that case, it's a spot instance that has just been spun up and so will have no cached objects anyway...


If you have a large scene and you're doing test renders on a local farm, then sending all the files every time can add up.

However there's another advantage here which is indirect: much lower memory usage. Currently lux stores a full copy of all the scene data, including external files, in memory. This is so that it can send it to a slave if the slave is connected after the render has started. This essentially doubles the memory usage.

The implementation of request 614 will not store external files in memory, but instead stream them from disk per request.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4450
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby cwichura » Mon Apr 16, 2012 10:49 am

So the assumption is that the cache exists only as long as the slave instance is running? So if you stop and start a master but not the slaves, they'll still have the cached data. But if the slave itself is recycled, then its cache is wiped. Correct?

If so, then that simplifies things a lot, since you won't have a case of an ever-growing cache on the slave. It might almost make more sense to do cache cleanup on startup in addition to shutdown, for cases where the slave terminates abruptly and the cleanup code never gets to run.

Streaming objects from disk would be a nice upgrade. Would this also include support for allowing multiple slaves to spin up at the same time, rather than having to wait for each one to start completely before moving to the next like it currently does?
cwichura
 
Posts: 351
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby cwichura » Mon Apr 16, 2012 10:57 am

Actually, having luxconsole be two processes might be a good idea for slaves. The parent process can handle cache cleanup, in which case it doesn't matter if the child panics. It also allows for the parent to automatically restart the child if it panics.

(I've had problems recently where a scene using path+kdtree was crashing lux, but when I switched it to path+qbvh it ran without crashing. Bidir seems pretty stable, but this scene was using a volumeintegrator, which requires path. And I've no idea what was causing it to crash, since the crash occurred after some random amount of time -- might be five min, sometimes an hour or two. So trying to delete things until it stopped crashing wasn't really an option. Unfortunately, I discovered the path+kdtree crash after it had already crashed the slave machine and it was over a weekend, so I didn't have access to get in an restart the slave.)
cwichura
 
Posts: 351
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Mon Apr 16, 2012 1:25 pm

I think both advanced cache management and server management (in terms of restarting etc) for now is best handled by using a small system or python script.

If we start adding an "agent" process or similar we should look into full-blown render farm management.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4450
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Mon Apr 16, 2012 1:50 pm

cwichura wrote:Would this also include support for allowing multiple slaves to spin up at the same time, rather than having to wait for each one to start completely before moving to the next like it currently does?


This is somewhat of an orthogonal issue, however the current implementation shouldn't make it any more difficult to implement multithreaded initialization of slaves.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4450
Joined: Sat Nov 17, 2007 2:10 pm

Re: Logic flaw in server re-connect code

Postby cwichura » Sat Apr 28, 2012 8:06 am

Just got bit by the reconnect bug again due to a power failure.

The master is my laptop (a quad-core i7 w/ 24GB of RAM), and so the power failure took out the laptop, but not the slave, which is in a datacenter room. As the laptop also had eight render threads working, it consumed its battery pretty quickly once the power went out and put itself to sleep. When power came back, the laptop was resumed from sleep. But lux attempted to reconnect instantly, before the laptop was able to re-connect to the WiFi network. So lux failed a connect and then overwrote the SID and the slave became a zombie.

I dunno how hard it would be, but perhaps another thing to consider is having lux hold off on any network requests for a bit when its gets a message that the OS was just resumed? Though this becomes less important once the reconnect bug gets fixed.
cwichura
 
Posts: 351
Joined: Sun Feb 12, 2012 11:31 pm

Re: Logic flaw in server re-connect code

Postby Lord Crc » Sat Apr 28, 2012 8:52 am

cwichura wrote:I dunno how hard it would be, but perhaps another thing to consider is having lux hold off on any network requests for a bit when its gets a message that the OS was just resumed? Though this becomes less important once the reconnect bug gets fixed.


I think it's fixed in my current code, I'll commit it along with the implementation for issue 614, hopefully I can do that and get a build out tomorrow.

edit: the current code only requests a new session if the slave actively refused resuming the old session.
May contain traces of nuts.
User avatar
Lord Crc
Developer
 
Posts: 4450
Joined: Sat Nov 17, 2007 2:10 pm

PreviousNext

Return to Architecture & Design

Who is online

Users browsing this forum: No registered users and 0 guests