Team render issues

Chaos Corona for Cinema 4D > [C4D] Bug Reporting

Team render issues

(1/10) > >>

kizo:
Hi,

as some other users I also have issues with TR. The issues are clients disconnecting and difference in output plus other limitations like not being able to save crx format.

As beta will expire on the 21st Feb. we are trying to prepare the workflow but the TR issues just seem to prevent the production.At the current state I feel like the TR is totally useless, unreliable and in fact any node licences are worthless which is a shame for a newly released product. I love Corona and hope to keep using it. We have NO issues working on single machines but distributed rendering or using the full potential of our hardware is essential.
I have been trying to get some help from the official support but unfortunately it takes way too long to get any answer so I will try here.Maybe I will have better luck.

TEAM RENDER ISSUE: CLIENTS DISCONNECTING AND SLOW RENDER TIME

I have spent a lot of time testing things out. Im not a network specialist so its rather hard to understand whats going on and if the network itself is causing issues but I did a series of tests to identify the differences in settings and results obtained.But we used the same machines and network setup with Vray DR and Thea render distributed (CPU+GPU on the network) without similar issues.

We are running a 1Gbp/s network and have a dedicated range of fixed IPs for machines. I have adjusted all firewall setting , disabled bonjour and Im using IP adresses to conect. I have followed the cineversity series of videos on TR and made all adjustments.Corona license server running. Tests we made with the hotfix release.

I made all tests using 9 machines.

From the tests I made so far one conclusion is that the smaller the packet size I set in Corona render settings the less the nodes disconnect.

In fact with a size of 5 or 10 and interval always at 10s none of the nodes disconnected.With a size of 50 they start disconnecting and with 100,200 or even 500 they also disconnect and it seems sooner. The error that we get when the node disconnects is: "Frame synchronization failed: Communication error" but if pinged or tested the connection the node is available so I can only assume that there were to many chinks at the same time, but Im not sure. Also if thats the case...having smaller chunks would mean a lot more of them at the same time .....but when chunks are smaller no nodes disconnect at all.

A small chunk size seems to be OK for smaller resolution renders.The tests I made with a 1k square render and small packet size 10 and 10s interval rendered without disconnecting nodes at all.The scene with all nodes would render to the noise level set in 5-6 min. (same scene on a single 2990wx took 19 mins)

On a 4k square render of the same scene some interesting facts are shown:

A) packet size = 10 mb // render time (stamped)=0:42:00 // no nodes disconnected

B) packet size= 100 mb// render time (stamped)=0:21:34// 4 nodes disconnected (3 almost at the same time some 3-4 mins after starting the render)

Test A took around 1/4th of the time only to collect all the chunks. So it finished rendering in 30 mins and it took 10 to get all the chunks.

Test B the render finishes in 21min. Even tho 4 of the machines disconnected this is way faster but the issue is not all of the power is used.

I have tried many combinations changing the interval only, the packet size only or combined. Its a very time consuming process. In all my test nodes disconnect with larger packages wile they dont with smaller ones but than rendering takes way too long.

PICTURE VIEWER OUTPUT DIFFERENT FROM VFB

In all the test made so far using TR trough PV (the only possible way) but single machine too, there is a difference where the PV saved image is always darker.
The c4d project settings are set to linear workflow and sRGB

I also had a situation where the lights didnt match in the render saved as .jpg from the PV to the .jpg saved from the VFB. It seems VFB is showing and saving the lightmix while the PV the beauty pass.That would explain why the color and intensity of light was different.Please check the attached images.
If the above is true how would one save a non layered format out of PV and have it match the VFB ?

RENDER TIME INCONSISTENCY

While testing I noticed different render times are reported for the same job.Im not sure why is that but makes it harder to know whats the real render time and estimate.Check attached example.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

It would be great to get any help from the support but Im looking forward to user input too. Anyone out there using TR without the above issues?

Must say Im rather disappointed with how network rendering works. I have 3 x Corona + 3 nodes licences and one Corona + 10 nodes license along with 10 machines waiting to render.And Im now facing major slowdown in production and cant find a solution. Im not getting any help from the support too.
It feels like the beta stage should have lasted longer even if it was rather long.But making distributed rendering work should be part of a released product and unfortunately cinema Team Render doesnt seem to be the best choice for that.

Cheers
kizo

TomG:
Hi Kizo!

To me, it sounds like your network is being overloaded with traffic. 10 machines is a lot to run on a 1Gb network, and I know there are people with less machines who are using a 10Gbe network, so I am thinking collisions on the network is the problem, where "keep alive" messages sent between the machines to confirm they are still there are being lost and so either the Client concludes the Server isn't there or vice versa.

This also sounds like a fit with the fact that using a larger packet size is causing the drop outs, while a smaller packet size does not. A test that might help in confirming this some is this - you mention " packet size= 100 mb// render time (stamped)=0:21:34// 4 nodes disconnected", it would be interesting to know with the 100 Mb packet size using only workstation + 2 clients, does a disconnect happen? Then workstation + 4 nodes, and so on. If the larger packet size works when there are less nodes in use, that would also suggest the network is becoming overloaded resulting in some keep lives getting lost among data collisions.

(BTW, you should have heard all this from support already)

The other possibly interesting test would be to use Team Render but with one of the native C4D render engines - I haven't looked into that to see if it has a similar control over packet size, but if it does you could try and replicate the same settings there and see if that has the same issue too.

A search for 10gbe will show threads where people are using 10 gig ethernet networks, so could be that is just a lot of machines and a lot of data that is more than the 1 gig network can manage. I haven't looked too far into the question for C4D in general, but I do see posts like this https://www.c4dcafe.com/ipb/forums/topic/92259-team-render-woes-bandwidth-issues/ where it is suggested that a gigabit network can do about 100mbps, a figure I thought was interesting since that is the packet size where you mention things start to drop out.

We're continuing to look into this and discuss the subject of networking here, including considering whether some home grown networking solution may be better than C4D TR, but in the meantime this could just be a limitation of C4D TR.

If you can do those tests (100 mb packet size, less nodes; TR with native render engine with similar packet sizes if controllable) that could further confirm whether the network might be an issue. Let us know!

kizo:
Hi Tom,

I will try to make further testing.
But if network is the issue how would you explain rendering with vray DR or Thearender distributed a few years back worked without any issues?

also if network is the bottleneck wouldnt it make more sense that it gets congested with far more packets being sent at the same time like when a smaller packet is used?

in any of these scenarios seems like all my node licences wont be of any use and if the issue is Team render related I can see the Corona team is pointing to Maxon to solve the issue with TR.
If you research you can see TR issues with other engines as well as C4d ones.So those issues are here for a long time its no news. And that makes the choice to use TR by corona even more strange

as I stated in my emails all I need is to be able to continue the production. As it seems now that wont be possible.

Cheers
kizo

TomG:
Using TR isn't strange in that if there is a good existing solution, it makes sense not to spend development time replacing it. The question on whether TR is a good solution is still out for the jury though :)

TY for the further testing, it will be interesting to know the result of that! And as a note, we are not pointing fingers at Maxon, nor expecting them to solve it - in part, since we are using TR in a somewhat different way than it's native intention. What we are trying to find out is whether the way we use it supports 10 machines across a 1 gig network. If that turns out to be the case, we will have to consider what the next steps will be. It's certainly a much rarer thing for people to be using that many clients, making it an edge case with much less data for us to investigate how things work in that scenario, which is why your testing is very valuable to us.

TomG:
The other interesting test would be changing the interval at which data is sent - this could also reduce the likelihood of collisions. That's the second part of the Manual TR settings for Corona, and may be something to test too. Unfortunately at the moment I don't have any suggested values to try, but raising that (so that machines send data less often) may also be a solution to get everything working as required. Hope this helps!

Navigation

[0] Message Index

[#] Next page

Go to full version