Author Topic: Team render issues  (Read 13829 times)

2019-02-07, 13:28:23

kizo

  • Active Users
  • **
  • Posts: 28
    • View Profile
Hi,

as some other users I also have issues with TR. The issues are clients disconnecting and difference in output plus other limitations like not being able to save crx format.

As beta will expire on the 21st Feb. we are trying to prepare the workflow but the TR issues just seem to prevent the production.At the current state I feel like the TR is totally useless, unreliable and in fact any node licences are worthless which is a shame for a newly released product. I love Corona and hope to keep using it. We have NO issues working on single machines but distributed rendering or using the full potential of our hardware is essential.
I have been trying to get some help from the official support but unfortunately it takes way too long to get any answer so I will try here.Maybe I will have better luck.



TEAM RENDER ISSUE: CLIENTS DISCONNECTING AND SLOW RENDER TIME

I have spent a lot of time testing things out. Im not a network specialist so its rather hard to understand whats going on and if the network itself is causing issues but I did a series of tests to identify the differences in settings and results obtained.But we used the same machines and network setup with Vray DR and Thea render distributed (CPU+GPU on the network) without similar issues.

We are running a 1Gbp/s network and have a dedicated range of fixed IPs for machines. I have adjusted all firewall setting , disabled bonjour and Im using IP adresses to conect. I have followed the cineversity series of videos on TR and made all adjustments.Corona license server running. Tests we made with the hotfix release.

I made all tests using 9 machines.

From the tests I made so far one conclusion is that the smaller the packet size I set in Corona render settings the less the nodes disconnect.

In fact with a size of 5 or 10 and interval always at 10s none of the nodes disconnected.With a size of 50 they start disconnecting and with 100,200 or even 500 they also disconnect and it seems sooner. The error that we get when the node disconnects is:  "Frame synchronization failed: Communication error" but if pinged or tested the connection the node is available so I can only assume that there were to many chinks at the same time, but Im not sure. Also if thats the case...having smaller chunks would mean a lot more of them at the same time .....but  when chunks are smaller no nodes disconnect at all.

A small chunk size seems to be OK for smaller resolution renders.The tests I made with a 1k square render and small packet size 10 and 10s interval  rendered without disconnecting nodes at all.The scene with all nodes would render to the noise level set in 5-6 min. (same scene on a single 2990wx took 19 mins)

On a 4k square render of the same scene some interesting facts are shown:

A) packet size =  10 mb // render time (stamped)=0:42:00 // no nodes disconnected

B) packet size= 100 mb// render time (stamped)=0:21:34// 4 nodes disconnected (3 almost at the same time some 3-4 mins after starting the render)

Test A took around 1/4th of the time only to collect all the chunks. So it finished rendering in 30 mins and it took 10 to get all the chunks.

Test B  the render finishes in 21min. Even tho 4 of the machines disconnected this is way faster but the issue is not all of the power is used.

I have tried many combinations changing the interval only, the packet size only or combined. Its a very time consuming process. In all my test  nodes disconnect with larger packages wile they dont with smaller ones but than rendering takes way too long.


PICTURE VIEWER OUTPUT DIFFERENT FROM VFB

In all the test made so far using TR trough PV (the only possible way) but single machine too,  there is a difference where the PV saved image is always darker.
The c4d project settings are set to linear workflow and sRGB

I also had a situation where the lights didnt match in the render saved as .jpg from the PV to the .jpg saved from the VFB. It seems VFB is showing and saving the lightmix while the PV the beauty pass.That would explain why the color and intensity of light was different.Please check the attached images.
 If the above is true how would one save a non layered format out of PV and have it match the VFB ?

RENDER TIME INCONSISTENCY

While testing I noticed different render times are reported for the same job.Im not sure why is that but makes it harder to know whats the real render time and estimate.Check attached example.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

It would be great to get any help from the support but Im looking forward to user input too. Anyone out there using TR without the above issues?

Must say Im rather disappointed with how network rendering works. I have 3 x Corona + 3 nodes licences  and one Corona + 10 nodes license along with 10 machines waiting to render.And Im now facing major slowdown in production and cant find a solution. Im not getting any help from the support too.
It feels like the beta stage should have lasted longer even if it was rather long.But making distributed rendering work should be part of a released product and unfortunately cinema Team Render doesnt seem to be the best choice for that.

Cheers
kizo









2019-02-07, 14:05:10
Reply #1

TomG

  • Administrator
  • Active Users
  • *****
  • Posts: 5434
    • View Profile
Hi Kizo!

To me, it sounds like your network is being overloaded with traffic. 10 machines is a lot to run on a 1Gb network, and I know there are people with less machines who are using a 10Gbe network, so I am thinking collisions on the network is the problem, where "keep alive" messages sent between the machines to confirm they are still there are being lost and so either the Client concludes the Server isn't there or vice versa.

This also sounds like a fit with the fact that using a larger packet size is causing the drop outs, while a smaller packet size does not. A test that might help in confirming this some is this - you mention " packet size= 100 mb// render time (stamped)=0:21:34// 4 nodes disconnected", it would be interesting to know with the 100 Mb packet size using only workstation + 2 clients, does a disconnect happen? Then workstation + 4 nodes, and so on. If the larger packet size works when there are less nodes in use, that would also suggest the network is becoming overloaded resulting in some keep lives getting lost among data collisions.

(BTW, you should have heard all this from support already)

The other possibly interesting test would be to use Team Render but with one of the native C4D render engines - I haven't looked into that to see if it has a similar control over packet size, but if it does you could try and replicate the same settings there and see if that has the same issue too.

A search for 10gbe will show threads where people are using 10 gig ethernet networks, so could be that is just a lot of machines and a lot of data that is more than the 1 gig network can manage. I haven't looked too far into the question for C4D in general, but I do see posts like this https://www.c4dcafe.com/ipb/forums/topic/92259-team-render-woes-bandwidth-issues/ where it is suggested that a gigabit network can do about 100mbps, a figure I thought was interesting since that is the packet size where you mention things start to drop out.

We're continuing to look into this and discuss the subject of networking here, including considering whether some home grown networking solution may be better than C4D TR, but in the meantime this could just be a limitation of C4D TR.

If you can do those tests (100 mb packet size, less nodes; TR with native render engine with similar packet sizes if controllable) that could further confirm whether the network might be an issue. Let us know!
Tom Grimes | chaos-corona.com
Product Manager | contact us

2019-02-07, 14:14:34
Reply #2

kizo

  • Active Users
  • **
  • Posts: 28
    • View Profile
Hi Tom,

I will try to make further testing.
But if network is the issue how would you explain rendering with vray DR or Thearender distributed a few years back worked without any issues?

also if network is the bottleneck wouldnt it make more sense that it gets congested with far more packets being sent at the same time like when a smaller packet is used?

in any of these scenarios seems like all my node licences wont be of any use and if the issue is Team render related  I can see the Corona team is pointing to Maxon to solve the issue with TR.
If you research you can see TR issues with other engines as well as C4d ones.So those issues are here for a long time its no news. And that makes the choice to use TR by corona even more strange

as I stated in my emails all I need is to be able to continue the production. As it seems now that wont be possible.

Cheers
kizo

2019-02-07, 14:25:39
Reply #3

TomG

  • Administrator
  • Active Users
  • *****
  • Posts: 5434
    • View Profile
Using TR isn't strange in that if there is a good existing solution, it makes sense not to spend development time replacing it. The question on whether TR is a good solution is still out for the jury though :)

TY for the further testing, it will be interesting to know the result of that! And as a note, we are not pointing fingers at Maxon, nor expecting them to solve it - in part, since we are using TR in a somewhat different way than it's native intention. What we are trying to find out is whether the way we use it supports 10 machines across a 1 gig network. If that turns out to be the case, we will have to consider what the next steps will be. It's certainly a much rarer thing for people to be using that many clients, making it an edge case with much less data for us to investigate how things work in that scenario, which is why your testing is very valuable to us.
Tom Grimes | chaos-corona.com
Product Manager | contact us

2019-02-07, 14:29:28
Reply #4

TomG

  • Administrator
  • Active Users
  • *****
  • Posts: 5434
    • View Profile
The other interesting test would be changing the interval at which data is sent - this could also reduce the likelihood of collisions. That's the second part of the Manual TR settings for Corona, and may be something to test too. Unfortunately at the moment I don't have any suggested values to try, but raising that (so that machines send data less often) may also be a solution to get everything working as required. Hope this helps!
Tom Grimes | chaos-corona.com
Product Manager | contact us

2019-02-07, 14:38:21
Reply #5

kizo

  • Active Users
  • **
  • Posts: 28
    • View Profile
Hi Tom,

thanks on the further explanation.

Please can you reply to my question:

But if network is the issue how would you explain rendering with vray DR or Thearender distributed a few years back worked without any issues?

also if network is the bottleneck wouldnt it make more sense that it gets congested with far more packets being sent at the same time like when a smaller packet is used?

Thanks
kizo


P.S.  I understand 10 nodes might not be so common and Im willing to test anything you like just to get things working.If you have more specific tests needed just let me know.I have a huge load of work but I will find the time for that
« Last Edit: 2019-02-07, 14:43:03 by kizo »

2019-02-07, 14:41:25
Reply #6

TomG

  • Administrator
  • Active Users
  • *****
  • Posts: 5434
    • View Profile
I can't comment on V-Ray and Thearender as I don't know how they were using TR compared to how we use it.

And not necessarily on the packet size, and the smaller ones more often would leave more gaps for keep alives to go in between them. Think of a highway, when lots of small cars are joining the highway, there is always a gap possible by delaying one car to slip in another one (a keep alive), but when a truck made of 18 connected trailers is joining the highway, there is no gap and our keep alive car has to sit and wait for the whole super long truck to get onto the highway before it can join (by which time, it may be too late for it to reach its destination).

(EDIT which is why sending the super large trucks less often may open up gaps for the keep alives to get in there, as one possibility)
Tom Grimes | chaos-corona.com
Product Manager | contact us

2019-02-07, 14:47:33
Reply #7

kizo

  • Active Users
  • **
  • Posts: 28
    • View Profile
I can't comment on V-Ray and Thearender as I don't know how they were using TR compared to how we use it.

And not necessarily on the packet size, and the smaller ones more often would leave more gaps for keep alives to go in between them. Think of a highway, when lots of small cars are joining the highway, there is always a gap possible by delaying one car to slip in another one (a keep alive), but when a truck made of 18 connected trailers is joining the highway, there is no gap and our keep alive car has to sit and wait for the whole super long truck to get onto the highway before it can join (by which time, it may be too late for it to reach its destination).

(EDIT which is why sending the super large trucks less often may open up gaps for the keep alives to get in there, as one possibility)

a good description of the packet size thing, thanks on that much more suited for my non tech brain.

as for commenting vray or any other engine its a bit political answer. I dont know the ins and outs of the tech behind but I do know I rendered distributed on the same network with 10 and more machines and never had issues. from my POV its a good indication something is wrong with either TR or coronas usage of it.  BTW non of them used TR that might tell something
« Last Edit: 2019-02-07, 14:53:00 by kizo »

2019-02-07, 15:34:22
Reply #8

kizo

  • Active Users
  • **
  • Posts: 28
    • View Profile
Tom I would like to get an answer on the color difference issues also if possible? here is the issue again for easier following but please check the attachments in the 1st post in this thread.

Thanks


"PICTURE VIEWER OUTPUT DIFFERENT FROM VFB

In all the test made so far using TR trough PV (the only possible way) but single machine too,  there is a difference where the PV saved image is always darker.
The c4d project settings are set to linear workflow and sRGB

I also had a situation where the lights didnt match in the render saved as .jpg from the PV to the .jpg saved from the VFB. It seems VFB is showing and saving the lightmix while the PV the beauty pass.That would explain why the color and intensity of light was different.Please check the attached images.
 If the above is true how would one save a non layered format out of PV and have it match the VFB ?
"

2019-02-07, 16:05:19
Reply #9

TomG

  • Administrator
  • Active Users
  • *****
  • Posts: 5434
    • View Profile
I'll have to leave that one to someone else, as I have no knowledge or research done into that one :) One for Ben, or the devs!
Tom Grimes | chaos-corona.com
Product Manager | contact us

2019-02-07, 17:53:04
Reply #10

houska

  • Former Corona Team Member
  • Active Users
  • **
  • Posts: 1512
  • Cestmir Houska
    • View Profile
Tom I would like to get an answer on the color difference issues also if possible? here is the issue again for easier following but please check the attachments in the 1st post in this thread.

Thanks


"PICTURE VIEWER OUTPUT DIFFERENT FROM VFB

In all the test made so far using TR trough PV (the only possible way) but single machine too,  there is a difference where the PV saved image is always darker.
The c4d project settings are set to linear workflow and sRGB

I also had a situation where the lights didnt match in the render saved as .jpg from the PV to the .jpg saved from the VFB. It seems VFB is showing and saving the lightmix while the PV the beauty pass.That would explain why the color and intensity of light was different.Please check the attached images.
 If the above is true how would one save a non layered format out of PV and have it match the VFB ?
"

Hi kizo,

after reading your description, it seems to me that you somehow managed to save only the non-lightmix version of the image out of the PV. The functionality of the "Save as..." option of the C4D Picture Viewer depends on the current layer mode (Image vs. Single-Pass vs. Multi-Pass) and possibly on what layer you have selected.

As for the different lightness, we are aware of a very slight (almost imperceptible) difference between PV and VFB and it's probably a result of different sRGB handling. But it seems to me (based on your description) that you have a much bigger difference between those two. Might I ask, whether you have any PostProcessing filters enabled? And if so, are the results from PV and VFB the same after disabling PostProcessing?

2019-02-07, 19:53:47
Reply #11

Nelaton

  • Active Users
  • **
  • Posts: 56
    • View Profile
Hello Houska,

I think kizo is telling that we cannot  automatically save the result of the lightmix interactive pass as  in the corona frame buffer.

Moreover, beside this problem of correspondancies between both viewers that is occuring,  we meet exact same problem than kizo regarding Team render and stills rendering: Clients are
disconnecting when raising the packet size+ interval in manual mode. So i +1 everything he say, to us, the problem is not on our side.

Regards,

Nelaton
« Last Edit: 2019-02-07, 20:01:57 by Nelaton »

2019-02-07, 19:57:53
Reply #12

TomG

  • Administrator
  • Active Users
  • *****
  • Posts: 5434
    • View Profile
As a note, raising the packet size is what may cause clients to disconnect (larger packets, no room for keep alives to be sent across). Raising the packet size is only recommended for high resolutions, where the default packet size may be too small (causing slow rendering). Smaller packet sizes make disconnects less likely (we believe, if it is network traffic causing this, which is what we hope to find out from the tests).

How many clients are you using to render, and what packet sizes are you setting, when you get those disconnects? That information would be very useful to the issue at hand.

Cheers!
   Tom
Tom Grimes | chaos-corona.com
Product Manager | contact us

2019-02-07, 20:19:52
Reply #13

Nelaton

  • Active Users
  • **
  • Posts: 56
    • View Profile
hi Tom,
We are using 8 clients, and the value we tested is 100 Mb for the size packet  and we render A3, 144 dpi.
We also tested smaller packet size, but render time increased drastically to a point it was preferable to us to render localy.

One remark,  when rendering animations (with TR), we have no client disconnecting and descent render times/frames (around 10 mn/frames).
Not sure this is a solution, but i'm wondering why it's working  (client are not disconnecting) with animation and not with frame rendering.
 
Cheers,

Nelaton
« Last Edit: 2019-02-07, 20:35:21 by Nelaton »

2019-02-07, 20:37:30
Reply #14

TomG

  • Administrator
  • Active Users
  • *****
  • Posts: 5434
    • View Profile
Thanks Nelaton! It is still pointing to the same issue in that case, that the network is getting overloaded (is it by chance a 1 Gig ethernet, rather than a 10 Gig ethernet? Sorry for the questions, but all this information helps us a lot!). The fact that this is a large number of Clients again also suggests that, along with your packet size, great info thanks!

For animations, if you are using the Team Render Server, that would also make sense, as the Clients are not sending back packets of results but only a single image once completed, so they are "silent" across the network other than keep alives while they are rendering (while TR to Picture Viewer has the clients frequently sending back their latest results, in whatever packet size is set).

I wonder if raising the Client Update Interval would help in these cases - larger packets can be sent less often, which should reduce network congestion. I don't have any good figures from experimentation, but maybe 30 seconds, or even 60 seconds, or 90 seconds would be good (the progress of the render isn't so important here, as you should already have a good idea of what the final result will look like - you just want all machines working effectively to produce the final image, so if you don't get updates from them except every 30 seconds that shouldn't disrupt workflow, and may ease the problem of network traffic).

Thanks for your patience and information while this gets researched and investigated further (both you and Kizo too)
Tom Grimes | chaos-corona.com
Product Manager | contact us