Author Topic: Full crash on AMD Threadripper 2990 / Gigabyte x399: Known issues?  (Read 2634 times)

2019-05-11, 16:33:01

wider-reality

  • Active Users
  • **
  • Posts: 6
    • View Profile
Hello out there,

We have a workstation that we built for rendering purposes only. We use this also as main node, where a cinema instance runs – and works as «Server» for team rendering.

From time to time we had issues with crashes that happened. The machine crashed completely, which means, that not only the application crashed – or a blue screen occured. It is more, that not even the power button reacts to interactions - would say, I need to plug off the computer to switch it off and restart it again. Since I installed the newer hotfix V2 last weekend, these crashes are happening more often – or all the time I am trying to render an animation (not, if I only render a Still image).

First I thought, it must be a hardware-error only. Thats why I made different stress-tests then, where I ran the comp at 100% CPU usage for about 5h. But no crash happened. I also took a tool to use the RAM while CPU runs at 100%. But still no crash.

Then I started to render different animations with physical render only - running several hours with no crash at all.

Only rendering animations while corona is in use gives me reliable crashes after 30mins latest. I found one post here – of another user who uses a Threadripper 1950 – but on another Mainboard. He solved these issues with a firmware-update of his mainboard. So I did the same – but it seems as this only made things worse. Now the system crashes after 3-4 mins of render time.

I understand, that this is more a hardware-issue. But as it happens with corona only, I wanted to ask, if someone else knows such issues – or if the developers might have a workaround for this.

Our System:
- Gigabyte X399 designare Mainboard
- AMD Threadripper 2990WX Processor
- 32GB (2x16GB) of Corsair RAM
- ASUS RTX2080 GPU
- 1000W PSU
- Windows 10 Pro, V1809

Thanks for any hint - or help.
-Stephan-
« Last Edit: 2019-05-11, 16:45:29 by wider-reality »

2019-05-11, 17:48:09
Reply #1

Nejc Kilar

  • Corona Team
  • Active Users
  • ****
  • Posts: 1251
    • View Profile
    • My personal website
Hey Stephan,

Quick question, what did you use to stress test the CPU? Prime95?
Nejc Kilar | chaos-corona.com
Educational Content Creator | contact us

2019-05-12, 01:12:11
Reply #2

wider-reality

  • Active Users
  • **
  • Posts: 6
    • View Profile
Hello nkilar,

I started with HeavyLoad - for RAM and CPU stress testing. And because it did only use the CPU at 60% I ran StressMyPC at the same time.

After 5 hours I stopped the test then.

What I noticed until now is, that only one scene causes replicable crashes. This scene is a Motion Tracking scene that uses an MP4-File and renders out different channels (beauty, shadow and an alpha).

Cheers,
-Stephan-

2019-05-13, 16:53:38
Reply #3

wider-reality

  • Active Users
  • **
  • Posts: 6
    • View Profile
In the meantime I found kind of a workaround: After different tests with modifying the scene (a motion tracking scene, where we now splitted the movie into single frames – to prevent from caching problems) I changed the team render settings – from using all threads to using only 90% of all threads, so 58 instead of all 64 threads.

Since this change, this computer runs now «stable». At least, it is calculating now since hours instead of just minutes – and it is still running.

Are there maybe other users experiencing problems with such a setup? AMD Thread Ripper 2990 and a Gigabyte X399-Board?

2019-05-13, 18:23:59
Reply #4

Nejc Kilar

  • Corona Team
  • Active Users
  • ****
  • Posts: 1251
    • View Profile
    • My personal website
I'd give Prime95 a good run and have a couple of extra benchmarks running at the same time. This really sounds like a hardware issue to me but I can't really say much more.

I'm guessing either something with the PSU or RAM. :\
Nejc Kilar | chaos-corona.com
Educational Content Creator | contact us

2019-05-13, 19:19:16
Reply #5

Juraj

  • Active Users
  • **
  • Posts: 4761
    • View Profile
    • studio website
Definitely run latest Prime95, it has AVX instructions, so it better simulates Corona. And note your temperatures on CPU (all cores and VRM).

Is your CPU cooler properly seated ? What kind is it ?  Does your case get good air-flow ? X399 Designare has 8-phase VRM so it will get very hot with 2990WX (this board was not made for 2990WX).

Please follow my new Instagram for latest projects, tips&tricks, short video tutorials and free models
Behance  Probably best updated portfolio of my work
lysfaere.com Please check the new stuff!

2019-05-14, 18:27:48
Reply #6

wider-reality

  • Active Users
  • **
  • Posts: 6
    • View Profile
Thank you for your input – specially about Prime95! (compared to the other stresstest, this program used all the power of my comp ;) )

I ran Prime95 for five hours now – without any crash. With the animation we tried to render, crashes happened between 5-30 minutes.

We rarely have crashes with other scenes so far (maybe 2-3 in 6 months) - and I wouldn't worry if they were "normal" crashes.

But since these are reproducible complete crashes, where the display stops (black screen, no signal) and I can't reset either the keyboard or the power button, I still have to ask myself if it's a hardware or a software problem (guessing, that it is a software/hardware-problem on my mainboard).

About the temperature and the PSU: The temperature control of the computer and the temperature of the CPU are in the normal range - in the case about 45°C, the CPU maximum 65° in continuous operation. Also the temperature of the VRM shows only between 45° and 50°. The cooling takes place via a liquid cooling with three fans.
The power supply delivers 1000 watts and needs about 400 watts in the stress test - but stable.

I'm afraid there's nothing else for me at the moment but to observe the situation.

Thank you very much for your support!

2019-05-14, 18:54:16
Reply #7

houska

  • Former Corona Team Member
  • Active Users
  • **
  • Posts: 1512
  • Cestmir Houska
    • View Profile
Hmm, this is really strange.

Everything points at a hardware fault, but only Corona and this particular scene together with 100% CPU usage are causing crashes, if I understand correctly. That would actually hint at a software issue. Normally, I would suggest trying to pinpoint down the problem that causes the crash, but you were already trying to do that, as per your own words.

Did you try to remove half of the scene and try to render it to see if it crashes? If it doesn't, try the other half and see if that part crashes. Once you identify the crashing part, start splitting it further. That way, you should find the culprit, if it's a single object or material/shader.

2019-05-15, 19:32:06
Reply #8

Juraj

  • Active Users
  • **
  • Posts: 4761
    • View Profile
    • studio website
The cooling takes place via a liquid cooling with three fans.

But what AIO liquid cooler is it ? Threadripper had 4 big dies, so the heatspreader needs to cover this surface. For 2990WX there is basically only one AIO cooler (without custom loop) that does this on market, Enermax Liqtech II.
Every other cooler covers the heatspreader only partially. That means while the overall temperature can be around 65C, some cores can spike much higher. It's much better to use Air tower like Noctua UH-14S TR3 than other AIO coolers.

Do you see any spikes ? (as 'max temp').

Still, it should throttle, not crash. Did you update to latest bios ?
Please follow my new Instagram for latest projects, tips&tricks, short video tutorials and free models
Behance  Probably best updated portfolio of my work
lysfaere.com Please check the new stuff!