Anim8or Community
		General Category => Ongoing Anim8or Development => Topic started by: Steve on July 08, 2018, 03:08:38 pm
		
			
			- 
				I just posted a new development release  v1.01.1329 (http://www.anim8or.com/download/preview/files/animcl1329.zip) dated July 7, 2018 that uses all of your CPU's cores to render ART images. It also has an improved FastAA algorithm that improves shadow edges, especially for soft shadows, and properly ignores shadows from objects marked to not cast shadows.
 
 Note: There are some minor issues, particularly with noise in movies when the image isn't changing. These are cause by the separate threads calling the random number generator in different a different order for each render. I'm working on fixing this.
 
 Please give this build a try. Let me know of any problems that you have, and how much it speeds up your renders.
- 
				Edit | Rotate is disabled for component selections in this build. Is that intentional?
			
- 
				Steve, thanks!
 
 However, there's a typo in the link above: it has a spurious "[b" at the end. Here's a corrected url:
 
 http://www.anim8or.com/download/preview/files/animcl1329.zip
 
 
- 
				nemyax: Oops! It looks like Edit | Rotate was disabled in build 13.25 as well. I'll look into it.
 
 selden: Thanks, I fixed it :)
- 
				Hi Steve
 
 I've got Raxx's phur script loading at startup.  when I start 1329 I get
 
 Compiling "C:\Program Files (x86)\anim8or\scripts\PHUR v1.4.a8s":
 2531 lines 0 errors
 Compiling "C:\Program Files (x86)\anim8or\scripts\PHUR_Comb.a8s":
 error on line 47: undefined member reference 'LockAmbiantDiffuse'
 error on line 47: incompatible types in assignment
 47 lines 2 errors
 
 I'm on Win7 x64.
 
 (This is not something I'm using at the moment its just left over from experiments.  Hope this is useful feedback.)
 
 Thanks for the hard work :)
 
 
- 
				Hi Steve
 
 I have a 200 frame 1920 x 1080 simple animation on a green background rendering using ART, anti aliased, xvid codec.
 I'm running on an 8 core i7 4800MQ 2.7GHz.
 
 1329 renders in 31 minutes  :(  [processor is running at 96% during render]
 1321 renders in 15 minutes!     [ processor is running at 15% during render]
 
 Both tests are running with the PC using about 5GB of 8GB RAM, anim8or is using about 250MB during both tests.  Multithreaded is checked.
 
 Let me know if I've might of set it up incorrectly.
 
 
- 
				AlecJames: I haven't tested on an 8 core CPU yet. It may be using too much cache. Try adding the int attribute ChunkSize with a value of 32 and see if that helps. (Use the menu command Scene->Attributes to add an attribute)
 
 Limiting the number of threads might help, too, but unfortunately the attribute I added to do that is broken.
 
 AlecJames: Oops! I broke it in v1.01.1325. I'll fix it!!!
- 
				I ran some rough tests, ChunkSize vs frames rendered in 1 min:
 
 ChunkSize : Render speed
 Not set    5.5 frames per min
 16   5 frames per min
 32   1.9 frames per min
 64   3.5 frames per min
 128   7 frames per min
 256   12.6 frames per min
 384   18.5
 448   18
 512   21.5 frames per min
 640   20
 768   19.5
 1024   15 frames per min
 4096   9.5
 
 Hope it helps :)
 
- 
				It looks like the default setting of ChunkSize = 100 is far from optimal for your computer and your scene. This may take a bit of experimenting to figure out how to automatically choose the right parameters.
			
- 
				Gave it a shot. The new build is definitely rendering it slower, but the image seems to come out cleaner in 1329 than in 1321.
 
 1321 apparently has the FastAA setting in it as well, so if it was coded in that version then I'm not sure why it's looking cleaner. Is there a setting you programmed in that's increasing render quality and slowing down render times in 1329? The number of rays seems to double in the new version.
 
 Windows 10 Home
 Intel i7-6820HK 2.70 GHz (Quad core, 8 threads)
 
 Below are the results.
 
 
 
 
 
 
 
 Also discovered some clear spots in the shadowing for the 1329 renders.
 
 
 
 
- 
				I don't know my threads from my cores  :-[  :)
			
- 
				AlecJames: A core is a complete set of hardware registers, etc. and can run any program. They execute in parallel and independent from each other. So a 4 core CPU can continuously execute 4 programs at the same time. If it is executing more, then all but 4 are paused at any given time, and the OS swaps them in and out as necessary to make it appear that they are all always running.
 
 A thread is a stream of instructions being executed by the CPU, i.e. software. You can run many, many threads at once. Windows normally has dozens and dozens of them active. But at any one moment in time, there can only be as many executing as there are cores.
 
 When Anim8or is rendering with the ART ray tracer, it uses as many threads as there are cores.
- 
				Raxx: Are you using ambient occlusion? I haven't tried any examples with the new fast AA.
 
 I changed the fast AA heuristics. It should make soft shadows better and faster, among other things. Here's how it works for N samples/pixel:
 
 1. A small number of samples, sqrt(N), is made. If all of the following criteria are satisfied then those are averaged for the final value:
 
 a) all samples are of the same material, including the closest sample in the 4 adjacent pixels,
 b) the min Z and max Z differ by less that 5%, including the closest sample in the 4 adjacent pixels,
 c) each light is either visible but all samples, or not visible, including the closest sample in the 4 adjacent pixels,
 d) the average normal from this pixel and the 4 adjacent pixels are within 30 degrees,
 
 The previous fast AA heuristic used color contrast between the samples and did not use materials or visibility of lights.
 
 You can see what criteria are used with the environment variable int ShowFastAA.
 
 1 - Green is fast AA, black is full evaluation,
 2 - Red is large delta Z, gray is fast AA, black is other full evaluation,
 3 - Yellow is different lights visible,
 4 - Cyan is divergent normals,
 5 - Magenta is multiple materials,
 6 - Orange is reflective material (currently disabled),
 7 - Blue is large color contrast (currently disabled),
 8 - Violet is multiple materials, not counting adjacent pixels' nearest samples.
 
 I attached renders of a room with 4 soft shadow local lights that show how this works.
 
- 
				Steve: When I saw Raxx had said "Quad core, 8 threads", I wondered why because I always thought that I could have 1000s threads running on as many CPU cores that I saw in the task manager performance tab.   Why 8 threads?
 
 I found that CPU threads relates to hyper threading where a single CPU core uses 2 (hyper) threads to create 2 virtual cores. They context switch when one thread is idle waiting on some resource.  I guess you don't get 2x performance but its better that 1x.  (I found a really good description this morning but can't find it now :( )
 
 Since there are virtual cores they appear in task manager as separate cores (so I though I had 8 cores but I've got 4 x 2 hyper-threaded cores).  I must have seen the "hyperthreading" term loads of times over the years and just assumed it was some marketing term :)
 
 
 
- 
				It is possible to achieve a 8x throughput improvement with a quad code, 8 threaded Intel CPU. Somewhat simplifying, each thread only is allocated every other cycle. So if 1 thread takes 1000 cycles of CPU use and finishes in 2000 cycles clock-time, then 2 threads can run on the same core and finish in 2001 cycles clock-time. So Raxx's CPU could run 8x faster that a single thread render.
 
 There are other resources that limit the performance: memory bandwidth, cache size, etc., so in practice the actual result is less than the theoretical maximum. Changing the "chunk" size in Anim8or changes how memory and the cache are accessed so, depending on the details of the scene, the optimal size can vary greatly.
 
 I'm also working on other ways to limit the memory requirements.
- 
				Steve,
 
 Can you provide a way to limit the number of threads to fewer than the number of virtual cores available?
 
 It'd be nice to be able to do some interactive work (perhaps improving a model or text editing) on the computer while a render is in progress with minimal interference from the rendering.
- 
				selden: Yes, there will be a max threads setting in the next drop.
			
- 
				Great!
			
- 
				While rendering an image, after a while there might be one or two threads left running for a long time, that are rendering spots that have a lot going on. I think it'd be nice if the renderer would automatically split remaining chunks into smaller chunks and assign them to different threads, continuously until the render is completed, up to a minimum chunk size. Possible?
			
- 
				Raxx: Rendering with smaller chunks can also help. I'm still experimenting with ways to handle trailing chunks.
 
 As to why your example is slower with 1329, it's caused by the light mask heuristic. Because ambient occlusion sends "shadow" rays in all directions, different samples for the same pixel on much of the models can see or not see the sky. This causes FastAA to revert to a full evaluation of all 256 samples/pixel. In the attached image these are the yellow pixels. The fast pixels are dark gray.
 
 The other image is from 1325 where the orange pixels are the fast ones. As you can see almost the entire image is "fast".
 
 I'm trying to come up with better heuristics for ambient occlusion.
- 
				Nice. This was a long time coming. And now I finally have a beast of a desktop to take advantage of this too. Though its interesting to see what parts the renderer is struggling with. Often times what I assumed would be the simplest turned out to take the longest. 
			
- 
				HOLY CRAP!!! Multi-Threaded rendering has arrived, testing NOW :)
 
 Trev
 
 rendered a test (Diffuse Inter-Reflection test as shown on another topic)
 
 83 secs down to 14 !!!! WOW
 
 Also, anyone using DIR should disable fastAA as its actually slower since it always has to do "full evaluation".
 
 My only suggestion now would be to "build" an image interlaced, i.e., blocks of solid -> blocky(coarse) -> final image(fine)
 
 Also, my tests show that ChunkSize 32 is the best for speed, anything bigger gets held up on one or more chunks (like the mirror ball).
 32 on the other hand is small enough that each chunk can finish in a timely manner and if its still needing to render then another thread can "take over" since its not tied to a thread.
 
 Can you up the thread count to 16 too? Im only using 70% CPU (anyone have anything more?)
 
- 
				well, this is all rather exciting!  unfortunately i'm in the middle of a really hectic fortnight (and away from home for most of it) so i can't test this update straight away, but i'll give it a spin as soon as i'm able
			
- 
				Trevor: it doesn't really help to use more threads that there are in your CPU. If your CPU isn't 100% busy it's probably waiting on memory (i.e. there is more data being accessed than will fit in the CPU's cache). It might even speed up the render to use fewer threads, depending on the data set.
 
 On my CPU (2 hyper threaded cores, 4 threads total) most renders use 100% of the CPU.
 
- 
				so, on your dual core, how many threads do you run? 2 or 4? (Since hyperthreading makes windows think you have 4 logical cores)
 I have 8 cores, 16 logical cores and not all cores light up, only 8 do at a time. (obviously some cores also run other processes like windows, but they are not full)
 
 Trev
- 
				I have 2 physical cores and 4 logical cores = 4 threads. I suspect that you have 4 physical cores and 8 logical cores - that's what Intel calls a "8 core cpu".
			
- 
				Steve: these are not problems for me but I noticed recently and while you are in ART- 
 
 #096-014 - ART Renders Don't Show Panaroma Backgrounds
 
 I think there is also an issue with image background.
 
 A second camera does not show the panorama in the work space or rendered view.
 
- 
				AlecJames: Thanks, I'll look into it.
			
- 
				I have 2 physical cores and 4 logical cores = 4 threads. I suspect that you have 4 physical cores and 8 logical cores - that's what Intel calls a "8 core cpu".
 
 
 Erm... no... I have 8 cores, 16 threads, Im not mistaken :P
 
 Hmm, looking at the cores, it would seem that they are all doing something since right before pressing render they were all more-or-less idle, however only 8 red lines ever show up on the preview.
 
 maybe Im missing something?
 
 Ok, so doing some tests with affinity and priority it seems you made a comment earlier that less cores could be faster, it seems your right, limiting to the first thread of each core actually works out faster than allowing all threads to run.
 
 
 Trev
- 
				Trevor: You are correct. Your CPU has 8 multi threaded cores for 16 threads. Anim8or currently limits the max thread count to 8 but I'll increase that to 16 in the next build.
 
 (P.S. Can I exchange CPUs with you :) ?)
- 
				This is a game changer!  Rendering speeds are fantastic, attached took about 2 minutes with AA at 100 on i7 quad core (8 threads), image size 1920x1080.  No problems to report at this stage, testing animation at the moment.
			
- 
				I've done some more testing and run some render time comparisons between Anim8or V1.0, Anim8or build 1329 with multi-threading, and Carrara.  It's tricky to run a fair comparison with Carrara since it is an entirely different package with a different set of parameters but I have attempted to match the AA, materials and lighting as close as possible on a quick still life with lots of lights (14), ART attributes and soft shadows.  Materials and lights in Carrara have been set to match Anim8or as closely as possible, and the model is identical.  Renders are attached and times are below:
 
 Anim8or V1.0 AA100: 50m 11s
 Anim8or V1.0 AA100 (fast AA): 11m 10s
 Anim8or build 1329 AA100 multi-threading on: 5m 47s
 Carrara: 9m 5s
 
 There are some minor issues with the 1329 multi-thread render (some previously noted):
 
 - Graininess in the lower-right of the glass ball
- Apparent lack of AA on objects visible behind the glass ball - some jaggy edges (may be the same issue as above)
- Slight step in shadow at near corner of wooden ramp, though it is actually more pronounced on both V1.0 renders
 
 Aside from this, the handling of shadows in general and the quality of the lighting is significantly better in 1329 than V1.0 and, remarkably, rendered quicker than Carrara.  This isn't a 'Carrara vs Anim8or' thread by any means as both have their advantages (Carrara for its massively powerful materials engine, Anim8or for it's workflow and simple UV editor), this comparison is only about rendering times.
- 
				ENSONIQ5: Nice comparisons. I'm really pleased at how well the latest Anim8or compares to Carrara. Carrara's lighting and glass effects are noticeably better, but overall Anim8or's speed and soft shadows seem to match Carrara's.
			
- 
				Agreed, they compare very well indeed and the speed is really impressive.  If anything I think the Anim8or 1329 render has the edge, there is a richer saturation and it's a more pleasing render overall.
			
- 
				Those Christmas lights just scream "Corona required" :P
 
 hmm, actually, just having a think about it, how hard would it be to have 2d sprites attached to the Z-Near and locked to the on-screen position of "target" nodes?
 
 Bit more of a "Game Engine" feature than a "3D Modeler" feature, but it would be great for renderings (scanline or ART)
 
 Trev
- 
				Trevor: Definitely possible, I'm planning an animation test with multi-threading and will include coronas :)
			
- 
				oh, haha, I know we can do it manually, I was meaning more a An8 feature for Steve :P, it would certainly make ART unique.
 
 haha, I couldn't resist testing my old lightsaber in the new ART
 
 10:33 for full self illumination (DIR) with 64AA - there are no "Lights" in the scene
 
 Final image has corona as close to Z-Near as possible.
 
 Trev
- 
				Trevor - what does it mean "has corona as close to Z-Near as possible"?
			
- 
				Trevor: I just posted build 1330 (http://www.anim8or.com/download/preview/files/animcl1330.zip) with max-threads of 16. Give it a try!
 
 AlecJames: Backgrounds of all kinds (Panorama, Image and Cube Map) should work for all camera views in all renderers in build 1330.
 
 selden: the int Attribute MaxThreads sets the maximum number of threads. So if you have a 4 code CPU you can set it to 3 to leave a core for you to use while it's rendering. Alternately, you can set it to -1 to use all but one thread, or -2, etc.
 
 Raxx: the main difference with the multi-threaded Fast AA for your scene is the way Anim8or handles soft shadows.
 
 Previously Anim8or would sample sqrt(n) samples for n sample anti aliasing and see what the max color difference was between the samples. If it was too large, Anima8or would take another sqrt(n) samples. Then if the color still wasn't converging fast enough it would sample the full number.
 
 Now the color difference isn't used because highly varying textures could cause a slowdown. Instead Anim8or uses a mask of what lights are visible for each sample. If it is different for the first sqrt(n) samples, then Anim8or samples all the rest. This works very well for normal shadow edges and soft shadow edges that aren't "too soft", i.e. the number of pixels that are partially shadowed is small. You scene has several soft shadowed lights that are very close to the objects which means a lot of pixels are softly lit. So Anim8or sample all samples for them. Since you are using 100 sample FastAA this adds a lot of time.
 
 I have a more sophisticated algorithm that I want to try for later builds that should make much smoother soft shadows with fewer samples. It involves filtering the light separately from the surface material. I don't know when I'll have it ready so keep you fingers crossed!
- 
				Thanks Steve, here are the results
 
 Interior DIR test down from 165 to 100 secs (Chunk 32, AA 64) [Original 1,300s]
 Exterior DIR Test down from 14 to 8 secs (Chunk 32, AA 64) [Original 86s]
 
 Lightsaber DIR down from 10:33 to 6:30 (Chunk 32, AA 64)
 
 These speeds are phenomenal :)
 
 Just to test chunksize again I tested interior at 128 and it took 130s (with 30s spent waiting on 1 chunk to finish) and also 64 which took 106, so yeah, smaller chunks are very much better since it spreads the complexity across all cores evenly right to the end.
 I tried 16 but it defaults to 100 leaving the lone chunk to finish.
 
 Trev
- 
				Trevor: Thanks for the results. As to why ChunkSize=16 doesn't work, Anim8or doesn't use values less than 32 because the overhead of the extra work required increases rapidly below that point. I suppose I should clamp the size to 32 for smaller values instead of using the default.
			
- 
				You cant just multiply like that. That chip would probably not be ideal for a desktop because most applications are not multi-threaded well enough to use all that power.
 Where it would be good is on, say, a production server that is compiling things with gcc. gcc is very well multi-threaded and would take advantage of the power very well.