Wish List item

DragonM · **Joined:** Sun Jul 10, 2011 2:15 am **Posts:** 12

Currently the most CPU-intensive part of using PolyVox is running the mesh extractor, letting Marching Cubes do its job. In my experimentation, it took 9 minutes 24 seconds to generate a mesh for 1024x1024x32 voxels. For those of us seeking to render larger worlds, that's pretty brutal.

So here's the wishlist item: a GPU implementation of Marching Cubes, integrated into PolyVox (given available headers and libraries to build it). Nvidia has released complete source for a CUDA version, both demonstrating it's possible and providing a reference implementation with a liberal license.

Of course an OpenCL adaptation would be preferable, so ATI cards could play too.

Also, I wanna pony.

Thank you Santa.

ker · **Posted:** Tue Sep 27, 2011 5:52 am

That's impossible. Are you using Debug mode?
I have about 15ms per complex 16x16x16 area. Which means

Code:

; 1024*1024*32/16/16/16*15/1000/60
   2.048

will be 2 minutes.
Having spaces that consist of mainly one material are way faster.

obviously this still is some time used up.
But if you want to do this on a gpu, won't it be interrupting your frame rendering?
I mean, nothing cooler than having a speedup during level loading... but later in the game when changing the world, this needs to be done on cpu again, or it will interfere with rendering.
If it's a singleplayer game I'd simply store the world after playing it once.
if it's a multiplayer game, the other players can send you the chunks.

Well... I'm curious about David's answer

David Williams · **Joined:** Sun May 04, 2008 6:35 pm **Posts:** 1827

As ker said, the performance should be much better than this. The July 2009 demo includes a 1024x1024x256 map (8x larger than yours) and on my two year old PC it performs the surface extraction in about 15 seconds. This is across four threads so it's probably a minute on one thread, but that would still suggest a time of about 7 seconds for your map.

So what's going wrong?

Check you are building in release mode.
Are you using the LargeVolume? If so try SimpleVolume or RawVolume instead as these provide faster access to the data. 1024x1024x32 isn't big - it's only 32Mb uncompressed.
Are you creating a single large mesh? There is some chance that four 512x512x32 meshes can be generated faster a single 1024x1024x32 mesh. I haven't looked into this as I expect meshes to be fairly small.
Does your time only include SurfaceExtractor::execute(), and not the time required to upload to the GPU, etc?

Otherwise it's possible that's I've broken performance since that 2009 demo but I won't assume that first...

Regarding OpenCL/CUDA I currently have no plans here. Basically I just haven't seen the need and so haven't verified how great the benefit is and what the trade offs are. There are lots of other features I would rather look at before getting to this.

DragonM · **Joined:** Sun Jul 10, 2011 2:15 am **Posts:** 12

I am using LargeVolume, yes. My intent was to eventually support a map of 32768x32768 voxels (plus some indeterminate height to be worked out after some experimentation).

Yes, it's a release build.
Actual data available is 1024x1024x256 but in the section of world I'm experimenting with nothing higher than the 32nd voxel is occupied by anything but air, so the extraction region was automatically optimized down to 1024x1024x32.
The quoted time is for a single giant mesh (several million triangles).
Quoted time is solely for SurfaceExtractor::execute(). Time to shove the SurfaceMesh data into an Ogre::ManualObject was an additional 2 minutes.

I have terrain streaming as RLE-compressed chunks from a server, with hundreds of chunks packed into each network message. Chunk dimensions is currently the PolyVox default block size of 32x32x32 voxels. RLE-encoded air comes across the wire as barely 20 bytes for the whole block. Chunks in the bottom 32 voxels are, of course, somewhat bigger. The extraction optimization mentioned above happens during voxel insertion into LargeVolume. A flag gets set if a block has non-zero data in it, and that region gets added to the extraction region that is used once voxel data loading is complete.

My client has a second mode of operation. It can extract meshes for individual blocks as they arrive from the server. This, of course, blows out my batch count something fierce, but it demonstrates a reasonable minimum. In that mode, extracting a single block from LargeVolume takes between 9 and 15 milliseconds (again, populated blocks only; air blocks omitted). Subsequent conversion from SurfaceMesh to Ogre mesh takes an additional 1 to 3 milliseconds, typically.

Given the per-block extraction results, you'd expect a total extraction time of somewhere in the neighborhood of 10240 milliseconds. Ten seconds, perhaps. It would appear that LargeVolume is radically non-linear when SurfaceExtractor is traversing it across block boundaries. Note that these results are with the default value for the number of uncompressed blocks allowed, and maximum loaded blocks increased to prevent any block removal at all (1048576 blocks). I suppose dramatically increasing the number of allowed uncompressed blocks should help.

One thing said confused me. SurfaceExtractor is threaded? I saw no evidence of threading in the source.

Addendum: When my client is in per-block extraction mode, it has a secondary phase where it consolidates groups of 2x2 blocks of voxels into a single mesh, re-extracting and swapping out the previously generated 4 meshes for a single mesh. (My first attempt at batch count reduction.) Given the timings above, that process should take between 36 and 60 milliseconds. My logs show it takes an absolute minimum of 75 milliseconds, and very frequently takes over 90 milliseconds, with spikes as high as 120 milliseconds. So the time required to extract 4 blocks worth is 8 times the time to extract 1 block worth. That non-linearity is what leads to the over 9 minute time.

David Williams · **Joined:** Sun May 04, 2008 6:35 pm **Posts:** 1827

DragonM wrote:

Given the per-block extraction results, you'd expect a total extraction time of somewhere in the neighborhood of 10240 milliseconds. Ten seconds, perhaps. It would appear that LargeVolume is radically non-linear when SurfaceExtractor is traversing it across block boundaries.

Intuitively you would expect the running time of the SurfaceExtractor to be linear with the number of voxels being processed, but as you have observed that may not be the case. I can't say for sure without doing some tests, but I suspect the main reason is that the SurfaceExtractor avoids generating duplicate vertices. I.e., each time a vertex is about to be generated it tests whether that vertex already exists.

Obviously this is good for your generated mesh, but it means that extraction will slow down as the number of vertices increases. I thought I had a fast way of checking for these duplicates(constant time by using a lookup array) but I guess it can still fail for really large meshes.

I'm away from my development machine for a few days so I'm only speculating here. It may be possible to turn off the checking for duplicate vertices, though of course you'll then get a lot more vertex data.

DragonM wrote:

Note that these results are with the default value for the number of uncompressed blocks allowed, and maximum loaded blocks increased to prevent any block removal at all (8192 blocks). I suppose dramatically increasing the number of allowed uncompressed blocks should help.

Possibly, it depends where the bottleneck is. The LargeVolume is potentially slow when accessing across block boundaries and it may have to decompress or page data, but I haven't really used it enough to really say. It would be worth narrowing down whether the problem is the SurfaceExtractor or the LargeVolume access, so it would be useful if you could swap it for SimpleVolume for testing.

DragonM wrote:

One thing said confused me. SurfaceExtractor is threaded? I saw no evidence of threading in the source.

No, in my project I implement threading at a higher level, and simply have several surface extractors running on different parts of the volume (not safe on LargeVolume though).

DragonM · **Joined:** Sun Jul 10, 2011 2:15 am **Posts:** 12

I have test results for RawVolume and SimpleVolume. This required changing one typedef, the allocation statement for the volume, and the extractor definition. Three lines of code changed. Very convenient.

SimpleVolume took 12,325 milliseconds to extract 1024x1024x32 voxels. Converting to Ogre mesh took an additional 10,418 milliseconds, resulting in approximately 2.5 million triangles, which my GPU is happy to render in one batch at 70 fps.

RawVolume took slightly less time and resulted in no mesh data whatsoever. I can't imagine why. The construction of the volume and the extractor are identical except for substituting PolyVox::RawVolume for PolyVox::SimpleVolume. (I'm using #if #elif #elif #endif to keep all three versions in the code at once.)

So it seems for wide-ranging traversals, something in LargeVolume is disastrously expensive.

And it appears to be the uncompression/recompression cycle. When I setmaxNumberOfUncompressedBlocks(1048576); to match the maximum number of blocks in memory, I get a mesh extraction time of 14881 milliseconds. Triangle count is, of course, the same. The extractor's traversal causes severe block uncompression/recompression thrashing. This is troublesome because my client's memory usage climbs to over 800MB during extraction.

I hope you have time to revisit SurfaceExtractor.

David Williams · **Joined:** Sun May 04, 2008 6:35 pm **Posts:** 1827

DragonM wrote:

I have test results for RawVolume and SimpleVolume. This required changing one typedef, the allocation statement for the volume, and the extractor definition. Three lines of code changed. Very convenient.

That's the idea :-)

DragonM wrote:

SimpleVolume took 12,325 milliseconds to extract 1024x1024x32 voxels. Converting to Ogre mesh took an additional 10,418 milliseconds, resulting in approximately 2.5 million triangles, which my GPU is happy to render in one batch at 70 fps.

Ok, that's more the kind of time I was expecting.

DragonM wrote:

RawVolume took slightly less time and resulted in no mesh data whatsoever. I can't imagine why. The construction of the volume and the extractor are identical except for substituting PolyVox::RawVolume for PolyVox::SimpleVolume. (I'm using #if #elif #elif #endif to keep all three versions in the code at once.)

I don't know what would cause this, but I need to add some more volume-related unit tests any way so I'll try to ensure consistant behaviour across the different volumes. But I don't think RawVolume would be any faster than SimpleVolume.

DragonM wrote:

So it seems for wide-ranging traversals, something in LargeVolume is disastrously expensive.

And it appears to be the uncompression/recompression cycle.

Yep, that's quite posssible. It may be too slow or it may just be called too often. I wrote that compression code myself and I'll freely admit that compression is outside my area of expertise. There has been a request in the past for the compression to be replaced by a dedicated compression library which I do think is a good idea, but I don't expect to change it myself soon (I can provide guidence if you want to implement this).

This would probably be faster than my own approach, and I suspect the compression would be much better as well. Especially for density values (rather than just material values) as my RLE compression doesn't work so great here.

DragonM wrote:

When I setmaxNumberOfUncompressedBlocks(1048576); to match the maximum number of blocks in memory, I get a mesh extraction time of 14881 milliseconds. Triangle count is, of course, the same. The extractor's traversal causes severe block uncompression/recompression thrashing. This is troublesome because my client's memory usage climbs to over 800MB during extraction.

The surface extractor processes the voxels in a linear fasion, iterating over each voxel in a line, each line in a slice, and each slice in the desired region. As you say this is bad because, if you have a 32x32x32 block, then it will enter that block, process 32 voxels, and then exit. It will enter and exit 32x32 times with other blocks processed in between.

Obviously this is bad, but there are advantages to the slice-by-slice approach as well (in particular when catching the duplicated vertices). It does need revisiting though.

In your particular case you should not need to set the number of uncompressed blocks as high as 1048576. Instead, think how many blocks actually fit inside the region. In your case that would be 32x32x1 = 1024?

DragonM wrote:

I hope you have time to revisit SurfaceExtractor.

Progress on PolyVox has been slow recently, but that is because I am focusing on a project built on PolyVox rather than working on the library itself. However, I'm not using either the LargeVolume or the marching cubes SurfaceExtractor, so I won't spend much time here. I expect this project to be done around the end of the year so hopefully I can get back to the library then.

DragonM · **Joined:** Sun Jul 10, 2011 2:15 am **Posts:** 12

David Williams wrote:

Progress on PolyVox has been slow recently, but that is because I am focusing on a project built on PolyVox rather than working on the library itself. However, I'm not using either the LargeVolume or the marching cubes SurfaceExtractor, so I won't spend much time here. I expect this project to be done around the end of the year so hopefully I can get back to the library then.

Sorry to hear it. Er, I mean, congratulations, good for you.

I've determined it's time for me to try out resampling, which will, of course, use SimpleVolume as a destination, and so bypass SurfaceExtractor's abusive behavior towards LargeVolume. Having one batch in 13 seconds (turns out conversion to Ogre mesh takes only 1201 ms when the client has enough RAM to work with) is nice but 2.5 million triangles is a bit much. A lower LOD is in order, so I won't be trying to do major extractions from LargeVolume. I have a sneaking suspicion VolumeResampler will thrash LargeVolume too though, and that's more of a problem. However, it's probably more amenable to per-block processing, since it doesn't have to worry about culling vertices.

Given your preoccupation, I'll probably try my hand at revising VolumeResampler as needed and submitting a patch.

I anticipate needing int Volume::getBlockSize() const; At the moment, client code of Volume doesn't necessarily know what block size the Volume is using, and since block size is customizable in the constructor it's not technically an implementation detail, so exposing it would be both helpful and appropriate. I would prefer to have block-sensitive code ask the Volume it's working with what its block size is, rather than trying to use the same constant with which the Volume was constructed.

David Williams · **Joined:** Sun May 04, 2008 6:35 pm **Posts:** 1827

DragonM wrote:

I have a sneaking suspicion VolumeResampler will thrash LargeVolume too though, and that's more of a problem. However, it's probably more amenable to per-block processing, since it doesn't have to worry about culling vertices.

Agreed on both points. Git master contains the start of a 'IteratorController' class (soon volume samplers will be renamed to iterators as they will also provide write access). Eventually that will be used to move iterators across the volume in a cache efficient manner (e.g z-curve or Hilbert curve). But this isn't really implemented yet, and I haven't made firm decisions about exactly what I'm going to do.

However, I am using it a bit in my project so it may get some consideration over the coming weeks.

DragonM wrote:

Given your preoccupation, I'll probably try my hand at revising VolumeResampler as needed and submitting a patch.

Be aware that the SmoothLodExample appears to be broken in Git head. I don't know when it happened but I'll try to fix it. You are definitely in experimental code with this part of PolyVox ;-)

DragonM wrote:

I anticipate needing int Volume::getBlockSize() const; At the moment, client code of Volume doesn't necessarily know what block size the Volume is using, and since block size is customizable in the constructor it's not technically an implementation detail, so exposing it would be both helpful and appropriate. I would prefer to have block-sensitive code ask the Volume it's working with what its block size is, rather than trying to use the same constant with which the Volume was constructed.

I think that's ok in principle. Keep in mind that it won't apply to all volume types (such as the RawVolume, and maybe an OctreeVolume in the future) as these are not block based. But it should be OK for SimpleVolume/LargeVolume.

David Williams · **Joined:** Sun May 04, 2008 6:35 pm **Posts:** 1827

David Williams wrote:

Be aware that the SmoothLodExample appears to be broken in Git head. I don't know when it happened but I'll try to fix it. You are definitely in experimental code with this part of PolyVox ;-)

Ok, it's working again. It was actually the RawVolumeSampler which was broken.

Wish List item

Who is online