b46d0be5bf8f
—
Leonard Ritter
a month ago

* initial check-in articles written for liminal

8 files changed,1535insertions(+),0deletions(-) A => doc/articles/gpu-delaunay-transform.txt A => doc/articles/half-edge-tet.txt A => doc/articles/linkdemo.py A => doc/articles/maxnorm.txt A => doc/articles/nicer-teardown.txt A => doc/articles/node-dom.txt A => doc/articles/split-scheme.txt A => doc/articles/twistpool.c

A => doc/articles/gpu-delaunay-transform.txt +160 -0

@@ 0,0 1,160 @@+Parallelizable Gather-based Delaunay Transforms +----------------------------------------------- + +by `Leonard Ritter <https://twitter.com/paniq>`_, `Duangle GbR <http://duangle.com>`_ + +Today I figured out how to do GPU-friendly Delaunay transforms, and I'd like to +describe how the algorithm works before I forget everything. + +What the algorithm does is to flip edges in a half-edge triangle mesh where a flip +would improve the quality of the adjacent triangles (an equilateral triangle +has perfect quality). The triangle mesh I'm using is embedded in R2, that is, +it's on a 2D plane, has no holes, is not a manifold, and is typically a heavily +subdivided square; but the algorithm should very well work for 2-manifolds in 3D too. + +The mesh is not guaranteed to be Delaunay conforming after a single transform, +and there is no halting condition. The algorithm is designed to be run in +realtime, continuously refining a mesh with moving vertices each frame. The +algorithm supports constrained and unconstrained Delaunay transforms. + +Overview +^^^^^^^^ + +There are three passes to this algorithm, inspired by the approach of +Navarro et al in [1]_. Navarro et al's approach is scatter-based and requires atomics +to work, but was a good starting point for figuring out how a lo-fi version +might work. Each pass can be massively parallelized (as the iteration +is order-independent), requires no GPGPU implementation or compute shaders and +therefore is optimal for OpenGL 3.2 compatible hardware. + +The first pass iterates all edges of the mesh and evaluates flip candidates based +on whether a flip would improve the quality. The second pass iterates all +triangles and does the flip, using a simple precedence heuristic to avoid +collisions. The third pass fixes broken adjacency information. + +Data Structure +^^^^^^^^^^^^^^ + +The data structure I use for the triangle mesh is a reduced version of the +structure described by Fabian Giesen in his Half-edges Redux article [2]_. +Each triangle has only three edges sorted in counter-clockwise order so that +``edgeId(triangleId, corner) = triangleId * 3 + corner``. That makes it possible +to store the vertex indices for each edge (which map to the vertex opposite +each edge inside the same triangle, *not* each edge's starting vertex) as a +linear element array, the kind of format that OpenGL prefers for indexed +triangles. There are no wedges in this structure, and the algorithm needs to be +modified if there are. + +I also store an array of opposites for each vertex (which I will call twins for +brevity), so that ``vertex(twin(edgeId))`` gives the vertex opposite each edge +in the adjacent triangle. That means each potential quad that an edge is the +diagonal of is described by the vertices ``vertex(edgeId)``, +``vertex(next(edgeId))``, ``vertex(twin(edgeId))`` and ``vertex(prev(edgeId))`` +in counter-clockwise order. The algorithm needs the twin array to work, +as well as a temporary copy of the twin array. + +Here is a breakdown of the data structure:: + + V vertices + T triangles + E=3*T half-edges + + // position[vertexId] = position of each vertex in R2 + // Read in pass 1. + vec2 position[V]; + + // vertex[edgeId] = vertexId of inner vertex opposite a half-edge + // Read in passes 1 & 2, written in pass 2. + uint vertex[E]; + + // twin[edgeId] = edgeId of the twin half-edge connecting the same vertices + // Read in passes 1, 2 & 3, written in pass 3. + uint twin[E]; + + // temporary flag array, three bits per triangle (one bit per half-edge). + // Written in pass 1, read in pass 2. + uint8 candidates[T]; + + // temporary twin array to avoid a race condition. + // Written in pass 2, read in pass 3. + uint tmptwin[E]; + +Algorithm +^^^^^^^^^ + +The algorithm needs three passes to perform, which are as follows: + +#. For each triangle, for each of its half-edges, collect the four adjacent + vertices that form a quad with this half-edge as its diagonal and estimate the + quality of triangles with both possible diagonals. I use a quality estimator + described by Mller et al. in [3]_. If a flip would improve the quality, + store a bit for this edge in ``candidates``. By default, ``candidates[triangleId]`` + must be initialized to zero. + + It is important that the quality computations use the exact vertex order from + both sides of each edge to compute the exact same floating point values and + arrive at the exact same predicate, or floating point imprecisions will cause + asymmetrical results, and subsequent steps will destroy the mesh. I got good + results with a fixed point implementation of the quality estimator that is + insensitive to vertex order. + + If the transform is supposed to be constrained, then constrained edges should + simply never be tagged. + +#. For each triangle, find the first half-edge that is tagged by ``candidates`` + as a flip candidate. If no half-edge is tagged, then skip. If more + than one half-edge is tagged, the tagged half-edge with the lowest lexicographic + index wins. + + The lexicographic index is an ``uint64`` built from the two half-edge ID's + that make up a single edge. + ``index(edgeId) = min(edgeId,twin[edgeId]) | (max(edgeId,twin[edgeId]) << 32)``. + + The half-edge candidate can still collide with tagged half-edges in the + neighboring triangle. If its lexicographic index is *not* the lowest among its + tagged neighbors, then skip. Otherwise, do this triangle's part of the flip: + + The flip is executed by mapping the half-edge's end vertex to the + outer opposing vertex so that ``vertex[prev(edgeId)] = vertex[twin[edgeId]]``. + Because this moves two half-edges, the local twin information must also be + corrected, so that ``tmptwin[next(edgeId)] = next(twin[edgeId])`` and + ``tmptwin[edgeId] = twin[next(twin[edgeId])]``. + + Because both triangles participating in this flip arrive at the same conclusions, + do the same operation independently, and the flip operation itself is symmetric, + the connectivity will be restored without collision or ambiguity. + + If no operation is performed, tmptwin must still be initialized so that + ``tmptwin[edgeId] = twin[edgeId]``. + +#. There's one last problem to correct, which is the invalid adjacency information + stored in half-edges bordering on triangles that have been flipped. + + For each half-edge, determine if its twin has changed by comparing + ``tmptwin[edgeId] != twin[edgeId]``. If the twin has changed, then we simply + update ``twin[edgeId] = tmptwin[edgeId]``. + + We also check if this half-edge's information is still up to date by + verifying the bi-directionality of the link using the predicate + ``edgeId == tmptwin[tmptwin[edgeId]]``. If the predicate fails, then we have + to search for the new, correct adjacent half-edge, whose location is + fortunately invariant. + + We now need to discern whether ``edgeId`` does belong to a half-edge + bordering on a flipped triangle using the predicate + ``tmptwin[tmptwin[tmptwin[edgeId]]] == tmptwin[edgeId]``. + + If the comparison holds true, we update the half-edge information to the + invariant location of its new half-edge so that + ``twin[edgeId] = prev(tmptwin[tmptwin[edgeId]])``, and we're done. + + +References +^^^^^^^^^^ + +.. [1] Navarro et al. + `A parallel GPU-based algorithm for Delaunay edge-flips <http://swp.dcc.uchile.cl/TR/2011/TR_DCC-20110315-003.pdf>`_ +.. [2] Fabian Giesen + `Half-edges redux <https://fgiesen.wordpress.com/2012/04/03/half-edges-redux/>`_ +.. [3] Mller et al. + `Air Meshes for Robust Collision Handling <http://matthias-mueller-fischer.ch/publications/airMeshesPreprint.pdf>`_

A => doc/articles/half-edge-tet.txt +200 -0

@@ 0,0 1,200 @@+Symmetric Half-Edge Tetrahedron Predicates +Author: Leonard Ritter <leonard.ritter@duangle.com> +2015-04-19 + +This document describes a compact and efficient half-edge mapping for +tetrahedra. Its indices have been purposefully arranged so that all +invariant adjacency and incidence predicates of a tetrahedron can be computed +from one another. + +All queries ultimately make use of one single bit-arithmetic predicate, +faceVertex(). For increased performance, computed results can be cached in +bit tables. + +The vertices are numbered this way: + + 0 + + + /|\ + / | \ +1 +-----+ 2 + \ | / + \|/ + + + 3 + +The numbering scheme describes a hamiltonian cycle, and is therefore suited for +patch rendering. + +Faces run counter-clockwise when observed from the outside; if one would like +the hull to face inward instead, one can flip vertices V1 and V2 in the map +above and have faces run counterclockwise; The topology is not altered, so none +of the predicates are affected by this choice. + +Wherever edges are mentioned in the following sections, it is implied that +half-edges are meant. There are no full-edge indices in this scheme. + +Face -> Vertices +================ + +F0: V3 V2 V1 +F1: V2 V3 V0 +F2: V1 V0 V3 +F3: V0 V1 V2 + +Face indices have been chosen so that face F lies opposite vertex V if V = F. +This way, for any base face F, (F+1)%4, (F+2)%4, (F+3)%4 form the faces +of the pyramid cone, with vertex V = F as its apex. + +vertexOppositeFace(vertex) -> vertex +faceOppositeVertex(face) -> face + +The phase for each triangle has been picked so that the vertex of index 0 of +each face F equals F^3 or 3-F, and each vertex is referenced across a +diagonal of the above face * vertex 3x4 matrix. + +The above table can be computed from the function: + +faceVertex(face,corner) -> ((((face & 2) + corner) & 3) ^ (face & 1)) ^ 3 + +Vertex -> Faces +=============== + +V0: F1 F3 F2 +V1: F0 F2 F3 +V2: F3 F1 F0 +V3: F2 F0 F1 + +The inverse table maps each vertex to one of three incident faces sharing +an edge with the opposite face F = V, in the clockwise order of its edges. + +The table is computed from the function: + +cornerOpposite(corner) -> (corner + 2) % 3 +vertexFace(vertex,corner) -> faceVertex(vertex, cornerOpposite(corner)) + +When a face is passed instead of a vertex, the function returns the +index of the face adjacent to the edge starting at the given corner. + +Edge -> (Start Vertex, End Vertex) +================================== + + E0: V0 -> V1 + E1: V0 -> V3 + E2: V0 -> V2 + E3: V1 -> V0 + E4: V1 -> V2 + E5: V1 -> V3 + E6: V2 -> V3 + E7: V2 -> V1 + E8: V2 -> V0 + E9: V3 -> V2 +E10: V3 -> V0 +E11: V3 -> V1 + +Edges have been mapped so that all edges sharing the same starting vertex +are grouped and pointing towards the vertices of the opposing triangle in +counterclockwise order, starting at the vertex opposite to each face's first +edge. + +The table can be computed from the two functions: + +edgeStart(edge) -> floor(edge / 3) +edgeEnd(edge) -> vertexFace(edgeStart(edge), edge) + +The hamiltonian edge path is E0 E4 E6 E10, equivalent to the function + +hamiltonian(index) -> ((index & 1) << 2) + 3 * (index & 2) + +Vertex -> Edges +=============== + +Because of the grouping, computing an edge with a given start vertex is trivial: + +vertexEdge(vertex,corner) -> vertex * 3 + corner + +When a face instead of a vertex is passed, the function returns the edges +that connect the face-opposing vertex with the edge-opposing vertex. + + +Face -> Edges +============= + +F0: E9 E7 E5 +F1: E6 E10 E2 +F2: E3 E1 E11 +F3: E0 E4 E8 + +Due to the previously made indexing choices, the starting edge of a face +can be computed from its vertex index and the desired corner. + +faceEdge(face,corner) -> vertexEdge(faceVertex(face,corner), corner) + +Edge -> (Face, Corner) +====================== + + E0: F3 C0 + E1: F2 C1 + E2: F1 C2 + E3: F2 C0 + E4: F3 C1 + E5: F0 C2 + E6: F1 C0 + E7: F0 C1 + E8: F3 C2 + E9: F0 C0 +E10: F1 C1 +E11: F2 C2 + +You may notice that the mapping is equivalent to the face -> vertex mapping, +due to the F = V equality when F and V oppose each other. + +The inverse can be computed as well: + +edgeCorner(edge) -> edge % 3 +edgeFace(edge) -> faceVertex(edgeStart(edge), edgeCorner(edge)) + +Edge -> Opposite Edge +===================== + + E0: E3 + E1: E10 + E2: E8 + E3: E0 + E4: E7 + E5: E11 + E6: E9 + E7: E4 + E8: E2 + E9: E6 +E10: E1 +E11: E5 + +The half-edge opposite to a half-edge is again computed the same way, +making use of a flipped Face -> Edge predicate: + +edgeOpposite(edge) -> vertexEdge(vertexFace(edgeStart(edge), edge), edgeCorner(edge)) + +Edge -> Next Edge +================= + + E0: E4 + E1: E11 + E2: E6 + E3: E1 + E4: E8 + E5: E9 + E6: E10 + E7: E5 + E8: E0 + E9: E7 +E10: E2 +E11: E3 + +The functions used to retrieve the previous / next edge in counterclockwise +order within a face are: + +edgeNext(edge) -> faceEdge(edgeFace(edge), (edge + 1) % 3) +edgePrev(edge) -> faceEdge(edgeFace(edge), cornerOpposite(edge)) + +

A => doc/articles/linkdemo.py +262 -0

@@ 0,0 1,262 @@+ +""" + +The Liminal RNode System - Scheduling Renderjobs In A Crazy World +by Leonard Ritter, Duangle GbR (2014.4.11) + + + +Structuring rendering in a 3D application to allow for drastical alterations + in pipeline ordering isn't easy; you usually end up wanting all kinds of + permutations, like "i want _this_ actor with _this_ mesh, but this time into + _this_ framebuffer, but with the same camera data; screw that; we need + the same data in two different rendertargets, oh and I forgot we need + splitscreen, so all this needs to be split into four viewports - except + for this object, which needs to be on top of it all." + +...and the graphics engine coder works his ass off for two more weeks; usually +these changes ripple through the entire codebase; starting at providing +basic support at the engine level, over to the game programmer, who then has +to make use of the new flags and refactor the rendering setup. + +""" + +# so here's how to setup scenes in our Liminal engine now; +# it's a bit like linking generators and effects in a modular sequencer, +# only that it works on packets (aka renderjobs) instead of streams of +# audio impulses. + +# I haven't seen a game engine which organizes its renderpath like this yet, +# so I'm pretty excited to demo this and get everyone's input on what I +# think is a pretty lean and elegant way to organize 3D rendering in a game. + + +# first, generate a default, uv mapped cube mesh with normals +mesh = generate_cube_mesh() +mesh.name = "cube" + +# setup a deferred material for the mesh +material = GMaterial.new() +material.name = "CubeMaterial" +material.use_backface_culling = True +material.use_shadow = True +material.albedo = vec3(1.0,1.0,1.0) +material.metallic_factor = 0.0 +material.roughness_factor = 1.0 +material.add_shader_texture('albedo_map', noise1_texture) +material.add_shader_texture('height_map', noise1_texture) +material.add_shader_texture('gloss_map', noise1_texture) +material.add_shader_texture('glow_map', zero_texture()) + +# create a new actor to give the mesh a location +# the actor has no conception of anything else; just an orientation in space +cube2 = Actor() +cube2.name = 'cube2' +# give it a physical body for collision testing +cube2.body = GeomBody.new(ws.world, Box.new(ws.space, vec3(64,64,1))) +cube2.body.kinematic = True +# apply position and size +cube2.position = vec3(0,0,-10) +cube2.scale = vec3(32.0,32.0,0.5) + +# now concatenate our renderjob: +# we want this mesh with this material at this actor's location, +# so cube2 now depends on material, which depends on mesh +cube2 << material << mesh + +# the order of assignments is only interesting in relation to the graph, +# the only requirement is that the mesh is the a source, as meshes are +# the only nodes that generate renderjobs; other nodes only augment them. +# +# this would be equally valid, and a good setup for when many actors +# share the same material: +material << cube2 << mesh + +# now setup an empty vizgroup which we use to bundle a group of objects, +# possibly to control visibility later. This would be kind of like a "scene" +vizgroup = RNode() +vizgroup.name = 'scene' +# make vizgroup depend on cube2 +vizgroup << cube2 + +# The RNode is the basic building block for the render graph; materials, meshes, +# actors, framebuffers, etc. and a lot of other default resources already +# implement the RNode interface so they can be linked; But it's pretty easy +# to write your own RNodes, or use a RNode as proxy to re-use another RNode in +# two different unrelated paths. + +# get the singleton for the deferred renderer's resources, which is also +# an RNode +grm = GRenderManager() + +# link the vizgroup to the framebuffer for the deferred "scene" - +grm.scene_framebuffer << vizgroup + +# now the last thing we need to complete the job is a camera to actually +# specify where the scene is in relation to the viewer; +camera = DebugCamera(lock_up=True) +camera.name = 'camera' +camera.far = 300.0 +camera.fov = 90.0 +# give the camera a collision box +camera.body = GeomBody.new(ws.world, Box.new(ws.space, vec3(1,1,1))) +camera.body.kinematic = True +# and position it +camera.look_at(vec3(0,-70.0,0), vec3(0,0,0), vec3(0,0,1)) + +# all resources the deferred renderer has set up are set, except for the view +# data, so let's just link the entire grm through it and we're done; the +# scene is on screen. +# +# retrieve render manager singleton +rm = RenderManager() +# complete the chain; batch is the root node - now the renderer can 'see' +# the content. +rm.batch << camera << grm + +# For debugging, I have written a small 90 line function to have graphviz +# draw a map of the graph, so I can see at any point what the pipeline looks +# like + +""" + +""" + +# the declaration for RNodes is only 70 lines, and writing new RNodes is simple; +# here is an RNode that changes the renderpass of all jobs going through it: +class Renderpass(RNode): + def __init__(self, renderpass): + RNode.__init__(self) + self.renderpass = renderpass + + def process(self, batch, jobs): + RNode.process(self, batch, jobs) + for job in jobs: + # we could also insert new orders into the jobs array here + # but let's just alter the renderpass + job.key.renderpass = self.renderpass + +# rnodes are only traversed on graph change, not per frame, and the work can +# be done in a thread, as the renderer isn't required for processing. +# +# this is how the renderbatch then generates all jobs while traversing +# the graph: + +def render_nodes(self): + # this could be done in a more conside way in python, + # but i wanted to keep the implementation parallel to a possible + # future C implementation; this one doesn't need any hash tables + # or binary searches, just arrays. + + # get a new selection id + RNode.NODE_ID += 1 + node_id = RNode.NODE_ID + + # array of nodes in topological order + nodes = [] + # how many times each node's content is referenced + noderefs = [] + + # recursive visitor function + def visit_node(node): + # if node has already been seen, skip + if node._node_id == node_id: return + # visit children first + for n in node.sources: + visit_node(n) + # increase noderefs for that node + noderefs[n._node_index] += 1 + # assign the node's order index + node._node_index = len(nodes) + # update the select id to mark the node as visited + node._node_id = node_id + # append to topologically sorted array + nodes.append(node) + # and initialize the refcount for this node + noderefs.append(0) + + # start at batch, which is the root + visit_node(self) + + # debug out: print order in which nodes are processed, (which + # is not the order in which they are rendered, btw) + for i,n in enumerate(nodes): + print('#{}: {}'.format(i,n)) + + # array of each node's jobs + all_jobs = [] + # linearly go through node in sorted order + for k,node in enumerate(nodes): + # list of jobs that this node will receive + jobs = [] + # for all inputs of this node + for source in node.sources: + # retrieve the jobs that the predecessor has + src_index = source._node_index + src_jobs = all_jobs[src_index] + # make sure we have at least one reference left + assert noderefs[src_index] > 0 + # if this is not the last reference + if noderefs[src_index] > 1: + # make a copy of the jobs and append + for job in src_jobs: + jobs.append(self.copy_job(job)) + else: + # last reference: they terk er jerbs + for job in src_jobs: + jobs.append(job) + all_jobs[src_index] = None + # and decrease reference + noderefs[src_index] -= 1 + + # process the jobs + try: + node.process(self, jobs) + except: + print(node) + raise + # and append to job list + all_jobs.append(jobs) + + # as jobs are already appended to the renderbatch on create/copy, + # we are already done; the batch can now be sorted and is ready + # for rendering. + +# And, just for completion, this is the basic RNode class declaration: + +class RNode(Named): + # selection id counter + NODE_ID = 0 + + def __init__(self): + Named.__init__(self) + # set of input nodes + self._sources = set() + # variables used during graph traversion + self._node_index = 0 + self._node_id = 0 + + # this function does a number on the jobs; any operation is OK: + # alteration, duplication, deletion, etc. Change of order has no effect + # on anything. + def process(self, batch, jobs): + pass + + @property + def sources(self): + return self._sources + + # overloaded << operator to provide linking syntax sugar + def __lshift__(self, other): + assert self.name, '{} is unnamed'.format(self) + if isinstance(other, collections.Iterable): + for node in other: + assert node.name, '{} is unnamed'.format(other) + self.sources.update(other) + return other + else: + assert other.name, '{} is unnamed'.format(other) + self.sources.add(other) + return other + +# That's it. Thanks for reading :)

A => doc/articles/maxnorm.txt +110 -0

@@ 0,0 1,110 @@+here is a collection of signed distance field functions with maxnorm metric +(chebychev distance or L^inf distance) + +#define SIGNED_DIST_MAX 1000.0 + +float sumx(vec3 v) { + return v.x + v.y + v.z; +} + +float max_element(vec3 v) { + return max3(v.x, v.y, v.z); +} + +float max_norm(vec3 v) { + return max_element(abs(v)); +} + +float sd_cube(vec3 p, float r) { + return max_norm(p) - r; +} + +vec2 solve_quadratic(float a, float b, float c) { + // ax^2 + bx + c = 0, a non-zero + float q = b*b - 4.0*a*c; + if (q < 0.0) { + return vec2(SIGNED_DIST_MAX); + } + float r0 = -b/(2.0*a); + float r1 = sqrt(q)/(2.0*a); + return vec2(r0 - r1, r0 + r1); +} + +float sd_ellipsoid(vec3 p, vec3 r) { + // move ellipse so that target point is at origin, centre in positive space + // f(v) = (v.x - c.x)^2/r.x^2 + (v.y - c.y)^2/r.y^2 + (v.z - c.z)^2/r.z^2 - 1 + vec3 c = abs(p); + vec3 c2 = c*c; + vec3 r2 = r*r; + float d = SIGNED_DIST_MAX; + + // gather terms of quadratic + vec3 qa = 1.0/r2; + vec3 qb = -2.0*c/r2; + vec3 qc = c2/r2; + float qcs = sumx(qc) - 1.0; + + // check corners: + // solve f(v)=0 for v.x=v.y=v.z=t + { + vec2 t0 = abs(solve_quadratic(sumx(qa), sumx(qb), qcs)); + d = min3(d, t0.x, t0.y); + } + + // interior of convex shape always hits corners first, so early out + if (qcs <= 0.0) { + return -d; + } + + // check edges: + // df/dx=0 => v.x=c.x, solve f(v)=0 for v.x=c.x, v.y=v.z=t + // then do the same for y and z cases + { + vec2 t = abs(solve_quadratic(qa.y + qa.z, qb.y + qb.z, qc.y + qc.z - 1.0)); + d = min(d, max(min(t.x, t.y), c.x)); + } + { + vec2 t = abs(solve_quadratic(qa.x + qa.z, qb.x + qb.z, qc.x + qc.z - 1.0)); + d = min(d, max(min(t.x, t.y), c.y)); + } + { + vec2 t = abs(solve_quadratic(qa.x + qa.y, qb.x + qb.y, qc.x + qc.y - 1.0)); + d = min(d, max(min(t.x, t.y), c.z)); + } + + // check faces: + // df/dx=df/dy=0 => v.xy=c.xy, so f(v)=0 => |v.z - c.z|=r.z + { + d = min(d, max3(c.x, c.y, abs(c.z - r.z))); + d = min(d, max3(c.x, abs(c.y - r.y), c.z)); + d = min(d, max3(abs(c.x - r.x), c.y, c.z)); + } + + // done + return d; +} + +float smoothmin(float a, float b, float r, out float e) +{ + e = max(r-abs(a - b), 0.0); + return min(a, b)-e*e*0.25/r; +} + +/* todo: + +cubic strokes +cylinders +cones +triangular prisms +torus +round cubes (super ellipsoids with variable power per dim) +pyramids + +operations: +* add +* subtract +* color +* soft blend + + +*/

A => doc/articles/nicer-teardown.txt +182 -0

@@ 0,0 1,182 @@+Nicer Teardown In Game Engines +============================== + +by `@paniq <https://twitter.com/paniq>`_ + +A cool thing I figured out today is how to organize teardown in a +game engine without going nuts. I'm going to explain the thinking steps that +led to the solution, concluding with it. It's a solution that is both suprisingly +fun to use and yet quite efficient. + +A typical allocation scheme for a static codebase is to only allocate +memory at the beginning of a game. This allows for a simple stack +allocator which does not have to deal with fragmentation because it +never frees anything until the game ends. Then the stack allocator +just resets, and that's it. The user never has to manually free memory +or design destructor cascades. + +That's a nice, simple way to do cleanup, but what about resources +borrowed from the operating system, like file handles, vertex buffers, +and so forth? What about memory allocated by third party libraries? +Could we do something similar there? + +In C++, the destructor cascade pattern is typically used for this, +and its advantage is that the generated code is completely static. +The disadvantage is that *every* handle now needs to be wrapped in +a class type, and it becomes a major nuisance to incorporate naked +C APIs into the project. + +Considering that setup and teardown are the inseparable Taiji +of a resource's lifetime, that one depends on the other and neither +is ever seen alone, is there a way to always keep both routines +closely together, in order to ease their maintenance? + +Regarding this, there's an interesting addition in Golang, a feature +that Terra supports as well, the `defer <http://blog.golang.org/defer-panic-and-recover>`_ +keyword. ``defer`` registers a call to be executed at the end of a scope, +and permits to keep releases close to allocations. + +``defer`` is meant for function level resource management, typically +short lived file handles. Our game engine is large, modular, with many +things ideally held separate from another. Could we get a ``defer`` for +this scope? + +A dynamic solution that would work for a scripting language (although in a +syntactically unattractive way) would look like this in practice:: + + // initializing an engine module or plug-in + + function init_module () { + + // example allocation of a system resource + var handle = allocDrawingContext(); + + // register a deferred destructor for the teardown + memory.on_free(function () { + + // we can do more than one thing in this function + freeDrawingContext(handle); + + }); + + } + +One can build a relatively similar construct in C++, alas, the disadvantage +to classic destructor cascades is that we can't get nicely inlined static +teardown code this way. We also need to manage the list of function pointers to +be called on teardown. + +Suppose we could extend the compiler, or even better yet, use a language +that allowed to script the compiler using static macros, a language like +Terra, could we then get a more static solution for the code above? + +The question is leading. Of course we can! + +Here's how such an approach would look in Terra:: + + terra init_module () + -- example allocation of system resources + var handle : int = allocDrawingContext() + + -- register a deferred release for the teardown + -- + -- side note: of course we could (and should) wrap allocDrawingContext + -- to do it transparently, but in this example, let's bleed some + -- internals. + [memory.on_free(freeDrawingContext, handle, int)] + + -- do a few more + var handle2 : int = allocDrawingContext() + [memory.on_free(freeDrawingContext, handle2, int)] + + var handle3 : int = allocDrawingContext() + [memory.on_free(freeDrawingContext, handle3, int)] + + end + + terra main() + init_module() + memory.teardown() + end + +And this is what the generated IL for ``main()`` looks like for LLVM:: + + define void @"$main"() { + entry: + %0 = tail call i32 @"$allocDrawingContext"() + store i32 %0, i32* @dvar, align 4 + %1 = tail call i32 @"$allocDrawingContext"() + store i32 %1, i32* @dvar1, align 4 + %2 = tail call i32 @"$allocDrawingContext"() + store i32 %2, i32* @dvar2, align 4 + tail call void @"$freeDrawingContext"(i32 %2) + %3 = load i32* @dvar1, align 4 + tail call void @"$freeDrawingContext"(i32 %3) + %4 = load i32* @dvar, align 4 + tail call void @"$freeDrawingContext"(i32 %4) + ret void + } + +"Sorcery!" you might exclaim like a caricature of a 14th century cleric, but +far from it. + +``memory.on_free()`` is declared as follows:: + + memory.on_free = function (func, ctx, ctxtype) + + -- create global variable holding the resource if argument + -- isn't already one + local dvar = (terralib.isglobalvar(ctx) and ctx) or global(ctxtype) + + -- insert at front so objects get destroyed in reverse + table.insert(memory, 1, {func=func, ctx=dvar}) + + -- initialize global variable + return quote + dvar = ctx + end + end + +and the teardown macro looks like this:: + + memory.teardown = macro(function () + + -- generate statement list of static destructor calls + local stmts = {} + for _,o in ipairs(memory) do + table.insert(stmts, `o.func(o.ctx)) + + end + return stmts + end) + +One point of criticism is that we're now storing each resource twice: once +within the context it was allocated for, and once in a global variable that +hogs a spot in the data segment. + +Unlike C++, it is practical and safe in Terra to use a global variable for +resources, as global variables are hygienic and hidden by the closures they +were defined in at compile time. + +It would therefore be safe to do this:: + + -- global handles to be auto-destroyed later + + local handle = global(int) + local handle2 = global(int) + local handle3 = global(int) + + terra init_module () + handle = allocDrawingContext() + [memory.on_free(freeDrawingContext, handle)] + + handle2 = allocDrawingContext() + [memory.on_free(freeDrawingContext, handle2)] + + handle3 = allocDrawingContext() + [memory.on_free(freeDrawingContext, handle3)] + end + +and that reduces memory and instruction footprint to its theoretical minimum. + +

A => doc/articles/node-dom.txt +155 -0

@@ 0,0 1,155 @@+ NOM - A Graph-Based DOM + + by Leonard Ritter + Duangle GbR + + First revision: 2014/09/14 + + +For my noodles graph programming I ended up rewriting the model. + +I used to have several separate classes for each type: Type, Function, Call, +Argument, Result, Module, Scope, Link. + +Then I realized I first need to have an extensible data descriptor model which +is general enough to permit building the above classes as specialized data - and +maybe more. + +Therefore I folded all these classes back into one, the Node, which is +comparable to a DOMObject or a JSONObject, with a few changes +particular to the models application: + + * the data is not edited in text form, so there is no "language" for this, + just an API for (visual) editing, whose behavior must be + configurable/hintable. + + * the data structure can be split over several files, usually in a tree-like + dependency structure. + + * references need to be loosely coupled and overridable. + + * configurable directed links/edges between nodes are required. + + * a metadata/type/inheritance system is selectively required. + +All in all, the result is comparable to my earlier Datenwerk system, with a few +major simplifications and amendments. + +Here's a description of the features of the model: + +Tree Structure +-------------- + +As with DOM/JSON, the node is a tree node, in that it has one parent and a list +of children, and that it is deleted when its parent is deleted. + +Parent-children relationships are implicit and hard-referenced, in that they can +be implemented with a native pointer. There are soft references which I'll talk +about later. + +All root nodes of loaded documents are attached to a constant, unrenameable, +unreparentable and undeletable single root node named "#" that exists only +during runtime and must not be saved. + +Data +---- + +A node can have a hard reference to a piece of data, which is neither typed +nor specified; in DOM terms, we would call this "CDATA". Typically, it can be +a number, a string or text block, a boolean, a table, an object, a sound buffer, +an image buffer, etc. The type of the data is opaque to the model and only +of relevance to particular applications. It is deleted with its owner. + +Typically, a user would tag this node with a child node or metanode describing +its type, or the type would be implicitly inferrable from context; +visualizations would have specific requirements as to how to make this +data reflectable. + +Names +----- + +Each node carries a non-empty name, which is comparable to DOM id's, but the +node name is only unique within the scope, similar to files in directories. +To reference a node, one needs the full path originating from the "#" node, in +form of a list of node names; + +A name can either be readable and user-defined, or a random id (e.g. uuid) in +cases where the node has no textual representation and requires no repairable +linkability. Random ids need to be prefixed with "#" to distinguish them from +readable names. This has an effect on visualization and soft-reference rules. + +We call user-defined names explicit names and random ids implicit names. + +Paths +----- + +A path is what constitutes a so-called soft reference. + +Path name resolution is *not* the same as in most file systems. The name +resolution for the first path element travels upwards, treating the tree +structure as scope, comparable to namespace resolution in C, C++, Java, Lua, +Javascript, and a few others. From then on it resolves down, descending +into child nodes. + +Therefore, relative paths can directly start from the first commonly shared +parent node, if the parent is not immediate. + +That means, if two nodes in upper scopes share the same name, the path +will always resolve to the node in the nearest scope. + +Metanodes +--------- + +A node can use another node as a metanode, comparable to instances of classes +in C++, Python or Java; prototypes in Javascript; and metatables in Lua. + +Metanodes are soft-referenced, which means they are referenced with a path, +in this case ideally a path with 1 to 2 elements. + +If the path to a metanode can not be resolved, the node is marked as incomplete +by the editor, and will lose all of the styling and behavior imposed by the +metanode until the reference is restored. A metanode path can not contain +implicit names. + +Aliases +------- + +When a path is resolved down and the path iterator can not find a child node +with the given name, it will continue the search in the node's metanode +instead. + +This way, one can import, alias or typedef nodes across widely separated +scopes. + +Links +----- + +This is the new bit that to my knowledge isn't a part of any mainstream object +descriptor format I know, and which adds noodle-ability to this model. + +A node can be attached between two other nodes, called a source and a sink, +describing a directional graph, which effectively turns the node into an edge, +or link (not to be confused with HTML/XML links). + +Source and sink are referenced weakly; neither needs to share the same scope +as the link. + +Deletion rules are tricky: if either source or sink path can not be resolved, +and the first unresolvable path name is implicit, the node is deleted. If both +paths can not be resolved, explicit or not, the node is always deleted. + +There are some more theoretical implications for links as nodes, for which +I know no immediate use cases: + +Because links are nodes, a link can have child nodes and a metanode. A node +can also use a link as a metanode. + +Because links are nodes, links can link links. This theoretically allows to +build link cascades and sierpinski triangles, but I don't know what those are +good for. + + + + + +

A => doc/articles/split-scheme.txt +149 -0

@@ 0,0 1,149 @@+ An Invertible Split Scheme for Cascading Shadow Maps + + by Leonard Ritter + Duangle GbR + + Second revision: 2016/01/01 + + +DISCLAIMER: This is original research and has not been verified by a +second or third party yet. + + +This is a proposal for a new function for the practical split +scheme[1] for shadow map cascades by Zhang et al. The new +function can be inverted in a shader to easily project +distance z back to layer index i. An intuitive replacement +for the weighing factor sigma is proposed as well. + + +In the original article[1], a split scheme function was suggested to +find good boundaries for individual shadow map cascades, so that +given a layer index i, the optimal depth z for that index could +be derived; given a normal scalar x which is retrieved via + + i + x = ----------- + layer_count + +and a sigma constant S in the range 0..1, which would allow the +user interpolate linearly between an equidistant depth distribution +and an ideal exponential one to bias the resolution distribution, +the original split scheme equation would be + + f + z = pow(-, x) n S + ((f - n) x + n) (1 - S) = f(x) + n + +n and f designate the distance of the near and far planes respectively. + +Unfortunately this function can't be inverted. With this solution, +it is necessary to precompute and pass the layer boundaries as +coefficients to the pixel shader. This also limits applicability, +as the number of addressable layers is bounded by storage. + +The New Split Scheme +==================== + +Through experimentation and some guesswork, I found an alternative +way to bias the distribution, which is purely exponential and +has no additive polynomial component, so inversion is trivial. + +This bias effectively projects the near and far distances to a +slope further up the exponential curve, which dampens the initial +slow momentum and causes the function to approach a more linear +scaling. + +The original sigma constant S is squared so that C = sqrt(S), +and the split scheme function is altered so that + + f f + z = (f - n - -) (1 - pow(-------------, x)) + n = f(x) + C (n - f) C + f + +At the original proposed default of S = 0.5, the function starts +with near identical momentum, but converges marginally sooner +towards f; across all values of S, the first half of the function +distributes very similarly, so this function can easily be used +as a drop-in replacement. + +The only limitation with this modified function is that, while there +is a solution for C = 0, z and x can not be computed, which poses no +issue as S = 0 describes a simple linear mapping that is trivial to +invert, and undesirable in the context of efficient depth layer usage. + +Analysis +======== + +http://i.imgur.com/9OeUMBz.png + +The above picture shows the distribution for sigma = 0.5; the +green and yellow curves show the boundaries at sigma 0.0 and 1.0, +which are the same for both old and new functions. +The purple curve is the original split scheme, the blue curve +is the modified one. + + +Shader Usage +============ + +In the shader, the layer index can now be retrieved by resolving x from a +distance z and multiplying it with layer_count to retrieve the interpolated +index of the layer to be retrieved. + +To calculate x from z, we invert the new split scheme function so that + + (z - far) C + far far + x = log(--------------------) / log(--------------------) = f(z) + (near - far) C + far (near - far) C + far + +For optimization, the inner factors and the second log term can be stored in +three constant variables U, V, W: + + d = far * (1 - C) + near * C + + C + U = - + d + + far * (1 - C) + V = ------------- + d + + far + W = layer_count * log(2) / log(---) + d + +With these optimizations, the resolve reduces to + + i = log2(z U + V) W + + +Future Work +=========== + +I did not find the original weighing parameter sigma to be particularly +easy to understand and apply. In practice, I found myself aiming to get +an optimal balance between resolution and the range of the first cascade +layer by guessing values for sigma; There was no clear relationship +between the parameter and its implications. + +Ideally, I would prefer a new balancing variable z1 instead of sigma, which +defines the distance at which the first shadow map layer ends, and is +used to calculate the bias C. + +Unfortunately, I could not find a way to solve the equation towards C, +so I leave this for further exploration. Until a simpler method can be found, +bisecting towards C would be the nearest choice; A routine would +methodically attempt to solve the split scheme function by adjusting C until +z = z1 for x = 1 / layer_count. + + +References +========== + +[1] Parallel-Split Shadow Maps on Programmable GPUs + http://http.developer.nvidia.com/GPUGems3/gpugems3_ch10.html + + +Thanks to Sean Barrett and Fabian Giesen for help and suggestions.

A => doc/articles/twistpool.c +317 -0

@@ 0,0 1,317 @@+/* + +Twisting Pool Allocator +======================= +written by Leonard Ritter (leonard.ritter@duangle.com) + +This file is in the public domain + +I don't know if I was the first one to stumble upon this technique, so +I can't guarantee there's no patent on it, but let's hope there's not, +then this counts as prior art. + +-------------------------------------------------------------------------------- + +This is a proof of concept implementation for a pool allocator that guarantees +compactness (unsorted gapless iteration without indirections) while preserving +element ids (using one order-optimized indirection), with insertion, deletion +and lookup in O(1) time. + +Because all id <-> index assignments are symmetric swaps (either +identity-mapped or twisted), only a single table is required to resolve +index from id and id from index. + +Insertion and deletion is a self-sorting process. Both operations attempt +to untwist identifiers that share the same status (obtained/released) and +so, regardless of deallocation order, only few entries are accessed and +fragmentation is generally low; in my random allocation test, the average number +of fragmented entries appeared to be ~log2(capacity)); It is possible to produce +a worst case by obtaining all id's, then releasing every other id. The upper +bound on fragmented entries appears to be capacity/3. + +The data being managed must be allocated separately, and the user must +copy at least none, at most one element after a call to obtain/release. + +With this implementation, the memory requirement is +sizeof(unsigned int) * capacity, typically 4 bytes per entry. + +It might however possible to implement the map with a compact sorted array or +hashtable, storing only twisted entries, which should tremendously reduce memory +usage and therefore notably improve cache efficiency. + +-------------------------------------------------------------------------------- + +*/ + +#include <stdbool.h> +#include <assert.h> + +// use whatever size you prefer; this implementation operates directly on +// global memory; a productive implementation would probably use the heap. +#define CAPACITY 10000 +#define N (CAPACITY+1) + +/* index: offset of element in data array; elements are moved + to keep the array compact and contiguous + + id: numerical name assigned to a data element; never changes. + therefore, elements should typically be referenced by their id, + not their index. + + id 0 and index 0 are synonymous with no element + */ + +/* symmetrical map + maps index -> id; compact, gapless + last index == count + happens to also map id -> index; has gaps + */ +unsigned int map[N]; + +/* all ids are mapped, even the unused ones; + the ideal mapping is id == index + the first free index is count+1 + if count == capacity, the pool is full. + */ +unsigned int count; + +// pool constructor +void construct () { + unsigned int i; + // initially no elements + count = 0; + for (i = 0; i < N; ++i) { + // start out with identity mapping; + // all unused ids are mapped + map[i] = i; + } +} + +// lookup index from id or id from index +unsigned int resolve(unsigned int id_or_index) { + return map[id_or_index]; +} + +bool is_index_valid (unsigned int index) { + return (index > 0) && (index <= count); +} + +bool is_id_valid (unsigned int id) { + return (id > 0) && (id <= CAPACITY) && is_index_valid(resolve(id)); +} + +bool is_identity (unsigned int id_or_index) { + return (map[id_or_index] == id_or_index); +} + +// swap the ids of two indices (or the indices of two ids) +void swap (unsigned int a, unsigned int b) { + // ids can only be twisted or untwisted, map must not be shuffled any further + assert((a == b) + || ((map[a] == a) && (map[b] == b)) + || ((map[a] == b) && (map[b] == a))); + unsigned int t = map[a]; + map[a] = map[b]; + map[b] = t; +} + +typedef struct _pair { + unsigned int _0; + unsigned int _1; +} pair; + +/* allocate a new id from the pool + user must copy slot[index(id)] to slot[count] after + allocation. obtain() returns a { src, dst } tuple indicating + how data must be moved; if (dst == src), then no move is necessary. + src is also identical to the newly obtained id + if the pool is full, obtain returns { 0, 0 } + */ +pair obtain () { + if (count < CAPACITY) { + unsigned int index; + unsigned int id; + + // increase number of elements + ++count; + // index of new last element + index = count; + // id of new last element + id = map[index]; + // if id not identical to index (is twisted) + if (id != index) { + // swap with index that matches this id, + // so that index(id) == id + swap(index, id); + } + // return new id/index and index of moved item + pair result = { id, index }; + return result; + } else { + // out of space + pair result = { 0, 0 }; + return result; + } +} + +/* release an obtained id to the pool + user must copy slot[count] to slot[index(id)] after + deletion. (release id) returns a (src dst) tuple indicating + how data must be moved; if (src == dst), then no move is necessary. + */ +pair release (unsigned int id) { + unsigned int index; + unsigned int last_index; + unsigned int last_id; + + assert(is_id_valid(id)); + + // index of element to be deleted + index = map[id]; + + // if element is twisted, then untwist + if (id > count) + swap(index, id); + // index and id of element to take its place + last_index = count; + last_id = map[last_index]; + + // swap indices so that tailing element fills the gap (twist) + // if last element is twisted, then untwist first + if (last_id > count) { + swap(last_index, last_id); + swap(index, last_id); + } else { + swap(index, last_index); + } + + // decrease number of elements + --count; + // return index of element to be moved and index of gap + pair result = { last_index, index }; + return result; +} + +// ---------------------------------------------------------------------------- +// test + +#include <stdlib.h> +#include <stdio.h> + +// our "data array", which just stores the id again, so we can easily verify +// if an id still matches its assigned content +unsigned int data[N]; + +void dump() { + unsigned int i; + + printf("index -> ID:\n"); + for (i = 1; i <= CAPACITY; ++i) { + if (i == (count + 1)) + printf("-------\n"); + unsigned int id = resolve(i); + unsigned int ri = resolve(id); + printf("%u\t%u\t%s\n", i, id, ((i != ri)?"!":"")); + } + printf("%i used elements\n\n", count); +} + +void move(pair k) { + if (k._0 != k._1) { + data[k._1] = data[k._0]; + } +} + +unsigned int verify_data () { + unsigned int i; + unsigned int twisted = 0; + for (i = 1; i <= count; ++i) { + unsigned int id = resolve(i); + assert(data[i] == id); + assert(resolve(id) == i); + if (!is_identity(i)) + ++twisted; + } + return twisted; +} + +unsigned int test_obtain () { + pair k = obtain(); + move(k); + data[k._0] = k._0; + return k._0; +} + +unsigned int test_release (unsigned int id) { + assert(id != 0); + pair k = release(id); + move(k); +} + +#include <memory.h> +#include <stdio.h> + +// array of ids in use +unsigned int used[CAPACITY]; + +int main (int argc, void** argv) { + int i; + unsigned int mintwisted = N; + unsigned int maxtwisted = 0; + unsigned int total = 0; + unsigned int steps = 0; + unsigned int used_count = 0; + + memset(used, 0, sizeof(used)); + + construct(); + + srand(time()); +#if 1 + // do random obtains/releases, see if something breaks + for (i = 0; i < 100000; ++i) { + if (((rand() % 100) < 50) && (used_count > 0)) { + unsigned int used_index = rand() % used_count; + unsigned int id = used[used_index]; + // remove from used array and fill + used_count--; + used[used_index] = used[used_count]; + used[used_count] = 0; + test_release(id); + unsigned int t = verify_data(); + mintwisted = (mintwisted < t)?mintwisted:t; + maxtwisted = (maxtwisted > t)?maxtwisted:t; + total += t; + ++steps; + } else { + unsigned int k = test_obtain(); + if (k != 0) { + assert(used_count < CAPACITY); + used[used_count] = k; + ++used_count; + } + verify_data(); + } + } +#else + // attempt to fabricate a worst case + for (i = 0; i < CAPACITY; ++i) { + test_obtain(); + } + for (i = 1; i <= CAPACITY; i += 2) { + test_release(i); + unsigned int t = verify_data(); + mintwisted = (mintwisted < t)?mintwisted:t; + maxtwisted = (maxtwisted > t)?maxtwisted:t; + total += t; + ++steps; + } +#endif + + dump(); + printf("releases: %u\nmin twisted: %u\nmax twisted: %u\ntotal twisted: %u\naverage: %f\n", + steps, mintwisted, maxtwisted, total, (double)total / (double)steps); + printf("OK.\n"); + + return 0; +}