b46d0be5bf8f — Leonard Ritter a month ago
* initial check-in articles written for liminal
A => doc/articles/gpu-delaunay-transform.txt +160 -0
@@ 0,0 1,160 @@
+Parallelizable Gather-based Delaunay Transforms
+-----------------------------------------------
+by `Leonard Ritter <https://twitter.com/paniq>`_, `Duangle GbR <http://duangle.com>`_
+Today I figured out how to do GPU-friendly Delaunay transforms, and I'd like to
+describe how the algorithm works before I forget everything.
+What the algorithm does is to flip edges in a half-edge triangle mesh where a flip
+would improve the quality of the adjacent triangles (an equilateral triangle
+has perfect quality). The triangle mesh I'm using is embedded in R2, that is,
+it's on a 2D plane, has no holes, is not a manifold, and is typically a heavily
+subdivided square; but the algorithm should very well work for 2-manifolds in 3D too.
+The mesh is not guaranteed to be Delaunay conforming after a single transform,
+and there is no halting condition. The algorithm is designed to be run in
+realtime, continuously refining a mesh with moving vertices each frame. The
+algorithm supports constrained and unconstrained Delaunay transforms.
+Overview
+^^^^^^^^
+There are three passes to this algorithm, inspired by the approach of
+Navarro et al in _. Navarro et al's approach is scatter-based and requires atomics
+to work, but was a good starting point for figuring out how a lo-fi version
+might work. Each pass can be massively parallelized (as the iteration
+is order-independent), requires no GPGPU implementation or compute shaders and
+therefore is optimal for OpenGL 3.2 compatible hardware.
+The first pass iterates all edges of the mesh and evaluates flip candidates based
+on whether a flip would improve the quality. The second pass iterates all
+triangles and does the flip, using a simple precedence heuristic to avoid
+collisions. The third pass fixes broken adjacency information.
+Data Structure
+^^^^^^^^^^^^^^
+The data structure I use for the triangle mesh is a reduced version of the
+structure described by Fabian Giesen in his Half-edges Redux article _.
+Each triangle has only three edges sorted in counter-clockwise order so that
+``edgeId(triangleId, corner) = triangleId * 3 + corner``. That makes it possible
+to store the vertex indices for each edge (which map to the vertex opposite
+each edge inside the same triangle, *not* each edge's starting vertex) as a
+linear element array, the kind of format that OpenGL prefers for indexed
+triangles. There are no wedges in this structure, and the algorithm needs to be
+modified if there are.
+I also store an array of opposites for each vertex (which I will call twins for
+brevity), so that ``vertex(twin(edgeId))`` gives the vertex opposite each edge
+in the adjacent triangle. That means each potential quad that an edge is the
+diagonal of is described by the vertices ``vertex(edgeId)``,
+``vertex(next(edgeId))``, ``vertex(twin(edgeId))`` and ``vertex(prev(edgeId))``
+in counter-clockwise order. The algorithm needs the twin array to work,
+as well as a temporary copy of the twin array.
+Here is a breakdown of the data structure::
+    V vertices
+    T triangles
+    E=3*T half-edges
+    // position[vertexId] = position of each vertex in R2
+    // Read in pass 1.
+    vec2 position[V];
+    // vertex[edgeId] = vertexId of inner vertex opposite a half-edge
+    // Read in passes 1 & 2, written in pass 2.
+    uint vertex[E];
+    // twin[edgeId] = edgeId of the twin half-edge connecting the same vertices
+    // Read in passes 1, 2 & 3, written in pass 3.
+    uint twin[E];
+    // temporary flag array, three bits per triangle (one bit per half-edge).
+    // Written in pass 1, read in pass 2.
+    uint8 candidates[T];
+    // temporary twin array to avoid a race condition.
+    // Written in pass 2, read in pass 3.
+    uint tmptwin[E];
+Algorithm
+^^^^^^^^^
+The algorithm needs three passes to perform, which are as follows:
+#. For each triangle, for each of its half-edges, collect the four adjacent
+   vertices that form a quad with this half-edge as its diagonal and estimate the
+   quality of triangles with both possible diagonals. I use a quality estimator
+   described by Mller et al. in _. If a flip would improve the quality,
+   store a bit for this edge in ``candidates``. By default, ``candidates[triangleId]``
+   must be initialized to zero.
+   It is important that the quality computations use the exact vertex order from
+   both sides of each edge to compute the exact same floating point values and
+   arrive at the exact same predicate, or floating point imprecisions will cause
+   asymmetrical results, and subsequent steps will destroy the mesh. I got good
+   results with a fixed point implementation of the quality estimator that is
+   insensitive to vertex order.
+   If the transform is supposed to be constrained, then constrained edges should
+   simply never be tagged.
+#. For each triangle, find the first half-edge that is tagged by ``candidates``
+   as a flip candidate. If no half-edge is tagged, then skip. If more
+   than one half-edge is tagged, the tagged half-edge with the lowest lexicographic
+   index wins.
+   The lexicographic index is an ``uint64`` built from the two half-edge ID's
+   that make up a single edge.
+   ``index(edgeId) = min(edgeId,twin[edgeId]) | (max(edgeId,twin[edgeId]) << 32)``.
+   The half-edge candidate can still collide with tagged half-edges in the
+   neighboring triangle. If its lexicographic index is *not* the lowest among its
+   tagged neighbors, then skip. Otherwise, do this triangle's part of the flip:
+   The flip is executed by mapping the half-edge's end vertex to the
+   outer opposing vertex so that ``vertex[prev(edgeId)] = vertex[twin[edgeId]]``.
+   Because this moves two half-edges, the local twin information must also be
+   corrected, so that ``tmptwin[next(edgeId)] = next(twin[edgeId])`` and
+   ``tmptwin[edgeId] = twin[next(twin[edgeId])]``.
+   Because both triangles participating in this flip arrive at the same conclusions,
+   do the same operation independently, and the flip operation itself is symmetric,
+   the connectivity will be restored without collision or ambiguity.
+   If no operation is performed, tmptwin must still be initialized so that
+   ``tmptwin[edgeId] = twin[edgeId]``.
+#. There's one last problem to correct, which is the invalid adjacency information
+   stored in half-edges bordering on triangles that have been flipped.
+   For each half-edge, determine if its twin has changed by comparing
+   ``tmptwin[edgeId] != twin[edgeId]``. If the twin has changed, then we simply
+   update ``twin[edgeId] = tmptwin[edgeId]``.
+   We also check if this half-edge's information is still up to date by
+   verifying the bi-directionality of the link using the predicate
+   ``edgeId == tmptwin[tmptwin[edgeId]]``. If the predicate fails, then we have
+   to search for the new, correct adjacent half-edge, whose location is
+   fortunately invariant.
+   We now need to discern whether ``edgeId`` does belong to a half-edge
+   bordering on a flipped triangle using the predicate
+   ``tmptwin[tmptwin[tmptwin[edgeId]]] == tmptwin[edgeId]``.
+   If the comparison holds true, we update the half-edge information to the
+   invariant location of its new half-edge so that
+   ``twin[edgeId] = prev(tmptwin[tmptwin[edgeId]])``, and we're done.
+References
+^^^^^^^^^^
+..  Navarro et al.
+    `A parallel GPU-based algorithm for Delaunay edge-flips <http://swp.dcc.uchile.cl/TR/2011/TR_DCC-20110315-003.pdf>`_
+..  Fabian Giesen
+    `Half-edges redux <https://fgiesen.wordpress.com/2012/04/03/half-edges-redux/>`_
+..  Mller et al.
+    `Air Meshes for Robust Collision Handling <http://matthias-mueller-fischer.ch/publications/airMeshesPreprint.pdf>`_

A => doc/articles/half-edge-tet.txt +200 -0
@@ 0,0 1,200 @@
+Symmetric Half-Edge Tetrahedron Predicates
+Author: Leonard Ritter <leonard.ritter@duangle.com>
+2015-04-19
+This document describes a compact and efficient half-edge mapping for
+tetrahedra. Its indices have been purposefully arranged so that all
+invariant adjacency and incidence predicates of a tetrahedron can be computed
+from one another.
+All queries ultimately make use of one single bit-arithmetic predicate,
+faceVertex(). For increased performance, computed results can be cached in
+bit tables.
+The vertices are numbered this way:
+     0
+     +
+    /|\
+   / | \
+1 +-----+ 2
+   \ | /
+    \|/
+     +
+     3
+The numbering scheme describes a hamiltonian cycle, and is therefore suited for
+patch rendering.
+Faces run counter-clockwise when observed from the outside; if one would like
+the hull to face inward instead, one can flip vertices V1 and V2 in the map
+above and have faces run counterclockwise; The topology is not altered, so none
+of the predicates are affected by this choice.
+Wherever edges are mentioned in the following sections, it is implied that
+half-edges are meant. There are no full-edge indices in this scheme.
+Face -> Vertices
+================
+F0: V3 V2 V1
+F1: V2 V3 V0
+F2: V1 V0 V3
+F3: V0 V1 V2
+Face indices have been chosen so that face F lies opposite vertex V if V = F.
+This way, for any base face F, (F+1)%4, (F+2)%4, (F+3)%4 form the faces
+of the pyramid cone, with vertex V = F as its apex.
+vertexOppositeFace(vertex) -> vertex
+faceOppositeVertex(face) -> face
+The phase for each triangle has been picked so that the vertex of index 0 of
+each face F equals F^3 or 3-F, and each vertex is referenced across a
+diagonal of the above face * vertex 3x4 matrix.
+The above table can be computed from the function:
+faceVertex(face,corner) -> ((((face & 2) + corner) & 3) ^ (face & 1)) ^ 3
+Vertex -> Faces
+===============
+V0: F1 F3 F2
+V1: F0 F2 F3
+V2: F3 F1 F0
+V3: F2 F0 F1
+The inverse table maps each vertex to one of three incident faces sharing
+an edge with the opposite face F = V, in the clockwise order of its edges.
+The table is computed from the function:
+cornerOpposite(corner) -> (corner + 2) % 3
+vertexFace(vertex,corner) -> faceVertex(vertex, cornerOpposite(corner))
+When a face is passed instead of a vertex, the function returns the
+index of the face adjacent to the edge starting at the given corner.
+Edge -> (Start Vertex, End Vertex)
+==================================
+ E0: V0 -> V1
+ E1: V0 -> V3
+ E2: V0 -> V2
+ E3: V1 -> V0
+ E4: V1 -> V2
+ E5: V1 -> V3
+ E6: V2 -> V3
+ E7: V2 -> V1
+ E8: V2 -> V0
+ E9: V3 -> V2
+E10: V3 -> V0
+E11: V3 -> V1
+Edges have been mapped so that all edges sharing the same starting vertex
+are grouped and pointing towards the vertices of the opposing triangle in
+counterclockwise order, starting at the vertex opposite to each face's first
+edge.
+The table can be computed from the two functions:
+edgeStart(edge) -> floor(edge / 3)
+edgeEnd(edge) -> vertexFace(edgeStart(edge), edge)
+The hamiltonian edge path is E0 E4 E6 E10, equivalent to the function
+hamiltonian(index) -> ((index & 1) << 2) + 3 * (index & 2)
+Vertex -> Edges
+===============
+Because of the grouping, computing an edge with a given start vertex is trivial:
+vertexEdge(vertex,corner) -> vertex * 3 + corner
+When a face instead of a vertex is passed, the function returns the edges
+that connect the face-opposing vertex with the edge-opposing vertex.
+Face -> Edges
+=============
+F0: E9  E7  E5
+F1: E6 E10  E2
+F2: E3  E1 E11
+F3: E0  E4  E8
+Due to the previously made indexing choices, the starting edge of a face
+can be computed from its vertex index and the desired corner.
+faceEdge(face,corner) -> vertexEdge(faceVertex(face,corner), corner)
+Edge -> (Face, Corner)
+======================
+ E0: F3 C0
+ E1: F2 C1
+ E2: F1 C2
+ E3: F2 C0
+ E4: F3 C1
+ E5: F0 C2
+ E6: F1 C0
+ E7: F0 C1
+ E8: F3 C2
+ E9: F0 C0
+E10: F1 C1
+E11: F2 C2
+You may notice that the mapping is equivalent to the face -> vertex mapping,
+due to the F = V equality when F and V oppose each other.
+The inverse can be computed as well:
+edgeCorner(edge) -> edge % 3
+edgeFace(edge) -> faceVertex(edgeStart(edge), edgeCorner(edge))
+Edge -> Opposite Edge
+=====================
+ E0:  E3
+ E1: E10
+ E2:  E8
+ E3:  E0
+ E4:  E7
+ E5: E11
+ E6:  E9
+ E7:  E4
+ E8:  E2
+ E9:  E6
+E10:  E1
+E11:  E5
+The half-edge opposite to a half-edge is again computed the same way,
+making use of a flipped Face -> Edge predicate:
+edgeOpposite(edge) -> vertexEdge(vertexFace(edgeStart(edge), edge), edgeCorner(edge))
+Edge -> Next Edge
+=================
+ E0: E4
+ E1: E11
+ E2: E6
+ E3: E1
+ E4: E8
+ E5: E9
+ E6: E10
+ E7: E5
+ E8: E0
+ E9: E7
+E10: E2
+E11: E3
+The functions used to retrieve the previous / next edge in counterclockwise
+order within a face are:
+edgeNext(edge) -> faceEdge(edgeFace(edge), (edge + 1) % 3)
+edgePrev(edge) -> faceEdge(edgeFace(edge), cornerOpposite(edge))

@@ 0,0 1,262 @@
+"""
+The Liminal RNode System - Scheduling Renderjobs In A Crazy World
+by Leonard Ritter, Duangle GbR (2014.4.11)
+Structuring rendering in a 3D application to allow for drastical alterations
+    in pipeline ordering isn't easy; you usually end up wanting all kinds of
+    permutations, like "i want _this_ actor with _this_ mesh, but this time into
+    _this_ framebuffer, but with the same camera data; screw that; we need
+    the same data in two different rendertargets, oh and I forgot we need
+    splitscreen, so all this needs to be split into four viewports - except
+    for this object, which needs to be on top of it all."
+...and the graphics engine coder works his ass off for two more weeks; usually
+these changes ripple through the entire codebase; starting at providing
+basic support at the engine level, over to the game programmer, who then has
+to make use of the new flags and refactor the rendering setup.
+"""
+# so here's how to setup scenes in our Liminal engine now;
+# it's a bit like linking generators and effects in a modular sequencer,
+# only that it works on packets (aka renderjobs) instead of streams of
+# audio impulses.
+# I haven't seen a game engine which organizes its renderpath like this yet,
+# so I'm pretty excited to demo this and get everyone's input on what I
+# think is a pretty lean and elegant way to organize 3D rendering in a game.
+# first, generate a default, uv mapped cube mesh with normals
+mesh = generate_cube_mesh()
+mesh.name = "cube"
+# setup a deferred material for the mesh
+material = GMaterial.new()
+material.name = "CubeMaterial"
+material.use_backface_culling = True
+material.albedo = vec3(1.0,1.0,1.0)
+material.metallic_factor = 0.0
+material.roughness_factor = 1.0
+# create a new actor to give the mesh a location
+# the actor has no conception of anything else; just an orientation in space
+cube2 = Actor()
+cube2.name = 'cube2'
+# give it a physical body for collision testing
+cube2.body = GeomBody.new(ws.world, Box.new(ws.space, vec3(64,64,1)))
+cube2.body.kinematic = True
+# apply position and size
+cube2.position = vec3(0,0,-10)
+cube2.scale = vec3(32.0,32.0,0.5)
+# now concatenate our renderjob:
+# we want this mesh with this material at this actor's location,
+# so cube2 now depends on material, which depends on mesh
+cube2 << material << mesh
+# the order of assignments is only interesting in relation to the graph,
+# the only requirement is that the mesh is the a source, as meshes are
+# the only nodes that generate renderjobs; other nodes only augment them.
+#
+# this would be equally valid, and a good setup for when many actors
+# share the same material:
+material << cube2 << mesh
+# now setup an empty vizgroup which we use to bundle a group of objects,
+# possibly to control visibility later. This would be kind of like a "scene"
+vizgroup = RNode()
+vizgroup.name = 'scene'
+# make vizgroup depend on cube2
+vizgroup << cube2
+# The RNode is the basic building block for the render graph; materials, meshes,
+# actors, framebuffers, etc. and a lot of other default resources already
+# implement the RNode interface so they can be linked; But it's pretty easy
+# to write your own RNodes, or use a RNode as proxy to re-use another RNode in
+# two different unrelated paths.
+# get the singleton for the deferred renderer's resources, which is also
+# an RNode
+grm = GRenderManager()
+# link the vizgroup to the framebuffer for the deferred "scene" -
+grm.scene_framebuffer << vizgroup
+# now the last thing we need to complete the job is a camera to actually
+# specify where the scene is in relation to the viewer;
+camera = DebugCamera(lock_up=True)
+camera.name = 'camera'
+camera.far = 300.0
+camera.fov = 90.0
+# give the camera a collision box
+camera.body = GeomBody.new(ws.world, Box.new(ws.space, vec3(1,1,1)))
+camera.body.kinematic = True
+# and position it
+camera.look_at(vec3(0,-70.0,0), vec3(0,0,0), vec3(0,0,1))
+# all resources the deferred renderer has set up are set, except for the view
+# data, so let's just link the entire grm through it and we're done; the
+# scene is on screen.
+#
+# retrieve render manager singleton
+rm = RenderManager()
+# complete the chain; batch is the root node - now the renderer can 'see'
+# the content.
+rm.batch << camera << grm
+# For debugging, I have written a small 90 line function to have graphviz
+# draw a map of the graph, so I can see at any point what the pipeline looks
+# like
+"""
+"""
+# the declaration for RNodes is only 70 lines, and writing new RNodes is simple;
+# here is an RNode that changes the renderpass of all jobs going through it:
+class Renderpass(RNode):
+    def __init__(self, renderpass):
+        RNode.__init__(self)
+        self.renderpass = renderpass
+    def process(self, batch, jobs):
+        RNode.process(self, batch, jobs)
+        for job in jobs:
+            # we could also insert new orders into the jobs array here
+            # but let's just alter the renderpass
+            job.key.renderpass = self.renderpass
+# rnodes are only traversed on graph change, not per frame, and the work can
+# be done in a thread, as the renderer isn't required for processing.
+#
+# this is how the renderbatch then generates all jobs while traversing
+# the graph:
+def render_nodes(self):
+    # this could be done in a more conside way in python,
+    # but i wanted to keep the implementation parallel to a possible
+    # future C implementation; this one doesn't need any hash tables
+    # or binary searches, just arrays.
+    # get a new selection id
+    RNode.NODE_ID += 1
+    node_id = RNode.NODE_ID
+    # array of nodes in topological order
+    nodes = []
+    # how many times each node's content is referenced
+    noderefs = []
+    # recursive visitor function
+    def visit_node(node):
+        # if node has already been seen, skip
+        if node._node_id == node_id: return
+        # visit children first
+        for n in node.sources:
+            visit_node(n)
+            # increase noderefs for that node
+            noderefs[n._node_index] += 1
+        # assign the node's order index
+        node._node_index = len(nodes)
+        # update the select id to mark the node as visited
+        node._node_id = node_id
+        # append to topologically sorted array
+        nodes.append(node)
+        # and initialize the refcount for this node
+        noderefs.append(0)
+    # start at batch, which is the root
+    visit_node(self)
+    # debug out: print order in which nodes are processed, (which
+    # is not the order in which they are rendered, btw)
+    for i,n in enumerate(nodes):
+        print('#{}: {}'.format(i,n))
+    # array of each node's jobs
+    all_jobs = []
+    # linearly go through node in sorted order
+    for k,node in enumerate(nodes):
+        # list of jobs that this node will receive
+        jobs = []
+        # for all inputs of this node
+        for source in node.sources:
+            # retrieve the jobs that the predecessor has
+            src_index = source._node_index
+            src_jobs = all_jobs[src_index]
+            # make sure we have at least one reference left
+            assert noderefs[src_index] > 0
+            # if this is not the last reference
+            if noderefs[src_index] > 1:
+                # make a copy of the jobs and append
+                for job in src_jobs:
+                    jobs.append(self.copy_job(job))
+            else:
+                # last reference: they terk er jerbs
+                for job in src_jobs:
+                    jobs.append(job)
+                all_jobs[src_index] = None
+            # and decrease reference
+            noderefs[src_index] -= 1
+        # process the jobs
+        try:
+            node.process(self, jobs)
+        except:
+            print(node)
+            raise
+        # and append to job list
+        all_jobs.append(jobs)
+    # as jobs are already appended to the renderbatch on create/copy,
+    # we are already done; the batch can now be sorted and is ready
+    # for rendering.
+# And, just for completion, this is the basic RNode class declaration:
+class RNode(Named):
+    # selection id counter
+    NODE_ID = 0
+    def __init__(self):
+        Named.__init__(self)
+        # set of input nodes
+        self._sources = set()
+        # variables used during graph traversion
+        self._node_index = 0
+        self._node_id = 0
+    # this function does a number on the jobs; any operation is OK:
+    # alteration, duplication, deletion, etc. Change of order has no effect
+    # on anything.
+    def process(self, batch, jobs):
+        pass
+    @property
+    def sources(self):
+        return self._sources
+    def __lshift__(self, other):
+        assert self.name, '{} is unnamed'.format(self)
+        if isinstance(other, collections.Iterable):
+            for node in other:
+                assert node.name, '{} is unnamed'.format(other)
+            self.sources.update(other)
+            return other
+        else:
+            assert other.name, '{} is unnamed'.format(other)
+            return other
+# That's it. Thanks for reading :)

A => doc/articles/maxnorm.txt +110 -0
@@ 0,0 1,110 @@
+here is a collection of signed distance field functions with maxnorm metric
+(chebychev distance or L^inf distance)
+#define SIGNED_DIST_MAX 1000.0
+float sumx(vec3 v) {
+    return v.x + v.y + v.z;
+}
+float max_element(vec3 v) {
+	return max3(v.x, v.y, v.z);
+}
+float max_norm(vec3 v) {
+	return max_element(abs(v));
+}
+float sd_cube(vec3 p, float r) {
+	return max_norm(p) - r;
+}
+vec2 solve_quadratic(float a, float b, float c) {
+	// ax^2 + bx + c = 0, a non-zero
+	float q = b*b - 4.0*a*c;
+	if (q < 0.0) {
+		return vec2(SIGNED_DIST_MAX);
+	}
+	float r0 = -b/(2.0*a);
+	float r1 = sqrt(q)/(2.0*a);
+	return vec2(r0 - r1, r0 + r1);
+}
+float sd_ellipsoid(vec3 p, vec3 r) {
+	// move ellipse so that target point is at origin, centre in positive space
+	// f(v) = (v.x - c.x)^2/r.x^2 + (v.y - c.y)^2/r.y^2 + (v.z - c.z)^2/r.z^2 - 1
+	vec3 c = abs(p);
+	vec3 c2 = c*c;
+	vec3 r2 = r*r;
+    float d = SIGNED_DIST_MAX;
+	// gather terms of quadratic
+	vec3 qa = 1.0/r2;
+	vec3 qb = -2.0*c/r2;
+	vec3 qc = c2/r2;
+    float qcs = sumx(qc) - 1.0;
+    // check corners:
+    // solve f(v)=0 for v.x=v.y=v.z=t
+    {
+		vec2 t0 = abs(solve_quadratic(sumx(qa), sumx(qb), qcs));
+        d = min3(d, t0.x, t0.y);
+    }
+	// interior of convex shape always hits corners first, so early out
+	if (qcs <= 0.0) {
+		return -d;
+	}
+	// check edges:
+	// df/dx=0 => v.x=c.x, solve f(v)=0 for v.x=c.x, v.y=v.z=t
+    // then do the same for y and z cases
+    {
+        vec2 t = abs(solve_quadratic(qa.y + qa.z, qb.y + qb.z, qc.y + qc.z - 1.0));
+        d = min(d, max(min(t.x, t.y), c.x));
+	}
+    {
+        vec2 t = abs(solve_quadratic(qa.x + qa.z, qb.x + qb.z, qc.x + qc.z - 1.0));
+        d = min(d, max(min(t.x, t.y), c.y));
+	}
+    {
+        vec2 t = abs(solve_quadratic(qa.x + qa.y, qb.x + qb.y, qc.x + qc.y - 1.0));
+        d = min(d, max(min(t.x, t.y), c.z));
+	}
+	// check faces:
+    // df/dx=df/dy=0 => v.xy=c.xy, so f(v)=0 => |v.z - c.z|=r.z
+	{
+		d = min(d, max3(c.x, c.y, abs(c.z - r.z)));
+		d = min(d, max3(c.x, abs(c.y - r.y), c.z));
+		d = min(d, max3(abs(c.x - r.x), c.y, c.z));
+	}
+	// done
+	return d;
+}
+float smoothmin(float a, float b, float r, out float e)
+{
+    e = max(r-abs(a - b), 0.0);
+    return min(a, b)-e*e*0.25/r;
+}
+/* todo:
+cubic strokes
+cylinders
+cones
+triangular prisms
+torus
+round cubes (super ellipsoids with variable power per dim)
+pyramids
+operations:
+* subtract
+* color
+* soft blend
+*/

A => doc/articles/nicer-teardown.txt +182 -0
@@ 0,0 1,182 @@
+Nicer Teardown In Game Engines
+==============================
+A cool thing I figured out today is how to organize teardown in a
+game engine without going nuts. I'm going to explain the thinking steps that
+led to the solution, concluding with it. It's a solution that is both suprisingly
+fun to use and yet quite efficient.
+A typical allocation scheme for a static codebase is to only allocate
+memory at the beginning of a game. This allows for a simple stack
+allocator which does not have to deal with fragmentation because it
+never frees anything until the game ends. Then the stack allocator
+just resets, and that's it. The user never has to manually free memory
+That's a nice, simple way to do cleanup, but what about resources
+borrowed from the operating system, like file handles, vertex buffers,
+and so forth? What about memory allocated by third party libraries?
+Could we do something similar there?
+In C++, the destructor cascade pattern is typically used for this,
+and its advantage is that the generated code is completely static.
+The disadvantage is that *every* handle now needs to be wrapped in
+a class type, and it becomes a major nuisance to incorporate naked
+C APIs into the project.
+Considering that setup and teardown are the inseparable Taiji
+of a resource's lifetime, that one depends on the other and neither
+is ever seen alone, is there a way to always keep both routines
+closely together, in order to ease their maintenance?
+Regarding this, there's an interesting addition in Golang, a feature
+that Terra supports as well, the `defer <http://blog.golang.org/defer-panic-and-recover>`_
+keyword. ``defer`` registers a call to be executed at the end of a scope,
+and permits to keep releases close to allocations.
+``defer`` is meant for function level resource management, typically
+short lived file handles. Our game engine is large, modular, with many
+things ideally held separate from another. Could we get a ``defer`` for
+this scope?
+A dynamic solution that would work for a scripting language (although in a
+syntactically unattractive way) would look like this in practice::
+  // initializing an engine module or plug-in
+  function init_module () {
+    // example allocation of a system resource
+    var handle = allocDrawingContext();
+    // register a deferred destructor for the teardown
+    memory.on_free(function () {
+      // we can do more than one thing in this function
+      freeDrawingContext(handle);
+    });
+  }
+One can build a relatively similar construct in C++, alas, the disadvantage
+to classic destructor cascades is that we can't get nicely inlined static
+teardown code this way. We also need to manage the list of function pointers to
+be called on teardown.
+Suppose we could extend the compiler, or even better yet, use a language
+that allowed to script the compiler using static macros, a language like
+Terra, could we then get a more static solution for the code above?
+The question is leading. Of course we can!
+Here's how such an approach would look in Terra::
+    terra init_module ()
+        -- example allocation of system resources
+        var handle : int = allocDrawingContext()
+        -- register a deferred release for the teardown
+        --
+        -- side note: of course we could (and should) wrap allocDrawingContext
+        -- to do it transparently, but in this example, let's bleed some
+        -- internals.
+        [memory.on_free(freeDrawingContext, handle, int)]
+        -- do a few more
+        var handle2 : int = allocDrawingContext()
+        [memory.on_free(freeDrawingContext, handle2, int)]
+        var handle3 : int = allocDrawingContext()
+        [memory.on_free(freeDrawingContext, handle3, int)]
+    end
+    terra main()
+        init_module()
+        memory.teardown()
+    end
+And this is what the generated IL for ``main()`` looks like for LLVM::
+    define void @"\$main"() {
+    entry:
+      %0 = tail call i32 @"\$allocDrawingContext"()
+      store i32 %0, i32* @dvar, align 4
+      %1 = tail call i32 @"\$allocDrawingContext"()
+      store i32 %1, i32* @dvar1, align 4
+      %2 = tail call i32 @"\$allocDrawingContext"()
+      store i32 %2, i32* @dvar2, align 4
+      tail call void @"\$freeDrawingContext"(i32 %2)
+      %3 = load i32* @dvar1, align 4
+      tail call void @"\$freeDrawingContext"(i32 %3)
+      %4 = load i32* @dvar, align 4
+      tail call void @"\$freeDrawingContext"(i32 %4)
+      ret void
+    }
+"Sorcery!" you might exclaim like a caricature of a 14th century cleric, but
+far from it.
+``memory.on_free()`` is declared as follows::
+    memory.on_free = function (func, ctx, ctxtype)
+        -- create global variable holding the resource if argument
+        local dvar = (terralib.isglobalvar(ctx) and ctx) or global(ctxtype)
+        -- insert at front so objects get destroyed in reverse
+        table.insert(memory, 1, {func=func, ctx=dvar})
+        -- initialize global variable
+        return quote
+            dvar = ctx
+        end
+    end
+and the teardown macro looks like this::
+    memory.teardown = macro(function ()
+        -- generate statement list of static destructor calls
+        local stmts = {}
+        for _,o in ipairs(memory) do
+            table.insert(stmts, `o.func(o.ctx))
+        end
+        return stmts
+    end)
+One point of criticism is that we're now storing each resource twice: once
+within the context it was allocated for, and once in a global variable that
+hogs a spot in the data segment.
+Unlike C++, it is practical and safe in Terra to use a global variable for
+resources, as global variables are hygienic and hidden by the closures they
+were defined in at compile time.
+It would therefore be safe to do this::
+    -- global handles to be auto-destroyed later
+    local handle = global(int)
+    local handle2 = global(int)
+    local handle3 = global(int)
+    terra init_module ()
+        handle = allocDrawingContext()
+        [memory.on_free(freeDrawingContext, handle)]
+        handle2 = allocDrawingContext()
+        [memory.on_free(freeDrawingContext, handle2)]
+        handle3 = allocDrawingContext()
+        [memory.on_free(freeDrawingContext, handle3)]
+    end
+and that reduces memory and instruction footprint to its theoretical minimum.

A => doc/articles/node-dom.txt +155 -0
@@ 0,0 1,155 @@
+                        NOM - A Graph-Based DOM
+                           by Leonard Ritter
+                              Duangle GbR
+                       First revision: 2014/09/14
+For my noodles graph programming I ended up rewriting the model.
+I used to have several separate classes for each type: Type, Function, Call,
+Then I realized I first need to have an extensible data descriptor model which
+is general enough to permit building the above classes as specialized data - and
+maybe more.
+Therefore I folded all these classes back into one, the Node, which is
+comparable to a DOMObject or a JSONObject, with a few changes
+particular to the models application:
+ * the data is not edited in text form, so there is no "language" for this,
+   just an API for (visual) editing, whose behavior must be
+   configurable/hintable.
+ * the data structure can be split over several files, usually in a tree-like
+   dependency structure.
+ * references need to be loosely coupled and overridable.
+ * configurable directed links/edges between nodes are required.
+ * a metadata/type/inheritance system is selectively required.
+All in all, the result is comparable to my earlier Datenwerk system, with a few
+major simplifications and amendments.
+Here's a description of the features of the model:
+Tree Structure
+--------------
+As with DOM/JSON, the node is a tree node, in that it has one parent and a list
+of children, and that it is deleted when its parent is deleted.
+Parent-children relationships are implicit and hard-referenced, in that they can
+be implemented with a native pointer. There are soft references which I'll talk
+All root nodes of loaded documents are attached to a constant, unrenameable,
+unreparentable and undeletable single root node named "#" that exists only
+during runtime and must not be saved.
+Data
+----
+A node can have a hard reference to a piece of data, which is neither typed
+nor specified; in DOM terms, we would call this "CDATA". Typically, it can be
+a number, a string or text block, a boolean, a table, an object, a sound buffer,
+an image buffer, etc. The type of the data is opaque to the model and only
+of relevance to particular applications. It is deleted with its owner.
+Typically, a user would tag this node with a child node or metanode describing
+its type, or the type would be implicitly inferrable from context;
+visualizations would have specific requirements as to how to make this
+data reflectable.
+Names
+-----
+Each node carries a non-empty name, which is comparable to DOM id's, but the
+node name is only unique within the scope, similar to files in directories.
+To reference a node, one needs the full path originating from the "#" node, in
+form of a list of node names;
+A name can either be readable and user-defined, or a random id (e.g. uuid) in
+cases where the node has no textual representation and requires no repairable
+linkability. Random ids need to be prefixed with "#" to distinguish them from
+readable names. This has an effect on visualization and soft-reference rules.
+We call user-defined names explicit names and random ids implicit names.
+Paths
+-----
+A path is what constitutes a so-called soft reference.
+Path name resolution is *not* the same as in most file systems. The name
+resolution for the first path element travels upwards, treating the tree
+structure as scope, comparable to namespace resolution in C, C++, Java, Lua,
+Javascript, and a few others. From then on it resolves down, descending
+into child nodes.
+Therefore, relative paths can directly start from the first commonly shared
+parent node, if the parent is not immediate.
+That means, if two nodes in upper scopes share the same name, the path
+will always resolve to the node in the nearest scope.
+Metanodes
+---------
+A node can use another node as a metanode, comparable to instances of classes
+in C++, Python or Java; prototypes in Javascript; and metatables in Lua.
+Metanodes are soft-referenced, which means they are referenced with a path,
+in this case ideally a path with 1 to 2 elements.
+If the path to a metanode can not be resolved, the node is marked as incomplete
+by the editor, and will lose all of the styling and behavior imposed by the
+metanode until the reference is restored. A metanode path can not contain
+implicit names.
+Aliases
+-------
+When a path is resolved down and the path iterator can not find a child node
+with the given name, it will continue the search in the node's metanode
+This way, one can import, alias or typedef nodes across widely separated
+scopes.
+-----
+This is the new bit that to my knowledge isn't a part of any mainstream object
+descriptor format I know, and which adds noodle-ability to this model.
+A node can be attached between two other nodes, called a source and a sink,
+describing a directional graph, which effectively turns the node into an edge,
+Source and sink are referenced weakly; neither needs to share the same scope
+Deletion rules are tricky: if either source or sink path can not be resolved,
+and the first unresolvable path name is implicit, the node is deleted. If both
+paths can not be resolved, explicit or not, the node is always deleted.
+There are some more theoretical implications for links as nodes, for which
+I know no immediate use cases:
+Because links are nodes, a link can have child nodes and a metanode. A node
+can also use a link as a metanode.
+build link cascades and sierpinski triangles, but I don't know what those are
+good for.

A => doc/articles/split-scheme.txt +149 -0
@@ 0,0 1,149 @@
+                           by Leonard Ritter
+                              Duangle GbR
+                        Second revision: 2016/01/01
+DISCLAIMER: This is original research and has not been verified by a
+second or third party yet.
+This is a proposal for a new function for the practical split
+function can be inverted in a shader to easily project
+distance z  back to layer index i. An intuitive replacement
+for the weighing factor sigma is proposed as well.
+In the original article, a split scheme function was suggested to
+given a layer index i, the optimal depth z for that index could
+be derived; given a normal scalar x which is retrieved via
+                                         i
+                                x = -----------
+                                    layer_count
+and a sigma constant S in the range 0..1, which would allow the
+user interpolate linearly between an equidistant depth distribution
+and an ideal exponential one to bias the resolution distribution,
+the original split scheme equation would be
+                          f
+                  z = pow(-, x) n S + ((f - n) x + n) (1 - S) = f(x)
+                          n
+n and f designate the distance of the near and far planes respectively.
+Unfortunately this function can't be inverted. With this solution,
+it is necessary to precompute and pass the layer boundaries as
+coefficients to the pixel shader. This also limits applicability,
+as the number of addressable layers is bounded by storage.
+The New Split Scheme
+====================
+Through experimentation and some guesswork, I found an alternative
+way to bias the distribution, which is purely exponential and
+has no additive polynomial component, so inversion is trivial.
+This bias effectively projects the near and far distances to a
+slope further up the exponential curve, which dampens the initial
+slow momentum and causes the function to approach a more linear
+scaling.
+The original sigma constant S is squared so that C = sqrt(S),
+and the split scheme function is altered so that
+                         f                 f
+            z = (f - n - -) (1 - pow(-------------, x)) + n = f(x)
+                         C           (n - f) C + f
+At the original proposed default of S = 0.5, the function starts
+with near identical momentum, but converges marginally sooner
+towards f; across all values of S, the first half of the function
+distributes very similarly, so this function can easily be used
+as a drop-in replacement.
+The only limitation with this modified function is that, while there
+is a solution for C = 0, z and x can not be computed, which poses no
+issue as S = 0 describes a simple linear mapping that is trivial to
+invert, and undesirable in the context of efficient depth layer usage.
+Analysis
+========
+http://i.imgur.com/9OeUMBz.png
+The above picture shows the distribution for sigma = 0.5; the
+green and yellow curves show the boundaries at sigma 0.0 and 1.0,
+which are the same for both old and new functions.
+The purple curve is the original split scheme, the blue curve
+is the modified one.
+============
+In the shader, the layer index can now be retrieved by resolving x from a
+distance z and multiplying it with layer_count to retrieve the interpolated
+index of the layer to be retrieved.
+To calculate x from z, we invert the new split scheme function so that
+                (z - far) C + far                  far
+       x = log(--------------------) / log(--------------------) = f(z)
+               (near - far) C + far        (near - far) C + far
+For optimization, the inner factors and the second log term can be stored in
+three constant variables U, V, W:
+                           d = far * (1 - C) + near * C
+                                         C
+                                     U = -
+                                         d
+                                   far * (1 - C)
+                               V = -------------
+                                         d
+                                                                far
+                                 W = layer_count * log(2) / log(---)
+                                                                 d
+With these optimizations, the resolve reduces to
+                              i = log2(z U + V) W
+Future Work
+===========
+I did not find the original weighing parameter sigma to be particularly
+easy to understand and apply. In practice, I found myself aiming to get
+an optimal balance between resolution and the range of the first cascade
+layer by guessing values for sigma; There was no clear relationship
+between the parameter and its implications.
+Ideally, I would prefer a new balancing variable z1 instead of sigma, which
+defines the distance at which the first shadow map layer ends, and is
+used to calculate the bias C.
+Unfortunately, I could not find a way to solve the equation towards C,
+so I leave this for further exploration. Until a simpler method can be found,
+bisecting towards C would be the nearest choice; A routine would
+methodically attempt to solve the split scheme function by adjusting C until
+z = z1 for x = 1 / layer_count.
+References
+==========
+ Parallel-Split Shadow Maps on Programmable GPUs
+    http://http.developer.nvidia.com/GPUGems3/gpugems3_ch10.html
+Thanks to Sean Barrett and Fabian Giesen for help and suggestions.

A => doc/articles/twistpool.c +317 -0
@@ 0,0 1,317 @@
+/*
+Twisting Pool Allocator
+=======================
+written by Leonard Ritter (leonard.ritter@duangle.com)
+This file is in the public domain
+I don't know if I was the first one to stumble upon this technique, so
+I can't guarantee there's no patent on it, but let's hope there's not,
+then this counts as prior art.
+--------------------------------------------------------------------------------
+This is a proof of concept implementation for a pool allocator that guarantees
+compactness (unsorted gapless iteration without indirections) while preserving
+element ids (using one order-optimized indirection), with insertion, deletion
+and lookup in O(1) time.
+Because all id <-> index assignments are symmetric swaps (either
+identity-mapped or twisted), only a single table is required to resolve
+index from id and id from index.
+Insertion and deletion is a self-sorting process. Both operations attempt
+to untwist identifiers that share the same status (obtained/released) and
+so, regardless of deallocation order, only few entries are accessed and
+fragmentation is generally low; in my random allocation test, the average number
+of fragmented entries appeared to be ~log2(capacity)); It is possible to produce
+a worst case by obtaining all id's, then releasing every other id. The upper
+bound on fragmented entries appears to be capacity/3.
+The data being managed must be allocated separately, and the user must
+copy at least none, at most one element after a call to obtain/release.
+With this implementation, the memory requirement is
+sizeof(unsigned int) * capacity, typically 4 bytes per entry.
+It might however possible to implement the map with a compact sorted array or
+hashtable, storing only twisted entries, which should tremendously reduce memory
+usage and therefore notably improve cache efficiency.
+--------------------------------------------------------------------------------
+*/
+#include <stdbool.h>
+#include <assert.h>
+// use whatever size you prefer; this implementation operates directly on
+// global memory; a productive implementation would probably use the heap.
+#define CAPACITY 10000
+#define N (CAPACITY+1)
+/*  index: offset of element in data array; elements are moved
+    to keep the array compact and contiguous
+    id: numerical name assigned to a data element; never changes.
+    therefore, elements should typically be referenced by their id,
+    not their index.
+    id 0 and index 0 are synonymous with no element
+    */
+/*  symmetrical map
+    maps index -> id; compact, gapless
+    last index == count
+    happens to also map id -> index; has gaps
+    */
+unsigned int map[N];
+/*  all ids are mapped, even the unused ones;
+    the ideal mapping is id == index
+    the first free index is count+1
+    if count == capacity, the pool is full.
+    */
+unsigned int count;
+// pool constructor
+void construct () {
+    unsigned int i;
+    // initially no elements
+    count = 0;
+    for (i = 0; i < N; ++i) {
+        // start out with identity mapping;
+        // all unused ids are mapped
+        map[i] = i;
+    }
+}
+// lookup index from id or id from index
+unsigned int resolve(unsigned int id_or_index) {
+    return map[id_or_index];
+}
+bool is_index_valid (unsigned int index) {
+    return (index > 0) && (index <= count);
+}
+bool is_id_valid (unsigned int id) {
+    return (id > 0) && (id <= CAPACITY) && is_index_valid(resolve(id));
+}
+bool is_identity (unsigned int id_or_index) {
+    return (map[id_or_index] == id_or_index);
+}
+// swap the ids of two indices (or the indices of two ids)
+void swap (unsigned int a, unsigned int b) {
+    // ids can only be twisted or untwisted, map must not be shuffled any further
+    assert((a == b)
+            || ((map[a] == a) && (map[b] == b))
+            || ((map[a] == b) && (map[b] == a)));
+    unsigned int t = map[a];
+    map[a] = map[b];
+    map[b] = t;
+}
+typedef struct _pair {
+    unsigned int _0;
+    unsigned int _1;
+} pair;
+/*  allocate a new id from the pool
+    user must copy slot[index(id)] to slot[count] after
+    allocation. obtain() returns a { src, dst } tuple indicating
+    how data must be moved; if (dst == src), then no move is necessary.
+    src is also identical to the newly obtained id
+    if the pool is full, obtain returns { 0, 0 }
+    */
+pair obtain () {
+    if (count < CAPACITY) {
+        unsigned int index;
+        unsigned int id;
+        // increase number of elements
+        ++count;
+        // index of new last element
+        index = count;
+        // id of new last element
+        id = map[index];
+        // if id not identical to index (is twisted)
+        if (id != index) {
+            // swap with index that matches this id,
+            // so that index(id) == id
+            swap(index, id);
+        }
+        // return new id/index and index of moved item
+        pair result = { id, index };
+        return result;
+    } else {
+        // out of space
+        pair result = { 0, 0 };
+        return result;
+    }
+}
+/*  release an obtained id to the pool
+    user must copy slot[count] to slot[index(id)] after
+    deletion. (release id) returns a (src dst) tuple indicating
+    how data must be moved; if (src == dst), then no move is necessary.
+    */
+pair release (unsigned int id) {
+    unsigned int index;
+    unsigned int last_index;
+    unsigned int last_id;
+    assert(is_id_valid(id));
+    // index of element to be deleted
+    index = map[id];
+    // if element is twisted, then untwist
+    if (id > count)
+        swap(index, id);
+    // index and id of element to take its place
+    last_index = count;
+    last_id = map[last_index];
+    // swap indices so that tailing element fills the gap (twist)
+    // if last element is twisted, then untwist first
+    if (last_id > count) {
+        swap(last_index, last_id);
+        swap(index, last_id);
+    } else {
+        swap(index, last_index);
+    }
+    // decrease number of elements
+    --count;
+    // return index of element to be moved and index of gap
+    pair result = { last_index, index };
+    return result;
+}
+// ----------------------------------------------------------------------------
+// test
+#include <stdlib.h>
+#include <stdio.h>
+// our "data array", which just stores the id again, so we can easily verify
+// if an id still matches its assigned content
+unsigned int data[N];
+void dump() {
+    unsigned int i;
+    printf("index -> ID:\n");
+    for (i = 1; i <= CAPACITY; ++i) {
+        if (i == (count + 1))
+            printf("-------\n");
+        unsigned int id = resolve(i);
+        unsigned int ri = resolve(id);
+        printf("%u\t%u\t%s\n", i, id, ((i != ri)?"!":""));
+    }
+    printf("%i used elements\n\n", count);
+}
+void move(pair k) {
+    if (k._0 != k._1) {
+        data[k._1] = data[k._0];
+    }
+}
+unsigned int verify_data () {
+    unsigned int i;
+    unsigned int twisted = 0;
+    for (i = 1; i <= count; ++i) {
+        unsigned int id = resolve(i);
+        assert(data[i] == id);
+        assert(resolve(id) == i);
+        if (!is_identity(i))
+            ++twisted;
+    }
+    return twisted;
+}
+unsigned int test_obtain () {
+    pair k = obtain();
+    move(k);
+    data[k._0] = k._0;
+    return k._0;
+}
+unsigned int test_release (unsigned int id) {
+    assert(id != 0);
+    pair k = release(id);
+    move(k);
+}
+#include <memory.h>
+#include <stdio.h>
+// array of ids in use
+unsigned int used[CAPACITY];
+int main (int argc, void** argv) {
+    int i;
+    unsigned int mintwisted = N;
+    unsigned int maxtwisted = 0;
+    unsigned int total = 0;
+    unsigned int steps = 0;
+    unsigned int used_count = 0;
+    memset(used, 0, sizeof(used));
+    construct();
+    srand(time());
+#if 1
+    // do random obtains/releases, see if something breaks
+    for (i = 0; i < 100000; ++i) {
+        if (((rand() % 100) < 50) && (used_count > 0)) {
+            unsigned int used_index = rand() % used_count;
+            unsigned int id = used[used_index];
+            // remove from used array and fill
+            used_count--;
+            used[used_index] = used[used_count];
+            used[used_count] = 0;
+            test_release(id);
+            unsigned int t = verify_data();
+            mintwisted = (mintwisted < t)?mintwisted:t;
+            maxtwisted = (maxtwisted > t)?maxtwisted:t;
+            total += t;
+            ++steps;
+        } else {
+            unsigned int k = test_obtain();
+            if (k != 0) {
+                assert(used_count < CAPACITY);
+                used[used_count] = k;
+                ++used_count;
+            }
+            verify_data();
+        }
+    }
+#else
+    // attempt to fabricate a worst case
+    for (i = 0; i < CAPACITY; ++i) {
+        test_obtain();
+    }
+    for (i = 1; i <= CAPACITY; i += 2) {
+        test_release(i);
+        unsigned int t = verify_data();
+        mintwisted = (mintwisted < t)?mintwisted:t;
+        maxtwisted = (maxtwisted > t)?maxtwisted:t;
+        total += t;
+        ++steps;
+    }
+#endif
+    dump();
+    printf("releases: %u\nmin twisted: %u\nmax twisted: %u\ntotal twisted: %u\naverage: %f\n",
+        steps, mintwisted, maxtwisted, total, (double)total / (double)steps);
+    printf("OK.\n");
+    return 0;
+}