Welcome to the first Development INF (stands for information)! Basically for each Development INF article, I will be listing the road blocks, gotchas, tips and tricks that I found during development; in this case DirectX11.

1. DirectX11 is part of WindowsSDK
For old time DirectX developers, we usually had to install DirectX SDK. However, starting Windows 8, Microsoft has included the SDK to Windows 8 SDK. Consequently, this creates compile issues as some projects are still referencing the old DirectX path.

In Visual Studio, we usually use $(DXSDK_DIR)\Include and $(DXSDK_DIR)\Lib to locate the headers and lib, however, this will be no longer the case. The headers should be in $(WindowsSDK_IncludePath) and lib in $(WindowsSdkDir)\lib\x64.

2. D3DX11 library is deprecated
D3DX11 Library is Deprecated, we should no longer include d3dx11.h header and no longer use D3DX11* functions. We need to find the replacement for each function.

3. DirectXMath replaces XNAMath
//#include <xnamath.h> - deprecated
#include <DirectXMath.h>
using namespace DirectX; // The math structs and funcs are in this namespace
4. Effects API is no longer part of DirectX
You can download and compile it yourself from http://fx11.codeplex.com/. It's a good idea to build and put the library in your library folder.
Recent games have been heading towards Physically Based Rendering and this requires a solid understanding on lighting theory more than ever before. This time, I'm posting my note on Radiometry and Photometry.


Radiometry is basically ideas + mathematical tools to describe light propagation + reflection. Radiative Transfer is a study of transfer of radiant energy (which operates on geometric optics level - macroscopic properties of light suffice to describe how light interacts with objects much larger than light's wavelength).

Four Radiometric Quantities:

1. Flux (Radiant Flux/Power) - total amount of energy passing through a surface or region of space per unit time (J/s or Watt). Total emission from light sources is generally described in terms of flux.
2. Irradiance (E) - area density of flux arriving at a surface (Watt/m2)
Radiant Exitance (M) - area density of flux leaving a surface (Watt/m2)
3. Radiant Intensity (I) - flux density per solid angle (Watt/sr)
4. Radiance (L) - flux density per unit area, per unit solid angle (Watt/m2-sr)


Photometry is a study of visible electromagnetic radiation in terms of its (human) perception. Each spectral radiometric quantity can be converted to its corresponding photometric quantity.
Luminance (Y) measures how bright a spectral power distribution appears to a human observer. The unit is cd/m2 or nit. Cd is the photometric equivalent of radiant intensity (Watt/sr).

Connecting Radiometry with Photometry
Photometric quantity can be obtained by integrating spectral radiometric quantity against the spectral response curve V(lambda) (the relative sensitivity of the human to various wavelengths).

QuantityRadiometric NameRadiometric UnitPhometric NamePhotometric Unit
powerRadiant FluxWLuminous Fluxlumen (lm)
power per unit areaIrradianceW/m2Illuminancelm/m2 = lux (lx)
power per unit solid angleRadiant IntensityW/srLuminous Intensitylm/sr = candela (cd)
power per unit area per unit solid angleRadianceW/m2-srLuminancelm/m2-sr = cd/m2 = nit


1. Radiometry and Photometry FAQ - http://fp.optics.arizona.edu/Palmer/rpfaq/rpfaq.pdf
2. Philips Lighting Theory - http://www.lighting.philips.com/main/connect/lighting_university/internet-courses/lighting-theory/lighting-theory/
3. Moving Frostbite to PBR - http://www.frostbite.com/wp-content/uploads/2014/11/course_notes_moving_frostbite_to_pbr.pdf
4. IES File Format - http://www.cn-hopu.com/upload/file/IES.pdf
5. Radiometry Summary - http://cs.brown.edu/~ls/teaching/RadiometrySummary.pdf

During early years of my programming, I tend to make everything super generic, as automatic as possible. For example, if I want to make rendering system to render multiple layers (opaque, translucent, etc2). I will do a generic system so that each layer will register to a render manager. Subsequently, render manager will just iterate through all registered layers and do something (update/render) with them.

Something like this:
// Assume we have RenderManager class and RenderLayer class

// Initialization
void RenderManager::Init()
    // Create and register each layers
    RegisterLayer( "Opaque" );
    RegisterLayer( "Translucent" );
    RegisterLayer( "PostFX" );

void RenderManager::Render()
    for (int i=0; i < numLayers; ++i)
Isn't this awesome? We can create as many layer as we want and when adding a new layer, we just need to modify Init()!

Issue #1: Ordering

One thing we have to resolve is that how do we determine the order of the layers. Is it the same as the order of registration? or maybe we can add some sorting value when registering? Whichever method you choose,  changing the order should be easy to do.

Issue #2: Readability

Ok, with some way of ordering the layers, the system looks great! This surely will be the best system right? The problem with this is that when the project grow bigger and then more people are involved. Let say we have some problem when rendering the translucent layer, we have to inject some code such as:
void RenderManager::Render()
    for (int i=0; i < numLayers; ++i)
        if ( layers[i].Name == "Translucent" )
If you opt with "sort value" system to determine the order, you need to write Dump() function to print out the ordering.

In this case, the better way is to do this explicitly. This basically laying out the rendering order in the code. It will be self-documenting code and it will be easy to debug.
void RenderManager::Render()




When you encounter this code, it will much easier to read! Of course, the downside is that you need to modify Render() when adding a new layer. However, this happens infrequently, so it makes sense to do explicit approach.

So, do you think we have to change all our system to explicit approach? Nope, surely not, there are certain things that will benefit from general approach. You need to evaluate whether it makes sense to do explicit approach. Explicit approach is not always a bad thing, sometimes it's better than general approach.

If there's one thing to be picked up from here is that we have to remember that our code will be read with other people (including ourself), we have to craft it so that it's easy to understand (and to debug)

This post aims to help Windows user/developer to switch to Mac. If you are a developer and wanted to jump to Mac for iOS/Mac development, this post will help you :)

Must-Have Apps

  • Sublime Text 2 - this is more for Notepad++ replacement in Mac
  • gfxCardStatus - menu bar app that will show you which graphics card is in use (discrete or integrated)
  • XtraFinder - misses Windows Explorer? this is must install that will make Finder a little closer to Windows Explorer.
When working in a large project with multiple features, you might be facing with a lot of #if-s. For example:
if ( doCheck || doCheckA || doCheckB )
if ( doCheck || doCheckA )
if ( doCheck || doCheckB )
if ( doCheck )
The problem with this code is that as the number of features arises, you will have more and more permutations of #if-s to take care. Let see something a little bit better:
if ( doCheck
     || doCheckA
     || doCheckB
The code is orthogonal, however, imagine you have lots and lots of this, your codebase will be super hard to read! I like to organize this a little differently, which have best of both worlds: orthogonality and clarity:
bool doSomeCheck = doCheck;    // General check on all conditions
doSomeCheck |= doCheckA;       // Check on Feature A
doSomeCheck |= doCheckB;       // Check on Feature B

// The logic below will be uncluttered and easy to see
if ( doSomeCheck )
Depending in the situation, it might better to refactor it to a function too:
static bool ShouldDoThis()
    bool doSomeCheck = doCheck;
    doSomeCheck |= doCheckA;
    doSomeCheck |= doCheckB;
    return doSomeCheck;

// so the existing code becomes
if ( ShouldDoThis() )
There's some minor catch of organizing it the way I described. You will lose some logical shortcut, because instead of using || we are essentially just use |. So, there might be a place where this technique is not appropriate. However, most of the time, the penalty is small so you should be able to use this style.

Jorge Jimenez has released his slides regarding "Next Generation Post Processing in Call of Duty: Advanced Warfare". This talk is part of SIGGRAPH 2014 Advances in Real-Time Rendering in Games Course.

1. SIGGRAPH Course Website - http://advances.realtimerendering.com/s2014/index.html
2. Jorge's Website - http://www.iryoku.com/next-generation-post-processing-in-call-of-duty-advanced-warfare

Check out the awesome presentation slides! I feel honored to have been mentioned by Jorge in his slide, thanks Jorge!
To understand the concept of Compute Shader, let's start from basic.

Compute Shader (CS) Threads
A thread is basic CS processing element.

1. CPU kicks off CS thread groups.
// Total number of thread groups = nX * nY * nZ
pDevice->Dispatch( nX, nY, nZ );

2. Each CS declares the number of threads on the "thread group".
// Total number of threads per thread group = X * Y * Z
void cs_main(...)

// CPU
pDevice->Dispatch( 3, 2, 1 );

// CS
[numthreads(4, 4, 1)]
void cs_main(...)

// # of thread groups = 3*2*1 = 6
// # of threads per group = 4*4*1 = 16
// # of total threads = 6 * 16 = 96

N.B: Picture taken from GDC09 Slide "Shader Model 5.0 and Compute Shader"

CS Parameter Input
void cs_main(uint3 groupID          : SV_GroupID,
             uint3 groupThreadID    : SV_GroupThreadID,
             uint3 dispatchThreadID : SV_DispatchThreadID,
             uint  groupIndex       : SV_GroupIndex)

// groupID          : index [0..nX), [0..nY), [0..nZ)
// groupThreadID    : index [0..X), [0..Y), [0..Z)
// dispatchThreadID : global thread offset = groupID.xyz * (X,Y,Z) + groupThreadID.xyz
// groupIndex       : flattened version of groupThreadID

That's it for now, we will continue with more stuff in Part 3!
This page will discuss/contains links about game programming concepts for game development.

In rendering world, there are several article that discusses about bent cone. For example Bent Normals and Cones in Screen Space and also in GPU Pro 3: Screen-Space Bent Cones: A Practical Approach.

In this post, I wanted to share how I compute bent cone (bent normal and max cone angle). The paper Bent Normals and Cones in Screen Space actually discusses how you compute bent normal and max cone angle (although it's a bit math-y). Here, I want to present how I compute it.

Computing bent normal is quite easy, basically you just shoot rays from your sampling point (pixel/vertex) and average the unoccluded rays (and normalize it).

For max angle, it turns out we can correlate it with AO:
   A = Half Opening of Cone Angle
   AO = Ambient Occlusion Value

AO = UnoccludedArea / TotalArea

TotalArea = Hemisphere Area
          = 2 * pi * r * r
UnoccludedArea = Area covered by Solid Angle 2A
               = Solid Angle 2A * r * r
               = 2 * pi * (1 - cosA) * r * r

AO = 1 - cosA
cosA = 1 - AO
This basically means, to compute the cone angle, it's enough to compute AO.
Some useful online math utility:
  • Graph Toy - iq's simple on-line graph visualizer
Here are the collection of links/slides from GDC 2014, GTC 2014 and Build 2014. I will update this page as I go:

Game Developer Conference (GDC) 2014:
  • Infamous: Second Son Engine Post Mortem - really interesting talk, recurring theme for next-gen rendering, i.e. physically based shading, screen-space reflection (SSR), compute shader.

    Notes on their rendering:
    • Physically based HDR shading on 1080p with SMAA T2x! This produces really nice and sharp image!
    • Tiled Deferred+ Renderer - build light list per-tile that can be used for both deferred and forward rendering (forward: pre-integrated skin, anisotropic cloth/hair, glass, shadows(?))
    • Lighting:
    • Shadows: 
      • Used Normal Offset Shadows to combat shadow acne
      • Filtering: PCF with varying kernel size, mostly 8x8, hair uses additional screen space marching (can they just use screen space marching?)
      • Resolved to screen?? - I don't quite get it
  • The Infamous: Second Son Particle System Architecture - 
  • The Visual Effects of Infamous: Second Son
  • Adding High-End Graphical Effects to GT Racing 2 on Android x86 (Adrian Voinea, Steven Hughes) - talks about how they add Bloom, DoF, LightShaft and Heat Haze. Pretty standard stuff, if you have been working on console.

GPU Technology Conference (GTC) 2014:
  • Direct3D 12 API Preview (video, slides) - talks about console level efficiency (CPU efficiency and parallelism). DX12 tries to achieve this by giving more control over GPU memory resources (pipeline state object, resource binding, descriptor heaps and tables, bundles). DX12 is going bindless!
Shadows is an important visual cue for rendering and has been incorporated into recent games via shadow mapping technique. Most shadow mapping technique only concerns about opaque object shadows and there is not a lot of about translucent object shadows.

Translucent Shadows in Starcraft II Review
In Starcraft II - Effects & Techniques, Dominic Fillion mentions how they render translucent shadows in Starcraft II. Here's how the rendering works:

* Requires second shadow map and color buffer. Let's name them as translucent shadow map and translucent shadow buffer.

Shadow Maps Rendering
* Opaque Shadow Map: render opaque objects to opaque shadow map
* Translucent Shadow Map: render translucent objects to translucent shadow map (z-write on, z-test on with less equal, no alpha test, records depth of closest transparency)
* Translucent Shadow Buffer: Clear to white, sort translucent objects front-to-back, use Opaque Shadow Map as z-buffer, no z-write off, z-test on with less equal, records color information for transparencies in front of opaque objects

Scene Rendering
* Test both opaque shadow map and translucent shadow map
* if (translucent shadow map test fail) module by color of transparent shadow color bufer
* Modulate by result of opaque shadow map test (binary test)

Things to note:

  • The technique completely separates opaque shadows and translucent shadows (that's why they have 2 shadow maps)
  • This technique handle self-shadowing correctly for both opaque objects and translucent objects
  • Translucent shadow buffer might not be necessary if we don't need colored shadows
  • If we are just doing hard shadows or 1 sample per shadow map, the opaque shadows and translucent shadows can be combined simply by modulation, i.e
  • vis = opaqueShadowVis * translucentShadowVis
  • In general case, if we want soft-shadows, we basically need to sum the visibility and divide by number of samples, i.e.
  • vis = 0
    for (i=0; i != num_samples; i++)
        vis += opaqueShadowVis[i] * translucentShadowVis[i];
    vis /= num_samples;
  • Remember, here we are talking about binary tests! so opaqueShadowVis and translucentShadowVis are point sampled.
Every once in a while I found some C/C++ tips and tricks. This page is going to be the repository of those tips and tricks.

Print the first n characters with printf()
const char* pStr = "C Tips and Tricks.";
printf( "%.*s\n", 5, pStr);  // This will print "C Tips"

Specifying enum type
enum EnumAsByte : unsigned char { ... };
enum EnumAsUnsignedShort : unsigned short { ... };

I am always worried when using C++ operator overloading because they can generate a bunch of temporaries and it may not be efficient. In other words, there is this concept of Return Value Optimization (RVO) in C++. So, I decided to check how the compiler (Visual Studio 2012) behaves.

Here's how my test class looks like:
struct Vec3
    float x, y, z;
    Vec3() {}
    Vec3(const Vec3& v) : x(v.x), y(v.y), z(v.z) {}
    Vec3(float _x, float _y, float _z) : x(_x), y(_y), z(_z) {}

    // *this = a*b
    void Mul(const Vec3& a, const Vec3& b)
        x = a.x * b.x;
        y = a.y * b.y;
        z = a.z * b.z;
    // *this += a
    void Add(const Vec3& a)
        x += a.x;
        y += a.y;
        z += a.z;

inline Vec3 operator*(const Vec3& a, const Vec3& b)
{   // RVO: return with temporary
    return Vec3(a.x * b.x, a.y * b.y, a.z * b.z);

inline Vec3 operator+(const Vec3& a, const Vec3& b)
{    // RVO: return with temporary
     return Vec3(a.x + b.x, a.y + b.y, a.z + b.z);
The test is simple, I want to check out how a*b + c behave, by using function vs operator overloading:
// Function test
Vec3 res;

// Operator overloading test
Vec3 res = a*b + c;
Using function, the assembly looks like this:
 Vec3 res1;
001B1277  movss       xmm0,dword ptr ds:[1B40E4h]  
001B127F  mulss       xmm0,dword ptr ds:[1B40F0h]  
001B1287  movss       xmm1,dword ptr ds:[1B40E8h]  
001B128F  movss       xmm2,dword ptr ds:[1B40ECh]  
001B1297  mulss       xmm1,dword ptr ds:[1B40F4h]  
001B129F  mulss       xmm2,dword ptr ds:[1B40F8h]  
001B12A7  movss       xmm4,dword ptr ds:[1B40D8h]  
001B12AF  movss       xmm3,dword ptr ds:[1B40DCh]  
001B12B7  addss       xmm4,xmm0  
001B12BB  movss       xmm0,dword ptr ds:[1B40E0h]  
001B12C3  addss       xmm3,xmm1  
001B12C7  addss       xmm0,xmm2  
with operator overloading, it looks like this:
 Vec3 res = a*b + c;
01001277  movss       xmm0,dword ptr ds:[10040F0h]  
0100127F  mulss       xmm0,dword ptr ds:[10040E4h]  
01001287  movss       xmm1,dword ptr ds:[10040F4h]  
0100128F  movss       xmm2,dword ptr ds:[10040F8h]  
01001297  mulss       xmm1,dword ptr ds:[10040E8h]  
0100129F  mulss       xmm2,dword ptr ds:[10040ECh]  
010012A7  movss       xmm4,dword ptr ds:[10040D8h]  
010012AF  movss       xmm3,dword ptr ds:[10040DCh]  
010012B7  addss       xmm4,xmm0  
010012BB  movss       xmm0,dword ptr ds:[10040E0h]  
010012C3  addss       xmm3,xmm1  
010012C7  addss       xmm0,xmm2  
So, they are identical! Meaning that I shouldn't worry too much about using operator overloading. Subsequently, let's try using a function Madd() to do this and let see how this fares with operator overloading. Madd() is simply implemented as this
void Madd(const Vec3& a, const Vec3& b, const Vec3& c)
    x = a.x * b.x + c.x;
    y = a.y * b.y + c.y;
    z = a.z * b.z + c.z;
Let see the assembly:
 Vec3 res3;
 res3.Madd(a, b, c);
011D1277  movss       xmm2,dword ptr ds:[11D40E4h]  
011D127F  movss       xmm1,dword ptr ds:[11D40E8h]  
011D1287  movss       xmm0,dword ptr ds:[11D40ECh]  
011D128F  mulss       xmm2,dword ptr ds:[11D40F0h]  
011D1297  mulss       xmm1,dword ptr ds:[11D40F4h]  
011D129F  mulss       xmm0,dword ptr ds:[11D40F8h]  
011D12A7  addss       xmm2,dword ptr ds:[11D40D8h]  
011D12AF  addss       xmm1,dword ptr ds:[11D40DCh]  
011D12B7  addss       xmm0,dword ptr ds:[11D40E0h]  
Arghhhh!! Madd() generates better code both in instruction count and register usage than operator overloading/two-functions (Mul() and Add()). Let's look closely on what the code is doing in operator overloading/two-function:
01031277  movss       xmm0,dword ptr ds:[10340F0h]   // xmm0  = a.x    (load a.x from mem)
0103127F  mulss       xmm0,dword ptr ds:[10340E4h] // xmm0 *= b.x    (xmm0 = a.x * b.x)
01031287  movss       xmm1,dword ptr ds:[10340F4h]   // xmm1  = a.y    (load a.y from mem)
0103128F  movss       xmm2,dword ptr ds:[10340F8h]   // xmm2  = a.z    (load a.z from mem)
01031297  mulss       xmm1,dword ptr ds:[10340E8h]   // xmm1 *= b.y    (xmm1 = a.y * b.y)
0103129F  mulss       xmm2,dword ptr ds:[10340ECh]   // xmm2 *= b.z    (xmm2 = a.z * b.z)
010312A7  movss       xmm4,dword ptr ds:[10340D8h]   // xmm4  = c.x    (load c.x from mem)
010312AF  movss       xmm3,dword ptr ds:[10340DCh]   // xmm3  = c.y    (load c.y from mem)
010312B7  addss       xmm4,xmm0     // xmm4 += xmm0   (xmm4 = c.x + a.x * b.x)
010312BB  movss       xmm0,dword ptr ds:[10340E0h]   // xmm0  = c.z    (load c.z from mem)
010312C3  addss       xmm3,xmm1     // xmm3 += xmm1   (xmm3 = c.y + a.y * b.y)
010312C7  addss       xmm0,xmm2     // xmm0 += xmm2   (xmm0 = c.z + a.z * b.z)
If you see, it looks like the compiler do six loads, i.e. a and c. I'm guessing this is because the compiler can't see the global picture of what's going on. It's optimizing each function (Mul() and Add()) separately. In contrast, in Madd(), the compiler sees everything and it's able to perform a better optimization.
This page will consolidate intersection pseudo-code between geometric primitives.

Ray Segment vs Triangle
// Möller-Trumbore algorithm
// Ray:    r(t)   = rayOrg + rayDir * t
// Tri:    f(u,v) = (1-u-v)*p0 + u*p1 + v*p2
// Find (t,u,v) such that r(t) = f(u,v) such that (t,u,v) in [0,1] and (u+v < 1)
// Ref: 1. Real time rendering 3rd edition, pg. 746
//      2. http://www.scratchapixel.com/lessons/3d-basic-lessons/lesson-9-ray-triangle-intersection/m-ller-trumbore-algorithm/
bool IntersectRaySegmentTriangle(Vec3 rayOrg, Vec3 rayDir, 
                                 Vec3 p0, Vec3 p1, Vec3 p2,
                                 float& t, float& u, float& v) 
    Vec3 e1 = p1 - p0;
    Vec3 e2 = p2 - p0;
    Vec3 q  = cross(rayDir, e2);

    float det = dot(e1, q);
    if (det == 0)
        return false;
    float detInv = 1 / det;

    Vec3 s = rayOrg - p0;
    u = dot(s,q) * invDet;
    if (u < 0 || u > 1)
        return false;

    Vec3 r = cross(s, e1);
    v = dot(rayDir, r) * detInv;
    if (v < 0 || u+v > 1 )
        return false;

    t = dot(e2, r) * detInv;
    return (t >= 0 && t <= 1.0f);

Double buffering is pretty much standard on games. We will have two buffers: front buffer and back buffer. The idea is that GPU will work on back buffer while the monitor displays front buffer and then flip the buffers when GPU is done. In addition, to prevent screen tearing, we usually turn on v-sync.

Assuming you are making 60 fps (or 16.667 ms per frame) game and your monitor refresh rate is 60 Hz. With double buffering and v-sync on, if your game render time is > 16.667 ms, you will miss v-sync and you have to wait for the next v-sync (your renderer will stall). Effectively this will double your frame time to be ~33ms.

With programmer art, this is how you visualize a steady 16 ms (one char is one frame, * = frame renders fine, . = frame is dropped):


if your game renders longer than 16 ms, you will be dropping the frame consistently:


Oh noo, this is badd!! Is there a way to fix this problem? Well, the easiest way is to make sure that you always render within the budget ~16 ms. However, practically speaking, there can be some spikes in the game because of some effects turned on, some explosion, etc2. This occasional spikes are the one that kills. Luckily there's a way to alleviate this issue... enter Triple Buffering.

Triple Buffering
Let me say this first, triple buffering is not a magic that will solve your slow render time. You still need to render within the budget. However, triple buffering can smooth out your occasional spikes.

As the name implies, triple buffering utilizes 3 buffers: front buffer, back buffer 0 and back buffer 1. The idea is that GPU will work on back buffer 0, and can immediately switch to work on back buffer 1 when it's done with back buffer 0. This allows back buffer 0 to have additional frame before being displayed by the monitor. However, as you can predict, if your game consistently renders slowly, it will finally drop a frame.

Now, image if your game renders 1.5x of the budget, you will be dropping every other frame, this looks like:


Imagine, you are rendering 15 ms and there's a spike to 17 ms. With double buffering, you will feel the stutter immediately. With triple buffering, you will feel it too, but less often. The advantage of triple buffering is to smooth out spikes; remember this is spike, so by definition it's doesn't happen very long and only occur occasionally.  With triple buffering, you will get a silky smooth frame rate, just because of the additional one frame time buffer.

All is well and good, so, what's the catch? There's overhead with triple buffering. Basically, you have to allocate one more frame buffer. In addition, you will have additional 1 frame lag caused by additional frame before the frame get's displayed on monitor.

If your game consistently renders slower, for example 1.5 times, you will have uneven dropping of frames (described above) which can be annoying compared to consistent dropping of frames. So, it's best to have an option to enable/disable triple buffering in your game and let the player choose. In normal situation, where your game renders within budget with occasional spikes, I would just turn on triple buffering (assuming you have enough resources to do this).
GPU has become a general purpose processor! or at least becoming more and more general. This is proved by the existence of GPGPU APIs such as DirectCompute, CUDA, OpenCL. It's time to start learning Compute Shader (CS), in this case, DirectCompute from D3D11.

Past GPGPU Coders...
Believe it or not GPGPU actually has existed before Compute Shaders arrived. However, you need to structure everything in terms of graphics, i.e. in order to launch GPGPU computation you have to render geometry and you basically use Pixel Shaders to do the computation.

While this style of GPGPU coding can still work today, we can do much better! Compute Shaders allow us to use GPU just like we program a regular code. The first benefit is that you don't need to care about graphics pipeline and such, you just need to dispatch your Compute Shaders and that's it. In addition, Compute Shaders bypass graphics pipeline, i.e. primitive assembly, rasterization, etc2; so you have the potential to run faster than running GPGPU with Pixel Shaders.. or at least in theory.

Setting Up Simple Framework
In order to start learning Compute Shaders, we need a framework, a simple one that allow us to focus on doing Compute Shaders and learn the performance characteristics. A good place to start is BasicCompute11 from DirectX SDK.

I'd start from that sample. However, we need a little bit more. We need to upgrade to VS2012+ so that we can potentially use VSGD (Visual Studio Graphics Debugger) to profile our application. In addition, since we want to learn the performance characteristic of Compute Shaders, we need to be able to time it. There are couple references on how to do this:
  1. Nathan Reed: GPU Profiling 101 - http://www.reedbeta.com/blog/2011/10/12/gpu-profiling-101/
  2. MJP: Profiling in DX11 with Queries - http://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/
  3. OpenVIDIA: Events: Basic Profiling and Synchronization - http://openvidia.sourceforge.net/index.php/DirectCompute#Events:_Basic_Profiling_.26_Synchronization
I prefer doing it via D3D11 queries, specifically D3D11_QUERY_TIMESTAMP_DISJOINT and D3D11_QUERY_TIMESTAMP. However, don't forget to wait for the data to be available when calculating the elapsed time of compute shader. Basically, here's how I profile the compute shaders:
void RunComputeShader(...)

    // Do some CS init, i.e. setting shader, resource, constant buffer

    pContext->Dispatch( x, y, z);

    // Do some CS unit


    // Collect time stamps

    // Wait for data to become available
    while (pContext->GetData(g_pQueryDisjoint, &tsDisjoint, sizeof(tsDisjoint), 0) == S_FALSE) {}
    if (tsDisjoint.Disjoint)

    UINT64 beginCSTimeStamp;
    UINT64 endCSTimeStamp;

    while (pContext->GetData(g_pQueryBeginCS, &beginCSTimeStamp, sizeof(UINT64), 0) == S_FALSE) {}
    while (pContext->GetData(g_pQueryEndCS, &endCSTimeStamp, sizeof(UINT64), 0) == S_FALSE) {}

    // Convert to real time
    float computeShaderElapsed = float(endCSTimeStamp - beginCSTimeStamp) / float(tsDisjoint.Frequency) * 1000.0f;
    printf("Compute shader done in %f ms\n", computeShaderElapsed);
For completeness, here's how I create and destroy the queries:
    // create
    D3D11_QUERY_DESC queryDisjointDesc;
    queryDisjointDesc.Query     = D3D11_QUERY_TIMESTAMP_DISJOINT;
    queryDisjointDesc.MiscFlags = 0;

    if (FAILED(g_pDevice->CreateQuery(&queryDisjointDesc, &g_pQueryDisjoint)))
        printf("Could not create timestamp disjoint query!");

    D3D11_QUERY_DESC queryDesc;
    queryDesc.Query     = D3D11_QUERY_TIMESTAMP;
    queryDesc.MiscFlags = 0;

    if (FAILED(g_pDevice->CreateQuery(&queryDesc, &g_pQueryBeginCS)))
        printf("Could not create start-frame timestamp query");

    if (FAILED(g_pDevice->CreateQuery(&queryDesc, &g_pQueryEndCS)))
        printf("Could not create start-frame timestamp query");
    // destroy
    SAFE_RELEASE( g_pQueryDisjoint );
    SAFE_RELEASE( g_pQueryBeginCS );
    SAFE_RELEASE( g_pQueryEndCS );    

That will allow us to start plunging into the world of Compute Shaders!
Just wanted to post my own version of XCode keyboard shortcut.

XCode Editor:

  • Command + Shift + B - Build
  • Command + Control + J - Jump to definition
  • Command + Shift + O - Open file...
  • Command + Ctrl + Up/Down - Switch between header/implementation file
  • Command + Ctrl + Left/Right - Go back/forward on opened files
  • Command + ] - Indent multiple lines
  • Command + [ - Unindent multiple lines
  • Command + / - Comment/uncomment multiple lines
Collection of useful data structures

HashMap / HashSet

This page will discuss/contains links about next-generation rendering topics. When mentioning about next-generation, it's helpful to be more specific. What I meant by next-generation is PS4/Xbone generation.

Update: This is becoming my links dumping ground...

Linear Space Lighting

HDR Rendering / Tonemapping / Color Management

Physically Based Rendering

Sparse Voxel Octree

Global Illumination/Area Lights

Order Independent Transparency

Order Independent Transparency (OIT) is a rendering technique that doesn't require rendering geometry in sorted order.