I am always worried when using C++ operator overloading because they can generate a bunch of temporaries and it may not be efficient. In other words, there is this concept of Return Value Optimization (RVO) in C++. So, I decided to check how the compiler (Visual Studio 2012) behaves.

Here's how my test class looks like:
struct Vec3
{
    float x, y, z;
    Vec3() {}
    Vec3(const Vec3& v) : x(v.x), y(v.y), z(v.z) {}
    Vec3(float _x, float _y, float _z) : x(_x), y(_y), z(_z) {}

    // *this = a*b
    void Mul(const Vec3& a, const Vec3& b)
    {
        x = a.x * b.x;
        y = a.y * b.y;
        z = a.z * b.z;
    }
    // *this += a
    void Add(const Vec3& a)
    {
        x += a.x;
        y += a.y;
        z += a.z;
    }
};

inline Vec3 operator*(const Vec3& a, const Vec3& b)
{   // RVO: return with temporary
    return Vec3(a.x * b.x, a.y * b.y, a.z * b.z);
}

inline Vec3 operator+(const Vec3& a, const Vec3& b)
{    // RVO: return with temporary
     return Vec3(a.x + b.x, a.y + b.y, a.z + b.z);
}
The test is simple, I want to check out how a*b + c behave, by using function vs operator overloading:
// Function test
Vec3 res;
res.Mul(a,b);
res.Add(c);

// Operator overloading test
Vec3 res = a*b + c;
Using function, the assembly looks like this:
 Vec3 res1;
 res1.Mul(a,b);
 res1.Add(c);
001B1277  movss       xmm0,dword ptr ds:[1B40E4h]  
001B127F  mulss       xmm0,dword ptr ds:[1B40F0h]  
001B1287  movss       xmm1,dword ptr ds:[1B40E8h]  
001B128F  movss       xmm2,dword ptr ds:[1B40ECh]  
001B1297  mulss       xmm1,dword ptr ds:[1B40F4h]  
001B129F  mulss       xmm2,dword ptr ds:[1B40F8h]  
001B12A7  movss       xmm4,dword ptr ds:[1B40D8h]  
001B12AF  movss       xmm3,dword ptr ds:[1B40DCh]  
001B12B7  addss       xmm4,xmm0  
001B12BB  movss       xmm0,dword ptr ds:[1B40E0h]  
001B12C3  addss       xmm3,xmm1  
001B12C7  addss       xmm0,xmm2  
with operator overloading, it looks like this:
 Vec3 res = a*b + c;
01001277  movss       xmm0,dword ptr ds:[10040F0h]  
0100127F  mulss       xmm0,dword ptr ds:[10040E4h]  
01001287  movss       xmm1,dword ptr ds:[10040F4h]  
0100128F  movss       xmm2,dword ptr ds:[10040F8h]  
01001297  mulss       xmm1,dword ptr ds:[10040E8h]  
0100129F  mulss       xmm2,dword ptr ds:[10040ECh]  
010012A7  movss       xmm4,dword ptr ds:[10040D8h]  
010012AF  movss       xmm3,dword ptr ds:[10040DCh]  
010012B7  addss       xmm4,xmm0  
010012BB  movss       xmm0,dword ptr ds:[10040E0h]  
010012C3  addss       xmm3,xmm1  
010012C7  addss       xmm0,xmm2  
So, they are identical! Meaning that I shouldn't worry too much about using operator overloading. Subsequently, let's try using a function Madd() to do this and let see how this fares with operator overloading. Madd() is simply implemented as this
void Madd(const Vec3& a, const Vec3& b, const Vec3& c)
{
    x = a.x * b.x + c.x;
    y = a.y * b.y + c.y;
    z = a.z * b.z + c.z;
}
Let see the assembly:
 Vec3 res3;
 res3.Madd(a, b, c);
011D1277  movss       xmm2,dword ptr ds:[11D40E4h]  
011D127F  movss       xmm1,dword ptr ds:[11D40E8h]  
011D1287  movss       xmm0,dword ptr ds:[11D40ECh]  
011D128F  mulss       xmm2,dword ptr ds:[11D40F0h]  
011D1297  mulss       xmm1,dword ptr ds:[11D40F4h]  
011D129F  mulss       xmm0,dword ptr ds:[11D40F8h]  
011D12A7  addss       xmm2,dword ptr ds:[11D40D8h]  
011D12AF  addss       xmm1,dword ptr ds:[11D40DCh]  
011D12B7  addss       xmm0,dword ptr ds:[11D40E0h]  
Arghhhh!! Madd() generates better code both in instruction count and register usage than operator overloading/two-functions (Mul() and Add()). Let's look closely on what the code is doing in operator overloading/two-function:
01031277  movss       xmm0,dword ptr ds:[10340F0h]   // xmm0  = a.x    (load a.x from mem)
0103127F  mulss       xmm0,dword ptr ds:[10340E4h] // xmm0 *= b.x    (xmm0 = a.x * b.x)
01031287  movss       xmm1,dword ptr ds:[10340F4h]   // xmm1  = a.y    (load a.y from mem)
0103128F  movss       xmm2,dword ptr ds:[10340F8h]   // xmm2  = a.z    (load a.z from mem)
01031297  mulss       xmm1,dword ptr ds:[10340E8h]   // xmm1 *= b.y    (xmm1 = a.y * b.y)
0103129F  mulss       xmm2,dword ptr ds:[10340ECh]   // xmm2 *= b.z    (xmm2 = a.z * b.z)
010312A7  movss       xmm4,dword ptr ds:[10340D8h]   // xmm4  = c.x    (load c.x from mem)
010312AF  movss       xmm3,dword ptr ds:[10340DCh]   // xmm3  = c.y    (load c.y from mem)
010312B7  addss       xmm4,xmm0     // xmm4 += xmm0   (xmm4 = c.x + a.x * b.x)
010312BB  movss       xmm0,dword ptr ds:[10340E0h]   // xmm0  = c.z    (load c.z from mem)
010312C3  addss       xmm3,xmm1     // xmm3 += xmm1   (xmm3 = c.y + a.y * b.y)
010312C7  addss       xmm0,xmm2     // xmm0 += xmm2   (xmm0 = c.z + a.z * b.z)
If you see, it looks like the compiler do six loads, i.e. a and c. I'm guessing this is because the compiler can't see the global picture of what's going on. It's optimizing each function (Mul() and Add()) separately. In contrast, in Madd(), the compiler sees everything and it's able to perform a better optimization.
This page will consolidate intersection pseudo-code between geometric primitives.

Ray Segment vs Triangle
// Möller-Trumbore algorithm
//
// Ray:    r(t)   = rayOrg + rayDir * t
// Tri:    f(u,v) = (1-u-v)*p0 + u*p1 + v*p2
//
// Find (t,u,v) such that r(t) = f(u,v) such that (t,u,v) in [0,1] and (u+v < 1)
//
// Ref: 1. Real time rendering 3rd edition, pg. 746
//      2. http://www.scratchapixel.com/lessons/3d-basic-lessons/lesson-9-ray-triangle-intersection/m-ller-trumbore-algorithm/
bool IntersectRaySegmentTriangle(Vec3 rayOrg, Vec3 rayDir, 
                                 Vec3 p0, Vec3 p1, Vec3 p2,
                                 float& t, float& u, float& v) 
{
    Vec3 e1 = p1 - p0;
    Vec3 e2 = p2 - p0;
    Vec3 q  = cross(rayDir, e2);

    float det = dot(e1, q);
    if (det == 0)
        return false;
    
    float detInv = 1 / det;

    Vec3 s = rayOrg - p0;
    u = dot(s,q) * invDet;
    if (u < 0 || u > 1)
        return false;

    Vec3 r = cross(s, e1);
    v = dot(rayDir, r) * detInv;
    if (v < 0 || u+v > 1 )
        return false;

    t = dot(e2, r) * detInv;
    return (t >= 0 && t <= 1.0f);
}

Double buffering is pretty much standard on games. We will have two buffers: front buffer and back buffer. The idea is that GPU will work on back buffer while the monitor displays front buffer and then flip the buffers when GPU is done. In addition, to prevent screen tearing, we usually turn on v-sync.

Assuming you are making 60 fps (or 16.667 ms per frame) game and your monitor refresh rate is 60 Hz. With double buffering and v-sync on, if your game render time is > 16.667 ms, you will miss v-sync and you have to wait for the next v-sync (your renderer will stall). Effectively this will double your frame time to be ~33ms.

With programmer art, this is how you visualize a steady 16 ms (one char is one frame, * = frame renders fine, . = frame is dropped):

*******

if your game renders longer than 16 ms, you will be dropping the frame consistently:

*.*.*.*.

Oh noo, this is badd!! Is there a way to fix this problem? Well, the easiest way is to make sure that you always render within the budget ~16 ms. However, practically speaking, there can be some spikes in the game because of some effects turned on, some explosion, etc2. This occasional spikes are the one that kills. Luckily there's a way to alleviate this issue... enter Triple Buffering.

Triple Buffering
Let me say this first, triple buffering is not a magic that will solve your slow render time. You still need to render within the budget. However, triple buffering can smooth out your occasional spikes.

As the name implies, triple buffering utilizes 3 buffers: front buffer, back buffer 0 and back buffer 1. The idea is that GPU will work on back buffer 0, and can immediately switch to work on back buffer 1 when it's done with back buffer 0. This allows back buffer 0 to have additional frame before being displayed by the monitor. However, as you can predict, if your game consistently renders slowly, it will finally drop a frame.

Now, image if your game renders 1.5x of the budget, you will be dropping every other frame, this looks like:

**.**.**.**.

Imagine, you are rendering 15 ms and there's a spike to 17 ms. With double buffering, you will feel the stutter immediately. With triple buffering, you will feel it too, but less often. The advantage of triple buffering is to smooth out spikes; remember this is spike, so by definition it's doesn't happen very long and only occur occasionally.  With triple buffering, you will get a silky smooth frame rate, just because of the additional one frame time buffer.

All is well and good, so, what's the catch? There's overhead with triple buffering. Basically, you have to allocate one more frame buffer. In addition, you will have additional 1 frame lag caused by additional frame before the frame get's displayed on monitor.

If your game consistently renders slower, for example 1.5 times, you will have uneven dropping of frames (described above) which can be annoying compared to consistent dropping of frames. So, it's best to have an option to enable/disable triple buffering in your game and let the player choose. In normal situation, where your game renders within budget with occasional spikes, I would just turn on triple buffering (assuming you have enough resources to do this).