It turns out to be not a straightforward one. If you ever want to render trapezoid but mapped to square texture coordinate, i.e. (0,0) - (1,1), it won't turn out right.

Turns out there's an easy way to fix this. Basically, instead of passing in float2 texture coordinates, you need to pass in the third coordinate to do projection on texture coordinates.

The solution can be found here

Edit: It turns out, there's a more generic solution, i.e. quadrilateral interpolation:
Other references that might be useful:
Just want to share collection of tricks to optimize branch in CPU.

Bounds Checking

Checking bounds [0,max)
// int i, max;
// if (i >= 0 && i < max) {}
if ((unsigned int) i < (unsigned int)max) {}

Checking bounds[min,max]
// int i, min,max;
// if (i >= min && i <= max) {}
if ((unsigned int)(i - min) <= (unsigned int)(max - min)) {}