Nov 30, 2009

Better, but ...


Better, but there are some bugs.
1> Don`t know the root cause of distortion (back surface).
2> Some refraction are disappeared.
3> Both bottom-right & bottom-left corners are rendered with incorrect lighting.

Nov 25, 2009

Tired~

I feel tired today, and still don`t have better idea how to implement some details. So adding several lines to get this (hard code) only.

Nov 24, 2009

BusyBusyBusy.......


I am very busy recently (due to new job). But coding is still excited ! Above is my new toy !

Nov 6, 2009

CUDA Note[5]="float";


//-----------------------------------------------------------------------------
__global__ void __FloatTest(unsigned int* pIn)
{
unsigned int fMask = *pIn;

float iMask =
(float)((fMask & 0xff000000) >> 24) * 0.1122f +
(float)((fMask & 0x00ff0000) >> 16) * 0.2233f +
(float)((fMask & 0x0000ff00) >> 8) * 0.3344f +
(float)((fMask & 0x000000ff) >> 0) * 0.4455f;

*((float*)pIn) = iMask;
}

//-----------------------------------------------------------------------------
void FloatTest()
{
const unsigned int fMask = 0x22446688;

float iMask = 0.0f;
float* pMask = 0;

::cudaMalloc(&pMask, sizeof(float));

::cudaMemcpy(pMask, &fMask, sizeof(unsigned int), cudaMemcpyHostToDevice);

__FloatTest<<<1, 1>>>((unsigned int*)pMask);

::cudaMemcpy(&iMask, pMask, sizeof(float), cudaMemcpyDeviceToHost);

::cudaFree(pMask);

::printf("gpu : %f\n", iMask);

iMask =
(float)((fMask & 0xff000000) >> 24) * 0.1122f +
(float)((fMask & 0x00ff0000) >> 16) * 0.2233f +
(float)((fMask & 0x0000ff00) >> 8) * 0.3344f +
(float)((fMask & 0x000000ff) >> 0) * 0.4455f;

::printf("cpu : %f\n", iMask);
}


output :
gpu : 113.695999
cpu : 113.695999


//-----------------------------------------------------------------------------
__global__ void __FloatTest(unsigned int* pIn)
{
unsigned int fMask = *pIn;

float iMask =
(float)((fMask & 0xff000000) >> 24) * 0.112233f +
(float)((fMask & 0x00ff0000) >> 16) * 0.223344f +
(float)((fMask & 0x0000ff00) >> 8) * 0.334455f +
(float)((fMask & 0x000000ff) >> 0) * 0.445566f;

*((float*)pIn) = iMask;
}

//-----------------------------------------------------------------------------
void FloatTest()
{
const unsigned int fMask = 0x22446688;

float iMask = 0.0f;
float* pMask = 0;

::cudaMalloc(&pMask, sizeof(float));

::cudaMemcpy(pMask, &fMask, sizeof(unsigned int), cudaMemcpyHostToDevice);

__FloatTest<<<1, 1>>>((unsigned int*)pMask);

::cudaMemcpy(&iMask, pMask, sizeof(float), cudaMemcpyDeviceToHost);

::cudaFree(pMask);

::printf("gpu : %f\n", iMask);

iMask =
(float)((fMask & 0xff000000) >> 24) * 0.112233f +
(float)((fMask & 0x00ff0000) >> 16) * 0.223344f +
(float)((fMask & 0x0000ff00) >> 8) * 0.334455f +
(float)((fMask & 0x000000ff) >> 0) * 0.445566f;

::printf("cpu : %f\n", iMask);
}


output :
gpu : 113.714699
cpu : 113.714706

so keep in mind that the float result may different between cpu & gpu.

Nov 1, 2009

MSN 塗鴉


這樣有沒有當圖文作家的天份? XD

Oct 31, 2009

funny type !


struct NullType
{};

template <
typename X,
typename Y,
typename Z = NullType,
typename W = NullType>
struct Caster
{
X x;
Y y;
Z z;
W w;
};

int main()
{
::printf("%d\n", sizeof(NullType));
::printf("%d\n", sizeof(Caster));
::printf("%d\n", sizeof(Caster));
::printf("%d\n", sizeof(Caster));
::printf("%d\n", sizeof(Caster));
}


There is nothing new in this post. You can find the null type in "Modern C++ design". But the output is really funny (compile with vc9) :
1
4
12
12
4

So...Nulltype occupy 1 byte even it`s null. And it follow a special padding rule (which I`m not interested in now XD).

Oct 30, 2009

CUDA Note[4]="cast & align";


//-----------------------------------------------------------------------------
__global__ void RCastTest0(unsigned int* rgTar, unsigned int* rgSrc)
{
rgTar[0] = rgSrc[0];

rgTar[1] = *(unsigned int*)((unsigned char*)rgSrc + 2);
}

//-----------------------------------------------------------------------------
void RCastTest()
{
unsigned int rgTestSrcHost[2] = {0x11223344, 0xaabbccdd};
unsigned int rgTestTarHost[2];
unsigned int* rgTestSrcDevice = 0;
unsigned int* rgTestTarDevice = 0;
unsigned int* rgTestDevice = 0;

::cudaMalloc(&rgTestDevice, 4 * sizeof(unsigned int));

rgTestTarDevice = rgTestDevice;
rgTestSrcDevice = rgTestDevice + 2;

::cudaMemcpy(
rgTestSrcDevice,
rgTestSrcHost,
2 * sizeof(unsigned int),
cudaMemcpyHostToDevice);

//--cast in cuda
RCastTest0<<<1, 1>>>(rgTestTarDevice, rgTestSrcDevice);

::cudaMemcpy(
rgTestTarHost,
rgTestTarDevice,
2 * sizeof(unsigned int),
cudaMemcpyDeviceToHost);

//--cast in cpu
unsigned int dCasted = *(unsigned int*)((unsigned char*)(rgTestSrcHost) + 2);

::printf("CUDA (align) : 0x%08X\n", rgTestTarHost[0]);
::printf("CUDA (un-align) : 0x%08X\n", rgTestTarHost[1]);
::printf("CPU (un-align) : 0x%08X\n", dCasted);

::cudaFree(rgTestDevice);
}


A simple test of casting in cuda. The device memory is aligned when being allocated (to 256 byte). Everything is fine if you forget the optimized trick in C. For example, when make gray scale image fome a r8g8b8 one. You can get 3 u8 and calc the luminance, or get one u32 then calc with bit operation. This is OK in CPU, and should get better performance since access global memory is pretty slow in cuda. But you can`t do it like in cpu due to the result of this testing. When reading from global memory, cuda align internal reading address with sizeof casting type. (4 for u32, 2 for u16, etc.)

So there is a trick to do gray scale. If I just calc luma with reading 3 u8, every pixel need 4 global memory accessing (3 read, 1 write). But if I calc 4 pixels in one thread, I can read 3 u32 (and the first one is 4 byte aligned), the average read-write time would be 1.75 per pixel !

p.s.
output :
CUDA (align) : 0x11223344
CUDA (un-align) : 0x11223344
CPU (un-align) : 0xCCDD1122