Waiting & Waiting: 2009-10

Oct 31, 2009

funny type !


struct NullType
{};

template <
    typename X,
    typename Y,
    typename Z = NullType,
    typename W = NullType>
struct Caster
{
    X   x;
    Y   y;
    Z   z;
    W   w;
};

int main()
{
    ::printf("%d\n", sizeof(NullType));
    ::printf("%d\n", sizeof(Caster));
    ::printf("%d\n", sizeof(Caster));
    ::printf("%d\n", sizeof(Caster));
    ::printf("%d\n", sizeof(Caster));
}

There is nothing new in this post. You can find the null type in "Modern C++ design". But the output is really funny (compile with vc9) :
1
4
12
12
4

So...Nulltype occupy 1 byte even it`s null. And it follow a special padding rule (which I`m not interested in now XD).

Oct 30, 2009

CUDA Note[4]="cast & align";


//----------------------------------------------------------------------------- 
__global__ void RCastTest0(unsigned int* rgTar, unsigned int* rgSrc)
{
    rgTar[0] = rgSrc[0];

    rgTar[1] = *(unsigned int*)((unsigned char*)rgSrc + 2);
}

//-----------------------------------------------------------------------------
void RCastTest()
{
    unsigned int  rgTestSrcHost[2] = {0x11223344, 0xaabbccdd};
    unsigned int  rgTestTarHost[2];
    unsigned int* rgTestSrcDevice = 0;
    unsigned int* rgTestTarDevice = 0;
    unsigned int* rgTestDevice = 0;

    ::cudaMalloc(&rgTestDevice, 4 * sizeof(unsigned int));

    rgTestTarDevice = rgTestDevice;
    rgTestSrcDevice = rgTestDevice + 2;

    ::cudaMemcpy(
        rgTestSrcDevice,
        rgTestSrcHost,
        2 * sizeof(unsigned int),
        cudaMemcpyHostToDevice);

    //--cast in cuda
    RCastTest0<<<1, 1>>>(rgTestTarDevice, rgTestSrcDevice);

    ::cudaMemcpy(
        rgTestTarHost,
        rgTestTarDevice,
        2 * sizeof(unsigned int),
        cudaMemcpyDeviceToHost);

    //--cast in cpu
    unsigned int dCasted = *(unsigned int*)((unsigned char*)(rgTestSrcHost) + 2);

    ::printf("CUDA (align)    : 0x%08X\n", rgTestTarHost[0]);
    ::printf("CUDA (un-align) : 0x%08X\n", rgTestTarHost[1]);
    ::printf("CPU  (un-align) : 0x%08X\n", dCasted);

    ::cudaFree(rgTestDevice);
}

A simple test of casting in cuda. The device memory is aligned when being allocated (to 256 byte). Everything is fine if you forget the optimized trick in C. For example, when make gray scale image fome a r8g8b8 one. You can get 3 u8 and calc the luminance, or get one u32 then calc with bit operation. This is OK in CPU, and should get better performance since access global memory is pretty slow in cuda. But you can`t do it like in cpu due to the result of this testing. When reading from global memory, cuda align internal reading address with sizeof casting type. (4 for u32, 2 for u16, etc.)

So there is a trick to do gray scale. If I just calc luma with reading 3 u8, every pixel need 4 global memory accessing (3 read, 1 write). But if I calc 4 pixels in one thread, I can read 3 u32 (and the first one is 4 byte aligned), the average read-write time would be 1.75 per pixel !

p.s.
output :
CUDA (align) : 0x11223344
CUDA (un-align) : 0x11223344
CPU (un-align) : 0xCCDD1122

Oct 29, 2009

CUDA Note [3] = "cudart.dll";

It seems to be not necessary for your client application if you develop CUDA with driver API. But it is not that friendly. (BTW, I have not given it a try, yet.)
It seems to be not necessary to separate CUDA code from main code to different binary since cudart.dll depend nothing special. It only benefit coworker who doesn`t install CUDA sdk.
If develop with CUDA run-time. You have to pack the cudart.dll, too.

So......I guess I have made a wrong decision (separate cuda code to another binary) due to misunderstanding.

CUDA Note [2] = "Driver Version";

cudaGetDeviceCount(&cDevice) may find no CUDA device if your cuda run-time is newer than driver. Currently you can download v2.3 run-time from nvidia, and the driver for notebook is still in beta. If mix the run-time with the other driver, all sample will crash because can`t find any CUDA device.

Oct 24, 2009

CUDA Note [1] = "design strategy";

There are so many "CLs" :

OpenCL
DX11 computing shader
CUDA
Stream SDK

The first problem for me is how to integrate them. I start this kind of programming from CUDA. But it relay on nVidia, and the worst thing is the binary depends on CUDA runtime ..... that means I have to handle every thing when there is no nVidia graphic card. My solution should be COM. Besides, I`ll start to study DX11 once I have win 7. I guess OpenCL will not be so good in the beginning.

Oct 16, 2009

兒時的玩具

整理雜物的時候翻出兒時的玩具，從這些東西看來，我真的是沒什麼童年啊！

九連環，在昨日世界買的。

沒線的溜溜球

仙人擺渡加一些小東西

華容道，也是在昨日世界買的！

還做過一些小玩具，像貴妃稱之類的，不過都丟掉了 XD

Oct 13, 2009

CUDA Note [0] = "How to integrate with VC ?";

1. How to build "*.cu" in VC :

Find “Cuda.Rules” in \\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\
Copy “Cuda.Rules” to \\MSVS8\VC\VCProjectDefaults\
Open VC solution.
Right click your “project” and select “Custom Build Rules”.
Check “CUDA Build Rule v#.#.#".
Click ok.
Right click “*.cu” file in solution explorer.
Select “CUDA Build Rule v#.#.#” in option “Tool”.
Then vc can build "*.cu" file in your project.

2. How to high light *.cu syntax in VC :

Go to Options of Tools in VC.
Check "File Extension" of Text Editor.
Add "cu" in "Extension" edit box and select Editor.
Click OK and reopen VC.
Now the high light rule you selected is applied.
I use VC++ syntax hight light to edit *.cu.

Oct 10, 2009

篆刻：貓掌

沒想到會刻肖形章，還刻了貓掌！整個印最大的難處該是印文的佈置吧，貓掌的重量十足，破在掌印上就沒了意思，連帶也不適合破邊，最後只挖掉左上角，佈置是否合宜就不是我的功力看得出來的了，但應該還挺「古錐」的吧！

Waiting & Waiting