Dec 24, 2009
灌籃高手
Dec 14, 2009
我希望……
- 花兩個小時以上看書或寫程式
- 花一個小時吃早餐加聽英文
- 花兩個小時玩遊戲或從事放空腦袋的活動
- 花兩個小時搭公車上下班
- 花一個小時吃晚餐加民生必需事務
Dec 2, 2009
Dec 1, 2009
Nov 30, 2009
Better, but ...
Nov 25, 2009
Tired~
Nov 24, 2009
Nov 6, 2009
CUDA Note[5]="float";
//-----------------------------------------------------------------------------
__global__ void __FloatTest(unsigned int* pIn)
{
unsigned int fMask = *pIn;
float iMask =
(float)((fMask & 0xff000000) >> 24) * 0.1122f +
(float)((fMask & 0x00ff0000) >> 16) * 0.2233f +
(float)((fMask & 0x0000ff00) >> 8) * 0.3344f +
(float)((fMask & 0x000000ff) >> 0) * 0.4455f;
*((float*)pIn) = iMask;
}
//-----------------------------------------------------------------------------
void FloatTest()
{
const unsigned int fMask = 0x22446688;
float iMask = 0.0f;
float* pMask = 0;
::cudaMalloc(&pMask, sizeof(float));
::cudaMemcpy(pMask, &fMask, sizeof(unsigned int), cudaMemcpyHostToDevice);
__FloatTest<<<1, 1>>>((unsigned int*)pMask);
::cudaMemcpy(&iMask, pMask, sizeof(float), cudaMemcpyDeviceToHost);
::cudaFree(pMask);
::printf("gpu : %f\n", iMask);
iMask =
(float)((fMask & 0xff000000) >> 24) * 0.1122f +
(float)((fMask & 0x00ff0000) >> 16) * 0.2233f +
(float)((fMask & 0x0000ff00) >> 8) * 0.3344f +
(float)((fMask & 0x000000ff) >> 0) * 0.4455f;
::printf("cpu : %f\n", iMask);
}
output :
gpu : 113.695999
cpu : 113.695999
//-----------------------------------------------------------------------------
__global__ void __FloatTest(unsigned int* pIn)
{
unsigned int fMask = *pIn;
float iMask =
(float)((fMask & 0xff000000) >> 24) * 0.112233f +
(float)((fMask & 0x00ff0000) >> 16) * 0.223344f +
(float)((fMask & 0x0000ff00) >> 8) * 0.334455f +
(float)((fMask & 0x000000ff) >> 0) * 0.445566f;
*((float*)pIn) = iMask;
}
//-----------------------------------------------------------------------------
void FloatTest()
{
const unsigned int fMask = 0x22446688;
float iMask = 0.0f;
float* pMask = 0;
::cudaMalloc(&pMask, sizeof(float));
::cudaMemcpy(pMask, &fMask, sizeof(unsigned int), cudaMemcpyHostToDevice);
__FloatTest<<<1, 1>>>((unsigned int*)pMask);
::cudaMemcpy(&iMask, pMask, sizeof(float), cudaMemcpyDeviceToHost);
::cudaFree(pMask);
::printf("gpu : %f\n", iMask);
iMask =
(float)((fMask & 0xff000000) >> 24) * 0.112233f +
(float)((fMask & 0x00ff0000) >> 16) * 0.223344f +
(float)((fMask & 0x0000ff00) >> 8) * 0.334455f +
(float)((fMask & 0x000000ff) >> 0) * 0.445566f;
::printf("cpu : %f\n", iMask);
}
output :
gpu : 113.714699
cpu : 113.714706
so keep in mind that the float result may different between cpu & gpu.
Nov 1, 2009
Oct 31, 2009
funny type !
struct NullType
{};
template <
typename X,
typename Y,
typename Z = NullType,
typename W = NullType>
struct Caster
{
X x;
Y y;
Z z;
W w;
};
int main()
{
::printf("%d\n", sizeof(NullType));
::printf("%d\n", sizeof(Caster));
::printf("%d\n", sizeof(Caster));
::printf("%d\n", sizeof(Caster));
::printf("%d\n", sizeof(Caster));
}
There is nothing new in this post. You can find the null type in "Modern C++ design". But the output is really funny (compile with vc9) :
1
4
12
12
4
So...Nulltype occupy 1 byte even it`s null. And it follow a special padding rule (which I`m not interested in now XD).
Oct 30, 2009
CUDA Note[4]="cast & align";
//-----------------------------------------------------------------------------
__global__ void RCastTest0(unsigned int* rgTar, unsigned int* rgSrc)
{
rgTar[0] = rgSrc[0];
rgTar[1] = *(unsigned int*)((unsigned char*)rgSrc + 2);
}
//-----------------------------------------------------------------------------
void RCastTest()
{
unsigned int rgTestSrcHost[2] = {0x11223344, 0xaabbccdd};
unsigned int rgTestTarHost[2];
unsigned int* rgTestSrcDevice = 0;
unsigned int* rgTestTarDevice = 0;
unsigned int* rgTestDevice = 0;
::cudaMalloc(&rgTestDevice, 4 * sizeof(unsigned int));
rgTestTarDevice = rgTestDevice;
rgTestSrcDevice = rgTestDevice + 2;
::cudaMemcpy(
rgTestSrcDevice,
rgTestSrcHost,
2 * sizeof(unsigned int),
cudaMemcpyHostToDevice);
//--cast in cuda
RCastTest0<<<1, 1>>>(rgTestTarDevice, rgTestSrcDevice);
::cudaMemcpy(
rgTestTarHost,
rgTestTarDevice,
2 * sizeof(unsigned int),
cudaMemcpyDeviceToHost);
//--cast in cpu
unsigned int dCasted = *(unsigned int*)((unsigned char*)(rgTestSrcHost) + 2);
::printf("CUDA (align) : 0x%08X\n", rgTestTarHost[0]);
::printf("CUDA (un-align) : 0x%08X\n", rgTestTarHost[1]);
::printf("CPU (un-align) : 0x%08X\n", dCasted);
::cudaFree(rgTestDevice);
}
A simple test of casting in cuda. The device memory is aligned when being allocated (to 256 byte). Everything is fine if you forget the optimized trick in C. For example, when make gray scale image fome a r8g8b8 one. You can get 3 u8 and calc the luminance, or get one u32 then calc with bit operation. This is OK in CPU, and should get better performance since access global memory is pretty slow in cuda. But you can`t do it like in cpu due to the result of this testing. When reading from global memory, cuda align internal reading address with sizeof casting type. (4 for u32, 2 for u16, etc.)
So there is a trick to do gray scale. If I just calc luma with reading 3 u8, every pixel need 4 global memory accessing (3 read, 1 write). But if I calc 4 pixels in one thread, I can read 3 u32 (and the first one is 4 byte aligned), the average read-write time would be 1.75 per pixel !
p.s.
output :
CUDA (align) : 0x11223344
CUDA (un-align) : 0x11223344
CPU (un-align) : 0xCCDD1122
Oct 29, 2009
CUDA Note [3] = "cudart.dll";
- It seems to be not necessary for your client application if you develop CUDA with driver API. But it is not that friendly. (BTW, I have not given it a try, yet.)
- It seems to be not necessary to separate CUDA code from main code to different binary since cudart.dll depend nothing special. It only benefit coworker who doesn`t install CUDA sdk.
- If develop with CUDA run-time. You have to pack the cudart.dll, too.
CUDA Note [2] = "Driver Version";
Oct 24, 2009
CUDA Note [1] = "design strategy";
- OpenCL
- DX11 computing shader
- CUDA
- Stream SDK
Oct 16, 2009
兒時的玩具
Oct 13, 2009
CUDA Note [0] = "How to integrate with VC ?";
- Find “Cuda.Rules” in \\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\
- Copy “Cuda.Rules” to \\MSVS8\VC\VCProjectDefaults\
- Open VC solution.
- Right click your “project” and select “Custom Build Rules”.
- Check “CUDA Build Rule v#.#.#".
- Click ok.
- Right click “*.cu” file in solution explorer.
- Select “CUDA Build Rule v#.#.#” in option “Tool”.
- Then vc can build "*.cu" file in your project.
- Go to Options of Tools in VC.
- Check "File Extension" of Text Editor.
- Add "cu" in "Extension" edit box and select Editor.
- Click OK and reopen VC.
- Now the high light rule you selected is applied.
- I use VC++ syntax hight light to edit *.cu.
Oct 10, 2009
篆刻:貓掌
Sep 30, 2009
malloc fail
MARK說XP對記憶體的保護比較不嚴格,所以就算動到了不該動的位置也不見得會立刻出事。我碰到的情況是MALLOC忽然就失敗了,而整個程式吃掉的記憶體才13MB左右,因此斷定不是LOOP+LEAK,但是由於只是配置失敗,所以程式跑起來好像沒事一樣,只不過東西出不來而已。於是我開了些除錯用的例外,果然程式在出錯的地方丟出意外,仔細檢查後才發現是BUFFER SIZE算錯。
Sep 16, 2009
pack problem again
struct TGA_HEADER
{
unsigned char cbIDField;
unsigned char iColorMapType;
unsigned char iImageType;
//--1 byte padding
short ofsColorMap;
short cColorMap;
unsigned char cColorMapBits;
//--1 byte padding
short dX;
short dY;
short dWidth;
short dHeight;
unsigned char dColorDepth;
unsigned char descImage;
};
I have tried to make tga file yesterday and kept fail. When the image being dragged into directx texture tool, it alwayse told me something wrong in that file. But the header should be 18 bytes and the sum of all fields in the struct is 18 bytes, too. Finally, I started to check the pack issue as before. Then the problem emerged ! You can find a comment to mark the 1 byte padding in the code. To fix this only need 2 extra lines :
#include
struct TGA_HEADER
{....};
#include
Sep 1, 2009
粗糙的lit pre pass
Aug 29, 2009
regular expression for real
- 0x204080FF
- 3.14159f
- 7
Aug 28, 2009
下一站?
感到遺憾的事:
- 等級不到三!
- 沒在p4上寫笑話!
- 沒跟review team吵過架!
- 沒有3d!
- 沒有gf!
- 轉team!
- 好同事!
- 沒了!
- 進過大公司!
- 進過軟體公司!
- 比較懂3d了!
- coding skill比較好了(原因同上)。
- .-.. . --. .- -.-. -.--
- ..-. .. -..- -.-. --- -- .--. --- -. . -. -
- .-- . .- -.- ... -.- .. .-.. .-..
syntax highlighter for HLSL10
Texture2D g_NSZTexture;
Texture2D g_LitTexture;
float4x4 g_mWorld;
float4x4 g_mWorldViewProjection;
VtxNormalPass VSNormalPass(
float4 pos : POSITION,
float3 nor : NORMAL,
float2 tex : TEXCOORD0)
{
VtxNormalPass output;
output.spos = mul(pos, g_mWorldViewProjection);
output.wpos = mul(pos, g_mWorld);
output.wnor = mul(nor, g_mWorld);
return output;
}
PixOutput PSNormalPass(VtxNormalPass input)
{
float3 nor = normalize(input.wnor);
float phi = acos(nor.y);
float len = sin(phi);
float the = acos(clamp(nor.x / len, -1.0f, 1.0f));
if (nor.z <= 0.0f)
the = 6.283125f - the;
float z = length((input.wpos - g_vEyePos).xyz);
PixOutput output;
output.color = float4(the, phi, z, 4.0);
return output;
}
technique10 NormalPass
{
pass P0
{
SetVertexShader(CompileShader(vs_4_0, VSNormalPass()));
SetGeometryShader(NULL);
SetPixelShader(CompileShader(ps_4_0, PSNormalPass()));
SetRasterizerState(CullBack);
SetDepthStencilState(EnableDepth, 1);
SetBlendState(BSOverWrite, float4(0.0f, 0.0f, 0.0f, 0.0f), 0xFFFFFFFF);
}
}
Aug 26, 2009
Trigonometric function explosion
float3 nor = normalize(input.wnor);
float phi = acos(nor.y);
float len = sin(phi);
float the = acos(nor.x / len);
最後改成這樣就搞定了:
float the = acos(clamp(nor.x / len, -1.0f, 1.0f));
Aug 22, 2009
Aug 17, 2009
ui rendering optimization
Jul 30, 2009
screen space motion blur
Jul 26, 2009
Anything wrong in glShaderSourceARB ?
char v[128] =
"void main(void)\n"
"{\n"
" gl_Position = ftransform();\n"
"}\n";
glShaderSourceARB(shaderVtx, 1, (const GLcharARB**)&v, 0);
My application always crashed in "strlen". But if I change the code to :
char v[128] =
"void main(void)\n"
"{\n"
" gl_Position = ftransform();\n"
"}\n";
const GLcharARB* ppV[1] = {(GLcharARB*)v};
glShaderSourceARB(shaderVtx, 1, ppV, 0);
Everything is fine. Do I forget how to write C ? (Although I insert them in objective-c.) It looks like due to compiler setting since the first one works in sample of xcode.
20090802
Ok, nothing wrong except my careless style. This is a very basic pointer usage problem. "v" in the code does mean the start address of a string, but not a pointer to a string (althought the type is a pointer). That means "&v" is not in the memory. But why the sample works ? Because the sample pass "v" to another function which would generate a new local pointer to v, so we can find this pointer in memory. Luckily, no one would read this article. XD
Jul 21, 2009
Hello Watermelon !
Jul 17, 2009
在玩 WPF 3D 之前
- 有些時候不需要 face culling,這些時候不做會有比較好的效能。WPF一定做 face culling,當你不需要時,只能用兩組三角形來解決。
- 有越來越多的特效需要多個pass來完成,WPF似乎不管這件事。
- 有越來越多的特效需要換 render target,但是因為WPF的redering被完整封裝,沒辦法讓我胡搞。
- geometry 的 member 有缺,只有 vertex、index、texture coordinate、normal,在寫San Angels 的時候,我想在頂點上填色卻不知道怎麼辦。
- 光源必然影響整個scene tree。
- 想到再加。
Jul 16, 2009
Hello SyntaxHightlighter !
#include
int main()
{
printf("Hello SyntaxHightlighter !");
}
終於弄出來了,有興趣的話原始專案在這。當初在彩富還特別寫了個轉檔程式,沒想到有更漂亮的方法!幾點注意事項如下(參照本頁的原始碼):
- java script 的 source 可以放在blogspot提供的元件上(本頁最下面那條槓)。
- css 的 link 得直接放到head上。
- 放在blog上得在HightLightAll前加上dp.SyntaxHighlighter.BloggerMode();
現在比較頭痛的是script放在google page上,而google準備把google page移掉,雖然說會轉移,不過google site禁止上傳js檔,不知道到時會不會失效啊!
Jul 13, 2009
San Angeles in WPF
Jun 30, 2009
projection in vertex shader
在 vertex shader 動手腳的 projection:
不知道為什麼下圖有點走樣,不過似乎可以靠vertex shader 讓 z 值變成線性的,這樣的話應該可以避掉z fighting,效能差多少就不清楚了。
deferred lighting
Jun 29, 2009
UNORM v.s. UNORM_SRGB
DXGI_FORMAT_R8G8B8A8_UNORM :
DXGI_FORMAT_R8G8B8A8_UNORM 才是我要的結果!dx 10 DXUT 的back buffer format 預設只選DXGI_FORMAT_R8G8B8A8_UNORM_SRGB!
p.s. 材質是這裡來的。
國三數學課本
如果問題是A,答案是Z,可能存在最佳解法M,我的思路常常變成ABCZ或是ADEMZ,B或D是前幾個在腦袋裡浮現的解法,是「邏輯上」可行的做法,然後不斷的往前延伸,直到一連串邏輯上可行的解法達到Z,因為M比其他的路隱晦許多,所以雖然沒找到最佳解,但通常可以很快地找到可用解。壞處是只要問題夠複雜,這個方法就容易走進死胡同。即使是寫程式,整個流程也是這般流泄而出,通常能很快的寫出可行的程式,再花時間最佳化。就算一開始就找到M,也沒人能保證A到Z之間有更好的解法,也許把紙對折,A和Z就自然碰上了。
三年級的數學作業有這麼一題,一段圓弧跟圓弧外的點A,畫一條直線連接A跟圓弧的圓心。問題很簡單,在弧上畫兩段不重疊的弓,它們的中垂線會在圓心相交。不過當時我可不是這們寫的,因為題目的位子畫得太好了,所以只要以A為圓心,畫個弧能夠跟本來的弧線有兩個交點,那可以少做很多事……所以從國中開始,我就有當程式員的徵兆了嗎?
Jun 22, 2009
when to SSE ?
Long long ago, when I playing with Jaina, I don`t know why my Jaina can`t act as smooth as Blizzard`s. But there is a rumor about how to make Jaina move more like a really young girl. SSE is the first solution.
I don`t find any document about cost of SSE (ohh...I`m lzay, you know that...). But in my experience, there is a simple rule. The most general usage of SSE is matrix multiplication. And there are many many multiplication in bone skin animation. But you`ll find it cost more CPU power if you only write SSE to do "one" vector multiple "one" matrix.
The simple rule is :
if the number of vector multiplication is more than double of vector IO, SSE will gain higher performance.
For example, the dot value of 2 vectors need 3 IO (2 vectors in, one value out), but there is only 1 vector multiplication. Another example is vector multiple matrix, 6 IO (5 in, 1 out) with 4 vector mul ...... so we still can`t get better performance.
But in the bone animation case, there are less bones with many many point. That means many points will mul the same matrix in each frame. If you have to mul N points, you need :
- 4 vector reading from matrix.
- N vector reading from points.
- N vector writing to points.
- each point need 4 vector mul.
Ignore the 1`st one, it just meet my simple rule. So there is a chance make my Jaina act more smooth. (Just write a function to mul many vectors to one matrix).
BTW, there are many 0 in normal 3d matrix......that`s another story.
Jun 21, 2009
Jun 20, 2009
WarHammer[2];
抬頭望天,晃晃角色的視野,這時會頓!
低頭撿錢……好順啊!怎麼會這樣?看到第一張圖的紅圈圈了嗎?再來比對這一張:
樹林滿是三角形啊!看起來密密麻麻地貼了好幾層!地上的灌木叢也是:
看來這比wow豐富多了!只可惜在wow的荊棘谷時,我一樣腳步輕盈,從來不覺得自己太胖該減肥,即使控制的角色是隻大笨牛!樹林看起來相對豐富,但似乎是用三角形換來的,難怪控制起來舉步維艱啊!
WarHammer[1];
這個遊戲最最最明顯的問題,就在人物移動的時候常會莫名其妙卡點!玩了兩個星期,常常亂跑亂跳後就以為自己卡點了,還得亂跑亂跳一陣才能脫離!傳說中,戰鎚的特色之一是玩家間的碰撞,人物間的碰撞應該不是難題,大概就是OOBB吧!但是玩起來著實讓我很難過,邊放法術邊移動,不但要看敵方從哪個方向跟來,還得注意前方是不是有身型瘦小的種族擋路,被擋住就完全不能前進,不會因為按著前進就慢慢的滑開。至於怪物就沒有碰撞了,他們打我的角色時,可就經常利用這一點,穿過角色身體,從背後開始攻擊!
上圖,我穿不過一個在火盆與旗子之間的空隙,因為bounding box 卡到火盆了,完全無法前進。
Jun 19, 2009
Jun 17, 2009
WarHammer[0];
最近玩戰鎚,不過大概快玩不下去了,bug很多,而且G1S跑起來也不順,3G的RAM加上8600GT還不能順暢的運作,這個game一定有些問題,雖然設定上還有些意思,但我從第一天開始就覺得:無論美術程式,都無法與Blizzard 打對台。今天累了,提一個奇怪的rendering bug就好,其他有空再說。
圖不是很清楚,左上角紅框裡還有個「方框」,喔不!這是個誤會,實際上那是盞燈,週遭則是充滿煙霧(風沙?)的場景,會跑出個這麼突兀的方框是 z check 跟 z write 的問題,燈看起來是張textue (texture animation?maybe)的billboard,屬於半透明物件,整片煙霧也是半透明物件,這兩個東西都會等到整個場景畫得差不多後才會畫上去,為得是半透明的渲染未經排序的話結果會是錯的。我猜問題是這樣發生的:燈影跟煙霧畫的順序不一定,有時燈先畫,有時煙霧先,所以這個問題不是一直存在。接著,畫燈影的時候 z write 是開著的,畫霧的時候 z check 是開著的,結果燈先畫的話,因為 z write(z 值還可能是錯的),導致畫煙霧時 z check 後沒辦法畫那一塊,最後就生出一個方框了。
ZZzzZzZzzzz...快睡著了......
why UnAdvise ?
I try to demo something by directshow recently. When working with filter, I made some mistake in accident today. I did something like UnAdviseFooEventSink in the destrucotr of event object. Let`s check what happen :
- when the event sink object is created, the reference count should be 1.
- when Foo Advise the sink object, reference count of sink object increase to 2.
- when I don`t need the sink object anymore and try to release it, reference count decrease to 1.
- Now the owner of the sink object is Foo, and I wish sink object to be UnAdvise when being deleted.
- Since Foo is still the owner of sink object, sink object won`t be deleted if Foo doesn`t do extra work.
Thanks for the debugging function of baseclass of dshow, I got some ugly assert and try to fix it.
BTW, the same isuue would happen in cocoa of Mac, too. If 2 objects are the owner of each other (retain each other), they should be disconnected before release the final reference outside of the scope.
Jun 1, 2009
some methods to optimized gui rendering.
- get a single texture with all images of controls.
- update position of dynamic controls by update vertex buffer.
- update visibility of controls by update index buffer.
- collect all characters in the same texture (or less textures)
- pray.
May 28, 2009
Prince of Persia !
話說前一陣子在玩新的波斯王子,還沒玩完,不過已經讓我碰到幽靈事件了:
走近一看……
公主到底是怎麼飛上去的啊!波斯王子的畫面依然令我驚豔(之前只玩過時之砂),改變不大,要說最大的變化就是……這一代是波斯公主遇到一個很強的路人,只是公主太強大了,所以只要路人在年老力衰前達成任務即可,就算往懸崖跳一百次,公主也不會只救你九十九次!比較難過的是隨著劇情發展,公主跟路人會開始調情……是怎樣!打個電動都活該被閃嗎! \_/