Fast affine texture mapping II (fatmap2.txt)
--------------------------------------------
by
Mats Byggmastar
MRI
3D programmer in Doomsday
mri@penti.sit.fi
17 April 1997, Jakobstad, Finland
Feel free to upload this document anywhere you find appropriate,
as long as you keep it in it's original, unmodified form.
This is free information, you may not charge anything for it.
Companies with selfrespect are encouraged to contact me,
if any of this code is to be used as part of a commercial product.
Table of contents
-----------------
1. About this document
2. Disclaimer
3. About the source code
4. Convex polygons
5. Drawing convex polygons
6. Constant texture gradients and polygons
7. 16:16 bit fixed point idiv and imul
8. Fixed point ceil() function
9. Subpixel accuracy
10. Subtexel accuracy
11. Avoiding divide overflow in slope calculations
12. Loop counters in inner loops
13. 8:16 bit inner loop
14. Arbitrary size bitmap inner loop
15. 8:15 bit tiled inner loop
16. Polygons per second
17. Greetings
1. About this document
-----------------------
This document is the follow-up to fatmap.txt released 19 Jun. 1996. The
aim of this second document is to make a more accurate affine texture
mapper. I have also skipped triangles and concentrated on convex
polygons to make clipping more efficient. The tiled bitmap inner loop
scheme described in fatmap.txt has now also been realized in a 6 clock
tick inner loop that is put into action here with good results.
As in the previous document the inner loops in assembly are developed
for the Intel Pentium processor.
People looking for descriptions on perspective correct texture mappers
should visit Game Developer Magazine at http://www.gdmag.com and
backorder the issues with Chris Hecker's articles on this subject!
I, the author, am a 25 year old computer and telecom engineer (B.Sc.)
currently working as a teacher in a vocational school for students in
the age of 16-19 years studying computers, electronic, automation and
power electric. I do real-time 3D graphics programming mostly as a
hobby and am an active member of the Finnish demogroup Doomsday. My
dream is to one day work full-time with 3D graphics.
Special thanks to Harriet Mattfolk for proofreading this document and
helping me with the english syntax.
2. Disclaimer
--------------
The author takes no responsibility, if something in this document or
the accompanying source or executable causes any kind of data loss or
damage to your hardware.
3. About the source code
-------------------------
Except for the inner loops and some misc. functions the source is
written in C++ very close to C. To make the most of the code I've made
the inner loops as inline assembler. I personally have the whole
mapping functions in fine-tuned assembly, but it would be pointless to
include such a source in this type of document.
The source code that should accompany this document is divided into 8
files:
misc.h - Misc. declarations of structures and functions
clip.cpp - Sutherland-Hodgman 2D clipping
draw.cpp - Calculates deltas, clips and calls the mappers
main.cpp - Simple test program
flat.cpp - Flat filler
gouraud.cpp - Gouraud filler
txtmap.cpp - Texture mapper
txtarb.cpp - Arbitrary size texture mapper
txttil.cpp - Tiled texture mapper
The last five files are the ones of interest. The other files were just
included so I would be able to make the test program. Clip.cpp is my
very own implementation of the Sutherland-Hodgman clipping algorithm
and might be of some interest. You can find the theory behind the
Sutherland-Hodgman algorithm in any graphics textbook.
From time to time I see people looking for a fast way to fill polygons
with a solid color. I have therefore included the flat filler. You
might want to replace the call to memset() with some other inner loop.
I recently also participated in a discussion on the newsgroup
comp.graphics.algorithms concerning floating point vs. fixed point in a
gouraud filler so I decided to include my version of a gouraud filler.
The source should be compiled with Watcom C/C++ as the inline assembly
is Watcom specific. It has only been tested with Watcom C/C++ version
10.0. The compiled version of the test program (main.exe) is included.
You will need the DOS4GW protected mode stub to run it.
4. Convex polygons
-------------------
First, let us define what a convex polygon is. A convex polygon must
not have any of it's corners pointing inwards to the center of the
polygon. In other words, the angle between two edges joining in a
vertex must not be greater than 180 degree. This means that triangles
are always convex polygons. The opposite of convex is concave and they
can have corners pointing inwards.
______ ______________ ____ ____
/\ / \ \ / | \ / |
/ \ / \ \ / | \/ |
/ \ \ / / / | |
/_________\ \ ______ / /__________ / |________________|
convex convex concave concave
The mappers presented in this document can only draw convex polygons
with one exception. The third polygon above will be handled correctly
even if it is concave. This is because it is concave in such a way that
no scanlines need to be split while rendering it. The fourth polygon
will not be handled properly because scanlines must be split.
5. Drawing convex polygons
---------------------------
Drawing convex polygons is just as simple and fast as drawing
triangles. The only difference is that the vertex scanning code needs
to be in a loop for polygons. If we only render triangles we know that
we always have 3 vertices which makes the code a bit more
straightforward.
First step in a polygon function is to search for the top and bottom
vertex. So we scan the vertex array and search for min and max y. Let
us take the following 4 vertex polygon as example:
v1
/ \
/ \
/ \
/ \
v2 / / v0
\ /
\ /
\ /
\ /
\ /
v3
Note that the vertices must be in an anti clockwise order. We find that
v1 is the top vertex and v3 is the bottom vertex.
We now know that when scanning the left side of the polygon, we should
start at v1 and move forward through the array until we come to v3. In
general when moving forward, we should wrap to the start of the array
if we move past the array. I.e. if we come to the last vertex, our next
vertex will be v0.
v1
/
/
/
/
v2 /
\
\
\
\
\
v3
For the right side we start at v1 and move backwards through the array
until we come to v3. In general when moving backwards, we should wrap
to the end of the array if we move outside the array. I.e. when we come
to v0, our next vertex will be the last vertex in the array.
v1
\
\
\
\
/ v0
/
/
/
/
/
v3
In our example we have 2 sections on the left side:
v1 - v2 and v2 - v3
and 2 sections on the right side:
v1 - v0 and v0 - v3
The second step in a polygon function is to calculate the x coordinate
slope for the first section on the right side. If the section had zero
height, try the next until a non-zero section is found.
The third step is to calculate x coordinate slope and any other slope
(e.g. texture u and v) for the first section on the left side. If the
section had zero height, try the next until a non-zero section is
found.
If all sections on a side has zero height, the whole polygon has zero
height and should not be plotted.
Now we can start to interpolate the values along the left and the right
side as we draw each scanline using an inner loop.
v1
/-\
/------\
/-----------\
/ \
v2 / v0
When we come to the end of a section, we search for the next non-zero
height section and calculate the new slope. If all sections on one side
are done, i.e. if we have come to the bottom vertex, the polygon is
done.
6. Constant texture gradients and polygons
-------------------------------------------
With triangles the texture gradients (texture u,v slopes over the
triangle surface) are guaranteed to be constant. Therefore we can
calculate the gradients (dudx, dvdx) once and use the same values for
all scanlines of the triangle.
If we start out with a triangle and clip it into a polygon, the texture
gradients for the polygon are still constant over the whole surface. So
if we calculate the gradients before the triangle gets clipped, we can
still use the constant texture gradient method for polygons.
Calculating the texture gradients for a triangle can be done in the
following way:
v0
/\
/ \
/ \
/_________\
v1 v2
double d = (v0.x - v2.x) * (v1.y - v2.y) -
(v1.x - v2.x) * (v0.y - v2.y);
if(d == 0.0) return;
double id = 1.0/d * 65536.0;
long dudx = ((v0.u - v2.u) * (v1.y - v2.y) -
(v1.u - v2.u) * (v0.y - v2.y)) * id;
long dvdx = ((v0.v - v2.v) * (v1.y - v2.y) -
(v1.v - v2.v) * (v0.y - v2.y)) * id;
The vertex data x,y,u,v is assumed to be in floating point and the
resulting dudx,dvdx is in 16:16 bit fixed point.
7. 16:16 bit fixed point idiv and imul
---------------------------------------
All mappers work with 16:16 bit fixed point internally. I will not
explain how fixed point works here, you can look that up elsewhere.
I'll just point out that we need to perform the fixed point divides and
multiplies in assembler rather than in C. So from here on you can
assume that code like:
long dxdy = (width << 16) / height;
long x = v1.x + ((prestep * dxdy) >> 16);
need to be performed by some inline assembly functions:
long dxdy = idiv16(width, height);
long x = v1.x + imul16(prestep, dxdy);
8. Fixed point ceil() function
-------------------------------
We need a ceil() function for the subpixel and subtexel accuracy.
Definition of ceil() is:
ceil(x) returns the smallest integer not < x
e.g. ceil(1.0) returns 1
ceil(1.5) returns 2
ceil(-2.0) returns -2
ceil(-1.5) returns -1
If we limit x to only positive numbers we can make a 16:16 bit fixed
point version of ceil() very easily. We add 0xffff to x and right-shift
by 16.
inline long ceil(long x)
{
x += 0xffff;
return x >> 16;
}
Note that this function do not give the correct result if x is
negative. This is no problem as x will never be negative the way ceil()
is used in the mappers.
9. Subpixel accuracy
---------------------
The first thing to note when aiming for subpixel accurate polygon
drawing is that we must never truncate screen coordinates to integers.
Today it's common to use floating point all the way trough the 3D
engine and convert the projected screen coordinates to integers just
before entering the polygon function. This conversion should of course
be a float to fixed point conversion so that we do not lose the
fractional part. The fractional part is in fact the subpixel position
that will affect how the slope of a polygon edge will be rendered.
Without subpixel accuracy the polygons will jump one pixel at a time
when the object is moving slowly over the screen. With subpixel
accuracy the pixels that makes up the edges between the polygons will
slowly "float", making the edges seem to move slowly over the screen.
Calculating the slope of a subpixel accurate section:
long scanlines = ceil(v2.y) - ceil(v1.y);
if(scanlines <= 0) return; // Section had zero height
long height = v2.y - v1.y;
long dxdy = ((v2.x - v1.x) << 16) / height;
long prestep = (ceil(v1.y) << 16) - v1.y;
long left_x = v1.x + ((prestep * dxdy) >> 16);
The height of the section, or the number of scanlines, will be an
integer. When calculating the slope (dxdy) we will not use this height,
rather the real height that includes the fractional part. We apply the
subpixel accuracy to the slope by adjusting the initial x coordinate by
the amount that was truncated when selecting the top y coordinate using
ceil().
Interpolating along subpixel accurate left and right sections:
for( )
{
long x1 = ceil(left_x); // Scanline start x
long width = ceil(right_x) - x1; // Scanline width
if(width > 0)
{
Now draw the scanline from x1,y, width pixels wide.
}
left_x += left_dxdy;
right_x += right_dxdy;
y++;
}
10. Subtexel accuracy
---------------------
Calculate the u,v (dudy,dvdy) slopes along a section the same way as
the x coordinate slope and prestep the initial values the same way:
long height = v2.y - v1.y;
long dudy = ((v2.u - v1.u) << 16) / height;
long dvdy = ((v2.v - v1.v) << 16) / height;
long prestep = (ceil(v1.y) << 16) - v1.y;
long left_u = v1.u + ((prestep * dudy) >> 16);
long left_v = v1.v + ((prestep * dvdy) >> 16);
When we interpolate with subtexel accuracy we must prestep u,v before
drawing each scanline. This is because we round the start of the
scanline (x1) upward to the nearest pixel using ceil().
for( )
{
long x1 = ceil(left_x); // Scanline start x
long width = ceil(right_x) - x1; // Scanline width
if(width > 0)
{
long prestep = (x1 << 16) - left_x;
long u = left_u + ((prestep * dudx) >> 16);
long v = left_v + ((prestep * dvdx) >> 16);
Now draw the scanline from x1,y,u,v, width pixels wide.
}
left_x += left_dxdy;
left_u += left_dudy;
left_v += left_dvdy;
right_x += right_dxdy;
y++;
}
11. Avoiding divide overflow in slope calculations
--------------------------------------------------
For some sections that are very thin, the slope calculation might
overflow.
Look at the following case:
v1.x = 0000:0000 v2.x = 0002:0000
v1.y = 0000:0000 v2.y = 0000:0001
ceil(v2.y) - ceil(v1.y) will report that this is one scanline even if
the actual height is just a fraction of a pixel.
height = v2.y - v1.y = 0000:00001
width = v2.x - v1.y = 0002:00000
so performing (width << 16) / height will cause divide overflow. We
are aiming for nearly perfect affine mapping so we can't just take some
default value for the slope. One way to avoid the overflow is to
multiply width by the inverse of height using only 18:14 bit precision.
long height = v2.y - v1.y;
long inv_height = (0x10000 << 14) / height;
long dxdy = ((v2.x - v1.x) * inv_height) >> 14;
Note that this method can only be used for this special case where the
section height is less than one pixel. Other sections that are more
than one pixel should of course be calculated as usual.
12. Loop counters in inner loops
--------------------------------
There is a neat way to combine movement of destination pointer and loop
counter in inner loops. It might not save much (half a clock tick on a
Pentium) but you might find other ways of using it.
Assume that we want to draw a horizontal line on the screen with the
following data:
al = color
ecx = width of line
edi = destination pointer to first pixel on the left
The standard way would be:
inner:
mov [edi], al ; plot
inc edi ; advance destination pointer
dec ecx ; decrease loop counter
jnz inner
Another way is to combine destination pointer and width and plot the
line from right to left:
inner:
mov [edi+ecx-1], al
dec ecx
jnz inner
In the above loop we got rid of one instruction. On the other hand, we
are writing to memory from higher to lower addresses. This is bad for
the write buffers. We should write to increasing addresses instead.
That can be done in this way:
lea edi, [edi+ecx]
neg ecx ; loop counter goes from -width to 0
inner:
mov [edi+ecx], al
inc ecx
jnz inner
The destination is first moved to the end of the line and the loop
counter is negated. The first pixel will therefore be plotted at
start+width-width, i.e. at the start of the line.
neg ecx
can be replaced with:
xor ecx, -1
inc ecx
which most of the time can be paired with some other instructions in
the setup code. neg is a non pairable, 1 tick instruction. We will then
end up with:
lea edi, [edi+ecx]
xor ecx, -1
inc ecx
inner:
mov [edi+ecx], al
inc ecx
jnz inner
13. 8:16 bit inner loop
-----------------------
After fatmap.txt was released I was sent a very nice 8:16 bit 4 clock
tick inner loop made by Russel Simmons (Armitage/Beyond). In it's
original form it plotted the scanlines from right to left. I have in a
later stage changed this and present the new version here which plots
the scanlines from left to right.
; bitmap (256x256) must be aligned on 64k. (low 16 bits zero)
; eax = u frac : -
; ebx = bitmap ptr : v int : u int
; ecx = scanline width
; edx = v frac : v int step : u int step
; esi = u frac step : 0 : 0
; edi = destination ptr
; ebp = v frac step : 0 : 0
lea edi, [edi+ecx]
xor ecx, -1
inc ecx
inner:
mov al, [ebx] ; get color
add edx, ebp ; v frac += v frac step
adc bh, dh ; v int += v int step (+carry from v frac)
add eax, esi ; u frac += u frac step
adc bl, dl ; u int += u int step (+carry from u frac)
mov [edi+ecx], al ; plot pixel
inc ecx
jnz inner
This is a good inner loop with enough precision, which also can handle
texture wrapping. Note that the texture must be aligned on 64k.
One way of aligning texture maps on 64k is to first make a buffer which
is 64k byte larger than the texture map and then align a pointer inside
that buffer:
char source[256*256]; // source bitmap
char bigbuffer[256*256*2]; // 2*64k byte buffer
char * aligned = (char *) (((int)(bigbuffer + 0xffff)) & ~0xffff);
memcpy(aligned, source, 256*256);
Now access the bitmap using the aligned pointer.
14. Arbitrary size bitmap inner loop
------------------------------------
The idea for the following 5 clock tick inner loop was taken from Chris
Hecker's scanline subdivision texture mapper. This loop doesn't rely on
the fact that textures are always 256x256 pixels. We use a 2 dword
lookup table instead to perform the v * width calculation. Therefore it
can handle bitmaps of any size. The fractional part in this loop can be
up to 32 bits but we only use 16 bits here. Note that the bitmap does
not need to be aligned.
The trick in this loop is to convert the carry flag from the fractional
v addition into an index in the lookup table. Then to get the texture
increment from the lookup table.
; bitmap can be any size, setup duvdxstep table according to width
; dvdxfrac = v frac step : 0
; eax = u frac : 0
; ebx = v frac : 0
; ecx = scanline width
; edx = u frac step : 0
; esi = bitmap ptr
; edi = destination ptr
lea edi, [edi+ecx]
xor ecx, -1
add ebx, [dvdxfrac] ; v frac += v frac step
inc ecx
sbb ebp, ebp ; -1 if carry from add
inner:
add eax, edx ; u frac += u frac step
mov bl, [esi] ; get color
adc esi, [duvdxstep+4+ebp*4] ; move texture pointer
add ebx, [dvdxfrac] ; v frac += v frac step
sbb ebp, ebp ; -1 if carry from add
mov [edi+ecx], bl ; plot pixel
inc ecx
jnz inner
The duvdxstep lookup table is set up in the following way:
long duvdxstep[2];
duvdxstep[0] = (dudx >> 16) + (dvdx >> 16) * mapwidth + mapwidth;
duvdxstep[1] = (dudx >> 16) + (dvdx >> 16) * mapwidth;
This means that when we get that carry from the v addition and ebp
becomes -1, we will address duvdxstep[0] and add one extra line in the
bitmap.
A drawback of this inner loop is that it can not handle texture
wrapping. I.e. you must never let the u,v coordinates be outside the
texture map or you will read random data outside the texture map or
even cause a protection fault.
15. 8:15 bit tiled inner loop
-----------------------------
In fatmap.txt I described a method of arranging the texture map as 8x8
tiles for more efficient cache usage. Back then I did not know any way
of doing the bit-swapping inner loop in less than 11 clock ticks. But
one day it just came to me and I could immediately write a 8 clock tick
8:16 bit version. That version did not even work in theory, but it was
a starting point. The inner loop below has evolved from the original 8
clock tick version over a two month period. It was developed on paper
only and the first time I actually tested any of the loops, was this
version and it worked perfectly.
The version below runs at 6 clock ticks and uses 8:15 bit
interpolation. Here 8x8 tiles are used but any type of scheme can be
used, just modify the bit-masks. The 256x256 bitmap need not be
aligned, but for the tiling scheme to be effective the bitmap should be
aligned on 32 byte. Texture wrapping is allowed.
; bitmap (256x256) must be stored as 8x8 tiles
; tildudx = wwwww11111111www1fffffffffffffffb (w=whole, f=frac)
; tildvdx = 11111wwwwwwww1111fffffffffffffffb
; eax = u wwwww00000000www0fffffffffffffffb
; ebx = v 00000wwwwwwww0000fffffffffffffffb
; ecx = scanline width
; edi = destination ptr
; esi = bitmap ptr
lea edi, [edi+ecx-1]
xor ecx, -1
lea ebp, [eax+ebx] ; u+v
inc ecx
inner:
add eax, [tildudx] ; u += tildudx
add ebx, [tildvdx] ; v += tildvdx
shr ebp, 16 ; (u+v) >> 16
and eax, 11111000000001110111111111111111b ; clear bitgaps
and ebx, 00000111111110000111111111111111b
inc ecx
mov dl, [esi+ebp] ; get color
lea ebp, [eax+ebx] ; u+v
mov [edi+ecx], dl ; plot pixel
jnz inner
Converting 16:16 bit fixed point dudx,dvdx to the tiled format can be
done the following way:
tildudx = (((dudx << 8) & 0xf8000000) +
((dudx >> 1) & 0x00007fff) +
(dudx & 0x00070000)) | 0x07f88000;
tildvdx = (((dvdx << 3) & 0x07f80000) +
((dvdx >> 1) & 0x00007fff)) | 0xf8078000;
Note that we fill out the bit-gaps with 1's. This is because we must
force the bits to jump over the gaps when we add in the inner loop.
These 1's must be cleared out after the add. We do this with the 'and'
instructions in the inner loop.
Before entering the inner loop we also convert u,v to the tiled format:
u = ((u << 8) & 0xf8000000) +
((u >> 1) & 0x00007fff) +
(u & 0x00070000);
v = ((v << 3) & 0x07f80000) +
((v >> 1) & 0x00007fff);
It should be noticed that 15 bits are only used for the fractional part
and that it is right-shifted 1 bit. This is done because we must not
let the fractional part overflow and clutter the texture pointer when
we add u+v using the lea instruction. We have placed a bit-gap between
the fractional part and the whole part to avoid this.
The bitmap itself can be tiled the following way:
char source[256*256]; // linear source bitmap
char tiled[256*256];
for(int v=0; v<256; v++) {
for(int u=0; u<256; u++) {
int dst = ((u<<8) & 0xf800)+(u & 0x0007)+((v<<3) & 0x07f8);
tiled[dst] = source[u+v*256];
}
}
16. Polygons per second
-----------------------
From time to time I see coders talking about the speed of their 3D
engine by specifying how many polygons per second can be plotted. In my
opinion, this is completely irrelevant as they do not specify under
what condition they have made the test. The speed will be _very_
dependent on the size and orientation of the polygons as well as how
the texture is mapped onto the polygons. Many other factors will
contribute to the speed, even the type of memory manager or DOS
extender.
With the simple test program accompanying this document I try to
compare the speed between the different mappers. In this case, a
comparison is valid as they work under the exact same conditions. I get
the following results on my Pentium 120:
Flat polys/sec 3494
Gouraud polys/sec 1849
Texture polys/sec 1214
Arbitrary texture polys/sec 1196
Tiled texture polys/sec 1520
These figures do not seem very impressive, but keep in mind that the
triangles can be very large as I allow x,y to be in the range
[0...320],[0...200]. In any case you should notice that the flat filler
(using standard memset() as inner loop) is about twice as fast as the
others. The arbitrary size mapper and the ordinary mapper should run at
about the same speed. The tiled mapper will come out as a winner,
because we allow the u,v values to vary a lot over the triangle
surface, i.e. the cache issues will become more important than a quick
setup and a fast inner loop.
17. Greetings
-------------
Members of Doomsday
Members of SoCS
Members of Esteem
Phil Carmody
Nix / The Black Lotus
Submissive / Cubic Team & $eeN
Armitage / Beyond
Jare / Iguana
Wog / Orange
#coders. No one mentioned, no one forgotten.
All my students in YiJ 1996-1997; Data3, AD2, Elm2, El1A, El1B
...my heart in Oslo
.eof.