forked from len0rd/rockbox
Source documentation of gnuboy (all there is anyways...)
Helps with understanding the code. git-svn-id: svn://svn.rockbox.org/rockbox/trunk@6195 a1c6a512-1295-4272-9138-f99709370657
This commit is contained in:
parent
7107dd8e1f
commit
80a8ea19ca
1 changed files with 472 additions and 0 deletions
472
apps/plugins/rockboy/HACKING
Normal file
472
apps/plugins/rockboy/HACKING
Normal file
|
@ -0,0 +1,472 @@
|
|||
|
||||
HACKING ON THE GNUBOY SOURCE TREE
|
||||
|
||||
|
||||
BASIC INFO
|
||||
|
||||
In preparation for the first release, I'm putting together a simple
|
||||
document to aid anyone interested in playing around with or improving
|
||||
the gnuboy source. First of all, before working on anything, you
|
||||
should know my policies as maintainer. I'm happy to accept contributed
|
||||
code, but there are a few guidelines:
|
||||
|
||||
* Obviously, all code must be able to be distributed under the GNU
|
||||
GPL. This means that your terms of use for the code must be equivalent
|
||||
to or weaker than those of the GPL. Public domain and MIT-style
|
||||
licenses are perfectly fine for new code that doesn't incorporate
|
||||
existing parts of gnuboy, e.g. libraries, but anything derived from or
|
||||
built upon the GPL'd code can only be distributed under GPL. When in
|
||||
doubt, read COPYING.
|
||||
|
||||
* Please stick to a coding and naming convention similar to the
|
||||
existing code. I can reformat contributions if I need to when
|
||||
integrating them, but it makes it much easier if that's already done
|
||||
by the coder. In particular, indentions are a single tab (char 9), and
|
||||
all symbols are all lowercase, except for macros which are all
|
||||
uppercase.
|
||||
|
||||
* All code must be completely deterministic and consistent across all
|
||||
platforms. this results in the two following rules...
|
||||
|
||||
* No floating point code whatsoever. Use fixed point or better yet
|
||||
exact analytical integer methods as opposed to any approximation.
|
||||
|
||||
* No threads. Emulation with threads is a poor approximation if done
|
||||
sloppily, and it's slow anyway even if done right since things must be
|
||||
kept synchronous. Also, threads are not portable. Just say no to
|
||||
threads.
|
||||
|
||||
* All non-portable code belongs in the sys/ or asm/ trees. #ifdef
|
||||
should be avoided except for general conditionally-compiled code, as
|
||||
opposed to little special cases for one particular cpu or operating
|
||||
system. (i.e. #ifdef USE_ASM is ok, #ifdef __i386__ is NOT!)
|
||||
|
||||
* That goes for *nix code too. gnuboy is written in ANSI C, and I'm
|
||||
not going to go adding K&R function declarations or #ifdef's to make
|
||||
sure the standard library is functional. If your system is THAT
|
||||
broken, fix the system, don't "fix" the emulator.
|
||||
|
||||
* Please no feature-creep. If something can be done through an
|
||||
external utility or front-end, or through clever use of the rc
|
||||
subsystem, don't add extra code to the main program.
|
||||
|
||||
* On that note, the modules in the sys/ tree serve the singular
|
||||
purpose of implementing calls necessary to get input and display
|
||||
graphics (and eventually sound). Unlike in poorly-designed emulators,
|
||||
they are not there to give every different target platform its own gui
|
||||
and different set of key bindings.
|
||||
|
||||
* Furthermore, the main loop is not in the platform-specific code, and
|
||||
it will never be. Windows people, put your code that would normally go
|
||||
in a message loop in ev_refresh and/or sys_sleep!
|
||||
|
||||
* Commented code is welcome but not required.
|
||||
|
||||
* I prefer asm in AT&T syntax (the style used by *nix assemblers and
|
||||
likewise DJGPP) as opposed to Intel/NASM/etc style. If you really must
|
||||
use a different style, I can convert it, but I don't want to add extra
|
||||
dependencies on nonstandard assemblers to the build process. Also,
|
||||
portable C versions of all code should be available.
|
||||
|
||||
* Have fun with it. If my demands stifle your creativity, feel free to
|
||||
fork your own projects. I can always adapt and merge code later if
|
||||
your rogue ideas are good enough. :)
|
||||
|
||||
OK, enough of that. Now for the fun part...
|
||||
|
||||
|
||||
THE SOURCE TREE STRUCTURE
|
||||
|
||||
[documentation]
|
||||
README - general information related to using gnuboy
|
||||
INSTALL - compiling and installation instructions
|
||||
HACKING - this file, obviously
|
||||
COPYING - the gnu gpl, grants freedom under condition of preseving it
|
||||
|
||||
[build files]
|
||||
Version - doubles as a C and makefile include, identifies version number
|
||||
Rules - generic build rules to be included by makefiles
|
||||
Makefile.* - system-specific makefiles
|
||||
configure* - script for generating *nix makefiles
|
||||
|
||||
[non-portable code]
|
||||
sys/*/* - hardware and software platform-specific code
|
||||
asm/*/* - optimized asm versions of some code, not used yet
|
||||
asm/*/asm.h - header specifying which functions are replaced by asm
|
||||
asm/i386/asmnames.h - #defines to fix _ prefix brain damage on DOS/Windows
|
||||
|
||||
[main emulator stuff]
|
||||
main.c - entry point, event handler...basically a mess
|
||||
loader.c - handles file io for rom and ram
|
||||
emu.c - another mess, basically the frame loop that calls state.c
|
||||
debug.c - currently just cpu trace, eventually interactive debugging
|
||||
hw.c - interrupt generation, gamepad state, dma, etc.
|
||||
mem.c - memory mapper, read and write operations
|
||||
fastmem.h - short static functions that will inline for fast memory io
|
||||
regs.h - macros for accessing hardware registers
|
||||
save.c - savestate handling
|
||||
|
||||
[cpu subsystem]
|
||||
cpu.c - main cpu emulation
|
||||
cpuregs.h - macros for cpu registers and flags
|
||||
cpucore.h - data tables for cpu emulation
|
||||
asm/i386/cpu.s - entire cpu core, rewritten in asm
|
||||
|
||||
[graphics subsystem]
|
||||
fb.h - abstract framebuffer definition, extern from platform-specifics
|
||||
lcd.c - main control of refresh procedure
|
||||
lcd.h - vram, palette, and internal structures for refresh
|
||||
asm/i386/lcd.s - asm versions of a few critical functions
|
||||
lcdc.c - lcdc phase transitioning
|
||||
|
||||
[input subsystem]
|
||||
input.h - internal keycode definitions, etc.
|
||||
keytables.c - translations between key names and internal keycodes
|
||||
events.c - event queue
|
||||
|
||||
[resource/config subsystem]
|
||||
rc.h - structure defs
|
||||
rccmds.c - command parser/processor
|
||||
rcvars.c - variable exports and command to set rcvars
|
||||
rckeys.c - keybindingds
|
||||
|
||||
[misc code]
|
||||
path.c - path searching
|
||||
split.c - general purpose code to split strings into argv-style arrays
|
||||
|
||||
|
||||
OVERVIEW OF PROGRAM FLOW
|
||||
|
||||
The initial entry point main() main.c, which will process the command
|
||||
line, call the system/video initialization routines, load the
|
||||
rom/sram, and pass control to the main loop in emu.c. Note that the
|
||||
system-specific main() hook has been removed since it is not needed.
|
||||
|
||||
There have been significant changes to gnuboy's main loop since the
|
||||
original 0.8.0 release. The former state.c is no more, and the new
|
||||
code that takes its place, in lcdc.c, is now called from the cpu loop,
|
||||
which although slightly unfortunate for performance reasons, is
|
||||
necessary to handle some strange special cases.
|
||||
|
||||
Still, unlike some emulators, gnuboy's main loop is not the cpu
|
||||
emulation loop. Instead, a main loop in emu.c which handles video
|
||||
refresh, polling events, sleeping between frames, etc. calls
|
||||
cpu_emulate passing it an idea number of cycles to run. The actual
|
||||
number of cycles for which the cpu runs will vary slightly depending
|
||||
on the length of the final instruction processed, but it should never
|
||||
be more than 8 or 9 beyond the ideal cycle count passed, and the
|
||||
actual number will be returned to the calling function in case it
|
||||
needs this information. The cpu code now takes care of all timer and
|
||||
lcdc events in its main loop, so the caller no longer needs to be
|
||||
aware of such things.
|
||||
|
||||
Note that all cycle counts are measured in CGB double speed MACHINE
|
||||
cycles (2**21 Hz), NOT hardware clock cycles (2**23 Hz). This is
|
||||
necessary because the cpu speed can be switched between single and
|
||||
double speed during a single call to cpu_emulate. When running in
|
||||
single speed or DMG mode, all instruction lengths are doubled.
|
||||
|
||||
As for the LCDC state, things are much simpler now. No more huge
|
||||
glorious state table, no more P/Q/R, just a couple simple functions.
|
||||
Aside from the number of cycles left before the next state change, all
|
||||
the state information fits nicely in the locations the Game Boy itself
|
||||
provides for it -- the LCDC, STAT, and LY registers.
|
||||
|
||||
If the special cases for the last line of VBLANK look strange to you,
|
||||
good. There's some weird stuff going on here. According to documents
|
||||
I've found, LY changes from 153 to 0 early in the last line, then
|
||||
remains at 0 until the end of the first visible scanline. I don't
|
||||
recall finding any roms that rely on this behavior, but I implemented
|
||||
it anyway.
|
||||
|
||||
That covers the basics. As for flow of execution, here's a simplified
|
||||
call tree that covers most of the significant function calls taking
|
||||
place in normal operation:
|
||||
|
||||
main sys/
|
||||
\_ real_main main.c
|
||||
|_ sys_init sys/
|
||||
|_ vid_init sys/
|
||||
|_ loader_init loader.c
|
||||
|_ emu_reset emu.c
|
||||
\_ emu_run emu.c
|
||||
|_ cpu_emulate cpu.c
|
||||
| |_ div_advance cpu.c *
|
||||
| |_ timer_advance cpu.c *
|
||||
| |_ lcdc_advance cpu.c *
|
||||
| | \_ lcdc_trans lcdc.c
|
||||
| | |_ lcd_refreshline lcd.c
|
||||
| | |_ stat_change lcdc.c
|
||||
| | | \_ lcd_begin lcd.c
|
||||
| | \_ stat_trigger lcdc.c
|
||||
| \_ sound_advance cpu.c *
|
||||
|_ vid_end sys/
|
||||
|_ sys_elapsed sys/
|
||||
|_ sys_sleep sys/
|
||||
|_ vid_begin sys/
|
||||
\_ doevents main.c
|
||||
|
||||
(* included in cpu.c so they can inline; also in cpu.s)
|
||||
|
||||
|
||||
MEMORY READ/WRITE MAP
|
||||
|
||||
Whenever possible, gnuboy avoids emulating memory reads and writes
|
||||
with a function call. To this end, two pointer tables are kept -- one
|
||||
for reading, the other for writing. They are indexed by bits 12-15 of
|
||||
the address in Game Boy memory space, and yield a base pointer from
|
||||
which the whole address can be used as an offset to access Game Boy
|
||||
memory with no function calls whatsoever. For regions that cannot be
|
||||
accessed without function calls, the pointer in the table is NULL.
|
||||
|
||||
For example, reading from address addr can be accomplished by testing
|
||||
to make sure mbc.rmap[addr>>12] is not NULL, then simply reading
|
||||
mbc.rmap[addr>>12][addr].
|
||||
|
||||
And for the disbelievers in this optimization, here are some numbers
|
||||
to compare. First, FFL2 with memory tables disabled:
|
||||
|
||||
% cumulative self self total
|
||||
time seconds seconds calls us/call us/call name
|
||||
28.69 0.57 0.57 refresh_2
|
||||
13.17 0.84 0.26 4307863 0.06 0.06 mem_read
|
||||
11.63 1.07 0.23 cpu_emulate
|
||||
|
||||
Now, with memory tables enabled:
|
||||
|
||||
38.86 0.66 0.66 refresh_2
|
||||
8.42 0.80 0.14 156380 0.91 0.91 spr_enum
|
||||
6.76 0.91 0.11 483134 0.24 1.31 lcdc_trans
|
||||
6.16 1.02 0.10 cpu_emulate
|
||||
.
|
||||
.
|
||||
.
|
||||
0.59 1.61 0.01 216497 0.05 0.05 mem_read
|
||||
|
||||
As you can see, not only does mem_read take up (proportionally) 1/20
|
||||
as much time, since it is rarely called, but the main cpu loop in
|
||||
cpu_emulate also runs considerably faster with all the function call
|
||||
overhead and cache misses avoided.
|
||||
|
||||
These tests were performed on K6-2/450 with the assembly cores
|
||||
enabled; your milage may vary. Regardless, however, I think it's clear
|
||||
that using the address mapping tables is quite a worthwhile
|
||||
optimization.
|
||||
|
||||
|
||||
LCD RENDERING CORE DESIGN
|
||||
|
||||
The LCD core presently used in gnuboy is very much a high-level one,
|
||||
performing the task of rasterizing scanlines as many independent steps
|
||||
rather than one big loop, as is often seen in other emulators and the
|
||||
original gnuboy LCD core. In some ways, this is a bit of a tradeoff --
|
||||
there's a good deal of overhead in rebuilding the tile pattern cache
|
||||
for roms that change their tile patterns frequently, such as full
|
||||
motion video demos. Even still, I consider the method we're presently
|
||||
using far superior to generating the output display directly from the
|
||||
gameboy tiledata -- in the vast majority of roms, tiles are changed so
|
||||
infrequently that the overhead is irrelevant. Even if the tiles are
|
||||
changed rapidly, the only chance for overhead beyond what would be
|
||||
present in a monolithic rendering loop lies in (host cpu) cache misses
|
||||
and the possibility that we might (tile pattern) cache a tile that has
|
||||
changed but that will never actually be used, or that will only be
|
||||
used in one orientation (horizontally and vertically flipped versions
|
||||
of all tiles are cached as well). Such tile caching issues could be
|
||||
addressed in the long term if they cause a problem, but I don't see it
|
||||
hurting performance too significantly at the present. As for host cpu
|
||||
cache miss issues, I find that putting multiple data decoding and
|
||||
rendering steps together in a single loop harms performance much more
|
||||
significantly than building a 256k (pattern) cache table, on account
|
||||
of interfering with branch prediction, register allocation, and so on.
|
||||
|
||||
Well, with those justifications given, let's proceed to the steps
|
||||
involved in rendering a scanline:
|
||||
|
||||
updatepatpix() - updates tile pattern cache.
|
||||
|
||||
tilebuf() - reads gb tile memory according to its complicated tile
|
||||
addressing system which can be changed via the LCDC register, and
|
||||
outputs nice linear arrays of the actual tile indices used in the
|
||||
background and window on the present line.
|
||||
|
||||
Before continuing, let me explain the output format used by the
|
||||
following functions. There is a byte array scan.buf, accessible by
|
||||
macro as BUF, which is the output buffer for the line. The structure
|
||||
of this array is simple: it is composed of 6 bpp gameboy color
|
||||
numbers, where the bits 0-1 are the color number from the tile, bits
|
||||
2-4 are the (cgb or dmg) palette index, and bit 5 is 0 for background
|
||||
or window, 1 for sprite.
|
||||
|
||||
What is the justification for using a strange format like this, rather
|
||||
than raw host color numbers for output? Well, believe it or not, it
|
||||
improves performance. It's already necessary to have the gameboy color
|
||||
numbers available for use in sprite priority. And, when running in
|
||||
mono gb mode, building this output data is VERY fast -- it's just a
|
||||
matter of doing 64 bit copies from the tile pattern cache to the
|
||||
output buffer.
|
||||
|
||||
Furthermore, using a unified output format like this eliminates the
|
||||
need to have separate rendering functions for each host color depth or
|
||||
mode. We just call a one-line function to apply a palette to the
|
||||
output buffer as we copy it to the video display, and we're done. And,
|
||||
if you're not convinced about performance, just do some profiling.
|
||||
You'll see that the vast majority of the graphics time is spent in the
|
||||
one-line copy function (render_[124] depending on bytes per pixel),
|
||||
even when using the fast asm versions of those routines. That is to
|
||||
say, any overhead in the following functions is for all intents and
|
||||
purposes irrelevant to performance. With that said, here they are:
|
||||
|
||||
bg_scan() - expands the background layer to the output buffer.
|
||||
|
||||
wnd_scan() - expands the window layer.
|
||||
|
||||
spr_scan() - expands the sprites. Note that this requires spr_enum()
|
||||
to have been called already to build a list of which sprites are
|
||||
visible on the current scanline and sort them by priority.
|
||||
|
||||
It should be noted that the background and window functions also have
|
||||
color counterparts, which are considerably slower due to merging of
|
||||
palette data. At this point, they're staying down around 8% time
|
||||
according to the profiler, so I don't see a major need to rewrite them
|
||||
anytime soon. It should be considered, however, that a different
|
||||
intermediate format could be used for gbc, or that asm versions of
|
||||
these two routines could be written, in the long term.
|
||||
|
||||
Finally, some notes on palettes. You may be wondering why the 6 bpp
|
||||
intermediate output can't be used directly on 256-color display
|
||||
targets. After all, that would give a huge performance boost. The
|
||||
problem, however, is that the gameboy palette can change midscreen,
|
||||
whereas none of the presently targetted host systems can handle such a
|
||||
thing, much less do it portably. For color roms, using our own
|
||||
internal color mappings in addition to the host system palette is
|
||||
essential. For details on how this is accomplished, read palette.c.
|
||||
|
||||
Now, in the long term, it MAY be possible to use the 6 bpp color
|
||||
"almost" directly for mono roms. Note that I say almost. The idea is
|
||||
this. Using the color number as an index into a table is slow. It
|
||||
takes an extra read and causes various pipeline stalls depending on
|
||||
the host cpu architecture. But, since there are relatively few
|
||||
possible mono palettes, it may actually be possible to set up the host
|
||||
palette in a clever way so as to cover all the possibilities, then use
|
||||
some fancy arithmetic or bit-twiddling to convert without a lookup
|
||||
table -- and this could presumably be done 4 pixels at a time with
|
||||
32bit operations. This area remains to be explored, but if it works,
|
||||
it might end up being the last hurdle to getting realtime emulation
|
||||
working on very low-end systems like i486.
|
||||
|
||||
|
||||
SOUND
|
||||
|
||||
Rather than processing sound after every few instructions (and thus
|
||||
killing the cache coherency), we update sound in big chunks. Yet this
|
||||
in no way affects precise sound timing, because sound_mix is always
|
||||
called before reading or writing a sound register, and at the end of
|
||||
each frame.
|
||||
|
||||
The main sound module interfaces with the system-specific code through
|
||||
one structure, pcm, and a few functions: pcm_init, pcm_close, and
|
||||
pcm_submit. While the first two should be obvious, pcm_submit needs
|
||||
some explaining. Whenever realtime sound output is operational,
|
||||
pcm_submit is responsible for timing, and should not return until it
|
||||
has successfully processed all the data in its input buffer (pcm.buf).
|
||||
On *nix sound devices, this typically means just waiting for the write
|
||||
syscall to return, but on systems such as DOS where low level IO must
|
||||
be handled in the program, pcm_submit needs to delay until the current
|
||||
position in the DMA buffer has advanced sufficiently to make space for
|
||||
the new samples, then copy them.
|
||||
|
||||
For special sound output implementations like write-to-file or the
|
||||
dummy sound device, pcm_submit should write the data immediately and
|
||||
return 0, indicating to the caller that other methods must be used for
|
||||
timing. On real sound devices that are presently functional,
|
||||
pcm_submit should return 1, regardless of whether it buffered or
|
||||
actually wrote the sound data.
|
||||
|
||||
And yes, for unices without OSS, we hope to add piped audio output
|
||||
soon. Perhaps Sun audio device and a few others as well.
|
||||
|
||||
|
||||
OPTIMIZED ASSEMBLY CODE
|
||||
|
||||
A lot can be said on this matter. Nothing has been said yet.
|
||||
|
||||
|
||||
INTERACTIVE DEBUGGER
|
||||
|
||||
Apologies, there is no interactive debugger in gnuboy at present. I'm
|
||||
still working out the design for it. In the long run, it should be
|
||||
integrated with the rc subsystem, kinda like a cross between gdb and
|
||||
Quake's ever-famous console. Whether it will require a terminal device
|
||||
or support the graphical display remains to be determined.
|
||||
|
||||
In the mean time, you can use the debug trace code already
|
||||
implemented. Just "set trace 1" from your gnuboy.rc or the command
|
||||
line. Read debug.c for info on how to interpret the output, which is
|
||||
condensed as much as possible and not quite self-explanatory.
|
||||
|
||||
|
||||
PORTING
|
||||
|
||||
On all systems on which it is available, the gnu compiler should
|
||||
probably be used. Writing code specific to non-free compilers makes it
|
||||
impossible for free software users to actively contribute. On the
|
||||
other hand, compiler-specific code should always be kept to a minimum,
|
||||
to make porting to or from non-gnu compilers easier.
|
||||
|
||||
Porting to new cpu architectures should not be necessary. Just make
|
||||
sure you unset IS_LITTLE_ENDIAN in the makefiles to enable the big
|
||||
endian default if the target system is big endian. If you do have
|
||||
problems building on certain cpus, however, let us know. Eventually,
|
||||
we will also want asm cpu and graphics code for popular host cpus, but
|
||||
this can wait, since the c code should be sufficiently fast on most
|
||||
platforms.
|
||||
|
||||
The bulk of porting efforts will probably be spent on adding support
|
||||
for new operating systems, and on systems with multiple video (or
|
||||
sound, once that's implemented) architectures, new interfaces for
|
||||
those. In general, the operating system interface code goes in a
|
||||
directory under sys/ named for the os (e.g. sys/nix/ for *nix
|
||||
systems), and display interfaces likewise go in their respective
|
||||
directories under sys/ (e.g. sys/x11/ for the x window system
|
||||
interface).
|
||||
|
||||
For guidelines in writing new system and display interface modules, i
|
||||
recommend reading the files in the sys/dos/, sys/svga/, and sys/nix/
|
||||
directories. These are some of the simpler versions (aside from the
|
||||
tricky dos keyboard handling), as opposed to all the mess needed for
|
||||
x11 support.
|
||||
|
||||
Also, please be aware that the existing system and display interface
|
||||
modules are somewhat primitive; they are designed to be as quick and
|
||||
sloppy as possible while still functioning properly. Eventually they
|
||||
will be greatly improved.
|
||||
|
||||
Finally, remember your obligations under the GNU GPL. If you produce
|
||||
any binaries that are compiled strictly from the source you received,
|
||||
and you intend to release those, you *must* also release the exact
|
||||
sources you used to produce those binaries. This is not pseudo-free
|
||||
software like Snes9x where binaries usually appear before the latest
|
||||
source, and where the source only compiles on one or two platforms;
|
||||
this is true free software, and the source to all binaries always
|
||||
needs to be available at the same time or sooner than the
|
||||
corresponding binaries, if binaries are to be released at all. This of
|
||||
course applies to all releases, not just new ports, but from
|
||||
experience i find that ports people usually need the most reminding.
|
||||
|
||||
|
||||
EPILOGUE
|
||||
|
||||
That's it for now. More info will eventually follow. Happy hacking!
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
Loading…
Add table
Add a link
Reference in a new issue