Graphics research has typically approached the problem of image generation from two separate directions: realism and interactivity. Photo-realistic approaches are primarily concerned with simulating increasingly detailed physical properties, requiring an increase in both computational resources and time required per image. Interactive applications tend to rely on existing graphics hardware and algorithmic simplifications to obtain high frame rates.
One time-honored technique for generating photo-realistic imagery is ray tracing. Ray tracers have yielded some of the most realistic images produced to date. Ray tracing provides a simple and elegant method for generating complex effects, naturally incorporating hidden surface removal, shadows, reflection and refraction, which can be difficult to produce using other methods. This simplicity comes with a much higher per-pixel resource cost than other techniques, such as scan-line rendering, and so ray tracing is commonly regarded as useful only for non-interactive rendering jobs. As the technology curve continues to increase, this perception is changing; however, ray tracing is still not a ubiquitous technique for rendering scenes at interactive speeds.
The image quality of interactive systems tend to be much lower than that of images generated by offline image synthesis, due primarily to reliance upon existing graphics hardware. Graphics hardware can remove much of the computational burden from the host CPU, but the application is then dependent on the capabilities of the hardware. Graphics hardware is becoming increasingly sophisticated, yet still has difficulties producing some effects such as refraction. It is not a coincidence that when interactivity is not a constraint, many rendering applications forgo hardware accelerated techniques for more realistic methods.
Ideally, users would be able to have the best of both worlds: unlimited photo-realistic interactivity; however, this goal is still a few years off. Many current design tools are therefore divided into two sections: an interactive design session and an off-line rendering session. This approach often requires several iterations, as problems that were not visible during the design phase become apparent in the rendering phase.
This research seeks to strike a balance between the two approaches to rendering. By using a ray tracing engine for the base rendering module, the user will obtain access to flexible shading models and realistic imagery. Further, by utilizing a pixel reprojection, it becomes possible to inexpensively reuse this data. This paper presents a rendering architecture that rapidly generates realistic imagery by exploiting frame-to-frame coherence. The benefits can be applied directly to areas where both interactivity and fidelity are desired, as in design work or virtual reality environments.
The ray tracing algorithm lends itself well to parallelization. Since each initial ray cast into a scene is essentially independent, calculation of each pixel does not depend on any previous results. Parallelization has been a common method of increasing the speed of ray tracing engines. Many implementations for parallelizing ray tracers have been created. In fact, ray tracing has reached interactive speeds on high end machines using this technique. Parker, et al.,  discuss many trade-offs to consider when moving a ray tracer into a heavily parallel setting.
Pixel reprojection can be performed on any image that has an associated depth buffer. Mark et al.  discuss geometric pixel reprojection in detail, which they refer to as a 3D warp. By using both image coordinates and the associated depth value of each pixel in a source frame, it is possible to accurately compute both translation and rotation transforms, placing the pixel into a destination frame. Although every pixel that is reprojected will be correctly placed, the resultant image will not necessarily be complete. There will usually be several areas of the image that have no information. These areas will be appear as holes in object surfaces. These holes are symptoms of two major problems: occluders and sub-sampling artifacts.
There are several methods for reducing or eliminating the occluder and sub-sampling problems. Most methods work by obtaining more information about the scene, utilizing such methods as combining multiple viewpoints of the same scene , or augmenting a single reference image with multiple depth layers (LDIs) as described by Gortler et al. , . Other common approaches to filling holes involve the use of interpolation techniques such as splatting  or mesh stretching .
When generating successive frames in an animation or interactive environment, it is most common to simply calculate each frame independently. Although some global information may be retained (such as view-independent scene preprocessing used for bounding box hierarchies or radiosity calculations), these tend to be the exceptions rather than the rule. This research seeks to increase the amount of information shared between frames by leveraging frame-to-frame coherence.
This work is most similar to that done by Badt, Adelson and Hodges. The primary distinction between this and earlier projects is that this framework provides an interactive environment. Interactivity imposes new constraints and adds issues that were not concerns in the previous works. Badt had access to the entire animation, so he could rely on making retroactive corrections to images as errors were detected in the course of rendering new frames. Also, in both Badt's work and Adelson and Hodges' refinements, large amounts of data are saved to disk for each frame. Using the disk as a storage device is not desirable in an interactive environment. Mobile objects must also be handled in a new manner. Mobile objects are not addressed in Adelson and Hodges' work while Badt's algorithms made assumptions based on the ability to access the entire animation.
The following sections describe the architecture proposed for the new hybrid rendering framework.
Computation is performed on an enhanced image buffer. In addition to color information, additional data is stored at each pixel location. Figure 2 enumerates the data that is tracked for each entry in the image buffer. This record provides information needed by the rendering process in order to efficiently reuse calculations from previous frames.
In order to perform pixel reprojection, there must exist a source image buffer and a destination image buffer. The destination buffer is always assumed to be empty. The source buffer provides the information that will be remapped into the destination buffer. Since this is not an in-place mapping, frameless rendering techniques cannot be applied to stage one of the rendering process.
Viewpoint motion requires that every pixel in the current frame be changed. This is done by transforming the world coordinates of each pixel in the source buffer by the new viewport matrix. This will result in coordinates that either fall outside the destination buffer or coordinates that can be reused.
Basic pixel reprojection simply maps each pixel in the source buffer to a single pixel positioned in the destination buffer. If this transformation results in multiple source pixels mapping to a single location, the pixel closest to the viewer will be displayed. This one-to-one mapping can leave a significant number of holes in the destination buffer. This problem is handled by explicitly ray tracing these areas during the second stage of rendering.
When a pixel is moved from the source buffer into the destination buffer, its time-to-live counter is decremented. A pixel with a zero time-to-live counter will not be reprojected, leaving a hole that stage two will fill.
Pixel movement due to object motion is a more complicated problem and is discussed in greater detail in section 4.2.
During this stage, each pixel in the destination buffer is examined. The function performed on each pixel varies depending on the pixel type. See figure 3 for a list of all possible pixel types. A pixel can be flagged empty either a result of its time-to-live count expiring, or as a hole left from stage one. Every pixel is considered empty for the first frame. This provides the application with its initial base frame, from which all successive frames are then derived. When an empty pixel is filled, its time-to-live count is set to a user-defined base value plus a random jitter. This jitter spreads the cost of retracing pixels over several frames, so the frame rate does not suddenly drop x frames after a substantial image update. The time-to-live count can additionally be modified due to material type or object identifier. This ability is helpful in rendering moving objects (See section 4.2 for more detail). Deferred pixels are pixels that require deferred shading in order to incorporate viewpoint specific effects caused specular or reflective surfaces. In this case, the surface normal and material types are passed to the shading routine, and the color for the pixel is recalculated. Although this is more expensive than simply reusing the original color, it is more cost effective than having to retrace the pixel.
|Empty||Retrace pixel, set initial time-to-live|
|Deferred||Reshade pixel using deferred shading|
Once a tile has been processed, it may be optionally anti-aliased. Since anti-aliasing is a view-dependent process, it is only performed at display time, otherwise the pixel information reused between frames would contain greater error. To this end, the tile will be reduced in size and placed into a separate pending display buffer. This means that a larger image will be maintained internally than is displayed to the user, effectively performing super-sampling.
Mobile objects present an interesting challenge to the pixel reprojection module. Basic image reprojection techniques assume that the only motion in a scene is caused by movement of the viewer. This allows the transformation matrix to be applied equally to every pixel in the frame buffer. However, if a pixel belongs to a moving object, the transformation generates incorrect images. The simple solution is to force pixels derived from moving objects to refresh every frame. Alternatively it is possible to associate a unique transform matrix with each object and attach this information to the pixel record.
A program adhering to the architecture outlined in the previous section has been implemented. This program was dubbed RRIG (Rapid, Realistic Image Generator). The following sections describe various design decisions, specific programming details and discoveries made along the way.
In order to perform pixel warping in a parallel fashion, RRIG divides the image buffer into n horizontal slices, where n is the number of worker threads. Each thread then performs the pixel reprojection as outlined in section 3.2. Since reprojection has a linear cost with respect to the number of pixels to be computed, each thread will perform roughly the same amount of work. This means that worker load balance is not a problem in stage one. Notice that each thread is writing to a common destination buffer. Although pixels reprojected within a thread are properly ordered, pixels from different strips that map to the same location in the destination buffer are not guaranteed to be correct. In practice, however, warped pixels are recalculated when their time-to-live counter expires, so this artifact is not a major issue.
The image refinement stage is primarily composed of ray tracing. Although ray tracing trivially parallelizes, the time to cast each individual ray can vary greatly. This makes load balancing a very real issue. RRIG approaches this problem by dividing the destination buffer computed by stage one into a series of mxn tiles, where m and n are user defined variables at run-time. For each frame, the master thread creates m*n tiles. These tiles are then processed in a linear fashion by the worker threads using the method outlined in section 4.1.2 Each worker obtains a set of tiles using one of two methods.
Initially all threads will obtain init tiles, where:
Upon each successive request, the worker will ask for half the number of tiles as the previous request, unless this would cause the worker to request zero tiles. In this case, the worker will request a single tile, and continue doing so until all the remaining tiles are exhausted. This method of load balancing allows dynamic requests without incurring a large amount of interprocess communication overhead.
Figure 4 shows a map of this algorithm in action. In this example, four worker threads are processing a 512x512 pixel display. The scene being processed is a close-up of the Mayan pyramid described later in section 6. The tiles are processed from the bottom of the figure to the top. Initially, each thread is allocated one-eighth of the screen. When finished, each worker requests a new block covering one-sixteenth of the screen, and so on. Notice that at the end, the scene is essentially being processed one tile at a time, but primarily only by two workers. The remaining workers are busy processing earlier, more time-consuming tiles.
When a worker has completed its allocated tile set, it peeks at the tile set of the worker next in the ring. If it finds uncompleted tiles in its neighbor's tile set, the worker ``steals'' half of the remaining tiles. However, if the adjacent worker has already completed its work, then the original worker continues to the next worker in the ring. If a worker travels completely around the ring without finding new work, it goes to sleep until the next frame.
A simple example of this process is outlined in Figure 5. In this figure, Worker A has just completed its tile set. It peeks at Worker B's load, but there are no tiles available. Worker A then examines Worker C's load and discovers that there are four available tiles. It then takes two of these tiles for itself.
Figure 6 shows a map of the task stealing algorithm at work. This example is processing the identical scene outlined for Figure 4, and again is using four workers. Initially, each worker is allocated tiles composing one-quarter of the screen. This tile set is processed from the bottom to the top. As a worker finishes its tile set, it attempts to steal work from other workers. This can be seen most clearly in the area being processed by the worker delineated in black. Sections originally allocated to this worker are eventually processed by every worker in the scene.
Task-stealing provides a method of load balancing that is similar to other dynamic request methods. However, instead of requiring the workers to all lock a single master queue when requesting new work, the locks are distributed among the workers. This distinction becomes important when there are a large number of tiles in a scene.
Early on, the decision to keep world coordinates in the pixel record was made. The alternative is to use an inverse perspective matrix when performing pixel warping. Storing the world coordinates provides two major advantages: first, positional errors do not accumulate as rapidly when warping pixels; second, it allows the ray tracer to cast jittered rays into the scene, rather than relying on a ray passing through the center of each pixel. This reduces texture artifacting to some degree.
Originally, the color records were kept in a separate, contiguous memory location. This allowed the image to be easily copied directly from memory to the screen using a standard OpenGL call. Unfortunately, this incurred heavy cache thrashing during the rendering stages.
To reduce this problem, the color field was integrated back into the pixel record. To allow OpenGL to display the image, an additional marshalling step is now required. When a worker finished refining a tile, it copies the color fields into a separate display buffer. Anti-aliasing, if desired, is performed during this marshalling process.
Extending the pixel reprojection stage with a splatting technique was explored. However, several issues arose that made splatting an awkward fit into the rendering architecture. The primary difficulty is that in order to reduce visual artifacts, alpha blending must be used to soften the edges of the splat. If performed in software, this would require a much greater overhead in the pixel mapping routines, since an ordered z-buffer would need to be kept. Alternatively, it would be possible to use OpenGL calls to perform the compositing in hardware. However, in order to generate the successive frame, RRIG needs to know the color of each pixel from the previous frame. While it is possible to read the OpenGL display buffer, it is very expensive, and would also require a further step of writing the color information to each pixel record. It was felt that the performance cost of splatting in stage one of the rendering pipeline would eliminate the benefits obtained for stage two.
In addition to displaying the moving objects themselves, RRIG must also be concerned with secondary visual effects, such as moving shadows. Doing this in an efficient manner is problematic. Figure 7 shows a shadow cast by a ball moving across a surface. The shadow in frame one consists of the regions labeled A and B. The shadow in frame two consists of the regions labeled B and C. It is fairly trivial to mark a pixel as ``in moving shadow'' when it is initially traced. When the object advances, pixels marked as being in shadow can be recalculated using deferred shading. This will correctly remove region A from beneath the shadow. The difficulty arises in correctly updating region C. In order to do this, RRIG must somehow determine that pixels in region C need to be updated. This becomes quite problematic. A few possible solutions might be to construct shadow volumes (eliminating much of the advantages that ray tracing lends to rendering), tag every surfaces upon which any moving shadow falls as ``dirty'' (giving incorrect results if the shadow crosses surfaces), or simply to lower the time to live value of the entire scene to a low value. RRIG currently handles this situation by ignoring shadows cast by moving objects.
A number of experiments were run to determine the performance of RRIG. These experiments were run on a lightly loaded MIPS R10000 machine with 32 250 MHz IP27 processors and 8192 megabytes of main memory. Run times were measured in wall-clock time, and converted into a measure of frames per second.
The test scene consisted of a Mayan pyramid with two L-system generated trees. The static scene was composed of 80,991 triangles. Added to this was a 4032 triangle teapot which moved along a linear path on top of the pyramid. The pyramid pillars, comprising the majority of the geometry in the scene were shaded using a marble solid texture based on a Perlin noise model. The base of the pyramid was shaded using a solid checker pattern. Illumination for the scene was provided by two shadow casting point light sources.
Although RRIG is usually used dynamically, with the user controlling the camera with the mouse, it can be used in a batch manner for generating specific timings. A camera path exploring the scene was recorded while RRIG utilized 24 workers (25 processors total). This path consisted of 135 data points. This path was then rendered, varying the parameters of frame size, load-balancing method, number of processors and anti-aliasing. Results were measured for 2, 4, 8, 16 and 24 worker threads, which were migrated to separate processors. Each experiment also utilized an additional processor for the master thread. When rendering in batch mode, RRIG does not use future knowledge (i.e. future camera and object positions), but continues to run dynamically, simply reading input from a file rather than from the mouse. When rendering, RRIG attempts to normalize the path. This causes experiments with slower rendering times than the recorded path to skip frames. So, for instance, running the scene with only two processors may cause RRIG to only display 20 frames along the path rather then each of the possible 135 frames.
Table 1 outlines the results using a frame size of 512x512 pixels and the dynamic reduction load-balancing scheme. The display size of the anti-aliased run is 256x256. Internally, RRIG computes a 512x512 display, then reduces this in size to perform smoothing.
|Worker||Image Warp||Ray Trace||Oversampled|
Table 2 summarizes the results given by using the same parameters as used for Table 1, but task stealing is used as the load balancing scheme.
|Worker||Image Warp||Ray Trace||Oversampled|
Table 3 outlines the results using a frame size of 768x768 pixels and the dynamic reduction load-balancing scheme. Again, the reduction step performed by the anti-aliasing routine results in a display size of 384x384 for the anti-aliased run.
|Worker||Image Warp||Ray Trace||Oversampled|
Finally, Table 4 outlines the results using a frame size of 768x768 pixels and the task stealing load-balancing scheme.
|Worker||Image Warp||Ray Trace||Oversampled|
Figure 8 records the average amount of time spent by each worker in the program subsections. The timings were recorded using the same camera path as the experiments detailed above. The scene was rendered using a 512x512 pixel display, no anti-aliasing, and dynamic reduction as the load balancing scheme.
There are some interesting trends that can be drawn from the data. First of all, it is clear that load balancing via task stealing is more efficient than via dynamic reduced load balancing. This is particularly true when there are a larger number of processors. This is due using a single locked queue in the latter method versus locks distributed among the workers in the former.
It is interesting to note that the anti-aliased frames (oversampled) are often faster than their unprocessed counter-part. This is particularly evident at the 768x768 resolution. This may be a result of having the master thread update the display, forcing it to complete the OpenGL display call before generating new work.
Although the augmented ray tracing engine always outperformed the straight ray tracing engine, it did not scale as well. The reason for this can be seen in figure 8. Notice that although the image refinement stage (composed mostly of ray tracing new pixels) scaled fairly linearly, the pixel warping stage does not. In fact, it took almost as long to remap the pixel buffer with 8 processors as it did with 24 processors. This effect is significant enough, that during development of RRIG, when run on 60 195 MHz nodes of an MIPS R10000, the ray tracing engine outperformed the augmented engine by up to 47 percent. Clearly, in order to improve performance when using large numbers of processors, the pixel warping stage needs to be reexamined.
Although not clearly visible in the data, subjectively RRIG has a better immediate response time then ray tracing alone. When users are actively running the system, the pixel reprojection routines tend to provide a more sensitive feel. This is due to the fact that in a local area with small movements, the ray tracing engine running alone will require approximately the same amount of time to render each frame. However, in the same circumstance, RRIG can generate a displayable image much faster than having to retrace the entire scene.
As it stands, the current implementation of RRIG performs most efficiently when it utilizes 4 to 8 processors. Multiprocessor desk machines are becoming increasingly popular. With the recent introduction of Intel's quad-Xenon chipset and AMD's multiprocessor Athlon architecture, it is quite possible that the RRIG rendering framework could be on the desktop within nine to twelve months.
Another approach to increase RRIG's scalability might be to extend McMillian's list-priority mapping technique . McMillan's method guarantees that pixels will arrive in a back to front order by processing pixels in a specific order without having to perform a Z-buffer test. Parallelizing this algorithm requires classifying the image buffer into two distinct regions. The first class contains all areas that can be processed in an arbitrary order. The second class contains potential ``overlap areas'' and must be processed after the first class regions are completed. In order to classify the second class regions, the maximum vertical distance d by which a given pixel can be displaced must be determined. This can be found by reprojecting either the pixel with the minimum range value in the source buffer or reprojecting a pixel on the hither clipping plane, and finding the difference from its original position. Once d is known, the overlap areas can be found at the borders of adjoining tiles as shown in figure 9. Figure 9 shows four workers dividing up the source buffer. First, each worker processes the class one regions (white and red) using McMillan's method. After every worker has completed this step, class two regions (yellow) are processed, resulting in potential overlap with the red areas. This algorithm assumes that d is smaller than the height of the class one tiles. If this is not the case, the algorithm should either reduce the number of worker threads (increasing the size of the tile), or utilize a different algorithm.
A second major area for future work is the efficient incorporation of secondary visual effects from moving objects. One possible method might be to track regions likely to require updates. This would require an image processing stage which would note pixels that no longer are in shadow or no longer reflect a moving object. These pixels could then be used as a base to predict which new areas may be affected. Determining a viable method of performing this task in a real-time system promises to be an interesting area of exploration.
There are several other avenues for investigation that would have a lesser impact on system performance. One of these is an examination of the effects of tile size during image refinement. The current tile size was determined in an ad hoc manner, and a more rigorous method may result in a slight performance increase. Another parameter to test is a pixel's time-to-live value. Increasing this value should result in a potential performance gain, but at the cost of image accuracy. Finally, augmenting the enriched buffer with a limited layered depth image of 2-3 layers may improve responsiveness but incur a much higher expense in memory and frame calculation time.