On this page:
Image Operations
Setup
Data Structures
Pinwheel
Motion
Performance measures
Assumptions
Infrastructure
Versioning
Driver
Optimizing Pinwheel (50 points)
Optimizing Motion (50 points)
Coding Rules
Evaluation
Hand In Instructions

Performance Lab

This performance lab is based on the one by Bryant and O’Hallaron for Computer Systems: A Programmer’s Perspective, Third Edition

Due: Wednesday, October 5, 11:59pm

This assignment deals with optimizing memory intensive code. Image processing offers many examples of functions that can benefit from optimization. In this lab, we will consider two image processing operations: pinwheel, which rotates quadrants of an image counter-clockwise by 90 degrees, and motion, which “blurs” an image to simulate motion toward the top-left of the image.

These instructions are long, but the lab itself may not be too time-consuming to get the threshold results required for full credit. The potential upside for clever optimizations is anyone’s guess.

Image Operations

For this lab, we will consider an image to be represented as a two-dimensional matrix M, where Mi,j denotes the value of (i,j)th pixel of M. Pixel values are triples of red, green, and blue (RGB) values. We will only consider square images. Let N denote the number of rows (or columns) of an image. Rows and columns are numbered, in C-style, from 0 to N-1.

Given this representation, the pinwheel operation in the first quadrant can be implemented quite simply as the combination of the following two matrix operations on that quadrant:

For example, applying pinwheel to

produces

The motion operation is implemented by replacing every pixel value with a combination of nine pixels: the pixels that form a 3x3 block with the target pixel in the top left. Pixels in the source image are weighted as follows:

1/2

 

1/32

 

0

1/32

 

1/4

 

1/32

0

 

1/32

 

1/8

That is, the new value of Mi,j is computed as

  Mi,j/2

  + Mi+1,j+1/4

  + Mi+2,j+2/8

  + Mi,j+1/32

  + Mi+1,j/32

  + Mi+1,j+2/32

  + Mi+2,j+1/32

For example, applying motion to

produces

Setup

Start by copying perflab-handout.zip to a protected directory in which you plan to do your work. Then, run the command:

  $ unzip perflab-handout.zip

This will cause a number of files to be unpacked into the directory. The only file you will be modifying and handing in is "kernels.c". The "driver.c" program is a driver program that allows you to evaluate the performance of your solutions. Use the command make driver to generate the driver code and run it with the command ./driver.

Looking at the file "kernels.c" you’ll notice a C structure student into which you should insert the requested identifying information about yourself. Do this right away so you don’t forget.

Data Structures

The core data structure deals with image representation. A pixel is a struct as shown below:

  typedef struct {

     unsigned short red;   /* R value */

     unsigned short green; /* G value */

     unsigned short blue;  /* B value */

  } pixel;

As can be seen, RGB values have 16-bit representations (“16-bit color”). An image I is represented as a one-dimensional array of pixels, where the (i,j)th pixel is I[RIDX(i,j,n)], where n is the dimension of the image matrix, and RIDX is a macro defined as follows:

  #define RIDX(i,j,n) ((i)*(n)+(j))

See the file "defs.h" for this code.

Pinwheel

The following C function computes the result of pinwheeling the source image src and stores the result in destination image dst. dim is the dimension of the image.

  void naive_pinwheel(int dim, pixel *src, pixel *dest) {

    int i, j;

  

    for (i = 0; i < dim/2; i++)

      for (j = 0; j < dim/2; j++)

        dest[RIDX(dim/2-1-j, i, dim)] = src[RIDX(i, j, dim)];

  

    for (i = 0; i < dim/2; i++)

      for (j = 0; j < dim/2; j++)

        dest[RIDX(dim/2-1-j, dim/2+i, dim)] = src[RIDX(i, dim/2+j, dim)];

  

    for (i = 0; i < dim/2; i++)

      for (j = 0; j < dim/2; j++)

        dest[RIDX(dim-1-j, i, dim)] = src[RIDX(dim/2+i, j, dim)];

  

    for (i = 0; i < dim/2; i++)

      for (j = 0; j < dim/2; j++)

        dest[RIDX(dim-1-j, dim/2+i, dim)] = src[RIDX(dim/2+i, dim/2+j, dim)];

  }

The above code scans the rows of the source image matrix, copying to the columns of the destination image matrix. Your task is to rewrite this code to make it run as fast as possible using techniques like code motion, loop unrolling and blocking.

See the file "kernels.c" for this code.

Motion

The motion-blurring function takes as input a source image src and returns the blurred result in the destination image dst. Here is part of an implementation:

  void naive_motion(int dim, pixel *src, pixel *dst) {

    int i, j;

  

    for (i = 0; i < dim; i++)

      for (j = 0; j < dim; j++)

        dst[RIDX(i, j, dim)] = weighted_combo(dim, i, j, src);

  }

The function weighted_combo performs the weighted combination of the pixels around the (i,j)th pixel. Your task is to optimize motion (and weighted_code) to run as fast as possible. (Note: The function weighted_combo is a local function and you can get rid of it altogether to implement motion in some other way.)

This code (and an implementation of weighted_combo) is in the file "kernels.c".

Performance measures

Our main performance measure is CPE or Cycles per Element. If a function takes C cycles to run for an image of size N×N, the CPE value is C/N2. When you build and driver its output shows CPE results for 5 different values of N. The baseline measurements were made on a CADE lab1-n machine.

The ratios (speedups) of the optimized implementation over the naive one will constitute a score of your implementation. To summarize the overall effect over different values of N, we will compute the geometric mean of the results for these 5 values.

Assumptions

To make life easier, you can assume that N is a multiple of 32. Your code must run correctly for all such values of N but we will measure its performance only for the 5 values reported by driver.

Infrastructure

We have provided support code to help you test the correctness of your implementations and measure their performance. This section describes how to use this infrastructure. The exact details of each part of the assignment are described in the following section.

Note: The only source file you will be modifying is "kernels.c".

Versioning

You will be writing many versions of the pinwheel and motion routines. To help you compare the performance of all the different versions you’ve written, we provide a way of “registering” functions.

For example, the file "kernels.c" that we have provided you contains the following function:

  void register_pinwheel_functions() {

     add_pinwheel_function(&pinwheel, pinwheel_descr);

  }

This function contains one or more calls to add_pinwheel_function. In the above example, add_pinwheel_function registers the function pinwheel along with a string pinwheel_descr which is an ASCII description of what the function does. See the file "kernels.c" to see how to create the string descriptions. This string can be at most 256 characters long.

A similar function for your motion kernels is provided in the file motion.c.

Driver

The source code you will write will be linked with object code that we supply into a driver binary. To create this binary, you will need to execute the command

  $ make driver

You will need to re-make driver each time you change the code in "kernels.c".

To test your implementations, you can then run the command:

  $ ./driver

The driver can be run in four different modes:

If run without any arguments, driver will run all of your versions (default mode). Other modes and options can be specified by command-line arguments to driver, as listed below:

Optimizing Pinwheel (50 points)

In this part, you will optimize pinwheel to achieve as low a CPE as possible. You should compile driver and then run it with the appropriate arguments to test your implementations.

For example, running driver with the supplied naive version (for pinwheel) generates the output shown below:

  $ ./driver

  Name: Harry Q. Bovik

  Email: bovik@nowhere.edu

  

  Pinwheel: Version = naive_pinwheel: Naive baseline implementation:

  im              64      128     256     512     1024    Mean

  Your CPEs       3.0     3.4     6.9     10.4    13.2

  Baseline CPEs   3.7     2.9     6.3     10.4    13.3

  Speedup         1.2     0.8     0.9     1.0     1.0     1.0

Some advice: Modern compilers on modern processors do a good job with even the “naive” variant of pinwheel. Still, you should be able to make some improvement to the original implementation.

Optimizing Motion (50 points)

In this part, you will optimize motion to achieve as low a CPE as possible.

For example, running driver with the supplied naive version (for motion) generates the output shown below:

  $ ./driver

  

  Motion: Version = naive_motion: Naive baseline implementation:

  Dim             32      64      128     256     512     Mean

  Your CPEs       188.5   194.3   197.0   198.4   199.2

  Baseline CPEs   188.0   194.0   197.0   198.0   199.0

  Speedup         1.0     1.0     1.0     1.0     1.0     1.0

Some advice: A human reasoning about the motion code can arrive at a much bigger speedup compared to a modern compiler on a modern processor. The threshold for full credit in this case is much lower than you should be able to achieve.

Coding Rules

You may write any code you want, as long as it satisfies the following:

You can only modify code in "kernels.c". You are allowed to define macros, additional global variables, and other procedures in these files.

Evaluation

Your solutions for pinwheel and motion will each count for 50% of your grade. The score for each will be based on the following:

Hand In Instructions

When you have completed the lab, you will hand in one file, "kernels.c", that contains your solution. Use Canvas to hand in your work.

Good luck!