Homework 1: Simple Performance Measurements

This assignment will make you more familiar with how to perform simple performance measurements on x86 machines. You should do this assignment on the openlab machines.

NOTE: YOU CANNOT PUBLICLY RELEASE SOLUTIONS TO THIS HOMEWORK. It's ok to show your work to your future employer as a private github/gitlab repo, however any public release is prohibited.

Download the main.c, and look it over. This is a skeleton for a simple UNIX program.

To compile main.c, you need a C compiler, such as gcc. On Openlab machines, you can compile the skeleton with the following command:

 $ gcc main.c -o perf-hw1 
This will produce an perf-hw1 file, which you can run:
 $ ./perf-hw1 
In the rest of this part of the assignment you will explore how to measure latency of individual parts of your program.

Measuring latency of individual instrucitons with RDTSC

In general measuring performance of individual instructions is hard. Start this homework by reading the following white paper by Intel: How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures.

Preparing measurements

Download the main.c file and look over it. It contains the definition for the rdtsc function that we will use to access the time-stamp counter (TSC). You can compile this code with the following command:
gcc -O2 -g main.c -o perf-hw1
Here we're telling the gcc compiler to use O2 level of optimizations and generate debug information (-g) so we can look at the source code while stepping through the program with GDB.

Warmup: Function invocations

Now imagine we would like to know how many cycles it takes to perform a simple function invocation, e.g.
 c = foo(a, b); 
It's a bit hard to measure, because precision of the rdtsc instruction does not allow us to measure the cost of one function invocation. An obvious strategy is to perform invocation in a loop for N iteration and then divide the total average by N. Lets introduce a simple helper function that does this for us:
int __attribute__ ((noinline)) foo(int a, int b) {
    return a + b;
}

int test_func_invocation() {
    uint64_t start;
    uint64_t end;
    volatile int ret = 0;

    start = rdtsc();
    for (int i = 0; i < runs; i++) {
        ret = foo(ret, i); 
    }   
    end = rdtsc();

    printf("func invocation cycles:%lu\n", (end - start)/runs); 
    return 0;

}
Here we take TSC measurements before and after the loop and report a total number of cycles devided by the number of iterations (runs). There is a couple of unusual things: 1) the __attribute__ ((noinline)) is required to prevent the compiler from optimizing out the function invocation, 2) the ret value is declared with the volatile attribute again required to prevent compiler from optimizing out the function invocation entirely.

Don't forget to invoke this helper function from main. Now you can compile the code and run it again

./perf-hw1
Lets run it a couple of time. The number of cycles on my machine stays the same, which means that our measurement is reasonably stable.

Lets take a look at what code is actually running. You can take a look at the generated code either with the GDB or with objdump.

Lets first use GDB, you can start it like

gdb ./perf-hw1
If you never used GDB before, I suggest you quickly look over this homework from cs238p Operating Systems class: GDB Homework

You can set the breakpoint on the function you would like to explore, i.e.,

(gdb) b test_func_invocation
And run the program with
(gdb) r
Now switch to the split layout with
(gdb) layout split
You can see both source code and the generated assembly. You can single step individual instructions with the "step instruction" command, or si
(gdb) si
If you don't want to step into foo each time you can use the "next instruction" or ni command.

Alternatively, you can take a look at the generated code with the objdump tool, you can invoke objdump like

objdump -M intel -d perf-hw1
you can search for the test_func_invocation function and locate the main body of the loop we're measuring:
     8b0:       8b 7c 24 0c             mov    edi,DWORD PTR [rsp+0xc]
     8b4:       89 d6                   mov    esi,edx
     8b6:       e8 95 ff ff ff          call   850 
     8bb:       83 c2 01                add    edx,0x1
     8be:       89 44 24 0c             mov    DWORD PTR [rsp+0xc],eax
     8c2:       81 fa 00 e1 f5 05       cmp    edx,0x5f5e100
     8c8:       75 e6                   jne    8b0 
The loop body first saves ret that is allocated on the stack into the register that carries the first function argument (the calling convention is that the first argument is passed in edi and the second in esi). It then moves the value of i into the second argument and calls the function with call. The code then increments i which is in edx. The result is returned in eax (again this is the calling convention). The code compares i with 0x5f5e100 (which is hex for 100000000) and if they are not equal repeats the loop. What to submit: take a look at the provided template and fill in the cycles you measured.

Addition instruction

Now lets try to measure the cost of executing a simple addition instruction. Lets create another helper function
int test_add() {
    uint64_t start;
    uint64_t end;
    volatile int ret = 0;

    start = rdtsc();
    for (int i = 0; i < runs; i++) {
        ret = ret + i;
    }
    end = rdtsc();

    printf("add cycles:%lu\n", (end - start)/runs);
    return ret;
}
If we compile and run the measurement you will see that addition is measured to take 3 cycles. This is a bit higher than expected, since remember we looked at the Agner Fog's instruction tables and there it was 1 cycle. Moreover, our pipeline seems to be able to execute addition in one cycle.

It looks like addition is not measured correctly due to the high overhead of maintaining the loop. Let's try to come up with a better way to measure overhead of individual instructions. Lets introduce a macro that allows us to execute something 10000 times

#define OP(x) { x; }

#define OP10(x)   OP(x) OP(x) OP(x) OP(x) \
                 OP(x) OP(x) OP(x) OP(x) \
                 OP(x) OP(x)

#define OP100(x)   OP10(x) OP10(x) OP10(x) OP10(x) \
                  OP10(x) OP10(x) OP10(x) OP10(x) \
                  OP10(x) OP10(x)

#define OP1000(x)   OP100(x) OP100(x) OP100(x) OP100(x) \
                   OP100(x) OP100(x) OP100(x) OP100(x) \
                   OP100(x) OP100(x)

#define OP10000(x)   OP1000(x) OP1000(x) OP1000(x) OP1000(x) \
                    OP1000(x) OP1000(x) OP1000(x) OP1000(x) \
                    OP1000(x) OP1000(x)
Here the OP10000(x) macro expands recursively generating 10000 instances of x. Now if we add the following helper function
int test_add_op() {
    uint64_t start;
    uint64_t end;
    volatile int ret = 0;

    start = rdtsc();

    asm volatile ("":::"%rax");
    OP10000({asm volatile ("add %rbx, %rax");});
    end = rdtsc();

    printf("add op cycles:%lu\n", (end - start));
    return ret;
}
Now we can measure how many cycles it takes to execute the add instruction 10000 times. Here, we're using inline assembly to generate the specific instruction we would like to execute. Specifically, we're using the ATT syntax which means that we simply copy the %rbx register into %rax.

Note, that the assembly instruction clobbers the %rax register and might break the code around it, if %rax contains something useful.

I add asm volatile ("":::"%rax"); telling the compiler that this empty assembly sequence clobbers the %rax (the compiler obeys this instruction and will generate the code in such a manner that either %rax is not used or it's saved on the stack. (Well strictly speaking there are no guarantees, but in practice this trick works).

Note that if you invoke the program several times the measured number is constantly changing. The CPU changes frequency constantly to achieve an optimal power utilization. You can run the program multiple times with the following shell command if you're using bash.

 for i in {1..15}; do ./perf-hw1; done 
You can watch the frequency of your CPU cores changing with the following command
 watch "cat /proc/cpuinfo | grep MHz"
Let's also add a little routine that can warm up the core before we're taking the measurement putting the core to the highest available frequency.
void warmup() {

    for(unsigned long i = 0; i < 1000; i++) {
        asm volatile ("":::"%rax");
        OP10000({asm volatile ("add %rbx, %rax");});
    }
    return;
}
This warmup routine takes roughly 6M cycles and is enough to warm up the core. Run this routine before taking your measurement.

Finally, the kernel can also move the execution of your program between the cores. To prevent this, we can pin the program to a specific core, e.g., core 3 (please pick one at random on the openlab machines) with the following command:

for i in {1..15}; do taskset -c 3 ./perf-hw1; done
Now the question is what to report? Clearly some measurements are outliers and the system reaches a stable state only after some time. Lets report the average of the last five measurements. What to submit: take a look at the provided template and fill in the cycles you measured for both techniques and small explanations for the assembly code generated by both methods.

More measurements

Now since you understand the basics of taking performance measurements, your task is to conduct similar measurements for the following things (you need to create helping functions similar to the one above, fill in the same things into the template: explanations for the assembly code and performance measurements).
  1. Division
  2. Multiplication
  3. Malloc (loop method only, no asm)
  4. Copy from one array of 256 characters into another using a loop (loop method only, no asm)
  5. memcpy() of 256 bytes (loop method only, no asm)

Hints

The compiler might try to optimize out your memory allocations or memory copies. Try touching the values after you take the measurements by, for example, reading each element and adding it to the return value that you return from the function.

Submit your work

Submit your solution through Gradescope Gradescope CS250P Computer Architecture. You need to submit main.c and template.md. You can resubmit as many times as you wish.

Updated: September, 2021