ECE/CS 752 Spring 2008
Homework 4

Due: Monday May 5 or Wed May 7, 2008, at the beginning of lecture, depending on which day you are presenting your project (homework is due the day you are not presenting your project)

All work must be done individually. No late assignments will be accepted.
Problem 1

Solve problem 11.1 in Chapter 11 of the Shen/Lipasti textbook.

Problem 2

Solve problem 11.8 from Chapter 11 of the Shen/Lipasti textbook.

Problem 3

Answer problem 11.10 from Chapter 11 of the Shen/Lipasti textbook.

Problem 4

Using the results from HW3/Problem 4a, compute the dynamic energy per data access for each of the L1 data caches simulated. Assume that all L1 misses are satisfied from a 1MB 8-way set-associative L2 cache with 64B blocks (assume the L2 always hits or never misses). Obtain power estimates for each of the caches (including the L2) using the online interface to Cacti 5.1 at http://quid.hpl.hp.com:9082/cacti/index.y?new, assuming 70nm technology and a single bank.

Identify the lowest-dynamic-energy L1 cache option of the ones studied.

Comment on the performance of this option, and on what would happen if you also considered static (leakage) power, based on the leakage estimates from Cacti 5.1.

Problem 5

This problem uses the Niagara (Sun T2000) system niagara.ece.wisc.edu, which you have access to using ssh. Those of you with pre-existing ECE domain accounts can use those to log in to the machine (nere, bbuchli, seemuth, mgilchr). Otherwise I have created an account that matches your @wisc.edu email userid. The initial password is your campus ID number.

Your assignment is to understand the effects of threads that share a cache on the Niagara machine. To get started, look at the vectorsum.c program in /home/mikko/hw4. This program spawns four threads and binds them to the processor numbers you specify on the command line (0-31). Processors 0-3 are threads on physical processor 0, 4-7 on physical processor 1, and so on. Each thread then adds up the total of all the entries in a 1024-entry array (it does this 100 times). The program uses the TICK() macro to count execution cycles for the interval during which the loop executes.

a. Report the number of cycles needed to execute each thread for various combinations of processor bindings (all threads on the same processor, each on it’s own processor, two threads sharing a processor).

b. Use the cputrack command for collecting performance counter data (man cputrack to learn more about it) to understand what is causing the variation in performance for each configuration from part (a). Explain your findings.

c. Modify the program to vary the size of the array and the number of threads to see if you can uncover any additional interesting interactions between threads sharing the machine.