Difference between revisions of "User:Barre/ITK Registration Optimization"

From NAMIC Wiki
Jump to: navigation, search
Line 21: Line 21:
 
===Quick Links===
 
===Quick Links===
 
* [http://goog-perftools.sourceforge.net/ Google Performance Tools]
 
* [http://goog-perftools.sourceforge.net/ Google Performance Tools]
* [http://kcachegrind.sourceforge.net/cgi-bin/show.cgi kcachegrind], [http://docs.kde.org/development/en/kdesdk/kcachegrind/index.html The KCachegrind Handbook]
+
* [http://kcachegrind.sourceforge.net/cgi-bin/show.cgi kcachegrind], [http://docs.kde.org/development/en/kdesdk/kcachegrind/index.html The KCachegrind Handbook], [http://brent.izolo.com/blog/?p=4 Running Kcachegrind on Mac OSX 10.4]
 
* Also check my [http://www.google.com/notebook/public/14106771154920524977/BDRnWSgoQ_Nq_jY0i BW-NAC Google Notebook] for fresher links.
 
* Also check my [http://www.google.com/notebook/public/14106771154920524977/BDRnWSgoQ_Nq_jY0i BW-NAC Google Notebook] for fresher links.
  

Revision as of 17:20, 22 February 2007

ITK Registration Optimization (BW NAC) Project

My (Sebastien Barre) notes so far. Once the dust settles, the relevant sections will be moved to the project pages listed below.

Project

The ultimate specific goal is B-Spline registration optimization for linux and windows on multi-core and multi-processor, shared memory machines. [...] Also, setup tools and a reporting mechanism for ITK speed to be monitored and reported by us and others. BWH is the driving force behind this work.

Contacts

Quick Links

Source Code

Testing Data

Suggested Benchmarks

Per Stephen's suggestion:

time LinearInterpolate checker10_5.mha 2 2 2 res.mha
time LinearInterpolate checker10_5.mha 4 4 4 res.mha
time LinaerInterpolate checker100_50.mha 2 2 2 res.mha
time LinaerInterpolate checker100_50.mha 4 4 4 res.mha
time LinaerInterpolate checker1000_500.mha 2 2 2 res.mha
time LinaerInterpolate checker1000_500.mha 4 4 4 res.mha

The idea is to create a batchmake script that runs our tests and submits the results to a central database. We can have the tests run nightly on select machines to monitor our progress

More (Stephen):

linear interpolation/resampling
b-spline interpolation/resampling
metric evaluation
b-spline gradient calculation
optimization using linear
optimization using b-spline

Potential Issues with Timing

  • Repository was updated so that it can compile on Unix.

__rtdsc()

CallMonWin includes <intrin.h> to call __rdtsc(), a header that does not exist in Microsoft compilers prior to Visual Studio 8/2005. It seems however that one can call __rdtsc() directly from assembly:

A few articles advises against the use of __rdtsc(), especially in a multicore/multithread context:

The suggested alternative is to use Performance Counters. Hardware counters are actually not an OS feature per se, but a CPU feature that has been around for some time. They provide high-resolution timers that can be used to monitor a wide range of resources:

The issue remains on how to access those counters in a cross-platform way:

  • PAPI: "The Performance API (PAPI) project specifies a standard API for accessing hardware performance counters".
    Stephen/Christian reported that Dual Core CPUs were not supported, but it seems from the release notes for PAPI 3.5 (2006-11-09) that both Intel Core2Duo and Pentium D (i.e. dual core) are indeed supported.

Process Priority

Whatever our choices, several articles also suggest to bump the application's priority to real-time before performing testing to make sure the wall-clock() results are as realistic as possible. It is however very important to set it back to normal (see last paragraph. Example (win32):

	DWORD dwPriorityClass = GetPriorityClass(GetCurrentProcess());
	int nPriority = GetThreadPriority(GetCurrentThread());
	SetPriorityClass(GetCurrentProcess(),REALTIME_PRIORITY_CLASS);
	SetThreadPriority(GetCurrentThread(),THREAD_PRIORITY_TIME_CRITICAL);
[...]g
        SetThreadPriority(GetCurrentThread(),nPriority);
	SetPriorityClass(GetCurrentProcess(),dwPriorityClass);

Thread Affinity

We should consider using SetThreadAffinityMask to make sure that the starting time is recorded on the same thread as the ending time (Win32). Will that constrain the rest of the program to run on a single thread, very good question. Also check Sleep(0), reported in a few discussions, including this long one.

Test Platforms

The primary target platform at the 8, 16, and 32 processor machines at BWH. However, preliminary tests have been performed on KHQ computers.

KHQ

A full software stack was compiled on several machines at Kitware. Each component was build in two flavors, both shared/debug and static/release:

  • Tcl/Tk 8.4
  • VTK (cvs)
  • ITK (cvs)
  • ITK Applications (cvs)
  • FLTK (1.1 svn)
  • BWHItkOptimization (cvs)

All platforms are so far described in the BWHItkOptimization/Results directory:

Host #CPU CPU Freq RAM Arch OS Login
amber2 2 Pentium Xeon 2.8 GHz 4 GB 64 bits Linux 2.6 (Red Hat Enterprise 4) kitware (ssh, vnc; cd ~/barre)
fury 1 Pentium 4 (hyperthread) 2.8 GHz 1 GB 32 bits Linux 2.6 (Fedora Core 4) barre
panzer 1 Intel Core Duo (dual core) 1.66 GHz 1 GB 32 bits Mac OS X 10.4.8 barre
sanakhan 1 Pentium M 1.8 GHz 1 GB 32 bits Windows XP SP2 barre
tetsuo 1 Pentium D (dual core) 3.2 GHz 2 GB 32 bits Windows XP SP2 barre

BWH

I have not yet access to the machines at BWH. Stephen has, and will let me know.

Status

kcachegrind and timing are being performed on amber2. Stay tuned.