Difference between revisions of "User:Barre/ITK Registration Optimization"

From NAMIC Wiki
Jump to: navigation, search
 
(35 intermediate revisions by 5 users not shown)
Line 16: Line 16:
 
* [mailto:stephen.aylward@gmail.com Stephen Aylward]
 
* [mailto:stephen.aylward@gmail.com Stephen Aylward]
 
* [mailto:julien.jomier@kitware.com Julien Jommier]
 
* [mailto:julien.jomier@kitware.com Julien Jommier]
 +
* [mailto:brad.davis@kitware.com Brad Davis]
 
* [mailto:sebastien.barre@kitware.com Sebastien Barre]
 
* [mailto:sebastien.barre@kitware.com Sebastien Barre]
 
* [mailto:LydiaN@alleninstitute.org Lydia Ng]
 
* [mailto:LydiaN@alleninstitute.org Lydia Ng]
Line 33: Line 34:
  
 
to checkout the code.
 
to checkout the code.
 +
 +
You can browse the repository online using [http://public.kitware.com/cgi-bin/viewcvs.cgi/?root=BWHITKOptimization ViewCVS] as well.
  
 
===Testing Data===
 
===Testing Data===
Line 38: Line 41:
 
* [http://insight-journal.org/dspace/handle/1926/459 NAMIC: Deformable registration speed optimization] (DSpace @ Insight-Journal)
 
* [http://insight-journal.org/dspace/handle/1926/459 NAMIC: Deformable registration speed optimization] (DSpace @ Insight-Journal)
  
===Suggested Benchmarks===
 
 
All tests should cout two values
 
* the time required
 
* an measure of the error (0 = no error; 1 = 100% error)
 
  
Tests to be developed and suggested parameter settings
 
* LinearInterpTest <numThreads> <dimSize> <factor> [<outputImage>]
 
** NumThreads = 1, 2, 4, and #OfCoresIf>4 (for every platform)
 
** DimSize = 100, 200 (meaning: 100^3 and 200^3 images)
 
** Factor = 1.5, 2, 3 (thereby producing up to 600^3 images)
 
** = 24 tests (approx time on dual-core for all tests = 1.5 minutes)
 
* BSplineInterpTest <numThreads> <dimSize> <factor> [<outputImage>]
 
** NumThreads = 1, 2, 4, and #OfCoresIf>4 (for every platform)
 
** DimSize = 100, 200 (meaning: 100^3 and 200^3 images)
 
** Factor = 1.5, 2, 3 (thereby producing up to 600^3 images)
 
** = 24 tests (approx time on dual-core for all tests = ??)
 
* SincInterpTest <numThreads> <dimSize> <factor> [<outputImage>]
 
* BSplineTransformLinearInterpTest <numThreads> <dimSize> <nodes> [<outputImage>]
 
* MeanReciprocalSquaredDifferenceMetricTest
 
* MeanSquaresMetricTest
 
* NormalizedCorreltationMetricTest
 
* GradientDifferentMetricTest
 
* MattesMutualInformationMetricTest
 
* MutualInformationMetricTest
 
* NormalizedMutualInformationMetricTest
 
* MutualInformationHistogramMetricTest
 
* NormaalizedMutualInformationHistogramMetricTest
 
  
 
===Potential Issues with Timing===
 
===Potential Issues with Timing===
Line 83: Line 59:
 
* [http://icl.cs.utk.edu/papi/overview/index.html PAPI]: "''The Performance API (PAPI) project specifies a standard API for accessing hardware performance counters''".<br>Stephen/Christian reported that Dual Core CPUs were not supported, but it seems from the [http://icl.cs.utk.edu/viewcvs/viewcvs.py/PAPI/papi/RELEASENOTES.txt?view=markup&revision=1.8.10.2.2.2 release notes for PAPI 3.5] (2006-11-09) that both Intel Core2Duo and Pentium D (i.e. dual core) are indeed supported.
 
* [http://icl.cs.utk.edu/papi/overview/index.html PAPI]: "''The Performance API (PAPI) project specifies a standard API for accessing hardware performance counters''".<br>Stephen/Christian reported that Dual Core CPUs were not supported, but it seems from the [http://icl.cs.utk.edu/viewcvs/viewcvs.py/PAPI/papi/RELEASENOTES.txt?view=markup&revision=1.8.10.2.2.2 release notes for PAPI 3.5] (2006-11-09) that both Intel Core2Duo and Pentium D (i.e. dual core) are indeed supported.
 
===Process Priority===
 
===Process Priority===
Whatever our choices, several articles also suggest to bump the application's priority to real-time before performing testing to make sure the wall-clock() results are as realistic as possible. It is however very important to set it back to normal (see [http://www.devx.com/MicrosoftISV/Article/16293 last paragraph]. Example (win32):
+
Whatever our choices, several articles also suggest to bump the application's priority to real-time before performing testing to make sure the wall-clock() results are as realistic as possible. It is however very important to set it back to normal.
 +
 
 +
* Windows:
 +
See [http://www.devx.com/MicrosoftISV/Article/16293 last paragraph]. Use GetPriorityClass, SetPriorityClass, GetThreadPriority, SetThreadPriority. After experimenting with that API, it seems that users with Admnistrative priviledges will be able to access REALTIME_PRIORITY_CLASS, whether users with Users priviledges only will only access HIGH_PRIORITY_CLASS. Note that the below code will not fail for normal users, HIGH will be picked instead of REALTIME. In any case, I quote: "Use extreme care when using the high-priority class, because a high-priority class application can use nearly all available CPU time."; indeed, mouse interaction is pretty much impossible, and some applications like IM will disconnect after losing socket connection. Let's stick to HIGH.
 +
 
 
<pre>
 
<pre>
 
DWORD dwPriorityClass = GetPriorityClass(GetCurrentProcess());
 
DWORD dwPriorityClass = GetPriorityClass(GetCurrentProcess());
Line 93: Line 73:
 
SetPriorityClass(GetCurrentProcess(),dwPriorityClass);
 
SetPriorityClass(GetCurrentProcess(),dwPriorityClass);
 
</pre>
 
</pre>
 +
 +
* Unix:
 +
The getpriority(), setpriority(), and nice() functions can be used to change the priority of processes. The getpriority() call returns the current nice value for a process, process group, or a user. The returned nice value is in the range of [-NZERO, NZERO-1]. NZERO is defined in /usr/include/limits.h. The default process priority always has the value 0 for UNIX. The setpriority() call sets the current nice value for a process, process group, or a user to the value of value + NZERO. It is important to note that  setting a higher priority is only allowed if you are root or if the program has its suid set, in order to avoid rogue program/virus to claim system resources. In practice, it is likely to prevent us from increasing the priority on Unix.
 +
 +
'''Update''': [http://public.kitware.com/cgi-bin/viewcvs.cgi/Code/Utilities/?root=BWHITKOptimization itkHighPriorityRealTimeClock], a subclass of itkRealTimeClock, has been created. Since we wanted to be compatible with the itkRealTimeClock API, no Start() and Stop() methods were created to increase (respectively restore) the process/thread priority; this is done automatically from the class constructor (respectively destructor) methods instead. The drawback to this approach is that a class that would use an itkHighPriorityRealTimeClock as a member variable would have its priority bumped as soon as it is created. This does not apply to us per-se, as we favor allocating clock objects right before the section that needs to be timed. As noted above, this is likely not to help us on Unix.
  
 
===Thread Affinity===
 
===Thread Affinity===
We should consider using [http://msdn2.microsoft.com/en-us/library/ms686247.aspx SetThreadAffinityMask] to make sure that the starting time is recorded on the same thread as the ending time (Win32). Will that constrain the rest of the program to run on a single thread, very good question. Also check Sleep(0), reported in a few discussions, including [http://msdn2.microsoft.com/en-us/library/ms686247.aspx this long one].
+
We should consider setting the thread affinity to make sure that the starting time is recorded on the same thread as the ending time. Will that constrain the rest of the program to run on a single thread, very good question.  
 +
 
 +
* Windows:
 +
Using [http://msdn2.microsoft.com/en-us/library/ms686247.aspx SetThreadAffinityMask].
 +
Also check Sleep(0), reported in a few discussions, including [http://msdn2.microsoft.com/en-us/library/ms686247.aspx this long one].
 +
 
 +
* Unix:
 +
Using [http://www.google.com/search?lr=&ie=UTF-8&oe=UTF-8&q=sched_setaffinity sched_setaffinity], but seems to be Linux-only (not POSIX).
 +
 
 +
I read a few articles ([http://www.linuxjournal.com/article/6799 here], [http://www.open-mpi.org/software/plpa/ here], [http://www.uwsg.iu.edu/hypermail/linux/kernel/0409.0/1974.html here]) and came to the conclusion that:
 +
* the sched_setaffinity API is not reliable accross Linux vendor/kernels (see the [http://www.open-mpi.org/software/plpa/ Portable Linux Processor Affinity] (PLPA) library though,
 +
* there is too much a risk influencing the algorithms being timed,
 +
* binding CPU affinity for timing purposes make only sense for real-time application, where events to be timed are very small in duration (in such cases, calling a timer start() could happen on one cpu/thread, calling stop() could happen on the other, potentially resulting in negative timings). This does not apply to us, as we are trying to measure performance over a reasonable amount of time (minutes to hours).
 +
 
 +
However, once our optimization have been tested, it might be interesting to see if CPU affinity can be used to improve cache performace: "[...] But the real problem comes into play when processes bounce between processors: they constantly cause cache invalidations, and the data they want is never in the cache when they need it. Thus, cache miss rates grow very large. CPU affinity protects against this and improves cache performance. A second benefit of CPU affinity is a corollary to the first. If multiple threads are accessing the same data, it might make sense to bind them all to the same processor. Doing so guarantees that the threads do not contend over data and cause cache misses. [...]".
  
 
===Test Platforms===
 
===Test Platforms===
Line 101: Line 100:
 
The primary target platform at the 8, 16, and 32 processor machines at BWH. However, preliminary tests have been performed on KHQ computers.
 
The primary target platform at the 8, 16, and 32 processor machines at BWH. However, preliminary tests have been performed on KHQ computers.
  
====KHQ====
+
A full software stack was compiled on several machines at Kitware and BWH/SPL. Some components were buildt in two flavors, shared/debug and/or static/release, but static/release should be used for submitting dashboards.
A full software stack was compiled on several machines at Kitware. Each component was build in two flavors, both shared/debug and static/release:
 
* Tcl/Tk 8.4
 
* VTK (cvs)
 
* ITK (cvs)
 
* ITK Applications (cvs)
 
* FLTK (1.1 svn)
 
* BWHItkOptimization (cvs)
 
  
All platforms are so far described in the [https://www.kitware.com/scripts/cvsweb/BWHItkOptimization/Results/?cvsroot=Work BWHItkOptimization/Results] directory:
+
Some platforms are described in the [http://public.kitware.com/cgi-bin/viewcvs.cgi/Results/?root=BWHITKOptimization&sortby=date BWHITKOptimization/Results] directory:
  
{| border="1" width="90%" align="center" cellspacing="0" cellpadding="3"
+
{| border="1" width="100%" align="center" cellspacing="0" cellpadding="3"
 
|- bgcolor="#abcdef"
 
|- bgcolor="#abcdef"
! Host !! #CPU !! CPU !! Freq !! RAM !! Arch !! OS !! Login
+
! Host !! #CPU !! CPU !! Freq !! RAM !! Arch !! OS !! Login !! [http://public.kitware.com/dashboard.php?name=BWHITKOptimization Dash]
 
|-
 
|-
| [https://www.kitware.com/scripts/cvsweb/BWHItkOptimization/Results/amber2.kitware?cvsroot=Work amber2] || 2 || Pentium Xeon || 2.8 GHz || 4 GB || 64 bits || Linux 2.6 (Red Hat Enterprise 4) || kitware (ssh, vnc; cd ~/barre)
+
| KW: [http://public.kitware.com/cgi-bin/viewcvs.cgi/Results/amber2.kitware?root=BWHITKOptimization&sortby=date&view=markup amber2] || 2 || Intel Xeon || 2.8 || 4 || 64 || Linux 2.6 (RHE 4) || kitware (~/barre) || No
 
|-
 
|-
| [https://www.kitware.com/scripts/cvsweb/BWHItkOptimization/Results/fury.kitware?cvsroot=Work fury] || 1 || Pentium 4 (hyperthread) || 2.8 GHz || 1 GB || 32 bits || Linux 2.6 (Fedora Core 4) || [mailto:sebastien.barre@kitware.com barre], jjomier, aylward
+
| KW: [http://public.kitware.com/cgi-bin/viewcvs.cgi/Results/fury.kitware?root=BWHITKOptimization&sortby=date&view=markup fury] || 1 || Intel Pentium 4 (hyperth.) || 2.8 || 1 || 32 || Linux 2.6 (Fedora 4) || [mailto:sebastien.barre@kitware.com barre], jjomier, aylward || Yes
 
|-
 
|-
| [https://www.kitware.com/scripts/cvsweb/BWHItkOptimization/Results/panzer.kitware?cvsroot=Work panzer] || 1 || Intel Core Duo (dual core) || 1.66 GHz || 1 GB || 32 bits || Mac OS X 10.4.8 || [mailto:sebastien.barre@kitware.com barre], jjomier, aylward
+
| KW: mcpluto || 1 || Intel Pentium D || 3.0 || 4 || 64 || Linux 2.6 (Debian Etch) || [mailto:brad.davis@kitware.com davisb] || ?
 
|-
 
|-
| [https://www.kitware.com/scripts/cvsweb/BWHItkOptimization/Results/sanakhan.kitware?cvsroot=Work sanakhan] || 1 || Pentium M || 1.8 GHz || 1 GB || 32 bits || Windows XP SP2 || [mailto:sebastien.barre@kitware.com barre]
+
| KW: [http://public.kitware.com/cgi-bin/viewcvs.cgi/Results/panzer.kitware?root=BWHITKOptimization&sortby=date&view=markup panzer] || 1 || Intel Core Duo || 1.66 || 1 || 32 || Mac OS X 10.4.8 || [mailto:sebastien.barre@kitware.com barre], jjomier, aylward || Yes
 
|-
 
|-
| [https://www.kitware.com/scripts/cvsweb/BWHItkOptimization/Results/tetsuo.kitware?cvsroot=Work tetsuo] || 1 || Pentium D (dual core) || 3.2 GHz || 2 GB || 32 bits || Windows XP SP2 || [mailto:sebastien.barre@kitware.com barre]
+
| KW: [http://public.kitware.com/cgi-bin/viewcvs.cgi/Results/sanakhan.kitware?root=BWHITKOptimization&sortby=date&view=markup sanakhan] || 1 || Intel Pentium M || 1.8 || 1 || 32 || Windows XP SP2 || [mailto:sebastien.barre@kitware.com barre] || No
 +
|-
 +
| KW: [http://public.kitware.com/cgi-bin/viewcvs.cgi/Results/tetsuo.kitware?root=BWHITKOptimization&sortby=date&view=markup tetsuo] || 1 || Intel Pentium D || 3.2 || 2 || 32 || Windows XP SP2 || [mailto:sebastien.barre@kitware.com barre] || Yes
 +
|-
 +
| SPL: vision || 6 || Sun Sparc || ? || 24G || 64 || Solaris 8 || [mailto:sebastien.barre@kitware.com barre] || Not yet
 +
|-
 +
| SPL: forest || 10 || Sun Sparc || ? || 10G || 64 || Solaris 8 || [mailto:sebastien.barre@kitware.com barre] || Not yet
 +
|-
 +
| SPL: john || 16 || AMD Opteron || 2.4 || 128G || 64 || Linux 2.6 (Fedora 5) || [mailto:sebastien.barre@kitware.com barre] || Not yet
 +
|-
 +
| SPL: b2_d6_1 || 4 || Intel Xeon || 2.8 || 8G || 64 || Linux 2.6 (Fedora 5) || [mailto:sebastien.barre@kitware.com barre] || Not yet
 
|}
 
|}
  
====Tests====
+
====Directory Structure====
 +
 
 +
* Each Kitware (KW) machine has its own space:
 +
** source trees can be found in <tt>~/src</tt> (or <tt>~/barre/src</tt>),
 +
** build trees can be found in <tt>~/build</tt> (or <tt>~/barre/build</tt>),
 +
** crontab and scripts can be found in <tt>~/bin</tt> (or <tt>~/barre/bin</tt>),
 +
** dashboards (i.e. nightly source and build trees) can be found in <tt>~/build/my dashboards </tt> (or <tt>~/barre/build/my dashboards</tt>).
 +
 
 +
* All SPL machine share the same user space, with limited quota. A larger space was allocated for our project and can be found in <tt>/project/na-mic/barre</tt>:
 +
** source trees, common to all machines, can be found in <tt>/project/na-mic/barre/src</tt>,
 +
** all other trees can be found in a per-machine subdirectory of <tt>/project/na-mic/barre/machines/</tt> (for example <tt>/project/na-mic/barre/machines/vision</tt> or <tt>/project/na-mic/barre/machines/forest</tt>),
 +
*** build trees can be found in <tt>/project/na-mic/barre/machines/machine/build</tt>
 +
*** crontab and scripts can be found in <tt>/project/na-mic/barre/machines/machine/bin</tt>,
 +
*** dashboards (i.e. nightly source and build trees) can be found in <tt>/project/na-mic/barre/machines/machine/dashboards</tt>.
 +
 
 +
===Graphs===
 +
 
 +
Parameters relevant to all tests:
 +
* # threads (only use 1, 1/2 max # in machine, and max # in machine?)
 +
* problem run time
 +
** dimsize * factor (for interpolators)
 +
** # samples * iterations (for metrics)
 +
** dimsize * iterations (for transforms)
 +
* optimization ratio (optimized time vs. unoptimized time)
 +
* memory
 +
* # logical CPUs (i.e. physical CPU or cores, processing units)
 +
* whetstone score per cpu/core
 +
 
 +
Parameters relevant to registration only:
 +
* # samples (random sample scenario)
 +
 
 +
Graphs:
 +
Y: optimization ratio
 +
 
 +
* One graph per machine
 +
** X: # threads (multiple lines: one per dimsize * factor)
 +
** X: dimsize * factor (multiple lines: one per # threads)
 +
 
 +
* One graph for all machines
 +
** X: # threads (multiple lines: one per # CPUs; min variance graph)
 +
** X: dimsize * factor (multiple lines: one per memory, # threads fixed > 1)
 +
** X: # samples (multiple lines: one per machines, everything else fixed)
 +
 
 +
Y: Absolute run time
 +
 
 +
* One graph per machine and (bar graph): for a given problem size, using max # of threads
 +
** X: Method (unoptimized, optimized)
 +
 
 +
 
  
* LinearInterp: (to describe)
+
Ask Stephen about how we are going to report/detect if new algorithms are using more memory than the old ones.
 +
* Using more memory is okay...as long as the tests can still be run.
  
====BWH====
+
====Graph Library====
I have not yet access to the machines at BWH. Stephen has, and will let me know.
+
* [http://www.aditus.nu/jpgraph/ JpGraph] is the PHP Graph library that was used so far by BatchMake and is used by Kitware and NA-MIC bug trackers (PHPBugTracker and Mantis). Newer version of this library CAN NOT be used anymore in a commercial context, it requires buying [http://www.aditus.nu/jpgraph/proversion.php JpGraph Pro]. This licensing change apparently took place at some point in the past (see [http://fcp.surfsite.org/modules/newbb/viewtopic.php?topic_id=35963&start=0 here] and [http://packages.debian.org/changelogs/pool/main/libp/libphp-jpgraph/libphp-jpgraph_1.5.2-10.1/libphp-jpgraph.copyright here]). We seem to be using JpGraph 1.12.2 at Kitware, under a Qt license (QPL 1.0).
 +
* It is suggested that we use [http://pear.veggerby.dk/ Image::Graph] instead (see [http://pear.veggerby.dk/samples/ samples/screenshots]). As a [http://pear.php.net/package/Image_Graph PEAR package], it is ''very'' easy to install (<tt>pear install --alldeps Image_Graph-alpha</tt>) and is released under the LGPL license. Looks pretty easy to use: [http://pear.veggerby.dk/wiki/image_graph:getting_started_guide Getting Started Guide].
  
===Status===
+
===sshfs===
  
* kcachegrind and timing are being performed on amber2. Stay tuned.
+
* We set up a special user on the MIDAS server side for people to mount the filesystem remotely using [http://fuse.sourceforge.net/sshfs.html sshfs]. Security is taken care of through the use of [http://sublimation.org/scponly/wiki/index.php/Main_Page scponly] as the remote shell (praise [http://en.wikipedia.org/wiki/Chroot chroot]). The sshfs package is based on [http://fuse.sourceforge.net/ FUSE], Filesystem in Userspace, which allows non privileged users to mount filesystems securely.
* valgrind is not supported on x86_64 architecture :( Now using fury instead of amber2.
+
* Most Linux distribution include a [http://www.linuxjournal.com/article/8904 sshfs package] (free). This is just a matter of installing that package once.  
* <tt>RegTests/RunLinearInterpTest.sh.in</tt> is configured automatically to run and times <tt>LinearInterp</tt> with various combinations of threads, size and factor parameters.
+
* Windows user can use [http://en.wikipedia.org/wiki/SftpDrive sftpdrive], a commercial Windows Explorer extension that facilitates mapping of virtual drives to any ssh (scp) / sftp server.
** It was run on fury (release static): <tt>Results/fury.kitware.timings-rel.txt</tt>
+
* MacOSX user can use Google's [http://code.google.com/p/macfuse/ MacFUSE] (free).
** It was run on fury (debug): <tt>Results/fury.kitware.timings-dbg.txt</tt>
 
** It was run on amber2 (release static): <tt>Results/amber2.kitware.timings-rel.txt</tt>
 

Latest revision as of 21:33, 30 April 2007

ITK Registration Optimization (BW NAC) Project

My (Sebastien Barre) notes so far. Once the dust settles, the relevant sections will be moved to the project pages listed below.

Project

The ultimate specific goal is B-Spline registration optimization for linux and windows on multi-core and multi-processor, shared memory machines. [...] Also, setup tools and a reporting mechanism for ITK speed to be monitored and reported by us and others. BWH is the driving force behind this work.

Contacts

Quick Links

Source Code

 cvs -d :pserver:<login>@public.kitware.com:/cvsroot/BWHITKOptimization login

Enter your VTK <login>, and password, then:

 cvs -d :pserver:<login>@public.kitware.com:/cvsroot/BWHITKOptimization co BWHITKOptimization

to checkout the code.

You can browse the repository online using ViewCVS as well.

Testing Data


Potential Issues with Timing

  • Repository was updated so that it can compile on Unix.

__rtdsc()

CallMonWin includes <intrin.h> to call __rdtsc(), a header that does not exist in Microsoft compilers prior to Visual Studio 8/2005. It seems however that one can call __rdtsc() directly from assembly:

A few articles advises against the use of __rdtsc(), especially in a multicore/multithread context:

The suggested alternative is to use Performance Counters. Hardware counters are actually not an OS feature per se, but a CPU feature that has been around for some time. They provide high-resolution timers that can be used to monitor a wide range of resources:

The issue remains on how to access those counters in a cross-platform way:

  • PAPI: "The Performance API (PAPI) project specifies a standard API for accessing hardware performance counters".
    Stephen/Christian reported that Dual Core CPUs were not supported, but it seems from the release notes for PAPI 3.5 (2006-11-09) that both Intel Core2Duo and Pentium D (i.e. dual core) are indeed supported.

Process Priority

Whatever our choices, several articles also suggest to bump the application's priority to real-time before performing testing to make sure the wall-clock() results are as realistic as possible. It is however very important to set it back to normal.

  • Windows:

See last paragraph. Use GetPriorityClass, SetPriorityClass, GetThreadPriority, SetThreadPriority. After experimenting with that API, it seems that users with Admnistrative priviledges will be able to access REALTIME_PRIORITY_CLASS, whether users with Users priviledges only will only access HIGH_PRIORITY_CLASS. Note that the below code will not fail for normal users, HIGH will be picked instead of REALTIME. In any case, I quote: "Use extreme care when using the high-priority class, because a high-priority class application can use nearly all available CPU time."; indeed, mouse interaction is pretty much impossible, and some applications like IM will disconnect after losing socket connection. Let's stick to HIGH.

	DWORD dwPriorityClass = GetPriorityClass(GetCurrentProcess());
	int nPriority = GetThreadPriority(GetCurrentThread());
	SetPriorityClass(GetCurrentProcess(),REALTIME_PRIORITY_CLASS);
	SetThreadPriority(GetCurrentThread(),THREAD_PRIORITY_TIME_CRITICAL);
[...]g
        SetThreadPriority(GetCurrentThread(),nPriority);
	SetPriorityClass(GetCurrentProcess(),dwPriorityClass);
  • Unix:

The getpriority(), setpriority(), and nice() functions can be used to change the priority of processes. The getpriority() call returns the current nice value for a process, process group, or a user. The returned nice value is in the range of [-NZERO, NZERO-1]. NZERO is defined in /usr/include/limits.h. The default process priority always has the value 0 for UNIX. The setpriority() call sets the current nice value for a process, process group, or a user to the value of value + NZERO. It is important to note that setting a higher priority is only allowed if you are root or if the program has its suid set, in order to avoid rogue program/virus to claim system resources. In practice, it is likely to prevent us from increasing the priority on Unix.

Update: itkHighPriorityRealTimeClock, a subclass of itkRealTimeClock, has been created. Since we wanted to be compatible with the itkRealTimeClock API, no Start() and Stop() methods were created to increase (respectively restore) the process/thread priority; this is done automatically from the class constructor (respectively destructor) methods instead. The drawback to this approach is that a class that would use an itkHighPriorityRealTimeClock as a member variable would have its priority bumped as soon as it is created. This does not apply to us per-se, as we favor allocating clock objects right before the section that needs to be timed. As noted above, this is likely not to help us on Unix.

Thread Affinity

We should consider setting the thread affinity to make sure that the starting time is recorded on the same thread as the ending time. Will that constrain the rest of the program to run on a single thread, very good question.

  • Windows:

Using SetThreadAffinityMask. Also check Sleep(0), reported in a few discussions, including this long one.

  • Unix:

Using sched_setaffinity, but seems to be Linux-only (not POSIX).

I read a few articles (here, here, here) and came to the conclusion that:

  • the sched_setaffinity API is not reliable accross Linux vendor/kernels (see the Portable Linux Processor Affinity (PLPA) library though,
  • there is too much a risk influencing the algorithms being timed,
  • binding CPU affinity for timing purposes make only sense for real-time application, where events to be timed are very small in duration (in such cases, calling a timer start() could happen on one cpu/thread, calling stop() could happen on the other, potentially resulting in negative timings). This does not apply to us, as we are trying to measure performance over a reasonable amount of time (minutes to hours).

However, once our optimization have been tested, it might be interesting to see if CPU affinity can be used to improve cache performace: "[...] But the real problem comes into play when processes bounce between processors: they constantly cause cache invalidations, and the data they want is never in the cache when they need it. Thus, cache miss rates grow very large. CPU affinity protects against this and improves cache performance. A second benefit of CPU affinity is a corollary to the first. If multiple threads are accessing the same data, it might make sense to bind them all to the same processor. Doing so guarantees that the threads do not contend over data and cause cache misses. [...]".

Test Platforms

The primary target platform at the 8, 16, and 32 processor machines at BWH. However, preliminary tests have been performed on KHQ computers.

A full software stack was compiled on several machines at Kitware and BWH/SPL. Some components were buildt in two flavors, shared/debug and/or static/release, but static/release should be used for submitting dashboards.

Some platforms are described in the BWHITKOptimization/Results directory:

Host #CPU CPU Freq RAM Arch OS Login Dash
KW: amber2 2 Intel Xeon 2.8 4 64 Linux 2.6 (RHE 4) kitware (~/barre) No
KW: fury 1 Intel Pentium 4 (hyperth.) 2.8 1 32 Linux 2.6 (Fedora 4) barre, jjomier, aylward Yes
KW: mcpluto 1 Intel Pentium D 3.0 4 64 Linux 2.6 (Debian Etch) davisb ?
KW: panzer 1 Intel Core Duo 1.66 1 32 Mac OS X 10.4.8 barre, jjomier, aylward Yes
KW: sanakhan 1 Intel Pentium M 1.8 1 32 Windows XP SP2 barre No
KW: tetsuo 1 Intel Pentium D 3.2 2 32 Windows XP SP2 barre Yes
SPL: vision 6 Sun Sparc ? 24G 64 Solaris 8 barre Not yet
SPL: forest 10 Sun Sparc ? 10G 64 Solaris 8 barre Not yet
SPL: john 16 AMD Opteron 2.4 128G 64 Linux 2.6 (Fedora 5) barre Not yet
SPL: b2_d6_1 4 Intel Xeon 2.8 8G 64 Linux 2.6 (Fedora 5) barre Not yet

Directory Structure

  • Each Kitware (KW) machine has its own space:
    • source trees can be found in ~/src (or ~/barre/src),
    • build trees can be found in ~/build (or ~/barre/build),
    • crontab and scripts can be found in ~/bin (or ~/barre/bin),
    • dashboards (i.e. nightly source and build trees) can be found in ~/build/my dashboards (or ~/barre/build/my dashboards).
  • All SPL machine share the same user space, with limited quota. A larger space was allocated for our project and can be found in /project/na-mic/barre:
    • source trees, common to all machines, can be found in /project/na-mic/barre/src,
    • all other trees can be found in a per-machine subdirectory of /project/na-mic/barre/machines/ (for example /project/na-mic/barre/machines/vision or /project/na-mic/barre/machines/forest),
      • build trees can be found in /project/na-mic/barre/machines/machine/build
      • crontab and scripts can be found in /project/na-mic/barre/machines/machine/bin,
      • dashboards (i.e. nightly source and build trees) can be found in /project/na-mic/barre/machines/machine/dashboards.

Graphs

Parameters relevant to all tests:

  • # threads (only use 1, 1/2 max # in machine, and max # in machine?)
  • problem run time
    • dimsize * factor (for interpolators)
    • # samples * iterations (for metrics)
    • dimsize * iterations (for transforms)
  • optimization ratio (optimized time vs. unoptimized time)
  • memory
  • # logical CPUs (i.e. physical CPU or cores, processing units)
  • whetstone score per cpu/core

Parameters relevant to registration only:

  • # samples (random sample scenario)

Graphs: Y: optimization ratio

  • One graph per machine
    • X: # threads (multiple lines: one per dimsize * factor)
    • X: dimsize * factor (multiple lines: one per # threads)
  • One graph for all machines
    • X: # threads (multiple lines: one per # CPUs; min variance graph)
    • X: dimsize * factor (multiple lines: one per memory, # threads fixed > 1)
    • X: # samples (multiple lines: one per machines, everything else fixed)

Y: Absolute run time

  • One graph per machine and (bar graph): for a given problem size, using max # of threads
    • X: Method (unoptimized, optimized)


Ask Stephen about how we are going to report/detect if new algorithms are using more memory than the old ones.

  • Using more memory is okay...as long as the tests can still be run.

Graph Library

  • JpGraph is the PHP Graph library that was used so far by BatchMake and is used by Kitware and NA-MIC bug trackers (PHPBugTracker and Mantis). Newer version of this library CAN NOT be used anymore in a commercial context, it requires buying JpGraph Pro. This licensing change apparently took place at some point in the past (see here and here). We seem to be using JpGraph 1.12.2 at Kitware, under a Qt license (QPL 1.0).
  • It is suggested that we use Image::Graph instead (see samples/screenshots). As a PEAR package, it is very easy to install (pear install --alldeps Image_Graph-alpha) and is released under the LGPL license. Looks pretty easy to use: Getting Started Guide.

sshfs

  • We set up a special user on the MIDAS server side for people to mount the filesystem remotely using sshfs. Security is taken care of through the use of scponly as the remote shell (praise chroot). The sshfs package is based on FUSE, Filesystem in Userspace, which allows non privileged users to mount filesystems securely.
  • Most Linux distribution include a sshfs package (free). This is just a matter of installing that package once.
  • Windows user can use sftpdrive, a commercial Windows Explorer extension that facilitates mapping of virtual drives to any ssh (scp) / sftp server.
  • MacOSX user can use Google's MacFUSE (free).