Difference between revisions of "ITK Registration Optimization"

From NAMIC Wiki
Jump to: navigation, search
Line 8: Line 8:
  
 
# [http://insight-journal.org/InsightJournalManager/view_reviews.php?pubid=172 Aylward, Stephen; Jomier, Julien; Barre, Sebastien; Davis, Brad; Ibanez, Luis, "Optimizing ITK’s Registration Methods for Multi-processor, Shared-Memory Systems." MICCAI Open Source and Open Data Workshop, 2007] [http://insight-journal.org/InsightJournalManager/download_publication.php?pubid=172&revision=2&name=OptimizingITKRegistrationMethods.pdf&pdf=1 (Download PDF)]
 
# [http://insight-journal.org/InsightJournalManager/view_reviews.php?pubid=172 Aylward, Stephen; Jomier, Julien; Barre, Sebastien; Davis, Brad; Ibanez, Luis, "Optimizing ITK’s Registration Methods for Multi-processor, Shared-Memory Systems." MICCAI Open Source and Open Data Workshop, 2007] [http://insight-journal.org/InsightJournalManager/download_publication.php?pubid=172&revision=2&name=OptimizingITKRegistrationMethods.pdf&pdf=1 (Download PDF)]
 +
 +
* One remaining, high priority task is to complete the integration of the new, threaded, registration methods into ITK.  Luis and Sebastien have adapted the new methods to be 100% backward compatible with ITK's existing classes.  This is a major effort involving approximately 50,000 lines of new code and over 400 new tests in ITK.  The new registration framework is going to be significantly better tested as well as significantly faster than the existing ITK registration framework.  Once it is ported, helper-classes will be added to ITK, and modules using those helper classes will be distributed with Slicer.  We have chosen to spend the time to integrate with ITK because it will serve the broader community, it will benefit from the support of the broader community, it will avoid having to incorporate another SVN checkout into Slicer's build process, and it will keep us from having to maintain and monitor separate dashboards for this effort.
  
 
= Goals =
 
= Goals =
Line 51: Line 53:
 
== Data ==
 
== Data ==
  
* [http://www.insight-journal.org/dspace/handle/1926/459 NA-MIC MIDAS collection for this project]
+
* Now distributed with CVS
  
 
= Workplan =
 
= Workplan =
Line 59: Line 61:
 
## Cross platform and multi-threaded
 
## Cross platform and multi-threaded
 
## Timing and profiling
 
## Timing and profiling
#* Status
 
#*# Instrumenting modular tests
 
#*#* Extending itk's cross-platform high precision timer
 
#*#* Adding thread affinity to ensure valid timings
 
#*#* Adding method for increasing process priority
 
#*#* [http://public.kitware.com/cgi-bin/viewcvs.cgi/?root=BWHITKOptimization ViewCVS]
 
#*# Profiling complete registration solutions for use cases
 
#*#* Using CacheGrind on single and multi-core linux systems
 
#*#* Using LTProf on windows
 
 
# Develop performance dashboard for collecting results
 
# Develop performance dashboard for collecting results
 
## Each test will report time and accuracy to a central server
 
## Each test will report time and accuracy to a central server
 
## The performance of a test, over time, for a given platform can be viewed on one page
 
## The performance of a test, over time, for a given platform can be viewed on one page
 
## The performance of a set of tests, at one point in time, for all platforms can be viewed on one page
 
## The performance of a set of tests, at one point in time, for all platforms can be viewed on one page
#* Status
+
 
#*# BatchMake database communication code being isolated
 
#*# Performance dashboard web pages being designed
 
  
 
== Develop tests ==
 
== Develop tests ==
 
# Develop modular tests
 
# Develop modular tests
#* Status
 
#*# Developed itkCheckerboardImageSource so no IO required
 
#*# Developed all tests listed in the "Modular Tests" section below
 
#*# Developing second generation tests that compare optimized performance with standard performance in a single executable
 
# Develop C-style tests
 
## Tests should represent the non-ITK way of doing image analysis
 
##* Use standard C/C++ arrays and pointers to access blocks of memory as images
 
 
# Develop complete registration solutions for use cases
 
# Develop complete registration solutions for use cases
#* Status
 
#*# Centralized data and provide easy access
 
#*# Identified relevant registration algorithms
 
#*#* rigid, affine, bspline, multi-level bspline, and Demons'
 
#*#* normalized mutual information, mean squared difference, and cross correlation
 
#*# Developing traditional ITK-style implementations
 
  
== Compute performance on target platforms ==
 
* Completing the development of the reporting dashboard, the standard C tests, and the full registration applications (see above).
 
* Developing optimized versions of select methods in parallel with deploying the testing suite (see below).
 
  
 
== ITK Optimization ==
 
== ITK Optimization ==
Line 104: Line 79:
 
*** Sacrifice some memory and algorithm initialization speed to gain algorithm operation speed increases
 
*** Sacrifice some memory and algorithm initialization speed to gain algorithm operation speed increases
 
*** Call multi-threaded functions when possible
 
*** Call multi-threaded functions when possible
 
 
* Integrate metrics with transforms and interpolators for tailored performance
 
* Integrate metrics with transforms and interpolators for tailored performance
  
Line 112: Line 86:
 
** Added multi-threading to GetValue function
 
** Added multi-threading to GetValue function
 
*** Partitions the samples - thereby distributes the computation of the transforms and interpolations across threads
 
*** Partitions the samples - thereby distributes the computation of the transforms and interpolations across threads
*** Original runtime is O(C1+C2*M) given M samples from the fixed image and new runtime is O(C1+(C2*M)/(C3*N)) given N cores.  The ratio of C2/C3 varies by the type of transform used.  When a BSplineTransform is used, the ratio C2/C3 approaches 1.0.
 
 
*** Added the pre-computation of the FixedImageMarginalPDF for the sample to reduce the need for the thread mutex lock
 
*** Added the pre-computation of the FixedImageMarginalPDF for the sample to reduce the need for the thread mutex lock
 
**** Required the concept of an AdjustedFixedImageMarginalPDF that is updated when a fixed image voxel does not map into the moving image and thereby isn't valid for the current computations.  By only updating when samples are missed, mutex lock to update a cross-thread data structure is needed less often.
 
**** Required the concept of an AdjustedFixedImageMarginalPDF that is updated when a fixed image voxel does not map into the moving image and thereby isn't valid for the current computations.  By only updating when samples are missed, mutex lock to update a cross-thread data structure is needed less often.
Line 121: Line 94:
 
* GetDerivative
 
* GetDerivative
 
** Following the same convention as used with GetValue
 
** Following the same convention as used with GetValue
 
 
 
  
  
Line 150: Line 120:
 
SECOND GENERATION TEST
 
SECOND GENERATION TEST
 
* Computes runtime for GetValue, GetDerivative, and GetValueAndDerivative for standard ITK implementation and for the optimized version being developed.  Also computes difference (if any) between their answers and reports as an error measure
 
* Computes runtime for GetValue, GetDerivative, and GetValueAndDerivative for standard ITK implementation and for the optimized version being developed.  Also computes difference (if any) between their answers and reports as an error measure
# OptMattesMutualInformationImageToImageMetricTest
 
  
 
= Larger tests =
 
= Larger tests =

Revision as of 13:53, 3 August 2007

Home < ITK Registration Optimization

Quick Links

  1. Dashboard for this project
  2. Dashboard for BatchMake
  3. Batchboard (nightly experiment results) for this project

Results and Publications

  1. Aylward, Stephen; Jomier, Julien; Barre, Sebastien; Davis, Brad; Ibanez, Luis, "Optimizing ITK’s Registration Methods for Multi-processor, Shared-Memory Systems." MICCAI Open Source and Open Data Workshop, 2007 (Download PDF)
  • One remaining, high priority task is to complete the integration of the new, threaded, registration methods into ITK. Luis and Sebastien have adapted the new methods to be 100% backward compatible with ITK's existing classes. This is a major effort involving approximately 50,000 lines of new code and over 400 new tests in ITK. The new registration framework is going to be significantly better tested as well as significantly faster than the existing ITK registration framework. Once it is ported, helper-classes will be added to ITK, and modules using those helper classes will be distributed with Slicer. We have chosen to spend the time to integrate with ITK because it will serve the broader community, it will benefit from the support of the broader community, it will avoid having to incorporate another SVN checkout into Slicer's build process, and it will keep us from having to maintain and monitor separate dashboards for this effort.

Goals

There are two components to this research

  1. Identify registration algorithms that are suitable for non-rigid registration problems that are endemic to NA-MIC
  2. Develop implementations of those algorithms that take advantage of multi-core and multi-processor hardware.

Algorithmic Requirements and Use Cases

  • Requirements
    1. relatively robust, with few parameters to tweak
    2. runs on grey scale images
    3. has already been published
    4. relatively fast (ideally speaking a few minutes for volume to volume).
    5. not patented
    6. can be implemented in ITK and parallelized.

Hardware Platform Requirements and Use Cases

  • Requirements
    1. Shared memory
    2. Single and multi-core machines
    3. Single and multi-processor machines
    4. AMD and Intel - Windows, Linux, and SunOS
  • Use-cases
    1. Intel Core2Duo
    2. Intel quad-core Xeon processors, Visual Studio 8, Windows Vista (Kitware: redwall)
    3. 6 CPU Sun, Solaris 8 (SPL: vision)
    4. 12 CPU Sun, Solaris 8 (SPL: forest and ocean)
    5. 16 core Opteron (SPL: john, ringo, paul, george)
    6. 16 core, Sun Fire, AMDOpteron (UNC: Styner)

Data

  • Now distributed with CVS

Workplan

Establish testing and reporting infrastructure

  1. Identify timing tools
    1. Cross platform and multi-threaded
    2. Timing and profiling
  2. Develop performance dashboard for collecting results
    1. Each test will report time and accuracy to a central server
    2. The performance of a test, over time, for a given platform can be viewed on one page
    3. The performance of a set of tests, at one point in time, for all platforms can be viewed on one page


Develop tests

  1. Develop modular tests
  2. Develop complete registration solutions for use cases


ITK Optimization

  • Target bottlenecks
    • Multi-thread metric calculation
      • Initial target is MattesMutualInformationImageToImageMetric
    • Optimize code
      • Sacrifice some memory and algorithm initialization speed to gain algorithm operation speed increases
      • Call multi-threaded functions when possible
  • Integrate metrics with transforms and interpolators for tailored performance

MattesMutualInformationImageToImageMetric

Optimizations Employed

  • GetValue
    • Added multi-threading to GetValue function
      • Partitions the samples - thereby distributes the computation of the transforms and interpolations across threads
      • Added the pre-computation of the FixedImageMarginalPDF for the sample to reduce the need for the thread mutex lock
        • Required the concept of an AdjustedFixedImageMarginalPDF that is updated when a fixed image voxel does not map into the moving image and thereby isn't valid for the current computations. By only updating when samples are missed, mutex lock to update a cross-thread data structure is needed less often.
      • Each thread now has its own copy of the joinPDF. After threads complete, jointPDFs from each thread are summed. This eliminates mutex from the main loop over samples.
      • SUMMARY: Speedup on a dual-core system is about 30% (reduction in computation time) when using linear transform and linear interpolation and about 45% when using bspline transform and bspline interpolation.
    • Algorithm optimization (aside from adding multi-threading)
      • None at this level. Will be done for transforms and interpolators.
  • GetDerivative
    • Following the same convention as used with GetValue


Modular tests

All tests send two values to performance dashboards

  • the time required
  • an measure of the error (0 = no error; 1 = 100% error)

Tests being developed and their parameter spaces

  1. NearestNeighborInterpTest <numThreads> <dimSize> <factor> [<outputImage>]
  2. LinearInterpTest <numThreads> <dimSize> <factor> [<outputImage>]
  3. BSplineInterpTest <numThreads> <dimSize> <factor> <bSplineOrder> [<outputImage>]
  4. SincInterpTest <numThreads> <dimSize> <factor> [<outputImage>]
    • Uses the Welch window function
  5. BSplineTransformLinearInterpTest <numThreads> <dimSize> <numNodesPerDim> <bSplineOrder> [<outputImage>]
    • 3 nodes are also added outside of the image for interpolation
  6. MeanSquaresImageToImageMetricTest <numThreads> <dimSize> <iterations>
  7. CorrelationCoefficientHistogramMetricTest <numThreads> <dimSize> <iterations>
  8. NormalizedCorreltationImageToImageMetricTest <numThreads> <dimSize> <iterations>
  9. MattesMutualInformationImageToImageMetricTest <numThreads> <dimSize> <numSamples> <iterations>
    • MattesMutualInformationMetric defaults to BSpline interpolator - this test overrides to use nearest neighbor interpolation
  10. MutualInformationImageToImageMetricTest <numThreads> <dimSize> <numSamples> <iterations>
  11. NormalizedMutualInformationHistrogramMetricTest <numThreads> <dimSize> <iterations>

SECOND GENERATION TEST

  • Computes runtime for GetValue, GetDerivative, and GetValueAndDerivative for standard ITK implementation and for the optimized version being developed. Also computes difference (if any) between their answers and reports as an error measure

Larger tests

These are top down tests that implement registration tasks as a user would. These tests are run on real medical image data. The goal is to use these larger, realistic tests to focus our optimization effort.

  • AlignMomentsTest <fixed image> <moving image> [number of threads] [transformed moving image] [post-registration difference image] [pre-registration difference image]
  • AlignRigidMSETest <fixed image> <moving image> [number of threads] [number of iterations] [transformed moving image] [post-registration difference image] [pre-registration difference image]
  • AlignRigidMMITest <fixed image> <moving image> [number of threads] [number of iterations] [transformed moving image] [post-registration difference image] [pre-registration difference image]

Events

Related Pages

Performance Measurement