   #[1]Monte Carlo eXtreme: GPU-based Monte Carlo Simulations

Frequently Asked Questions about MCX

          1. [2]I am getting a "kernel launch timed-out" error, what is
          that?

          2. [3]When should I use the atomic version of MCX ?

          3. [4]How do I interpret MCX's output data?

          4. [5]Does MCX support multiple GPUs in a single computer?

          5. [6]Will you consider porting MCX to MPI to run on my cluster?

          6. [7]What is the maximum number of media types that MCX can
          handle?

          7. [8]What is the maximum number of detectors in MCX?

          8. [9]My simulation created an empty history file, why is that?

1. I am getting a "kernel launch timed-out" error, what is that?

   Answer: This error happens only when you are using a non-dedicated GPU.
   A non-dedicated GPU refers to a graphics card that is used both for
   display and GPU computation. Because you connect your display to the
   card and the nvidia graphics driver imposes a time-limit to the
   response time of a kernel (a function running on a GPU). This
   time-limit is referred to as the "driver watch-dog time limit". For
   Linux, this limit is usually about 10 seconds; for Windows, this limit
   is about 2 seconds. When a kernel is running on a GPU for longer than
   this limit, the driver will kill this kernel for safety purposes.

   If you have only one graphics card on your system and you have to use
   it in a non-dedicated way (i.e. connect to your monitor and for MCX
   simulations), MCX allows you to slice the entire simulations into
   chunks, so that the run-time for each chunk can be smaller than the
   watch-dog time limit. This is done by setting the "-r" (repetition)
   parameter.

   If you have a high-end dual-GPU graphics card, you can simply run MCX
   without worrying about this limit. Because MCX automatically selects
   the second GPU to perform the simulation, which it is often not
   connected to a monitor (if this guess is wrong, one can use -G to
   manually select the dedicated GPU). Alternatively, if you can install a
   second graphics card to your machine and connect your display to one of
   the cards (the weaker one), this will make the other card a dedicated
   CUDA device.

   For Windows users, you may modify the TdrDelay value in the registry to
   effectively extend this time-out limit. You can find more info in
   [10]this thread. For Linux/Unix users, you can kill the x-window and
   run mcx in the pure console mode (you may boot into [11]he "text" mode,
   or if you are already in the graphics mode, you may [12]stop it from a
   terminal). After killing the graphics interface, you may run mcx on a
   non-dedicated GPU without the watch-dog limit.

2. When should I use the atomic version of MCX ?

   Answer: In an MCX Monte Carlo simulation, we need to save photon
   weights to the global memory from many parallel threads. This may cause
   problems when multiple threads write to the same global memory address
   at the same time, which we referred to as the race condition. To avoid
   race conditions, CUDA provides a set of "atomic" operations, where
   read-compute-write process in a thread can not be interrupted by other
   threads. Unfortunately, there is a great speed penalty for using these
   functions. As we have shown in Fig. 7 in our [13]paper, the atomic
   version of MCX can only achieve about 75x acceleration at an optimal
   thread number around 500~1000, compared with 300x acceleration with the
   non-atomic versions. More importantly, the performance of atomic MCX
   can not be scaled with better graphics hardware.

   Fortunately, in our algorithm, the photon propagation for each thread
   is maximally asynchronized, together with the exponential decay of the
   fluence, the overall impact of the race condition is negligible (as
   shown in Fig. 4 of [14]Fang2009). For a range of scattering
   coefficients, the accumulation-miss due to race condition is around 1%
   for over 1000 threads. Using the non-atomic version can give quite
   accurate solutions as long as it is a few voxels away from the source.

   In the newer versions of MCX (v0.4.9 or later), we provide a new
   binary, Cachebox MCX, that takes advantage of the atomic operations in
   the shared memory. This binary can ensure an accurate solution near the
   source, and is only 20% slower than the non-atomic version (i.e. the
   Vanilla MCX). If the accuracy near the source is important for your
   application, please use this binary for your simulation. More details
   can be found in [15]this document.

3. How do I interpret MCX's output data?

   Please read the [16]output interpretation of MMC (Mesh-based Monte
   Carlo). The meanings of the outputs from both software are almost
   identical. The only difference is that MCX saves the output on a
   voxelated grid, and MMC saves on a mesh.

4. Does MCX support multiple GPUs in a single computer?

   Answer: No, but you can launch parallel jobs and utilize all GPUs you
   have simultaneously. You must use the "-G" option to specify a
   different GPU ID for each job, and seed them independently by using
   different input files (change the second row of each input file). All
   of these can be done automatically using [17]GNU Parallel. You can find
   examples of running parallel jobs using this tool at [18]this page.
   This approach also supports running MCX over a distributed system such
   as a cluster.

   In addition, the author of MCX has also developed an OpenCL code that
   is targeted at a heterogeneous computing platform involving multiple
   GPUs and CPUs. This code will be announced at a later time.

5. Will you consider porting MCX to MPI to run on my cluster?

   Answer: There are simple alternatives, and you can find my arguments on
   this at [19]this link]. The support for distributed systems is similar
   to the support for multiple GPUs in the same box. You are recommended
   to use [20]GNU Parallel to manage parallel jobs. Examples can be found
   [21]here.

6. What is the maximum number of media types that MCX can handle?

   Answer: In MCX v0.5, the maximum number of media types is 128. If you
   need more than 128 media, you can check out a special branch of mcx by
   the following command:

  svn checkout --username anonymous_user [22]https://mcx.svn.sourceforge.net/svnroot/mcx/mcextreme_cuda/branches/media16 mcx

   In this branch, we use an "unsigned short" instead of "unsigned char"
   to represent a medium index. For CUDA devices with 64k constant memory,
   this branch can use a maximum of 3712 tissue types. You do need to
   create your volume input (*.bin) in "unsigned short" per voxel.

7. What is the maximum number of detectors in MCX?

   Answer: The build-in maximum detector number in MCX is 256. You can
   modify this number in the mcx_const.h unit.

8. My simulation created an empty history file, why is that?

   Answer: This is typically caused by the detector position offset due to
   the incorrectly assumed coordinate system origin.

   In MCX, the default coordinate system is the MATLAB volume index (in
   {x,y,z} float triplet, all starting from 1.0). As a result, the origin
   of the volume (the corner of the diagonal direction of the first voxel)
   is (1,1,1) instead of (0,0,0). If you want to use (0,0,0) as the
   origin, you can do so by adding "--srcfrom0 1" in the command line. The
   following two figures (the bottom face of an 8x8x8 volume) show the
   differences between these two options:

       default or --srcfrom0 0                       --srcfrom0 1
  [23]upload:detmask_coordinates.png
                                     [24]upload:detmask_coordinates_srcfrom0.png

   You can find more discussions here:

   [25]http://groups.google.com/group/mcx-users/browse_thread/thread/e5e01
   40d7e73e4bf?hl=en

   A photon detection event only happens when a photon escapes from the
   target to the exterior space. This includes two situations:

    1. a photon moving from a non-zero voxel to a 0-voxel
    2. a photon moving beyond the bounding box of the volume

   Thus, in order for a detector to capture an escaped photon, it MUST be
   located on the interface between the zero/non-zero voxels, or on the
   bounding box (within the detector radius). This makes it very sensitive
   to the coordinate origin issue above when the detector radius is 1mm or
   less, because if you mistakenly offset your detector by 1mm, the
   detector will capture nothing, thus giving you an empty history file.

   To help better use of this feature, starting from MCX 0.5.2, we allow
   users to specify coordinate origin types in the input file. The 3rd row
   of an input file now accommodates a 4th input, specifying the srcfrom0
   flag. For example
  30.0 30.0 0.0 1

   sets the srcfrom0 flag to 1 (the last integer). As a result, the volume
   origin is set to (0,0,0). This is equivalent to
  31.0 31.0 1.0 0

   or
  31.0 31.0 1.0

   This setting will be effective for both source and detector positions.

References

   1. http://mcx.sourceforge.net/cgi-bin/index.cgi?action=rss
   2. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/FAQ#I_am_getting_a_kernel_launch_timed_out_error_what_is_that
   3. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/FAQ#When_should_I_use_the_atomic_version_of_MCX
   4. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/FAQ#How_do_I_interpret_MCX_s_output_data
   5. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/FAQ#Does_MCX_support_multiple_GPUs_in_a_single_computer
   6. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/FAQ#Will_you_consider_porting_MCX_to_MPI_to_run_on_my_cluster
   7. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/FAQ#What_is_the_maximum_number_of_media_types_that_MCX_can_handle
   8. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/FAQ#What_is_the_maximum_number_of_detectors_in_MCX
   9. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/FAQ#My_simulation_created_an_empty_history_file_why_is_that
  10. http://stackoverflow.com/questions/17186638/modifying-registry-to-increase-gpu-timeout-windows-7
  11. http://askubuntu.com/questions/281858/boot-to-command-line-12-04-grub-options-not-working
  12. http://askubuntu.com/questions/148321/how-do-i-stop-gui
  13. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/Reference
  14. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/Reference
  15. http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/ReleaseNotes/0.4.9
  16. http://mcx.sourceforge.net/cgi-bin/index.cgi?MMC/Doc/FAQ#How_do_I_interpret_MMC_s_output_data
  17. http://www.gnu.org/software/parallel/
  18. http://mcx.sourceforge.net/cgi-bin/index.cgi?MMC/Doc/MMCCluster
  19. http://mcx.sourceforge.net/cgi-bin/index.cgi?MMC/Doc/FAQ#Will_you_consider_porting_MMC_to_MPI_to_run_on_my_cluster
  20. http://www.gnu.org/software/parallel/
  21. http://mcx.sourceforge.net/cgi-bin/index.cgi?MMC/Doc/MMCCluster
  22. https://mcx.svn.sourceforge.net/svnroot/mcx/mcextreme_cuda/branches/media16
  23. http://mcx.sf.net/upload/detmask_coordinates.png
  24. http://mcx.sf.net/upload/detmask_coordinates_srcfrom0.png
  25. http://groups.google.com/group/mcx-users/browse_thread/thread/e5e0140d7e73e4bf?hl=en
