PARALLEL COMPUTING Parallel Computing 23 (1997) 943-952
ELSEVIER
Profiling for efficient parallel volume visualisation Cemal K&e Depurtment
* .l,
Alan Chalmers
of Computer Science, Uniuersity ofBristol, Bri.stol, BS8 I UB, UK Received 1 June
1996; revised 21 November
1996
Abstract Visualising a multi-dimensional volume data set is a computationally intensive process. Parallel processing offers the potential for achieving the visualisation in a reasonable time. This paper discusses how an appropriate data management approach and the correct management of tasks in the parallel implementation can improve overall system performance. A number of profiling and load balancing strategies are considered to exploit any coherence that may exist as the view point moves. Keywords:
Volume visualisation; Prefetching; Load balancing; Profiling
1. Introduction Volume visualisation allows meaningful and intuitive information to be extracted from three-dimensional volume data. Volume data sets, such as medical data from CAT or MRI scans, are becoming more freely available. The ability to interactively visualise these provides a powerful tool to scientists and engineers, allowing them to examine a complex three dimensional volume from a variety of orientations and to investigate its structure and complexity 131. The volume is visualised by rendering it from selected view points. The rendering of the volume is achieved by converting the volume data into a two dimensional image. In the simplest form a volume is defined as a collection of voxels, or small cubic cells. Each voxel stores a collection of attributes pertaining to a unit of space. These attributes may consist of, for example, a scalar value representing material density or a vector representing flow direction. Each voxel thus has visual properties such as colour, opacity
* Corresponding author. E-mail:
[email protected]
’ Supported
by the Black Sea Technical University. Turkey.
0167~8191/97/$17.00
Copyright 0
P/I SO167-Sl9l(97)00036-7
1997 Elsevier Science B.V. All rights reserved
944
C. i&e,
A. Chalmers/ Purullel Computing 23 (1997) 943-952
and reflectivity. The rendering process samples the three dimensional space at discrete grid points. Classification, re-sampling, shading and smoothing or filtering operations are now applied to produce the image from the current view point. The computational effort required to render a single image of a complex volume is significant and may take many minutes, even hours, to render on a conventional machine. Parallel processing offers the potential of significantly reducing this rendering time. However, volume data sets exhibit certain characteristics, such as very large data requirements and variations in computational complexity, which complicate their visualisation on multiprocessor systems [4,10,11]. These issues must be addressed before an efficient parallel implementation may be possible.
2. Ray casting Of all the techniques developed for volume rendering, ray casting is perhaps the most simple and is well suited for parallel processing [2,4,5,9]. The ray casting method casts a group of rays from the view point through the image plane to the volume data [5]. As each ray travels through the volume data, the data is interpolated to generate new sample points at the intersection points along the path of the ray. Opacity and intensity are accumulated along the path of a ray. This path is terminated when the volume has been passed through, or the accumulated opacity along the ray surpasses a desired tolerance. Terminating the path of any ray whose accumulated opacity is greater than the desired tolerance is a significant technique for optimising the ray casting method [5]. The volume is finally shaded according to light transmission and reflection. The rendering algorithm thus traces the rays through the voxels until they hit a surface and then assigns an intensity inversely proportional to the distance from the eye. The radiation transfer equation with single scattering approximations is used to simulate transmission of light through the volume and model reflectance from the layered volume. Opacity and inverse transparency are defined as scalar functions and evaluated at the nearest face of each cell along the ray’s path. This path is stepped along until the entire cell has been traversed with evaluations of the scalar field, shading function, opacity, and texture mapping [51.
3. Parallel volume visualisation The computational complexity has led to many parallel algorithms being developed recently, for example [7,8]. Early parallel approaches targeted volume rendering directly on specialised, and thus expensive, hardware. Here, we consider parallel volume visualisation on a general purpose MIMD system; a network of transputers. A single computational element of a parallel rendering algorithm may be chosen as the calculation of the local colour and opacity contribution of an intersection of a voxel of the volume data with a ray cast through a pixel of the image plane. Parallel volume rendering may now be classified as either image purtirioning or volume partitioning depending on how these computational elements are combined as tasks in the parallel
C. KGse, A. Chalmers/Parallel
Computing 23 (1997) 943-952
945
Tasks to be done by PE,
I /’
CUlTClll view point
fetched
,
’ j I ,
(a)
Image
plane
b CU#.lt3lt view point
(b)
Fig. 1. Division of data and tasks (a) Image partitioning (b) Volume partitioning.
implementation [lo]. Fig. 1 shows the difference between these two approaches in two dimensions. In this figure we assume there are three processing elements and that a third
of the volume data is accommodated at each processing element. In image-partitioning the image plane is initially evenly partitioned amongst the processors. This is equivalent to a balanced data driven computational model [l]. Each processor is responsible for computing the pixel values for its allocated image portion. The work load at each processor is proportional to the number of pixels of the image plane to be computed. As can be seen in Fig. l(a), with large distributed volume data sets, image partitioning may require a processing element to fetch data items from other processing elements in order to complete its tasks. To ensure an even load balance it must be possible to migrate some tasks from those processing elements allocated complex tasks to these whose initial allocation contained tasks which were computationally easier to compute. As each processing element is responsible for the rendering tasks of its region, there is no need for an additional combination of partial results to produce the final image. The volume-partitioning method performs the reconstruction and re-sampling tasks with the portion of the volume data held at each processing element. The large volume data set is distributed amongst the processing elements and thus each processing element may only compute partial results of the tasks from their allocated portion of the volume data. In order to render the final image, it is necessary, therefore, to combine the partial results computed by several processing elements, as shown for a single pixel in Fig. l(b). The advantage of this method is, of course, there is no need for a processing element to fetch potentially large amounts of volume data from other processing elements. In this paper we have used image partitioning for our parallel implementation of volume visualisation in order to avoid the communication overhead associated with combining the partial results and so that we may exploit the early termination optimisation of ray casting.
946
C. K&e, A. Chalmers/Purallel
Compuring 23 (1997) 943-952
4. Task management Image partitioning requires each processing element to perform the local colour and opacity calculations for the rays cast through its allocated portion of the image plane. The volume data is evenly distributed as a resident set amongst the processing elements and remains in situ at the processing elements throughout the entire volume visualisation. Completion of a task at one processing element may thus require some essential data to be fetched from other processing elements. Efficient management of this data is essential if system performance is not to be adversely affected [4]. Image partitioning does have one important advantage over volume partitioning. Image partitioning is able to fully exploit the ‘early termination’ optimisation of ray casting. Early termination may occur if an opaque layer hides the rest of the volume from a cast ray or the opacity accumulation exceeds a certain level. A front-to-back opacity accumulation technique is able to determine this situation and thus stop any further computation of tasks on the path of the considered ray. Such early termination may save a substantial amount of computation, especially when considering high density objects [6]. The use of early termination, and the movement of the view point, means that computational complexity of tasks will vary. Such variations may result in significant load imbalances within the system unless a load balancing scheme is adopted [l]. Task management ensures that tasks may migrate from processing elements which are struggling with high complexity tasks to those which have finished all their less difficult tasks. 4.1. Preferred
bias allocation
The preferred bias method of task management is a way of allocating tasks to processing elements which combines the simplicity of a balanced data driven model with the flexibility of the demand driven approach [I]. The balanced data driven model allocated all tasks to specific processing elements prior to computation proceeding. On the other hand, with the demand driven model, once a processing element has completed a task it ‘demands’ another. The requesting processing element will be assigned the next available task packet from the task pool, and thus no processing element is bound to any area of the problem domain. As no data-dependencies exist between different pixels of the image plane, the order of task completion is unimportant. Once all tasks have been computed, the image is rendered. In the preferred bias method, the problem domain, that is the image plane, is divided into equal regions with each region being assigned to a particular processing element, as is done in the balanced data driven approach. However, in this method, these regions are purely conceptual in nature. A demand driven model of computation is still used, but the tasks are not now allocated in an arbitrary fashion to the processing elements. Rather, a task is dispatched to a processing element from its conceptual portion. Once all tasks from a processin, 0 element’s conceptual portion have been completed, only then will that processing element be allocated its next task from the portion of another processing element which has yet to complete its conceptual portion of tasks.
C. K&e, A. Chulmers/
Parallel Computing 23 (1997) 943-952
947
4.2. Cache profiling The data associated with rays cast through the same pixels of the image plane will be different as the view point changes. So, although the resident sets of the volume data may be optimised for tasks allocated from one view point, for example the initial view, this will no longer be the case once the view point moves. Obviously, it is inefficient to redistribute all the data before starting tasks from the new view. Cache profiling attempts to allocate tasks for this new view point to those processing elements which are predicted to have the highest percentage of relevant resident data. Prior to any volume visualisation commencing, the resident volume data sets must be distributed to the processing elements in a predetermined fashion, typically optimised to the initial view point [lo]. The system controller is aware of this initial distribution and can use this information for cache profiling after each subsequent view point move. Using the rotation and transformation information obtained from the user, the system controller ‘clips’ the volume relative to a fixed position within the data set and uses the results of this to allocate tasks to each processing element’s conceptual region. For example, if the user rotates the volume by an amount (Y from one frame to the next, then the system controller can ‘rotate’ the known data locations by - (Y.Note that the data is not physically rotated, but this temporary calculation is simply performed to determine the task allocation for each conceptual region. To determine the optimum allocation requires the X, y and z coordinates of these temporary locations to be compared, that is ‘clipped’. As we will see in the results section, such an operation is complex and may actually reduce overall system performance. A simpler operation is to ‘clip’ relative to a single coordinate, for example X. Although such a L1D clipping’ no longer provides the optimum allocation of tasks to conceptual regions, it may be performed quickly. This will avoid the benefits of cache profiling being swamped by the overheads of the clipping calculations. 4.3. Load balancing Any variations in computational complexity between different conceptual regions of the image plane can result in significant load imbalances, with some processing elements standing idle while others struggle to complete their complex portions. It is essential, therefore, that tasks can migrate from processing elements which are struggling to those which have finished their conceptual allocation. The conceptual regions for each processing element are chosen to maximise the data coherence within the region and thus reduce the number of remote data fetches. On completion of its own conceptual region, if load balancing is to be maintained, a processing element must obtain a new task from somewhere else within the system. A processing element which now acquires a task from an arbitrary conceptual region will no longer be able to exploit data coherence. Profiling the data currently in its cache and making use of any coherence between this and tasks available at other processing elements would seem in principle to reduce the number of subsequent remote data fetches. However, the overhead required to examine and achieve a ‘good match’ of data and tasks is prohibitive and actually degrades system performance.
948
C. Kiise, A. Chalmers/Parallel
Computing 23 (1997) 943-952
Instead, here we have examined four options for choosing from where a new task should be selected. In all cases a supplied task contains the profile of its data requests. This allows the data associated with the task to be prefetched by its new ‘owner’, thus overlapping computation with communication. To increase the exploitation of coherence, once a source processing element has been identified, the requesting processing element will keep obtaining tasks from this source while there are still tasks there that need to be completed. In order to keep each processing element informed of the current work load of all other processing elements, each processing element globally broadcasts a short message on completion of each task. Although there will be inevitable latency difficulties with such a system, nevertheless the state of the information provided will be sufficient for the purposes of selecting from which processing element to acquire a task. The four possibilities from where a processing element may find a task are: (i) Simply ask for a task from the neighbouring processing element with the most uncompleted tasks. (ii) Fetch the task from the processing element which has the most tasks available. If more than one has an equivalent number, then select from the processing element which is physically closer. This source processing element then supplies the task directly. (iii) The source processing element is selected as above, but the task is not returned directly to the requester. Instead the task is passed to the neighbouring processing element on the path to the requester. This processing element then supplies one of its tasks to its neighbour on the path and so on until a task is supplied to the original requester. (iv) In this approach a request for a task is sent to the neighbouring processing element on the path to the one with the most available tasks. This neighbour supplies the task and then in turn asks its neighbour on the same path for a task and so on until a task is acquired from the selected source. The difference between these four methods is shown in Fig. 2. The advantage of requesting and supplying via neighbours is the reduction in communication overhead. This is particularly important as the volume data associated with all tasks may also be PE with the most tasks pailable
Neighbowing PE with A Requesting PE
r5 LTJ I. Non-optimum request & supply
f--
>
2. Direct request, direct supply
Z+
3. Direct request, aeigbbour supply
<
-G----z+-----
-----%-+-
4. Neighbow request, neigbbour supply
c-v-z----t-Fig. 2. Different task request strategies.
C. K&e, A. Chalmers / Purallel Computing23 (1997) 943-952
949
60 Linear speedup Fetch on demand Virtual memory
10
20
A --I-. -B-
30 40 Number of PEs
L
50
60
50
60
Fig. 3. Data management strategies.
60 Linear speed-up 3D Clipping 1D Clipping No Clipping
50
-+tQ-
40 9 a L k?
30
20
10
0 10
20
30 Number
40
of PEs
Fig. 4. A comparison of cache profiling methods.
C. K&e. A. Chulmers/
Parallel Computing 23 (1997) 943-952
60 Linear speed-up Load balancing Strategy Load Balancing Strategy Load Balancing Strategy Load Balancing Strategy No Load Balancing
50
1 2 3 4
+-u-x-. a--
0 10
20
30 40 Number of PEs
Fig. 5. The different load balancing
prefetched across a minimum a pile of sand disperses.
distance.
50
60
strategies.
Such approaches
are similar to the way in which
5. Results
To illustrate the performance improvements provided by profiling, two volume data sets: a volume frame of the Mandelbrot set in the quatemion and a medical MRI scan of a human head were visualised. The results shown are the averages obtained after the volume data was rotated eight times about an arbitrary axis. All these results have been obtained on a Meiko system of sixty four TSOO transputers arranged in torus configurations with a volume data size of 128 X 128 X 128 voxels. A prefetching data management was used in all cases [4]. This strategy enables the data to be fetched from a remote location before it is required. This is in contrast with a simple fetch-on-demand approach. The performance improvement of the prefetch technique may be seen in Fig. 3.
Cache profiling helps select tasks to place in the conceptual region of a processing element from one frame to the next based on the contents of the processing element’s data cache. The benefits of this technique can be seen in Fig. 4. The overheads of a full 3D clipping operation are such that system performance is worse than if no cache
C. K&e, A. Chalmers/ Parallel Computing 23 (1997) 943-952
10
Fig. 6. A comparison
20
30 40 Number of PEs
of image partitioning
951
50
60
and volume partitioning
profiling had been used. The less optimum, but much quicker 1D clipping is able to give a performance benefit of 3.8%. Fig. 5 compares the different load balancing strategies. As can be seen, strategy 2, direct-request direct-supply, is marginally better than strategy 3, direct-request neighbout--supply.
Finally, Fig. 6 shows the average times taken to visualise the volumes for both the volume partitioning strategy and image partitioning including our cache profiling and load balancing enhancements. For low numbers of processing elements volume partitioning is marginally more efficient. However, as the number of processing elements increases, the benefits of the lower communication overheads of our image partitioning strategy become more significant. For 64 processin g elements, image partitioning is more than 13 s faster than volume partitioning.
6. Conclusions Without any load balancing, the image partitioning strategy for parallel volume visualisation was only able to achieve an average speed-up of 37.7. By allowing tasks to migrate from those processing elements which are busy with computationally complex tasks, to those who have completed their less complex tasks, we have been able to increase the average speed-up to 48.7 on 64 processing elements.
C. K&e, A. Chalmers/ Parallel Computing 23 (1997) 943-952
952
At the start of each new frame, preferred bias allocation in conjunction with cache profiling allows the tasks to be assigned to those processing elements which contain the highest percentage of necessary data in their local caches. However, although it is possible to determine an optimum allocation of tasks based on the full 3D coordinates of the data and the new view point, such a computation is too complex to provide any benefit. Instead by using a simpler, but less optimum allocation, it was possible to achieve a modest improvement in system performance. Despite the system performance improvements that can be achieved by cache profiling and careful task management, improvements will still need to be made if interactive volume visualisation is to be achieved. Future work will consider a hybrid of the image and volume partitioning schemes to combine the effective task management strategies that are possible with image partitioning with the reduced need to fetch remote data that is a feature of volume partitioning. In addition we will investigate complexity reduction schemes which will enable us to quickly render an approximation of the volume data between successive view points. Once the desired new view point has been reached, progressive refinement techniques will then be used to obtain desired image quality.
References [I]
A.G. Chalmers, J.P. Tidmus. Practical Parallel Processing: An introduction to problem solving in parallel, International Thomson Publishing, London, 19%.
[2] V. Gael. A. Mukherjee, (1996)
An optima1 parallel algorithm for volume ray casting, Visual Comput.
12 (26)
26-39.
[3] A. Kaufman,
K. Hiihne, P. Schrijder, Research issues in volume visualisation, IEEE
Comput. Graph.
Appl. (I 994) 63-67. [4] C. K&e, (Ed.), [5] M.S.
A.C.
Chalmers, Memory management strategies for parallel volume rendering, in: B. O’Neil
19th World Occam and Transputer User Group meeting, IOS Press, Nottingham, March Levey,
Volume
rendering: Display of surfaces from volume data, Comput. Graph.
19%.
Appl.
S(3)
(I 988). [6] M.S. Levoy, Efficient ray tracing of volume data, ACM Trans. Graph. 9(3) (1990). 171 K. Ma et al., Parallel volume rendering using binary-swap cornpositing, IEEE Comput. Graph. Appl. 67 ( 1994). [Sl P. Mackerras, B. Carrie, Exploiting data coherence to improve parallel volume rendering, IEEE Parallel Distrib. Technol. (1994)
8- 16.
[9] U. Neumann, Communication cost for parallel volume rendering algorithms, IEEE Comput. Graph. Appl. (I 994) 49-58. [IO] U. Neumann,
Volume
reconstruction and parallel
rendering algorithms: A comparative
study, Ph.D.
thesis, The Univeristy of North Carolina at Chapel Hill, Department of Computer Science, 1993. [1 I] R. Yagel. 319-338.
R. Machiraju,
Data-parallel
volume-rendering
algorithms,
Visual
Comput.
I
I
(6)
(1995)