Optimizing & Parallelizing a Large Commercial Code for Modeling Oil-well Networks
Abstract
In this project, a complex, serial application that models networks of oil wells is analyzed for today's parallel architectures. By heavy use of the profiling tool Valgrind, several serial optimizations are achieved, causing up to a 30-50x speedup on previously dominant sections of the code, on different architectures. Our initial main goal is to parallelize our application for GPGPUs (General Purpose Graphics Processing Units) such as the NVIDIA GeForce 8800GTX. However, our optimized application is shown not to have a high enough computational intensity to be suitable for the GPU platforms, with the data transfer over the PCI-express port showing to be a serious bottleneck. We then target our applications for another, more common, parallel architecture -- the multi-core CPU. Instead of focusing on the low-level hotspots found by the profiler, a new approach is taken. By analyzing the functionality of the application and the problem it is to solve, the high-level structure of the application is identified. A thread pool in combination with a task queue is implemented using PThreads in Linux, which fit the structure of the application. It also supports nested parallel queues, while maintaining all serial dependencies. However, the sheer size and complexity of the serial application, introduces a lot of problems when trying to go multithreaded. A tight coupling of all parts of the code, introduces several race conditions, creating erroneous results for complex cases. Our focus is hence shifted to developing models to help analyze how suitable applications with traversal of dependence-tree structures, such as our oil well network application is, given benchmarks of the node times. First, we benchmark the serial execution of each child in the network and predict the overall parallel performance by computing dummy tasks reflecting these times on the same tree structure on two given well networks, a large and a small case. Based on these benchmarks, we then predict the speedup of these two cases, with the assumption of balanced loads on each level in the network. Finally, the minimum amount of time needed to calculate a given network is predicted. Our predictions of low scalability, due to the nature of the oil networks in the test cases, are then shown. This project thus concludes that the amount of work needed to successfully introduce multithreading in this application might not be worth it, due to all the serial dependencies in the problem the application tries to solve. However, if there are multiple individual networks to be calculated, we suggest using Grid technology to manage multiple individual instances of the application simultaneously. This can be done either by using script files or by adding DRMAA API calls in the application. This, in combination with further serial optimizations, is the way to go for good speedup for these types of applications.