Optimization
Introduction
As we approach the end of Moore’s Law, the conventional techniques for improving compute and performance by packing more transistors are no longer viable. Therefore, it is imperative to formulate and implement innovative methods to augment the performance of applications. Inculcating a general mindset for optimization, where the objective is to ensure effective and efficient utilization of the available resources in a feasible manner, can prove to be a powerful mechanism for substantially enhancing the performance of applications. This technical blog presents a few simple optimization techniques that can be incorporated during the development, execution, and profiling of applications to achieve considerable performance gains. It ranges from optimization techniques during software development, like function inlining, likely and unlikely condition flow, loop tiling, unrolling, etc., to execution mechanisms like task isolation and CPU pinning, to performance monitoring and profiling techniques.
Likely, Unlikely condition flow – Branch Prediction
Branching instructions are considerably expensive, given that the processor pipeline is flushed whenever a branch or jump instruction is encountered. Consequently, branch prediction is a fundamental problem statement that has been comprehensively researched, where the objective is to predetermine the control flow through different conditional blocks with high probability. Most contemporary platforms have some form of branch prediction supported as part of their hardware. This can either be static prediction (fixed taken/not taken prediction) or dynamic prediction (where the prediction is not fixed but changes dynamically based on some history of what happened previously in the conditional blocks under consideration). The likely and unlikely macros are compiler directives that can be used by a programmer to aid branch prediction by providing hints to the compiler regarding the possible flow through the conditional blocks. The compiler uses this information to optimize the generated assembly code where the likely block is optimized to be executed without any branching or jump instruction, while the unlikely block will be executed after a branch or jump instruction. The informed usage of the likely and unlikely macros can result in a considerable performance gain in a variety of different scenarios. For example, the unlikely directive can be used for the conditional block corresponding to an index out-of-bounds check, which is expected to be a rare occurrence, whereas the likely directive can be used for the index in range block, which is expected to be the more frequent scenario.
Loop Unrolling
Loop unrolling is a loop transformation technique where the objective is to optimize a program’s execution time at the expense of its binary size. It involves a reduction or complete removal of the loop iterations by rewriting the loop as a repeated sequence of similar independent instructions. This results in an improvement in the program’s execution time due to the reduction or elimination of instructions that control the loop, such as pointer arithmetic and “end of loop” tests on each iteration. However, loop unrolling can come with its disadvantages, like increased binary executable size, a higher number of instruction cache misses, and reduced readability of the code. Therefore, loop unrolling has to be used judiciously, depending on the type of application, to achieve a performance gain.
Inline
Function calls involve substantial overheads due to the various processes involved, like storing the memory address of the instruction following the function call, copying the arguments of the function to the stack, and storing the function return value. Inline is a compiler directive aimed at eliminating these function call overheads by requesting the compiler to perform inline expansion of the function under consideration. This is realized by inserting the function code at the address of each function call. The consequent reduction in function call overheads can result in a substantial performance gain. However, this can come with its own disadvantages, like increased binary executable size, a higher number of instruction cache misses, and overheads due to increased register utilization. Thus, the inline compiler directive has to be used prudently, depending on the type of application, to achieve a performance gain.
Prefetching
Prefetching, specifically cache prefetching, is a performance enhancement technique where instructions or data are fetched from their original storage in slower memory to a faster local memory like cache before it is actually needed. Most modern processors support some form of hardware-based prefetching by having a dedicated hardware mechanism that watches the stream of instructions or data being requested by the executing program, recognizes the next few elements that the program might need based on this stream, and prefetches the instructions and/or data into the processor’s cache. Modern-day compilers, through their extensive code analysis, can perform software-based prefetching by inserting additional “prefetch” instructions in the program. Further, in specific scenarios, where the programmer has extensive knowledge of the memory access patterns that their application will encounter, “__builtin_prefetch” compiler directives can be placed at strategic locations to prefetch the required data. However, arriving at an accurate magic number representing when the data required has to be prefetched can be an extremely challenging endeavor. The required cache lines have to be prefetched early enough so that the necessary data is available in the cache before it is required. At the same time, as the prefetch will essentially involve the replacement of the current cache lines with the prefetched ones, it is possible that it can lead to cache misses for currently required and utilized data.
False Sharing
False sharing is a performance-degrading usage pattern that can be encountered in systems with distributed, coherent caches. False sharing occurs when threads on different processors modify variables that reside on the same cache line. When a thread modifies a variable in its cache, the whole cache line on which the variable resides is marked as invalid. If another thread attempts to access a completely different variable on the same cache line, then the modified cache line is written back to memory, and the thread fetches the cache line from memory. This occurs because cache coherency is maintained on a cache-line basis and not for individual variables or elements. With false sharing, a thread is forced to fetch a more recent copy of a cache line from memory, even though the variable it is attempting to access has not been modified. Therefore, it is important to keep note of and account for false sharing and its possibilities while designing the data structures in our program. It is possible to avoid false sharing by reordering the variables or adding padding (unused bytes) between the variables to ensure that they do not occupy the same cache line. It is also advised not to use temporary, thread-specific data organized in arrays that are indexed by the thread IDs. The use of thread-local variables can be a helpful alternative in this scenario.
Memory Locking
In certain scenarios, as in the case of Real-time applications that require deterministic timing, paging is one major cause of unexpected program execution delays that can result in exceeding the desired timing boundaries. Memory locking through system calls like mlock() and mlockall() is a technique aimed at preventing delays due to page faults. It includes the locking of a part or all of the calling process’ virtual address space into RAM, thereby preventing that memory from being paged to the swap space. The memory that can be locked includes the pages of the code, data, and stack segments, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All the pages are guaranteed to be resident in RAM until they are unlocked by the corresponding unlock system calls. Therefore, the judicious usage of memory locking for time-critical portions of an application can eliminate the delays due to page faults and can result in a performance gain.
Aligned Allocation
Let’s consider a case where a 4-byte integer is allocated in the memory of a 32-bit system across two words. To elaborate, the first two bytes of the integer occupy the last two bytes of the ith word, whereas the latter two bytes of the integer occupy the first two bytes of the (i+1)th word. In most CPU architectures, the instruction to load a type of x bytes consumes one memory cycle only when the data is aligned on the boundaries of multiples of x. In this case, that would imply that an instruction to load a 4-byte integer consumes one memory cycle only when the integer is aligned on the 4-byte boundaries. Therefore, when a 4-byte integer is spread across two words, as explained above, multiple memory cycles would be required to fetch the two halves of the integer. The solution to this expensive problem is to ensure that data of different types are always allocated on their corresponding boundaries (char to be 1-byte aligned, short to be 2-byte aligned, int, float, and pointers to be 4-byte aligned on a 32-bit system) by providing appropriate padding. In the case of primitive data types, malloc is guaranteed to return a suitably aligned memory block. But in situations that mandate a stricter alignment, as in the case of vector instructions (for example, an AVX512 instruction for 64-bit scalars requires 64-bit alignment), aligned_alloc can be used to allocate memory with the required alignment. In the case of user-defined data types, like structs, alignment is achieved by structure padding, where padding bytes are inserted between the different components of the structure to ensure that each of the components is aligned appropriately. However, in certain scenarios with restrictions on memory, it may be mandatory to avoid the default padding. A preliminary solution would be to reorder the members of the structure to minimize the padding bytes. Further, there are various ways to completely disable struct padding and ensure that the structure is packed without any padding bytes. One such way is to use the #pragma pack(1) directive. Therefore, a comprehensive knowledge of alignment and the various scenarios encountered can aid in achieving the most optimal design solution as mandated by the requirements and constraints of the problem under consideration.
Memory Management
Memory allocation system calls like malloc in C/C++ are considerably expensive. In certain applications, it may not be acceptable to pay this penalty during every allocation of memory. An alternative solution is to design a custom memory management system. First, the constraints and requirements of the applications must be thoroughly analyzed to determine the total memory necessary over the lifecycle of the program execution. This chunk of memory can then be allocated as part of a single malloc call at the very beginning of the program. Now, the subsequent requests for memory can be allocated from this chunk rather than performing repeated malloc system calls. This adheres to the concept of a statically allocated memory pool managed by the application. This entire chunk of memory can then be freed as part of a single free call at the end of the program. There is one other possibility where the memory requirements of the application are such that they exceed the total memory that can be allocated at once. The contemporary solution is to perform a sequence of malloc and free calls to ensure that the memory requirements of the application can be satisfied. This can again be expensive due to the repeated malloc and free calls. The alternative solution would be to allocate a memory chunk of maximum size at the beginning and design a complete, personalized memory manager with book-keepers and memory allocation strategies (first-fit, best-fit, worst-fit) where the discrete chunks are maintained as a linked list to ensure they can be allocated and freed accordingly. Although this involves substantial effort in the design of a custom memory manager and additional memory overheads for maintaining the book-keepers, it can lead to considerable performance enhancements in applications where the penalties due to repeated malloc system calls are unacceptable.
Locality of Reference
Locality of reference is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference locality, namely – temporal and spatial locality. Temporal locality refers to the reuse of specific data and/or resources within a relatively short time duration. Spatial locality refers to the use of data elements within relatively close storage locations. The organization of the memory hierarchy is based on this concept of locality of reference. A typical memory hierarchy ranges from the CPU registers, the different levels of cache (L1, L2, and L3), RAM, and then finally, disk memory, which is the lowest level. As we move through the hierarchy from the lowest level of disk memory to the higher levels consisting of the different caches, the size of the memory keeps decreasing while the speed of the memory keeps increasing. Caches are one such example of exploiting the concept of locality of reference. They exploit temporal locality by storing the recently referenced data, expecting them to be reused in the immediate future. They exploit spatial locality by loading from the levels of memory in terms of cache lines, which are the basic units for cache storage. When a particular data element has to be loaded, the entire memory block of one cache line size containing that data element is loaded into the cache, expecting the usage of neighboring data. A comprehensive knowledge of the concept of locality of reference and its adoption in the memory hierarchy, specifically the cache, can lead to important insights that can result in performance gains. Let’s consider the example of matrix multiplication to highlight the effect of the locality of reference on performance.
Let Clxn = Alxm * Bmxn represent the matrix multiplication operation, where A and B are the matrices to be multiplied, and C is the product matrix with dimensions as depicted.
The typical structure of the matrix multiplication nested loops is as follows:
for i in 0..l:
for j in 0..n:
temp = C[i][j]
for k in 0..m:
temp += A[i][k] * B[k][j]
C[i][j] = temp
A drastic performance gain, especially for larger matrices that cannot be completely fit into the caches, can be achieved by switching the looping order for j and k as follows:
for i in 0..l:
for k in 0..m:
temp = A[i][k]
for j in 0..n:
C[i][j] += temp * B[k][j]
This follows from the premise that matrices are stored in row-major format in C/C++ and most other programming languages. Therefore, the probability of finding the elements of the same row in the cache is higher than that of finding the elements of different rows due to the spatial locality of reference incorporated in caches through loads of cache lines. In the first case, the reads of A[i][k] are in cache (as the i index is constant and the k index is contiguous in the innermost loop), but the reads of B[k][j] aren’t (as the k index corresponding to the rows of B changes in every iteration). Therefore, there is a penalty incurred due to the cache misses. In the second case, the read and writes of C[i][j] and the reads of B[k][j] are both in cache (as the j index corresponding to the columns is the only one changing while the i and k indices are constant in the innermost loop). Therefore, there is no penalty due to cache misses.
Loop Tiling
Loop tiling (also known as loop blocking) is a program optimization technique aimed at improving the locality of reference. It modifies the memory access pattern of the loop in order to reuse data already present in the data cache before that data gets evicted and needs to be reloaded from the memory. The basic idea behind loop tiling is to split the processing of data into smaller segments called tiles or blocks. As these blocks are smaller than the original work set, they fit into data caches more readily, thereby reducing the cache misses significantly. Loop tiling is typically performed on a loop nest involving repeated iterations over the same dataset.
Compilers
The primary purpose of compilers is to translate the high-level source code written in one programming language into low-level machine code that can be executed on the underlying hardware. However, modern compilers have evolved drastically and support a wide range of optimizations intended to improve the performance of applications significantly. A large number of optimizations touched upon as part of this blog can be achieved through the use of compilers. One such way is to use the compiler directives, which are essentially suggestions to the compiler to optimize a given code section in a particular manner. Further, compilers can achieve these optimizations in a truly automated manner without any user intervention when the different optimization levels of a compiler, like O1, O2, O3, etc, are enabled. These cover a variety of optimizations, ranging from simple ones like inlining to sophisticated ones like auto-vectorization, auto-parallelization, etc, that are performed by the compiler after extensive analysis of the program under consideration. LLVM is one such compiler that provides interesting mechanisms to achieve additional optimizations. LLVM is designed around a portable, language-agnostic Intermediate Representation (IR) that completely decouples the front end and back end of the compiler. Therefore, the LLVM IR is both language and target-independent. LLVM allows programmers to design custom passes for optimizations that can be effortlessly integrated into the compiler framework. This opens up a whole new set of possibilities where passes specific to the requirements of an application can be implemented, thereby resulting in considerable performance gains. Therefore, a good understanding of the underlying compiler, its supported features and options, the strategic usage of compiler directives and compiler optimization levels, and the implementation of personalized passes for optimization, wherever applicable, can drastically enhance the performance of applications.
Task Isolation
Applications like real-time systems and high-bandwidth networking applications with user-space drivers require guaranteed access to the CPU without even brief interruptions. The nohz mode in Linux allows for partial task isolation where the number of interrupts that a particular CPU receives is decreased substantially. However, the nohz mode does not guarantee that there will be no interruptions completely; the running task can still be interrupted by page faults, delayed work queues, etc. The isolcpus mode in Linux allows for complete task isolation where a particular CPU is made unavailable to the scheduler and is only accessible to a process via an explicitly set affinity, thereby preventing the said task from competing with the rest of the workload for CPU time. These features reduce interruptions on the isolated CPUs but do not fully eliminate them; task isolation is an attempt to finish the job by removing all interruptions. A process that enters the isolation mode will be able to run in user space with no interference from the kernel or other processes. Therefore, the informed usage of task isolation can aid in the accurate functioning of time-sensitive applications where there is no room for interruptions.
Processor Affinity or CPU Pinning
A pre-emptive multitasking operating system consistently reschedules jobs on a multicore processor for optimal system performance. Each time a job or a thread is pre-empted and another job or thread is scheduled, the operating system might assign the new job or thread to a core that is free. As a result, the core to which a given thread or process is assigned to can differ every time. Each time a thread is assigned to a core, the processor must copy the thread’s relevant data and instructions into its cache if the data is not already in the cache. This means that the cache must be updated each time a thread is rescheduled on a new core. Processor affinity or CPU pinning enables applications to bind or unbind a process or a thread to a specific core or to a range of cores or CPUs. The operating system ensures that a given thread executes only on the assigned core(s) or CPU(s) each time it is scheduled if it was pinned to a core. Processor affinity takes advantage of the fact that the remnants of a process execution remain valid when the same process or thread is scheduled a second time on the same processor. So, the cache may still be valid when a thread execution is rescheduled after being pre-empted.
Performance Monitoring
Monitoring the performance of an application and its various components is often a requirement to accurately identify possible candidates for optimization. The requirement to monitor performance at a granular scale is particularly intensified in the case of real-time applications with stringent constraints. In this direction, computing the time taken for a particular block of code is commonplace. This can be achieved by using various APIs provided by different libraries to measure the time consumed. However, the resolution of these measurements (usually in terms of seconds and microseconds) might not be granular enough for many applications. Moreover, they may not be completely dependable in terms of accuracy due to various other factors like Dynamic Frequency Scaling, turbo boost, etc. The RDTSC and RDTSCP instructions provide the means to measure the time consumed in such scenarios accurately. These instructions make use of the constant rate Time Stamp Counter (TSC) present in the underlying processor. The TSC ticks at a constant frequency known as the TSC frequency that is independent of the current CPU frequency, which may ebb and flow due to DVFS and other power-saving optimizations. Therefore, the TSC ticks are counting the passage of time and not the number of CPU cycles elapsed. These TSC instructions can be placed before and after the code section that is under consideration. The difference between these ticks gives the number of TSC ticks that elapsed during the execution of this particular code section. Dividing this difference by the TSC frequency gives the time consumed at a very high resolution (of up to nano-seconds). In the case of RDTSC instructions, it is possible that they get reordered due to the out-of-order execution property of modern processors, and hence, the time consumed may not even represent that of the code section under consideration. The usage of the RDTSCP instruction can help in resolving such a situation as the RDTSCP instruction introduces serialization by placing a memory barrier instruction. Further, there are various comprehensive tools for performance monitoring and analysis that can provide a detailed account of the application under consideration. The Linux perf utility and Intel vTune tool are two such tools that can be used for extensive profiling of the applications. Intel vTune provides both a CLI and a GUI, with features to detect hotspots, memory consumption, anomaly detection, degree and effectiveness of parallelization, microarchitecture details, memory access details, etc. Similarly, the Linux perf utility provides a CLI to profile various events like cycles consumed, cache misses at the different levels of cache, stalled cycles, branch misses, page faults, context switches, etc. Both these tools make use of the Performance Monitoring Units (PMU) present in modern processors to monitor events and consequently calculate various profiling metrics for an application. The use of such profiling tools can help in the detection of various candidates for optimization, which can significantly enhance the performance of applications.
Conclusion
In this era of efficiency and sustainability, an ingrained mindset for optimization can be extremely significant and transformative. It must not merely be an afterthought; rather, optimization must be consciously taken into account during the formative phases of an application. At the same time, it is equally important to maintain a fine balance where optimization is not overdone at the cost of feasibility as highlighted by the caveats in the different sections of this blog.