Efficient C++
Dov Bulka and David Mayhew

1. The tracing war story

Object definitions trigger silent execution (don't think of it as silent overhead) in the form of object constructors and destructors.
Just because we pass an object by reference does not guarantee good performance. It would be helpful if we didn't have to construct and destroy the object in the first place.
Don't waste effort on computations whose results are not likely to be used.
Don't aim for the world record in design flexibility. A char pointer can sometimes do the simple jobs as well, and more efficiently, than a string.
Inline. Eliminate the function call overhead that comes with small, frequently invoked function calls.

2. Constructors and destructors

Constructors and destructors may be as efficient as hand-crafted C code.
The construction (destruction) of an object triggers recursive construction (destruction) of parent and member objects.
Make sure that your code actually uses the objects that it creates and the computations that they perform.
The object life cycle is not free of cost. Don't create an object unless you are going to use it.
Compilers must initialize contained member objects prior to entering the constructor body. Use the "member initialization list".

3. Virtual functions

The cost of a virtual function stems from the inability to inline calls that are dynamically bound at run-time (not from the presence and de-referencing of the v-table pointer).
Templates are more performance-friendly than inheritance hierarchies - they push type resolution to compile-time. [virtual functions are run-time polymorphism, templates are compile-time polymorphism]

4. The RTO (return value optimization)

If you must return an object by value, the RVO will help performance by eliminating the need for creation and destruction of a local object.
The application of the RVO is up to the discretion of the compiler implementation.
You will have a better shot at RVO by defining a "computational constructor" [Complex::Complex( const Complex& x, const Complex& y ) : real( x.real + y.real ), imag( x.imag + y.imag ) { }]

5. Temporaries

A temporary object could penalize performance twice in the form of constructor and destructor computations.
Declaring a constructor explicit will prevent the compiler from using it for type conversion behind your back.
A temporary object is often created by the compiler to fix type mismatch. You can avoid it by function overloading.
Avoid object copy if you can. Pass and return objects by reference.
You can eliminate temporaries by using compound operators like operator+=.

6. Single-threaded memory pooling

Flexibility trades off with speed. As the power and flexibility of memory management increases, execution speed decreases.
The global memory manager is general purpose and consequently expensive.
Specialized memory managers can be more than an order of magnitude faster than the global one.
If you allocate mostly memory blicks of a fixed size, a specialized fixed-size memory manager will provide a significant performance boost.
A similar boost is available if you allocate mostly memory blocks that are confined to a single thread (skip the concurrency issues that the global new and delete must handle).

7. Multi-threaded memory pooling

If you develop a set of efficient single-threaded allocators, you can easily extend their reach into multi-threaded environments by the use of templates.

8. Inlining basics

Inlining is the replacement of a method call with the code for the method.
Inlining improves performance by removing call overhead and allowing cross-call optimizations to take place.
Inlining is primarily an execution-time optimization, though it can also result in smaller executable images as well.
Inlining efficiency is not an issue in the case of functions whose cost is not dominated by call and return overhead. [p57]

9. Inlining - performance considerations

Literal arguments and inlining, when combined, provide significant opportunities for a compiler to provide significant performance improvements.
Inlining can increase code size. Large code size suffers a higher rate of cache misses and page faults than smaller code.
Nontrivial inlining decisions should be based on sample execution profiles, not gut feelings.
Consider rewriting high frequency methods with large static size and small dynamic size to extract their significant dynamic characteristic, and then inline the dynamic component.
Trivial and singleton methods can always be inlined.

10. Inlining tricks

The goal is to find a program's fast path and inline it, though inlining this path may not be trivial.
Conditional inlining prevents inlining from occuring. This decreases compile time and simplifies debug during the earlier phases of development.
Selective inlining is a technique that inlines methods only in some places. It can offset some of the code size explosion potential of inlining a method by inlining method calls only on performance-critical paths.
Directly recursive methods cannot be inlined [p144]. Recursive inlining is an ugly but effective technique for improving the performance of recursive methods.
Care needs to be taken with local static variables.
Inlining is aimed at call elimination. Be sure of the real cost of calls on your system before using inlining.

11. STL

The STL is an uncommon combination of abstraction, flexibility, and efficiency.
Some containers are more efficient that others for a particular usage pattern.
Unless you know something about the problem domain that the STL doesn't, it is unlikely that you will beat the performance of an STL implementation by a wide enough margin to justify the effort.
It is possible, however, to exceed the performance of an STL implementation in some specific scenarios.

12. Reference counting

Reference counting is not an automatic performance winner. Reference counting, execution speed, and resource conservation form a delicate interaction that must be evaluated carefully if performance is an important consideration. Reference counting may help or hurt performance depending on the usage pattern. The case in favor of reference counting is strengthened by any one of the following items:
- the target object is a large resource consumer
- the resource in question is expensive to allocate and free
- a high degree of sharing is likely due to lots of copying and assignment
- the creation or destruction of a reference is relatively cheap

13. Coding optimizations

Ensure that the results of computations are actually used.
Defer a computation to the point where it is actually needed. Premature computations may never be used on some execution flows.
If a computation has already been performed, use the previous result instead of computing a new result.
Take advantage of simplifying assumptions. Reduced flexibility increases speed.
Some unnecessary flexibility is hidden in library calls. Familiarize yourself with the hidden cost of system calls, and roll your own implementation where the effort can be justified.
Minimize memory management calls - they are relatively expensive.
Optimize the processing of the 20% of the data that is used 80% of the time.
Write cache-friendly code.

14. Design optimizations

Performance and flexibility are often enemies. On the 20% of your software that executes 80% of the time, performance often comes first at the expense of flexibility.
Caching offers leverage at large and small levels of granularity. You can avoid big blobs of computation by simply stashing away the result of previous computations.
The use of efficient algorithms and data structures is necessary, but not sufficient.
Some computations may be necessary only on a subset of the likely execution scenarios. Those computations should be deferred until the need for the result is demonstrated.
Large-scale software often produces obsolete code. Periodic purges of obsolete code boosts performance and overall "hygiene".

15. Scalability

Amdahl's Law puts an upper limit on the potential scalability of an application. The scalability is limited by the portions of the computation that most be performed serialing. The trick to scalability is to reduce and, if possible, eliminate serial code.
Split a monolithic task into multiple subtasks that are conducive to parallel execution by concurrent threads.
Critical sections should contain critical code and nothing else. Code that does not directly manipulate shared resources should not reside in the critical section.
Cache. At times, it may be possible to eliminate execution visits to a critical section by caching the result of an earlier visit.
Share nothing. If you only need a small, fixed number of resource instances, you should avoid the use of public resource pools. Make those instances thread-private and recycle them.
Partial sharing. It is better to have two identical pools with half the contention.
Lock granularity. Don't fuse resources under the protection of the same lock unless they are always updated together.
False sharing. Don't place two hot locks in close proximity in the class definition. You don't want them to share the same cache line and trigger "cache consistency" storms.
When a lock is freed, does it result in all waiting threads to be woken-up, or just one? Those that wake up all threads threaten the scalability of the application.
Investigate the implementations of system and library calls. Some are hiding significant portions of serial code.
Shared data that is read-mostly will benefit from reader/writer locks - they eliminate contention among reader threads.

16. System architecture dependencies

Cache thrash. One of the interesting effects of having a cache in a multiprocessor system is the effect of cache coherency protocols on cache performance. Cache thrash is typically the result of multiprocessor systems that do not maintain any "processor affinity" for its processes. Processor affinity amounts to keeping track of which processor a process ran on last time it executed, and trying to increase the likelihood that it will execute on that same processor the next time it executes. Cache thrash is an antiscaling artifact in large-scale SMP systems. It is not a C++ issue. It is purely an operating system issue. But if processor affinity characteristics ever filter down into multithreading/multiprocessor APIs, it is important to understand the implications and leverage that exist. [p277]
Avoid branching. Almost all modern processors are now pipelined (instructions are decomposed into "stages" and these stages can execute in parallel). Unfortunately, conditional branches tend to stall the pipeline. [p279]
Prefer simple calculations to small branches. [p280]
The further the memory you want to use is from the processor, the longer it takes to access. The resource closest to the processor - registers - are limited in their capability, but extremely fast. Their optimization can be very valuable.
Virtual memory is not free. Indiscriminate reliance on system-maintained virtual structures can have significant negative performance ramifications.
Context switches are expensive - avoid them.
We feel that the coming shift in processor architecture will significantly disadvantage monolithic threading approaches (e.g. internally managed asynchronous I/O).

Efficient C++ Dov Bulka and David Mayhew

1. The tracing war story

2. Constructors and destructors

3. Virtual functions

4. The RTO (return value optimization)

5. Temporaries

6. Single-threaded memory pooling

7. Multi-threaded memory pooling

8. Inlining basics

9. Inlining - performance considerations

10. Inlining tricks

11. STL

12. Reference counting

13. Coding optimizations

14. Design optimizations

15. Scalability

16. System architecture dependencies

Efficient C++
Dov Bulka and David Mayhew