Efficient C++
Dov Bulka and David Mayhew
1. The tracing war story
- Object definitions trigger silent execution (don't think of it as
silent overhead) in the form of object constructors and destructors.
- Just because we pass an object by reference does not guarantee
good performance. It would be helpful if we didn't have to construct
and destroy the object in the first place.
- Don't waste effort on computations whose results are not likely
to be used.
- Don't aim for the world record in design flexibility. A
char pointer can sometimes do the simple jobs as well, and
more efficiently, than a string.
- Inline. Eliminate the function call overhead that comes with
small, frequently invoked function calls.
2. Constructors and destructors
- Constructors and destructors may be as efficient as hand-crafted
C code.
- The construction (destruction) of an object triggers recursive
construction (destruction) of parent and member objects.
- Make sure that your code actually uses the objects that it
creates and the computations that they perform.
- The object life cycle is not free of cost. Don't create an
object unless you are going to use it.
- Compilers must initialize contained member objects prior to
entering the constructor body. Use the "member initialization list".
3. Virtual functions
- The cost of a virtual function stems from the inability to inline
calls that are dynamically bound at run-time (not from the
presence and de-referencing of the v-table pointer).
- Templates are more performance-friendly than inheritance
hierarchies - they push type resolution to compile-time. [virtual
functions are run-time polymorphism, templates are compile-time
polymorphism]
4. The RTO (return value optimization)
- If you must return an object by value, the RVO will help
performance by eliminating the need for creation and destruction
of a local object.
- The application of the RVO is up to the discretion of the
compiler implementation.
- You will have a better shot at RVO by defining a
"computational constructor" [Complex::Complex( const Complex&
x, const Complex& y ) : real( x.real + y.real ), imag( x.imag +
y.imag ) { }]
5. Temporaries
- A temporary object could penalize performance twice in the
form of constructor and destructor computations.
- Declaring a constructor explicit will prevent the
compiler from using it for type conversion behind your back.
- A temporary object is often created by the compiler to fix
type mismatch. You can avoid it by function overloading.
- Avoid object copy if you can. Pass and return objects by
reference.
- You can eliminate temporaries by using compound operators
like operator+=.
6. Single-threaded memory pooling
- Flexibility trades off with speed. As the power and
flexibility of memory management increases, execution speed
decreases.
- The global memory manager is general purpose and
consequently expensive.
- Specialized memory managers can be more than an order of
magnitude faster than the global one.
- If you allocate mostly memory blicks of a fixed size, a
specialized fixed-size memory manager will provide a significant
performance boost.
- A similar boost is available if you allocate mostly memory
blocks that are confined to a single thread (skip the
concurrency issues that the global new and delete
must handle).
7. Multi-threaded memory pooling
- If you develop a set of efficient single-threaded allocators,
you can easily extend their reach into multi-threaded environments
by the use of templates.
8. Inlining basics
- Inlining is the replacement of a method call with the code
for the method.
- Inlining improves performance by removing call overhead and
allowing cross-call optimizations to take place.
- Inlining is primarily an execution-time optimization, though
it can also result in smaller executable images as well.
- Inlining efficiency is not an issue in the case of functions
whose cost is not dominated by call and return overhead. [p57]
9. Inlining - performance considerations
- Literal arguments and inlining, when combined, provide
significant opportunities for a compiler to provide significant
performance improvements.
- Inlining can increase code size. Large code size suffers a
higher rate of cache misses and page faults than smaller code.
- Nontrivial inlining decisions should be based on sample
execution profiles, not gut feelings.
- Consider rewriting high frequency methods with large static
size and small dynamic size to extract their significant dynamic
characteristic, and then inline the dynamic component.
- Trivial and singleton methods can always be inlined.
10. Inlining tricks
- The goal is to find a program's fast path and inline it,
though inlining this path may not be trivial.
- Conditional inlining prevents inlining from occuring. This
decreases compile time and simplifies debug during the earlier
phases of development.
- Selective inlining is a technique that inlines methods only
in some places. It can offset some of the code size explosion
potential of inlining a method by inlining method calls only on
performance-critical paths.
- Directly recursive methods cannot be inlined [p144].
Recursive inlining is an ugly but effective technique for
improving the performance of recursive methods.
- Care needs to be taken with local static variables.
- Inlining is aimed at call elimination. Be sure of the real
cost of calls on your system before using inlining.
11. STL
- The STL is an uncommon combination of abstraction,
flexibility, and efficiency.
- Some containers are more efficient that others for a
particular usage pattern.
- Unless you know something about the problem domain that the
STL doesn't, it is unlikely that you will beat the performance
of an STL implementation by a wide enough margin to justify the
effort.
- It is possible, however, to exceed the performance of an
STL implementation in some specific scenarios.
12. Reference counting
- Reference counting is not an automatic performance winner.
Reference counting, execution speed, and resource conservation
form a delicate interaction that must be evaluated carefully
if performance is an important consideration. Reference
counting may help or hurt performance depending on the usage
pattern. The case in favor of reference counting is
strengthened by any one of the following items:
- the target object is a large resource consumer
- the resource in question is expensive to allocate and free
- a high degree of sharing is likely due to lots of copying
and assignment
- the creation or destruction of a reference is relatively
cheap
13. Coding optimizations
- Ensure that the results of computations are actually used.
- Defer a computation to the point where it is actually
needed. Premature computations may never be used on some
execution flows.
- If a computation has already been performed, use the
previous result instead of computing a new result.
- Take advantage of simplifying assumptions. Reduced
flexibility increases speed.
- Some unnecessary flexibility is hidden in library calls.
Familiarize yourself with the hidden cost of system calls, and
roll your own implementation where the effort can be justified.
- Minimize memory management calls - they are relatively
expensive.
- Optimize the processing of the 20% of the data that is used
80% of the time.
- Write cache-friendly code.
14. Design optimizations
- Performance and flexibility are often enemies. On the 20%
of your software that executes 80% of the time, performance
often comes first at the expense of flexibility.
- Caching offers leverage at large and small levels of
granularity. You can avoid big blobs of computation by simply
stashing away the result of previous computations.
- The use of efficient algorithms and data structures is
necessary, but not sufficient.
- Some computations may be necessary only on a subset of the
likely execution scenarios. Those computations should be
deferred until the need for the result is demonstrated.
- Large-scale software often produces obsolete code. Periodic
purges of obsolete code boosts performance and overall "hygiene".
15. Scalability
- Amdahl's Law puts an upper limit on the potential
scalability of an application. The scalability is limited by
the portions of the computation that most be performed serialing.
The trick to scalability is to reduce and, if possible, eliminate
serial code.
- Split a monolithic task into multiple subtasks that are
conducive to parallel execution by concurrent threads.
- Critical sections should contain critical code and nothing
else. Code that does not directly manipulate shared resources
should not reside in the critical section.
- Cache. At times, it may be possible to eliminate execution
visits to a critical section by caching the result of an earlier
visit.
- Share nothing. If you only need a small, fixed number of
resource instances, you should avoid the use of public resource
pools. Make those instances thread-private and recycle them.
- Partial sharing. It is better to have two identical pools
with half the contention.
- Lock granularity. Don't fuse resources under the protection
of the same lock unless they are always updated together.
- False sharing. Don't place two hot locks in close proximity
in the class definition. You don't want them to share the same
cache line and trigger "cache consistency" storms.
- When a lock is freed, does it result in all waiting threads
to be woken-up, or just one? Those that wake up all threads
threaten the scalability of the application.
- Investigate the implementations of system and library calls.
Some are hiding significant portions of serial code.
- Shared data that is read-mostly will benefit from
reader/writer locks - they eliminate contention among reader
threads.
16. System architecture dependencies
- Cache thrash. One of the interesting effects of having
a cache in a multiprocessor system is the effect of cache
coherency protocols on cache performance. Cache thrash
is typically the result of multiprocessor systems that do
not maintain any "processor affinity" for its processes.
Processor affinity amounts to keeping track of which
processor a process ran on last time it executed, and trying
to increase the likelihood that it will execute on that same
processor the next time it executes. Cache thrash is an
antiscaling artifact in large-scale SMP systems. It is not
a C++ issue. It is purely an operating system issue. But
if processor affinity characteristics ever filter down into
multithreading/multiprocessor APIs, it is important to
understand the implications and leverage that exist. [p277]
- Avoid branching. Almost all modern processors are now
pipelined (instructions are decomposed into "stages" and these
stages can execute in parallel). Unfortunately, conditional
branches tend to stall the pipeline. [p279]
- Prefer simple calculations to small branches. [p280]
- The further the memory you want to use is from the processor,
the longer it takes to access. The resource closest to the
processor - registers - are limited in their capability, but
extremely fast. Their optimization can be very valuable.
- Virtual memory is not free. Indiscriminate reliance on
system-maintained virtual structures can have significant
negative performance ramifications.
- Context switches are expensive - avoid them.
- We feel that the coming shift in processor architecture will
significantly disadvantage monolithic threading approaches (e.g.
internally managed asynchronous I/O).