Introduction to Static Analysis for C and C++

Static Code Analysis for C and Cpp

Summary:

This post will teach you what static analysis is in C and C++ languages and why it is important. You will learn how to use static analysis as part of your development process and tips on how to automate it.

 

Introduction to Static Analysis

The “four stages of competence” is a well-known learning model that depicts a learner’s stages while acquiring a skill. This blog post is for you if you have never heard of static code analysis. By the end of this blog post, if you decide to learn more about static code analysis, you have successfully transitioned from the “I don’t know that I don’t know” state to the “I know that I don’t know” state. 😁

At their heart, compilers are tools that transform human-readable text into machine-readable code. An executable is born if the compiler doesn’t encounter any error during this transformation from the source code. 99% of the time, a program is born with logical errors/vulnerabilities/functional defects that the compiler knows nothing about. (What about the remaining 1%? They are miracle births that are celebrated far and wide. 🦄 )

Not only are compilers faithful in taking programmers’ instructions but also, they help to give hints to improve a program. These hints depend on compiler settings and are known as warning levels. These warnings are a result of static code analysis done by the compiler during program compilation.

The “90/10” rule is a well-known paradigm that applies to many areas of computer science. For example, 90 percent of a program’s execution time is spent in 10 percent of the code. Extrapolating this rule to warnings thrown by the compiler, we can claim that 90 percent of programmers fix only 10 percent of compiler warnings. “They are just warnings,” I hear you say!

Bad things happen when the compiler’s well-intended warnings are not heeded. Programmers make subtle mistakes while writing a program that cannot be caught by the compiler. These mistakes manifest themselves as software errors🐞 or vulnerabilities. 💥Let us go through some such examples:

Ariane 5 Disaster

What can be worse than losing a rocket 30 seconds into its launch, blowing up 370 million dollars in tax payer’s money? All because the compiler tried to cram a 64-bit value into a 16-bit address.

0123456789101112131415

16-bit address space

0123456789101112131415
                
                
              63

64-bit address space

A 64-bit value can be a very large number (up to 2^64 -1, to be precise). This cannot fit into a 16-bit address space, but the compiler generated instructions to do just that! You can read more about the disaster here.

Therac-25 radiation overdose

Therac-25 is the story of how a machine intended to save the lives of patients ultimately became their killer – all because of bugs in its software. Race conditions and integer overflows resulted in the machine malfunctioning, thereby sending high doses of radiation.

Race conditions happen when multiple threads executing a block of code are not properly synchronized. For example:

Thread 1Thread local storage Variable Thread 2Thread local storage
Read Value00Read Value0
Increment Value1 0 Increment Value1
Write Value11Write Value1

Data races gave an incorrect result because of the concurrent execution of threads that are not synchronized. Such bugs are non-deterministic and quite hard to track.

Integer overflows happen when an operation results in values outside of the range that can be represented with a given number of digits. Assume that a 16-bit unsigned integer is holding its maximum value of 65535 like so:

1111111111111111

16-bit address space

If we try to add 1 to this number, an integer overflow happens because 65536 cannot be represented in a 16-bit integer. This overflow causes undefined behavior with signed integers and can cause data to leak to other memory space (with an unsigned integer, however, overflow behavior is defined and predictable and will not cause a problem).

Both race conditions and integer overflows are not normally detected during compilation – although a special class of Integer overflows involving only constants can be detected.

Heartbleed – vulnerability in OpenSSL

Heartbleed Bug

This is one of those rare bugs which has a webpage of its own! OpenSSL is a cryptographic library used to secure information. It is open source and well-supported in all modern operating systems. A missing bounds check in handling the TLS heartbeat extension can be exploited to reveal sensitive information. Though not caught by the compiler, boundary-check omission can be caught by a static code analysis tool. A critical software vulnerability can even be weaponized. The most famous of such cases is that of Stuxnet where industrial automation software was specifically targeted to deliver a worm that disrupted the programable logic controllers, acting as a weapon.

Aside: What is a bound check, and why is it important?

Arrays are variables that store values of the same type in contiguous memory locations. Assume we have an array like so:

420100713532
Index 0Index 1Index 2Index 3Index 4Index 5Index 6Index 7

The bounds of the array are (0, 7), both inclusive. Some languages like C# or Java check if the array access is within this bound. Languages like C and C++ let the programmer deal with array access, and there is no bound check added by the compiler. This leads to subtle bugs that can be caught during static code analysis.

How to manage code quality?

There are different ways in which the software industry tries to deal with such bugs. One is the tacit understanding that not all bugs can be eliminated. The following can and should be done to minimize software errors:

  • Code reviews – The practice of overseeing code before it gets into production is a great way of arresting bugs. The effectiveness of this approach depends greatly on the capability of the reviewer.
  • Software testing – More than 50% of the time in software projects is spent on testing. Testing can catch bugs, and automated test runs to ensure the bugs don’t regress. But testing is costly and also can lead to a false sense of security.
  • Static code analysis – Unlike code reviews that need manual reviewers, static code analysis uses tools to check programs. This checking can even be integrated into the nightly builds to generate daily build reports. A disadvantage would be the cost associated with the tool.

What is static code analysis?

During compilation, source code is transformed into intermediate representations like Abstract Syntax Tree (AST) and Control Flow Graph (CFG). Compilers use these intermediate representations to run data flow analysis (DFA) algorithms to do code optimizations. During the code optimization stage, it is possible to determine the unused variables and unused code (dead code). The primary goal of a compiler is to transform such intermediate representations into executable code. In contrast, the primary goal of a static code analysis tool is to use the intermediate representations to find issues in the code.

What can static code analysis do for you?

Static code analysis can

  • Detect code that deviates from a coding standard (e.g., MISRA C)
  • Detect code that can lead to resource or memory leaks
  • Detect code that can lead to null pointer dereferencing
  • Detect concurrency issues in code leading to race conditions
  • Detect incorrect use of APIs
  • Detect conditionals that always evaluate to either true or false
  • Detect operator precedence issues
  • More…

(Check out the appendix for more details on some of these issues.)

Most static code analysis tools are well integrated into the development environment. This gives the programmer a chance to run static code analysis on demand. Mostly this opportunity comes after the “last elusive bug” is found, or the last customer feature is done – which is never. 🕸

Hence the ideal way to run static code analysis is to integrate it with the source control management and its nightly build setup. A free, open-source tool, CppCheck, lets you seamlessly analyze C and C++ code on your PC. You can even find free-to-use (🤑) GitHub actions that allow you to run CppCheck freely on your repository hosted on GitHub. Although CppCheck is free, it does have its share of limitations and does not provide such a comprehensive set of rules and checks as commercial static analysis tools. So, I recommend using CppCheck if you are not using anything at the moment and have a tight budget. However, if you work in an industry (such as automotive, medical devices, aviation, etc.) that needs to comply with certain safety standards, then I strongly suggest you use a commercial tool such as C++Test by Parasoft of QA-MISRA by QA Systems. The commercial tools require some expertise to set them up as part of your testing automation, and that is something that Novodes can happily help you with.

Appendix – Technical follow-up of the paragraph on what static code analysis can do for you

  1. AST: Abstract Syntax Tree is a data structure obtained as a result of lexing and parsing a program. For more information, see https://en.wikipedia.org/wiki/Abstract_syntax_tree
  2. CFG: Control Flow Graph. A graph where basic blocks of a program constitute the nodes and control flow depicts the edges. This data structure is obtained as a result of an optimization pass in a compiler. Most static code analysis tools require this as a prerequisite. For more information, see https://en.wikipedia.org/wiki/Control-flow_graph
  3. DFA: Data flow analysis. Data flow analysis sets up recurrence equations, the solutions of which can decide if some particular optimization (e.g., Liveness analysis, Code hoisting, Copy propagation, and Common sub-expression elimination) can be done. For more information, see https://en.wikipedia.org/wiki/Data-flow_analysis
  4. Incorrect use of API: Every API – be it a web service request, a third-party library call, or even the call to a standard library function – has a contract that needs to be followed. Take, for example, the standard C function strtok:

char * strtok ( char * str, const char * delimiters );

The contract says: On a first call, the function expects a C string as an argument for str, whose first character is used as the starting location to scan for tokens. In subsequent calls, the function expects a null pointer and uses the position right after the end of the last token as the new starting location for scanning.

Without understanding and following this contract we are guaranteed to have bugs.

  1. Operator precedence issues: An expression can involve multiple operands, the precedence of which might not be as intended by the programmer. Take, for example, the following statement:

if (isUser = AuthenticateUser(username, password) == FAIL) {

The expression involves the equality operator (==) and the assignment operator (=). The equality operator has higher precedence, and we have a classic operator precedence bug.

Fixing this uses parenthesis, forcing the correct operator precedence :

if ((isUser = AuthenticateUser(username, password)) == FAIL) {

 

Was this information helpful?
YesNo

Check out these other blog posts: