Deprecated: Non-static method StringParser_Node::destroyNode() should not be called statically, assuming $this from incompatible context in /www/htdocs/w008ab83/ad/stringparser_bbcode/src/stringparser.class.php on line 356
AUDACIA Software - Type-safe format strings
AUDACIA Software

Type-safe format strings

Contributing to the rehabilitation of printf.
Moritz Beutel, July 6th, 2009



  1. printf() in C
  2. The C++ alternative: Streams
  3. CbdeFormat: making printf() typesafe
  4. Example 1: Migration to C++Builder 2009/Unicode
  5. Example 2: Debugging complex format strings
  6. Example 3: Localization bug
  7. Enabling Format String Checking for your own functions
  8. References
  9. Comments



printf() in C


About every high-level language has a notion of a string formatting mechanism. Even C, which some claim to be just a PDP-11 assember that thinks it is a language, has the printf() function and its numerous variants. In its flexibility and simplicity, C's printf often served as prototype for more recent languages, be it PHP, Delphi, C#/.NET or Java.

In C, however, printf() has a severe flaw: it is not type-safe. The format string is parsed at runtime, but printf() does not have knowledge about the types of its variadic arguments. This problem has proven to be a prevalent source of errors, which can even be misused for injecting code. A few harmless examples follow:
#include <stddef.h>
#include <stdio.h>
#include <string.h>

int bar (const char* arg)
{
  char buf[200];
  wchar_t wbuf[200];
  int i;

    /* The function expects two variadic parameters, but we pass only one.
     * Therefore, *(int*)arg is taken as first argument, the return address
     * as string pointer. The possible consequences range between undefined
     * content of buf and an accesss violation due to buffer overflow.
     */
  sprintf (buf, "%*s", arg);

    /* Illusive safety: the function expects const wchar_t* but receives
     * const char*. A wchar_t-based string is terminated with two (or four,
     * depending on the platform) null-bytes, which don't necessarily occur
     * in a char-string. Same consequences as above.
     */
  if (strlen (arg) < 200)
    swprintf (wbuf, L"%s", arg);

    /* Expects two parameters but gets only one. The function attempts to
     * write to the return address, which will most likely result in an AV.
     */
  sscanf (buf, "%*d", &i);
}


The printf variants suffer from additional vulnerabilities even if parameterized properly. For instance, sprintf(), swprintf() and all variants of scanf() can easily cause buffer overflows:

#include <stdio.h>
#include <string.h>

void foo (const char* arg)
{
  char buf[200];

    /* Corrupts the stack if strlen (arg) > 192.
     */
  sprintf (buf, "arg: '%s'", arg);

    /* Corrupts the stack if the user decides to provide more than 200
     * characters of input.
     */
  scanf ("%s\n", buf);
}



The C++ alternative: Streams


Most programming languages that borrowed the basic concept of printf are not susceptible to buffer overflows (due to dynamic string management), nor do they lack type safety since they do not rely on variadic stack parameters but rather pass implicitly constructed arrays in different varieties.

However, the designers of the C++ language chose a totally different approach. All traditional C standard library functions are available in C++, and the new abilities had made it possible to prevent the buffer overflow problem by using a string class. But the concept of IO streams, which can easily be extended with support for custom types, had encouraged the impression that printf() was not appropriately modern for the language due to its lack of extensibility.

Because of that, printf() and the related set of functions are widely deprecated in C++; the different kinds of streams are the preferred solution for string formatting. Streams combine type safety and extensibility with the flexibility of printf(): thanks to ADL, the stream operators can be overloaded for custom classes in their namespace; manipulators can be used to adjust the output format of streams; any kind of stream (file streams, console I/O streams, stringstreams) can be used with every stream operator etc.

The problem is that it looks just plain ugly.

As an example, look at the translation of a concise format statement from C to C++:
int iVal = 2;
int hVal = 0x0C;
const char* filename = "test.fil";
const char* username = "juser";
float fValue = 23.8;

    // C
#include <stdio.h>

void doItWithPrintf (void)
{
    printf ("%4d  0x%02X  Testfile: [%10s]  user: [%-10s]\nTrial %8.2f \n",
            iVal, hVal, filename, username, fValue);
}


    // C++
#include <iostream>
#include <iomanip>
#include <cstdio>

int doItWithIOStreams (void)
{
  cout << setw(4) << iVal;
  cout << "  0x" << hex << uppercase << setprecision (2) << setfill ('0') << setw (2) << hVal;
  cout << setfill (' ') << dec; // reset things
  cout << "  Testfile: [" << setw (10) << filename << "]";
  cout << "  user: [" << setw (10) << left << username << right << "]";
  cout << "\n";
  cout << "Trial " << setw (8) << fixed << setprecision (2) << fValue;
  cout << endl;
}


Even in less extreme cases, it is obvious that stream-based formatting makes it difficult to internationalize strings properly. Assume that the following compiler error is to be translated using gettext:
void e2094 (const char* op, const char* lhstype, const char* rhstype)
{
        // the C way
    std::fprintf (stderr, gettext ("E2094: Operator '%s' not implemented in "
                                   "type '%s' for arguments of type '%s'"),
                  op, lhstype, rhstype);

        // the C++ way?
    std::cerr << gettext ("E2094: Operator '") << op
              << gettext ("' not implemented in type '") << lhstype
              << gettext ("' for arguments of type '") << rhstype
              << '\'' << std::endl;
}
Instead of a single string with placeholders, we now have several string fragments. As you can imagine, the translators will not be amused, and if they do not know the context, they will be even less amused. Another problem becomes visible when looking at the German translation:

The German state passive usually puts the participle after the substantive it relates to. When internationalizing string fragments with gettext as above, this is plainly impossible.

More good reasons exist to avoid C++'s IO streams for formatting. The most flexible alternative I know of is the Boost Format library which manages to combine most of the advantages of printf and IO streams - and even compensates some of the flaws in C's printf(): it supports multiple references to the same parameter and parameter reordering.


CbdeFormat: making printf() typesafe


For newly created projects, the boost::format() library is likely the best approach, but still, it might not be an option in some scenarios: if you do not want do have your project depend on Boost, or if you are dealing with lots of legacy code that uses printf()-style functions extensively and needs to be maintained or migrated.

Interestingly, C++ provides all the prerequisites necessary for systematically preventing such errors. As a first step, memory management can be automated to avert the danger of buffer overflows in sprintf():
#include <string>
#include <cstdarg>
#pragma hdrstop

std::string str_printf (const char* format, ...)
{
    std::va_list args;
    std::string retval;

    va_start (args, format);

        // If passing 0 as buffer pointer, the function only calculates
        // the required buffer length.
    retval.resize (std::vsnprintf (0, 0, format, args));

    std::vsnprintf (&retval[0], retval.size () + 1, format, args);
    va_end (args);

    return retval;
}
Unfortunately, such a construct does not exist in the standard C++ libraries. As with much other basic functionality missing from the standard libraries, 3rd-party application frameworks usually provide a suitable implementation, such as String::sprintf() in C++Builder and CString::Format() in ATL/MFC. (Note that C++Builder also supports SysUtils::Format() and String::Format(), but these functions use the more powerful but deviant Delphi format string syntax and are not performing optimally when used in C++ code. On the other hand, these functions are type-safe out-of-the-box.)

Both of the mentioned printf() variants avoid the buffer overflow problem, but they still are type-unsafe. While GCC provides static format string checking with the -Wformat switch, this does not work across compilers or if the format string is gathered at runtime, e.g. when returned by gettext().

However, the new language features of C++, most notably templates and type inference, allow for type-checking at runtime. My particular implementation for C++Builder, CbdeFormat, shall be the primary subject of this article.

CbdeFormat works with C++Builder 2006, 2007 and 2009 (BCC does not support Variadic Macros in versions prior to 5.8). Although I wrote the library for use with C++Builder, it does not rely on any C++Builder specifics and should easily be adaptable to other environments.

The implementation matches most of my requirements for a type-safe printf() variant:
  • No changes are required for normal code.
  • The error messages are relatively comprehensive (position in code, format string, expected and actual parameter types).
  • As it might often be necessary to have type checking even in Release versions, the code added for every invocation of a printf() function has a low runtime overhead and is kept to an extreme minimum.
  • The table-driven format string parser can easily be adapted to support custom format syntax extensions.
  • A simple and well-defined way exists to add runtime type-checking to custom format functions.

Of course, this leaves a few disadvantages to be mentioned:
  • The type checking is always done at runtime, even if all information was available at compile time. In some cases, checking at compile time is not possible as there is no way to tokenize strings at compile time, but even errors that could be caught during compilation (such as passing non-intrinsic types as arguments) are raised at runtime, mostly to preserve consistence, but also because I tend to prefer precise, informative runtime errors over cryptic, misplaced compiler errors.
  • CbdeFormat manages an extensible list of functions such as printf(), sprintf(), scanf() etc. which are replaced by type-safe equivalents by the preprocessor. This implies that for every function of the same name, a type-safe version must be declared (although you usually do not have this problem in practice).
  • Code Completion and Parameter Insight do not work inside macro calls.

The library can be downloaded in the C++Builder section. The supplied installer performs all steps required for successful integration of CbdeFormat within C++Builder 2006, 2007 and 2009. In detail:
  • First, the required header files are copied to the IDE's include directory.
  • As C++Builder doesn't yet support Variadic Templates, I chose to implement the type-safe wrapper functions the canonical way: multiple overloads, as used for std::tr1::function or Variant::OleProcedure(). The formatgen utility is used to generate the according header files.
  • Using patch, the installer adds a few lines to some system header files (stdio.h, dstring.h, wstring.h, ustring.h, crtdbg.h) to enable the type-safe wrappers for the most commonly used printf()-style functions.
  • After updating the header files, the installer builds the library with the appropriate compiler.
  • C++Builder 2006 still uses the old approach of central PCH caches located in $(BDS)\lib (vcl100.csm, vcl100.#??). The installer deletes these files to ensure that the header file changes are in effect. This is not required for C++Builder 2007 und 2009, which both use project-specific PCHs.

Now CbdeFormat is installed, ready for use - and enabled by default for Debug builds.

The following sections cover a few more or less realistic use cases.


Example 1: Migration to C++Builder 2009/Unicode



Due to the Unicode transition in C++Builder 2009, the String::sprintf function now takes wide arguments. This is one of the more subtle changes in C++Builder 2009; most other changes cause compilation errors which are easy to locate and fix. Let's look at a little snippet of exemplary code which might appear similarly in many real-world C++Builder applications:
    const unsigned majorVersion = 1, minorVersion = 2;

    AnsiString theMessage = AnsiString ().sprintf ("Hello %s!\n"
        "This machine is running since %d minutes. "
        "High time for a coffee break!",
        EdtUserName->Text.c_str (), GetTickCount () / 1000 / 60);
    AnsiString theTitle = AnsiString ().sprintf (
        "%s v%d.%2d Professional Edition",
        Application->Title, majorVersion, minorVersion);

    MessageBox (Handle, theMessage.c_str (), theTitle.c_str (),
        MB_ICONINFORMATION);


When C++Builder 2009 imports older projects, it usually sets the TCHAR mapping to "char". This implicates that neither UNICODE nor _UNICODE are defined, thus all Windows functions are mapped to the ANSI variants. Other than Delphi programmers, C++Builder programmers mostly use AnsiString explicitly instead of the String typedef, and most functions from the C and C++ standard libraries and from most 3rd-party C or C++ libraries still use char-based strings as opposed to wchar_t-based ones which are now the default in Windows and VCL. This setting somewhat simplifies the migration. (Of course, if you want your application to properly support Unicode, you will have to change the mapping to "wchar_t" and adjust your string handling code accordingly.)

The migration of Delphi projects needs to handle different problems. AnsiString is hardly used in Delphi code; most programmers simply use the native string type, String. The String type is designed to be a generic type that can be changed when required, other than AnsiString, which is explicitly defined as single-byte string. The String type once changed its meaning in the past: in Turbo Pascal and Delphi 1, strings were allocated on the stack and had a limit of 255 chars. With the advent of Win32 and therefore 32 bits of memory address space, Delphi 2 changed the default string type to AnsiString which features copy-on-write semantics and is dynamically allocated. (The history of String is the reason for String being indiced from 1 onwards: since the developers of Turbo Pascal wanted to avoid the inherent problems of ASCIIZ strings and the cumbersome manual string management, they decided to allocate strings on the stack and to store the length in the first byte, which effectively limited them to 255 chars. In a DOS environment, this was a bearable trade-off, but not anymore in Win32.) In Delphi 2009, the String and Char types are UTF-16-based, and therefore, most code can be made Unicode-ready without much hassle. This doesn't mean that the migration is totally seamless for Delphi code - much older code is broken in subtle ways: if it assumes that SizeOf (Char) = 1, uses pointer arithmetics with PChar or stores binary data in strings (which has been justifiable for some time since Delphi 4 was the first version to support dynamic arrays).

Anyway, above code resembles real-world C++Builder code sufficiently for a suitable example. Due to the TCHAR mapping, C++Builder 2009 compiles the code without errors, but strange things happen at runtime:



To find the error, let's install CbdeFormat and rebuild the application. Now CbdeFormat is active and checks format strings for errors. When the code is executed again, CbdeFormat throws an exception, and we see this dialog:



After adding SystemCppException to the project, we see the actual error message:



This tells us that the first format string parameter is of type wchar_t* but should be a char*. (Other variations are denoted as well, but most of these can be converted implicitly and thus don't raise an exception; examples are the implicit const_cast<> (from T* to const T*) and the conversion between integral types like int and unsigned long.)

The reason for this discrepancy is the Unicode VCL: independently of the project settings, all strings used and exposed by Delphi code such as the VCL are really UnicodeString in C++Builder 2009. TEdit::Text is no exception.
In this case, the problem can be sidestepped by either adjusting the format string ("%ls" instead of "%s") or by casting the argument to AnsiString explicitly: AnsiString (EdtUserName->Text).c_str ().

After fixing this problem, we run right into the next one:



Implausible as it may seem, I've actually seen C++Builder code passing strings directly to sprintf(). This usually works since AnsiString and UnicodeString only contain a raw pointer to the string data, but that is a detail of the implementation that may change at any time, and I don't see any point in relying on it - after all, what is c_str() good for?

Adjust the code just as above, and you'll see the message box as originally intended:




Example 2: Debugging complex format strings


In a recent application I wrote code similar to the following:
    std::vector <String> values;
    values.push_back (String ().sprintf (_D ("File path: %s"),
        image->getImagePath ().c_str ()));
    values.push_back (String ().sprintf (_D ("Width: %f %s"),
        ppimage.getXLength (), ppimage.getXUnit ().c_str ()));
    values.push_back (String ().sprintf (_D ("Height: %f %s"),
        ppimage.getYLength (), ppimage.getYUnit ().c_str ()));
    values.push_back (String ().sprintf (_D ("Range: %.3f %s"),
        ppimage.getMaxVal () - ppimage.getMinVal (),
        ppimage.getZUnit ().c_str ()));
    values.push_back (String ().sprintf (_D ("X Resolution: %.3f %s/px"),
        ppimage.getXLength () / ppimage.getWidth (),
        ppimage.getXUnit ().c_str ()));
    values.push_back (String ().sprintf (_D ("Y Resolution: %.3f %s/px"),
        ppimage.getYLength () / ppimage.getHeight (),
        ppimage.getYUnit ().c_str ()));

(Yes, this is ugly, and yes, the code was preliminary.)
The situation is a bit confusing. image, for example, is an object of type ImageManager*, a class that holds and manages images. ImageManager uses some classes from the Delphi RTL and, to remain consistent, the Delphi string type, i.e. System::String. ppimage, however, is of type pp::Image&, which contains the actual image data in an array of double values and provides file persistence functionality. It is implemented in platform-independent C++ and uses std::string. Further, ppimage.getXLength() returns a double value (in ppimage.getXUnit()) whereas ppimage.getWidth() results in unsigned int (in pixels). Given all this, you may already expect what the initial output looked like:




CbdeFormat catches all these errors. However, by default it raises an exception when it encounters an invalid format statement, thereby interrupting code execution. This is often appropriate, but it can be annoying while debugging since it basically requires us to rebuild the program after fixing a format statement before we can address the next format statement error. In above case, we would need to rebuild 5 times. This is a bit inconvenient, and thus we choose to install another error handler instead. format_error.hpp contains the following interface for doing this:
  // format_error.hpp
namespace cbde
{

...

typedef void (*FormatStringErrorHandlerT) (const FormatStringErrorDescriptor& fse);

void setFormatStringErrorHandler (FormatStringErrorHandlerT newHandler);
FormatStringErrorHandlerT getFormatStringErrorHandler (void);

    // This is the default.
void formatStringErrorException (const FormatStringErrorDescriptor& fse);

void formatStringErrorMessageAndAbort (const FormatStringErrorDescriptor& fse);

    // Asks the user what to do next. Suitable for debugging purposes only.
void formatStringErrorDebug (const FormatStringErrorDescriptor& fse);


} // namespace cbde
For our case, the debug handler is appropriate. Install it as follows:
  // Project.cpp
...
#include <cbde/format_error.hpp>

//---------------------------------------------------------------------------
USEFORM("MainUnit.cpp", FrmMain);
...
//---------------------------------------------------------------------------
WINAPI _tWinMain(HINSTANCE, HINSTANCE, LPTSTR, int)
{
    try
    {
        cbde::setFormatStringErrorHandler (cbde::formatStringErrorDebug);
        ...
Now CbdeFormat displays a message that allows to continue execution:



If a format string bug does not cause a crash, we can address multiple consecutive format string issues this way without having to rebuild after every fix. In our case, the final result looks like this:




Example 3: Localization bug


Internationalizing your program with gettext gives you the advantage of an easily extensible resource mechanism: every user can download poEditCBDE_FORMAT_CHECK_DEBUG (which includes position information: file name, line number, function) or CBDE_FORMAT_CHECK (without position information) in the project options.

With CBDE_FORMAT_CHECK defined, the erroneous translation causes this error message:



Note that the overhead is minimal. The format string parser increases the executable size by approximately 2 KB; it doesn't perform heap allocations unless it finds errors in your format strings.

The additional code is minimal as well. Without format string checking, BCC generates this code:
; MainUnit.cpp.42: std::fprintf (stderr, gettext ("E2094: Operator '%s' is...
004019B4 8B4D10           mov ecx,[ebp+$10]
004019B7 51               push ecx
004019B8 8B450C           mov eax,[ebp+$0c]
004019BB 50               push eax
004019BC 8B5508           mov edx,[ebp+$08]
004019BF 52               push edx
004019C0 68FA114700       push $004711fa
004019C5 E8BAFFFFFF       call _gettext
004019CA 59               pop ecx
004019CB 50               push eax
004019CC 8B0DD0CD4700     mov ecx,[$0047cdd0]
004019D2 83C130           add ecx,$30
004019D5 51               push ecx
004019D6 E8ADF30600       call _fprintf
004019DB 83C414           add esp,$14

With CBDE_FORMAT_CHECK activated, the statement results in the following code:
; MainUnit.cpp.42: std::fprintf (stderr, gettext ("E2094: Operator '%s' is...
004019B7 8B7510           mov esi,[ebp+$10]
004019BA 8B7D0C           mov edi,[ebp+$0c]
004019BD 8B4508           mov eax,[ebp+$08]
004019C0 8945D4           mov [ebp-$2c],eax
004019C3 68FA214700       push $004721fa
004019C8 E8B7FFFFFF       call _gettext
004019CD 8BD8             mov ebx,eax
004019CF A1D0DD4700       mov eax,[$0047ddd0]
004019D4 59               pop ecx
004019D5 83C030           add eax,$30
004019D8 8945D0           mov [ebp-$30],eax
004019DB 6A03             push $03
004019DD 68D0224700       push $004722d0
004019E2 53               push ebx
004019E3 E830280000       call cbde::verifyPrintfFormatString(const char *,...
004019E8 83C40C           add esp,$0c
004019EB 56               push esi
004019EC 57               push edi
004019ED 8B55D4           mov edx,[ebp-$2c]
004019F0 52               push edx
004019F1 53               push ebx
004019F2 8B4DD0           mov ecx,[ebp-$30]
004019F5 51               push ecx
004019F6 E815F60600       call _fprintf
004019FB 83C414           add esp,$14
This is equivalent to the following C++ code, therefore close to the optimum:
    static const unsigned argTypeTable[3] = {
        cbde::TypeID <decltype (op)>::value,
        cbde::TypeID <decltype (lhstype)>::value,
        cbde::TypeID <decltype (rhstype)>::value,
    };
    const char* theFormatString = gettext ("E2094: Operator '%s' is not "
                                           "implemented in type '%s' "
                                           "for arguments of type '%s'");
    cbde::verifyPrintfFormatString (theFormatString, argTypeTable,
        sizeof (argTypeTable) / sizeof (unsigned));
    std::fprintf (stderr, theFormatString, op, lhstype, rhstype);


Enabling Format String Checking for your own functions


The installer configures CbdeFormat for a fixed set of format string functions - but you can easily add your own functions. Let me demonstrate how to do this using the example of the str_printf() function I showed above:
  • First, the CbdeFormat header files must be regenerated. Before doing that, open "defaultHeaderSettings.dat" located in the CbdeFormat installation directory with your favorite text editor and add the function name to the list:
    object TFormatHeaderSettings: TPersistenceWrapper
      Persistent.MaxFormatParams = 12
      Persistent.MaxFixedParams = 3
      Persistent.FormatNames.Strings = (
        'str_printf'
        'wstr_printf'
        'printf'
        'wprintf'
        'sprintf'
        'swprintf'
        'fprintf'
        'fwprintf'
        'scanf'
        'wscanf'
        'sscanf'
        'swscanf'
        'fscanf'
        'fwscanf'
        'snprintf'
        'snwprintf'
        '_snprintf'
        '_snwprintf'
        'cat_printf'
        'cat_sprintf')
    end
  • Now call FORMATGEN defaultHeaderSettings.dat "$(BDS)\include\cbde" from the command line (replacing $(BDS) with the path of C++Builder).

  • Next, the actual header file needs to be adjusted slightly:
    #ifndef _STR_PRINTF_HPP
    #define _STR_PRINTF_HPP
    
    #include <string>
    
    + #include <cbde/format_definition_begin.hpp>
    
    std::string str_printf (const char* format, ...);
    
    +     // Printf|Scanf, format_string_type, return_type, func_name
    + CBDE_FORMAT_DECLARE_SAFE (Printf, const char*, std::string, str_printf)
    
    + #include <cbde/format_definition_end.hpp>
    
    #endif // _STR_PRINTF_HPP
    A simple change must also be applied to the source file:
    #include <string>
    #include <cstdarg>
    #include <cstdio>
    #pragma hdrstop
    
    #include "str_printf.hpp"
    
    + #include <cbde/format_definition_begin.hpp>
    
    std::string str_printf (const char* format, ...)
    {
        std::va_list args;
        std::string retval;
    
        va_start (args, format);
    
            // If passing 0 as buffer pointer, the function only calculates
            // the required buffer length.
        retval.resize (std::vsnprintf (0, 0, format, args));
    
        std::vsnprintf (&retval[0], retval.size () + 1, format, args);
        va_end (args);
    
        return retval;
    }



References


[1] Wikipedia: Format string vulnerabilities as of 08.06.2009
[2] Boost Format library
[3] Using the GNU Compiler Collection (GCC): Options to Request or Suppress Warnings (-Wformat)
[4] Joel Spolsky: Back to Basics, 11.12.2001


Comments



Deprecated: mysql_connect(): The mysql extension is deprecated and will be removed in the future: use mysqli or PDO instead in /www/htdocs/w008ab83/ad/phputils/dbc_mysql.php on line 112

New entry:

Name:
E-Mail:
Website:
Date:
Number of characters in your name:
Message: