|
Type-safe format stringsContributing to the rehabilitation of printf.
printf() in CAbout every high-level language has a notion of a string formatting mechanism. Even C, which some claim to be just a PDP-11 assember that thinks it is a language, has the printf() function and its numerous variants. In its flexibility and simplicity, C's printf often served as prototype for more recent languages, be it PHP, Delphi, C#/.NET or Java. In C, however, printf() has a severe flaw: it is not type-safe. The format string is parsed at runtime, but printf() does not have knowledge about the types of its variadic arguments. This problem has proven to be a prevalent source of errors, which can even be misused for injecting code. A few harmless examples follow:
The printf variants suffer from additional vulnerabilities even if parameterized properly. For instance, sprintf(), swprintf() and all variants of scanf() can easily cause buffer overflows:
The C++ alternative: StreamsMost programming languages that borrowed the basic concept of printf are not susceptible to buffer overflows (due to dynamic string management), nor do they lack type safety since they do not rely on variadic stack parameters but rather pass implicitly constructed arrays in different varieties. However, the designers of the C++ language chose a totally different approach. All traditional C standard library functions are available in C++, and the new abilities had made it possible to prevent the buffer overflow problem by using a string class. But the concept of IO streams, which can easily be extended with support for custom types, had encouraged the impression that printf() was not appropriately modern for the language due to its lack of extensibility. Because of that, printf() and the related set of functions are widely deprecated in C++; the different kinds of streams are the preferred solution for string formatting. Streams combine type safety and extensibility with the flexibility of printf(): thanks to ADL, the stream operators can be overloaded for custom classes in their namespace; manipulators can be used to adjust the output format of streams; any kind of stream (file streams, console I/O streams, stringstreams) can be used with every stream operator etc. The problem is that it looks just plain ugly. As an example, look at the translation of a concise format statement from C to C++:
Even in less extreme cases, it is obvious that stream-based formatting makes it difficult to internationalize strings properly. Assume that the following compiler error is to be translated using gettext: Instead of a single string with placeholders, we now have several string fragments. As you can imagine, the translators will not be amused, and if they do not know the context, they will be even less amused. Another problem becomes visible when looking at the German translation:The German state passive usually puts the participle after the substantive it relates to. When internationalizing string fragments with gettext as above, this is plainly impossible. More good reasons exist to avoid C++'s IO streams for formatting. The most flexible alternative I know of is the Boost Format library which manages to combine most of the advantages of printf and IO streams - and even compensates some of the flaws in C's printf(): it supports multiple references to the same parameter and parameter reordering. CbdeFormat: making printf() typesafeFor newly created projects, the boost::format() library is likely the best approach, but still, it might not be an option in some scenarios: if you do not want do have your project depend on Boost, or if you are dealing with lots of legacy code that uses printf()-style functions extensively and needs to be maintained or migrated. Interestingly, C++ provides all the prerequisites necessary for systematically preventing such errors. As a first step, memory management can be automated to avert the danger of buffer overflows in sprintf(): Unfortunately, such a construct does not exist in the standard C++ libraries. As with much other basic functionality missing from the standard libraries, 3rd-party application frameworks usually provide a suitable implementation, such as String::sprintf() in C++Builder and CString::Format() in ATL/MFC. (Note that C++Builder also supports SysUtils::Format() and String::Format(), but these functions use the more powerful but deviant Delphi format string syntax and are not performing optimally when used in C++ code. On the other hand, these functions are type-safe out-of-the-box.)Both of the mentioned printf() variants avoid the buffer overflow problem, but they still are type-unsafe. While GCC provides static format string checking with the -Wformat switch, this does not work across compilers or if the format string is gathered at runtime, e.g. when returned by gettext(). However, the new language features of C++, most notably templates and type inference, allow for type-checking at runtime. My particular implementation for C++Builder, CbdeFormat, shall be the primary subject of this article. CbdeFormat works with C++Builder 2006, 2007 and 2009 (BCC does not support Variadic Macros in versions prior to 5.8). Although I wrote the library for use with C++Builder, it does not rely on any C++Builder specifics and should easily be adaptable to other environments. The implementation matches most of my requirements for a type-safe printf() variant:
Of course, this leaves a few disadvantages to be mentioned:
The library can be downloaded in the C++Builder section. The supplied installer performs all steps required for successful integration of CbdeFormat within C++Builder 2006, 2007 and 2009. In detail:
Now CbdeFormat is installed, ready for use - and enabled by default for Debug builds. The following sections cover a few more or less realistic use cases. Example 1: Migration to C++Builder 2009/UnicodeDue to the Unicode transition in C++Builder 2009, the String::sprintf function now takes wide arguments. This is one of the more subtle changes in C++Builder 2009; most other changes cause compilation errors which are easy to locate and fix. Let's look at a little snippet of exemplary code which might appear similarly in many real-world C++Builder applications:
When C++Builder 2009 imports older projects, it usually sets the TCHAR mapping to "char". This implicates that neither UNICODE nor _UNICODE are defined, thus all Windows functions are mapped to the ANSI variants. Other than Delphi programmers, C++Builder programmers mostly use AnsiString explicitly instead of the String typedef, and most functions from the C and C++ standard libraries and from most 3rd-party C or C++ libraries still use char-based strings as opposed to wchar_t-based ones which are now the default in Windows and VCL. This setting somewhat simplifies the migration. (Of course, if you want your application to properly support Unicode, you will have to change the mapping to "wchar_t" and adjust your string handling code accordingly.) The migration of Delphi projects needs to handle different problems. AnsiString is hardly used in Delphi code; most programmers simply use the native string type, String. The String type is designed to be a generic type that can be changed when required, other than AnsiString, which is explicitly defined as single-byte string. The String type once changed its meaning in the past: in Turbo Pascal and Delphi 1, strings were allocated on the stack and had a limit of 255 chars. With the advent of Win32 and therefore 32 bits of memory address space, Delphi 2 changed the default string type to AnsiString which features copy-on-write semantics and is dynamically allocated. (The history of String is the reason for String being indiced from 1 onwards: since the developers of Turbo Pascal wanted to avoid the inherent problems of ASCIIZ strings and the cumbersome manual string management, they decided to allocate strings on the stack and to store the length in the first byte, which effectively limited them to 255 chars. In a DOS environment, this was a bearable trade-off, but not anymore in Win32.) In Delphi 2009, the String and Char types are UTF-16-based, and therefore, most code can be made Unicode-ready without much hassle. This doesn't mean that the migration is totally seamless for Delphi code - much older code is broken in subtle ways: if it assumes that SizeOf (Char) = 1, uses pointer arithmetics with PChar or stores binary data in strings (which has been justifiable for some time since Delphi 4 was the first version to support dynamic arrays). Anyway, above code resembles real-world C++Builder code sufficiently for a suitable example. Due to the TCHAR mapping, C++Builder 2009 compiles the code without errors, but strange things happen at runtime: To find the error, let's install CbdeFormat and rebuild the application. Now CbdeFormat is active and checks format strings for errors. When the code is executed again, CbdeFormat throws an exception, and we see this dialog: After adding SystemCppException to the project, we see the actual error message: This tells us that the first format string parameter is of type wchar_t* but should be a char*. (Other variations are denoted as well, but most of these can be converted implicitly and thus don't raise an exception; examples are the implicit const_cast<> (from T* to const T*) and the conversion between integral types like int and unsigned long.) The reason for this discrepancy is the Unicode VCL: independently of the project settings, all strings used and exposed by Delphi code such as the VCL are really UnicodeString in C++Builder 2009. TEdit::Text is no exception. In this case, the problem can be sidestepped by either adjusting the format string ("%ls" instead of "%s") or by casting the argument to AnsiString explicitly: AnsiString (EdtUserName->Text).c_str (). After fixing this problem, we run right into the next one: Implausible as it may seem, I've actually seen C++Builder code passing strings directly to sprintf(). This usually works since AnsiString and UnicodeString only contain a raw pointer to the string data, but that is a detail of the implementation that may change at any time, and I don't see any point in relying on it - after all, what is c_str() good for? Adjust the code just as above, and you'll see the message box as originally intended: Example 2: Debugging complex format stringsIn a recent application I wrote code similar to the following:
(Yes, this is ugly, and yes, the code was preliminary.) The situation is a bit confusing. image, for example, is an object of type ImageManager*, a class that holds and manages images. ImageManager uses some classes from the Delphi RTL and, to remain consistent, the Delphi string type, i.e. System::String. ppimage, however, is of type pp::Image&, which contains the actual image data in an array of double values and provides file persistence functionality. It is implemented in platform-independent C++ and uses std::string. Further, ppimage.getXLength() returns a double value (in ppimage.getXUnit()) whereas ppimage.getWidth() results in unsigned int (in pixels). Given all this, you may already expect what the initial output looked like: CbdeFormat catches all these errors. However, by default it raises an exception when it encounters an invalid format statement, thereby interrupting code execution. This is often appropriate, but it can be annoying while debugging since it basically requires us to rebuild the program after fixing a format statement before we can address the next format statement error. In above case, we would need to rebuild 5 times. This is a bit inconvenient, and thus we choose to install another error handler instead. format_error.hpp contains the following interface for doing this: For our case, the debug handler is appropriate. Install it as follows: Now CbdeFormat displays a message that allows to continue execution:If a format string bug does not cause a crash, we can address multiple consecutive format string issues this way without having to rebuild after every fix. In our case, the final result looks like this: Example 3: Localization bugInternationalizing your program with gettext gives you the advantage of an easily extensible resource mechanism: every user can download poEditCBDE_FORMAT_CHECK_DEBUG (which includes position information: file name, line number, function) or CBDE_FORMAT_CHECK (without position information) in the project options. With CBDE_FORMAT_CHECK defined, the erroneous translation causes this error message: Note that the overhead is minimal. The format string parser increases the executable size by approximately 2 KB; it doesn't perform heap allocations unless it finds errors in your format strings. The additional code is minimal as well. Without format string checking, BCC generates this code:
With CBDE_FORMAT_CHECK activated, the statement results in the following code: This is equivalent to the following C++ code, therefore close to the optimum:
Enabling Format String Checking for your own functionsThe installer configures CbdeFormat for a fixed set of format string functions - but you can easily add your own functions. Let me demonstrate how to do this using the example of the str_printf() function I showed above:
References[1] Wikipedia: Format string vulnerabilities as of 08.06.2009 [2] Boost Format library [3] Using the GNU Compiler Collection (GCC): Options to Request or Suppress Warnings (-Wformat) [4] Joel Spolsky: Back to Basics, 11.12.2001 CommentsDeprecated: mysql_connect(): The mysql extension is deprecated and will be removed in the future: use mysqli or PDO instead in /www/htdocs/w008ab83/ad/phputils/dbc_mysql.php on line 112 |