Welcome!

Cloud Expo Authors: Kevin Benedict, Greg Ness, Ranko Mosic, John Cowan, Gilad Parann-Nissany

Related Topics: XML

XML: Article

High Performance XML Parsing in C++

High Performance XML Parsing in C++

In my last article (XML-J, Vol. 1, issue 3) I made the case for using custom classes derived from XML Schemas to represent XML documents in C++ applications. That article focused primarily on the problems of generating XML documents from program objects, and explained how custom classes have significant advantages over standards like DOM and SAX in terms of performance, object orientation and maintainability of source code.

Here I'll describe a unique methodology for parsing XML data into C++ classes that provides all the object-oriented benefits detailed in the first article, with increased performance (compared to traditional generic XML parsers).

The Problem with Conventional Parsers
C++ programmers have been dealing with parsing technologies for years. Most of you remember writing simple language parsers in school, and probably wrote the basic syntax parser in tools like Lex and Yacc. So, for C++ developers, the idea of a syntax parser isn't especially intimidating.

The basic grammar for XML is pretty simple compared to a programming language like C++ or Java, for example, but there's one problem unique to XML parsing that is daunting: unlike conventional programming languages, XML doesn't have a fixed set of tags (i.e., keywords). Imagine trying to develop a general-purpose grammar for a programming language with a user-defined set of keywords!

To solve the general problem of XML parsing, it's necessary to build a parser that can be dynamically fed a list of tags and rules for the specific dialect of XML to be parsed. In the terminology of XML standards, that means specifying an XML Schema file to a DOM parser so that it knows how to parse and validate the specific dialect of the input XML file.

If an application reads and writes a variety of dialects of XML documents, the DOM model is appropriate because it doesn't require source code changes for incremental support for a new dialect of XML. This is typically the case for integration broker applications, as described in my last article, in which the broker is reading, transforming and forwarding all kinds of XML documents within and between organizations.

However, as I also described, there's a large class of applications in which only a few types of XML are spoken and these don't often change. For these, the overhead of DOM and the lack of application-specific object orientation is a major drawback.

Static Parsers Derived from XML Schemas
Just as it's beneficial in some environments to derive C++ classes from XML Schemas for writing XML documents, it can also be beneficial to derive classes to read XML documents from schemas.

The typical process for creating a language parser in C++ is to hand-code the Lex rules and Yacc grammar, then generate the Lexer and parser from these XML dialect-specific input files (see Figure 1).

This process is tedious, however, and must be redone for each dialect of XML that your application needs to parse. While doable, the same logic that you'd hand-code in the rules and grammar is already encapsulated in the XML Schema file. A more efficient approach is to develop a translation program that can convert the XML Schema file into the equivalent Lex rules and Yacc grammar for the XML dialect (see Figure 2).

The example project in Listing 1 shows a generated grammar for a sample XML DTD file called acmepc.dtd. You'll see the generated Yacc input in acmepcxml_parser.y and the Lex input in acmepcxml_lexer.l. All the classes and parser for this project are contained in the C++ namespace acmepcxml.

Using the generated custom parser is simple. Just create an instance of the acmepcxml::XMLImporter class, initialize it with its Initialize() member and import the XML data into the schema-derived classes with the ImportFromFile() member. The importer exposes a base class root node of the class tree via the GetXObject() member. This base class is then dynamically cast back to the acmepc class that contains the context of the specific XML dialect defined by the acmepc.dtd schema (see Listing 1).

Advantages of Custom Parser Approach
There are four primary advantages to creating a custom parser rather than using a generic parser like DOM.

  1. First and foremost, it's fast. I've run benchmarks that show the custom parser to be up to three times faster than the fastest DOM parser I can find while also having a smaller in-memory footprint. The primary reason it's so much faster than DOM seems to be that it doesn't have to do dynamic validation of the XML input. Instead, validation is enforced by the automata generated by Yacc from the input files, which are derived from the XML Schema.

  2. The generated parser can integrate tightly with the derived classes de- scribed in my previous article. There is no two-step process of parsing into the DOM hierarchy, then populating classes from the DOM data structures. The custom parser creates the schema-derived classes directly, without the need for the intermediate step. The generated parser can also integrate tightly with framework technologies you might be using, such as STL and MFC class libraries.

  3. You get all the source code to the components that link into your application. By using the GNU-licensed Flex and Bison tools, the output source code will run on virtually every operating system imaginable. I've been very successful, for example, in running Flex and Bison on Windows NT and using the output C/C++ code on a variety of platforms with no necessary source code changes.

  4. The final advantage, and the coolest of all, is that using Lex and Yacc enables you to handle those pesky XML entities more easily. I use this feature to automatically expand entities on input so my program doesn't have to worry about them. XML entities can be preprocessed just as a macro is preprocessed by a compiler when parsing a C input file. The class instances created by the custom parser contain data with entity references fully expanded. I can't stress enough the amount of headaches this little feature can save you when dealing with documents with lots of entities.
Conclusion
While XML processing may be new to the C++ community, the skills and technologies that have matured over the last decade in this community can still be very useful in handling XML data formats. In my last article I described the benefits of deriving C++ class definitions from XML Schemas. Here, I've gone a bit further to show how to derive parser grammars for XML dialects from the XML Schema.

As the XML Schema standard nears acceptance, there will be many other opportunities to reuse the work of schema designers to automatically derive programming source code, relational database schemas and other artifacts that otherwise would have to be coded by hand. C++ developers should look for these opportunities as ways to reduce the amount of repetitive work required to add or update support for specific XML dialects.

More Stories By Ken Blackwell

Ken Blackwell is the chief technical officer of Bristol Technology, Inc., where he oversees product architecture and research in XML, middleware and transaction analysis technologies.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Cloud Expo Breaking News
Wide and cheap availability of cloud-based media services is upon us. With the transformations these services are already bringing to the consumption of music, video and interactive media, change has likewise come to professional workflows. Documents in 2012 are read, written, collaborated on, and distributed anywhere an Internet-enabled device can reach – which is to say, everywhere. In his session at the 10th International Cloud Expo, Christopher Kenneally, Director of Business Development a...
With Cloud Expo 2012 New York (10th Cloud Expo) just four months away, what better time to start introducing you in greater detail to the distinguished individuals in our incredible Speaker Faculty for the technical and strategy sessions at the conference... We have technical and strategy sessions for you every day from June 11 through June 14 dealing with every nook and cranny of Cloud Computing and Big Data, but what of those who are presenting? Who are they, where do they work, what else h...
Cloud is a shift from the focus on underlying technology implementation to leveraging existing implementations and further building upon them. Cloud orchestration or a network of clouds is the wave of the future where these clouds can operate with elasticity, scalability, and efficiency. Effective service management is an important aspect of managing such networks. The transition to the cloud will enable the further aggregation of composite web services and enhanced business-to-business capabili...
I've been working on Enterprise Cloud Strategy and in the course of this work identified some interesting and non-obvious opportunities in the Cloud. One solution I’ve examined is the well-crafted solution that is enStratus. enStratus has built a SaaS Cloud Management / Governance product focused on providing critical management, monitoring, governance capabilities tailored to the needs of the Global 2000 market, rather than the startup market. As I have worked with a current Fortune 500 clie...
CONGRATULATIONS to National Reconnaissance Office (NRO) CIO Jill T. Singer for being selected as one of the 10 winners of the first annual CloudNOW awards presented in Santa Clara, California earlier this week.

From the NRO Press Release:
"Considered one of the top women leaders in Federal IT, Ms. Singer was recognized for her innova...
With Cloud Expo 2012 New York (10th Cloud Expo) now under four months away, what better time to start introducing you in greater detail to the distinguished individuals in our incredible Speaker Faculty for the technical and strategy sessions at the conference... We have technical and strategy sessions for you every day from June 11 through June 14 dealing with every nook and cranny of Cloud Computing and Big Data, but what of those who are presenting? Who are they, where do they work, what e...
"Having been in the IT field for many years, I believe the cloud computing chapter in the industry is an exciting one and I am proud to be a part of it," said National Reconaissance Office (NRO) Chief Information Officer Jill T. Singer Tuesday, as it was announced that she was one of 10 winners of the 2012 CloudNOW "Top Ten Women in Cloud" Awards.
2011 was a year of rapid adoption for public and private cloud services. Instant and on-demand server provisioning was the driving force behind the massive growth. On top, cloud server templates and script automation simplified application installation for simple and pre-defined application stacks, but have not targeted more complex enterprise application environments. In his session at the 10th International Cloud Expo, John Yung, CEO of Appcara, will discuss how 2012 will be the year for app...
As more enterprises are adopting clouds, the nature of cloud computing is changing. Previously, clouds were used to test applications or for non-mission critical applications. Today, enterprises are using clouds for cost-saving advantages and launching more mission critical applications that have defined performance needs. In his session at the 10th International Cloud Expo, Eric Shepcaro, CEO and Chairman of the Board of Telx, will discuss how distributed computing has many advantages. It wou...
Building a cloud computing environment with on-demand access to compute, network, and storage resources requires an elastic infrastructure at multiple levels. Virtualization combined with x86 servers has transformed the way we scale out compute resources. Unfortunately, legacy Fibre Channel and iSCSI storage architectures are rooted in rigid mainframe-era designs, and are fundamentally mismatched with the dynamic, shared modern data center. In his session at the 10th International Cloud Expo, ...