Comments
Richard Davies wrote: The UK has a good crop of technology pioneers in cloud computing - for example ElasticHosts, FlexiScale, Flexiant, OnApp - and also some strong government initiatives such as G-Cloud. We will have to see whether this kind of technical leadership converts into swift mass-market adoption or not.
Cloud Computing
Conference & Expo
November 2-4, 2009 NYC
Register Today and SAVE !..

2008 West
DIAMOND SPONSOR:
Data Direct
SOA, WOA and Cloud Computing: The New Frontier for Data Services
PLATINUM SPONSORS:
Red Hat
The Opening of Virtualization
GOLD SPONSORS:
Appsense
User Environment Management – The Third Layer of the Desktop
Cordys
Cloud Computing for Business Agility
EMC
CMIS: A Multi-Vendor Proposal for a Service-Based Content Management Interoperability Standard
Freedom OSS
Practical SOA” Max Yankelevich
Intel
Architecting an Enterprise Service Router (ESR) – A Cost-Effective Way to Scale SOA Across the Enterprise
Sensedia
Return on Assests: Bringing Visibility to your SOA Strategy
Symantec
Managing Hybrid Endpoint Environments
VMWare
Game-Changing Technology for Enterprise Clouds and Applications
Click For 2008 West
Event Webcasts

2008 West
PLATINUM SPONSORS:
Appcelerator
Get ‘Rich’ Quick: Rapid Prototyping for RIA with ZERO Server Code
Keynote Systems
Designing for and Managing Performance in the New Frontier of Rich Internet Applications
GOLD SPONSORS:
ICEsoft
How Can AJAX Improve Homeland Security?
Isomorphic
Beyond Widgets: What a RIA Platform Should Offer
Oracle
REAs: Rich Enterprise Applications
Click For 2008 Event Webcasts
In many cases, the end of the year gives you time to step back and take stock of the last 12 months. This is when many of us take a hard look at what worked and what did not, complete performance reviews, and formulate plans for the coming year. For me, it is all of those things plus a time when I u...
SYS-CON.TV
Unlocking Microsoft Office Documents
An open source alternative

If you've ever written software to be used by business managers, you will no doubt have received requests for interoperability with the Microsoft Office Applications. "Get me the report in Excel; HTML doesn't cut it and I need to run my own analysis on it"; "Can you index the zillion word documents I have so that the whole organization can search on them?"; "I have all this data in Excel; do I have to enter it again on this Web page?".... These are things we commonly hear as application developers, which is not surprising given the ubiquity of MS Office.

Does this mean you're forced to tie your application to Windows to interface with the COM APIs of Excel or Word? Apart from the fact that you don't want your language or platform decision to be constrained by a lack of choice, it's also important to note that these APIs can be unstable because they're automating a desktop application. Because of this, they are unreliable for any server-side deployment. For the Java developer, however, the power of Jakarta POI is close at hand.

POI is a pure Java application library for reading and writing the Microsoft OLE2 Compound Document Format (OLE2CDF) file formats. This format is used by (among others) various MS Office applications. As the name suggests, this is a format for storing multiple documents (or streams) in one file, for example, storing an embedded spreadsheet along with a presentation. Within this structure are stored the records that contain the application-specific data.

POI is structured along these lines. At its base it has a component known as the POIFS or the POI File System, which is the most complete implementation of the OLE2CDF structure in Java. Layered above this are the components to read the Excel record structures (HSSF) or the Word record structures (HWPF).

HSSF
HSSF is the component of POI that allows you to read, write, and manipulate Excel spreadsheets from pure Java applications. It consists of code that understands the Excel record formats, and wraps them up in an easy-to-use API.

How easy does HSSF make reading Excel files? See for yourself!


  InputStream in = new FileInputStream("data.xls"));
  HSSFWorkbook wb     = new HSSFWorkbook(in);
  HSSFSheet sheet = wb.getSheetAt(0);      // the 1st sheet
  HSSFRow row     = sheet.getRow(1);       // get the 2rd row
  HSSFCell cell   = row.getCell((short)1); // the 2nd cell of the 2nd row

The model of an Excel document in HSSF begins with the HSSFWorkbook object. This object provides access to the sheets (by name or number), which in turn provides access to the rows (HSSFRow) in the sheet. Each row provides access to the individual cells (HSSFCell) it contains.

From the cell object you can retrieve data contained in that cell via accessor methods, depending on the type of data. Listing 1 provides an example.

Given this object model, writing is equally simple. Instead of "get"-ing rows and columns, you "create" them and then "set" the values in the cells as in Listing 2. Once again, start with the HSSFWorkbook class, whose default constructor provides a new workbook object; then populate the workbook by creating a sheet in which you create rows. In each row create the cells you need. Finally, populate the cells with the data. As Listing 2 shows, a cell can contain integers, floats, strings, and dates.

Styles
All that is fine, but plain data is usually not sufficient to keep your users happy. HSSF therefore has a whole range of features designed to let you use a variety of styles and formats that Excel supports.

To start applying styles to cells, first create an instance of an HSSFStyle class:

HSSFStyle myStyle = wb.createCellStyle() // wb is an HSSFWorkbook object

The style object will now provide you with methods to set various style parameters, such as foreground and background colors, fonts, borders, and data formats, via conventionally named setters.

Data Formats
A key component of a cell's style is its data format. This specifies, for example, the number of decimal places in a number, or the format of a date. The data format is set using the setDataFormat method of HSSFStyle. This method takes an integer, which is an index to a format, since Excel keeps a list of indexed built-in formats (and user-defined formats are appended to this list and indexed in a similar fashion).

It's easy to get the index, however. For a built-in format, use the static getBuiltinFormat method in the HSSFDataFormat class. Give it the format string and it will return the correct index, the proper index for you. To set a format:

myStyle.setDataFormat(HSSFDataFormat.getBuiltinFormat("d-mmm-yy");

For a user-defined format, first get an instance of HSSFDataFormat from an HSSFWorkbook object to ensure that your format is registered with the workbook:


  HSSFDataFormat df = wb.createDataFormat();
  myStyle.setDataFormat(df.getFormat("dd%MMM%yyyy"));

If you don't want to worry about which formats are user defined (it's documented in the Javadocs for HSSFDataFormat), simply use the nonstatic method and it will take care of this issue internally.

When you have defined the style you want, just set it to the cell:

cell.setCellStyle(myStyle);

Reuse the same style object for cells that are similarly formatted - do not create new style objects for each cell, since Excel has an upper limit on the number of styles that can be referenced in a workbook. For example, you could create one style object for the table headers, one for the body, and one for the footer and use them throughout your spreadsheet.

Formulas
Probably one of the most important features of HSSF is the ability to populate cells with formulas. This allows you to create dynamic spreadsheets and facilitate the user's ability to change the data and perform her own analysis (which is indeed the power of spreadsheets, and the number one reason why you would want to output Excel files).

Formulas are created using the setCellFormula method of an HSSFCell object. The input to this method is a string containing the formula you want at that cell. It should be in the same format that you would type into the edit box in Excel (without a leading "="), thus:

cell.setCellFormula("A1+A2^2");

You could use any built-in VBA function, or even a user-defined function, in the formulas:


  Cell.setCellFormula("average(A1:B1)");
  cell.setCellFormula("mySpecialFunction(A1/A2)");

If you need to provide your users with the ability to copy-paste or drag an Excel formula in the resultant sheet correctly, you might want to use absolute references instead of relative. If formulas with relative cell references (the default, e.g., A1) are copied from one cell and pasted to another, the cell references in the formulas change relative to the destination cell.

cell.setCellFormula("A1/$A$25");

However, if the formula contains references that are absolute, they stay the same irrespective of the destination cell. Absolute references are specified by adding a $ symbol to the reference, viz. $A$1. Note that the row and the column can be individually addressed while specifying absolute references, viz. A$1 vs $A1.

You can also reference other sheets in the same workbook in the formula. HSSF does not yet support the ability to write formulas referencing external workbook files.


  Cell.setCellFormula("SUM(Sheet1!A1-Sheet1!A2)");
  // formula in cell A1 of Sheet2.

Note, however, that the formula results are not calculated by HSSF, which is really a file format reader and writer, not a functional replacement for a spreadsheet application. The formula is merely written into the file in the proper format and evaluated when the file is opened in Excel.

Finally
Among other advanced features, HSSF allows you to create merged cell regions. You can also set headers and footers for sheets, as well as set print areas, to ensure the data prints well. You can create split and freeze panes, set zoom options, or enable sheet protection. Additionally, you can create and manipulate named ranges. Later versions (see sidebar - A Guide to POI Versions) also let you programmatically create drawings in sheets.

However, there are always features of an Excel file that POI does not yet support. In such scenarios, templates are invaluable. The idea is to create an empty Excel spreadsheet populated with the attributes that POI doesn't support. You could, for example, create a chart in the spreadsheet referencing named ranges, or create a pivot table in a certain area. At runtime, in Java code, you could read the workbook in with POI and fill in the cells with data from your application. Now when the user opens the workbook in Excel, it comes loaded with data, charts, and pivot tables. Listing 3 provides an example.

Hopefully this overview of HSSF has convinced you that HSSF has almost all it takes to create professionally produced Excel spreadsheets that'll be a joy to your users, and leave them asking for more.

Word Documents with HWPF
The HWPF (Horrible Word Processing Format) component of POI is a Java library for reading and writing Word documents. It's still in early beta but is relatively stable and it is the only open source Java solution we know of for programmatically accessing and/or creating a Word document.

I am going to give a short introduction to the high-level structure of a Word document. These are basic concepts that can be applied to most styled document formats and they will make later sections of this article easier to digest.

A Word document can be modeled as a tree-like structure. Figure 1 illustrates this. The document has sections, a section has paragraphs, and a paragraph has character runs. Each instance of these is associated with a range of text.

  • A section can be correlated with a chapter in a book. A section contains obscure properties like the page border and the number of columns.
  • A paragraph follows the traditional definition of a paragraph. It contains more familiar properties that most Microsoft Word users know. The justification (left, center, right) and the indent setting are good examples.
  • A character run is a consecutive run of characters that share the same formatting. These contain the most common and visible properties. Some examples are font family, font size, bold, italic, and underline.
This provides you with enough information to use Java to read and manipulate this model. To get started, we have to create an HWPFDocument object from a physical Word file.


1 FileInputStream in =
2   new FileInputStream("C:\\test.doc");
3 HWPFDocument doc =
4   new HWPFDocument(in);

The Section, Paragraph, and CharacterRun classes represent the document tree that I explained earlier. I walk that tree in Listing 4.

First, I get the Range object for the entire document. This is the entry point to the object model. The Range class is an important piece of the HWPF API. It represents an arbitrary range of text in the document, with one to many sections, paragraphs, and character runs. The Section, Paragraph, and CharacterRun classes extend the Range class.

The methods numSections(), numParagraphs(), and numCharacterRuns() and the correlating getters are actually implemented in the Range class. Of course, if you call numSections() on a Paragraph object, it will return one. That would be the parent Section of that Paragraph object.

Another important method in the Range class is text(). This can be used to get the plain text for a particular range. To get the text for a document, use the following code:

String plainText = doc.getRange().text();

Once we have an instance of a Section, Paragraph, or CharacterRun object, we can read its properties by calling its various getters.


//Check the number of columns
//for this section
Section sect = r.getSection(x);
sect.getNumColumns();

//See if a paragraph is set to
//have a page break before it.
Paragraph par = sect.getParagraph(y);
boolean breakBefore = par.pageBreakBefore()

//Get the font name of a
//character run
CharacterRun run =
  par.getCharacterRun(z);
String font = run.getFontName();

These are quick examples. There are dozens of settings and there isn't enough space to cover them all. I encourage you to read the Javadoc to see what is possible.

Tables
Behind the scenes, tables are just a group of paragraphs with certain flags set. HWPF attempts to hide the juicy details but it still needs a little help (see Listing 5).

Listing 5 touches every paragraph in the document, looking for one with the table flag set. When it finds one, it passes it to the getTable method on line 8. Notice on line 12 that it's necessary to increment x so that the paragraphs that were part of the table aren't processed again.

Tables have TableRows, which in turn have TableCells. All these classes extend Range so you can use all the methods that I've already talked about for getting the contents of these entities.

Lists
Unlike tables, lists don't have a beginning and an end, because entries in a list can be inserted anywhere in the document and the list numbering can pick up wherever it left off. The ListEntry class is used to represent an entry, and it extends the Paragraph class. Look at how I get a list entry in the following example:


1 for (int x = 0; x < numPars; x++)
2 {
3   Paragraph par =
4     range.getParagraph(x);
5
6   if (par instanceof ListEntry))
7   {
8     ListEntry entry = (ListEntry)par;
9
10    //do something with the entry...
11  }
12 }

Adding New Content
There may be a time when you want to generate new Word documents or modify an existing document using Java. My first word of advice is to make sure that this is absolutely necessary. In most cases, a nonproprietary file format such as PDF, RTF, or HTML is the better choice. There are free libraries available for all of these. In the cases of RTF and HTML, the standard JDK provides the javax.swing.text package to manipulate the file formats. A rule of thumb for creating Word documents is: Will the eventual recipients of these documents want to edit them? If not, the PDF or HTML format is a better choice. If they do wish to edit them, consider using RTF instead of the Word file format.

The writing functionality of HWPF is somewhat experimental so expect some bugs and limited features. Modifying an existing document or creating a new Word document from scratch starts the same way - simply create a new HWPFDocument as shown in an earlier example. The only difference is that if you want to create one from scratch, you start with a blank document. The POI distribution comes with one called "blank.doc."

To commit any changes to a physical file and see what they do, you must write out the modified document. The following code writes out a Word document that contains any changes made to the original object model.


FileOutputStream docOut =
  new FileOutputStream(
     "C:\\testout.doc");
doc.write(docOut);

To be safe, I wouldn't recommend overwriting the original document. HWPF attempts to keep things that it doesn't directly support in the file, but this doesn't guarantee that they will be there when it writes the file out again.

The Section, Paragraph, and CharacterRun classes define setters that allow the various properties of existing content to be changed. The Range class defines the following methods for adding text and paragraphs to a document.

  • insertBefore(String text): Inserts a string into the document at the beginning of the Range. Assumes the properties of the character run at the beginning of this range.
  • insertAfter(String text): Inserts a string into the document at the end of the Range. Assumes the properties of the character run at the end of this range.
  • insertBefore(String text, Character-Properties props): Inserts a string into the beginning of the Range with the properties given by props.
  • insertAfter(String text, Character-Properties props): Inserts a string into the end of the Range with the properties given by props.
  • InsertBefore (ParagraphProperties props, int styleIndex): Inserts a new empty paragraph at the beginning of this Range. Based on the style at index styleIndex in the stylesheet.
  • InsertAfter (ParagraphProperties props, int styleIndex): Inserts a new empty paragraph at the beginning of this Range. Based on the style at index styleIndex in the stylesheet.
All of the insert methods return the Range that the insertion is now a part of. For example, when inserting a paragraph using insertAfter(ParagraphProperties props, int styleIndex), a Paragraph object is returned. Since Paragraph extends Range, all of the above methods can be used to fill this paragraph with text. The ParagraphProperties and CharacterProperties are similar to the Paragraph and CharacterRun classes. The difference is that classes ending with "Properties" are not associated with a location in a document. There are also SectionProperties and TableProperties.

The methods that insert a paragraph require a style index. Paragraphs and character runs store their settings as deltas from a style stored in the stylesheet. Styles provide a convenient way to maintain a consistent look and feel in a document. They also help a person creating a Word document through a user interface to be more efficient. To a programmer this may not matter. No matter what the style is, whatever properties are set for a particular Paragraph or CharacterRun object will appear in the document. I recommend just using the number 0 for a style index. This will always refer to the "Normal" style in the stylesheet.

Editing Tables
Because of the complexity, the range class does not currently define methods for inserting tables. However, the TableCell class extends Range, so all of the insert methods defined in Range can be used to add content to the individual table cells of an existing Table.

Adding Lists
Adding a list is a little tricky. Unlike most objects in the document, a list is not associated with a range of text. There are paragraphs that are associated with a list and these paragraphs are actual entries in a list. Before an entry can be added to a document, a list must be created. The following code creates a list.


1 HWPFList list = new HWPFList(true,
2   doc.getStyleSheet());
3
4 int listID = doc.registerList(list);

The HWPFList constructor takes two arguments. The first one is a boolean determining whether the list should be bulleted (if the argument is false, the list will be numbered), and the second is the stylesheet of the document to which the list will belong. The register List method that I call on the method on line 4 is defined in HWPFDocument. It returns a unique ID that's needed when adding a list entry to the document.

The Range class defines more insert methods for adding list entries.

  • insertBefore(ParagraphProperties props, int listID, int level, int style-Index)
  • insertAfter(ParagraphProperties props, int listID, int level, int styleIndex)
What is different from the normal paragraph insert is that both of the above methods require the list ID and the level. The level argument refers to the indent level of the list. At this point, the level argument is ignored because HWPF only supports writing simple, one-level lists. Figure 2 shows a screenshot of the Word document created using the code in Listing 6.

Summary
POI has its weaknesses. The biggest by far is the memory consumption in the Excel component (HSSF). The POI team has recognized this problem and is trying to address it in a coming release. The Word component's (HWPF's) biggest problem is that it isn't very mature. Right now it only provides very limited functionality. Even the Excel side of POI could use improvement on its support of some key Excel features, such as charting and images

If POI doesn't cut it, there is a wide selection of commercial libraries for working with Excel, such as SoftArtisans OfficeWriter. SoftArtisans (www.softartisans.com) is the only vendor I could find that also offers a product that can create Word documents in pure Java. OfficeWriter also supports every feature of Word and Excel.

With the new agreement between Sun and Microsoft, we may one day see the opening of the Microsoft file formats. While you wait for this day to come, POI provides a free open source alternative.

References

  • Apache POI: jakarta.apache.org/poi
  • WinCVS: www.wincvs.org
  • SoftArtisans OfficeWriter: officewriter.softartisans.com/officewriter-240.aspx

    SIDEBAR

    A Guide to POI Versions
    As an open source project, POI's development is carried out in a public repository by a group of volunteers. As a result, the code is quite dynamic, and this guide will help you navigate the multiple versions you'll find in the wild. In general, note that releases with beta, dev, or RC attached to their names are flagged as development releases, while releases without these postfixes are flagged as production releases.

    The 1.5.1 version released early 2002 was the preferred production version for a long time. But after a long series of new features, followed by a longer period of bugfixes and stabilization, the 2.0 version was released in January 2004.

    Subsequently, the 2.5 version was released in late February 2004 to incorporate a major new piece of functionality - the ability to create drawings in Excel sheets via what is known as the Escher Layer.

    Meanwhile, development had been ongoing in an experimental branch to enable the reading and writing of Word documents (HWPF). Unfortunately, it's necessary to download this piece of POI directly from CVS and compile it yourself. There are many excellent and free client applications for accessing CVS repositories such as WinCVS and jCVS.

    SIDEBAR 2

    Getting Started
    Getting started with POI couldn't be easier. Download the version you want from www.apache.org/dyn/closer.cgi/jakarta/poi/ as a zip or tar.gz archive. From the archive extract poi-<version>-<date>.jar. Add this file to your classpath and you should be set. POI has an optional dependency on log4j, but that's needed only if you turn on logging (which is disabled by default).

  • About Ryan Ackley
    Ryan Ackley is a developer on the Apache POI project. He works as a senior programmer for Space Gateway Support at the Kennedy Space Center, FL. He is also a Master’s candidate for computer engineering at the University of Central Florida.

    About Avik Sengupta
    Avik Sengupta is a committer on the Jakarta POI project, and is chief technology officer at Itellix Software Solutions.

    In order to post a comment you need to be registered and logged in.

    Register | Sign-in

    Reader Feedback: Page 1 of 1

    Hi,
    I want read an images from MS word .doc file and place in a specified position in same format and place.
    is it possible ?

    Please reply to this..........

    i dont know how to get POI.jar(which contains HWPF Class ) please help me to download it)

    Nice one Ryan & Avik, I''m sure this will generate a great deal more interest in Apache POI


    Your Feedback
    jettimadhu wrote: Hi, I want read an images from MS word .doc file and place in a specified position in same format and place. is it possible ? Please reply to this..........
    ponnuchmay wrote: i dont know how to get POI.jar(which contains HWPF Class ) please help me to download it)
    Piers wrote: Nice one Ryan & Avik, I''m sure this will generate a great deal more interest in Apache POI
    SOA World Latest Stories
    Quick Response (QR) codes are intended to help direct users quickly and easily to information about products and services, but they are also starting to be used for social engineering exploits. This article looks at the emergence of QR scan scams and the rising concern for users today....
    The Chinese company that claims it owns the iPad trademark says it plans to seek a ban on iPad exports out of China, threatening global supplies. According to what a lawyer for Proview Technology (Shenzhen) Co Ltd told Reuters, the firm is petitioning Chinese customs to stop shipment...
    Cisco Wednesday filed suit in the European Union’s second-highest court, the General Court in Luxembourg, challenging the European Commission’s rubber stamp last October of Microsoft’s $8.5 billion acquisition of Skype. Cisco says it isn’t opposed to the merger, but figures the EC sh...
    2011 was a year of rapid adoption for public and private cloud services. Instant and on-demand server provisioning was the driving force behind the massive growth. On top, cloud server templates and script automation simplified application installation for simple and pre-defined applic...
    As more enterprises are adopting clouds, the nature of cloud computing is changing. Previously, clouds were used to test applications or for non-mission critical applications. Today, enterprises are using clouds for cost-saving advantages and launching more mission critical application...
    Building a cloud computing environment with on-demand access to compute, network, and storage resources requires an elastic infrastructure at multiple levels. Virtualization combined with x86 servers has transformed the way we scale out compute resources. Unfortunately, legacy Fibre Ch...
    Subscribe to the World's Most Powerful Newsletters
    Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
    Click to Add our RSS Feeds to the Service of Your Choice:
    Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
    myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
    Publish Your Article! Please send it to editorial(at)sys-con.com!

    Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


    SYS-CON Featured Whitepapers
    ADS BY GOOGLE