Package de.schlichtherle.io

Start here: Provides transparent read/write access to archive files and their entries as if they were (virtual) directories and files.

See:
          Description

Interface Summary
ArchiveDetector Detects archive files solely by scanning file paths - usually by testing for file name suffixes like .zip or the like.
ArchiveStatistics A proxy interface which encapsulates statistics about the total set of archives operated by this package.
FileFactory This interface is not intended for public use!
 

Class Summary
AbstractArchiveDetector Implements the FileFactory part of the ArchiveDetector interface.
ArchiveEntryMetaData This class is not intended for public use!
DefaultArchiveDetector An ArchiveDetector which matches file paths against a pattern of archive file suffixes in order to detect prospective archive files and look up their corresponding ArchiveDriver in its registry.
File A drop-in replacement for its subclass which provides transparent read/write access to archive files and their entries as if they were (virtual) directories and files.
FileInputStream A drop-in replacement for FileInputStream which provides transparent read access to archive entries as if they were (virtual) files.
FileOutputStream A drop-in replacement for FileOutputStream which provides transparent write access to archive entries as if they were (virtual) files.
FileReader A drop-in replacement for FileReader which provides transparent read access to archive entries as if they were (virtual) files.
FileWriter A drop-in replacement for FileWriter which provides transparent write access to archive entries as if they were (virtual) files.
InputArchiveMetaData This class is not intended for public use!
OutputArchiveMetaData This class is not intended for public use!
RaesFiles Saves and restores the contents of arbitrary files to and from the RAES file format for encryption and decryption.
RaesFileUtils Deprecated. Use the base class instead.
 

Exception Summary
ArchiveBusyException Thrown if an archive file could not get updated because some input or output streams for its entries are still open.
ArchiveBusyWarningException Thrown if an archive file has been successfully updated, but some input or output streams for its entries have been forced to close.
ArchiveEntryStreamClosedException Thrown if an input or output stream for an archive entry has been forced to close when the archive file was (explicitly or implicitly) unmounted.
ArchiveException Represents a chain of exceptions thrown by the File.umount() and File.update() methods to indicate an error condition which does incur loss of data.
ArchiveInputBusyException Like its super class, but indicates the existance of open input streams.
ArchiveInputBusyWarningException Like its super class, but indicates the existance of open input streams.
ArchiveOutputBusyException Like its super class, but indicates the existance of open output streams.
ArchiveOutputBusyWarningException Like its super class, but indicates the existance of open output streams.
ArchiveWarningException Represents a chain of exceptions thrown by the File.umount() and File.update() methods to indicate an error condition which does not incur loss of data and may be ignored.
ChainableIOException Represents a chain of IOExceptions.
ContainsFileException Thrown if two paths are referring to the same file or contain each other.
FileBusyException Thrown if an archive entry cannot get accessed because either (a) the client application is trying to input or output to the same archive file concurrently and the respective archive driver does not support this, or (b) the archive file needs an implicit unmount which cannot get performed because the client application is still using some other open streams for the same archive file.
InputIOException Thrown if an IOException happened on the input side rather than the output side when copying an InputStream to an OutputStream.
 

Package de.schlichtherle.io Description

Start here: Provides transparent read/write access to archive files and their entries as if they were (virtual) directories and files. Archive files may be arbitrarily nested and the nesting level is only limited by heap and file system size.


Contents

  1. Basic Operations
  2. Atomicity of File System Operations
  3. Updating Archive Files
  4. Miscellaneous

Basic Operations

In order to create a new archive file, the client application can simply use File.mkdir().

In order to delete it, File.delete() can be used. Similar to a regular directory this is only possible if the archive file is empty. Alternatively, the client application could use File.deleteAll() in order to delete the virtual directory in one go, regardless of its contents.

To read an archive entry, the client application can simply create a FileInputStream or a FileReader with the path or a File instance as its constructor parameter. Note that you cannot create a FileInputStream or a FileReader to read an archive file itself (unless it's a false positive, i.e. a regular file or directory with an archive file suffix).

Likewise, to write an archive entry, the client application can simply create a FileOutputStream or a FileWriter with the path or a File instance as its constructor parameter. Note that you cannot create a FileOutputStream or a FileWriter to write an archive file itself (unless it's a false positive, i.e. a regular file or directory with an archive file suffix).

If the client application just needs to copy data however, using one of the copy methods in the File class is highly recommended instead of using File(In|Out)putStream directly. These methods use asynchronous I/O (though they return synchronously), pooled big buffers, pooled threads (on JSE 5 and later) and do not need to decompress/recompress archive entry data when copying from one archive file to another for supported archive types. In addition, they are guaranteed to fail gracefully, while many Java apps fail to close their streams if an IOException occurs.

Note that there is no eqivalent to java.io.RandomAccessFile in this package because it's impossible to seek within compressed archive entry data.

Using Archive Entry Streams

When using streams, the client application should always close them in a finally-block like this:

FileOutputStream out = new FileOutputStream(file);
try {
    // Do I/O here...
} finally {
    out.close(); // ALWAYS close the stream!
}

This ensures that the stream is closed even if an exception occurs.

Note that for various (mostly archive driver specific) reasons, the close() method may throw an IOException, too. The client application needs to deal with this appropriately, for example by enclosing the entire block with another try-catch-block like this:

try {
    FileOutputStream out = new FileOutputStream(file);
    try {
        // Do I/O here...
    } finally {
        out.close(); // ALWAYS close the stream!
    }
} catch (IOException ex) {
    ex.printStackTrace();
}

This idiom is not at all specific to TrueZIP: Streams often utilize OS resources such as file descriptors, database or network connections. All OS resources are limited however and sometimes they are even exclusively allocated for a stream, so the stream should always be closed as soon as possible again, especially in long running server applications (relying on finalize() to do this during garbage collection is unsafe). Unfortunately, many Java applications and libraries fail in this respect.

TrueZIP is affected by open archive entry streams in the following ways:

In order to prevent these exceptions, TrueZIP automatically closes entry streams when they are garbage collected. However, the client application should never rely on this because the delay and order in which streams are processed by the finalizer thread is not specified and any unwritten data gets lost in output streams.


Atomicity of File System Operations

In general, a file system operation is either atomic or not. In its strict sense, an atomic operation meets the following conditions:

  1. The operation either completely succeeds or completely fails. If it fails, the state of the file system is not changed.
  2. Third parties can't monitor or influence the changes as they are in progress. They can only see the result.

All reliable file system implementations meet the first condition and so does TrueZIP. However, the situation is different for the second condition:

This implies that TrueZIP cannot provide any operations which are atomic in its strict sense. However, many file system operations in this package are declared to be virtually atomic according to their Javadoc. A virtually atomic operation meets the following conditions:

  1. The operation either completely succeeds or completely fails. If it fails, the state of the (virtual) file system is not changed.
  2. If the path does not contain any archive files, the operation is always delegated to the real file system and third parties can't monitor or influence the changes as they are in progress. They can only see the result.
  3. Otherwise, all File instances which recognize the same set of archive files in the path and share the same definition of classes in this package can't monitor or influence the changes as they are in progress. They can only see the result.

These conditions apply regardless of whether the File instances are used by different threads or not. In other words, TrueZIP is thread safe as much as you could expect from a real file system.


Updating Archive Files

To provide random read/write access to archive files, TrueZIP needs to associate some state for every recognized archive file on the heap and in the folder for temporary files while the client application is operating on the VFS.

TrueZIP automatically mounts the VFS from an archive file on the first access. The client application can then operate on the VFS in an arbitrary manner. Finally, an archive file must get unmounted in order to update it with the cumulated modifications. Note that an archive entry gets modified by any operation which creates, modifies or deletes it.

Explicit vs. Implicit Unmounting

Archive file unmounting is performed semi-automatic:

Explicit unmounting is required to support third-party access to an archive file (see below) or to monitor progress (see below). It also allows some control over any exceptions thrown: Both umount() and update() may throw an ArchiveWarningException or an ArchiveException. The client application may catch these exceptions and act on them individually (see below).

However, calling umount() or update() too often may increase the overall runtime: On each call, all remaining entries in the archive file are copied to the archive file again if the archive file did already exist. If the client application is explicitly unmounting the archive file after each modification, this may lead to an overall runtime of O(s*s), where s is the size of the archive file in bytes (see below).

In comparison, implicit unmounting provides best performance because archive files are only updated if there's really a need to. It also works reliably: The JVM shutdown hook is always run unless the JVM crashes (note that an uncatched throwable terminates the JVM, but does not crash it - a JVM crash is an extremely rare situation which indicates a bug in the JVM implementation, not a bug in the JRE or the application). Furthermore, it omits the need to introduce a call to umount() or update() in legacy applications.

The disadvantage is that the client application cannot easily detect and deal with any exceptions thrown as a result of updating an archive file: Depending on where the implicit unmount happens, either an arbitrary IOException is thrown, a boolean value is returned, or - when called from the JVM shutdown hook - just a stack trace is printed. In addition, updating an existing archive file takes linear runtime (see below). However, using long running JVM shutdown hooks is generally discouraged: They can't use java.util.logging, they can't use a GUI to monitor progress (see below) and they can only get debugged on JSE 5 or later.

Third Party Access

Because TrueZIP associates some state with any archive file which is read and/or write accessed by the client application, it requires exclusive access to these archive files until they get unmounted again.

Third parties must not concurrently access these archive files nor their entries unless the precautions outlined below have been taken!
In this context, third parties are:
  1. Instances of the class java.io.File which are not instances of the class de.schlichtherle.io.File.
  2. Instances of the class de.schlichtherle.io.File which do not recognize the same set of archive files in the path due to the use of a differently working ArchiveDetector.
  3. Other definitions of the classes in this package which have been loaded by different class loaders.
  4. Other system processes.

As a rule of thumb, the same archive file or entry within an archive file should not be accessed by different File classes (java.io.File versus de.schlichtherle.io.File) or File instances with different ArchiveDetector parameters. This ensures that the state associated to an archive file is not shadowed or bypassed.

To ensure that all File instances recognize the same set of archive files in a path, it's recommended not to use constructors or methods of the File class with explicit ArchiveDetector parameters unless there is good reason to.

To ensure that all File instances share the same definition of classes in this package, it's recommended to add TrueZIP's JAR to the boot class path or the extension class path.

If the prerequisites for these recommendations don't apply or if the recommendations can't be followed, the client application may call File.umount() (File.update() will not work) to perform an explicit unmount. This clears all state information so that the third party can then safely access any archive file. In addition, the client application must make sure not to access the same archive file or any of its entries in any way while the third party is still accessing it.

Failure to comply to these guidelines may result in unpredictable behavior and may even cause loss of data!

Exception Handling

umount() and update() are guaranteed to process all archive files which are in use or have been touched by the client application. However, processing some of these archive files may fail for a number of I/O related reasons. Hence, during processing, a sequential chain of archive exceptions is constructed and thrown upon termination unless its empty. Note that sequential exception chaining is a concept which is completely orthogonal to Java's general exception cause chaining: In a sequential archive exception chain, each archive exception may still have a chain of other exceptions as its cause (most likely IOExceptions).

Archive exceptions fall into two categories:

  1. The class ArchiveWarningException is the root of all warning exception types. These exceptions are thrown if an archive file has been completely updated, but some warning conditions apply. No data has been lost.
  2. Its super class ArchiveException is the root of all other exception types (unless it's an ArchiveWarningException again). These exceptions are thrown if an archive file could not get updated completely. This implies loss of some or all data in the respective archive file.

Note that the effect which is indicated by an archive exception is local: An exception thrown when processing an archive file does not imply an archive exception or loss of data when processing another archive file.

When the archive exception chain is thrown by this method, it's first sorted according to (1) descending order of priority and (2) ascending order of appearance, and the resulting head exception is then thrown. Since ArchiveWarningExceptions have a lower priority than ArchiveExceptions, they are always pushed back to the end of the chain, so that an application can use the following simple idiom to detect if only some warnings or at least one severe error has occured:

try {
    File.umount(); // with or without parameters
} catch (ArchiveWarningException oops) {
    // Only instances of the class ArchiveWarningException exist in
    // the sequential chain of exceptions. We decide to ignore this.
} catch (ArchiveException ouch) {
    // At least one exception occured which is not just an
    // ArchiveWarningException. This is a severe situation that
    // needs to be handled.

    // Print the sequential chain of exceptions in order of
    // descending priority and ascending appearance.
    //ouch.printStackTrace();

    // Print the sequential chain of exceptions in order of
    // appearance instead.
    ouch.sortAppearance().printStackTrace();
}
Note that the Throwable.getMessage() method (and hence Throwable.printStackTrace() will concatenate the detail messages of the exceptions in the sequential chain in the given order.

Performance Considerations

Unmounting a modified archive file is a linear runtime operation: If the size of the resulting archive file is s bytes, the operation always completes in O(s), even if only a single, small archive entry has been modified within a very large archive file. Unmounting an unmodified or newly created archive file is a constant runtime operation: It always completes in O(1). These magnitudes are independent of whether unmounting was performed explicitly or implicitly.

Now if the client application modifies each entry in a loop and accidentally triggers unmounting the archive file on each iteration, then the overall runtime increases to O(s*s)! Here's an example:

String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // O(1)
    File.umount(); // O(i + 1) !!
}
// Overall: O(n*n) !!!

The bad runtime is because umount() is called within the loop. Moving it out of the loop fixes the issue:

String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // O(1)
}
File.umount(); // new file: O(1); modified: O(n)
// Overall: O(n)

In essence: If at all, the client application should never call umount() or update() in a loop which modifies an archive file.

The situation gets more complicated with implicit remounting: If a file entry shall get modified which already has been modified before, TrueZIP implicitly remounts the archive file in order to avoid writing duplicated entries to it (which would waste space and may even confuse other utilities). Here's an example:

String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // First modification: O(1)
    entry.createNewFile(); // Second modification triggers remount: O(i + 1) !!
}
// Overall: O(n*n) !!!

Each call to createNewFile() is a modification operation. Hence, on the second call to this method, TrueZIP needs to do an implicit remount which writes all entries in the archive file created so far to disk again.

Unfortunately, a modification operation is not always so easy to spot. Consider the following example to create an archive file with empty entries which all share the same last modification time:

long time = System.currentTimeMillis();
String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // First modification: O(1)
    entry.setLastModified(time); // Second modification triggers remount: O(i + 1) !!
}
// Overall: O(n*n) !!!

When setLastModified() gets called, the entry has already been written and so an implicit remount is triggered, which writes all entries in the archive file created so far to disk again.

Detail: This deficiency is caused by archive file formats: All currently supported archive types require to write an entry's meta data (including the last modification time) before its content to the archive file. So if the meta data is to be modified, the archive entry and hence the whole archive file needs to get rewritten, which is what the implicit remount is doing.

To avoid accidental remounting when copying data, you should consider using the advanced copy methods instead. These methods are easy to use, work reliably and provide superior performance.

Monitoring Progress

When unmounting, the client application can monitor the progress by another thread using File.getLiveArchiveStatistics(). The returned instance is a proxy which returns live statistics about the updating process.

Here's an example how to monitor unmounting progress on standard error output after an initial delay of two seconds:

class ProgressMonitor extends Thread {
    Long[] args = new Long[2];
    ArchiveStatistics liveStats = File.getLiveArchiveStatistics();

    ProgressMonitor() {
        setPriority(Thread.MAX_PRIORITY);
        setDaemon(true);
    }

    public void run() {
        boolean run = false;
        for (long sleep = 2000; ; sleep = 200, run = true) {
            try {
                Thread.sleep(sleep);
            } catch (InterruptedException shutdown) {
                break;
            }
            showProgress();
        }
        if (run) {
            showProgress();
            System.err.println();
        }
    }

    void showProgress() {
        // Round up to kilobytes.
        args[0] = new Long(
                (liveStats.getUpdateTotalByteCountRead() + 1023) / 1024);
        args[1] = new Long(
                (liveStats.getUpdateTotalByteCountWritten() + 1023) / 1024);
        System.err.print(MessageFormat.format(
                "Top level archive IO: {0} / {1} KB        \r", args));
    }

    void shutdown() {
        interrupt();
        try {
            join();
        } catch (InterruptedException interrupted) {
            interrupted.printStackTrace();
        }
    }
}

// ...

ProgressMonitor monitor = new ProgressMonitor();
monitor.start();
try {
    File.umount();
} finally {
    monitor.shutdown();
}

Conclusions

Here are some guidelines to find the right balance between performance and control:

  1. When the JVM terminates, calling umount() is recommended in order to handle exceptions explicitly, but not required because TrueZIP's JVM shutdown hook takes care of unmounting anyway and prints the stacktrace of any exceptions on the standard error output.
  2. Otherwise, in order to achieve best performance, umount() or update() should not get called unless either third party access or explicit exception handling is required.
  3. For the same reason, these methods should never get called in a loop which modifies an archive file.
  4. umount() is generally preferred over update() for safety reasons.

Miscellaneous

Virtual Directories in Archive Files

The top level entries in an archive file build its root directory. The root directory is never written to the output when an archive file is modified.

To the client application, the root directory behaves like any other directory and is addressed by naming the archive file in a path: For example, the client application may list its contents by calling File.list() or File.listFiles().

The root directory receives its last modification time from the archive file whenever it's read. Likewise, the archive file will receive the root directory's last modification time whenever it's written.

While this is a proper emulation of the behavior of real file systems, it may confuse users if only entries which are located one level or more below the root directory have been changed in an existing archive file: In this case, the last modification time of the root directory is not updated and hence the archive file's last modification time will not reflect the changes in the deeper directory levels.

As a workaround, the client application can use the idiom File.isArchive() && File.isDirectory() to detect an archive file and explicitly change the last modification time of its root directory by calling File.setLastModified(long).

An archive may contain directories for which no entry is present in the file although they contain at least one member in their directory tree for which an entry is actually present in the file. Similarly, if File.isLenient() returns true (which is the default), an archive entry may be created in an archive file although its parent directory hasn't been explicitly created by calling File.mkdir() before.

Such a directory is called a ghost directory: Like the root directory, a ghost directory is not written to the output whenever an archive file is modified. This is to mimic the behavior of most archive utilities which do not create archive entries for directories.

To the client application, a ghost directory behaves like a regular directory with the exception that its last modification time returned by File.lastModified() is 0L. If the client application sets the last modification time explicitly using File.setLastModified(long), then the ghost directory reincarnates as a regular directory and will be output to the archive file.

Mind that a ghost directory can only exist within an archive file, but not every directory within an archive file is actually a ghost directory.

Entry Names in Archive Files

File paths may be composed of elements which either refer to regular nodes in the real file system (directories, files or special files), including top level archive files, or refer to entries within an archive file.

As usual in Java, elements in a path which refer to regular nodes may be case sensitive or not in TrueZIP's VFS, depending on the real file system and/or the platform.

However, elements in a path which refer to archive entries are always case sensitive. This enables the client application to address all files in existing archive files, regardless of the operating system they've been created on.

For existing archive files, redundant elements in entry names such as the empty string (""), the dot (".") directory, or the dot-dot ("..") directory are removed in the VFS when the archive file is read and not retained when the archive file is modified.

If an entry name contains characters which have no representation in the character set of the corresponding archive file type, then all file operations to create the archive entry will fail gracefully according to the documented contract of the respective operation. This is to protect the client application from creating archive entries which cannot get encoded and decoded again correctly. For example, the Euro sign (€) does not have a representation in the IBM437 character set and hence cannot be used for entries in ordinary ZIP files unless TrueZIP's configuration is customized to use another charset.

If an archive file contains entries with absolute entry names, such as /readme.txt rather than readme.txt, the client application cannot address these entries using the VFS in this package. However, these entries are retained like any other entry whenever the client application modifies the archive file. This should not impose problems as absolute entry names should never be used anyway and I'm not aware of any recent tools which would allow to create these.

If an archive file contains both a file and a directory entry with the same name it's up to the individual methods how they behave in this case. This could happen with archive files created by external tools only. Both File.isDirectory() and File.isFile() will return true in this case and in fact they are the only methods the client application can rely upon to act properly in this situation: Many other methods use a combination of isDirectory() and isFile() calls and will show an undefined behavior.

The good news is that both the file and the directory coexist in the virtual archive file system implemented by this package. Thus, whenever the archive file is modified, both entries will be retained and no data gets lost. This allows you to use another tool to fix the issue in the archive file. TrueZIP never allows the client application to create such an archive file, however.