Executive Summary

Archive extraction is one of the most trusted operations in modern computing. From package managers to backup systems, we routinely extract TAR files without a second thought. But what happens when the archive itself is malicious?

This analysis examines a how archive traversal technique uses deeply nested directory structures and symbolic link chains to bypass validation mechanisms and write files outside the intended extraction directory. By understanding how path resolution works at the filesystem level, we can see why simple validation fails and how attackers exploit this gap.

Introduction: The Trust Assumption

image

Every day, systems extract thousands of TAR archives. They unpack dependencies, restore backups, and process uploaded files. The developers who wrote these extraction routines made a simple assumption: the paths inside the archive will stay inside the extraction directory.

This assumption is dangerously wrong.

A TAR archive is not just a container. It is a sequence of filesystem instructions. Each entry tells the operating system exactly what to create and where to put it. When you extract an archive, you are executing those instructions. If the archive contains malicious instructions, you are executing malicious code—without ever running a program.

Understanding the Building Blocks

Before we dive into the exploit, we need to understand two fundamental concepts: TAR internals and symbolic links.

What’s Inside a TAR Archive?

A TAR archive is a sequential list of file entries. Each entry contains:

  • File name and path: Where the file should be created
  • File type: Regular file, directory, symbolic link, hard link
  • Metadata: Permissions, ownership, timestamps
  • Content: The actual file data (for regular files)
  • Link target: Where a symlink points (for link entries)

The extraction process reads these entries sequentially and performs exactly what each entry describes. If an entry says “create a symlink at path X pointing to target Y,” the extractor does it. If an entry says “write data to file Z,” the extractor does that too.

A symbolic link (symlink) is a special filesystem entry that acts as a pointer. When you access a symlink, the operating system transparently redirects you to its target.

shortcut -> /etc/passwd

If a program writes to shortcut, the data actually goes to /etc/passwd. The program never knows the redirection happened—the kernel handles it automatically. This automatic redirection is what makes symlinks dangerous in archive extraction

The Core Vulnerability: Path Resolution vs. Path Validation

The vulnerability exists in the gap between what the extraction program checks and what the filesystem actually does.

The Extraction Program’s View:

  1. Read entry: “Create file at restore_dir/subfolder/file.txt
  2. Validate that restore_dir/subfolder/file.txt starts with restore_dir/
  3. Create the file The Filesystem’s View:
  4. Start at restore_dir
  5. Enter subfolder
  6. If subfolder is a symlink, follow it to its target
  7. Create file.txt at the resolved location

If subfolder points outside restore_dir, the file ends up somewhere completely different. The validation passed because it checked the original path string, not the resolved filesystem path.

Anatomy of the Exploit: Building the Malicious Archive

The exploit constructs a TAR archive that abuses this gap. Let’s walk through each step of its construction.

Phase One: Creating Deep Directory Structures

The first step creates a deeply nested directory structure. The exploit uses directory names that are extremely long—hundreds of characters.

dddddddddddddddddddddddddddddddddddddddddddddddddd/
└── dddddddddddddddddddddddddddddddddddddddddddddddddd/
    └── dddddddddddddddddddddddddddddddddddddddddddddddddd/
        └── ...

Why so deep? Several reasons:

  • Bypassing validation: Some validation routines normalize paths or check for traversal patterns. Deep nesting can confuse these routines.
  • Path length limits: Filesystems have limits like PATH_MAX (typically 4096) and NAME_MAX (255). Paths approaching these limits can cause unexpected behavior in path resolution functions.
  • Complexity: The deeper the structure, the harder it is for automated tools to analyze what the archive actually does.

Inside this deep structure, the exploit creates multiple symbolic links at different levels.


dddddddddddd/
├── a -> dddddddddddddddddddddddddddddddddddddddddddddddddd/
├── dddddddddddddddddddddddddddddddddddddddddddddddddd/
│   ├── b -> dddddddddddddddddddddddddddddddddddddddddddddddddd/
│   ├── dddddddddddddddddddddddddddddddddddddddddddddddddd/
│   │   ├── c -> dddddddddddddddddddddddddddddddddddddddddddddddddd/
│   │   ├── ...

Each symlink (a, b, c, etc.) points deeper into the directory chain. This creates a web of redirections that any path resolution must navigate.

Now the exploit creates a special symlink with a very long name—up to 254 characters.

dddddddddddd/dddddddddddd/dddddddddddd/llllllllllllllllllllllll...

This long-named symlink points upward, out of the deep structure:

target = '../../../../' (repeated enough times to climb out)

Then comes the master stroke: a symlink named escape that combines everything.

escape -> dddddddddddd/dddddddddddd/.../llllllllll.../../../../../target/path

When resolved, this path does something remarkable:

  1. Follow escape to the deep chain
  2. Navigate through the chain of symlinks (a, b, c, etc.)
  3. Reach the long-named symlink, which points back up the chain
  4. After climbing out, the remaining ../../../../target/path resolves to an absolute path

The final resolved location is completely outside the extraction directory.

Phase Four: Writing Through the Escape

The final step is deceptively simple. The archive contains a regular file entry—also named escape

Entry: escape
Type: Regular file
Content: [payload data]

When the extractor processes this entry, it attempts to write the payload to escape. But escape is now a symlink. The kernel intervenes:

  1. Extractor: “Write data to restore_dir/escape
  2. Kernel: “restore_dir/escape is a symlink pointing elsewhere”
  3. Kernel: “Redirecting write to resolved path”

The payload is written to the target of the symlink—a location outside the restore directory.

The Moment of Escape: What Actually Happens

Let’s trace the exact moment the security boundary breaks.

Before extraction:

Restore directory: /restore/ (empty)

During extraction (symlink creation):

/restore/deep/dir/structure/
/restore/deep/dir/structure/a -> (deeper)
/restore/deep/dir/structure/b -> (deeper)
...
/restore/escape -> /deep/chain/../../../../outside/path

Everything still appears inside /restore/.

During extraction (file write):
Extractor calls: write(/restore/escape, payload)

Kernel path resolution:

  1. Start at /restore/escape
  2. escape is a symlink, resolve to its target
  3. Navigate the deep chain and upward traversal
  4. Final path: /outside/path

The payload lands in /outside/path. The extractor never knows the path changed.

Why Simple Validation Fails

Many developers attempt to prevent path traversal with checks like:

if '../' in member.path:
    reject()

This fails for several reasons:

Reason 1: Validation checks the string, not the resolved path
The exploit contains no ../ in the final escape entry. The traversal is hidden inside the symlink target, which validation may not check.

Reason 2: Validation runs before symlink resolution
The extractor might check escape and see a path that stays inside /restore/. It doesn’t realize that escape is a symlink that redirects elsewhere.

Reason 3: Deep paths confuse normalization
Path normalization functions may fail on extremely long paths or paths with many components, causing them to return incorrect results.

How to Identify Vulnerable Systems

Code Review Indicators

Look for these patterns in extraction code:

  • Unsafe extraction functions: tar.extractall(), tar.extract() without path validation
  • Missing symlink handling: No checks for SYMTYPE or LNKTYPE entries
  • String-based validation: Checking for ../ without resolving the full path
  • No post-extraction verification: Failing to verify where files actually landed

Dynamic Testing

Create a test archive to probe for vulnerabilities:

  1. Create a symlink pointing to a safe test location (e.g., /tmp/test-write)
  2. Add a file entry with the same name as the symlink
  3. Extract the archive in a controlled environment
  4. Check if the file appears at the symlink target

If the file appears outside the extraction directory, the system is vulnerable.

Proper Mitigation: Safe Extraction

Secure extraction requires multiple layers of defense.

Validate Resolved Paths

Never trust the path string from the archive. Always resolve the full path and verify it stays within the target directory.

def safe_extract_member(member, target_dir):
    # Get the absolute path of the target directory
    target_abs = os.path.abspath(target_dir)
    
    # Join with member name and resolve
    member_path = os.path.join(target_dir, member.name)
    resolved = os.path.realpath(member_path)
    
    # Verify the resolved path is still inside target
    if not resolved.startswith(target_abs + os.sep):
        raise Exception(f"Path escape detected: {member.name}")
    
    # Extract the member
    # ... extraction code ...

Consider whether symlinks are truly needed. If not, reject them entirely.

if member.issym() or member.islnk():
    raise Exception("Symlinks and hard links are not allowed")

If symlinks are required, validate their targets using the same resolved-path approach.

Use Safe Extraction Filters

Modern libraries include safer extraction options. For Python’s tarfile, use appropriate filters:

tar.extractall(path, filter='data')  # Safer, but verify version

Note that even filter='data' had vulnerabilities (CVE-2025-4517) in some versions. Always keep libraries updated.

Extract with Minimal Privileges

Never extract archives as root. Use a dedicated user with limited permissions. Even if an archive escapes, the damage is contained.

Conclusion

Archive extraction vulnerabilities persist because they exploit a fundamental gap between developer assumptions and filesystem behavior. The path string in an archive is not the final destination—it’s just the starting point for a resolution process that can traverse symlinks, follow pointers, and end up anywhere.

The exploit we’ve analyzed demonstrates how sophisticated these attacks can be. Deep directory structures, symlink chains, and carefully crafted path resolution create a mechanism that bypasses naive validation while appearing completely normal to the extraction program.

Understanding this mechanism is the first step toward building secure systems. The second step is implementing proper validation—not of path strings, but of fully resolved filesystem locations. Only by checking where a file actually lands can we ensure that extracting an archive doesn’t mean extracting control of our systems.


References