Python Tar Path Traversal with Symlink Exploitation
Executive Summary
Archive extraction is one of the most trusted operations in modern computing. From package managers to backup systems, we routinely extract TAR files without a second thought. But what happens when the archive itself is malicious?
This analysis examines a how archive traversal technique uses deeply nested directory structures and symbolic link chains to bypass validation mechanisms and write files outside the intended extraction directory. By understanding how path resolution works at the filesystem level, we can see why simple validation fails and how attackers exploit this gap.
Introduction: The Trust Assumption
Every day, systems extract thousands of TAR archives. They unpack dependencies, restore backups, and process uploaded files. The developers who wrote these extraction routines made a simple assumption: the paths inside the archive will stay inside the extraction directory.
This assumption is dangerously wrong.
A TAR archive is not just a container. It is a sequence of filesystem instructions. Each entry tells the operating system exactly what to create and where to put it. When you extract an archive, you are executing those instructions. If the archive contains malicious instructions, you are executing malicious code—without ever running a program.
Understanding the Building Blocks
Before we dive into the exploit, we need to understand two fundamental concepts: TAR internals and symbolic links.
What’s Inside a TAR Archive?
A TAR archive is a sequential list of file entries. Each entry contains:
- File name and path: Where the file should be created
- File type: Regular file, directory, symbolic link, hard link
- Metadata: Permissions, ownership, timestamps
- Content: The actual file data (for regular files)
- Link target: Where a symlink points (for link entries)
The extraction process reads these entries sequentially and performs exactly what each entry describes. If an entry says “create a symlink at path X pointing to target Y,” the extractor does it. If an entry says “write data to file Z,” the extractor does that too.
Symbolic Links: Pointers in the Filesystem
A symbolic link (symlink) is a special filesystem entry that acts as a pointer. When you access a symlink, the operating system transparently redirects you to its target.

shortcut -> /etc/passwd
If a program writes to shortcut, the data actually goes to /etc/passwd. The program never knows the redirection happened—the kernel handles it automatically.
This automatic redirection is what makes symlinks dangerous in archive extraction
The Core Vulnerability: Path Resolution vs. Path Validation

The vulnerability exists in the gap between what the extraction program checks and what the filesystem actually does.
The Extraction Program’s View:
- Read entry: “Create file at
restore_dir/subfolder/file.txt” - Validate that
restore_dir/subfolder/file.txtstarts withrestore_dir/ - Create the file The Filesystem’s View:
- Start at
restore_dir - Enter
subfolder - If
subfolderis a symlink, follow it to its target - Create
file.txtat the resolved location
If subfolder points outside restore_dir, the file ends up somewhere completely different. The validation passed because it checked the original path string, not the resolved filesystem path.
Anatomy of the Exploit: Building the Malicious Archive
The exploit constructs a TAR archive that abuses this gap. Let’s walk through each step of its construction.
Phase One: Creating Deep Directory Structures
The first step creates a deeply nested directory structure. The exploit uses directory names that are extremely long—hundreds of characters.
dddddddddddddddddddddddddddddddddddddddddddddddddd/
└── dddddddddddddddddddddddddddddddddddddddddddddddddd/
└── dddddddddddddddddddddddddddddddddddddddddddddddddd/
└── ...
Why so deep? Several reasons:
- Bypassing validation: Some validation routines normalize paths or check for traversal patterns. Deep nesting can confuse these routines.
- Path length limits: Filesystems have limits like
PATH_MAX(typically 4096) andNAME_MAX(255). Paths approaching these limits can cause unexpected behavior in path resolution functions. - Complexity: The deeper the structure, the harder it is for automated tools to analyze what the archive actually does.
Phase Two: Planting Symlinks Throughout the Structure
Inside this deep structure, the exploit creates multiple symbolic links at different levels.
dddddddddddd/
├── a -> dddddddddddddddddddddddddddddddddddddddddddddddddd/
├── dddddddddddddddddddddddddddddddddddddddddddddddddd/
│ ├── b -> dddddddddddddddddddddddddddddddddddddddddddddddddd/
│ ├── dddddddddddddddddddddddddddddddddddddddddddddddddd/
│ │ ├── c -> dddddddddddddddddddddddddddddddddddddddddddddddddd/
│ │ ├── ...
Each symlink (a, b, c, etc.) points deeper into the directory chain. This creates a web of redirections that any path resolution must navigate.
Phase Three: Creating the Escape Symlink
Now the exploit creates a special symlink with a very long name—up to 254 characters.
dddddddddddd/dddddddddddd/dddddddddddd/llllllllllllllllllllllll...
This long-named symlink points upward, out of the deep structure:
target = '../../../../' (repeated enough times to climb out)
Then comes the master stroke: a symlink named escape that combines everything.
escape -> dddddddddddd/dddddddddddd/.../llllllllll.../../../../../target/path
When resolved, this path does something remarkable:
- Follow
escapeto the deep chain - Navigate through the chain of symlinks (a, b, c, etc.)
- Reach the long-named symlink, which points back up the chain
- After climbing out, the remaining
../../../../target/pathresolves to an absolute path
The final resolved location is completely outside the extraction directory.
Phase Four: Writing Through the Escape
The final step is deceptively simple. The archive contains a regular file entry—also named escape
Entry: escape
Type: Regular file
Content: [payload data]
When the extractor processes this entry, it attempts to write the payload to escape. But escape is now a symlink. The kernel intervenes:
- Extractor: “Write data to
restore_dir/escape” - Kernel: “
restore_dir/escapeis a symlink pointing elsewhere” - Kernel: “Redirecting write to resolved path”
The payload is written to the target of the symlink—a location outside the restore directory.
The Moment of Escape: What Actually Happens
Let’s trace the exact moment the security boundary breaks.
Before extraction:
Restore directory: /restore/ (empty)
During extraction (symlink creation):
/restore/deep/dir/structure/
/restore/deep/dir/structure/a -> (deeper)
/restore/deep/dir/structure/b -> (deeper)
...
/restore/escape -> /deep/chain/../../../../outside/path
Everything still appears inside /restore/.
During extraction (file write):
Extractor calls: write(/restore/escape, payload)
Kernel path resolution:
- Start at
/restore/escape escapeis a symlink, resolve to its target- Navigate the deep chain and upward traversal
- Final path:
/outside/path
The payload lands in /outside/path. The extractor never knows the path changed.
Why Simple Validation Fails
Many developers attempt to prevent path traversal with checks like:
if '../' in member.path:
reject()
This fails for several reasons:
Reason 1: Validation checks the string, not the resolved path
The exploit contains no ../ in the final escape entry. The traversal is hidden inside the symlink target, which validation may not check.
Reason 2: Validation runs before symlink resolution
The extractor might check escape and see a path that stays inside /restore/. It doesn’t realize that escape is a symlink that redirects elsewhere.
Reason 3: Deep paths confuse normalization
Path normalization functions may fail on extremely long paths or paths with many components, causing them to return incorrect results.
How to Identify Vulnerable Systems
Code Review Indicators
Look for these patterns in extraction code:
- Unsafe extraction functions:
tar.extractall(),tar.extract()without path validation - Missing symlink handling: No checks for
SYMTYPEorLNKTYPEentries - String-based validation: Checking for
../without resolving the full path - No post-extraction verification: Failing to verify where files actually landed
Dynamic Testing
Create a test archive to probe for vulnerabilities:
- Create a symlink pointing to a safe test location (e.g.,
/tmp/test-write) - Add a file entry with the same name as the symlink
- Extract the archive in a controlled environment
- Check if the file appears at the symlink target
If the file appears outside the extraction directory, the system is vulnerable.
Proper Mitigation: Safe Extraction
Secure extraction requires multiple layers of defense.
Validate Resolved Paths
Never trust the path string from the archive. Always resolve the full path and verify it stays within the target directory.
def safe_extract_member(member, target_dir):
# Get the absolute path of the target directory
target_abs = os.path.abspath(target_dir)
# Join with member name and resolve
member_path = os.path.join(target_dir, member.name)
resolved = os.path.realpath(member_path)
# Verify the resolved path is still inside target
if not resolved.startswith(target_abs + os.sep):
raise Exception(f"Path escape detected: {member.name}")
# Extract the member
# ... extraction code ...
Handle Symlinks Safely
Consider whether symlinks are truly needed. If not, reject them entirely.
if member.issym() or member.islnk():
raise Exception("Symlinks and hard links are not allowed")
If symlinks are required, validate their targets using the same resolved-path approach.
Use Safe Extraction Filters
Modern libraries include safer extraction options. For Python’s tarfile, use appropriate filters:
tar.extractall(path, filter='data') # Safer, but verify version
Note that even filter='data' had vulnerabilities (CVE-2025-4517) in some versions. Always keep libraries updated.
Extract with Minimal Privileges
Never extract archives as root. Use a dedicated user with limited permissions. Even if an archive escapes, the damage is contained.
Conclusion
Archive extraction vulnerabilities persist because they exploit a fundamental gap between developer assumptions and filesystem behavior. The path string in an archive is not the final destination—it’s just the starting point for a resolution process that can traverse symlinks, follow pointers, and end up anywhere.
The exploit we’ve analyzed demonstrates how sophisticated these attacks can be. Deep directory structures, symlink chains, and carefully crafted path resolution create a mechanism that bypasses naive validation while appearing completely normal to the extraction program.
Understanding this mechanism is the first step toward building secure systems. The second step is implementing proper validation—not of path strings, but of fully resolved filesystem locations. Only by checking where a file actually lands can we ensure that extracting an archive doesn’t mean extracting control of our systems.
References
CVE-2025-4517: Python tarfile arbitrary file write https://nvd.nist.gov/vuln/detail/CVE-2025-4517 https://www.rapid7.com/db/vulnerabilities/redhat_linux-cve-2025-4517/
Python tarfile extraction filters documentation https://docs.python.org/3/library/tarfile.html#tarfile-extraction-filter
Supply chain risks in archive extraction https://linuxsecurity.com/news/security-vulnerabilities/python-tarfile-supply-chain-risk
Red Hat Security Advisory RHSA-2025:10026 https://access.redhat.com/errata/RHSA-2025:10026)