E2E Tests with AI: Technical Hurdles and Why It's More Complicated Than You Think

Georg Dörgeloh May 20, 2025

“Can’t we just use ChatGPT to write our E2E tests?” You probably hear this question frequently as a developer. The answer is more complicated than a simple yes or no. In this article, I’ll show you the technical reality behind AI-powered test automation – and why the problem is more interesting (and difficult) than it initially appears.

The Problem with Simple Approaches

Approach 1: Code Generation with LLMs

The most obvious approach is to have an LLM generate test code:

// AI-generated E2E test
test('User can login and update profile', async ({ page }) => {
  await page.goto('https://example.com');
  await page.fill('input[name="username"]', 'testuser');
  await page.fill('input[name="password"]', 'password123');
  await page.click('button[type="submit"]');
  await page.waitForNavigation();
  // and so on...
});

Technical Problems:

Context Window Limitation: Large codebases quickly exceed the LLM’s context window
Missing Visual Context: The LLM sees code but not the running application
Unreliable Selectors: Generated CSS selectors break with UI changes

Approach 2: Browser Automation with LLMs

The next step: Let the LLM control the browser directly. The LLM receives screenshots and DOM, analyzes the current state, and decides what to do next.

Why this works technically:

Modern vision models (GPT-4V, Claude 3.5 Sonnet) can interpret screenshots surprisingly well
LLMs understand natural language test descriptions
Browser automation APIs (Playwright, Selenium) are well documented

Why it fails in practice:

Performance: Each test step requires an API call to the LLM (100-500ms+)
Cost: A single E2E test can cost $0.10-1.00+
Scaling: CI pipelines become unaffordable with dozens of tests
Determinism: Same test, different execution

Hybrid Approach: Record & Replay

The most promising approach combines AI generation with traditional test execution:

Recording Phase: AI executes the test once and records actions
Code Generation: Actions are translated into stable test code
Replay Phase: Tests run without AI involvement
Regeneration: AI can re-record the test when errors occur

class HybridTestRecorder {
  async recordTest(testDescription) {
    const actions = [];

    // AI-controlled execution with recording
    while (!this.isTestComplete()) {
      const screenshot = await this.takeScreenshot();
      const action = await this.aiAgent.nextAction(screenshot, testDescription);

      actions.push({
        type: action.type,
        selector: this.generateStableSelector(action.element),
        data: action.data,
        screenshot: screenshot,
      });

      await this.executeAction(action);
    }

    return this.generatePlaywrightCode(actions);
  }
}

Why Existing Tools Aren’t Sufficient

Problem with Generic AI Assistants

ChatGPT, Claude & Co. aren’t optimized for browser automation:

No specialized DOM processing
No robust selector strategies
No integration with test frameworks
No understanding of test maintenance

Problem with Simple Browser Automation Tools

Tools like Claude Desktop with Playwright MCP only solve surface problems:

Work for demos, not for production
No selector optimization
No test management features
No CI/CD integration

The Three Core Technical Problems

1. The Context Problem

The DOM of a modern webapp can quickly become 2-10MB in size. LLMs have limited context windows. Even GPT-4 with 128k tokens cannot handle the complete DOM of a modern single-page application.

Solution Approaches:

DOM Pruning: Remove irrelevant elements
Semantic DOM: Extract only interactive elements

2. The Visual-DOM Mapping Problem

The LLM sees a “Search” button in the screenshot, but the DOM only contains:

<button class="btn-primary">
  <svg viewBox="0 0 24 24">
    <path
      d="M15.5 14h-.79l-.28-.27C15.41 12.59 16 11.11 16 9.5 16 5.91 13.09 3 9.5 3S3 5.91 3 9.5 5.91 16 9.5 16c1.61 0 3.09-.59 4.23-1.57l.27.28v.79l5 4.99L20.49 19l-4.99-5zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z"
    />
  </svg>
</button>

How should the LLM know that this SVG icon represents the “Search” button, which it is looking for?

Technical Solution Approaches:

Computer vision for icon recognition
Multi-modal embedding for visual-DOM alignment
Accessible name computation according to ARIA standards

3. The Selector Stability Problem

A reliable E2E test needs stable selectors. The LLM must not only find an element but also generate a robust selector for it.

Selector Hierarchy (from stable to fragile):

data-testid attributes (ideal, but rarely available)
Semantic selectors (role, aria-label)
Relative positioning to known elements
CSS classes and IDs
XPath with absolute positions (fragile)

// Robust selector algorithm
function generateSelector(element, dom) {
  if (element.dataset.testid) {
    return `[data-testid="${element.dataset.testid}"]`;
  }

  if (element.getAttribute('aria-label')) {
    return `[aria-label="${element.getAttribute('aria-label')}"]`;
  }

  // Fallback: Relative positioning
  const nearbyLandmark = findNearestLandmark(element, dom);
  if (nearbyLandmark) {
    return `${nearbyLandmark.selector} >> ${getRelativeSelector(element)}`;
  }

  // Last resort: CSS path
  return generateCSSPath(element);
}

The Value of Specialized Solutions

A well-thought-out AI test automation solution should provide:

Technical Excellence:

Multi-agent architecture for optimal specialization
Intelligent context optimization
Robust selector generation
Efficient caching and retry mechanisms

Developer Experience:

Seamless integration into existing test suites
Export to common frameworks (Playwright, Cypress, Selenium)
Test versioning and diffing
CI/CD pipeline integration

Business Value:

Self-service for product owners without developer bottleneck
Dramatic reduction in test maintenance time
Better test coverage through lower creation costs

Conclusion: Why You Should Suggest a Specialized Solution to Your PO

AI test automation is a fascinating problem that goes far beyond “LLM + Browser.” The technical challenges are real and complex:

Context management for large DOMs
Visual-DOM mapping for robust element identification
Selector strategies for maintainable tests
Multi-agent orchestration for optimal performance

As a developer, you understand this complexity. When your product owner asks: “Can’t we just use ChatGPT?”, you can explain why a specialized solution is necessary.

The value for you:

Less time spent on flaky test maintenance
More test coverage without proportionally more effort
Better collaboration with business stakeholders
Focus on more interesting development tasks

The future of E2E test automation doesn’t lie in generic AI tools, but in specialized systems that thoughtfully solve these technical challenges.