
E2E Tests with AI: Technical Hurdles and Why It's More Complicated Than You Think
“Can’t we just use ChatGPT to write our E2E tests?” You probably hear this question frequently as a developer. The answer is more complicated than a simple yes or no. In this article, I’ll show you the technical reality behind AI-powered test automation – and why the problem is more interesting (and difficult) than it initially appears.
The Problem with Simple Approaches
Approach 1: Code Generation with LLMs
The most obvious approach is to have an LLM generate test code:
// AI-generated E2E test
test('User can login and update profile', async ({ page }) => {
await page.goto('https://example.com');
await page.fill('input[name="username"]', 'testuser');
await page.fill('input[name="password"]', 'password123');
await page.click('button[type="submit"]');
await page.waitForNavigation();
// and so on...
});
Technical Problems:
- Context Window Limitation: Large codebases quickly exceed the LLM’s context window
- Missing Visual Context: The LLM sees code but not the running application
- Unreliable Selectors: Generated CSS selectors break with UI changes
Approach 2: Browser Automation with LLMs
The next step: Let the LLM control the browser directly. The LLM receives screenshots and DOM, analyzes the current state, and decides what to do next.
Why this works technically:
- Modern vision models (GPT-4V, Claude 3.5 Sonnet) can interpret screenshots surprisingly well
- LLMs understand natural language test descriptions
- Browser automation APIs (Playwright, Selenium) are well documented
Why it fails in practice:
- Performance: Each test step requires an API call to the LLM (100-500ms+)
- Cost: A single E2E test can cost $0.10-1.00+
- Scaling: CI pipelines become unaffordable with dozens of tests
- Determinism: Same test, different execution
Hybrid Approach: Record & Replay
The most promising approach combines AI generation with traditional test execution:
- Recording Phase: AI executes the test once and records actions
- Code Generation: Actions are translated into stable test code
- Replay Phase: Tests run without AI involvement
- Regeneration: AI can re-record the test when errors occur
class HybridTestRecorder {
async recordTest(testDescription) {
const actions = [];
// AI-controlled execution with recording
while (!this.isTestComplete()) {
const screenshot = await this.takeScreenshot();
const action = await this.aiAgent.nextAction(screenshot, testDescription);
actions.push({
type: action.type,
selector: this.generateStableSelector(action.element),
data: action.data,
screenshot: screenshot,
});
await this.executeAction(action);
}
return this.generatePlaywrightCode(actions);
}
}
Why Existing Tools Aren’t Sufficient
Problem with Generic AI Assistants
ChatGPT, Claude & Co. aren’t optimized for browser automation:
- No specialized DOM processing
- No robust selector strategies
- No integration with test frameworks
- No understanding of test maintenance
Problem with Simple Browser Automation Tools
Tools like Claude Desktop with Playwright MCP only solve surface problems:
- Work for demos, not for production
- No selector optimization
- No test management features
- No CI/CD integration
The Three Core Technical Problems
1. The Context Problem
The DOM of a modern webapp can quickly become 2-10MB in size. LLMs have limited context windows. Even GPT-4 with 128k tokens cannot handle the complete DOM of a modern single-page application.
Solution Approaches:
- DOM Pruning: Remove irrelevant elements
- Semantic DOM: Extract only interactive elements
2. The Visual-DOM Mapping Problem
The LLM sees a “Search” button in the screenshot, but the DOM only contains:
<button class="btn-primary">
<svg viewBox="0 0 24 24">
<path
d="M15.5 14h-.79l-.28-.27C15.41 12.59 16 11.11 16 9.5 16 5.91 13.09 3 9.5 3S3 5.91 3 9.5 5.91 16 9.5 16c1.61 0 3.09-.59 4.23-1.57l.27.28v.79l5 4.99L20.49 19l-4.99-5zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z"
/>
</svg>
</button>
How should the LLM know that this SVG icon represents the “Search” button, which it is looking for?
Technical Solution Approaches:
- Computer vision for icon recognition
- Multi-modal embedding for visual-DOM alignment
- Accessible name computation according to ARIA standards
3. The Selector Stability Problem
A reliable E2E test needs stable selectors. The LLM must not only find an element but also generate a robust selector for it.
Selector Hierarchy (from stable to fragile):
data-testid
attributes (ideal, but rarely available)- Semantic selectors (
role
,aria-label
) - Relative positioning to known elements
- CSS classes and IDs
- XPath with absolute positions (fragile)
// Robust selector algorithm
function generateSelector(element, dom) {
if (element.dataset.testid) {
return `[data-testid="${element.dataset.testid}"]`;
}
if (element.getAttribute('aria-label')) {
return `[aria-label="${element.getAttribute('aria-label')}"]`;
}
// Fallback: Relative positioning
const nearbyLandmark = findNearestLandmark(element, dom);
if (nearbyLandmark) {
return `${nearbyLandmark.selector} >> ${getRelativeSelector(element)}`;
}
// Last resort: CSS path
return generateCSSPath(element);
}
The Value of Specialized Solutions
A well-thought-out AI test automation solution should provide:
Technical Excellence:
- Multi-agent architecture for optimal specialization
- Intelligent context optimization
- Robust selector generation
- Efficient caching and retry mechanisms
Developer Experience:
- Seamless integration into existing test suites
- Export to common frameworks (Playwright, Cypress, Selenium)
- Test versioning and diffing
- CI/CD pipeline integration
Business Value:
- Self-service for product owners without developer bottleneck
- Dramatic reduction in test maintenance time
- Better test coverage through lower creation costs
Conclusion: Why You Should Suggest a Specialized Solution to Your PO
AI test automation is a fascinating problem that goes far beyond “LLM + Browser.” The technical challenges are real and complex:
- Context management for large DOMs
- Visual-DOM mapping for robust element identification
- Selector strategies for maintainable tests
- Multi-agent orchestration for optimal performance
As a developer, you understand this complexity. When your product owner asks: “Can’t we just use ChatGPT?”, you can explain why a specialized solution is necessary.
The value for you:
- Less time spent on flaky test maintenance
- More test coverage without proportionally more effort
- Better collaboration with business stakeholders
- Focus on more interesting development tasks
The future of E2E test automation doesn’t lie in generic AI tools, but in specialized systems that thoughtfully solve these technical challenges.