Menu
Turing

Agentic Coding Annotator - Online Tasks

Turing
full time remote mid

Required Skills

PythonJavaScriptJavaC++SQL

Job Description

Evaluate and improve datasets for agentic coding models through detailed reviews.

About Turing

Turing is one of the world’s fastest-growing AI companies, accelerating the advancement and deployment of powerful AI systems. Turing helps customers in two ways: working with the world’s leading AI labs to advance frontier model capabilities in thinking, reasoning, coding, agentic behavior, multimodality, multilinguality, STEM, and frontier knowledge; and leveraging that work to build real-world AI systems that solve mission-critical priorities for companies.


Role Overview

We are looking for strong, detail-oriented software practitioners to help evaluate and improve datasets for agentic coding models.

This role involves working with realistic coding tasks in an agentic coding harness, reviewing model trajectories, verifying solutions, and producing high-quality annotations.

Depending on the assignment, the work may include:

  • Online evaluations: Interacting with blinded models on predefined tasks, then ranking and grading the resulting trajectories
  • Offline evaluations: Designing realistic coding tasks, calibrating them through user simulation, writing task-specific rubrics, and grading generated trajectories

This is not a basic annotation role. It requires strong engineering judgment, the ability to read and debug code, validate behavior, follow detailed process rules, and make consistent evaluation decisions across model runs.

We are specifically looking for candidates with enough engineering maturity to independently handle real-world software tasks, not just toy problems or superficial code review exercises.


What does day-to-day look like

  • Execute realistic coding tasks within the assigned agentic coding harness while maintaining model blindness and session independence
  • Follow task instructions, milestones, planned interactions, and evaluation guardrails consistently across runs
  • Verify model outputs by reading code, running commands, checking logs, and validating generated artifacts
  • Write clear, specific, evidence-based rationales for trajectory rankings and assessments
  • Design multi-step, realistic coding tasks (offline work), including user intent, milestones, and expected behaviors
  • Create and refine task-specific rubrics and binary evaluation criteria
  • Review completed work for quality, consistency, completeness, and schema compliance
  • Identify and escalate broken environments, unclear instructions, or process gaps with supporting evidence

Requirements

Software Engineering Fluency (Mandatory)

  • 5+ years of experience in software engineering, QA, developer tooling, data/ML engineering, or similar code-heavy roles
  • Strong hands-on experience in at least 1–2 programming languages or ecosystems
  • Ability to:Read and understand unfamiliar codebases
    Run and interpret tests, scripts, and CLI tools
    Debug issues and reason about edge cases or partial fixes
    Evaluate whether an implementation is functionally correct

Representative languages include:
Python, JavaScript/TypeScript, Rust, Java, C/C++, Bash/CLI environments, Haskell, Swift, SQL, or other production-relevant ecosystems


Terminal & Tooling Skills (Mandatory)

  • Comfortable working in Linux/Ubuntu-like environments
  • Proficient with:Terminal workflows
    Git basics
    Code editors or IDEs
    Package managers and test runners
    JSON, YAML, and Markdown
  • Familiarity with Docker and reproducible environments (strong plus, especially for offline work)

Coding-Agent Workflow Familiarity (Mandatory)

  • Comfortable working with or quickly adapting to agentic coding environments, such as:OpenCode
    Claude Code
    Cursor
    Similar coding-agent tools
  • Ability to execute and evaluate tasks within agent-driven workflows

Quality Judgment & Annotation Accuracy (Mandatory)

  • Ability to:Compare multiple model trajectories and identify meaningful differences
    Distinguish correctness from style, communication quality, and agent behavior
    Evaluate solutions consistently using defined rubrics
    Follow detailed process instructions without deviation
    Maintain consistency across repeated or similar evaluations
    Write concise, evidence-based rationales (not generic summaries)

Work Style

  • Highly detail-oriented and process-driven
  • Comfortable with repetitive, high-precision evaluation work
  • Able to maintain consistency across long tasks and multiple model runs
  • Proactively flags ambiguity instead of making assumptions
  • Balances realism with strict evaluation consistency

Perks of Freelancing With Turing

  • Work on cutting-edge AI projects with leading foundation model companies
  • Collaborate on high-impact work at the frontier of LLM evaluation and reasoning
  • Remote, flexible opportunities with global teams
  • Competitive compensation based on experience and project scope

Offer Details

  • Commitments Required: 8 hours per day with a 4-hour overlap with PST.
  • Employment Type: Contractor position (Note: this role does not include medical/paid leave).
  • Duration of Contract: 5 weeks; [expected start date is next week].