Skip to main content

Jobs API

Overview

Jobs are individual executions of your scraping recipes. Each job tracks the progress, results, and any errors that occur during the scraping process.

Endpoints

Get Job Results

Retrieve the results of a specific job.

GET
/recipes/{id}/jobs/{jobId}/results

Path Parameters

ParameterTypeDescription
idstringRecipe ID
jobIdstringJob ID

Query Parameters

ParameterTypeDescription
pageintegerPage number for paginated results

Job Lifecycle

Status Flow

graph LR
A[Created] --> B[Pending]
B --> C[Running]
C --> D[Completed]
C --> E[Failed]

Status Definitions

StatusDescriptionNext States
pendingJob is queued and waiting to startrunning
runningJob is currently executingcompleted, failed
completedJob has finished successfully-
failedJob encountered an error-

Job Object

{
"id": "string",
"status": "pending | running | completed | failed",
"startDate": "2024-01-01T00:00:00Z",
"endDate": "2024-01-01T00:00:00Z",
"totalPaginationPages": 0,
"totalPagesScraped": 0,
"totalRows": 0,
"error": "string"
}

Progress Monitoring

Track your job's progress using these metrics:

MetricDescriptionExample
totalPaginationPagesTotal number of pages to process10
totalPagesScrapedNumber of pages processed so far7
totalRowsNumber of data rows extracted350

Progress Calculation

const progress = (totalPagesScraped / totalPaginationPages) * 100;

Error Handling

When a job fails, the error information is stored in the job object. Common error scenarios:

  • Network timeouts
  • Rate limiting
  • Invalid selectors
  • Website structure changes

Example Error Response

{
"id": "job_123",
"status": "failed",
"error": "Rate limit exceeded: Too many requests to target website"
}

Best Practices

  1. Monitoring

    • Regularly check job status
    • Set up notifications for job completion/failure
    • Monitor progress for long-running jobs
  2. Error Recovery

    • Implement retry logic for failed jobs
    • Use exponential backoff for rate limits
    • Log detailed error information
  3. Resource Management

    • Limit concurrent jobs
    • Set appropriate timeouts
    • Clean up completed job data