Introduction
Search1API's Deepcrawl API revolutionizes how developers collect web content for large language models. This powerful asynchronous crawling solution enables systematic processing of entire websites with minimal configuration, making it the perfect tool for building comprehensive knowledge bases for Retrieval-Augmented Generation (RAG) systems. At just 20 credits per request, Deepcrawl transforms how AI applications access and utilize web-based information.
Authentication
Like all Search1API endpoints, you'll need to authenticate using your Bearer token:
Authorization: Bearer your_api_key_here
How It Works
Deepcrawl operates on a simple yet powerful asynchronous model:
- Initiate a crawl task with your target URL
- Receive a task ID for tracking
- Check status until completion (typically within 1 minute)
- Access the comprehensive crawl results
The API offers two flexible crawling modes to match your specific needs:
- Sitemap mode (default): Processes only links defined in the website's sitemap.xml
- All mode: Discovers and crawls all findable links throughout the website
Here's how to start a crawl:
POST https://api.search1api.com/deepcrawl
{
"url": "https://search1api.com",
"type": "sitemap"
}
The API responds with a task ID for tracking:
{
"taskId": "abc123xyz",
"status": "queued"
} You can then check the status using:
GET https://api.search1api.com/deepcrawl/status/{taskId}
Which returns the current status:
{
"taskId": "abc123xyz",
"status": "processing",
"message": "Crawling in progress"
} Once complete (typically within a minute), you'll receive the full results of your crawl task.
Key Features
Asynchronous Processing
Deepcrawl handles resource-intensive crawling tasks in the background, letting your application continue working while the API does the heavy lifting. Results are typically available within just 1 minute.
Flexible Crawling Strategies
Choose between targeted sitemap crawling for efficiency or comprehensive link discovery for completeness. This flexibility lets you balance between speed and thoroughness based on your specific needs.
Complete Website Processing
Rather than handling individual pages, Deepcrawl systematically processes entire websites, ensuring your knowledge base is comprehensive and up-to-date.
Simple Task Management
The intuitive task-based interface makes it easy to initiate, monitor, and manage multiple crawl operations with minimal code and configuration.
Perfect for RAG Applications
What is RAG?
Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge. Instead of relying solely on the model's training data, RAG systems retrieve relevant information from a knowledge base before generating responses, dramatically improving accuracy and relevance.
Building Knowledge Bases with Deepcrawl
Deepcrawl is the ideal solution for creating comprehensive, up-to-date knowledge bases for RAG systems:
- Comprehensive Content Collection: Capture complete website content in a single operation
- Structured Data Organization: Content is organized logically based on site structure
- Efficient Processing: Asynchronous design handles large websites without overwhelming your application
- Rapid Implementation: From API call to usable knowledge base in minutes, not hours or days
Implementation Tips
Choosing the Right Crawl Mode
- When to use Sitemap Mode:
- For well-organized websites with comprehensive sitemaps
- When you need only the most important site content
- For faster, more efficient crawls
- When targeting specific sections defined in the sitemap
- When to use All Mode:
- For websites without sitemaps or with incomplete sitemaps
- When you need absolutely all available content
- For creating exhaustive knowledge bases
- When discovering hidden or unlisted content is important
Optimizing Your RAG Implementation
import requests
import time
def create_knowledge_base(website_url, crawl_type="sitemap"):
# Step 1: Initialize the crawl
headers = {
'Authorization': 'Bearer your_api_key_here',
'Content-Type': 'application/json'
}
data = {
'url': website_url,
'type': crawl_type
}
# Start the crawl task
response = requests.post(
'https://api.search1api.com/deepcrawl',
headers=headers,
json=data
)
task_data = response.json()
task_id = task_data['taskId']
# Step 2: Check status until complete
while True:
status_response = requests.get(
f'https://api.search1api.com/deepcrawl/status/{task_id}',
headers=headers
)
status_data = status_response.json()
if status_data['status'] == 'completed':
# Process the results for your RAG system
return status_data['results']
elif status_data['status'] == 'failed':
raise Exception(f"Crawl failed: {status_data['message']}")
# Wait before checking again (typically ready within 1 minute)
time.sleep(10) Error Handling
Implement appropriate retry logic and error handling for rare cases when a crawl might take longer than expected or encounter temporary issues with the target website.
Why Developers Choose Deepcrawl API
Deepcrawl stands out because it:
- Works asynchronously, freeing your application from waiting for results
- Processes complete websites in a single operation
- Delivers results quickly, typically within a minute
- Offers flexible crawling strategies to match your specific needs
- Integrates seamlessly with RAG implementations
Getting Started
Visit our API documentation to start building powerful knowledge bases for your AI applications today. Transform how your large language models access and utilize web content!
Whether you're enhancing a chatbot, building an AI research assistant, or creating a domain-specific knowledge system, Deepcrawl API provides the foundation for more accurate, relevant, and valuable AI-generated content.
No comments yet