Authentication is provided via Basic Auth, for anything non-trivial we recommend you create a separate user for your service.
All endpoints should be accessible via their regular URLs in HTML form thanks to our browsable API.
Concurrency control ensures the correct processing of data under concurrent operations by clients.
We implement optimistic concurrency control using DRF-extensions, and their documentation describes the purpose and workflow pretty well: http://chibisov.github.io/drf-extensions/docs/#concurrency-control
Typical flow:
ETag
header)entry_points
and If-Match
header) => 412ETag
header)entry_points
and newer If-Match
header) => 200Any errors that occur are reported under an error
key in the response.
Each key in the dictionary will be the field name, and the values will be lists
of strings of any error messages corresponding to that field.
The non_field_errors
key may also be present, and will list any general
validation errors.
{
"docs": "https://portal.hanzoarchives.com/api/docs",
"error": {
"organization": [
"This field is required."
],
"name": [
"This field is required."
]
}
}
We use consistent identifiers across all of our endpoints.
Whenever referencing an object relation, you should find that the identifier used is as per the following quick reference guide:
Name Identifier
------------ ----------
Archive Unit crawlkey (Organization code/Archive Unit name)
Crawl uuid
Export uuid
Plugin slug
Scope slug
Organization code
User username
Requests that return multiple items will be paginated to 100 items by default.
You can specify further pages with the ?page
parameter.
For some resources, you can also set a custom page size up to 1000 with the
?per_page
parameter.
curl 'https://portal.hanzoarchives.com/api/crawls?page=2&per_page=100'
Note that page numbering is 1-based and that omitting the ?page
parameter
will return the first page.
For more information on our pagination implementation, check out out GitHub's guide on Traversing with Pagination, from which ours is based upon.
The pagination info is included in the Link header. It is important to follow these Link header values instead of constructing your own URLs.
Link: <https://portal.hanzoarchives.com/api/crawls?page=2&per_page=100>; rel="next",
<https://portal.hanzoarchives.com/api/crawls?page=10&per_page=100>; rel="last"
Linebreak is included for readability.
This Link
response header contains one or more Hypermedia link relations, some
of which may require expansion as URI templates.
The possible rel values are:
Name Description
next The link relation for the immediate next page of results.
last The link relation for the last page of results.
first The link relation for the first page of results.
prev The link relation for the immediate previous page of results.
In addition to the link header, we also expose headers that describe the page the response represents and totals for the full queryset.
Name Description
X-Page The page number
X-Per-Page The number of results per page
X-Total The total number of results
X-Total-Pages The total number of pages
Archives subscribe to a particular plugin and scope. The plugins and scopes available to your organization depend on what has been set up for you by our engineers.
For more information regarding what plugins, scopes and settings are available to your organization sign into your account, or use the list plugins endpoint.
An archive within Hanzo Archives is the top-level representation of your capture i.e. it stores all of the information required in order to perform a crawl of the website(s) you wish to capture. Archives track a plugin (and a scope of the plugin) which instructs our crawler how to interact with the content you want to capture, additionally this plugin/scope also yields settings which are stored against the archive.
A crawl in Hanzo Archives represents a capture of your archive at a given point in time, the bulk of the configuration is typically performed on the archive so that performing an additional crawl is relatively trivial.
You may have seen the request a capture form on the portal website, what this form automates for you is the creation of an archive, a crawl within it, and an export of that crawl - which is essentially the same as how the API works:
{
"name": "Example Archive",
"crawlkey": "EXAMPLE/Example Archive", # generated from {{ organization code }}/{{ name }} if omitted
"organization": "EXAMPLE",
"entry_points": ["http://example.com"] # copied from seeds if omitted
"seeds": ["http://example.com"]
"plugin": "webpage",
"scope": "one_page_and_one_hop",
"settings": {
"all_video": "on",
"include_referered": "yes"
},
"tags": ["my first archive"],
"teams": ["managers"]
}
For more information on what settings
you'll need to pass for your chosen
plugin/scope see the settings documentation.
{
"organization": "EXAMPLE",
"archive_unit": "EXAMPLE/Example Archive",
"status": "requested:user"
}
{
"name": "ESIV-1",
"organization": "EXAMPLE",
"crawl": "41f087bc-ae7f-4ce8-9bd8-83e9a5a19373",
"status": "requested:user",
"type": "load_file"
}
Settings for the crawler are varied depending on your chosen plugin/scope combination, any custom configuration set up for your organization, and/or whether you or a Hanzo engineer has customised your Archive Unit.
If you want to see what settings you need to pass when either creating
or updating an Archive Unit, you can perform an OPTIONS
request on either the
list archive units or update an archive unit
endpoints.
You'll find the information you need within the JSON structure at:
actions.(POST|PUT).plugin.choices[your plugin].scopes[your scope].settings
Any fields marked as field_required
and with no defaultvalue
specified will
need to be passed via the settings
property in key: value
in your
create/partial update/update
request.
Validation is performed on when an Archive Unit is saved to ensure that the settings required by the plugin (and scope) have been supplied. If required settings are omitted or invalid values are passed, the API will return an error describing the problem(s).
name
string
|
|
---|---|
crawlkey
string
|
A unique identifier for this archive, constructed from {organization_code}/{name} |
organization
code
|
|
entry_points
array[string]
|
An array of URLs from which any new crawls can be entered via native access |
seeds
array[string]
|
An array of URLs the crawler starts from for any new crawls |
metadata
object
|
Additional user metadata store |
notes
string
|
|
jira_issue
string
|
|
jira_status
string
|
|
autoexport
boolean
|
|
plugin
slug
|
|
scope
slug
|
|
settings
object
|
Settings to be passed to the crawler (keys required depends on the plugin/scope) |
tags
array
|
|
teams
array
|
|
url
string
|
|
portal
string
|
|
portal_url
string
|
|
settings_url
string
|
|
plugin_module_settings
object
|
|
plugin_modules
array
|
|
created_at
datetime
|
|
updated_at
datetime
|
{
"name": "Example Archive",
"organization": "EXAMPLE",
"seeds": [
"http://www.example.com",
"http://blog.example.com"
],
"plugin": "webpage",
"scope": "one_page_and_one_hop",
"settings": {
"warcloader_url": "http://warcloader-1.hanzoman.com:1647/"
}
}
{
"name": "Example Archive",
"organization": "EXAMPLE",
"auth": {
"username": "test",
"password": "test"
},
"entry_points": [
"http://www.example.com"
],
"seeds": [
"http://www.example.com",
"http://blog.example.com"
],
"plugin": "webpage",
"scope": "one_page_and_one_hop",
"settings": {
"warcloader_url": "http://warcloader-1.hanzoman.com:1647/"
},
"metadata": {
"example_key": "example_value"
}
}
name
string
|
|
---|---|
crawlkey
string
|
A unique identifier for this archive, constructed from {organization_code}/{name} |
organization
code
|
|
entry_points
array[string]
|
An array of URLs from which any new crawls can be entered via native access |
seeds
array[string]
|
An array of URLs the crawler starts from for any new crawls |
metadata
object
|
Additional user metadata store |
notes
string
|
|
autoexport
boolean
|
|
plugin
slug
|
|
scope
slug
|
|
settings
object
|
Settings to be passed to the crawler (keys required depends on the plugin/scope) |
tags
array[name]
|
|
teams
array[slug]
|
|
plugin_module_settings
object
|
|
plugin_modules
array[package_name]
|
{
"name": "Example Archive",
"crawlkey": "EXAMPLE/Example Archive",
"organization": "EXAMPLE",
"entry_points": [
"http://www.example.com",
"http://blog.example.com"
],
"seeds": [
"http://www.example.com",
"http://blog.example.com"
],
"metadata": null,
"notes": null,
"jira_issue": null,
"jira_status": null,
"autoexport": false,
"plugin": "webpage",
"scope": "one_page_and_one_hop",
"settings": {},
"tags": [],
"teams": [],
"url": "https://portal.hanzoarchives.com/api/archive-units/EXAMPLE/Example%20Archive",
"portal": "https://portal.hanzoarchives.com/captures/example-archive",
"portal_url": "https://portal.hanzoarchives.com/captures/example-archive",
"settings_url": "https://portal.hanzoarchives.com/api/archive-units/EXAMPLE/Example%20Archive/settings",
"plugin_module_settings": null,
"plugin_modules": [],
"created_at": "2025-03-14T07:37:08Z",
"updated_at": "2025-03-14T07:37:08Z"
}
{
"name": "Example Archive",
"crawlkey": "EXAMPLE/Example Archive",
"organization": "EXAMPLE",
"entry_points": [
"http://www.example.com"
],
"seeds": [
"http://www.example.com",
"http://blog.example.com"
],
"metadata": {
"example_key": "example_value"
},
"notes": null,
"jira_issue": null,
"jira_status": null,
"autoexport": false,
"plugin": "webpage",
"scope": "one_page_and_one_hop",
"settings": {},
"tags": [],
"teams": [],
"url": "https://portal.hanzoarchives.com/api/archive-units/EXAMPLE/Example%20Archive",
"portal": "https://portal.hanzoarchives.com/captures/example-archive",
"portal_url": "https://portal.hanzoarchives.com/captures/example-archive",
"settings_url": "https://portal.hanzoarchives.com/api/archive-units/EXAMPLE/Example%20Archive/settings",
"plugin_module_settings": null,
"plugin_modules": [],
"created_at": "2025-03-14T07:37:08Z",
"updated_at": "2025-03-14T07:37:08Z"
}
name
string
|
|
---|---|
crawlkey
string
|
A unique identifier for this archive, constructed from {organization_code}/{name} |
organization
code
|
|
entry_points
array[string]
|
An array of URLs from which any new crawls can be entered via native access |
seeds
array[string]
|
An array of URLs the crawler starts from for any new crawls |
metadata
object
|
Additional user metadata store |
notes
string
|
|
jira_issue
string
|
|
jira_status
string
|
|
autoexport
boolean
|
|
plugin
slug
|
|
scope
slug
|
|
settings
object
|
Settings to be passed to the crawler (keys required depends on the plugin/scope) |
tags
array
|
|
teams
array
|
|
url
string
|
|
portal
string
|
|
portal_url
string
|
|
settings_url
string
|
|
plugin_module_settings
object
|
|
plugin_modules
array
|
|
created_at
datetime
|
|
updated_at
datetime
|
{
"name": "Example Archive",
"crawlkey": "EXAMPLE/Example Archive",
"organization": "EXAMPLE",
"entry_points": [
"http://example.com"
],
"seeds": [
"http://example.com"
],
"metadata": null,
"notes": null,
"jira_issue": null,
"jira_status": null,
"autoexport": false,
"plugin": "website",
"scope": "default",
"settings": {},
"tags": [],
"teams": [],
"url": "https://portal.hanzoarchives.com/api/archive-units/EXAMPLE/Example%20Archive",
"portal": "https://portal.hanzoarchives.com/captures/example-archive",
"portal_url": "https://portal.hanzoarchives.com/captures/example-archive",
"settings_url": "https://portal.hanzoarchives.com/api/archive-units/EXAMPLE/Example%20Archive/settings",
"plugin_module_settings": null,
"plugin_modules": [],
"created_at": "2025-03-14T07:37:08Z",
"updated_at": "2025-03-14T07:37:08Z"
}
name
string
|
|
---|---|
crawlkey
string
|
A unique identifier for this archive, constructed from {organization_code}/{name} |
organization
code
|
|
entry_points
array[string]
|
An array of URLs from which any new crawls can be entered via native access |
seeds
array[string]
|
An array of URLs the crawler starts from for any new crawls |
metadata
object
|
Additional user metadata store |
notes
string
|
|
jira_issue
string
|
|
jira_status
string
|
|
autoexport
boolean
|
|
plugin
slug
|
|
scope
slug
|
|
settings
object
|
Settings to be passed to the crawler (keys required depends on the plugin/scope) |
tags
array
|
|
teams
array
|
|
url
string
|
|
portal
string
|
|
portal_url
string
|
|
settings_url
string
|
|
plugin_module_settings
object
|
|
plugin_modules
array
|
|
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
crawlkey
string
|
A unique identifier for this archive, constructed from {organization_code}/{name} |
organization
code
|
|
entry_points
array[string]
|
An array of URLs from which any new crawls can be entered via native access |
seeds
array[string]
|
An array of URLs the crawler starts from for any new crawls |
metadata
object
|
Additional user metadata store |
notes
string
|
|
autoexport
boolean
|
|
plugin
slug
|
|
scope
slug
|
|
settings
object
|
Settings to be passed to the crawler (keys required depends on the plugin/scope) |
tags
array[name]
|
|
teams
array[slug]
|
|
plugin_module_settings
object
|
|
plugin_modules
array[package_name]
|
name
string
|
|
---|---|
crawlkey
string
|
A unique identifier for this archive, constructed from {organization_code}/{name} |
organization
code
|
|
entry_points
array[string]
|
An array of URLs from which any new crawls can be entered via native access |
seeds
array[string]
|
An array of URLs the crawler starts from for any new crawls |
metadata
object
|
Additional user metadata store |
notes
string
|
|
jira_issue
string
|
|
jira_status
string
|
|
autoexport
boolean
|
|
plugin
slug
|
|
scope
slug
|
|
settings
object
|
Settings to be passed to the crawler (keys required depends on the plugin/scope) |
tags
array
|
|
teams
array
|
|
url
string
|
|
portal
string
|
|
portal_url
string
|
|
settings_url
string
|
|
plugin_module_settings
object
|
|
plugin_modules
array
|
|
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
crawlkey
string
|
A unique identifier for this archive, constructed from {organization_code}/{name} |
organization
code
|
|
entry_points
array[string]
|
An array of URLs from which any new crawls can be entered via native access |
seeds
array[string]
|
An array of URLs the crawler starts from for any new crawls |
metadata
object
|
Additional user metadata store |
notes
string
|
|
autoexport
boolean
|
|
plugin
slug
|
|
scope
slug
|
|
settings
object
|
Settings to be passed to the crawler (keys required depends on the plugin/scope) |
tags
array[name]
|
|
teams
array[slug]
|
|
plugin_module_settings
object
|
|
plugin_modules
array[package_name]
|
name
string
|
|
---|---|
crawlkey
string
|
A unique identifier for this archive, constructed from {organization_code}/{name} |
organization
code
|
|
entry_points
array[string]
|
An array of URLs from which any new crawls can be entered via native access |
seeds
array[string]
|
An array of URLs the crawler starts from for any new crawls |
metadata
object
|
Additional user metadata store |
notes
string
|
|
jira_issue
string
|
|
jira_status
string
|
|
autoexport
boolean
|
|
plugin
slug
|
|
scope
slug
|
|
settings
object
|
Settings to be passed to the crawler (keys required depends on the plugin/scope) |
tags
array
|
|
teams
array
|
|
url
string
|
|
portal
string
|
|
portal_url
string
|
|
settings_url
string
|
|
plugin_module_settings
object
|
|
plugin_modules
array
|
|
created_at
datetime
|
|
updated_at
datetime
|
key
string
|
|
---|---|
value
string
|
|
field_name
string
|
|
field_description
string
|
|
field_type
choice
|
|
field_options
object
|
|
field_required
boolean
|
|
type
string
|
|
url
string
|
key
string
|
|
---|---|
value
string
|
|
field_name
string
|
|
field_description
string
|
|
field_type
choice
|
|
field_options
object
|
|
field_required
boolean
|
|
type
string
|
|
url
string
|
name
string
|
|
---|---|
uuid
string
|
|
archive_unit
crawlkey
|
|
organization
code
|
|
status
string
|
|
entry_points
array[object]
|
An array of objects with the URI from which the crawl can be entered via native access as well as the UUID of the page instance |
driver
string
|
Address of the crawl driver |
seeds
array[string]
|
An array of URLs the crawler starts from |
crawldata
object
|
A metadata object maintained by the crawler |
metadata
object
|
Additional user metadata store |
aggregate
boolean
|
Whether this crawl is an aggregation of one or more other crawls (requires `components`) |
partial
boolean
|
Whether this crawl is a partial capture i.e. custom seeds/settings were supplied specifically for use in an aggregation |
components
array
|
|
plugin
object
|
Plugin name, slug, scope and sha as derived on creation |
settings
object
|
Settings derived from the settings cascade on creation |
processing
int
|
The number of pages processing |
remaining
int
|
The number of pages remaining |
captured
int
|
The number of pages captured |
errored
int
|
The number of pages errored |
excluded
int
|
The number of pages excluded |
url
string
|
|
attachments_url
string
|
|
portal
string
|
|
portal_url
string
|
|
captured_at
datetime
|
The date that the crawler started capturing |
capture_last_active_at
datetime
|
The date that the crawler last reported activity |
first_completed_at
datetime
|
|
last_completed_at
datetime
|
|
storage_state
choice
|
|
nearline_storage_after_date
datetime
|
|
retrieved_until
datetime
|
|
nearline_storage_metadata
object
|
|
created_at
datetime
|
|
updated_at
datetime
|
{
"archive_unit": "EXAMPLE/Example Archive",
"status": "requested:user"
}
{
"archive_unit": "EXAMPLE/Example Archive",
"entry_points": [
{
"uri": "http://blog.example.com/new-post-1"
},
{
"uri": "http://blog.example.com/new-post-2"
},
{
"uri": "http://blog.example.com/new-post-3"
}
],
"status": "requested:user",
"aggregate": true,
"components": [
"58a6f8dd-24e0-4c56-9309-efd831bb77ab",
"05e27bad-39b0-4500-a81a-1e8b504c0723",
"908e266a-ccb2-4726-bb54-f95a1a9700c5"
]
}
{
"archive_unit": "EXAMPLE/Example Archive",
"entry_points": [
{
"uri": "http://blog.example.com/new-post-1"
},
{
"uri": "http://blog.example.com/new-post-2"
}
],
"seeds": [
"http://blog.example.com/new-post-1",
"http://blog.example.com/new-post-2"
],
"status": "requested:user",
"partial": true
}
name
string
|
|
---|---|
uuid
string
|
|
archive_unit
crawlkey
|
|
status
string
|
|
entry_points
array[object]
|
An array of objects with the URI from which the crawl can be entered via native access as well as the UUID of the page instance |
seeds
array[string]
|
An array of URLs the crawler starts from |
metadata
object
|
Additional user metadata store |
aggregate
boolean
|
Whether this crawl is an aggregation of one or more other crawls (requires `components`) |
partial
boolean
|
Whether this crawl is a partial capture i.e. custom seeds/settings were supplied specifically for use in an aggregation |
components
array[uuid]
|
|
settings
object
|
Settings derived from the settings cascade on creation |
first_completed_at
datetime
|
|
last_completed_at
datetime
|
|
storage_state
choice
|
|
nearline_storage_after_date
datetime
|
|
retrieved_until
datetime
|
|
nearline_storage_metadata
object
|
{
"name": "example-example-archive-202503140737",
"uuid": "3c78b345-f1a9-4bec-8e52-aa3c3387a3d6",
"archive_unit": "EXAMPLE/Example Archive",
"organization": "EXAMPLE",
"status": "requested:user",
"entry_points": [
{
"uri": "http://example.com"
}
],
"driver": null,
"seeds": [
"http://example.com"
],
"crawldata": null,
"metadata": null,
"aggregate": false,
"partial": false,
"components": [],
"plugin": {
"requires_qa": false,
"name": "Website",
"sha": null,
"requires_eng": false,
"scope": {
"name": "Default",
"slug": "default"
},
"slug": "website"
},
"settings": {
"warcloader_ingestor_image": "icr.io/chronicle-prod/warcloader/ingestor:1e2b71cd",
"page_handler_image": "icr.io/chronicle-prod/hanzo-page-handler:5b3ce3b6",
"single_crawl_per_crawldb": "True",
"customer_code": "EXAMPLE",
"db_instance_type": "t3.micro",
"warcloader_postgres_image": "icr.io/chronicle-prod/warcloader/postgres:1e2b71cd",
"restrict_domains": "on",
"job_server_security_groups": "prod/jobservers",
"db_security_groups": "prod/frontiers",
"reset_errors": "on",
"ntp_server": "169.254.169.123",
"task_log_queue": "s3://hanzo.software/task_queues",
"portal_api_endpoint": "https://portal.hanzoarchives.com/api/",
"thomas_image": "icr.io/chronicle-prod/thomas-the-crawl-engine:f0a71682",
"include_referered": "off",
"warcloader_timeout": "1200",
"mitmproxy_image": "icr.io/chronicle-prod/hanzo-mitmproxy:94aaf1c3",
"proxy_image": "icr.io/chronicle-prod/hanzo-qt-warcproxy:a274c6db",
"entrypoint": [
"http://example.com"
],
"frontier_image": "icr.io/chronicle-prod/miyamoto/frontier:5a284946",
"warcloader_url": "http://warcloader.inf.hanzoman.com:1647",
"job_server_instance_type": "t3.large",
"max_depth": "2",
"extract_handler_image": "icr.io/chronicle-prod/hanzo-extract-handler:58d8ca9a",
"instance_manager_url": "http://instance-manager.inf.hanzoman.com:1666/",
"tags": [],
"frontier_db_image": "icr.io/chronicle-prod/miyamoto/frontier-db:4ec4b271",
"max_workers": "5",
"final_frontier_image": "icr.io/chronicle-prod/final-frontier:dde45433",
"crawl_id": "3c78b345-f1a9-4bec-8e52-aa3c3387a3d6",
"chromium_image": "icr.io/chronicle-prod/chrome:107.0.5304.87-1_fonts",
"job_server_ami": "ami-0d8f8c6131ba8d0b0",
"arkwright_image": "icr.io/chronicle-prod/mr-arkwright:ibm-cloud___20241209_120232",
"warcloader_queue_server_image": "icr.io/chronicle-prod/warcloader/queue_server:1e2b71cd",
"manifest_path": "s3://hanzo.manifests/",
"global_setup": "off",
"crawl": "example-example-archive-202503140737",
"customer": "",
"warcloader_aggregator_image": "icr.io/chronicle-prod/warcloader/aggregator:1e2b71cd",
"snapshot_video_use_proxy": "no",
"seeds": [
"http://example.com"
],
"output_path": "s3://hanzoenterprise/RaC/",
"au": "Example Archive",
"capture_scope": "default",
"job_server_subnets": "prod/semi-private/*",
"db_ami": "ami-0875fac396b63e4e1",
"warcloader_crons_image": "icr.io/chronicle-prod/warcloader/crons:1e2b71cd",
"db_subnets": "prod/private/*"
},
"processing": null,
"remaining": null,
"captured": null,
"errored": null,
"excluded": null,
"url": "https://portal.hanzoarchives.com/api/crawls/3c78b345-f1a9-4bec-8e52-aa3c3387a3d6",
"attachments_url": "https://portal.hanzoarchives.com/api/crawls/3c78b345-f1a9-4bec-8e52-aa3c3387a3d6/attachments",
"portal": "https://portal.hanzoarchives.com/captures/example-archive/3c78b345-f1a9-4bec-8e52-aa3c3387a3d6",
"portal_url": "https://portal.hanzoarchives.com/captures/example-archive/3c78b345-f1a9-4bec-8e52-aa3c3387a3d6",
"captured_at": null,
"capture_last_active_at": null,
"first_completed_at": null,
"last_completed_at": null,
"storage_state": "hot",
"nearline_storage_after_date": null,
"retrieved_until": null,
"nearline_storage_metadata": null,
"created_at": "2025-03-14T07:37:08Z",
"updated_at": "2025-03-14T07:37:08Z"
}
{
"name": "example-example-archive-202503140737",
"uuid": "5f9a13ce-2d10-4693-945c-225b7850f466",
"archive_unit": "EXAMPLE/Example Archive",
"organization": "EXAMPLE",
"status": "requested:user",
"entry_points": [
{
"uri": "http://blog.example.com/new-post-1"
},
{
"uri": "http://blog.example.com/new-post-2"
},
{
"uri": "http://blog.example.com/new-post-3"
}
],
"driver": null,
"seeds": [
"http://example.com"
],
"crawldata": null,
"metadata": null,
"aggregate": true,
"partial": false,
"components": [],
"plugin": {
"requires_qa": false,
"name": "Website",
"sha": null,
"requires_eng": false,
"scope": {
"name": "Default",
"slug": "default"
},
"slug": "website"
},
"settings": {
"warcloader_ingestor_image": "icr.io/chronicle-prod/warcloader/ingestor:1e2b71cd",
"page_handler_image": "icr.io/chronicle-prod/hanzo-page-handler:5b3ce3b6",
"single_crawl_per_crawldb": "True",
"customer_code": "EXAMPLE",
"db_instance_type": "t3.micro",
"warcloader_postgres_image": "icr.io/chronicle-prod/warcloader/postgres:1e2b71cd",
"restrict_domains": "on",
"job_server_security_groups": "prod/jobservers",
"db_security_groups": "prod/frontiers",
"reset_errors": "on",
"ntp_server": "169.254.169.123",
"task_log_queue": "s3://hanzo.software/task_queues",
"portal_api_endpoint": "https://portal.hanzoarchives.com/api/",
"thomas_image": "icr.io/chronicle-prod/thomas-the-crawl-engine:f0a71682",
"include_referered": "off",
"warcloader_timeout": "1200",
"mitmproxy_image": "icr.io/chronicle-prod/hanzo-mitmproxy:94aaf1c3",
"proxy_image": "icr.io/chronicle-prod/hanzo-qt-warcproxy:a274c6db",
"entrypoint": [
"http://blog.example.com/new-post-1",
"http://blog.example.com/new-post-2",
"http://blog.example.com/new-post-3"
],
"frontier_image": "icr.io/chronicle-prod/miyamoto/frontier:5a284946",
"warcloader_url": "http://warcloader.inf.hanzoman.com:1647",
"job_server_instance_type": "t3.large",
"max_depth": "2",
"extract_handler_image": "icr.io/chronicle-prod/hanzo-extract-handler:58d8ca9a",
"instance_manager_url": "http://instance-manager.inf.hanzoman.com:1666/",
"tags": [],
"frontier_db_image": "icr.io/chronicle-prod/miyamoto/frontier-db:4ec4b271",
"max_workers": "5",
"final_frontier_image": "icr.io/chronicle-prod/final-frontier:dde45433",
"crawl_id": "5f9a13ce-2d10-4693-945c-225b7850f466",
"chromium_image": "icr.io/chronicle-prod/chrome:107.0.5304.87-1_fonts",
"job_server_ami": "ami-0d8f8c6131ba8d0b0",
"arkwright_image": "icr.io/chronicle-prod/mr-arkwright:ibm-cloud___20241209_120232",
"warcloader_queue_server_image": "icr.io/chronicle-prod/warcloader/queue_server:1e2b71cd",
"manifest_path": "s3://hanzo.manifests/",
"global_setup": "off",
"crawl": "example-example-archive-202503140737",
"customer": "",
"warcloader_aggregator_image": "icr.io/chronicle-prod/warcloader/aggregator:1e2b71cd",
"snapshot_video_use_proxy": "no",
"seeds": [
"http://example.com"
],
"output_path": "s3://hanzoenterprise/RaC/",
"au": "Example Archive",
"capture_scope": "default",
"job_server_subnets": "prod/semi-private/*",
"db_ami": "ami-0875fac396b63e4e1",
"warcloader_crons_image": "icr.io/chronicle-prod/warcloader/crons:1e2b71cd",
"db_subnets": "prod/private/*"
},
"processing": null,
"remaining": null,
"captured": null,
"errored": null,
"excluded": null,
"url": "https://portal.hanzoarchives.com/api/crawls/5f9a13ce-2d10-4693-945c-225b7850f466",
"attachments_url": "https://portal.hanzoarchives.com/api/crawls/5f9a13ce-2d10-4693-945c-225b7850f466/attachments",
"portal": "https://portal.hanzoarchives.com/captures/example-archive/5f9a13ce-2d10-4693-945c-225b7850f466",
"portal_url": "https://portal.hanzoarchives.com/captures/example-archive/5f9a13ce-2d10-4693-945c-225b7850f466",
"captured_at": null,
"capture_last_active_at": null,
"first_completed_at": null,
"last_completed_at": null,
"storage_state": "hot",
"nearline_storage_after_date": null,
"retrieved_until": null,
"nearline_storage_metadata": null,
"created_at": "2025-03-14T07:37:08Z",
"updated_at": "2025-03-14T07:37:08Z"
}
{
"name": "example-example-archive-202503140737",
"uuid": "e6b89145-ee52-4950-92d6-b331edeb5850",
"archive_unit": "EXAMPLE/Example Archive",
"organization": "EXAMPLE",
"status": "requested:user",
"entry_points": [
{
"uri": "http://blog.example.com/new-post-1"
},
{
"uri": "http://blog.example.com/new-post-2"
}
],
"driver": null,
"seeds": [
"http://blog.example.com/new-post-1",
"http://blog.example.com/new-post-2"
],
"crawldata": null,
"metadata": null,
"aggregate": false,
"partial": true,
"components": [],
"plugin": {
"requires_qa": false,
"name": "Website",
"sha": null,
"requires_eng": false,
"scope": {
"name": "Default",
"slug": "default"
},
"slug": "website"
},
"settings": {
"warcloader_ingestor_image": "icr.io/chronicle-prod/warcloader/ingestor:1e2b71cd",
"page_handler_image": "icr.io/chronicle-prod/hanzo-page-handler:5b3ce3b6",
"single_crawl_per_crawldb": "True",
"customer_code": "EXAMPLE",
"db_instance_type": "t3.micro",
"warcloader_postgres_image": "icr.io/chronicle-prod/warcloader/postgres:1e2b71cd",
"restrict_domains": "on",
"job_server_security_groups": "prod/jobservers",
"db_security_groups": "prod/frontiers",
"reset_errors": "on",
"ntp_server": "169.254.169.123",
"task_log_queue": "s3://hanzo.software/task_queues",
"portal_api_endpoint": "https://portal.hanzoarchives.com/api/",
"thomas_image": "icr.io/chronicle-prod/thomas-the-crawl-engine:f0a71682",
"include_referered": "off",
"warcloader_timeout": "1200",
"mitmproxy_image": "icr.io/chronicle-prod/hanzo-mitmproxy:94aaf1c3",
"proxy_image": "icr.io/chronicle-prod/hanzo-qt-warcproxy:a274c6db",
"entrypoint": [
"http://blog.example.com/new-post-1",
"http://blog.example.com/new-post-2"
],
"frontier_image": "icr.io/chronicle-prod/miyamoto/frontier:5a284946",
"warcloader_url": "http://warcloader.inf.hanzoman.com:1647",
"job_server_instance_type": "t3.large",
"max_depth": "2",
"extract_handler_image": "icr.io/chronicle-prod/hanzo-extract-handler:58d8ca9a",
"instance_manager_url": "http://instance-manager.inf.hanzoman.com:1666/",
"tags": [],
"frontier_db_image": "icr.io/chronicle-prod/miyamoto/frontier-db:4ec4b271",
"max_workers": "5",
"final_frontier_image": "icr.io/chronicle-prod/final-frontier:dde45433",
"crawl_id": "e6b89145-ee52-4950-92d6-b331edeb5850",
"chromium_image": "icr.io/chronicle-prod/chrome:107.0.5304.87-1_fonts",
"job_server_ami": "ami-0d8f8c6131ba8d0b0",
"arkwright_image": "icr.io/chronicle-prod/mr-arkwright:ibm-cloud___20241209_120232",
"warcloader_queue_server_image": "icr.io/chronicle-prod/warcloader/queue_server:1e2b71cd",
"manifest_path": "s3://hanzo.manifests/",
"global_setup": "off",
"crawl": "example-example-archive-202503140737",
"customer": "",
"warcloader_aggregator_image": "icr.io/chronicle-prod/warcloader/aggregator:1e2b71cd",
"snapshot_video_use_proxy": "no",
"seeds": [
"http://blog.example.com/new-post-1",
"http://blog.example.com/new-post-2"
],
"output_path": "s3://hanzoenterprise/RaC/",
"au": "Example Archive",
"capture_scope": "default",
"job_server_subnets": "prod/semi-private/*",
"db_ami": "ami-0875fac396b63e4e1",
"warcloader_crons_image": "icr.io/chronicle-prod/warcloader/crons:1e2b71cd",
"db_subnets": "prod/private/*"
},
"processing": null,
"remaining": null,
"captured": null,
"errored": null,
"excluded": null,
"url": "https://portal.hanzoarchives.com/api/crawls/e6b89145-ee52-4950-92d6-b331edeb5850",
"attachments_url": "https://portal.hanzoarchives.com/api/crawls/e6b89145-ee52-4950-92d6-b331edeb5850/attachments",
"portal": "https://portal.hanzoarchives.com/captures/example-archive/e6b89145-ee52-4950-92d6-b331edeb5850",
"portal_url": "https://portal.hanzoarchives.com/captures/example-archive/e6b89145-ee52-4950-92d6-b331edeb5850",
"captured_at": null,
"capture_last_active_at": null,
"first_completed_at": null,
"last_completed_at": null,
"storage_state": "hot",
"nearline_storage_after_date": null,
"retrieved_until": null,
"nearline_storage_metadata": null,
"created_at": "2025-03-14T07:37:08Z",
"updated_at": "2025-03-14T07:37:08Z"
}
name
string
|
|
---|---|
uuid
string
|
|
archive_unit
crawlkey
|
|
organization
code
|
|
status
string
|
|
entry_points
array[object]
|
An array of objects with the URI from which the crawl can be entered via native access as well as the UUID of the page instance |
driver
string
|
Address of the crawl driver |
seeds
array[string]
|
An array of URLs the crawler starts from |
crawldata
object
|
A metadata object maintained by the crawler |
metadata
object
|
Additional user metadata store |
aggregate
boolean
|
Whether this crawl is an aggregation of one or more other crawls (requires `components`) |
partial
boolean
|
Whether this crawl is a partial capture i.e. custom seeds/settings were supplied specifically for use in an aggregation |
components
array
|
|
plugin
object
|
Plugin name, slug, scope and sha as derived on creation |
settings
object
|
Settings derived from the settings cascade on creation |
processing
int
|
The number of pages processing |
remaining
int
|
The number of pages remaining |
captured
int
|
The number of pages captured |
errored
int
|
The number of pages errored |
excluded
int
|
The number of pages excluded |
url
string
|
|
attachments_url
string
|
|
portal
string
|
|
portal_url
string
|
|
captured_at
datetime
|
The date that the crawler started capturing |
capture_last_active_at
datetime
|
The date that the crawler last reported activity |
first_completed_at
datetime
|
|
last_completed_at
datetime
|
|
storage_state
choice
|
|
nearline_storage_after_date
datetime
|
|
retrieved_until
datetime
|
|
nearline_storage_metadata
object
|
|
created_at
datetime
|
|
updated_at
datetime
|
{
"name": "-example-archive-202503140737",
"uuid": "dd96a443-434e-4f92-b126-ed0445a9f636",
"archive_unit": "EXAMPLE/Example Archive",
"organization": "EXAMPLE",
"status": "crawling",
"entry_points": [
{
"uri": "http://example.com"
}
],
"driver": null,
"seeds": [
"http://example.com"
],
"crawldata": null,
"metadata": null,
"aggregate": false,
"partial": false,
"components": [],
"plugin": {
"requires_qa": false,
"name": "Website",
"sha": null,
"requires_eng": false,
"scope": {
"name": "Default",
"slug": "default"
},
"slug": "website"
},
"settings": {
"warcloader_ingestor_image": "icr.io/chronicle-prod/warcloader/ingestor:1e2b71cd",
"page_handler_image": "icr.io/chronicle-prod/hanzo-page-handler:5b3ce3b6",
"single_crawl_per_crawldb": "True",
"customer_code": "EXAMPLE",
"db_instance_type": "t3.micro",
"warcloader_postgres_image": "icr.io/chronicle-prod/warcloader/postgres:1e2b71cd",
"restrict_domains": "on",
"job_server_security_groups": "prod/jobservers",
"db_security_groups": "prod/frontiers",
"reset_errors": "on",
"ntp_server": "169.254.169.123",
"task_log_queue": "s3://hanzo.software/task_queues",
"portal_api_endpoint": "https://portal.hanzoarchives.com/api/",
"thomas_image": "icr.io/chronicle-prod/thomas-the-crawl-engine:f0a71682",
"include_referered": "off",
"warcloader_timeout": "1200",
"mitmproxy_image": "icr.io/chronicle-prod/hanzo-mitmproxy:94aaf1c3",
"proxy_image": "icr.io/chronicle-prod/hanzo-qt-warcproxy:a274c6db",
"entrypoint": [
"http://example.com"
],
"frontier_image": "icr.io/chronicle-prod/miyamoto/frontier:5a284946",
"warcloader_url": "http://warcloader.inf.hanzoman.com:1647",
"job_server_instance_type": "t3.large",
"max_depth": "2",
"extract_handler_image": "icr.io/chronicle-prod/hanzo-extract-handler:58d8ca9a",
"instance_manager_url": "http://instance-manager.inf.hanzoman.com:1666/",
"tags": [],
"frontier_db_image": "icr.io/chronicle-prod/miyamoto/frontier-db:4ec4b271",
"max_workers": "5",
"final_frontier_image": "icr.io/chronicle-prod/final-frontier:dde45433",
"crawl_id": "dd96a443-434e-4f92-b126-ed0445a9f636",
"chromium_image": "icr.io/chronicle-prod/chrome:107.0.5304.87-1_fonts",
"job_server_ami": "ami-0d8f8c6131ba8d0b0",
"arkwright_image": "icr.io/chronicle-prod/mr-arkwright:ibm-cloud___20241209_120232",
"warcloader_queue_server_image": "icr.io/chronicle-prod/warcloader/queue_server:1e2b71cd",
"manifest_path": "s3://hanzo.manifests/",
"global_setup": "off",
"crawl": "-example-archive-202503140737",
"customer": "",
"warcloader_aggregator_image": "icr.io/chronicle-prod/warcloader/aggregator:1e2b71cd",
"snapshot_video_use_proxy": "no",
"seeds": [
"http://example.com"
],
"output_path": "s3://hanzoenterprise/RaC/",
"au": "Example Archive",
"capture_scope": "default",
"job_server_subnets": "prod/semi-private/*",
"db_ami": "ami-0875fac396b63e4e1",
"warcloader_crons_image": "icr.io/chronicle-prod/warcloader/crons:1e2b71cd",
"db_subnets": "prod/private/*"
},
"processing": 12,
"remaining": 243,
"captured": 145,
"errored": 15,
"excluded": 214,
"url": "https://portal.hanzoarchives.com/api/crawls/dd96a443-434e-4f92-b126-ed0445a9f636",
"attachments_url": "https://portal.hanzoarchives.com/api/crawls/dd96a443-434e-4f92-b126-ed0445a9f636/attachments",
"portal": "https://portal.hanzoarchives.com/captures/example-archive/dd96a443-434e-4f92-b126-ed0445a9f636",
"portal_url": "https://portal.hanzoarchives.com/captures/example-archive/dd96a443-434e-4f92-b126-ed0445a9f636",
"captured_at": "2025-03-14T07:37:08Z",
"capture_last_active_at": "2025-03-14T07:37:08Z",
"first_completed_at": null,
"last_completed_at": null,
"storage_state": "hot",
"nearline_storage_after_date": null,
"retrieved_until": null,
"nearline_storage_metadata": null,
"created_at": "2025-03-14T07:37:08Z",
"updated_at": "2025-03-14T07:37:08Z"
}
name
string
|
|
---|---|
uuid
string
|
|
archive_unit
crawlkey
|
|
organization
code
|
|
status
string
|
|
entry_points
array[object]
|
An array of objects with the URI from which the crawl can be entered via native access as well as the UUID of the page instance |
driver
string
|
Address of the crawl driver |
seeds
array[string]
|
An array of URLs the crawler starts from |
crawldata
object
|
A metadata object maintained by the crawler |
metadata
object
|
Additional user metadata store |
aggregate
boolean
|
Whether this crawl is an aggregation of one or more other crawls (requires `components`) |
partial
boolean
|
Whether this crawl is a partial capture i.e. custom seeds/settings were supplied specifically for use in an aggregation |
components
array
|
|
plugin
object
|
Plugin name, slug, scope and sha as derived on creation |
settings
object
|
Settings derived from the settings cascade on creation |
processing
int
|
The number of pages processing |
remaining
int
|
The number of pages remaining |
captured
int
|
The number of pages captured |
errored
int
|
The number of pages errored |
excluded
int
|
The number of pages excluded |
url
string
|
|
attachments_url
string
|
|
portal
string
|
|
portal_url
string
|
|
captured_at
datetime
|
The date that the crawler started capturing |
capture_last_active_at
datetime
|
The date that the crawler last reported activity |
first_completed_at
datetime
|
|
last_completed_at
datetime
|
|
storage_state
choice
|
|
nearline_storage_after_date
datetime
|
|
retrieved_until
datetime
|
|
nearline_storage_metadata
object
|
|
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
uuid
string
|
|
archive_unit
crawlkey
|
|
status
string
|
|
entry_points
array[object]
|
An array of objects with the URI from which the crawl can be entered via native access as well as the UUID of the page instance |
seeds
array[string]
|
An array of URLs the crawler starts from |
metadata
object
|
Additional user metadata store |
aggregate
boolean
|
Whether this crawl is an aggregation of one or more other crawls (requires `components`) |
partial
boolean
|
Whether this crawl is a partial capture i.e. custom seeds/settings were supplied specifically for use in an aggregation |
components
array[uuid]
|
|
settings
object
|
Settings derived from the settings cascade on creation |
first_completed_at
datetime
|
|
last_completed_at
datetime
|
|
storage_state
choice
|
|
nearline_storage_after_date
datetime
|
|
retrieved_until
datetime
|
|
nearline_storage_metadata
object
|
name
string
|
|
---|---|
uuid
string
|
|
archive_unit
crawlkey
|
|
organization
code
|
|
status
string
|
|
entry_points
array[object]
|
An array of objects with the URI from which the crawl can be entered via native access as well as the UUID of the page instance |
driver
string
|
Address of the crawl driver |
seeds
array[string]
|
An array of URLs the crawler starts from |
crawldata
object
|
A metadata object maintained by the crawler |
metadata
object
|
Additional user metadata store |
aggregate
boolean
|
Whether this crawl is an aggregation of one or more other crawls (requires `components`) |
partial
boolean
|
Whether this crawl is a partial capture i.e. custom seeds/settings were supplied specifically for use in an aggregation |
components
array
|
|
plugin
object
|
Plugin name, slug, scope and sha as derived on creation |
settings
object
|
Settings derived from the settings cascade on creation |
processing
int
|
The number of pages processing |
remaining
int
|
The number of pages remaining |
captured
int
|
The number of pages captured |
errored
int
|
The number of pages errored |
excluded
int
|
The number of pages excluded |
url
string
|
|
attachments_url
string
|
|
portal
string
|
|
portal_url
string
|
|
captured_at
datetime
|
The date that the crawler started capturing |
capture_last_active_at
datetime
|
The date that the crawler last reported activity |
first_completed_at
datetime
|
|
last_completed_at
datetime
|
|
storage_state
choice
|
|
nearline_storage_after_date
datetime
|
|
retrieved_until
datetime
|
|
nearline_storage_metadata
object
|
|
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
uuid
string
|
|
archive_unit
crawlkey
|
|
status
string
|
|
entry_points
array[object]
|
An array of objects with the URI from which the crawl can be entered via native access as well as the UUID of the page instance |
seeds
array[string]
|
An array of URLs the crawler starts from |
metadata
object
|
Additional user metadata store |
aggregate
boolean
|
Whether this crawl is an aggregation of one or more other crawls (requires `components`) |
partial
boolean
|
Whether this crawl is a partial capture i.e. custom seeds/settings were supplied specifically for use in an aggregation |
components
array[uuid]
|
|
settings
object
|
Settings derived from the settings cascade on creation |
first_completed_at
datetime
|
|
last_completed_at
datetime
|
|
storage_state
choice
|
|
nearline_storage_after_date
datetime
|
|
retrieved_until
datetime
|
|
nearline_storage_metadata
object
|
name
string
|
|
---|---|
uuid
string
|
|
archive_unit
crawlkey
|
|
organization
code
|
|
status
string
|
|
entry_points
array[object]
|
An array of objects with the URI from which the crawl can be entered via native access as well as the UUID of the page instance |
driver
string
|
Address of the crawl driver |
seeds
array[string]
|
An array of URLs the crawler starts from |
crawldata
object
|
A metadata object maintained by the crawler |
metadata
object
|
Additional user metadata store |
aggregate
boolean
|
Whether this crawl is an aggregation of one or more other crawls (requires `components`) |
partial
boolean
|
Whether this crawl is a partial capture i.e. custom seeds/settings were supplied specifically for use in an aggregation |
components
array
|
|
plugin
object
|
Plugin name, slug, scope and sha as derived on creation |
settings
object
|
Settings derived from the settings cascade on creation |
processing
int
|
The number of pages processing |
remaining
int
|
The number of pages remaining |
captured
int
|
The number of pages captured |
errored
int
|
The number of pages errored |
excluded
int
|
The number of pages excluded |
url
string
|
|
attachments_url
string
|
|
portal
string
|
|
portal_url
string
|
|
captured_at
datetime
|
The date that the crawler started capturing |
capture_last_active_at
datetime
|
The date that the crawler last reported activity |
first_completed_at
datetime
|
|
last_completed_at
datetime
|
|
storage_state
choice
|
|
nearline_storage_after_date
datetime
|
|
retrieved_until
datetime
|
|
nearline_storage_metadata
object
|
|
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
slug
string
|
|
auth_provider
slug
|
|
url
string
|
|
scopes_url
string
|
|
settings_url
string
|
name
string
|
|
---|---|
slug
string
|
|
auth_provider
slug
|
|
url
string
|
|
scopes_url
string
|
|
settings_url
string
|
name
string
|
|
---|---|
slug
string
|
|
url
string
|
|
settings_url
string
|
name
string
|
|
---|---|
slug
string
|
|
url
string
|
|
settings_url
string
|
key
string
|
|
---|---|
value
string
|
|
field_name
string
|
|
field_description
string
|
|
field_type
choice
|
|
field_options
object
|
|
field_required
boolean
|
|
type
string
|
|
url
string
|
key
string
|
|
---|---|
value
string
|
|
field_name
string
|
|
field_description
string
|
|
field_type
choice
|
|
field_options
object
|
|
field_required
boolean
|
|
type
string
|
|
url
string
|
key
string
|
|
---|---|
value
string
|
|
field_name
string
|
|
field_description
string
|
|
field_type
choice
|
|
field_options
object
|
|
field_required
boolean
|
|
type
string
|
|
url
string
|
key
string
|
|
---|---|
value
string
|
|
field_name
string
|
|
field_description
string
|
|
field_type
choice
|
|
field_options
object
|
|
field_required
boolean
|
|
type
string
|
|
url
string
|
key
string
|
|
---|---|
value
string
|
|
field_name
string
|
|
field_description
string
|
|
field_type
choice
|
|
field_options
object
|
|
field_required
boolean
|
|
type
string
|
|
url
string
|
key
string
|
|
---|---|
value
string
|
|
field_name
string
|
|
field_description
string
|
|
field_type
choice
|
|
field_options
object
|
|
field_required
boolean
|
|
type
string
|
|
url
string
|
name
string
|
|
---|---|
uuid
string
|
|
crawl
uuid
|
|
crawl_details
crawl
|
|
organization
code
|
|
status
string
|
|
type
choice
|
The type of export, a `load_file` export is a standardised concordance load file |
credentials
object
|
Any credentials required to open the export |
url
string
|
|
attachments_url
string
|
|
portal
string
|
|
portal_url
string
|
|
attachments_count
int
|
|
attachments_size
int
|
|
exported_at
datetime
|
The date that the exporter started exporting |
created_at
datetime
|
|
updated_at
datetime
|
{
"name": "ESIV-1",
"organization": "EXAMPLE",
"crawl": "433a2584-f699-4523-ad61-c098d0171522",
"status": "requested:user",
"type": "load_file"
}
name
string
|
|
---|---|
uuid
string
|
|
crawl
uuid
|
|
organization
code
|
|
status
string
|
|
type
choice
|
The type of export, a `load_file` export is a standardised concordance load file |
credentials
object
|
Any credentials required to open the export |
exported_at
datetime
|
The date that the exporter started exporting |
{
"name": "ESIV-1",
"uuid": "f01eb185-e256-4012-8b15-6caf505d0512",
"crawl_details": "433a2584-f699-4523-ad61-c098d0171522",
"organization": "EXAMPLE",
"status": "requested:user",
"type": "load_file",
"credentials": null,
"url": "https://portal.hanzoarchives.com/api/exports/f01eb185-e256-4012-8b15-6caf505d0512",
"attachments_url": "https://portal.hanzoarchives.com/api/exports/f01eb185-e256-4012-8b15-6caf505d0512/attachments",
"portal": "https://portal.hanzoarchives.com/exports/f01eb185-e256-4012-8b15-6caf505d0512",
"portal_url": "https://portal.hanzoarchives.com/exports/f01eb185-e256-4012-8b15-6caf505d0512",
"exported_at": null,
"created_at": "2025-03-14T07:37:09Z",
"updated_at": "2025-03-14T07:37:09Z"
}
name
string
|
|
---|---|
uuid
string
|
|
crawl
uuid
|
|
crawl_details
crawl
|
|
organization
code
|
|
status
string
|
|
type
choice
|
The type of export, a `load_file` export is a standardised concordance load file |
credentials
object
|
Any credentials required to open the export |
url
string
|
|
attachments_url
string
|
|
portal
string
|
|
portal_url
string
|
|
attachments_count
int
|
|
attachments_size
int
|
|
exported_at
datetime
|
The date that the exporter started exporting |
created_at
datetime
|
|
updated_at
datetime
|
{
"name": "ESIV-10",
"uuid": "75cdfdab-73ee-4894-b359-616d1ce5eb24",
"crawl": "f10a49e9-f2b7-4bed-9aee-ede8fb6ad0b7",
"crawl_details": {
"name": "-example-archive-202503140737",
"uuid": "f10a49e9-f2b7-4bed-9aee-ede8fb6ad0b7",
"archive_unit": "EXAMPLE/Example Archive",
"organization": "EXAMPLE",
"status": "requested:user",
"auth": null,
"entry_points": [
{
"uri": "http://example.com"
}
],
"driver": null,
"seeds": [
"http://example.com"
],
"crawldata": null,
"metadata": null,
"aggregate": false,
"draft": false,
"partial": false,
"components": [],
"plugin": {
"requires_qa": false,
"name": "Website",
"sha": null,
"requires_eng": false,
"scope": {
"name": "Default",
"slug": "default"
},
"slug": "website"
},
"settings": {
"warcloader_ingestor_image": "icr.io/chronicle-prod/warcloader/ingestor:1e2b71cd",
"page_handler_image": "icr.io/chronicle-prod/hanzo-page-handler:5b3ce3b6",
"single_crawl_per_crawldb": "True",
"customer_code": "EXAMPLE",
"db_instance_type": "t3.micro",
"warcloader_postgres_image": "icr.io/chronicle-prod/warcloader/postgres:1e2b71cd",
"restrict_domains": "on",
"job_server_security_groups": "prod/jobservers",
"db_security_groups": "prod/frontiers",
"reset_errors": "on",
"ntp_server": "169.254.169.123",
"task_log_queue": "s3://hanzo.software/task_queues",
"portal_api_endpoint": "https://portal.hanzoarchives.com/api/",
"thomas_image": "icr.io/chronicle-prod/thomas-the-crawl-engine:f0a71682",
"include_referered": "off",
"warcloader_timeout": "1200",
"mitmproxy_image": "icr.io/chronicle-prod/hanzo-mitmproxy:94aaf1c3",
"proxy_image": "icr.io/chronicle-prod/hanzo-qt-warcproxy:a274c6db",
"entrypoint": [
"http://example.com"
],
"frontier_image": "icr.io/chronicle-prod/miyamoto/frontier:5a284946",
"warcloader_url": "http://warcloader.inf.hanzoman.com:1647",
"job_server_instance_type": "t3.large",
"max_depth": "2",
"extract_handler_image": "icr.io/chronicle-prod/hanzo-extract-handler:58d8ca9a",
"instance_manager_url": "http://instance-manager.inf.hanzoman.com:1666/",
"tags": [],
"frontier_db_image": "icr.io/chronicle-prod/miyamoto/frontier-db:4ec4b271",
"max_workers": "5",
"final_frontier_image": "icr.io/chronicle-prod/final-frontier:dde45433",
"crawl_id": "f10a49e9-f2b7-4bed-9aee-ede8fb6ad0b7",
"chromium_image": "icr.io/chronicle-prod/chrome:107.0.5304.87-1_fonts",
"job_server_ami": "ami-0d8f8c6131ba8d0b0",
"arkwright_image": "icr.io/chronicle-prod/mr-arkwright:ibm-cloud___20241209_120232",
"warcloader_queue_server_image": "icr.io/chronicle-prod/warcloader/queue_server:1e2b71cd",
"manifest_path": "s3://hanzo.manifests/",
"global_setup": "off",
"crawl": "-example-archive-202503140737",
"customer": "",
"warcloader_aggregator_image": "icr.io/chronicle-prod/warcloader/aggregator:1e2b71cd",
"snapshot_video_use_proxy": "no",
"seeds": [
"http://example.com"
],
"output_path": "s3://hanzoenterprise/RaC/",
"au": "Example Archive",
"capture_scope": "default",
"job_server_subnets": "prod/semi-private/*",
"db_ami": "ami-0875fac396b63e4e1",
"warcloader_crons_image": "icr.io/chronicle-prod/warcloader/crons:1e2b71cd",
"db_subnets": "prod/private/*"
},
"processing": 12,
"remaining": 243,
"captured": 145,
"errored": 15,
"excluded": 214,
"url": "https://portal.hanzoarchives.com/api/crawls/f10a49e9-f2b7-4bed-9aee-ede8fb6ad0b7",
"attachments_url": "https://portal.hanzoarchives.com/api/crawls/f10a49e9-f2b7-4bed-9aee-ede8fb6ad0b7/attachments",
"portal": "https://portal.hanzoarchives.com/captures/example-archive/f10a49e9-f2b7-4bed-9aee-ede8fb6ad0b7",
"portal_url": "https://portal.hanzoarchives.com/captures/example-archive/f10a49e9-f2b7-4bed-9aee-ede8fb6ad0b7",
"captured_at": null,
"capture_last_active_at": null,
"first_completed_at": null,
"last_completed_at": null,
"storage_state": "hot",
"nearline_storage_after_date": null,
"retrieved_until": null,
"nearline_storage_metadata": null,
"created_at": "2025-03-14T07:37:09Z",
"updated_at": "2025-03-14T07:37:09Z"
},
"organization": "EXAMPLE",
"status": "exporting",
"type": "load_file",
"credentials": null,
"url": "https://portal.hanzoarchives.com/api/exports/75cdfdab-73ee-4894-b359-616d1ce5eb24",
"attachments_url": "https://portal.hanzoarchives.com/api/exports/75cdfdab-73ee-4894-b359-616d1ce5eb24/attachments",
"portal": "https://portal.hanzoarchives.com/exports/esiv-10",
"portal_url": "https://portal.hanzoarchives.com/exports/esiv-10",
"exported_at": "2025-03-14T07:37:09Z",
"created_at": "2025-03-14T07:37:09Z",
"updated_at": "2025-03-14T07:37:09Z"
}
name
string
|
|
---|---|
uuid
string
|
|
crawl
uuid
|
|
crawl_details
crawl
|
|
organization
code
|
|
status
string
|
|
type
choice
|
The type of export, a `load_file` export is a standardised concordance load file |
credentials
object
|
Any credentials required to open the export |
url
string
|
|
attachments_url
string
|
|
portal
string
|
|
portal_url
string
|
|
attachments_count
int
|
|
attachments_size
int
|
|
exported_at
datetime
|
The date that the exporter started exporting |
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
uuid
string
|
|
crawl
uuid
|
|
organization
code
|
|
status
string
|
|
type
choice
|
The type of export, a `load_file` export is a standardised concordance load file |
credentials
object
|
Any credentials required to open the export |
exported_at
datetime
|
The date that the exporter started exporting |
name
string
|
|
---|---|
uuid
string
|
|
crawl
uuid
|
|
crawl_details
crawl
|
|
organization
code
|
|
status
string
|
|
type
choice
|
The type of export, a `load_file` export is a standardised concordance load file |
credentials
object
|
Any credentials required to open the export |
url
string
|
|
attachments_url
string
|
|
portal
string
|
|
portal_url
string
|
|
attachments_count
int
|
|
attachments_size
int
|
|
exported_at
datetime
|
The date that the exporter started exporting |
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
uuid
string
|
|
crawl
uuid
|
|
organization
code
|
|
status
string
|
|
type
choice
|
The type of export, a `load_file` export is a standardised concordance load file |
credentials
object
|
Any credentials required to open the export |
exported_at
datetime
|
The date that the exporter started exporting |
name
string
|
|
---|---|
uuid
string
|
|
crawl
uuid
|
|
crawl_details
crawl
|
|
organization
code
|
|
status
string
|
|
type
choice
|
The type of export, a `load_file` export is a standardised concordance load file |
credentials
object
|
Any credentials required to open the export |
url
string
|
|
attachments_url
string
|
|
portal
string
|
|
portal_url
string
|
|
attachments_count
int
|
|
attachments_size
int
|
|
exported_at
datetime
|
The date that the exporter started exporting |
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
description
string
|
|
slug
string
|
|
uuid
string
|
|
url
string
|
|
portal_url
string
|
|
created_by
user
|
|
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
description
string
|
|
slug
string
|
|
uuid
string
|
|
url
string
|
|
portal_url
string
|
|
created_by
user
|
|
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
description
string
|
|
slug
string
|
|
uuid
string
|
name
string
|
|
---|---|
description
string
|
|
slug
string
|
|
uuid
string
|
|
url
string
|
|
portal_url
string
|
|
created_by
user
|
|
created_at
datetime
|
|
updated_at
datetime
|
name
string
|
|
---|---|
description
string
|
|
status
choice
|
|
slug
string
|
|
uuid
string
|
|
job_id
string
|
|
url
string
|
|
portal_url
string
|
|
created_by
user
|
|
created_at
datetime
|
|
updated_at
datetime
|
|
teams
array
|
|
is_archived
boolean
|
|
s3_bucket
string
|
|
profile_archive_units
string
|
|
search_archive_units
string
|
|
extra_information
object
|
This is required because I want the detail view below to accept POST but I don't want the other detail view to accept POST. Ideally DRF would do that automatically for you.
name
string
|
|
---|---|
description
string
|
|
status
choice
|
|
slug
string
|
|
uuid
string
|
|
job_id
string
|
|
is_archived
boolean
|
name
string
|
|
---|---|
description
string
|
|
status
choice
|
|
slug
string
|
|
uuid
string
|
|
job_id
string
|
|
url
string
|
|
portal_url
string
|
|
created_by
user
|
|
created_at
datetime
|
|
updated_at
datetime
|
|
teams
array
|
|
is_archived
boolean
|
|
s3_bucket
string
|
|
profile_archive_units
string
|
|
search_archive_units
string
|
|
extra_information
object
|
name
string
|
|
---|---|
description
string
|
|
status
choice
|
|
slug
string
|
|
uuid
string
|
|
job_id
string
|
|
url
string
|
|
portal_url
string
|
|
created_by
user
|
|
created_at
datetime
|
|
updated_at
datetime
|
|
teams
array
|
|
is_archived
boolean
|
|
s3_bucket
string
|
|
profile_archive_units
string
|
|
search_archive_units
string
|
|
extra_information
object
|
Method for adding related search crawl to the investigation
name
string
|
|
---|---|
description
string
|
|
status
choice
|
|
slug
string
|
|
uuid
string
|
|
job_id
string
|
|
is_archived
boolean
|
name
string
|
|
---|---|
description
string
|
|
status
choice
|
|
slug
string
|
|
uuid
string
|
|
job_id
string
|
|
url
string
|
|
portal_url
string
|
|
created_by
user
|
|
created_at
datetime
|
|
updated_at
datetime
|
|
teams
array
|
|
is_archived
boolean
|
|
s3_bucket
string
|
|
profile_archive_units
string
|
|
search_archive_units
string
|
|
extra_information
object
|
name
string
|
|
---|---|
description
string
|
|
status
choice
|
|
slug
string
|
|
uuid
string
|
|
job_id
string
|
|
url
string
|
|
portal_url
string
|
|
created_by
user
|
|
created_at
datetime
|
|
updated_at
datetime
|
|
teams
array
|
|
is_archived
boolean
|
|
s3_bucket
string
|
|
profile_archive_units
string
|
|
search_archive_units
string
|
|
extra_information
object
|
name
string
|
|
---|---|
description
string
|
|
status
choice
|
|
slug
string
|
|
uuid
string
|
|
job_id
string
|
|
url
string
|
|
portal_url
string
|
|
created_by
user
|
|
created_at
datetime
|
|
updated_at
datetime
|
|
teams
array
|
|
is_archived
boolean
|
|
s3_bucket
string
|
|
profile_archive_units
string
|
|
search_archive_units
string
|
|
extra_information
object
|
name
string
|
|
---|---|
slug
string
|
|
code
string
|
|
archive_units_count
int
|
|
teams_count
int
|
|
users_count
int
|
|
has_captures
boolean
|
|
has_change
boolean
|
|
has_search
boolean
|
|
url
string
|
|
logo_url
string
|
|
portal_url
string
|
name
string
|
|
---|---|
code
string
|
|
has_captures
boolean
|
|
has_change
boolean
|
|
has_search
boolean
|
|
logo_url
string
|
name
string
|
|
---|---|
slug
string
|
|
code
string
|
|
archive_units_count
int
|
|
teams_count
int
|
|
users_count
int
|
|
has_captures
boolean
|
|
has_change
boolean
|
|
has_search
boolean
|
|
url
string
|
|
logo_url
string
|
|
portal_url
string
|
name
string
|
|
---|---|
slug
string
|
|
code
string
|
|
archive_units_count
int
|
|
teams_count
int
|
|
users_count
int
|
|
has_captures
boolean
|
|
has_change
boolean
|
|
has_search
boolean
|
|
url
string
|
|
logo_url
string
|
|
portal_url
string
|
name
string
|
|
---|---|
code
string
|
|
has_captures
boolean
|
|
has_change
boolean
|
|
has_search
boolean
|
|
logo_url
string
|
name
string
|
|
---|---|
slug
string
|
|
code
string
|
|
archive_units_count
int
|
|
teams_count
int
|
|
users_count
int
|
|
has_captures
boolean
|
|
has_change
boolean
|
|
has_search
boolean
|
|
url
string
|
|
logo_url
string
|
|
portal_url
string
|
uuid
string
|
|
---|---|
crawl
uuid
|
|
initiated_date
datetime
|
|
completed_date
datetime
|
|
until_date
datetime
|
|
duration_days
int
|
|
initiated_by
string
|
|
error
string
|
crawl
uuid
|
|
---|---|
duration_days
int
|
|
error
string
|
uuid
string
|
|
---|---|
crawl
uuid
|
|
initiated_date
datetime
|
|
completed_date
datetime
|
|
until_date
datetime
|
|
duration_days
int
|
|
initiated_by
string
|
|
error
string
|
uuid
string
|
|
---|---|
crawl
uuid
|
|
initiated_date
datetime
|
|
completed_date
datetime
|
|
until_date
datetime
|
|
duration_days
int
|
|
initiated_by
string
|
|
error
string
|
crawl
uuid
|
|
---|---|
duration_days
int
|
|
error
string
|
uuid
string
|
|
---|---|
crawl
uuid
|
|
initiated_date
datetime
|
|
completed_date
datetime
|
|
until_date
datetime
|
|
duration_days
int
|
|
initiated_by
string
|
|
error
string
|
crawl
uuid
|
|
---|---|
duration_days
int
|
|
error
string
|
uuid
string
|
|
---|---|
crawl
uuid
|
|
initiated_date
datetime
|
|
completed_date
datetime
|
|
until_date
datetime
|
|
duration_days
int
|
|
initiated_by
string
|
|
error
string
|
crawl
uuid
|
|
---|---|
duration_days
int
|
|
error
string
|
uuid
string
|
|
---|---|
crawl
uuid
|
|
initiated_date
datetime
|
|
completed_date
datetime
|
|
until_date
datetime
|
|
duration_days
int
|
|
initiated_by
string
|
|
error
string
|
crawl
uuid
|
|
---|---|
duration_days
int
|
|
error
string
|
uuid
string
|
|
---|---|
crawl
uuid
|
|
initiated_date
datetime
|
|
completed_date
datetime
|
|
until_date
datetime
|
|
duration_days
int
|
|
initiated_by
string
|
|
error
string
|
crawl
uuid
|
|
---|---|
duration_days
int
|
|
error
string
|
uuid
string
|
|
---|---|
crawl
uuid
|
|
initiated_date
datetime
|
|
completed_date
datetime
|
|
until_date
datetime
|
|
duration_days
int
|
|
initiated_by
string
|
|
error
string
|