crawlee-one / Exports / CrawleeOneIO
Interface for storing and retrieving:
- Scraped data
- Requests (URLs) to scrape
- Cache data
This interface is based on Crawlee/Apify, but defined separately to allow drop-in replacement with other integrations.
Name | Type |
---|---|
TEnv |
extends object = object |
TReport |
extends object = object |
TMetadata |
extends object = object |
- createDefaultProxyConfiguration
- generateEntryMetadata
- generateErrorReport
- getInput
- isTelemetryEnabled
- openDataset
- openKeyValueStore
- openRequestQueue
- runInContext
- triggerDownstreamCrawler
• createDefaultProxyConfiguration: <T>(input
: undefined
| T
| Readonly
<T
>) => MaybePromise
<undefined
| ProxyConfiguration
>
▸ <T
>(input
): MaybePromise
<undefined
| ProxyConfiguration
>
Creates a proxy configuration and returns a promise resolving to an instance of ProxyConfiguration that is already initialized.
Configures connection to a proxy server with the provided options. Proxy servers are used to prevent target websites from blocking your crawlers based on IP address rate limits or blacklists. Setting proxy configuration in your crawlers automatically configures them to use the selected proxies for all connections.
For more details and code examples, see ProxyConfiguration.
Name | Type |
---|---|
T |
extends object |
Name | Type |
---|---|
input |
undefined | T | Readonly <T > |
MaybePromise
<undefined
| ProxyConfiguration
>
src/lib/integrations/types.ts:128
• generateEntryMetadata: <Ctx>(ctx
: Ctx
) => MaybePromise
<TMetadata
>
▸ <Ctx
>(ctx
): MaybePromise
<TMetadata
>
Generate object with info on current context, which will be appended to the scraped entry
Name | Type |
---|---|
Ctx |
extends CrawlingContext <unknown , Dictionary , Ctx > |
Name | Type |
---|---|
ctx |
Ctx |
MaybePromise
<TMetadata
>
src/lib/integrations/types.ts:138
• generateErrorReport: (input
: CrawleeOneErrorHandlerInput
, options
: PickRequired
<CrawleeOneErrorHandlerOptions
<CrawleeOneIO
<TEnv
, TReport
, object
>>, "io"
>) => MaybePromise
<TReport
>
▸ (input
, options
): MaybePromise
<TReport
>
Generate object with info on current context, which will be send to the error Dataset
Name | Type |
---|---|
input |
CrawleeOneErrorHandlerInput |
options |
PickRequired <CrawleeOneErrorHandlerOptions <CrawleeOneIO <TEnv , TReport , object >>, "io" > |
MaybePromise
<TReport
>
src/lib/integrations/types.ts:133
• getInput: <Input>() => Promise
<null
| Input
>
▸ <Input
>(): Promise
<null
| Input
>
Returns a promise of an object with the crawler input. E.g. In Apify, retrieves the actor input value from the default KeyValueStore associated with the current actor run.
Name | Type |
---|---|
Input |
extends object |
Promise
<null
| Input
>
src/lib/integrations/types.ts:57
• isTelemetryEnabled: () => MaybePromise
<boolean
>
▸ (): MaybePromise
<boolean
>
MaybePromise
<boolean
>
src/lib/integrations/types.ts:131
• openDataset: (id?
: null
| string
) => MaybePromise
<CrawleeOneDataset
<object
>>
▸ (id?
): MaybePromise
<CrawleeOneDataset
<object
>>
Opens a dataset and returns a promise resolving to an instance of the CrawleeOneDataset.
Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. The actual data is stored either on the local filesystem or in the cloud.
Name | Type |
---|---|
id? |
null | string |
MaybePromise
<CrawleeOneDataset
<object
>>
src/lib/integrations/types.ts:35
• openKeyValueStore: (id?
: null
| string
) => MaybePromise
<CrawleeOneKeyValueStore
>
▸ (id?
): MaybePromise
<CrawleeOneKeyValueStore
>
Opens a key-value store and returns a promise resolving to an instance of the CrawleeOneKeyValueStore.
Key-value stores are used to store records or files, along with their MIME content type. The records are stored and retrieved using a unique key. The actual data is stored either on a local filesystem or in the cloud.
Name | Type |
---|---|
id? |
null | string |
MaybePromise
<CrawleeOneKeyValueStore
>
src/lib/integrations/types.ts:52
• openRequestQueue: (id?
: null
| string
) => MaybePromise
<CrawleeOneRequestQueue
>
▸ (id?
): MaybePromise
<CrawleeOneRequestQueue
>
Opens a request queue and returns a promise resolving to an instance of the CrawleeOneRequestQueue.
RequestQueue represents a queue of URLs to crawl, which is stored either on local filesystem or in the cloud. The queue is used for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders.
Name | Type |
---|---|
id? |
null | string |
MaybePromise
<CrawleeOneRequestQueue
>
src/lib/integrations/types.ts:44
• runInContext: (userFunc
: () => unknown
, options?
: ExitOptions
) => Promise
<void
>
▸ (userFunc
, options?
): Promise
<void
>
Equivalent of Actor.main.
Runs the main user function that performs the job of the actor and terminates the process when the user function finishes.
The Actor.main()
function is optional and is provided merely for your convenience.
It is mainly useful when you're running your code as an actor on the Apify platform.
However, if you want to use Apify SDK tools directly inside your existing projects, e.g.
running in an Express server, on
Google Cloud functions
or AWS Lambda, it's better to avoid
it since the function terminates the main process when it finishes!
The Actor.main()
function performs the following actions:
- When running on the Apify platform (i.e.
APIFY_IS_AT_HOME
environment variable is set), it sets up a connection to listen for platform events. For example, to get a notification about an imminent migration to another server. See Actor.events for details. - It checks that either
APIFY_TOKEN
orAPIFY_LOCAL_STORAGE_DIR
environment variable is defined. If not, the functions setsAPIFY_LOCAL_STORAGE_DIR
to./apify_storage
inside the current working directory. This is to simplify running code examples. - It invokes the user function passed as the
userFunc
parameter. - If the user function returned a promise, waits for it to resolve.
- If the user function throws an exception or some other error is encountered, prints error details to console so that they are stored to the log.
- Exits the Node.js process, with zero exit code on success and non-zero on errors.
Name | Type |
---|---|
userFunc |
() => unknown |
options? |
ExitOptions |
Promise
<void
>
src/lib/integrations/types.ts:116
• triggerDownstreamCrawler: <TInput>(targetActorId
: string
, input?
: TInput
, options?
: { build?
: string
}) => Promise
<void
>
▸ <TInput
>(targetActorId
, input?
, options?
): Promise
<void
>
Equivalent of Actor.metamorph.
This function should:
- Start a crawler/actor by its ID,
- Pass the given input into downsteam crawler.
- Make the same storage available to the downstream crawler. AKA, the downstream crawler should use the same "default" storage as is the current "default" storage.
Read more about Actor.metamorph:
Actor.metamorph
transforms this actor run to an actor run of a given actor. The system
stops the current container and starts the new container instead. All the default storages
are preserved and the new input is stored under the INPUT-METAMORPH-1 key in the same
default key-value store.
Name | Type |
---|---|
TInput |
extends object |
Name | Type | Description |
---|---|---|
targetActorId |
string |
ID of the crawler/actor to which should be triggered. |
input? |
TInput |
Input for the crawler/actor. Must be JSON-serializable (it will be stringified to JSON). |
options? |
Object |
- |
options.build? |
string |
Tag or number of the target build to metamorph into (e.g. beta or 1.2.345 ). If not provided, the run uses build tag or number from the default actor run configuration (typically latest ). |
Promise
<void
>