Skip to content

Commit

Permalink
完善文档
Browse files Browse the repository at this point in the history
  • Loading branch information
Boris-code committed Oct 31, 2022
1 parent 31215a6 commit cab2096
Show file tree
Hide file tree
Showing 13 changed files with 532 additions and 347 deletions.
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,22 +20,24 @@

### 1.拥有强大的监控,保障数据质量

![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2021/09/14/16316112326191.jpg)
![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2022/10/12/16655595870715.jpg)

监控面板:[点击查看详情](http://feapder.com/#/feapder_platform/feaplat)

### 2. 内置多维度的报警(支持 钉钉、企业微信、邮箱)
### 2. 内置多维度的报警(支持 钉钉、企业微信、飞书、邮箱)

![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084718974597.jpg)
![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/29/16092335882158.jpg)
![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084718683378.jpg)


### 3. 简单易用,内置三种爬虫,可应对各种需求场景
### 3. 简单易用,内置四种爬虫,可应对各种需求场景

- `AirSpider` 轻量爬虫:学习成本低,可快速上手

- `Spider` 分布式爬虫:支持断点续爬、爬虫报警、数据自动入库等功能
- `Spider` 分布式爬虫:支持断点续爬、爬虫报警等功能,可加快爬虫采集速度

- `TaskSpider` 任务爬虫:从任务表里取任务做,内置支持对接redis、mysql任务表,亦可扩展其他任务来源

- `BatchSpider` 批次爬虫:可周期性的采集数据,自动将数据按照指定的采集周期划分。(如每7天全量更新一次商品销量的需求)

Expand All @@ -44,7 +46,6 @@
## 文档地址

- 官方文档:http://feapder.com
- 国内文档:https://boris-code.gitee.io/feapder
- 境外文档:https://boris.org.cn/feapder
- github:https://github.com/Boris-code/feapder
- 更新日志:https://github.com/Boris-code/feapder/releases
Expand Down
11 changes: 6 additions & 5 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,23 @@

### 1.拥有强大的监控,保障数据质量

![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2021/09/14/16316112326191.jpg)
![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2022/10/12/16655595870715.jpg)

监控面板:[点击查看详情](http://feapder.com/#/feapder_platform/feaplat)

### 2. 内置多维度的报警(支持 钉钉、企业微信、邮箱)
### 2. 内置多维度的报警(支持 钉钉、企业微信、飞书、邮箱)

![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084718974597.jpg)
![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/29/16092335882158.jpg)
![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084718683378.jpg)

### 3. 简单易用,内置三种爬虫,可应对各种需求场景
### 3. 简单易用,内置四种爬虫,可应对各种需求场景

- `AirSpider` 轻量爬虫:学习成本低,可快速上手

- `Spider` 分布式爬虫:支持断点续爬、爬虫报警、数据自动入库等功能
- `Spider` 分布式爬虫:支持断点续爬、爬虫报警等功能,可加快爬虫采集速度

- `TaskSpider` 任务爬虫:从任务表里取任务做,内置支持对接redis、mysql任务表,亦可扩展其他任务来源

- `BatchSpider` 批次爬虫:可周期性的采集数据,自动将数据按照指定的采集周期划分。(如每7天全量更新一次商品销量的需求)

Expand All @@ -39,7 +41,6 @@
## 文档地址

- 官方文档:http://feapder.com
- 国内文档:https://boris-code.gitee.io/feapder
- 境外文档:https://boris.org.cn/feapder
- github:https://github.com/Boris-code/feapder
- 更新日志:https://github.com/Boris-code/feapder/releases
Expand Down
3 changes: 2 additions & 1 deletion docs/_sidebar.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@
* [响应-Response](source_code/Response.md)
* [代理使用说明](source_code/proxy.md)
* [用户池说明](source_code/UserPool.md)
* [浏览器渲染](source_code/浏览器渲染.md)
* [浏览器渲染-Selenium](source_code/浏览器渲染-Selenium.md)
* [浏览器渲染-Playwright](source_code/浏览器渲染-Playwright)
* [解析器-BaseParser](source_code/BaseParser.md)
* [批次解析器-BatchParser](source_code/BatchParser.md)
* [Spider进阶](source_code/Spider进阶.md)
Expand Down
258 changes: 258 additions & 0 deletions docs/source_code/浏览器渲染-Playwright.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
# 浏览器渲染-Playwright

采集动态页面时(Ajax渲染的页面),常用的有两种方案。一种是找接口拼参数,这种方式比较复杂但效率高,需要一定的爬虫功底;另外一种是采用浏览器渲染的方式,直接获取源码,简单方便

框架支持playwright渲染下载,每个线程持有一个playwright实例


## 使用方式:

1. 修改配置文件的渲染下载器:

```
RENDER_DOWNLOADER="feapder.network.downloader.PlaywrightDownloader"
```
2. 使用
```python
def start_requests(self):
yield feapder.Request("https://news.qq.com/", render=True)
```
在返回的Request中传递`render=True`即可
框架支持`chromium`、`firefox`、`webkit` 三种浏览器渲染,可通过[配置文件](source_code/配置文件)进行配置。相关配置如下:
```python
PLAYWRIGHT = dict(
user_agent=None, # 字符串 或 无参函数,返回值为user_agent
proxy=None, # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址
headless=False, # 是否为无头浏览器
driver_type="chromium", # chromium、firefox、webkit
timeout=30, # 请求超时时间
window_size=(1024, 800), # 窗口大小
executable_path=None, # 浏览器路径,默认为默认路径
download_path=None, # 下载文件的路径
render_time=0, # 渲染时长,即打开网页等待指定时间后再获取源码
wait_until="networkidle", # 等待页面加载完成的事件,可选值:"commit", "domcontentloaded", "load", "networkidle"
use_stealth_js=False, # 使用stealth.min.js隐藏浏览器特征
page_on_event_callback=None, # page.on() 事件的回调 如 page_on_event_callback={"dialog": lambda dialog: dialog.accept()}
storage_state_path=None, # 保存浏览器状态的路径
url_regexes=None, # 拦截接口,支持正则,数组类型
save_all=False, # 是否保存所有拦截的接口, 配合url_regexes使用,为False时只保存最后一次拦截的接口
)
```

- `feapder.Request` 也支持`render_time`参数, 优先级大于配置文件中的`render_time`

- 代理使用优先级:`feapder.Request`指定的代理 > 配置文件中的`PROXY_EXTRACT_API` > webdriver配置文件中的`proxy`

- user_agent使用优先级:`feapder.Request`指定的header里的`User-Agent` > 框架随机的`User-Agent` > webdriver配置文件中的`user_agent`

## 设置User-Agent

> 每次生成一个新的浏览器实例时生效
### 方式1:

通过配置文件的 `user_agent` 参数设置

### 方式2:

通过 `feapder.Request`携带,优先级大于配置文件, 如:

```python
def download_midware(self, request):
request.headers = {
"User-Agent": "xxxxxxxx"
}
return request
```

## 设置代理

> 每次生成一个新的浏览器实例时生效
### 方式1:

通过配置文件的 `proxy` 参数设置

### 方式2:

通过 `feapder.Request`携带,优先级大于配置文件, 如:

```python
def download_midware(self, request):
request.proxies = {
"https": "https://xxx.xxx.xxx.xxx:xxxx"
}
return request
```

## 设置Cookie

通过 `feapder.Request`携带,如:

```python
def download_midware(self, request):
request.headers = {
"Cookie": "key=value; key2=value2"
}
return request
```

或者

```python
def download_midware(self, request):
request.cookies = {
"key": "value",
"key2": "value2",
}
return request
```

或者

```python
def download_midware(self, request):
request.cookies = [
{
"domain": "xxx",
"name": "xxx",
"value": "xxx",
"expirationDate": "xxx"
},
]
return request
```

## 拦截数据示例

> 注意:主函数使用run方法运行,不能使用start
```python
from playwright.sync_api import Response
from feapder.utils.webdriver import (
PlaywrightDriver,
InterceptResponse,
InterceptRequest,
)

import feapder


def on_response(response: Response):
print(response.url)


class TestPlaywright(feapder.AirSpider):
__custom_setting__ = dict(
RENDER_DOWNLOADER="feapder.network.downloader.PlaywrightDownloader",
PLAYWRIGHT=dict(
user_agent=None, # 字符串 或 无参函数,返回值为user_agent
proxy=None, # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址
headless=False, # 是否为无头浏览器
driver_type="chromium", # chromium、firefox、webkit
timeout=30, # 请求超时时间
window_size=(1024, 800), # 窗口大小
executable_path=None, # 浏览器路径,默认为默认路径
download_path=None, # 下载文件的路径
render_time=0, # 渲染时长,即打开网页等待指定时间后再获取源码
wait_until="networkidle", # 等待页面加载完成的事件,可选值:"commit", "domcontentloaded", "load", "networkidle"
use_stealth_js=False, # 使用stealth.min.js隐藏浏览器特征
# page_on_event_callback=dict(response=on_response), # 监听response事件
# page.on() 事件的回调 如 page_on_event_callback={"dialog": lambda dialog: dialog.accept()}
storage_state_path=None, # 保存浏览器状态的路径
url_regexes=["wallpaper/list"], # 拦截接口,支持正则,数组类型
save_all=True, # 是否保存所有拦截的接口
),
)

def start_requests(self):
yield feapder.Request(
"http://www.soutushenqi.com/image/search/?searchWord=%E6%A0%91%E5%8F%B6",
render=True,
)

def parse(self, reqeust, response):
driver: PlaywrightDriver = response.driver

intercept_response: InterceptResponse = driver.get_response("wallpaper/list")
intercept_request: InterceptRequest = intercept_response.request

req_url = intercept_request.url
req_header = intercept_request.headers
req_data = intercept_request.data
print("请求url", req_url)
print("请求header", req_header)
print("请求data", req_data)

data = driver.get_json("wallpaper/list")
print("接口返回的数据", data)

print("------ 测试save_all=True ------- ")

# 测试save_all=True
all_intercept_response: list = driver.get_all_response("wallpaper/list")
for intercept_response in all_intercept_response:
intercept_request: InterceptRequest = intercept_response.request
req_url = intercept_request.url
req_header = intercept_request.headers
req_data = intercept_request.data
print("请求url", req_url)
print("请求header", req_header)
print("请求data", req_data)

all_intercept_json = driver.get_all_json("wallpaper/list")
for intercept_json in all_intercept_json:
print("接口返回的数据", intercept_json)

# 千万别忘了
driver.clear_cache()


if __name__ == "__main__":
TestPlaywright(thread_count=1).run()
```
可通过配置的`page_on_event_callback`参数自定义事件的回调,如设置`on_response`的事件回调,亦可直接使用`url_regexes`设置拦截的接口

## 操作浏览器对象示例

> 注意:主函数使用run方法运行,不能使用start
```python
import time

from playwright.sync_api import Page

import feapder
from feapder.utils.webdriver import PlaywrightDriver


class TestPlaywright(feapder.AirSpider):
__custom_setting__ = dict(
RENDER_DOWNLOADER="feapder.network.downloader.PlaywrightDownloader",
)

def start_requests(self):
yield feapder.Request("https://www.baidu.com", render=True)

def parse(self, reqeust, response):
driver: PlaywrightDriver = response.driver
page: Page = driver.page

page.type("#kw", "feapder")
page.click("#su")
page.wait_for_load_state("networkidle")
time.sleep(1)

html = page.content()
response.text = html # 使response加载最新的页面
for data_container in response.xpath("//div[@class='c-container']"):
print(data_container.xpath("string(.//h3)").extract_first())


if __name__ == "__main__":
TestPlaywright(thread_count=1).run()
```
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 浏览器渲染
# 浏览器渲染-Selenium

采集动态页面时(Ajax渲染的页面),常用的有两种方案。一种是找接口拼参数,这种方式比较复杂但效率高,需要一定的爬虫功底;另外一种是采用浏览器渲染的方式,直接获取源码,简单方便

Expand Down Expand Up @@ -73,16 +73,6 @@ def download_midware(self, request):

通过 `feapder.Request`携带,优先级大于配置文件, 如:

```python
def download_midware(self, request):
request.proxies = {
"http": "http://xxx.xxx.xxx.xxx:xxxx"
}
return request
```

或者

```python
def download_midware(self, request):
request.proxies = {
Expand Down Expand Up @@ -114,6 +104,21 @@ def download_midware(self, request):
return request
```

或者

```python
def download_midware(self, request):
request.cookies = [
{
"domain": "xxx",
"name": "xxx",
"value": "xxx",
"expirationDate": "xxx"
},
]
return request
```

## 操作浏览器对象

通过 `response.browser` 获取浏览器对象
Expand Down
Loading

0 comments on commit cab2096

Please sign in to comment.