Scrapy+GitHubActions部署云爬虫
本帖最后由 skyone 于 2021-3-22 13:00 编辑原文:我的博客标题:scrapy+GitHubActions部署云爬虫禁止转载,部分引用请著名出处为了学习GitHub Actions的基本使用,写一个爬虫练练手这个爬虫用于下载哔哩哔哩相簿的图片该文章分为三步:
[*]写爬虫
[*]写workflows文件
[*]上传代码到GitHub
懒得打代码就直接Fork我的仓库:# https
https://github.com/skyone-wzw/action.git
# ssh
git@github.com:skyone-wzw/action.git重点是GitHub Actions云爬虫不需要自己的服务器,还可以访问wai wang!咳咳咳,总之,懂的都懂。
GitHub Actions 为开源软件提供免费服务器,配置如下E5 2vCPU/7G RAM对于公开仓库可免费无时间限制的使用,且单次使用时间长达 6 个小时对爬虫来说,有以下优点
[*]免费
[*]万兆自由网络环境
[*]定时爬取
[*]并发(可同时进行20个任务)
[*]无需担心磁盘空间不足(近60G磁盘空间)
写一个scrapy爬虫创建项目新建一个项目文件夹,打开CMD并切换到该文件夹,输入以下命令初始化git:git init使用scrapy新建项目并新建爬虫:scrapy startproject bili bilibili.com
cd bili
scrapy genspider picture创建项目完毕,下面使用IDE打开项目文件夹吧~我使用的是Pycharm储存图片信息的class打开bili/items.py,不出意外的话,这里已经有一个BiliItem(scrapy.Item)类了我们将其改为:import scrapy
class BiliItem(scrapy.Item):
url = scrapy.Field() # 图片连接
title = scrapy.Field() # 图片标题
author = scrapy.Field() # 作者
id = scrapy.Field() # 图片id
uid = scrapy.Field() # 作者uid
extension = scrapy.Field()# 图片拓展名分析哔哩哔哩API打开哔哩哔哩相簿,开启检查元素的网络那一栏,发现每加载一次都会有一个get请求:https://i.w3tt.com/2021/03/22/qq4mB.png很好,正是我们想要的,多次尝试,很容易分析出来:GET https://api.vc.bilibili.com/link ... _num=0&page_size=20
Accept: application/json返回:GET https://api.vc.bilibili.com/link ... _num=0&page_size=20
HTTP/1.1 200 OK
Date: Sun, 21 Mar 2021 06:30:48 GMT
Content-Type: application/json
Transfer-Encoding: chunked
Connection: keep-alive
L: v
X-Trace-Id: 29bd4f5ef5186679
Http-Status-Code: 200
Bili-Status-Code: 0
Server: swoole-http-server
Set-Cookie: l=v
Expires: Sun, 21 Mar 2021 06:30:47 GMT
Cache-Control: no-cache
X-Cache-Webcdn: BYPASS from hw-sh3-webcdn-08
{
"code": 0,
"msg": "success",
"message": "success",
"data": {
"total_count": 500,
"items": [
{
"user": {
"uid": 21690338,
"head_url": "https://i1.hdslb.com/bfs/face/b6a132c21e444401228099c8cc07edab810bc9db.jpg",
"name": "ZWPAN盼"
},
"item": {
"doc_id": 1085297,
"poster_uid": 21690338,
"pictures": [
{
"img_src": "https://i0.hdslb.com/bfs/album/2785de69bb019f85c88ee0d9681468c779e3950f.jpg",
"img_width": 3029,
"img_height": 9506,
"img_size": 6425
}
],
"title": "108个小电视表情",
"category": "illustration",
"upload_time": 1512023535,
"already_liked": 0,
"already_voted": 0
}
},
/////////////////////////////////////////
/ { /
/ "_": "太多了,就不列举了" /
/ } /
/////////////////////////////////////////
]
}
}
Response code: 200 (OK); Time: 490ms; Content length: 12879 bytes
[*]URL:https://api.vc.bilibili.com/link_draw/v2/Doc/list
[*]method:GET
[*]params:
[*]category图片的类型,选项如下
[*]all:所有类型
[*]illustration:插画
[*]comic:漫画
[*]draw:其他
[*]type排名规则。选项如下:
[*]hot:按热度排序
[*]new:按时间排序
[*]page_num页数,从0开始
[*]page_size每页图片数
写爬虫先在项目的根目录创建一个配置文件setting.py# picture settings
PICTURE_MAX_PAGE = 20
PICTURE_SLEEP_TIME = 0.5
PICTURE_CATEGORY = "all"
PICTURE_TYPE = "hot"
[*]PICTURE_MAX_PAGE爬的页数,每页20个
[*]PICTURE_SELLP_TIME每爬一张图间隔的时间
[*]PICTURE_CATEGORY要爬的图片的类型,选项如下
[*]all:所有类型
[*]illustration:插画
[*]comic:漫画
[*]draw:其他
[*]PICTURE_TYPE排名规则。选项如下:
[*]hot:按热度排序
[*]new:按时间排序
打开bili/spider/picture.py
导入一些库
import json
import scrapy
import time
import setting
from bili.items import BiliItem改写class PictureSpider(scrapy.Spider):我们要先从https://h.bilibili.com/开始爬,从而免去设置Refererstart_urls = ['https://h.bilibili.com/']再接着循环爬取刚才分析的链接即可,记住要构造请求并设置回调函数哦!def parse(self, response, **kwargs):
for i in range(setting.PICTURE_MAX_PAGE):
yield scrapy.Request(
'https://api.vc.bilibili.com/link_draw/v2/Doc/list?page_size=20'
'&type=' + setting.PICTURE_TYPE +
'&category=' + setting.PICTURE_CATEGORY +
'&page_num=' + str(i),
callback=self.picture_info,
dont_filter=True
)
def picture_info(self, response, **kwargs):
data = json.loads(response.text)
for item in data["data"]["items"]:
img = BiliItem()
img["url"] = item["item"]["pictures"][-1]["img_src"]
img["title"] = item["item"]["title"]
img["id"] = item["item"]["doc_id"]
img["author"] = item["user"]["name"]
img["uid"] = item["user"]["uid"]
img["extension"] = img["url"].split('.')[-1]
yield img
time.sleep(setting.PICTURE_SLEEP_TIME)bili/spiders/picture.py完整代码import json
import scrapy
import time
import setting
from bili.items import BiliItem
class PictureSpider(scrapy.Spider):
name = 'picture'
allowed_domains = ['bilibili.com']
start_urls = ['https://h.bilibili.com/']
def parse(self, response, **kwargs):
for i in range(setting.PICTURE_MAX_PAGE):
yield scrapy.Request(
'https://api.vc.bilibili.com/link_draw/v2/Doc/list?page_size=20'
'&type=' + setting.PICTURE_TYPE +
'&category=' + setting.PICTURE_CATEGORY +
'&page_num=' + str(i),
callback=self.picture_info,
dont_filter=True
)
def picture_info(self, response, **kwargs):
data = json.loads(response.text)
for item in data["data"]["items"]:
img = BiliItem()
img["url"] = item["item"]["pictures"][-1]["img_src"]
img["title"] = item["item"]["title"]
img["id"] = item["item"]["doc_id"]
img["author"] = item["user"]["name"]
img["uid"] = item["user"]["uid"]
img["extension"] = img["url"].split('.')[-1]
yield img
time.sleep(setting.PICTURE_SLEEP_TIME)下载图片图片链接已经使用上面的picture_info()方法获取了,下面使用BiliPipeline管道下载图片打开bili/pipelines.py由于scrapy自带的图片下载器个人觉得使用起来太麻烦,这里直接用requests库就好import requests
from os import path
class BiliPipeline:
def process_item(self, item, spider):
name = str(item["id"]) + '.' + item["extension"]
res = requests.get(item["url"])
with open(path.join(path.join(path.abspath('.'), 'image'), name), 'wb') as file:
file.write(res.content)
print(name)
return item启用管道打开bili/settings.py找到以下内容并取消注释ITEM_PIPELINES = {
'bili.pipelines.BiliPipeline': 300,
}在运行阶段为了减少不必要的log提示,可以在在里面添加LOG_LEVEL = "WARNING"本地启动爬虫测试启动爬虫# 切换到项目根目录
scrapy crawl picturepycharm调试爬虫点击右上角“编辑配置”https://i.w3tt.com/2021/03/22/qqDVs.png添加配置,选择pythonhttps://i.w3tt.com/2021/03/22/qqLqK.png配置如下https://i.w3tt.com/2021/03/22/qqEAa.png
[*]脚本路径为python安装目录\Lib\site-packages\scrapy\cmdline.py
[*]参数为crawl 爬虫名
[*]工作目录为项目根目录
然后就可以愉快的打断点调试了 JetBrains NB写workflows文件在项目根目录创建:.github/workflows/blank.yml直接复制粘贴即可,有时间我会另写一篇入门GitHub Actions的文章,可能会鸽,咕咕咕。。。其实网上有很多这样的文章,例如:阮一峰的网络日志name: Spider
on:
workflow_dispatch:
jobs:
spider:
runs-on: ubuntu-latest
steps:
- name: checkout
uses: actions/checkout@master
- name: 'Set up Python'
uses: actions/setup-python@v1
with:
python-version: 3.7
- name: Run a single-line script
run: |
pip install requests
pip install lxml
pip install scrapy
scrapy crawl picture
- name: Upload artifact
uses: actions/upload-artifact@master
with:
name: img
path: image部署到GitHub Actions在Github上新建仓库将代码推送到该仓库里Git三部曲:git add . # 添加
git commit -m "init" # 提交
git push # 推送打开GitHub仓库网页依次点击:https://i.w3tt.com/2021/03/22/qqGQS.png等待一小会,不出意外的话,爬虫的结果就显示出来了,点击下载即可https://i.w3tt.com/2021/03/22/qqQCN.pnghttps://i.w3tt.com/2021/03/22/qq0iC.jpg
https://i.w3tt.com/2021/03/22/qqc4L.jpg
吐槽一句:1M的图太小啦,还要我传到自己的OSS上( ^ω^)
不出所料,对MarkDown和代码块的支持惨不忍睹:L:L:L 好耶,这比阿方厉害多了, {:5_209:} https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fpic.houfx.com%2Fuploads%2Fzedit%2F2019-01%2F08%2F20190108133540_98649.jpg&refer=http%3A%2F%2Fpic.houfx.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1618924587&t=978f75f3cc1822b784b15457d9b184b7 非常厉害的样子{:5_215:} np! 啊对了,请问一下,这个是python吗?我想学习这个,但是不知道该怎么学……
页:
[1]