写在开始

这个博客有可能要写好长的一个,所以有可能写好久的.

一些基础使用

Spider类的编写

生成的配置文件的时候生成出来的那种结构就没错,然后自己编写parse那一部分的代码就可以了,说一些其他的.

关于responseRequest

response

这个response就是Scrapy爬取这个网页之后返回的所有的数据,可以使用scrapy.Selector这个类来包围response,然后就可以对着这个使用xpath来搜索指定的标签了,我的代码附在最后.

Request

Request(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None)

Request是一个类,然后参数第一个就是你想要继续爬的URL,然后后面跟一个meta参数,这个参数输入一个dict,然后是传入callback的那个函数的responsemeta里面.整体就是发送一个request,然后把response传递给回调函数,然后这个回调就是你自己写的一个继续爬取的函数.整体来说就是这样.

Item的编写

关于这个Item,也是很好说的一个东西,直接上代码就行了:

class PictureItem(scrapy.Item):
    name = scrapy.Field()
    chapter = scrapy.Field()
    page = scrapy.Field()
    url = scrapy.Field()

像这样写一个Item就行了,然后调用的时候就想是调用一个字典那样就可以了,类似于Django里面的实体类映射(应该是吧),但是这个的类型很简单,就一种,所以就这样了.

Pipeline的编写

这个就是接收每一个Item,然后对这个Item处理的一个类,这个类我用的时候就是持久化到了MongoDB里面,不过因为性能优化不好,所以在Windows里面经常卡在这里,现在正想办法提升一下这里的性能.

感觉这个东西没啥好说的,因为主要写process_item(self, item, spider)这么一个函数,实现了这个函数基本上也就没啥东西了,然后数据库连接可以在新建这个类的时候创建一个.

这里添加一个额外的注意:

如果有多个爬虫爬取不同的Item的话,这里的Pipeline就要自己手动分离出来自己需要处理的Item了,然后再处理这些Item.

Middleware

这一部分我暂时还没用到,所以只能留空了.

写在最后

这个Scrapy框架整体来说还是挺简单的,但是还是有点不是很符合我的要求了,原因也很简单,当然也可能是我并没有把这个架构学精吧,就感觉我自己的爬虫代码耦合度很高一样,因为我感觉好多东西我应该都可以解耦的,但是写这个框架的时候并没有解开,可能是我中间件啥的都没用吧,或者只是我感觉耦合度很高之类的.

附Spider类的代码

import json
import re
from urllib.parse import urljoin, unquote

import js2py
from scrapy import Selector
from scrapy.spider import Spider, Request

from manhuaparser.items import PictureItem


class ManhuaSpider(Spider):
    name = "manhua"
    allowed_domains = ["www.dmzj.com","manhua.dmzj.com"]
    start_urls = [
        "http://manhua.dmzj.com/xfgj/",
        "http://manhua.dmzj.com",
    ]

    def parse(self, response):
        sel = Selector(response)
        href = sel.xpath("//a/@href").extract()
        href = filter(lambda x:re.search("javascript:void\(0\)",x) is None,href)
        for i in href:
            j = i
            if re.search("http",i) is None:
                j = urljoin(response.url,i)
            yield Request(j,callback=self.parse_selector)

    def parse_selector(self,response):
        # print("parse_selector",response.url)
        sel = Selector(response)
        chapter = sel.xpath("//div[@class=\"middleright_mr\"]").extract()
        new_chapter = sel.xpath("//ul[@class=\"list_con_li\"]").extract()
        if len(chapter) > 0:
            # print("has chapter")
            return self.parse_manhua_page(response)
        elif len(new_chapter) > 0:
            # print("has new chapter")
            return self.parse_new_manhua_page(response)
        else:
            return self.parse_normal_page(response)



    def parse_normal_page(self,response):
        # print("parse_normal_page",response.url)
        sel = Selector(response)
        href = sel.xpath("//a/@href").extract()
        for i in href:
            j = i
            if re.search("http",i) is None:
                j = urljoin(response.url,i)
            yield Request(j,callback=self.parse_selector)

    def parse_manhua_page(self,response):
        # print("parse_manhua_page",response.url)
        sel = Selector(response)
        title = sel.xpath("/head/title").extract_first()
        href = sel.xpath("//div[@class=\"middleright_mr\"]").xpath("//a/@href").extract()
        href = filter(lambda x:re.search("http",x) is None,href)
        for i in href:
            yield Request(urljoin(response.url,i),meta={"manhua_title":title},callback=self.parse_chapter_page)

    def parse_new_manhua_page(self,response):
        # print("parse_new_manhua_page",response.url)
        sel = Selector(response)
        title = sel.xpath("//title").extract_first()
        href = sel.xpath("//ul[@class=\"list_con_li\"]").xpath("//a/@href").extract()
        href = filter(lambda x: re.search("http", x) is None, href)
        for i in href:
            yield Request(urljoin(response.url, i), meta={"manhua_title": title}, callback=self.parse_new_chapter_page)

    def parse_chapter_page(self,response):
        # print("parse_chapter_page",response.url)
        sel = Selector(response)
        title = sel.xpath("//title").extract_first()
        # print("title is: ",title)
        script = sel.xpath("//script").extract_first()
        try:
            if re.search("src", script) is not None:
                return self.parse_selector(response)
        except TypeError as e:
            return self.parse_selector(response)
        # print(type(script))
        # print("script is: ",script)
        try:
            script = re.findall("eval\(.+\)", script)
            # print("find script is: ",script)
            # print("length is: ",len(script))
            max_one = script.pop(0)
            """
                    for i in script:
                if len(i) > len(max_one):
                    max_one = i
            """
            pictures = json.loads(js2py.eval_js(max_one))
            items = []
            for index, picture in enumerate(pictures, start=1):
                item = PictureItem()
                item["name"] = response.meta["manhua_title"]
                item["chapter"] = title
                item["page"] = index
                item["url"] = urljoin("http://images.dmzj.com",unquote(picture))
                items.append(item)
            return items
        except IndexError or TypeError as e:
            return self.parse_selector(response)

    def parse_new_chapter_page(self,response):
        # print("parse_chapter_page",response.url)
        sel = Selector(response)
        title = sel.xpath("//title").extract_first()
        # print("title is: ",title)
        script = sel.xpath("//script").extract_first()
        if re.search("src",script) is not None:
            return self.parse_selector(response)
        # print(type(script))
        # print("script is: ",script)
        try:
            script = re.findall("eval\(.+\)", script)
            # print("find script is: ",script)
            # print("length is: ",len(script))
            max_one = script.pop(0)
            """
                    for i in script:
                if len(i) > len(max_one):
                    max_one = i
            """
            pictures = json.loads(js2py.eval_js(max_one).split())
            items = []
            for index, picture in enumerate(pictures, start=1):
                item = PictureItem()
                item["name"] = response.meta["manhua_title"]
                item["chapter"] = title
                item["page"] = index
                item["url"] = urljoin("http://images.dmzj.com",unquote(picture))
                items.append(item)
            return items
        except IndexError or TypeError as e:
            return self.parse_selector(response)
Last modification:January 27, 2020
如果觉得我的文章对你有用,请随意赞赏