前言:需自己安装scrapy、pymongo两个库,安装mongodb数据库 。

1. 创建scrapy爬虫项目:jdspider

scrapy startproject  jdspider

2. 基于CrawlSpider创建爬虫文件,名称:jd,爬取的网址:jd.com

scrapy genspider -t crawl jd jd.com

3. 配置爬虫配置文件(settings.py):配置不遵守robots.txt 规则、下载延迟、禁用Cookies、自动限速。

ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 2
COOKIES_ENABLED = True
AUTOTHROTTLE_ENABLED = True

4. 定义爬取的数据(items.py):需要爬取的数据为商品名(name)、商品价格(price)、商品详情页网址(url)。

class JdspiderItem(scrapy.Item):
define the fields for your item here like: name = scrapy.Field ()
name = scrapy.Field(url = scrapy.Field()price = scrapy.Field()

5. 为爬虫设置随机用户代理

在settings.py中定义USER_AGENTS=[](自己在网上搜user-agent)

USER_AGENTS=[
   “Hozilla/5.0 (windous NT 6.1; MOM64)ApplewebKit/537.1(KGTML,like Gecko) chrone/22,0.1207.1 5afar1/537.1",
   "Nozilla/5.0 (x11j Cros i686 2268.111.0)ApplelebKit/536.11(XHTIMAL,ike Gecko) chrone/20.0.1132.575afari/5356.11","NMozilla/5.0(windows NT 6.2) ApplewebKit/536.6(XHTIL,like Gecko) Chrone/20.0.1090.0 Safari/536.6"
   "Nlozilla/5.0(windows NTr 6.2; wOM64) ApplelebKit/537.1(KITNL,like seck) Chrome/19.77.34.5 Safari/537.1""Nozilla/5.0 (windows NT 6.0) AppleilebKit/536.5(XHTHL,like Gecko) Chrome/19.0.1084.36 Safari/536.5"",
   "Nozilla/5.0 (windows NT 6.1j; MOM64) AppleebKit/536.3(KHTML,,like Gecko)Chrone/19.0.1063.0 Safari/536.3"
    ]

在middlewares.py中定义中间件RandomUserAgent

import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddlewarefrom jdspider.settings import USER_AGENTS
class RandomUserAgent(UserAgentMiddleware):
def process_request(self,request,spider) :
ua=random.choice(USER_AGENTS)
request.headers.setdefault( 'User-Agent " ,ua)

在settings.py中启用中间件RandomUserAgent

DOwNLOADER_MIDDLEWARES = {
'jdspider.middlewares.RandomUserAgent' : 543,
'scrapy . downloadermiddlewares.useragent.UserAgentMiddleware ' : None,
}

6. 在爬虫文件jd.py中完成以下内容

定义网址提取规则:提取手机列表页链接,提取手机详情页链接。
定义商品详情页数据提取方法(parse_item)
定义商品详情页商品价格提取方法(parse_price)
(商品价格的真实请求url为:
https://p.3.cn/prices/mgets?callback=jQuery5302887&type=1&area=24_2144_2145_0&pdtk=&pduid=15679981877981235997960&pdpin=&pin=null&pdbp=0&skuIds=J_ 100001550349&ext=11100000&source=item-pc
其中skuIds=J_ 100001550349为商品的id)

#-*- coding: utf-8 -*-
import scrapy
from scrapy . linkextractors import LinkExtractor
from scrapy .spiders import CrawlSpider,Rule

from jdspider.items  import JdspiderItem
import json
class dSpider(CrawlSpider):
name = 'jd'
allowed_domains = [ ' jd.com' ,'p.3.cn']
start_urls = [ ' https://list.jd.com/list.html?cat=9987,653,655']rules = (
Rule(LinkExtractor(allow='https://list.jd.com/list.html?.*cat=9987,653,655.*' ), follow=True),Rule(LinkExtractor(allow=( 'https://item.jd.coml . *html' , )),callback='parse_item'),
)
def parse_item(self,response):
item = JdspiderItem()
item[ 'name']=response.xpath( ' /html/body/div[6]/div/div[2]/div[1]/text()' ).extract_ first()item[ "name']=item[ 'name' ].strip()
item[ 'price']=response.xpath( ' llspan[starts-with(@class,"price")]/text() ' ).extract_first()item[ " ur1" ]=response.url
skuid=response.url
skuid=skuid.replace( " https: / /item.jd.com/ ' ,")skuid=skuid.replace( ".html' ,")
price url hts;t/1.3cuprcesonerscallacdejuery5.0X3ltye 1larea24 24.245 epotirlorit-4567.161798125975lptr-apieamllyedpaAlsatd=] ' rtwitd lxt1. itswrceita-tc'yield scrapy.Request(price_url,meta={ "item":item},callback=self.parse_price)
def parse_price(self,response):
item=response.meta[ 'item']P=response.text
"."

读取到的内容形如:
jQuery5302887([{ "cbf" : "" ,"id" :"J_100001550349"," m" : "9999.0o" ,"op" :"1099.00" ,"p" :"899.00"]);该宁符串中的p后面对应的值即是价格,后面代码就是把该值取出。
"
s1=p.index( '{')s2=p.index( '}')PP=P[s1:s2+1]pp=json.loads(pp)item[ " price']=pp[ 'p']yield item

7. 将爬取的数据保存到mongodb数据库

在pipelines.py中修改JdspiderPipeline

import pymongo
from pymongo import MongoClient
iclass dspiderPipeline(object):
def open_spider(self, spider) :
self.client = MongoClient( '172.16.37.62',27017)#自己的MongoDB数据库的IPself.db = self.client.jddb
self.collection = self.db.jddb_collection
def process_item(self,item,spider) :
self.collection.insert_one(dict(item))return item

在settings.py中启用JdspiderPipeline

ITEM_PIPELINES ={
'jdspider. pipelines.dspiderPipeline' : 300,
}

Last modification:May 19th, 2021 at 04:26 pm