Python

Python問題紀錄#13-使用Beautifuls]Soup模組和迴圈抓取多頁資料

by Gemma 2024 年 11 月 19 日

by Gemma 2024 年 11 月 19 日

本周目標

學習WebCrawler基本用法與應用

任務

如何訪問網站加入瀏覽器資訊並取得網站資料
使用Beautifulsoup抓取網站標題
加入cookie語法
使用迴圈抓取多頁資料

專案練習

目標: 抓取PT八卦版1-5頁面標題

建立函式，發送請求並取得資料

import urllib.request as req
def getData(url):
    request=req.Request(url,headers={
        "cookie":"over18=1",
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"})
    with req.urlopen(request) as response:
        data=response.read().decode("utf-8")

使用Beautifulsoip模組解析網站資料

    from bs4 import BeautifulSoup
    root=BeautifulSoup(data,"html.parser")

抓取文章標題

    titles=root.find_all("div",class_="title")
    for title in titles:
        if title.a:
            print(title.a.string)

注意: 錯誤程式碼，抓不到資料

for title in titles:
    if title.a:
        print(titles.a.string)

錯誤訊息

AttributeError: ResultSet object has no attribute 'a'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

語法要前後呼應，titles應該為title，因為最後印出資料是每個title，不是指找全部的titles

修正後語法

for title in titles:
    if title.a:
        print(title.a.string)

找到上頁連結，讓程式自動抓取

    nextLink=root.find("a",string="‹ 上頁")
    return nextLink["href"]

主程式: 使用迴圈爬取多頁資料

pageURL="https://www.ptt.cc/bbs/Gossiping/index.html"
count=0
while count<5:
    pageURL="https://www.ptt.cc"+getData(pageURL)
    count+=1

完整程式

import urllib.request as req
def getData(url):
    request=req.Request(url,headers={
        "cookie":"over18=1",
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"})
    with req.urlopen(request) as response:
        data=response.read().decode("utf-8")

    from bs4 import BeautifulSoup
    root=BeautifulSoup(data,"html.parser")
    titles=root.find_all("div",class_="title")
    for title in titles:
        if title.a:
            print(title.a.string)

    nextLink=root.find("a",string="‹ 上頁")
    return nextLink["href"]

pageURL="https://www.ptt.cc/bbs/Gossiping/index.html"
count=0
while count<5:
    pageURL="https://www.ptt.cc"+getData(pageURL)
    count+=1

Output

因資料太多，以下示範幾行

[ 好雷] 神鬼戰士2 緊扣羅馬建城神話的故事     
[好雷] 東京攻略[港片]（2000）
[問片] 中年男找街邊女郎純聊天的電影
[情報] 劇場版「進擊的巨人」完結篇1/3上映     
[問片] 一部偵探假裝上吊但假死的影片
[  雷] 櫻桃號
[好雷]《紅色一號》以滿滿心意，打磨出聖誕魔力
[新聞] 「土生花開」記錄下金枝演社 重現吳朋奉珍貴身影
[公告] 電影板板規 2022/12/5
[公告] 禁政治版規 及 投票結果

Gemma

previous post

Python練習#12-使用Beautifuls]Soup模組解析網站抓取資料

next post

Python問題紀錄#14-抓取KKDAY網站資料遇到SSI憑證問題

You may also like

Leave a Comment Cancel Reply

► Necessary Cookies Standard

Necessary cookies enable essential site features like secure log-ins and consent preference adjustments. They do not store personal data.

None

► Functional Cookies Remark

Functional cookies support features like content sharing on social media, collecting feedback, and enabling third-party tools.

None

► Analytical Cookies Remark

Analytical cookies track visitor interactions, providing insights on metrics like visitor count, bounce rate, and traffic sources.

None

► Advertisement Cookies Remark

Advertisement cookies deliver personalized ads based on your previous visits and analyze the effectiveness of ad campaigns.

None