Python

Python問題紀錄 #11-Web Crawler訪問網站

by Gemma 2024 年 11 月 16 日

本周目標

學習WebCrawler基本用法與應用

任務

如何訪問網站加入瀏覽器資訊並取得網站資料
使用Beautifulsoup抓取網站標題

專案練習

遇到問題

已經加入瀏覽器資料避免被機器人阻擋，但資料依然跑不出來

import urllib.request as req
url="https://www.ptt.cc/bbs/movie/index.html"
request=req.Request(url, headers={
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"
})
with req.urlopen(url) as response:
    data=response.read().decode("utf-8")
print(data)

Log

錯誤訊息: 一樣跳出Forbidden

urllib.error.HTTPError: HTTP Error 403: Forbidden

檢查原始碼發現:

with req.urlopen(url) as response:

url沒有改成request，修正後可抓到

修正後原始碼

import urllib.request as req
url="https://www.ptt.cc/bbs/movie/index.html"
request=req.Request(url, headers={
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"
})
with req.urlopen(request) as response:
    data=response.read().decode("utf-8")
print(data)

Output

程式碼太多不複製貼上

Python問題紀錄 #11-Web Crawler訪問網站

本周目標

任務

專案練習

遇到問題

Log

檢查原始碼發現:

修正後原始碼

Output

Python練習 #10-建立實體方法

Python練習#12-使用Beautifuls]Soup模組解析網站抓取資料

You may also like

Leave a Comment Cancel Reply