176
本周目標
學習WebCrawler基本用法與應用
任務
- 如何訪問網站加入瀏覽器資訊並取得網站資料
- 使用Beautifulsoup抓取網站標題
專案練習
遇到問題
已經加入瀏覽器資料避免被機器人阻擋,但資料依然跑不出來
import urllib.request as req
url="https://www.ptt.cc/bbs/movie/index.html"
request=req.Request(url, headers={
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"
})
with req.urlopen(url) as response:
data=response.read().decode("utf-8")
print(data)
Log
錯誤訊息: 一樣跳出Forbidden
urllib.error.HTTPError: HTTP Error 403: Forbidden
檢查原始碼發現:
with req.urlopen(url) as response:
url沒有改成request,修正後可抓到
修正後原始碼
import urllib.request as req
url="https://www.ptt.cc/bbs/movie/index.html"
request=req.Request(url, headers={
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"
})
with req.urlopen(request) as response:
data=response.read().decode("utf-8")
print(data)
Output
程式碼太多不複製貼上