Python : Asyncio 비동기 프로그래밍 - 웹스크래핑 두가지 방법

지금까지 asyncio 를 사용한 비동기 프로그래밍에 대해서 알아보았습니다. asyncio 를 사용하여 자주 구현되는 웹스크래핑 예제를 알려드리고자 합니다.

aiohttp 를 사용한 웹스크래핑

이미 지난 블로그에서 소개해드렸던 방법입니다. aiohttp를 사용하면 HTTP 기반의 비동기 IO 작업을 손쉽게 작성할 수 있습니다.

import asyncio
from time import time
import aiohttp

# 크롤링할 웹사이트 목록
urls = [
    "https://www.naver.com",
    "https://www.daum.net",
    "https://www.tistory.com/",
    "https://www.google.com"
]

# 세션에서 URL을 호출하여 응답을 받아오는 함수
async def fetch_url(session,url):
    async with session.get(url) as response:
        return await response.text()

# 세션을 열고 URL들을 비동기적으로 동시에 호출
async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_url(session,url) for url in urls]
        results = await asyncio.gather(*tasks)

        for i, content in enumerate(results):
            print(f"URL {i+1}: {urls[i]} - {len(content)} bytes fetched")

if __name__ == "__main__":
    start_time = time()
    asyncio.run(main())
    print(f"Time taken: {time() - start_time:.2f} seconds")

결과는아래와 같습니다. 제 PC에서는 1.68 초가 걸렸습니다.

URL 1: https://www.naver.com - 237666 bytes fetched
URL 2: https://www.daum.net - 559320 bytes fetched
URL 3: https://www.tistory.com/ - 33537 bytes fetched
URL 4: https://www.google.com - 20106 bytes fetched
Time taken: 1.68 seconds

urllib.request 의 urlopen 을 사용한 예제

urlopen 은 비동기 함수가 아니기 때문에 aiohttp 처럼 비동기 방식으로 여러 URL들을 동시에 호출할 수 없습니다. 따라서 이전 블로그에서 설명했던 것처럼 event_loop의 run_in_executor를 사용하여 블로킹함수인 urlopen을 비동기 방식으로 호출해야 합니다.

import asyncio
from time import time
from urllib.request import urlopen
from concurrent.futures import ThreadPoolExecutor

# 크롤링할 웹사이트 목록
urls = [
    "https://www.naver.com",
    "https://www.daum.net",
    "https://www.tistory.com/",
    "https://www.google.com"
]


async def fetch_url(executor, url):
    res = await loop.run_in_executor(executor, urlopen, url)
    return res.read()

async def main():
    with ThreadPoolExecutor() as executor:
        tasks = [asyncio.ensure_future(fetch_url(executor,url)) for url in urls]
        results = await asyncio.gather(*tasks)

        for i, content in enumerate(results):
            print(f"URL {i+1}: {urls[i]} - {len(content)} bytes fetched")

if __name__ == "__main__":
    start_time = time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    print(f"Time taken: {time() - start_time:.2f} seconds")

위 예제에서 사용된 asyncio.ensure_future는 Python의 asyncio 라이브러리에서 사용되는 함수로, 비동기 작업을 예약하고 실행되도록 보장합니다. 이 함수는 coroutine(코루틴)을 받아서 Task 객체로 변환하거나 이미 Future 객체인 경우 그대로 반환합니다.

결과는 동일합니다.

URL 1: https://www.naver.com - 191699 bytes fetched
URL 2: https://www.daum.net - 623405 bytes fetched
URL 3: https://www.tistory.com/ - 35561 bytes fetched
URL 4: https://www.google.com - 21583 bytes fetched
Time taken: 1.09 seconds

aiohttp vs urlopen

aiohttp 과 urlopen을 비교하면 아래와 같습니다.

항목	aiohttp	urlopen
처리방식	비동기	동기
속도	네트워크 병렬 처리로 빠름	각 요청을 순차적으로 처리하여 느림
코드 난이도	비교적 복잡	간단
라이브러리 의존성	aiohttp 필요	내장 모듈만 사용

urlopen 은 동기방식이기 때문에 모든 URL들을 순차적으로 호출하다보면 전체 처리속도는 느려지므로, asyncio와 Thread 모델을 사용하여 이를 병렬 처리로 전환하는예제를 보여드렸습니다.

결론

aiohttp는 비동기 처리가 필요한 경우 적합하며, 네트워크 병목이 큰 작업에서 뛰어난 성능을 발휘합니다. 반면 urlopen은 단순한 작업이나 빠르게 스크립트를 작성할 때 유용합니다.

두 접근 방식을 이해하면 다양한 프로젝트 요구사항에 맞는 최적의 솔루션을 선택할 수 있습니다.

함께 사는 세상... 그리고 나

Python : Asyncio 비동기 프로그래밍 - 웹스크래핑 두가지 방법

aiohttp 를 사용한 웹스크래핑

urllib.request 의 urlopen 을 사용한 예제

aiohttp vs urlopen

결론

티스토리툴바