{ "cells": [ { "cell_type": "markdown", "source": [ "# Web scraping con BeautifulSoup\n", "\n", "En este script hacemos una pequeña introducción al uso de BeautifulSoup para extraer datos de una página HTML.\n", "\n", "\n", "En ocasiones deseamos trabajar con datos que están dispersos en páginas HTML.\n", "\n", "En este ejemplo mostramos cómo obtener datos de precios de libros del sitio web de una librería ficticia\n", "\n", " http://books.toscrape.com/\n", "\n", "Para cada uno de los libros en venta en este sitio, deseamos saber lo siguiente:\n", " - título\n", " - precio\n", " - disponibilidad\n", " - rating\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 1, "outputs": [], "source": [ "import pandas as pd\n", "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "import re\n", "\n", "URL_LIBROS = 'http://books.toscrape.com/catalogue/'" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Esta funciones nos ayudará a convertir los datos a pandas" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 2, "outputs": [], "source": [ "def extraer_info(articulo):\n", " \"\"\"\n", " La información de cada libro viene contenida en una etiqueta \"article\", como en este ejemplo\n", "\n", "
\n", "
\n", " \"A\n", " \n", "

\n", " \n", " \n", " \n", " \n", " \n", "

\n", "

\n", " A Light in the ...\n", "

\n", "
\n", "

£51.77

\n", "

\n", " \n", "\n", " In stock\n", "\n", "

\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", " Esta función extrae el título, precio, disponibilidad y rating y lo devuelve como una tupla.\n", "\n", " \"\"\"\n", " titulo = articulo.h3.a['title']\n", " precio = articulo.find('p', attrs=['price_color']).get_text()[1:]\n", " disponible = articulo.find('p', attrs=['instock availability']).get_text().strip()\n", " rating = articulo.find('p', attrs=[re.compile('star-rating ([A-Za-z]+)')])['class']\n", " return (titulo, precio, disponible, rating.split()[1])\n", "\n", "def procesar_pagina(n):\n", " \"\"\"\n", " Procesa la página número \"n\" del sitio http://books.toscrape.com/\n", "\n", " Devuelve un dataframe de pandas con columnas título, precio, disponibilidad y rating,\n", " donde cada fila corresponde a un libro.\n", " \"\"\"\n", "\n", " with urlopen(URL_LIBROS + f'page-{n}.html') as archivo:\n", " soup = BeautifulSoup(archivo, 'xml', from_encoding='UTF-8')\n", " datos_crudos = [extraer_info(articulo) for articulo in soup.findAll('article')]\n", " return pd.DataFrame(datos_crudos, columns=('Título', 'Precio', 'Disponibilidad', 'Rating'))" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Procesar toda la librería" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 3, "outputs": [], "source": [ "datos = pd.concat([procesar_pagina(n) for n in range(1, 51)], ignore_index=True)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Convertir datos de precio a tipo numérico" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 4, "outputs": [], "source": [ "datos['Precio'] = datos['Precio'].astype(float)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Información de la tabla de datos" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 10, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1000 entries, 0 to 999\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Título 1000 non-null object \n", " 1 Precio 1000 non-null float64\n", " 2 Disponibilidad 1000 non-null object \n", " 3 Rating 1000 non-null object \n", "dtypes: float64(1), object(3)\n", "memory usage: 31.4+ KB\n" ] } ], "source": [ "datos.info()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Descripción de las columnas numéricas" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 11, "outputs": [ { "data": { "text/plain": " Precio\ncount 1000.00000\nmean 35.07035\nstd 14.44669\nmin 10.00000\n25% 22.10750\n50% 35.98000\n75% 47.45750\nmax 59.99000", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Precio
count1000.00000
mean35.07035
std14.44669
min10.00000
25%22.10750
50%35.98000
75%47.45750
max59.99000
\n
" }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datos.describe()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "¿Son más caros los libros mejor calificados?" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 12, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precio por rating\n", "\n", "Rating\n", "Five 35.374490\n", "Four 36.093296\n", "One 34.561195\n", "Three 34.692020\n", "Two 34.810918\n", "Name: Precio, dtype: float64\n" ] } ], "source": [ "print(\"Precio por rating\", datos.groupby('Rating')['Precio'].mean(), sep='\\n\\n')" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Convirtamos los datos de Rating a Categorías" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 13, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precio por rating, ya ordenados\n", "\n", "Rating\n", "One 34.561195\n", "Two 34.810918\n", "Three 34.692020\n", "Four 36.093296\n", "Five 35.374490\n", "Name: Precio, dtype: float64\n" ] } ], "source": [ "datos['Rating'] = pd.Categorical(datos['Rating'], categories=('One', 'Two', 'Three', 'Four', 'Five'), ordered=True)\n", "\n", "print(\"Precio por rating, ya ordenados\", datos.groupby('Rating')['Precio'].mean(), sep='\\n\\n')" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }