分享好友编程语言首页频道列表

使用R语言和XML包抓取网页数据-Scraping data from web pages in R with XML package

r语言文章/教程 2023-02-12 16:200

In the last years a lot of data has been released publicly in different formats, but sometimes the data we're interested in are still inside the HTML of a web page: let's see how to get those data.

One of the existing packages for doing this job is the XML package. This package allows us to read and create XML and HTML documents; among the many features, there's a function called readHTMLTable() that analyze the parsed HTML and returns the tables present in the page. The details of the package are available in the official documentation of the package.

Let's start.
Suppose we're interested in the italian demographic info present in this page http://sdw.ecb.europa.eu/browse.do?node=2120803 from the EU website. We start loading and parsing the page:

page <- "http://sdw.ecb.europa.eu/browse.do?node=2120803"
parsed <- htmlParse(page)

Now that we have parsed HTML, we can use the readHTMLTable() function to return a list with all the tables present in the page; we'll call the function with these parameters:

parsed: the parsed HTML
skip.rows: the rows we want to skip (at the beginning of this table there are a couple of rows that don't contain data but just formatting elements)
colClasses: the datatype of the different columns of the table (in our case all the columns have integer values); the rep() function is used to replicate 31 times the "integer" value

table <- readHTMLTable(parsed, skip.rows=c(1,3,4,5), colClasses = c(rep("integer", 31)))

As we can see from the page source code, this web page contains six HTML tables; the one that contains the data we're interested in is the fifth, so we extract that one from the list of tables, as a data frame:

values <- as.data.frame(table[5])

Just for convenience, we rename the columns with the period and italian data:

# renames the columns for the period and Italy
colnames(values)[1] <- 'Period'
colnames(values)[19] <- 'Italy'

The italian data lasts from 1990 to 2014, so we have to subset only those rows and, of course, only the two columns of period and italian data:

# subsets the data: we are interested only in the first and the 19th column (period and italian info)
ids <- values[c(1,19)]

# Italy has only 25 years of info, so we cut away the other rows
ids <- as.data.frame(ids[1:25,])

Now we can plot these data calling the plot function with these parameters:

ids: the data to plot
xlab: the label of the X axis
ylab: the label of the Y axis
main: the title of the plot
pch: the symbol to draw for evey point (19 is a solid circle: look here for an overview)
cex: the size of the symbol

plot(ids, xlab="Year", ylab="Population in thousands", main="Population 1990-2014", pch=19, cex=0.5)

and here is the result:

使用R语言和XML包抓取网页数据-Scraping data from web pages in R with XML package

Here's the full code, also available on my github:

library(XML)

# sets the URL
url <- "http://sdw.ecb.europa.eu/browse.do?node=2120803"

# let the XML library parse the HTMl of the page
parsed <- htmlParse(url)

# reads the HTML table present inside the page, paying attention
# to the data types contained in the HTML table
table <- readHTMLTable(parsed, skip.rows=c(1,3,4,5), colClasses = c(rep("integer", 31) ))

# this web page contains seven HTML pages, but the one that contains the data
# is the fifth
values <- as.data.frame(table[5])

# renames the columns for the period and Italy
colnames(values)[1] <- 'Period'
colnames(values)[19] <- 'Italy'

# now subsets the data: we are interested only in the first and 
# the 19th column (period and Italy info)
ids <- values[c(1,19)]

# Italy has only 25 year of info, so we cut away the others
ids <- as.data.frame(ids[1:25,])

# plots the data
plot(ids, xlab="Year", ylab="Population in thousands", main="Population 1990-2014", pch=19, cex=0.5)

from: http://andreaiacono.blogspot.com/2014/01/scraping-data-from-web-pages-in-r-with.html

查看更多关于【r语言文章/教程】的文章

展开全文

相关推荐

点赞 0

反对 0

举报 0

评论 0

图文资讯

热门推荐

优选好物

更多热点专题

更多推荐文章

拓端tecdat|R语言VAR模型的不同类型的脉冲响应分析

原文链接：http://tecdat.cn/?p=9384目录模型与数据估算值预测误差脉冲响应识别问题正交脉冲响应结构脉冲反应广义脉冲响应参考文献脉冲响应分析是采用向量自回归模型的计量经济学分析中的重要一步。它们的主要目的是描述模型变量对一个或多个变量的冲击的演化

0评论2023-03-16712

Visual Studio 编辑R语言环境搭建

Visual Studio 编辑R语言环境搭建关于Visual Studio 编辑R语言环境搭建具体的可以看下面三个网址里的内容，我这里就讲两个问题，关于r包管理和换本地的r的服务。1.r包管理：Ctrl+72.R本地服务管理：Ctrl+9Visual Studio R官方帮助文档（中文）： https://docs

0评论2023-03-16320

拓端tecdat|R语言代写实现向量自回归VAR模型

原文链接：http://tecdat.cn/?p=8478 澳大利亚在2008 - 2009年全球金融危机期间发生了这种情况。澳大利亚政府发布了一揽子刺激计划，其中包括2008年12月的现金支付，恰逢圣诞节支出。因此，零售商报告销售强劲，经济受到刺激。因此，收入增加了。VAR面临的批

0评论2023-03-16915

[译]用R语言做挖掘数据《五》 r语言数据挖掘简单实例

一、实验说明1. 环境登录无需密码自动登录，系统用户名shiyanlou，密码shiyanlou2. 环境介绍本实验环境采用带桌面的Ubuntu Linux环境，实验中会用到程序：1. LX终端（LXTerminal）: Linux命令行终端，打开后会进入Bash环境，可以使用Linux命令2. GVim：非常好

0评论2023-03-08418

《R语言数据挖掘》读书笔记：七、离群点(异常值)检测
0评论2023-03-08

拓端数据tecdat|R语言中实现广义相加模型GAM和普通最小二乘(OLS)回归
0评论2023-03-08

拓端数据tecdat|R语言时间序列平稳性几种单位根检验（ADF，KPSS，PP）及比较分析
0评论2023-03-08

拓端tecdat|R语言风险价值VaR（Value at Risk）和损失期望值ES（Expected shortfall）的估计
0评论2023-03-08

拓端tecdat|R语言代写基于copula的贝叶斯分层混合模型的诊断准确性研究
0评论2023-03-08

拓端tecdat|R语言中广义线性模型(GLM)中的分布和连接函数分析
0评论2023-03-08

[译]用R语言做挖掘数据《三》数据挖掘——基于r语言的实战

一、实验说明1. 环境登录无需密码自动登录，系统用户名shiyanlou，密码shiyanlou2. 环境介绍本实验环境采用带桌面的Ubuntu Linux环境，实验中会用到程序：1. LX终端（LXTerminal）: Linux命令行终端，打开后会进入Bash环境，可以使用Linux命令2. GVim：非常好

0评论2023-03-08596

拓端tecdat|R语言缺失值的处理：线性回归模型插补
0评论2023-03-08

R语言由浅入深：二、数据处理 r语言:从数据思维到数据实战
0评论2023-03-08

数据科学与R语言: 关于我 Rer

数据科学与R语言: 关于我关于我钱钟书曾说，鸡蛋好吃不一定要去认识下蛋的母鸡。不过人类是社会化的动物，访客和博主都希望有多一些的交流。在2012年元旦之即，写下这篇短文算是我的简介吧。最早接触的计算机还是一台386，那时作为Geek的成就感来自于迅速的

0评论2023-03-08403

拓端tecdat|R语言逻辑回归分析连续变量和分类变量之间的“相关性“
0评论2023-03-08

R语言画曲线图 r语言绘制调和曲线图
0评论2023-02-09

R语言的ARIMA模型预测 r语言arima参数估计

R通过RODBC连接数据库stats包中的st函数建立时间序列funitRoot包中的unitrootTest函数检验单位根forecast包中的函数进行预测差分用timeSeries包中diffstats包中的acf和pacf处理自相关和偏自相关stats包中的arima函数模型

0评论2023-02-09490

3.3-Cypher语言及语法使用 cypher读法

Cypher是一种图数据库查询语言，表现力丰富，查询效率高，其地位和作用与关系型数据库中的SQL语言相当。Cypher具备的能力：Cypher通过模式匹配图数据库中的节点和关系，来提取信息或者修改数据。Cypher语句中允许使用变量，用来表示命名、绑定元素和参数。Cyp

0评论2023-02-09514