xml - Extract links from html table

Question

Welcome To Ask or Share your Answers For Others

xml - Extract links from html table

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

xml - Extract links from html table

I'm trying to extract the links from the following webpage http://ipt.humboldt.org.co/ that are of type "Specimen". I can get the table from the webpage using the following code:

library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
tableNodes<-getNodeSet(sitePage,"//table")
siteTable<-readHTMLTable(tableNodes[[1]])

However the links are missing after I use the readHTML command.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:22:56+0000

It ended up being an intricate XPath expression:

library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
hyperlinksYouNeed<-getNodeSet(sitePage,"//table[@id='resourcestable']
                                        //td[5][.='Specimen']
                                        /preceding-sibling
                                        ::td[3]
                                        /a
                                        /@href")

but let me explain the XPath expression bit-by-bit:

//table[@id='resourcestable'] -> This way we are getting the main table on the page called 'resourcestable'
//td[5][.='Specimen'] -> Now we are filtering only these rows that have Type as Specimen
/preceding-sibling -> Now we start looking backwards
::td[3] -> 3 steps to be precise counting backwards from where we are. Be careful preceding-sibling start counting backwards therefore td[1] is the Type column, td[2] is the Organisation column and td[3] is the Name column we want.
/a -> now get the included a node
/@href -> and finally more precisely the href attribute content

Categories

xml - Extract links from html table

xml - Extract links from html table

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags