Thoughts

Web Scraping Natural History Information

11/24/2021

Two sources of natural history information that I use often are the allaboutbirds.org and birdsoftheworld.org databases, both managed by Cornell’s Lab of Ornithology and both based on the exhaustive Handbook of the Birds of the World. These online databases contain a valuable trove of natural history information, but extracting the data into a format usable for quantitative applications is not straightforward.

One solution to this problem is html web scraping, and the purpose of this post is to provide a very brief overview of some simple web scraping packages in R and how to use them. This seemed especially useful to do since there are really not many online tutorials for web scraping ecological information from textual databases that I could find.

It would seem like simply downloading the full html text and calling various regular expressions would be the easiest way to approach this problem. And this might be true for very simple applications. But using this approach requires complex coding with a lot of if/else statements, is prone to errors, and doesn’t utilize the structured nature of the underlying html code. Fortunately, with some basic understanding of how html is structured and formatted, extracting any piece of information you want from an online data source is relatively simple and repeatable.

To begin, we’re going to extract some basic life history data for the Painted Bunting from allaboutbirds.org. Information about this species can be access freely from https://www.allaboutbirds.org/guide/Painted_Bunting/. The page looks something like this:

In the gray summary box you can find important information regarding the species preferred habitat, feeding preferences, and conservation status, etc. The steps below describe how to quickly and accurately extract this information and save it in a usable format.

Step 1: Loading html text into R
To start, we need to access the html text and load it into R. There are several functions that will do this, one of which is the ‘read_html’ function in the rvest package.

>library(rvest)
>bunting_html<-read_html(‘https://www.allaboutbirds.org/guide/Painted_Bunting/’)

Step 2: Identifying relevant html nodes
The trick to web scraping is understanding how the html data is structured using elements and node sets. Fortunately, you don’t really need to know all that much about html to perform simple web scrapes, but reviewing the basics is certainly helpful. The most important thing to understand is that the html data are organized in a very hierarchical and predictable structure. More specifically, there consists a series of headings, sub-headings, and associated content in the form of text, graphics, and whatever elements are found within the webpage.

To visualize this architecture, you can use the inspect function in your web browser. I am using firefox below, and this is accomplished by right clicking anywhere within the body of the web page.

You will also notice that moving your cursor to various lines in the inspector tool will automatically highlight the relevant areas of the webpage that are generated from that line particular of code. This feature is very helpful for identifying which sections of the html code you need to access to get the data that you’re looking for. Similarly, where you right-click within the webpage to pull up the inspector tool will automatically highlight the associated hotml node set.

The section we are interested in is identified by the line <ul class=”LH-menu”>, which is highlighted in blue in the above image. If you hover over this line, it will highlight the section in the gray box that contains the five life history datapoints that we are interested in scraping.

We need the name of this node to access it in R. The name of a node is not always obvious. We know that it is a ‘ul’ node, but we don’t actually want all of the ‘ul’ nodes that are present in this url. We only want the specific node that is associated with the highlighted portion of the webpage. To get the name, simply right click on the appropriate line in the inspector tool, and select 'Copy CSS selector.' Again, this may be slightly different if you are using a different web browser, but it should be very similar.

Paste this somewhere, and we can see that the name of the node is ‘.LH-menu.’

Step 3: Accessing the relevant information
We will also use the rvest package to parse specific elemenets of the html data. Use of the rvest package requires some familiarity with pipes via the ‘%>%’ operator. If you have used the stringr package you will likely already be familiar with this.

To get the data from the .LH-menu node, we use the 'html_nodes' function:

>bunting_data<-bunting_html %>% html_nodes('.LH-menu')

This returns all of the content associated with the five life history categories shown above.

This, however, is a bit more information than we need. By doing some searching within the inspector tool, we can see that the data that we want (i.e., the five life history categories shown in the gray box above) are available within the 'img' sub-nodes. Accessing sub-nodes is done the same as above:

>bunting_data2<- bunting_data %>% html_nodes('img')

which returns:

{xml_nodeset (5)}
[1] <img src="/guide/images/icons/icon-scrub.png" alt="Habitat Scrub">
[2] <img src="/guide/images/icons/icon-seeds.png" alt="Food Seeds">
[3] <img src="/guide/images/icons/icon-shrub.png" alt="Nesting Shrub">
[4] <img src="/guide/images/icons/icon-ground-forager.png" alt="Behavior Ground Forager">
[5] <img src="/guide/images/icons/icon-low-concern.png" alt="Conservation Low Concern">

The data we are interested in are seen in parentheses on the right hand side of each line.

And this basically gets us all the way there. Figuring out the correct nodes to call is the biggest challenge with web scraping. Every situation will be different, and will likely require some tinkering and trial and error. Using the inspector tool in your web browser will help you identify the sections, node names, and relevant text that you need.

At this point, we could just extract the relevant text in each of the five nodes present in bunting_data using gsub or str_extract from the stringr package. For example:

> str_extract_all(word(as.character(bunting_data[1]), sep="alt=", -1), '\\w+')

returns:

[[1]]
[1] "Habitat" "Scrub"

But we can also do this more directly by identifying values of different attributes found within the nodes themselves. In the bunting_data object, each node contains two attributes: ‘src’ and ‘alt’, with the 'alt' attribute containging the life history information we are after.

Accessing attributes requires the xml2 package (other packages exist as well). To get the value of the 'alt' attribute, we use:

>bunting_data2<-xml_attr(bunting_data, 'alt')

We then remove the category names using str_split in the stringr package:

>bunting_data_3<- str_split(bunting_data2, " ")
>bunting_data_final<-lapply(bunting_data_3, function(x){paste[-1, collapse=“ ”})

Bunting_data_final returns a list of length five, indicating each of the life history categories we are looking for.

Step 4: Automating across multiple urls/species
Of course, going through the above steps takes much longer than simply looking at the webpage and transfering the data we want into excel. In most cases, though, we wish to access the same information for multiple species, or for all of the species in North American or even the globe.

For allaboutbirds.org, iterating over species is quite straightforward because of how the urls are structured. Specifically, species pages are identified using common names separated by an underscore. As such, a list of common names can be used to do something like:

>SpeciesList<-c(‘Painted_Bunting’, ‘American_Robin’, ‘Blue_Jay’, ‘American_Crow’)
>html_data<-vector(length=length(SpeciesList), mode=”list”)
>for(i in length(SpeciesList)){
> web_data<- paste("https://www.allaboutbirds.org/guide/", speciesList[i], sep="")
> html_data[[i]]<-read_html(web_data) ## modify this last part based on your needs
>}

Accessing multiple pages is more difficult for birdsoftheworld.org because the individual species pages are referenced using unique six letter codes. To access multiple pages, we need to know these codes, but from what I can tell, they are unique to birdsoftheworld.org, as I was unable to find them from any other source. Fortunately, we can scrape the codes directly from the website:

>species_page<-read_html(‘https://birdsoftheworld.org/bow/specieslist’)

>taxonomyNodes<- species_page %>% html_nodes('a')
>taxonomyNodes_codes<-taxonomyNodes %>% html_attr('href')

>BOW_codes<-lapply(taxonomyNodes_codes, function(x){str_match(x, '/bow/species/(.*?)/cur/introduction')[,2]})
>BOW_codes<-unlist(BOW_codes)

With the BOW_codes in hand, the individual species pages can be easily accessed.

Final Note
The birdsoftheworld.org pages contain much more information on most species than does allaboutbirds.org. Furthermore, the information present in each species page is more variable, including how the page content is structured. I mention this because in my experience, scraping data from birdsoftheworld.org is much more prone to errors. One solution is to add a TryCatch call to the loop. This useful function prevents termination of the loop when an error occurs for a particular species (e.g., if data are missing or the url turns out to be wrong). Something along the lines of:

>tryCatch(suppressWarnings(scrape_function, error = function(e) NA)

where scrape_function represents the specific task that you are trying to perform.

In addition to TryCatch, you may also need to include one or more if/else statements in your loop to make sure that you are accessing the correct information that you want.

0 Comments

Recent paper on global bird abundances by Callaghan et al.

8/31/2021

0 Comments

I was excited to read the recent paper by Callaghan et al. in PNAS entitled “Global abundance estimates for 9700 bird species.” The authors undertake the impressive and daunting task of estimating the global abundances of nearly all the world’s bird species. This information has obvious and important implications for understanding a wide array of question in macroecology and conservation biology.

I have no issues with the underlying analysis itself, which I think was carefully thought out and executed. However, I was very surprised to see the sizes of the reported confidence intervals (CIs) around the authors’ point estimates for global bird abundances. Below is a figure I made showing the point estimates of population size for each of the 9700 species analyzed, as well as the 95% confidence intervals surrounding them. Species are ranked from highest to lowest estimated abundance, the x-axis in the left panel is on a log scale. These data come straight from the supplemental information provided in the original paper.

For 9,472 species (98% of all species considered), the CIs span more than three orders of magnitude, and for 8,889 species (91.6%), the CIs are more than 80 times larger than the point estimate itself. As a result, interpreting these abundance estimates requires making statements like: “We are 95% confident that the global abundance of Larus hyperboreus lies somewhere between 11 million and 78 billion.” While I think it is great that the authors focused so much attention on accounting for uncertainty in their models, I’m also unsure what the utility of their resulting estimates are given these extremely large CIs.

Another curious situation arises from these data. If you look again at the left figure above you can see that 6780 species (~70% of all species) have global abundance estimates whose CIs overlap with the CI of the fifth most abundant species. The horizontal gray line represents this break point. This implies that even species predicted to be relatively uncommon (i.e., species above but close to the horizontal gray line) actually have a non-trivial chance of being one of the most abundant species on the planet!

And it’s not as if there’s no precedence for conducting similar kinds of global estimates of difficult to measure properties like species abundance. Stanton et al. (2019) estimated population sizes for all North American bird species. They showed that 76% of the species analyzed have confidence intervals that are < 2/3 the size of the point estimate. Similarly, Crowther et al. (2015) estimated the total number of trees on planet Earth and report a CI that is ~1/15 that of the point estimate (i.e., 3.04 0.096 trillion individual trees). The success of these studies in comparison to the results from Callaghan et al. is likely due to better quality training data available for these other studies. Abundance estimates for North American birds utilizes high-quality monitoring programs like the USGS Breeding Bird Survey, and Crowther et al. (2015) used hundreds of thousands of high-quality abundance samples in their predictive model. As a result, comparisons to these other studies makes me highly skeptical as to whether there exist enough data on global bird abundances to carry out the analysis that is presented. I think with the continued monitoring of global bird populations via citizen science efforts like eBird and other large-scale monitoring programs this kind of question could be more accurately approached in the future. I’m just not convinced we’re there yet.

A more general thought that I had after reading this paper is whether simply accounting for uncertainty in statistical models is really enough. Obviously, this is important to do, and Bayesian hierarchical models like the ones used by Callaghan make the process of error propagation relatively straightforward. But at what point does the amount of uncertainty become so great that inferences become meaningless? I would tend to argue that this is happening in the present case. In the end, I think this work is a good springboard for future efforts, and, like I mentioned above, the modelling approach seems sound and could be used for future research applications. But I do hope readers of this paper notice the limitations of these data and take appropriate caution when using them for future analyses.

0 Comments

Web Scraping Natural History Information

Recent paper on global bird abundances by Callaghan et al.

Author

Archives

Categories