Big Data and Environmental Information
Environmental information gathered by modern computational methods has all the hallmarks of “Big Data” including volume, variety, and velocity. This includes information collected by remote sensing from earth observing satellites, information collected by terrestrial, freshwater, and marine sensor networks, and data collected with portable devices such as radio telemetry or global positioning satellite transponders. Even small animals can be paired with “passive integrated transponders” that transform an organism into a data point – and tissue samples and specimens can be taken back to the lab, herbarium, or other natural history collection, with further analysis and data generation and curation done at the molecular and genetic level.
The volume of data available to be collected from the environment approaches infinity, with complex edges like a fractal. In fact, the National Ecological Observation Network states the defining characteristic of ecological data is complexity. The complex interactions between elements of the environment, such as distributions of species, interactions between species, and interactions between biotic and abiotic factors, generate a wealth of data points. A February 3, 2014 article entitled “Ecologists urged to avail themselves of big data in studies” suggests advances in data science and computational methods require a new generation of ecologists who can work with “big datasets at large scale” to mesh together both large and small datasets to better understand the complex dynamics of the natural world.
An adage in the wildlife biology field is to “know your stats and know your critters.” However, a 2012 paper by Strasser and Hampton concerning undergraduate training in data management for ecology students suggests that knowledge of data management tools and methods is lacking in undergraduate curriculum. At the University of Tennessee, students of the Environmental Studies Program are encouraged to take an “Environmental Information Science” course available through the College of Communication’s undergraduate minor in Information Science and Technology. Graduate level training is available from the University of New Mexico’s Environmental Information Management Institute, and the University of Tennessee offers a course in “Environmental Informatics.”
These course offerings are needed. In addition to “knowing the stats” and “knowing the critters,” students of ecology are increasingly expected to work with large datasets. Further, working with data that has already been collected is in fact less costly than collecting new data. Collected data becomes increasingly valuable, and personnel who can effectively manage and safeguard the investment in research effort offer a new skillset to the field of ecology beyond skills in quantitative analysis or biology. In “The Age of Big Data,” Steve Lohr wrote that the U.S. would need shy of 200,000 workers with skill in data analysis, and 1.5 million well versed in data management.
The need for data literacy in the environmental domain is underscored by the variety of applications, what the President’s Council of Advisors for Science and Technology in 2011 called “extreme heterogeneity of the data.” Along with remote-sensed data and ongoing field collection of data, museum collections hold millions of specimens. Many sites are in the process of digitizing their holdings for greater access. The U.S. Geological Survey’s Biological Information Serving Our Nation Program provides federated access to over 256 museum and herbarium collections, for a total of over 100 million occurrence records. Taken together, these records can provide new opportunities for inquiry in biodiversity science: with increased data access comes increased opportunities to conduct data-driven investigations.
The extreme volume, variety, and velocity of data necessitates a suite of tools to collect, curate, and visualize environmental information. Professional organizations such as Earth Science Information Providers or the Organization for Fish and Wildlife Information Managers or DataONE Users Group offer continual training in data management. Tutorials and methods abound on the Web.
A basic problem for any data scientist working with ecological data might be to curate his or her own collection of tools. DataONE addressed this by creating an online database of tools and software, which contains several hundred examples of computer tools that range across the data life cycle, from planning, to collecting, assuring quality, describing, preserving, discovering, integrating, and analyzing. The database includes brief descriptions and a rudimentary controlled vocabulary. It is searchable. Unfortunately the end user cannot modify the database and is forced to rely on website manager to update the site in case of expired links, and the end user cannot add new tools that emerge.
Here, a solution is proposed by integrating the database into a collection of online bookmarks, and expanding on the collection with new tools and best practices. Categories closely follow the DataONE Tools and Software Database, and include the following: Biodiversity, Databases, GIS, Metadata, Repository, Visualization. As the material is available online, feeds are presented in Table 1.
Table 1. XML Feeds of Annotated Bibliography
Note: PDF version of Assignment 1 online: IS-592-ASSN1-JesselT
Related presentation in “Big Data in Ecology: Volume, Variety and Velocity in Environmental Information.“
Posted on February 6, 2014, in Big Data Analytics, Coursework, Scholarly Life and tagged Big Data Analytics, Ecological Informatics, Environmental Information. Bookmark the permalink. Leave a comment.