Monthly Archives: February 2014
This would not send with a .csv file so I am trying a .txt file.
Just change the name to .csv so you can view the attachment in Excel.
Our first INSC 592 Big Data Analytics assignment included a discussion of what “Big Data” means to each student and their respective career goals.
Because I love wildlife and the outdoors, my career goals include environmental information management. The earth itself is a big place, so it’s easy to understand why volume, variety, and velocity might intersect as a source of big data.
This slide deck was designed to help my fellow students understand the sources of “big” data in the environment, and ways that the information can flow from field to data file.
The presentation was well received and I think produced the intended effect in illustrating the sources of volume, variety, and velocity for environmental data.
One of the more useful outcomes of this assignment was that I produced a collection of bookmarks available for download. The bookmarks are based on the helpful (albeit cumbersome) “Software Tools Catalog” database available from DataONE. My approach was to bookmark the links Diigo, which allows for long term curation, wider availability, customizable tags, and flexible output, such as an RSS feed like one for the keywork “repository” <https://www.diigo.com/rss/user/mountainsol/repository?type=all&sort=created>. A collection can for a particular tag can also be linked to; for example: “visualization” https://www.diigo.com/user/mountainsol/visualization.
I should thank Ethan White for his “Most Interesting Man” image that I could not resist borrowing. See http://jabberwocky.weecology.org/2013/08/12/ignite-talk-big-data-in-ecology/ for Ethan’s take on “Big Data in Ecology. “
Just reading and article that might be a good idea to cite in the Figshare article.
“Big data and the future of ecology”
Authors include Amber Budden, Carly Strasser, John Porter, and other DataONE associates.
A key point presented is “Traditional ecology produces ‘dark data'” which is kind of the direction I was going with “Figshare as a DataONE member node for dark data” as a reason why DataONE would be interested.
I hope that linkage is apparent in the Figshare article – adding this citation might help solidify the linkage.
Environmental information gathered by modern computational methods has all the hallmarks of “Big Data” including volume, variety, and velocity. This includes information collected by remote sensing from earth observing satellites, information collected by terrestrial, freshwater, and marine sensor networks, and data collected with portable devices such as radio telemetry or global positioning satellite transponders. Even small animals can be paired with “passive integrated transponders” that transform an organism into a data point – and tissue samples and specimens can be taken back to the lab, herbarium, or other natural history collection, with further analysis and data generation and curation done at the molecular and genetic level.
The volume of data available to be collected from the environment approaches infinity, with complex edges like a fractal. In fact, the National Ecological Observation Network states the defining characteristic of ecological data is complexity. The complex interactions between elements of the environment, such as distributions of species, interactions between species, and interactions between biotic and abiotic factors, generate a wealth of data points. A February 3, 2014 article entitled “Ecologists urged to avail themselves of big data in studies” suggests advances in data science and computational methods require a new generation of ecologists who can work with “big datasets at large scale” to mesh together both large and small datasets to better understand the complex dynamics of the natural world.
An adage in the wildlife biology field is to “know your stats and know your critters.” However, a 2012 paper by Strasser and Hampton concerning undergraduate training in data management for ecology students suggests that knowledge of data management tools and methods is lacking in undergraduate curriculum. At the University of Tennessee, students of the Environmental Studies Program are encouraged to take an “Environmental Information Science” course available through the College of Communication’s undergraduate minor in Information Science and Technology. Graduate level training is available from the University of New Mexico’s Environmental Information Management Institute, and the University of Tennessee offers a course in “Environmental Informatics.”
These course offerings are needed. In addition to “knowing the stats” and “knowing the critters,” students of ecology are increasingly expected to work with large datasets. Further, working with data that has already been collected is in fact less costly than collecting new data. Collected data becomes increasingly valuable, and personnel who can effectively manage and safeguard the investment in research effort offer a new skillset to the field of ecology beyond skills in quantitative analysis or biology. In “The Age of Big Data,” Steve Lohr wrote that the U.S. would need shy of 200,000 workers with skill in data analysis, and 1.5 million well versed in data management.
The need for data literacy in the environmental domain is underscored by the variety of applications, what the President’s Council of Advisors for Science and Technology in 2011 called “extreme heterogeneity of the data.” Along with remote-sensed data and ongoing field collection of data, museum collections hold millions of specimens. Many sites are in the process of digitizing their holdings for greater access. The U.S. Geological Survey’s Biological Information Serving Our Nation Program provides federated access to over 256 museum and herbarium collections, for a total of over 100 million occurrence records. Taken together, these records can provide new opportunities for inquiry in biodiversity science: with increased data access comes increased opportunities to conduct data-driven investigations.
The extreme volume, variety, and velocity of data necessitates a suite of tools to collect, curate, and visualize environmental information. Professional organizations such as Earth Science Information Providers or the Organization for Fish and Wildlife Information Managers or DataONE Users Group offer continual training in data management. Tutorials and methods abound on the Web.
A basic problem for any data scientist working with ecological data might be to curate his or her own collection of tools. DataONE addressed this by creating an online database of tools and software, which contains several hundred examples of computer tools that range across the data life cycle, from planning, to collecting, assuring quality, describing, preserving, discovering, integrating, and analyzing. The database includes brief descriptions and a rudimentary controlled vocabulary. It is searchable. Unfortunately the end user cannot modify the database and is forced to rely on website manager to update the site in case of expired links, and the end user cannot add new tools that emerge.
Here, a solution is proposed by integrating the database into a collection of online bookmarks, and expanding on the collection with new tools and best practices. Categories closely follow the DataONE Tools and Software Database, and include the following: Biodiversity, Databases, GIS, Metadata, Repository, Visualization. As the material is available online, feeds are presented in Table 1.
Table 1. XML Feeds of Annotated Bibliography
Note: PDF version of Assignment 1 online: IS-592-ASSN1-JesselT
Related presentation in “Big Data in Ecology: Volume, Variety and Velocity in Environmental Information.“
Hello Dr. Tenopir,
As requested, I have some data on Twitter mentions of @DataONEorg for you to share with Dr. Budden.
First, I should clarify from our discussion earlier that I made an accounting error in how I tabulated results – I reported in our meeting there were 15,000 mentions of @DataONEorg.
That number is actually substantially less – there were in fact 1490 tweets across all time using the Twitter handle "@DataONEorg."
I did not perform a search for the text string "dataone" because I found difficulty teasing apart mentions of dataone, the Indian IT services company, and dataone the U.S. based ISP.
Of these 1490 mentions of @DataONEorg, 350 originated from the user @DataONEorg itself.
This leaves 1140 Tweets across all time from users other than @DataONEorg.
This total does not account for Re-tweets of any one of the 1140 tweets.
For example, if one of the 1140 tweets is re-tweeted 100 times, my methods only count that tweet once. Therefore, my method produces an "under-estimate" of total reach.
I do not presently have a method in mind to assess total "reach" or a network analysis of reach. Although I would like to obtain this data, I am not optimistic that I can obtain the data prior to February 6.
I am currently exploring options for understanding what text, links, and hashtags are being shared in conjunction with mentions of @DataONEorg (For example, mentions of your "Practices and Perceptions" paper) or @DataONEorg plus #citizenscience or #openscience.
I am attaching a spreadsheet with the raw data and data sources.
My methods are outlined in the DataONE Data Science open research notebook:
Graduate Research Assistant:
Data Observation Network for Earth (DataONE)
Center for Information and Communication Studies
The University of Tennessee
Mail: 1345 Circle Park Drive, Suite 420
Physical: Hoskins 5, Room 5-H