Extraction, Transform and Load

A friend of mine in the program commented that about 90% of the time doing data science is obtaining and cleaning data.

This is where programming is incredibly useful.  In the second year of my Masters program, my programming skills are not yet at the level that I want them to be.

I recently started some work for my research assistantship concerning Twitter data for @DataONEorg.

I’m interested in the content of posts, and the relationships between the actors in the network.

In terms of content, I’d like to look at the hashtags and links.

To illustrate how difficult it is to accomplish tasks “by hand,” I recently tried to the twitter data from a free site.  My efforts are documented here: <https://notebooks.dataone.org/data-science/harvesting-dataoneorg-twitter-mentions-via-topsy/>.

I’ve read that employers should not hire a “data scientist” if the so-called “scientist” does not have programming skills.  For this reason, I’m disappointed that the School of Information Science does not offer a programming course within the School itself.  (I’ve heard Dr. Potnis will offer a course in Fall 2014, a semester after my graduation).

I enrolled in a programming course in the College of Engineering and Computer Science – Introduction to Programming for Scientists and Engineers.  The course focuses on C++ language.  This is unfortunate, as python is increasingly favored over C++.  This means more ready-made programs are available, and a user community is growing. Content management systems are even building up around python.

Python is used by a friend of mine who does genome science.  C++ is useful for taking advantage of parallelism, but that my friend who works on supercomputers uses python suggests to me that python works as well.

Programming language popularity.

http://venturebeat.com/2014/02/03/the-most-popular-coding-languages-for-2014-are/

Further reading:

Python Displacing R As The Programming Language For Data Science by @mjasay http://readwr.it/c1ew  

http://www.scipy.org/

 

 

Advertisements

About Tanner Jessel

I am a recent M.S. in Information Science graduate from the University of Tennessee School of Information Science. I was formerly a graduate research assistant funded by DataONE (Data Observation Network for Earth). Prior, I worked for four years as a content lead and biodiversity scientist with the U.S. Geological Survey's Biodiversity Informatics Program. Building on my work experience in biodiversity and environmental informatics, my work with DataONE focused on exploring the nature of scientific collaborations necessary for scientific inquiry. I also conducted research concerning user experience and usability, and assisted in development of member nodes with an emphasis on spatial data and infrastructure. I assisted with research designed to understand sociocultural issues within collaborative research communities. Through August 1, 2014, I was based at the Center for Information and Communication Studies at the University of Tennessee School of Information Science in Knoxville, Tennessee.

Posted on February 20, 2014, in Big Data Analytics, Coursework and tagged , , , . Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: