Extraction, Transform and Load
A friend of mine in the program commented that about 90% of the time doing data science is obtaining and cleaning data.
This is where programming is incredibly useful. In the second year of my Masters program, my programming skills are not yet at the level that I want them to be.
I recently started some work for my research assistantship concerning Twitter data for @DataONEorg.
I’m interested in the content of posts, and the relationships between the actors in the network.
In terms of content, I’d like to look at the hashtags and links.
To illustrate how difficult it is to accomplish tasks “by hand,” I recently tried to the twitter data from a free site. My efforts are documented here: <https://notebooks.dataone.org/data-science/harvesting-dataoneorg-twitter-mentions-via-topsy/>.
I’ve read that employers should not hire a “data scientist” if the so-called “scientist” does not have programming skills. For this reason, I’m disappointed that the School of Information Science does not offer a programming course within the School itself. (I’ve heard Dr. Potnis will offer a course in Fall 2014, a semester after my graduation).
I enrolled in a programming course in the College of Engineering and Computer Science – Introduction to Programming for Scientists and Engineers. The course focuses on C++ language. This is unfortunate, as python is increasingly favored over C++. This means more ready-made programs are available, and a user community is growing. Content management systems are even building up around python.
Python is used by a friend of mine who does genome science. C++ is useful for taking advantage of parallelism, but that my friend who works on supercomputers uses python suggests to me that python works as well.
Python Displacing R As The Programming Language For Data Science by @mjasay http://readwr.it/c1ew