Advanced technology brings huge amount of knowledge and changes the conventional method of acquiring news. The age that knowledge only comes from eyeball of experiment phenomenon has been passed. Getting into the micro, current improvements in both industry and academic work are most derived from the data observed from the real world. People interpret the hidden insights behind them by analysis or visualization. A standard process of operating data includes 1. ETL (Extract, Transform and Load) 2. Analysis 3. Visualization. In fact, all of these are somehow not easy to do.
ETL is the first step and usually the most tedious work. Your data may come from different places and be collected by different instruments and even be stored in different systems which makes it very complicated to get the complete set of data that is necessary to proceed your work. There exists tens computer operating systems (http://en.wikipedia.org/wiki/List_of_operating_systems); each of them may have multiple versions. Unfortunately, there is no such a type of file format that could be used for all systems to save data and the security setting of those systems will drive you crazy before you figure out the file format issue.
Thankfully, you may get through the first issue and want to go ahead to analyze it. However, you will find that these data are not quite fit to apply your analysis method which might need the aggregation on one of the parameters confined by other restrictions some of which may dynamically change. There is no choice but try to transform the raw data to be easy analyzed. But question is “how”, because you don’t want to do it manually on one million by one hundred of data. Excel skills and SQL query knowledge are usually very helpful in this stage.
If you are so lucky and skillful enough to move to the analysis process, don’t think you are almost there. The data you received from whatever source it is could be in multiple dimensions and in a so huge amount that your brain is not able to handle. Even it is less than 30 numbers, some simple analysis is needed to deliver the idea it carries. Accordingly, analysis explores and uncovers the knowledge from data which is the core of data operation. This process may be performed in terms of statistical analysis, machine learning and usually expertise in one or more related field. Don’t think about “average”, “summation” or “product”, etc. This is actually a brainstorm that forces your mind to work in a more creative way.
When get the results from analysis, you are one step away — visualization. This is the process that may fail you very easily if you are not good at it. It plays the role to show the knowledge come from analysis. The problem is how to pass complex results via simple plot that could be understood by different level of audience. Sometimes, event you can draw it on paper within 5 min, it may takes you 5 hours or more to plot it on screen. The difficulty is usually related to the complex of your analysis results.
Here is a failed example that I did for analyzing the precipitation events in Philadelphia, PA. The plot give the histogram of all wet spell durations compare to the distribution curves of wet spell durations on the conditions of different precedent wet spell lengths. The plot is too busy for people to follow. Too many curves makes it hard to recognize.
This plot is much better than the last one. It tells the relationship between dry spell duration and annual rainfall depth in Philadelphia, PA. The boxplots show the distribution of spell lengths for each year, the dot plots on the bottom is the average and variance of the distribution respectively. Their trends along the increase of annual rainfall are reflected by the linear regression curves.
My name is Ziwen and I am a graduate student in Civil Engineering at Drexel University.