Ever since Amazon released S3 (Simple Storage Solution) in 2006, it has revolutionized the way companies store and analyze data. However, there are a few ways a developer can get tripped up using this otherwise simple service.
One major perk of using S3 is that data can be seamlessly transferred to and from HDFS, which when combined with Amazon’s Elastic Map Reduce (EMR), makes Big Data analytics extremely accessible. However, since S3 is not the typical POSIX filesystem that many developers have come to expect, it can lead to some unexpected behavior.
I came across this summary of a talk given by Jonathan Corum titled “The Weight of Rain”. Not only did I find the narrative style enjoyable to follow, but I also learned a lot about how to make data informative and understandable visually. The sections about avoiding the common traps of info-graphics and visualizations was especially informative. This is without a doubt one of the best talks I’ve seen on the topic.
The Weight of Rain
If you’re anything like me and you’ve been looking for 2010 census data by Zip Code, you’ve had trouble finding it in a single package like a csv file. Through the data.gov website you can get the data by state, which requires downloading 52 files (DC and Puerto Rico) and combining them for each of the segments. Not wanting to do that by myself I wrote a little python script to collect and aggregate some of the pertinent info from the 2010 Census Summary file and I’m making the file available. The file contains data from the 2010 census and includes over 50 metrics about the US population aggregated to the zip code (there are hundreds if not thousands of available metrics).
Here’s the link to the data on git.
There’s been a lot of talk around the world about the 2014 World Cup draw. In fact I recently got into a fairly heated debate (at a data science workshop no less) about whether Spain or the US had the toughest group. The consensus among the group at the time that Spain had the toughest group. Well, The Guardian has compiled a fairly comprehensive analysis of the subject with some interesting visualizations to boot.
Who has the hardest World Cup 2013 Group Draw
Attention all Philly area Data Junkies!! Monetate will be hosting the November Data Philly Meetup
DataPhilly is a group for anyone interested in gaining insights from data. Topics include predictive analytics, applied machine learning, big data, data warehousing and data science. The November meetup will focus on tools and techniques with two presentations focusing on applications of Map Reduce and Scrapy. I’ve included details below, hope to see you all there.
Monday, November 18, 2013
951 E. Hector Street, Conshohocken, PA
6:00 – 6:30: Networking, food
6:30 – 7:00: Map Reduce: Beyond Word Count by Jeff Patti
7:00 – 7:30: Collecting data with Scrapy by Patrick O’Brien
7:30 – 8:00: Lightning Talks
8:00 – Leave for bar
Map Reduce: Beyond Word Count by Jeff Patti
Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate’s information pipeline
Collecting data with Scrapy by Patrick O’Brien
Scrapy is a simple, fast web scraping library that enables the production of clean data from unstructured web information. Together we will dive into the architecture of this library and create our own crawler.
Data Philly November 2013 – Data in Practice
I’ve been using R to accomplish a lot of statistical and data mining related tasks at work recently and it has really brought me back to my *nix scripting roots. While the learning curve with R is somewhat steep compared to more graphical tools like SAS, Statistica or SPSS where it truly shines is it’s ability to write applications to handle statistical computing tasks. Much like UNIX, R forgoes the graphical environment for a component based approach that more easily turns tasks into repeatable scripts. This can be particularly useful in tuning a data mining algorithm to a specific problem.
I had a data mining task at work which I decided after some initial exploration to use the Random Forest classification algorithm for its ability to handle large amounts of input variables (1000s) without variable deletion and has particularly good methods for dealing with missing data while maintaining accuracy in an unbiased manner. The Random Forest algorithm has several tuning parameters in it’s R implementation that can effect the accuracy of the model, particularly the number of trees to use as well as the number of variables to use in each tree. Typically, it requires several iterations of executing the algorithm manually followed by an analysis of the results to determine the ideal settings for these two options and even after several iterations the best possible settings might be unknown. There are some general rules for each of these parameters but in practice they are highly data set dependant. With the number of trees generally more trees is ideal, but at what point does the diminishing returns on adding trees outweigh the added complexity and processing costs. In addition in some cases adding more trees can actually increase the error rate. Choosing the number of variables to include in each decision node of the random forest model is highly data set dependent. The general rule is to use the square root of the total number of input variables for this parameter. However, in cases where there are many insignificant variables this number may need to be increased, but this must be done with caution as choosing two many variables may result in reduction in the randomness of the random forest model and will greatly reduce the overall accuracy.
Not wanting to leave anything up to chance I wrote a short script in R to do the work for me. This script will execute the model with hundreds of combinations of the two tuning parameters (Number of Trees and Number of Variables). After each execution the accuracy of the specific model is calculated and the output stored, along with the tuning parameters, for further review. Once the script has completed it’s a fairly simple task to sift through the output to find the combination with the lowest error rate.
Here’s the code:
for(i in 10:100)
for(j in 1:50)
# Set tuning parameters
mtryVal <- i
ntreeVal <- 500 + (j * 30)
# Execute the random forest model
crs$rf <- randomForest(Gold_1_Success ~ .,
# Display the tuning parameters and corresponding error rate
print(paste("number of trees: ", ntreeVal, " number of variables: ",mtryVal))
The results of each iteration were somewhat variable with a range of 71-75% accuracy. Using this methodology not only saved hours of manual effort but it also resulted in selection of the most accurate model possible. The final results were a little over 2% better than using the “general rule” tuning parameters which seems somewhat small, but depending on the business case that 2% difference could be thousands of dollars or even the difference between the model being accepted for production use or discarded.
John Elder and Elder Research are presenting a 2-day course Tools for Discovering Patterns in Data: Extracting Value from Tables, Text, and Links September 9th and 10th in Charlottesville, VA. I’ve had the pleasure of personally attending one of Dr. Elder’s courses in the past and I can’t recommend this course enough. Dr. Elder has both the breadth and depth of subject knowledge but also a very approachable teaching style.
Find the useful information hidden in your data! This course surveys computer-intensive methods for inductive classification and estimation, drawn from Statistics, Machine Learning, and Data Mining. Dr. Elder will describe the key inner workings of leading algorithms, compare their merits, and (briefly) demonstrate their relative effectiveness on practical applications. We’ll first review classical statistical techniques, both linear and nonparametric, then outline the ways in which these basic tools are modified and combined into powerful modern methods. The course emphasizes practical advice and focuses on the essential techniques of Resampling, Visualization, and Ensembles. Actual scientific and business examples will illustrate proven techniques employed by expert analysts. Along the way, major relative strengths and distinctive properties of the leading commercial software products for Data Mining will be discussed.
It’s often the case when beginning a data mining process that we have too many variables rather than too few. Simply using all of the available variables when creating a predictive model can lead to an over-fit problem. The process of selecting which variables to include in a model can be difficult and simply relying on intuition or what we perceive to be the most important variables can cause us to overlook variables that may add precision to our model. This is where a decision tree algorithm may be useful in narrowing the search for important variables.
This is where Decision Trees can be helpful. Aside from being a data mining technique in their own right Decision Tree algorithms like C&RT and CHAID among others go through a process of ‘pruning’ the tree where the model is refined in order to result in the simplest possible model given the constraints. In doing so variables that have a minimal predictive effect are pruned away and the remaining variables are able to be ranked according to their importance. Below is an example of an Importance plot generated with Statistica (most other data mining tools can create similar importance tables and charts)
Now using this chart I can begin to select the most relevant variables, however, just because a variable has high importance doesn’t mean it should be included in the final model. I’ve often found that some variables have high importance because they are overly specific and therefore act more like identifiers.
One of the key problems many Analytics professionals face is achieving the buy-in from upper management to invest resources in data mining projects. Not everyone has partaken in the ‘Big Data Kool-Aid’ and data mining projects are often long and require significant investment in labor, software and hardware which keeps many smaller (and even some larger) businesses from implementing them. This doesn’t need to be the case as there are some great open source tools out there like: Weka, RapidMiner, and R. R is the most widely used Open Source Software in the Data Mining / Statistical Analysis world. As shown by this graph below it also a desireable skill in the job market and while Weka and RapidMiner are useful tools they’re not nearly as widely used.
Continue reading Open Source Data Mining
At some point if you deal with data long enough you will come across the hierarchical data structure issue. A problem that is easily solved recursively in most programming languages becomes somewhat of a challenge to solve with SQL. It’s always been odd to me that a problem that I solved in my first Computer Science class could pose so many issues in a relational database. I’ve even read books that suggested a few ‘hacks’ disguised as best practices on the data design side to avoid the hierarchical problem all together. Adjacency lists are a commonly proposed solution, but really should be treated as a last resort, as they are not normalized and expose your database to insert and update anomalies. There are, however, much better query-side solutions out there.
Continue reading Dealing With Hierarchical Data in SQL