04 Feb 2017
Neural networks are perhaps one of the most exciting recent developments in machine learning. Got a problem? Just throw a neural net at it. Want to make a self-driving car? Throw a neural net at it. Want to fly a helicopter? Throw a neural net at it. Curious about the digestive cycles of your sheep? Heck, throw a neural net at it. This extremely powerful algorithm holds much promise (but can also be a bit overhyped). In this article we’ll go through how a neural network actually works, and in a future article we’ll discuss some of the limitations of these seemingly magical tools.
Continue reading
24 Dec 2016
Perceptrons, Logistic Regression, and SVMs
In this post we’ll talk about one of the most fundamental machine learning algorithms: the perceptron algorithm. This algorithm forms the basis for many modern day ML algorithms, most notably neural networks. In addition, we’ll discuss the perceptron algorithm’s cousin, logistic regression. And then we’ll conclude with an introduction to SVMs, or support vector machines, which are perhaps one of the most flexible algorithms used today.
Continue reading
08 Dec 2016
One crucial element that all statistical learning algorithms need is the ability to handle a tremendous amount of data very quickly. People have used different frameworks for querying, or fetching, data. Among these include Hadoop’s MapReduce framework and the Apache Spark framework. SAP Hana Vora’s (HV) unique in-memory Hadoop query engine for MapReduce frameworks is a promising new tool for big data and performing analysis in a distributed fashion on large databases of information. We demonstrate HV’s potential as a powerful resource for ML by examining its performance on tasks such as stock prediction on market data. We also contributed some additional functionality to the SAP HV library in the process.
Continue reading
07 Dec 2016
Since our last demo day, the Grand Rounds team has been busy working on many aspects of the project. Grand Rounds provided our team a database of anonymous medicare information, including information about doctors visits. Associated with each visit were claim numbers, the total charge, the complication code (essentially the reason for the visit), and the NPI number (which indicates which physician was doing the procedure), as well as other information.
Continue reading
03 Dec 2016
Github doesn’t only make version control software; it’s tasked with the unique challenge of analyzing programming languages. With over 10 million repositories, Github has an incredible corpus of files written by over 3 million users… that’s a lot of files. Our team had the unique opportunity of identifying which programming language each file was written in. Our dataset has 50,000+ files with over 600 languages, a nontrivial multi-class problem. We’ve improved on Github’s existing classifier, Linguist, and, in the process, learned about what makes each programming language unique from one another.
Continue reading