Tuesday, January 21, 2014

Response to "Math is Not Necessary for Software Development"

Ross Hunter recently wrote a blog entry on Mutually Human arguing that math is not necessary for being a good software developer.  I agree with his thesis -- math isn't necessary.  However, Ross shouldn't then jump to the conclusion that math isn't useful for software development. Math may not be necessary but it can certainly be useful.

I'll start by addressing relevant points in his argument and try to clear up perceived misconceptions.

First Argument:

  • The skills that make a good mathematician are not the same as the skills that make for a good software developer.
  • Math is the process of breaking down complex problems into simpler problems, recognizing patterns, and applying known formulae.
Math is often taught in a way where students learn how to solve problems by identifying patterns.  Once the student identifies the pattern, they can solve the problem using the approach they memorized.  It's unfortunate that math is taught this way because people like Ross come away with a very incomplete and distorted picture of math.  I would call this "computation" rather than math.

The reason we have "known formulae" is precisely because of the practice of actual mathematics. In my mind, mathematics is the process of analyzing a formal system with logic. A mathematician starts by defining the basis of a formal system by specifying an initial set of rules by way of axioms, or statements which are held to be true without proof.  Next, the mathematician recursively applies logic to determine what the implications of the axioms are and if any additional rules can be then be defined.  As more and more rules are proven, the system becomes more powerful.  

Often times, mathematicians will be looking to see if a specific rule can be implied from the initial set of axioms.  If they find that this is not the case, the mathematicians may apply creative thinking to look for more specific cases where the rule does hold true or may change the axioms.  A good example is the complex number system.  When faced with the square roots of negative values, mathematicians had to define a new mathematical object (the imaginary number) to be able to reason about such results. This process can actually be quite creative.

Speaking from personal experience and comments made by others, a good math education can be a significant advantage.  Reasoning through complex arguments and formal systems has made me much more detail oriented than I was before.  Math has improved my problem solving skills.  It's also enabled me to reason formally about software, which can be very important when developing distributed systems, for example.

It's unfortunate that the way math is often taught fails our students.  Students are often taught the results found over thousands of years, but not the methodology for discovering the results.  Classes like algebra, calculus, and introductory statistics are examples which focus on results rather than methodology. Unfortunately, these are also the most popular math classes since they are required in most high schools and college science majors!

Ross points out that he loved his discrete math class.  Discrete math, along with others such as geometry, graph theory, and combinatorics, are much better courses for teaching students the methods rather than results.  All of the subject material can be derived from a few simple definitions and axioms, giving students the opportunity to learn the mathematical process.  Imagine the benefit for students if they were taught Real Analysis or Modern Algebra instead of calculus?  As Ross rightly argues, in many cases, a solid foundation in logical thinking can more broadly applicable than calculus.

(I would also like to correct Ross's description of discrete math.  Ross implies that discrete math only consists of logic and boolean algebra.  This is, of course, wrong -- discrete math covers a range of topics such as set theory, combinatorics, and graph theory as well.)

Second Argument:
  • In Math, there is only one right answer, but in software development, there is rarely a singular right answer.
Ross assumes that if a student is trained in mathematics, they will not be able to deal with grey situations.  Maybe Ross assumes that people are only studying math?  Or maybe he assumes that people are not capable of learning new ways of thinking or analyzing situations critically? Either way, this argument doesn't hold water.

Like any skill or way of thinking we have developed, learning where and when to apply it is an important part of gaining experience.  Ideally, a student would also be exposed to the humanities or cutting edge problems in the sciences where there are not clear answers. (Science education faces a similar problem -- a focus on results, not methods.)  Even if the student only studies math, it would be safe to assume that people can learn and adapt as they gain experience.  That is fundamentally part of being human.

There are also cases where math rarely involves a single correct answer.  A mathematician may have multiple ways of defining the initial axioms, each with their own trade offs.  For example, there are variations on Euclidean geometry that change the initial axioms and end up with very different properties.


Math IS Useful:
Although math education is not necessary for software development, it is useful.  I've already described how math teaches good problem solving skills and critical thinking.  With the shift towards "internet scale" systems and big data, math is even more important than before.

Consider the case of evaluating and tuning a complex software system to squeeze out every last bit of performance.  A well-controlled experiment and appropriate use of statistics is necessary to accurately access the response of the system under various conditions.  A software developer doesn't want to waste time performance tuning the areas of the system contributing least to the run-time -- they want to know what's eating up all the time so they can use their time efficiently.

A better example would be the rise of machine learning and data mining.  Users leak data left and right, which is collected by nearly every internet company.  The data is then processed to predict what the user might like to target ads or improve the user experience.  Machine learning is also used in the banking apps on our cell phones to read hand-written checks.  The popularity of machine learning is exploding as more and more uses are found.  I predict that many software developers will need to be proficient in machine learning techniques in the future.  Since machine learning is based on math and statistics, there may be a time when most software developers will need to know some linear algebra, statistics, and calculus.

All Knowledge is Useful:
Every subject offers an opportunity to apply our skills in a new way and train our brains to be even better.  One of the benefits of a liberal arts education is that students are expected to take a number of courses outside their major.  A programmer who decides to study math, literature, history, or art may find that they have developed a number of skills and tools that traditional Computer Scientists lack.


So is Experience:
Experience is a great teacher, especially in software development.  Spending hours debugging code is a great way to learn how a project works and to remember what caused the bug in the first case.  The next time you see a similar bug, you won't have to spend nearly as much time hunting its source down.

Experience also offers the benefit of knowing what works best in practice.  Ross points out that sometimes clever people will write code that is TOO clever.  They have sacrificed readability and effort for laziness or intellectual satisfaction.  This is not a problem of mathematicians, though.  This is a problem that comes from a lack of experience.

In the end, I agree that math is not necessary for software development (yet) .  I also think that we could and should change the required math courses for computer science majors to reflect courses that will focus on logic and reasoning.  But, we shouldn't be attacking math or implying that it has no value.  For some of us, math education has been a valuable part of our training.


Exploring OpenStack Savanna, Parts II and III: Elastic Data Processing

In addition to provisioning Hadoop clusters, OpenStack Savanna can also be used to directly run Hadoop jobs using a feature called Elastic Data Processing (EDP).  One of the key advantages of this approach is that users do not need to deal with multiple environments -- everything can be run directly from the Savanna UI.  In this sense, Savanna is competing with Amazon's Elastic MapReduce service to offer "Analytics as a Service" (AaaS).

Part II: Exploring the Savanna Job Interface
There are three relevant tabs in the Savanna Dashboard plugin: "Job Binaries," "Jobs", and "Job Execution."  Job binaries allows you to upload the code you want to execute.  Options include Pig and Hive scripts as well as MapReduce Java jars.


After uploading the binaries, you can create a job using the Jobs tab:


At this stage, you need to select your job type: Pig, Hive, or a MapReduce Java jar.  If your code spans multiple files, you can use the "Libs" tab to select files to be included in the library path.

The final stage is executing the job.  The user is asked to choose the job, cluster, the number of mappers and reducers, and any arguments and parameters.  If the chosen cluster is not already running, Savanna will start the cluster before the job is run and shutdown the cluster when the job is finished.

Part III: Elastic Data Processing (EPD) Behind the Scenes
After the user provides the job descriptions and binaries, Savanna stores the data in a local database using the SQLAlchemy object-relational mapper.

When a job is executed, Savanna converts the data model into a XML Workflow for Oozie, a high-level workflow manager for running Pig, Hive, and MapReduce jobs on Hadoop.  By default, a new HDFS directory is created for every job:

/user/$username/$jobname/$uuid

where $username is the name of the user, $jobname is the name given to the job, and $uuid is a randomly-generated identifier.  The workflow itself is stored in a file named, appropriately enough, "workflow.xml". The main executable is placed in the job directory while the libraries are placed in

/user/$username/$jobname/$uuid/libs

Savanna currently provides two options for handling data sources.  If the user has provided data through Swift (the object store), the XML Workflow for Oozie points to the appropriate Swift locations.  Otherwise, it is assumed the user has provided paths to the data in the command-line arguments given when executing the job.  For example, if you intend to use HDFS, you would be responsible for manually uploading the data to the Hadoop cluster.

Monday, January 20, 2014

Exploring OpenStack Savanna, Part I: Launching a Hadoop Cluster

I recently started playing with OpenStack Savanna, a service for provisioning Hadoop clusters on top of OpenStack.  Savanna makes it easy to create clusters, but users may be confused by the initial process.  Here, I go through the steps of creating a Hadoop cluster using Savanna based on the Quick Start guide provided by the Savanna developers.

Part I: Launching a Hadoop Cluster
To begin with, I wanted to look at the process for provisioning a cluster.  The first step is to register one of the images available in your OpenStack installation for use with Savanna.  By clicking on the "Image Registry" tab on the lefthand side, we get a list of all images (none in my case) registered with Savanna:

 
We can register can image by clicking the "Register Image" button on the right.  A dialog comes up like so:


Select the image you want to use and give it a name.  Click the "Done" button to save your changes.  You should now see the newly added image in the list:


After adding an image, the we need to create template for the master and worker nodes.  Node templates control which processes run on the nodes (e.g., job tracker, task tracker, name node, data node, etc.) Start by clicking on the "Node Group Templates" tab on the left-hand side.


We'll create two templates: one for a master and one for a worker.  To begin, click on "Create Template."  The following dialog should appear:


Go with the defaults here and click "Create."  (I will ignore this dialog for the rest of the entry since you can always go with the defaults for this tutorial.) A second dialog will appear:



This dialog allows you to select options for the template.  As we're creating the master template, give the template the name "master," select the small flavor, and check "namenode" and "jobtracker."  Click "Create" to finish.

To create the client, follow the same procedure but use a different name and select "datanode" and "tasktracker":


You should now see your templates in your template list:


You now need to create a cluster template, which defines how many nodes a cluster has and their types.  Click the "Cluster Templates" tab on the left-hand side:


To create a cluster template, click "Create Template."  The following dialog will appear:


On the first tab, you will want to provide a name such as "test-cluster-template."  Next, switch to the "Node Groups" tab.


Add one master and two workers to the cluster.  The rest of the parameters can be ignored, so simply press "Create."  Your cluster template should appear in the list:


Now, onto the fun part -- starting our cluster!  Click on the "Clusters" tab and press "Create Cluster":


Add a hostname, select a template, select the base image, and select a key pair.  When you press "Create," the cluster will be created and spawned.