Monday, June 19, 2017

Why Are Pipes Hard?

I learned something amazing today.  I have done a few small multi-processing tasks in the past, some using Python and one using Java.  The most appropriate form of interprocess communication for these was pipes.  The experience was always very painful.  I had to figure out how to cobble together the right objects to produce a usable communication system.  I had to learn some special uses of objects to allow the parent process to continue executing, instead of waiting for the child to finish.  Working out the bugs was very difficult, because it was hard to tell if the problem was my communication protocol, if I had connected my objects wrong, or if maybe I had given an advanced function or method some incorrect parameter.  Of course, I expect this sort of difficulty.  I had heard from others just how hard interprocess communication is, especially with pipes.  That is just the way it is, I guess.

Today, a friend sent me a link to this.  That document discusses several types of interprocess communication.  There is one chapter on pipes, and the author starts the chapter, "There is no form of IPC that is simpler than pipes."  I found this rather amusing and slightly confusing, as this was not my experience.  Shortly following that, he adds that the document will not spend much time on pipes because they are so easy.  This was confusing.  By the end of the section, I knew two important things.  First, the author is right; pipes are incredibly easy.  Second, everyone else is doing pipes wrong.

In Python, you have to create and pass pipes into the function call that starts a new process.  I forget the exact details, but I seem to recall having to use a different function or at least put some obscure parameter into the regular one to get it to return before the child process was completed.  Then I had to use some method of the pipe object to interact with the child, and this was rather difficult to get right.  In one case I did have some issues with my communication protocol, but I thought it was yet another issue with how I was trying to use the pipe.  It took half an hour to figure out that the problem was some trivial thing with how was formatting the data I was passing through the pipe.  In short, the system was completely opaque and very difficult to use.  Java was not much better (in fact, I think I gave up on communication while the child process was still running, and just made the program call multiple times).  The friend who sent me the link told me he had the same experience when he read that section, except with the Go programming language.

Now, that document describes how to use pipes in C.  First, you create a pipe.  This requires merely making a single function call, passing in a two integer wide array.  The array is populated with a pair of file handles for the reading and writing end of the pipe respectively.  Then, you treat those file handles like you would any regular file!  The main difference is that they are more like standard in and standard out than a file on disk, but any programmer worth his or her salt should know how to deal with that!  The document goes on to use read and write calls to send and receive data, but you could easily use any function that reads or writes files for this.  If there is anything complicated about pipes, it is only that they have limited space, but 10kb is a pretty big limit.  Once you have a pipe, you just fork the process, which will provide the child process with copies of the file descriptors to the pipe.  Now the two processes  can communicate with read and write calls (or other file I/O calls) on the pipe.  Of course, if you need two way communication, you will need two pipes, but that is trivial.

The fact is, pipes are incredibly easy!  For some reason though, only low level languages can manage to get them right.  I don't often have something bad to say about Python, but this is one place where Python got it totally wrong!  I have plenty of bad things to say about Java, so it does not really surprise me that Java missed the mark on this one.  So why are pipes so hard in language that are higher level than C or C++?  In fact, I am almost certain that pipes are easier in assembly language than in higher level languages (and I know what I am talking about here).  Clearly the operating system has it right.  Why can't high level languages get it right?  They are supposed to make programming easier if I am not mistaken, but when I start considering writing a C module for Python, because that would be so much easier than just using the built in, you know something is wrong.

The problem is not terribly difficult for me to identify, but much of the programming world is so in love with Object Oriented Programming that they don't want hear anything bad about it, let alone admit that it might have a serious problem.  The problem here is that the OS gets it right in the first place, so it does not need complicated wrappers to make it easier.  At the OS level, pipes are about as simple as it gets.  C is wise to just leave it alone and let it do its thing.  Python, Java, and other languages, however, feel obligated to wrap everything of objects, because hey, object oriented.  The fact is pipes don't need objects.  They are perfectly fine just as they are.  Wrapping them in objects makes them absurdly more complicated and difficult to use, without adding anything of value.  I was actually stunned when I saw the C code in that document.  I had a hard time believing that it was correct, because my experience did not support that conclusion.  Using objects where they are unnecessary is bad design, and we really need to teach programmers to stop doing it.  There are good reasons to use objects, but there are also good reasons not to.

I have talked about this before, but the problem here is OOP, and that is just as absurd as Integer Oriented Programming (or Bit Oriented Programming, as a friend suggested).  When we decide to orient all of our programming around a single data structure (or metadata structure, in this case), we limit ourselves.  We destroy our ability to make good decisions about what tools will be the best for each problem.  Objects are a great tool, but we don't use a chain saw to screw in a screw or hammer in a nail.  Honestly, I feel like multi-processing in Python or Java is just like following some really complicated instructions for how to push a peg into a hole with a table saw without ever touching the peg or injuring yourself.  It would be really nice if those languages would just let me pick up the peg and push it into the hole with my finger, instead of trying to build a huge complex machine around such a simple process.

Thursday, June 1, 2017

Reusable Software Design

I just wrote about reusable software, and I figured reusable software design deserves its own article.  Perhaps the biggest problem I identified with reusable code is that if it is has problems, the time required to learn the code just to fix it largely negates any value in using it in the first place.  If we could easily understand reusable code enough to consistently adapt it to our own use cases, it would be perfect, but that is not how code works.  Code is inherently hard to understand.  Often, we even have a hard time understanding our own code after a few weeks of separation.  Isn't there some way we can consistently benefit from existing work, without the risk of having to spend so much time learning how it works that the cost is greater than the benefits?  It turns out that there is, and in some ways it may even be able to mitigate the costs of choosing not to use reusable code.  The solution is reusable design!

Design patterns have formally existed since object oriented programming became a thing.  Most design patterns are built around OOP.  They don't have to be though.  Things like loops and coherent if statements are primitive design patterns.  Functions are a kind of design pattern as well.  We often don't recognize this, because these have become integral elements of programming, but it is true.  These are all examples of design patterns that don't rely on object orientation.  In fact, very few design patterns require OOP.  Around a year ago, I wrote an event handler that uses the Event Queue design pattern, in C, and I have used the Observer pattern in C as well.  Any programming style or language can use design patterns (even assembly language).  Design patterns are not limited to OOP languages.

Design patterns are superior to concrete code for reuseability, because they are more abstract and flexible.  Design patterns don't specify implementation.  They merely provide a template.  For example, the "for loop" design pattern does not specify the exact number of iterations, the data type of the index variable, or the iteration step.  We can use it to do simple counting, to step through a string one (or two, or five...) character at a time, or to traverse an array.  These are flexible implementation details.  The if statement may or may not have an else or if-else clause.  The expression used to generate the boolean value used as the conditional is not defined for us.  We can use if statements for all kinds of things, because implementation is not defined for us.  Design patterns are flexible enough that it is trivial to adapt them to a particular application.

Good design can save a lot of coding time.  A friend of mine likes to say, "Hours of coding can save minutes of design."  Reusable code is generally endorsed for how much time it can save developers, but it comes with potentially serious costs.  Good design may take longer than crummy design, but it always pays for itself many times over.  Good design can save more time than using reusable code can.  Reusable design (design patterns) can save significant amounts of design time.  This is a double win, because judicious use of reusable design comes with no additional costs, and time spent (or saved) on design is generally more valuable than time spent on implementation.  When we save design time, we also save significantly more coding time.

Good design can reduce the need for reusable code, by finding ways to make coding from scratch faster than learning and using an external library or framework.  When reusable code is the best option, good design can optimize its use to reduce the costs.

Reusable design is, in my opinion, of far greater value than reusable code.  Design patterns are generic enough to solve a large number of problems, and they are easy enough to understand to customize them to a particular application.  Because design time is more valuable than coding time, they also have the potential to save far more time than reusable code.  Reusable design is probably something we should spend significantly more time and effort on.

Reusable Software

Ever since the invention of modular programming, reusable software has been a huge deal.  Nearly ever programming language comes with build-in reusable parts or some kind of standard library.  A great number of articles have been written, extolling the many virtues of reusable software components.  Enormous repositories of reusable parts have been created.  Most problem spaces have complex ecosystems of frameworks, libraries, and sometimes even just short snippets of code or individual reusable functions.  Anything this good has to come at a cost though, and the costs of reusable software are almost universally ignored.

First I want to say that reusable software is awesome.  I am not trying to bash it here.  Reusable software has many great benefits.  It saves a ton of programming time.  Popular reusable components tend to be fairly high quality, because they have already been tried, tested, fixed, and improved.  Reusable components are typically fairly easy to learn and use, because no one wants to use something that is more expensive to learn than writing it fresh would be, and no one wants to use something that takes more effort to use than it would take to roll fresh.  Reusable software can be used to make reasonably high quality software far faster than writing it from scratch most of the time.

Reusable software is a two edged sword, and most programmers cannot see the edge that is facing them.  Reusable software is so good that it is often difficult to see the costs, and even when the costs are known, they are often ignored.  Most of the time, the costs of reusable software are acceptable, but occasionally they can cause serious problems.  For example, recently a large number of web sites stopped working because a trivial piece of reusable code was removed from a popular nodeJS repository.  The time savings for using the code was probably between 1 and 5 minutes for a single developer.  The total cost in web site down time was probably in the hundreds of thousands or millions of dollars.  This is not a common scenario, but it is one that could have easily be avoided by spending a few minutes to write a trivial piece of code, instead of relying on an external source to remain available forever.

Perhaps the most obvious cost of reusable software is fitness of a particular purpose.  To be useful, reusable code must be usable for a variety of applications.  This means it must be generic.  On cost of making something generic is that it becomes less suitable for nearly every application, especially very specialized ones.  For flexible applications that are not critical, the application can generally be adapted to the reusable code.  This will almost certainly affect design, but it rare affects the utility and usability of the application.  Sometimes, however, it does.  Some specialized applications have strict design requirements that a generic component would violate.  In these cases, reusable software just cannot be used.  Occasionally, reusable software will seriously limit the design choices of even more mundane software in ways that are unacceptable.  In these cases, reusable software should not be used.  If using a particular framework or library feels like trying to fit a square peg in a round hole, it may be time to consider alternatives, including writing the code from scratch.

Another fairly obvious but deliberately ignored cost of reusable software is performance.  Making something generic means making it suitable for a wide variety of use cases.  This means a lot of features and a lot of "just in case" elements.  These use memory and processing power, whether you actually need the features or not.  In many applications, this does not make a huge difference.  In some though, it can make an enormous difference.  In applications that are performance critical, reusable code in bottlenecks can be a major problem.  Even in applications where performance is not critical, bloated or slow reusable code can be a problem.  Modern computers never run just one application at a time.  The typical computer is running 10 to 50 processes at any given moment, and all of those processes have to share resources.  While it might seem fine to waste a few megabytes here and there, when there are 50 processes wasting a few megabytes each, memory use can become a serious problem.  Similarly, processor time also must be shared between processes, and a process that is wasteful can affect the performance of the entire machine.  For many kinds of applications, this is rarely a problem but for some (web pages, for example, where a user may have anywhere from 5 to 100 tabs open at a time), a little resource hogging can go a long way.  In general, it is a good idea to keep in mind that reusable code is consuming resources for all of its features, even those you are not using.  A good rule of thumb is that if you are not using more than one or two features of a framework, it might be time to consider looking for something lighter or writing the code from scratch.

Reusable software makes code less maintainable.  This is very counter-intuitive.  There is a general assumption that using a framework or library takes some of the maintenance burden off of the programmer, and this is true.  Reusable code can definitely reduce the burden of maintenance, but this is not about maintenance time spent or saved.  This is about being able to maintain the code in the first place.  Learning exactly how a piece of reusable code works takes enough time that it defeats the purpose of using it, so few programmers bother.  If the reusable code itself has a bug, however, the only options are either to ditch the reusable code and home brew a replacement, pay to have someone spend hours, days, or weeks learning how the code works and fixing the bug, or report the bug to the project and wait for it to get fixed upstream.  The second option is so expensive that it is hardly ever an option.  The typical solution is to sacrifice quality, utility, and usability by working around the bug.  In addition to this, because reusable code is significantly harder to change (because the programmers using it are not intimately familiar with its code), valuable design changes to a program may be impossible.  As with bugs, if a valuable design change is needed that does not work with the reusable software, the options are to ditch it and home brew a replacement, or pay someone for a lot of labor to figure out how to adapt the reusable code.  Of course, once the reusable code has been altered, updates from the original source become invalid, and the reusable code must be maintained entirely in-house, at additional expense.  Again, this is not necessarily a problem that will come up a lot.  Major design changes after delivery are not exactly common.  Reusable code does tend to be more well tested as well, so bugs in it will be more rare.  When they do come up though, they can be very expensive.

Overall, reusable software is very valuable, and nearly ever program uses it in some way.  It also comes with some costs though, and understanding those costs is important to get the most out of it.  Sometimes it is better to write the code from scratch.  Code written to the specific application will always be superior to reusable code.  Reusable code can ultimately be more expensive than coding from scratch.  Cases where reusable code costs more than writing code from scratch are generally uncommon, but determining this should be part of the cost analysis of any project.  In addition, there are different levels of reusable software that should be considered.  Using something like the C standard library or other build-in functions and features of a language is generally a given.  For any popular language, these have been tested and optimized more than anything else.  Libraries for hardware interaction are often necessary as well, though there may be multiple options.  These tend to have more testing behind them as well.  Convenience and aesthetics libraries (jQuery, for example) are typically far less critical, so they tend to have less rigorous testing, and the benefits they provide are less critical.  Less essential libraries that have been around for less time deserve the most scrutiny.  Very small pieces of reusable code also deserve scrutiny, because it may end up taking more time and resources in the long run than just writing it from scratch.  Despite its high value, reusable code is not a silver bullet.  It comes with its own costs, and sometimes those costs can be much greater than any benefits.  Our industry needs to spend a bit more time considering the options and potential consequences than it currently does, and if it did, things would work a lot more smoothly for everyone.