Showing posts with label systems development. Show all posts
Showing posts with label systems development. Show all posts

Friday, 24 April 2009

Open Source Google for Everyone

There's a real 'spike' of activity going on at the Apache Software Foundation at the moment. I wrote about CouchDB in an earlier post, but there are a number of very interesting projects running currently. Probably the most significant is Hadoop. Hadoop was promoted to an Apache 'Top Level Project' a year ago but it's now taking off in the Open Source community.

Hadoop is a highly distributed computing middleware designed to process petabytes of data across 1000's of commodity hardware nodes. It implements a computational approach called Map/Reduce across a distributed file system to deliver a highly fault tolerant compute platform to process very large data sets in parallel. Hadoop is 'inspired' by Google BigTable.

So how does it work?

There are two major components to Hadoop:
  • HDFS - a distributed file system that replicates data across many nodes
  • Map/Reduce - an execution middleware that distributes processing to nodes where the data resides
Files loaded onto HDFS are split into chunks and these chunks are replicated to every node in the Hadoop cluster. System monitoring responds to hardware and processing failures to replicate data to other nodes providing very high levels of fault tolerance.

In the Hadoop programming framework data is record orientated. Input files are broken into records, lines or whatever sub element is appropriate for the processing application logic. Each Hadoop process running on a node processes a subset of these records. Essentially, if at all possible, processes act on data local to the node hard disk and do not transfer data across the network. The Hadoop approach has a strategy of moving computation to the data rather than the data to the computation. This is what gives Hadoop it's performance.



The splitting and recombining of data and processing is handled using a Map/Reduce algorithm. Here records are processed in isolation by tasks called Mappers. The output from the Mappers is then brought together into a second set of tasks called Reducers, where results from different mappers can be merged together



The clever aspect of Hadoop is that it takes pretty much all of the cluster and distributing processing away from the Developer, letting him focus on the application logic.

In my early programming career I worked on Apollo Domain Workstations, and I always remember one of the coolest programming examples that shipped with the operating system (AEGIS) was a Mandelbrot generator that executed elements of the set on different nodes in the network in parallel. That was my first experience of the power of distributed parallel computing. The problem with the program though is that all the inter-process and node communication was coded 'low level' through TCP socket programming etc. If I remember rightly, most of the code was handling all of this IPC stuff rather than generating the Mandelbrot sequences. This is the exact problem Hadoop solves.

The architecture of Hadoop exhibits flat scalability. On a cluster with small data sets the performance advantage is minimal, if at all. Once your program is running on two nodes with a 1 GB of data, it'll scale to thousands of nodes and petabytes of data without modification.

For an example Hadoop application imagine you wanted to write a program that counted unique occurrences of words in multiple text files. Example text files would look like:
text1.txt: google is the best search engine

text2.txt: a9 is the better search engine

The output would look like:
a9 1
google 1
is 2
the 2
best 1
better 1
search 2
engine 2

A pseudo code for a Map Reduce approach for solving this looks like:
mapper (filename, file-contents):
for each word in file-contents:
emit (word, 1)

reducer (word, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)

Several instances of the mapper function get created on different machines in the cluster. Each instance receives a different input file (it is assumed that we have many such files). The mappers output (word, 1) pairs which are then forwarded to the reducers. Several instances of the reducer method are also instantiated on the different machines. Each reducer is responsible for processing the list of values associated with a different word. The list of values will be a list of 1's; the reducer sums up those ones into a final count associated with a single word. The reducer then emits the final (word, count) output which is written to an output file.

The Hadoop distribution ships with a sample Java program that, essentially, does a similar task. It's available in the Hadoop distribution download under src/examples/org/apache/hadoop/examples/WordCount.java. This is partially reproduced below:

public static class MapClass extends MapReduceBase
implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value,
OutputCollector output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}

/**
* A reducer class that just emits the sum of the input values.
*/
public static class Reduce extends MapReduceBase
implements Reducer {

public void reduce(Text key, Iterator values,
OutputCollector output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

The final component of the Map/Reduce algorithm is the Driver. The driver initializes the job and instructs the Hadoop platform to execute your code on a set of input files, and controls where the output files are placed.

public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);

FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));

JobClient.runJob(conf);
}

The Apache Hadoop project also has a number of sub-projects that utilise or complement the core Hadoop middleware including:
  • HBase - a distributed database
  • Pig - a high-level data flow language to ease the development of parallel programs for Hadoop
  • Zookeeper - a management middleware for Hadoop
  • Hive - a data warehousing infrastructure
  • Mahout - machine learning libraries supporting a Map / Reduce processing model
So is any one using Hadoop and what for?

You bet, probably the biggest names using Hadoop are Facebook , Amazon and Yahoo. Facebook is using Hadoop to perform analytics on it's service, Amazon's using it for producing the product search indicies for it's A9 search engine. Even Microsoft is getting in on the act via its acquisition of Powerset, a NLP search engine. Yahoo use Hadoop to fight spam.

The New York Times used a Hadoop based solution to process 11 million TIFF images to PDF, all running on Amazon's EC2 and S3!

A company called Cloudera has started offering development, consulting and implementation services to clients wanting to implement Hadoop solutions.

I believe the future for Hadoop looks good. It opens up a whole area of large scale parallel computing to organisations and companies which just wasn't available before without dedicated supercomputing capabilities. You couple Hadoop with on-demand Cloud computing with services such as Amazon's EC2 and S3 then you have supercomputing for the masses.

Google's success was built on the foundation of Bigtable and it's Map / Reduce technology, having such a technology as Open Source, I believe, will drive a whole new generation of Internet computing services and applications.

Thursday, 2 April 2009

Role of the Business Analyst in Agile Projects

When I started my career in software in the mid 1980s it was in the role of Analyst / Programmer, the roles Software Engineer, Developer, Architect just didn't exist - well not as official job descriptions.

Analyst / Programmer pretty much accurately described the role, I was responsible for understanding the business process and information requirements, eliciting system specifications, designing the system as well as implementation and test. Come to think of it I did a fair bit of the deployment / system admin type activities also!

You still see Analyst / Programmer role descriptions appearing in Job Sites, but a big proportion of organisations have separated business analysis from development / implementation. I believe this is just a reflection of the general trend in the industry towards role specialisation. hence Data Architects, Security Architects, ERP Module Consultants etc etc.

I believe one of the problems with the Business Analyst role is that organisations sometimes do not put a clear definition of what the role is and how it bridges the business / systems divide. In my experience, a lot of Analysts have strong business or domain backgrounds but very little systems development experience. Also, a lot of Analysts I've come across in projects do not have any formal systems analysis / method training or experience, e.g. RUP / UML / SSADM / Yourdon etc. I'm not saying formal systems analysis is a sliver bullet, but having strong skills and experience in systems analysis helps to elicit a business problems into a system definition.

What can happen in delivery projects is a 'gap' can grow between the 'technically orientated' development team and the business analyst community. It can end up in Developers rejecting requirements for being too poor and vague, and Business Analysts getting frustrated that the system is not meeting the customer need. Lack of implementation / design detail in the requirements usually ends up with Developers making design assumptions in the code which can often turn out to be incorrect. Business Analysts, in some cases, can often end up being no more than proxies to the stakeholders.

The IS industry is littered with myths, and one of these that particularly annoys me is that statement that "techies" can't / won't / don't talk to the business, customers and end users. I will admit that people who go into Software Development and Programming do so because they are attracted by creativity of software and the technical aspects, but I've not yet come across a Developer who can't face off to the business if he's given the chance. I believe this myth ends up becoming self fulfilling as Developers don't get the opportunity to be more exposed to the business domain.

There's also the myth that end users cannot carry out any form of analysis out themselves. Most people now have PC's at home, Broadband Internet. I repeatedly come across end users who, when faced with a IS problem and no immediate solution, turn to customising Microsoft Office with VBA. IS professionals will, of course, "scoff" at this, but some of these solutions I've come across usually turn out to be quite smart given the limitations of the technology that's available to them.

The kinds of issues I've repeatedly seen with Business Analysis include:
  • Lack of formal training and systems analysis skills with the Analysts
  • Analysts having a lack of understanding of the capabilities of the technology and limitations of the architecture
  • Non functional requirements not defined as these tend to need some level of architecture understanding
  • Over analysis, or Analysis Paralysis, as it's often called
The Agile approach is all about avoiding these problems. At it's core is the philosophy that frequent working software in front of customers is the goal, iterative spiral development life cycles with emphasis on prototyping to elicit requirements rather than paper specs and the implementation team being as embedded into the customer domain as is feasible. So in an Agile projects, what is the role for the Business Analyst?

I don't believe that the Analyst role is dead, it just needs radically rethinking in the light of modern systems development.

I believe the key to improving the Analyst role is two fold:

Firstly get Analysts more cross trained in technical skills, not necessarily becoming proficient Developers, but gain an appreciation of current technologies and software development. Also ensure there have some level of formal systems analysis background to aid their "systems thinking" to eliciting business requirements.

Secondly, re-position the Analyst role to be one more focused on business change, process improvement, training and acting as a "champion" for the solution being built, rather than gathering and documenting requirements. I believe that this role is key to getting a solution deployed into an organisation and benefits realised from it.

If you want to find out more on this subject then I'd recommend an article on Agile Analysis by Scott Ambler.

Thursday, 15 January 2009

Facts & Fallacies of Software Engineering

Most software developers know how systems really get built , and most will of come across organisations repeating the same old mistakes time amd time again. And of course, there are those myths that perpetuate the industry, such as all developers are equal in output and productivity.

Robert Glass's Facts and Fallacies of Software Engineering lays out these 'home-truths' and 'urban myths' of the systems development process. The book draws upon Roberts pretty unrivalled experience in the software field, dating back to the pioneering 1950's. There can't be too many people still active in the industry with such an eminent and long career.

I feel an affinity with Robert's career, as he explains in his introduction to Chapter 1 (About Management) how he shunned career prospects in management to stay true to the technologist path. I too flirted with the vision of aiming for Senior Management positions in my early 30's, starting the ubiquitous MBA route to bolster my prospects. I tired of the MBA in the end, deducing that (i) most management theory was just plain common sense dressed in Consultant speak and (ii) you could pick up the same knowledge just by reading a few well chosen management books and save yourself a shed load of cash in the process.

So back to the book. Robert lays out 55 facts and fallacies across areas including management, the life cycle and quality. Pretty much all of then I recognise and agree with. There are a couple of odd-ball / controversial ones, COBOL is a very bad language, but all the others are so much worse for example.

The book simply presents these facts and fallacies grouped by domain and subject, provides rationale and examples of them and supports their credibility through referencing other work. It can be a bit dry to read front to back, but the text's really meant for dipping in and out of when you're looking for that inspiration to solve your project's issues.

The key facts and fallacies for me include:
  • The most important factor in software work is the quality of the programmers - it never ceases to amaze me how often this is never recognised. I have seen so many projects where developers, analysts and architects are seen as 'fully interchangeable' by management. I have seen lead architects swapped on programmes just before major go-live milestones! Management need to recognise that the knowledge, skill and experience of the technical team at the coal face of delivery are the greatest influence in whether a project is successful or not.
  • Adding people to a late project only makes it later - when projects overrun there's always a temptation to 'throw' more resource at them. This invariably just makes the situation worse with more communication paths between team members and massively reduced productivity of your key technical staff as they spend time getting 'newbies' up to speed. Also, I believe no matter how complex the architecture of a system there is a limit in terms of team size to productivity. As teams grow not only do you have the learning curve and communication problems, but the more likely you're going to get team members who just don't get on with each other. I've also observed that in the panic to accelerate progress, the recruitment process can fall down with less experienced and skilled people being brought on board.
  • Estimation usually occurs at the wrong time by the wrong people - when an new initiative is agreed it's usually given to a project manager who's possibly never delivered a project like it before and is may be non-technical. Yet senior management will usually demand a schedule and budget forecast possibly years ahead and then hold the project manager to that schedule. Managers are usually reluctant to provide revised estimates as the project progresses to senior stakeholders through fear of losing credibility.
  • For every 25 percent increase in problem complexity, there is a 100 percent increase in solution complexity - this is one of least understood of Roberts facts, even amongst technical people. As a solution evolves and the business need is better understood by users and the delivery team, system features that early on in the life cycle appeared straight forward, suddenly start getting complex from a design and implementation viewpoint. Add on top of this the inevitable change in features and system behaviour that occurs as the project matures then the team can suddenly hit a wall of rapidly expanding system complexity. If not contained it can quite easily de-rail the delivery. Stakeholders often get frustrated when asking for, what they see, as simple feature requests. when the delivery team explains they can't be done without blowing the schedule or budget.
  • One of the most common causes of runaway projects is unstable requirements - see my article on Forget Requirements - Collaborate on a Solution Concept for a viewpoint on this one.
  • Software needs more methodologies - I have to admit to detesting most 'methodologies', by these I mean the likes of RUP, PRINCE2, DSDM etc. The content is usually valid, for example RUP contains loads of good practice guidelines on use cases, OOAD etc. It's just that they (i) tend to be seen as magic bullets and are over promoted as the saviour to all your problems by Vendors and Consultants, (ii) are usually implemented prescriptively with a one size fits all approach, and (iii) end up just massively increasing the bureaucracy that was probably already present in your organisation - only now it's got a name!
So what can an organisation learn from this book:
  • The 'coal face' technical people are the most important factor in delivery success, their knowledge, experience and skills
  • Move away from large long term, waterfall driven IT programmes with widely optimistic schedules and budgets, to incremental, iterative solution development delivering smaller capabilities but significantly quicker
  • Manage stakeholder expectations on what can and cannot be realistically achieved with available technologies
  • Forget methods, and tools for that matter, even when well implemented these only deliver marginal improvements over the technical experience, skills and capabilities of your people.


The only other comment I'd add about this book is that I still haven't fathomed out why there's a picture of a Snowy Owl on the front. I must email Robert Glass and ask him.

Monday, 12 January 2009

Forget Requirements - Collaborate on a Solution Concept

Requirements in systems development have always been a difficult area. In the Standish Group Chaos Report issues with requirements always appear in the top 3 entries of reasons for project failure.

With this in mind there tends to be a management emphasis on "getting the requirements right", before committing to any form of development or implementation. Yet I've experienced numerous projects where hundreds, if not thousands, of man hours have been devoted to requirements, and still solutions have not met expectations. I suspect anyone reading this has also experienced similar projects. So why is requirements management so often poorly executed?

You often hear people talk about traceability, configuration & change control, use cases, process models etc, etc. Management will throw Process Improvement, Quality Teams and frameworks such as CMMI at the problem.

For me there are some 'home truths' about requirements which make the task. if tackled in the 'traditional way', near on impossible:
  • The majority of IT programmes are driven 'top down' with very scant definition of what's required, usually some vague goals - if you're lucky.
  • Stakeholders that will actually have to use the system are often not engaged until the end of the life cycle - if at all.
  • Stakeholders and sponsors usually change during the project life cycle, along with their expectations and, therefore, the requirements.
  • Users cannot often express their needs in terms that can be easily translated into system specification
  • Management and users usually have no understanding of the constraints or capabilities of the technologies. They ask for features that are infeasible or uneconomic to implement or, at the other end of the extreme, they don't ask for features which would be simple to deliver because they don't realise they can
  • Management ask for 'signed off' requirements documents, yet no-one ever reads them, let alone understands them.
  • Business processes, rules and taxonomy are 'fuzzy', ill-defined and not agreed upon by stakeholders
  • Stakeholders will keep changing their minds and usually come up with conflicting requirements
  • Users and management usually cannot see a business process working any different to how it works now, resulting in lost opportunities for IT driven improvement.
For a good example, I was working in an Investment Bank on an Asset Management system. I remember a workshop where we were trying to detail the business rules of a particular financial instrument. When we got to the real nitty gritty of how these rules worked, the guy who was the SME in this instrument said "...the system calculates all that". It turned out in the end that very few people in the business understood the detail as it had all been encoded in a Mainframe system that had been there longer than their time in the company! Cue the development team spending man months reverse engineering 1000's of lines of ADABAS code!

I could go on, but you get the idea. Basically the traditional approach encouraged by the Waterfall life cycle and heavyweight methods such as PRINCE2, SSADM and, to an certain extent RUP don't deliver the goods in the majority of projects.

I believe a big part of the problem is a requirement can end up being anything from a high level business objective, e.g. the system shall reduce the claim process time by n%, to a specific system requirement, e.g all buttons shall be blue, and variations in between. In theory the requirements analysis process should weed these issues out. But it rarely does due to the simple fact that requirements are being captured in, what I call, a 'solution architecture vacuum', i.e. they can't be validated against any form of system implementation view that sense checks their feasibility. This process can continue until your project is overflowing with requirement statements and process models and the whole project ends up in Analysis Paralysis.

What's the solution? Well, there's a lot of talk in the industry about Agile, in fact so much so it's become an industry in itself and possibly well on its way to becoming an oxymoron. I have seen very few organisations truly embrace an Agile approach, mainly due to management culture and vested interests, but that's another article.

In my view if organisations want to improve their approach to systems delivery then they really need to drop the idea of requirements management altogether, at least in the traditional sense of doorstop URDs, SRDs, Use Cases, endless Workshops and incomprehensible Process Models.

A fresh approach is required that is focused, not on requirements, but the solution, right at the start of the project life cycle. A overview of approach is shown below.


The approach is, of course, Agile, but adding the concept of an Increment or Micro-Increment on top of an Iteration. Increments should be measured in days, yet still deliver some demo-able or executable software to stakeholders. Micro-Increments are important as they drive projects to meet short term goals that are focused on software delivery, even if it's a simple as a dumb HTML UI mock-up, this adds infinitely more value that lines of requirements text or use cases.

Inputs to the Solution Concept include:
  • Available Technology Components - ensure you base your architecture on components and technologies you're confident you can readily develop and deploy. Look for maximum reuse, both in the small, e.g. Java persistence frameworks, and in the large, e.g. packaged COTS modules such as ERP and CRM
  • Application Architecture Patterns - very few business systems are entirely new, in all probability elements of the solution you're trying to build have been built and proven. Don't waste time reinventing wheels, leverage these patterns
  • Legacy Systems - this may be both systems that your solution will replace and systems you'll need to interface to or extract data from. It also includes manual systems, paper forms and any 'home grown' end user solutions, usually based upon desktop tools such as Excel and Access. Don't dismiss these by the way, I've repeatedly come across some pretty impressive solutions built by keen amateurs!
  • Business Goals & Objectives - understand what the business is trying to achieve and what a successful system looks like. The more you can immerse yourself in the users problem from their perspective, the better chance you have off building a great solution. More often than not, you uncover whole areas of requirements that users have not even thought about.
  • Programatics & Risk - ensure the budget and desired time scales are baked into the solution design at the start. There's no point designing a solution that's going to take 2 years when the stakeholders need something now! On the point of schedules, it's my view that if the solution is going to take longer that 9 months to go-live then you should either (i) reduce scope (ii) break the solution up into smaller elements or (iii) forget it! In my experience any information system that takes longer than 9 months to get deployed is likely to be pretty useless as the organisation will of moved on. The rule here is, the faster you can deploy solutions to production the better
Once you start to lay your hands on these inputs, the Increments themselves are all about getting stuff built! Yes you need to maintain some documentation, but keep it lightweight and value add.
  • UI Prototypes - use cases are okay, but there's no substitute for putting, at least what looks like, a real solution in front of stakeholders. In my experience UI prototypes validate system requirements better that any process modelling or workshops could ever do.
  • Demo-able Solution - if it's feasible to build some form of functional prototype within the bounds of an Iteration, then you should. Focus on the most complex or least understood area of the solution first.
  • Architecture Prototype - sense check your technology stack, runtime topology and non-functionals as early as you can. Often these issues constraint the functionality than can be implemented. For example, you may be able to do some fancy stuff in the Browser with a Plug-In, but the Corporate firewalls block the port it uses. You want to find these issues out right at the start of the project before you comit to the architecture.
  • Candidate Feature List - as you're producing these prototypes and getting feedback from stakeholders, you'll start to get 'real' useful requirements, that are in context of a system. I don't call these requirements, I call them system features as they are tied to the architecture. These features should be unequivocally understood by the development team how they work, what good looks like and potential approaches to implementation and test.
In my experience, once a project gets into a 'groove' of running Increments continuous throughout the life cycle of the delivery, then the whole process becomes self reinforcing through better defined features, improved prototypes etc. In fact, the process doesn't really change from inception to the final go-live, prototypes gradually move to alphas, betas, pre-release versions, release candidates then a final decision to promote a release candidate live.

So, stop talking about requirements, and start building Solution Concepts.

If you're looking for further useful info on this approach then I'd recommend:
  • Eclipse EPF - an Eclipse project focused on a Open Source lightweight development approach based on IBM's RUP but stripped to the core.
  • Feature Driven Development - promotes a project delivery approach called FDD based on features. There's also a book available on FDD.
  • Introduction to Features - definition from Scott Ambler as to what a good Feature looks like.
  • Agile Manifesto - and finally, keep this web page on your browser at all times to remind you what your job is!

Friday, 19 December 2008

Patterns of the Agile Organisation

Of all the software, technical and management books I've purchased over the years, there's one that really stands out for me - Organizational Patterns of Agile Software Development by James Coplien and Neil Harrison.

Organizational Patterns goes straight to the core of what's required to design and develop great systems, on time and on budget. Agile, of course, has been a buzz word in software development circles for a number of years now. For me, though, this sort of defeats the object of Agile, i.e. people are writing endless books about Agile (543 according to amazon.co.uk as of December 2008), rather than rolling their sleeves up and developing software systems!

The difference with James Coplien's book is that he started working on this text way before Agile became a software industry term, let alone fashionable. James admits in the books Introduction that they tacked the word Agile into the title purely for 'marketing' purposes. The basic premise of the book is that software development is broken into 'nuggets' of people behaviour and social relationships that make projects really work. Pretty much everyone of the Patterns discussed I recognise in one shape or form.

The best bit about the book are the 'anti-patterns', i.e. ways of working that definitely do not work. Most techies when they read these will definitely recognise typical non technical 'management' behaviours here. For example, adding more developers to an already late project in the vain hope that doubling the workforce will magically half the schedule. Invevitably, of course, we all know it ends up doubling it and some.

I have this book on my desk at all times. Amongst other things it keeps me sane and grounded in what really makes software projects work.

My favorite pattern in the book is the description of the role of the Wise Fool. The Wise Fool is the 'techie' in the organisation who really knows his (or her) stuff and what's going on at the coal face of projects. More importantly, is willing to raise 'uncomfortable truths' and openly question the management. To quote:

"A Wise Fool, though known for lacking tact, is usually highly respected technically and may be (or become) a Legend Role. They usually eschew managerial opportunities and may even show disdain for management. An acquaintance of the author was onced honored with the following words 'In the face of management opposition, he charged ahead and did what was right'"

I try to live up to the role of the Wise Fool every day of my working life! If you're struggling with problem IT projects then I highly recommend this book.