DDIA - Reliable Scalable and Maintainable Applications

  1. Overview

  2. Reliability

    2.1 Hardware Faults

    2.2 Software Errors

    2.3 Human Errors

  3. Scalability

  4. Maintainability

1 Overview

With more applications are data-intensive (in terms of the amount of data, complexity of data, and the speed at which the data is changing), it is important to figure out which tools and which approaches are the most appropriate for the tasks, and it can be hard to combine tools when a single tool alone is not enough.

We are trying to achieve a reliable, scalable, and maintainable data system when working with a data-intensive application.

2 - Reliability

Reliability refers that “the system should continue to work correctly even in face of adversity (hardware or software faults, human errors)”.

It is usually best to design fault-tolerance (or fault resillient) mechanisms that prevent faults from causing failures.

2.1 Hardware Faults

  • First response to hardware faults is to add redundancy to the individual hardware components to reduce the failure rate of the system, however, when data volumn and computing demands increase, more machines will be used, and the failure rate of hardwares increases proportionaly.

  • There is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance technieques in preference or in addition to hardware redundancy.

2.2 Software Errors

Compared with hardware faults, software errors is more like the systematic error within the system that are harder to anticipate due to the correlation across nodes, thus tend to cause more system failures than hardware faults.

There is quick solution to the systematic faults of software. Small things such as carefully thinking of assumptions, thorough tests can help.

2.3 Human Errors

Best practice to reduce human errors:

  • Design system in a way that minimize opportunities of error such as well-designed abstractions, make it easy to do the ‘right’ thing and discourage the ‘wrong’ thing.

  • Decouple the places where most mistakes are made from places where they can cause failure

  • Test thoroughly at all levels from unit tests to whole-system integration tests and manual tests.

  • Allow quick and easy recovery from human errors.

  • Set up detailed and clear monitoring.

  • Implement good management practice and training.

3 - Scalability

Scalability describes a system’s ability to cope with increased load.

  • Load is described by load parameters that depends on the architecture of the system.

  • With the load parameters description, the performance of the system can be investigated when the load increases.

  • Percentiles can be used to describe the performance. Others such as a distribution of values that one can measure.

  • Approaches to cope with load:

    1. Scaling up: a more powerful machine
    2. Scaling out: distribute load across multiple smaller machines
    3. In reality, good architectures usually involve a pragmatic mixture of the above two
    4. Use elastic system when load is highly unpredictable

Overall, the architecture of system that operate at large scale is usually highly specific to the application — no generic, one-size-fits-all scalable architecture.

4 - Maintainability

The majority of the cost of software is not in its initial development, but in its ongoing maintenance — fixing bugs, keeping its systems operational investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.

Design principles for software systems to minimize pain during maintenance:

  • Operability: makes the routine tasks easy, allowing the operations team to focus efforts on high-value activities.

  • Simplicity: make it easy for new engineers to understand the system by removing as much complexity as possible from the system. Abstraction can hide a great deal of implementation detail bebind a clean, simple-to-understand facde, and can be used to remove accidental complexity.

  • Evolvability: make changes easy.

Except for the above non-functional requirements, other non-functional ones such as security, compliance and functions ones (e.g., what is hould do, such as allowing data to be stored, retrieved, searched, and processed in various ways) are also required to make a useful applicatoin.