While the cost of software downtime is highly dependent on many factors such as your revenue, industry, the duration of the outage, and the number of people impacted, among other factors, the point remains: when your enterprise software unexpectedly crashes, the effects can be profound to your business.
With our deep product and engineering experience with data-intensive enterprise applications, here at Egen, we have created a checklist that we use to evaluate an existing/legacy platform. This checklist also helps us in answering the questions – “Why has this application been crashing? or Why is this API endpoint slow? or Why certain apps cannot scale?“.
Here is that living and breathing checklist:
1. Data Ingestion
- How data enters the platform? Is it just from the UI or there are other backend flows as well?
- Do you have a data flow diagram of all the data entry and exit points?
- What’s the average payload size and frequency of data ingestion requests?
- Do these ingestion flows auto-scale with the load (more users or more sources)?
- For the backend ingestion flows, have you looked into Streaming (async) frameworks like Apache Kafka instead of just the web-services (sync)?
2. Data Processing
- Have you identified any module (block, function, class) that is running long data processing tasks?
These are the code blocks that have very high Cyclomatic Complexity (simply put, too many loops).
- Are there modules that are processing the exact same data redundantly?
For example, every repeated API request is calculating some score on the data that is not changing with time. This kind of redundant processing can be done at the ingestion time using Streaming/Messaging flows.
- Are you utilizing available CPUs properly?
Try parallel execution wherever you can.
3. Data Validation
- Is incoming data validated against a schema?
If you don’t know the schema of the data stored in your system, you don’t know what’s in there and how you can use it.
- Are inline and set based validations in-place before putting incoming data into the data storage systems?
- How do you handle issues with incoming data? Ignored vs quarantined?
- Is monitoring put in place whenever corrupt data is ingested?
4. API Latency
- Do you have tools in place for Application Performance Management to monitor API performance?
For example, NewRelic, AppDynamics.
- Is it the UI code taking more time to process and render the response or API endpoints are slow?
- Have you identified specific API endpoints that are showing very high response latencies or timeouts?
- Have you identified specific API endpoints that are returning large response payload?
These APIs should be modified to provide filtered properties on the response objects. Think of sending filter like
?fields=id,name,company.namein the API request.
- Do API services scale with user load?
- Can some form of parallelism be used to properly utilize all available CPU cores?
5. Monolith vs Microservices
- Do you need to use microservices pattern and break down the monolith?
There are pros and cons of using monolith vs microservices, so these patterns should be evaluated per platform. Microservices are not the silver bullets. You don’t want to run into distributed transaction issues with microservices. Instead, just make a peace with Eventual Consistency.
- Have you identified modules that are candidates for microservices?
These modules should be broken into multiple microservices that can scale independently. An inter-microservice communication pattern based on queuing should be preferred over sync web-services.
6. Database Tuning
- Have you done slow-query-log analysis on the database?
Simplify those complex queries. Get help from a DBA.
- Is caching enabled between API service and database layer?
You may need to look into the distributed caching frameworks if multiple services are using the same database tables.
- Are there deadlocks in parallel query executions?
- What’s the rate of reads vs writes in the queries hitting the database? Are writes done in batches?
- Have you checked the feasibility of a clustered database?
A clustered database with separate read and write replicas help in high availability (HA) as well as scaling.
7. Memory Management
- Do you know resource (CPU and memory) requirements of the apps?
- Have you identified apps that are going down with Out of Memory exceptions?
8. Backward Compatibility
- Can you release new versions of APIs without worrying about version compatibility issues?
With frequent releases, one of the issues that come to the forefront is the compatibility mismatch between various application components. For example, an existing API endpoint has been updated but the client side is not changed/released to handle that. In the case of mobile apps, even if a new version of the app is released, it’s not guaranteed that all users will upgrade the app once the changes on APIs are live. So, APIs have to be backward compatible. In short, create a new endpoint, mark the old endpoint as deprecated, wait for a couple of releases, and then delete the old endpoint.
1. Automated Tests
- How much testing is automated vs manual?
- Do you have automated integration tests apart from unit tests?
- Do you have a test code coverage report?
- Are there regression tests that run on every release?
Having better automation test coverage removes human error/misses when the application is released frequently.
2. Data Consistency Tests
- Do you have data consistency tests running in a regulated fashion in prod?
Data Consistency tests on the databases are one of the important consistency tests. There should be a suite of scheduled, read-only data consistency tests checking if there are any missing objects, garbage values, temporary records, access control lists etc.
- Do you have penetration and data integrity checks covering issues with access security?
Access security issues are one of the hardest to catch in the unit and integration tests. Access checks should be the first thing APIs need to execute before any further processing of the requests. Inappropriate security level access may corrupt data forever and that can cause inconsistency in the application flows.
3. Performance Tests
- Is performance testing done routinely with every release?
- Is perf environment replicated similarly to prod in terms of data load and users?
1. Container Orchestration
- If using containers, do you have enough skills with the container orchestration framework being used?
Microservices are popular but need a lot of orchestration built around them to perform, available, and scale individually. It takes a lot of iterations for a dev team which has been living in the monolithic world for a while to adopt microservice pattern. It’s a change in the mindset, not just the build and release process. Once that is settled, orchestration framework like Kubernetes, DC/OS etc. then become an absolute necessity.
2. Resource Monitoring
- Is there a resource monitoring in place for continuously measuring CPU, Memory, Disk, and Network performance of the services?
- Are alerts configured to notify whenever these numbers are crossing 80% of the limits?
3. Auto Scaling
- Do services auto-scale with the load?
- Is an auto-scaler deployed that works with Resource Monitoring tools?
If using container orchestration, container autoscaling and node scaling are two different things, and both have to be provisioned separately.
4. Disaster Recovery
- Do you have a disaster recovery plan if production goes down because of issues related to the cloud provider, networking, or a security breach?
- How often do you test your DR plan? Does it even work within the SLAs?
- What’s your SLA with the availability of production systems?
In other words, how fast you can bring back the production environment to the minimal working condition as per the SLAs.
- What is the ratio of the number of different production app/services to the number of members in the DevOps team?
A higher ratio means a good amount of automation is in place. Investing in automating the infrastructure, monitoring, and auto-scaling frameworks should be the highest priority for the DevOps team. This team should be spending more time on new ideas to reduce infrastructure cost, improve high availability, solidify DR, new customer onboarding rather than working on mundane tasks.
Product and Engineering Alignment Issues
You can have one of the best architectures, applications, infrastructure but misalignment between users, product, and engineering teams can nullify all of it.
1. User Acceptance and Feedback Loop
- Have end users performed User Acceptance Testing on the mocks?
Not just on the developed apps.
- Is their feedback incorporated back to the product design?
A good A/B testing should be done with users before starting the actual development.
2. Team Composition
- Is engineering and product teams’ composition correct as per the product being built?
For example, for a data ingestion/onboarding project, you cannot depend on front-end engineers. The same rule is applicable to the product team as well. You need a Product Owner who is a CEO of the product, not just managing project lifecycle.
- If following the Agile process, do you have sufficient backlog for at least a couple of sprints at any point of the week?
The developers should be able to see 2 sprints in the future about what’s coming and should be ready to reprioritize if required.
4. Technical Debt
- What measures are being taken to manage and reduce the technical debt over time?
Every project has some quantity of technical debt (blame it to the product team, jk).
- Are teams encouraged to identify technical debts in the application and work on those issues in sprints alongside new features?
5. Throw Over the DevOps Wall
- Is the engineering team engaged with the entire release pipeline: build, test, deploy, repeat?
Apart from developing and testing, engineers should also understand how the code is getting deployed to all environments. Engineers are the best folks to know the quality and stamina of their code. The release process should be transparent to the engineering team and not just become a DevOps centric process.
Don’t keep your business at risk. At Egen, we recognize the complexities of application architecture and infrastructure and help you identify and mitigate points of failure, in all of its forms. Once you understand the vulnerabilities, it is easy to take preventative measures to protect your business.
If you are unsure of how to go about this, or what the process consists of, we offer this comprehensive assessment as a part of our core service offering to clients.
Banner Image Credit: https://github.com/grafana/grafana