NUTRINTG Middleware Monitoring
This documentation serves as a guide for whom ever needs to monitor Middleware services. CDS, Mule and LDS services. All services have platforms where users can view logs and actively maintain and troubleshoot them.
MuleSoft Services
Mule Services are all monitored from https://anypoint.mulesoft.com/ whereas CDS and LDS are monitored on Kubernetes. This list,NUTRINTG List of Middleware Services, contains each service that needs to be checked to ensure proper operation. At times, a service can experience socket time outs, in which case it should to be restarted. It is common to see other errors i.e (400 bad requests) or (401 unauthorized error) these are common errors and are external to each service therefore can be ignored as they do not reflect the health of a service.
Checking logs of Mule service
Within the CDP-PROD env, a user is able to search for a services i.e cdp-mule4-lead-ads-service. The search fetch is able to match partial words i.e “lead-ads”.
Clicking on the service directs you to a detailed view of a service as shown in the picture below.
From this view you can monitor different aspects of the service. The logs tab on the left panel is the most important for checking a service’s health. Detailed below are all messages received and generated by the service. Each service store about 100Mbs of data which could go back days or Months depending on how frequent the service processes data.
Logs can be filter by keyword, date-time and id. For example by adding “Error” to the search bar, a list of all errors logged so far is presented. Filtering data can let you find issues faster as opposed to scrolling through in particular looking at data in terms of days by using the 24Hr filter. Detailed logs can be viewed as seen below.
If a service lists an ERROR that reads or matches with “socket timeout” then this service needs to be restarted (Only authorized users can restart PROD services).
Kubernetes Services
With access to Kubernetes CDS and LDS service can be monitored. With the right permission a user is able to view the right namespaces, in particular the digital-core
namespace is where CDS and LDS reside. From the picture below, the left panel has sections that can be used to monitor various aspects of CDS and LDS in particular, the pods tab.
Navigating to the “Pods” leads you to view a list view of all services on the digital-core namespace. From the picture below, you can see buttons labeled 1 and 2 in red. These represent the search and logs buttons respectively. Clicking the search button lets you search for CDS services listed here NUTRINTG List of Middleware Services. On the side of each service there’s a number that represents the number of instances running for that service, each service also has a unique ID at the end. All service and their instances should be viewed.
Navigating to log view (clicking the button labeled 2) you will be presented with the follow view below.
From here you can view what the current state of a service. Typically, JSON data is logged and information regarding validation and method calls can also be seen. A service experiencing issues would indicate a message that looks more like “Trying to persist a connection to x CDS…” which indicates a service has failed to connect with one of the many services running. In this situation, the service needs to be restarted via TeamCity https://teamcity.digital-rb.com/overview.html.
Each service should have a 7-day lifetime, that means all services should be restarted within 7 days.
Restarting CDS Services via TeamCity
Before restarting services via TeamCity, a change must be made to the following repository in Bitbucket https://bitbucket.org/rbdigital/rb.digital.services.compound/src/develop/ .
A small update to the logger in the following file, for example, is enough to make teamCity pull the entire project. The problem is that every restart will utilize the same application build as before. As a result, small adjustments are required.
Without this crucial step, restarting services will causes errors during build.
Restarting the services with TeamCity is simple when this PR has been authorized and merged. A drop-down menu will appear in the "Projects" area, which is marked in red in the image below.
The drop-down menu displays every project you have access to. Navigate to “Consumer Data Store” displays all CDS services.
Each service is listed below and a drop-down menu that shows a service in detail.
Clicking a service will take you to a new page that lists all working environments (Regression, Production, Feature).
When you click the run options (three dots next to the run button), you'll be taken to a detailed panel where you may configure each instance before restarting the service.
The image shows the detailed panel view. The only requirement is to select an appropriate agent to run the build.
Select one of the EC2 agents as indicated by the arrows is enough and clicking run build should start the procedure.
The regression environment should be the first to run which runs tests (~272). All tests must pass for build process to continue to the production environment. If an error occurs during the regression build, simply retry and select an alternative EC2 agent, this should fix the issues.
It is important to remember that each service needs to be restarted in sequence. Or two at a time alternating between each EC2 instance.
Integration -> 1
Datarow -> 2
JsonValidation -> 3
DomainSchema -> 4
Bucket -> 5
Audit -> 6
Web -> 7
Once, each service is done, check Kubernetes to make sure each service works correctly.