Log Management - My Practical Experience Notes

All the truth is in your logs.

Because my work involves managing the operation and maintenance of two internal financial systems within two companies, both of which are facing internal employees, with more than 10,000 employees in the company, although the number of logs is not as large as that of Internet applications, there are multiple servers and instances, so when searching for logs to analyze problems, it can be a bit of a headache. I have summarized a few experiences for future log management work.

TL;DR#

Use a log aggregation tool to aggregate logs from multiple nodes.
Use Logrotate.
Specify the level of logs, warn, info, debug.
Clearly record interactions with external systems.
Use TraceID to trace events.

Use a log aggregation tool#

A black box is one of the electronic recording devices specifically used for airplanes, also known as flight data recorders. It contains a flight data recorder and a cockpit voice recorder, with sensors connected to various mechanical parts and electronic instruments of the aircraft. It can record technical parameters and sounds in the cockpit for the half hour before the aircraft stops working or crashes, and when needed, the recorded parameters can be played back for flight experiments and accident analysis.

When the number of users reaches a certain level and the business system becomes critical, multiple nodes are generally used to run the application on different servers for high availability. In this way, the logs will be scattered across different servers, and if you want to find a certain log, you need to search through multiple servers, which is very troublesome. Using a log aggregation tool can solve this problem.
Currently, Graylog is used to aggregate multiple logs. Graylog has a server, and other applications send logs to it through the API, which can be syslog or a program provided by Graylog to read file logs, graylog-collector-sidecar. This way, existing business applications do not need to make any changes.
Basic architecture of Graylog

Using ElasticSearch as a file storage and search tool makes log retrieval very convenient and easy, making searching for logs a very relaxed activity.
In addition to using Graylog, there is also ELK stack.

Use Logrotate#

Here, Logrotate refers to both the logrotate tool under Linux and the method of log rotation. You can easily implement log rolling in your application, renaming the log file whenever it reaches a specified size and creating a new log file. Only a certain number of log files are kept on the application server, and historical log files are automatically deleted. Additionally, if you are using a Linux server, you can also use the Logrotate command to automatically compress log files.

Use log levels#

When printing logs, differentiate the printing of different event importance levels. After collecting them in Graylog, you can differentiate the event levels and send email notifications for relevant operations. This also allows for quick positioning during retrieval.

Record interactions with external systems#

If the system involves interacting with external systems, the interfaces need to record information about the interactions with other systems. It is important to record the content of the received requests and the content of the responses. This provides a trace for subsequent tracking and troubleshooting.

Create a trace ID for maintenance#

With the popularity of microservices architecture, an event may need to be processed in multiple components of the system. Therefore, when an event occurs, create a unique trace ID. After the log collection is complete, this trace ID can be used to find the processing process of the event in different components.

lizhimiao