Celestia Testnet Node Performance Evaluation Through Log Analysis

<aside> 📌 I figured that many people will perform hardware analysis and it will be very crowed and probably not very useful to add another hardware analysis to the crowd. So I did some log analysis specipically on sampling header time logs and their error rates. The data uses 1 day of logs

</aside>

Source Code used for this analysis: https://github.com/aditya-manit/celestia-log-analysis

Introduction:

Analyzing logs is crucial for understanding the behavior and performance of any node - be it a validator node or a data-availability node. Log data provides detailed insights about system performance metrics, error occurrences, and other significant events. However, comprehending this raw data can often be a complex task. That's where data visualization steps in, transforming the raw log data into meaningful, digestible insights. In this post, we'll walk you through the process of our log analysis, explaining how we leveraged Python to visualize the data, uncovering key patterns and correlations.

Agenda:

The main objective of our analysis is to investigate two primary aspects of the node's performance - the time taken for sampling headers and the occurrence of errors. We intend to understand how these metrics behave over time and if there is any correlation between them. To do this, we'll visualize the data through different types of plots, each providing unique perspectives of the data.

Before we delve into the visualizations, let's understand how we extracted the necessary data from the log files:

Data Extraction from Log Files:

We started with raw log files, each line containing a wealth of information. The log line we were particularly interested in looked like this:

2023-05-13T00:00:12.575Z [INF] [headerfs] finished sampling headers, "from": 472391, "to": 472391, "errors": 0, "time_taken": 0.31604745

This line indicates the timestamp, the header range sampled ("from" and "to"), the number of errors that occurred during the operation, and the time taken for the operation.

We used Python's inbuilt re library to extract this data using regular expressions. The extracted data was then stored in a structured CSV file format, making it ready for analysis. Here's how the data looked post-transformation:

timestamp,from_height,to_height,errors,time_taken
2023-05-13T00:00:12.575Z,472391,472391,0,0.31604745
2023-05-13T00:00:22.960Z,280502,280601,0,4200.943984111
...

Now, let's jump into the analysis and visualizations:

Line Plot of Time Taken for Sampling Headers Over Time:

The first visualization represents how the time taken for sampling headers changes over time. We plotted a line graph where the x-axis represents timestamps and the y-axis signifies the time taken for sampling headers. This visualization helps identify trends and spot potential irregularities in the system performance over time.

Line Plot of Time Taken for Sampling Headers Over Time (Smoothed and Outlier Removed):

While the first plot provides a general overview, it can be influenced by outliers and abrupt changes. Therefore, we created a smoothed version of the graph, minimizing the impact of noise and outliers to highlight the underlying trend better. We utilized the Savitzky-Golay filter from the SciPy library for smoothing the graph.

Histogram of Errors:

To understand the distribution of errors in our system, we plotted a histogram. This plot helps identify the most frequent number of errors and any unusual occurrences, providing insights into the error handling of the node.

Histogram of Time Taken:

Similar to the error histogram, this visualization represents the distribution of time taken for operations. This plot aids in understanding the most common time durations and spotting any outliers, thus providing insights into the node's efficiency.