Shardeum Milestone: Restoring the Dynamically Sharded Network

Shardeum Milestone: Restoring the Dynamically Sharded Network

·

13 min read

On a crisp Monday morning, the digital cosmos witnessed a momentous event, reminiscent of a celestial ballet where precision meets audacity. Like a masterful symphony conductor, Shardeum orchestrated a performance that left spectators in awe, drawing parallels to the historical return of a spacecraft to its launch pad after a flawless maiden voyage.

In this captivating narrative, Shardeum stood at the precipice of a challenge, daring to tread where others had faltered. Yet, in a tale that defies conventional wisdom, it emerged not only unscathed but gloriously triumphant. With a resilience that seemed almost otherworldly, Shardeum etched its name into the annals of distributed ledger technology, becoming the first sharded network to self-restore.

Just as the journey of a spacecraft demands meticulous planning, precision engineering, and flawless execution of intricate maneuvers, Shardeum’s restoration of its Sphinx betanet, following a critical crash, demanded an equivalent level of technological finesse and groundbreaking innovation. The preservation of all data on a network, particularly one operating with dynamic state sharding, stands as a testament to the ingenuity and resilience of Shardeum’s architecture.

As we venture further into this exploration, we not only applaud Shardeum’s monumental achievement but also recognize it as a defining moment in the evolution of web 3 technology. This triumph represents a quantum leap forward, promising to redraw the boundaries of IT network resilience and data integrity, ushering in a new era of digital innovation and security.

Maintaining and restoring a dynamically sharded network like Shardeum involves handling a range of complex technical challenges, making it distinct from traditional blockchain networks such as Bitcoin or Ethereum. Unlike these traditional networks, where every full node holds the complete transaction history, Shardeum’s approach involves dividing the network into smaller partitions called shards, with each node responsible for managing only a subset of the entire network’s data.

In a dynamically state sharded environment with autoscaling, the network must continuously adjust and balance nodes and resources across different shards to optimize performance and scalability. This dynamic nature adds complexity to maintaining data consistency, ensuring network stability, and recovering from failures effectively.

To understand Shardeum’s response to node fluctuations, let’s compare it with Bitcoin. In Bitcoin, even if a minimal number of nodes are active, the network remains functional because each full node has the complete state and transaction history. However, in Shardeum, each active node only holds a portion of the data due to sharding. This lightweight nature of validator nodes presents both engineering opportunities and challenges.

To address the challenge of preserving data integrity when a node goes down, Shardeum employs two main strategies:

  1. Dynamic State Sharding: Shardeum partitions the overall address space according to the number of active nodes. Each node is responsible for managing its assigned partitions along with a certain radius around it and extra adjacent partitions. This approach ensures dynamic adaptability and robust data redundancy within the network. Even if a node fails, the network can continue functioning without data loss.

  2. Archiver Nodes: Shardeum utilizes archiver nodes to store the complete state of the entire network. Active nodes stream their partially stored states to these archivers for collection. This redundancy ensures that even if a node fails, the complete state of the network is preserved.

These design choices and optimizations enable Shardeum to restore the network in a novel way, ensuring features like autoscaling and linear scaling are maintained. Overall, Shardeum’s approach to maintaining and restoring a dynamically sharded network emphasizes adaptability, redundancy, and robustness, setting it apart in the realm of blockchain technology.

Now that we understand the basics of dynamic state sharding and that archiver nodes are somehow involved, let us first unravel some additional components more deeply and explain them. In order to understand the crash and the restoration of the Shardeum betanet, we must first understand a little about the following:

  • Archiver nodes

  • Lost archiver detection

  • Network modes

  • Recovery mode

Understanding the basics of each of these is necessary before we dive head first in the bugs, so let’s take a look!

1. Archiver Nodes: Interstellar Storage
In Shardeum, archiver nodes, also known as archivers, are like the ultimate record keepers of the network. Their main job is to store every single detail of what happens in Shardeum, like a giant library storing all the network’s data, from transactions to receipts. Unlike regular nodes that process transactions, archivers don’t make decisions; they just focus on storing data.

Because archivers are so important for keeping Shardeum running smoothly, the network has a special system to make sure they’re always working properly. If an archiver stops working, Shardeum needs to know right away so it can fix the problem and keep the network safe and reliable.

2. Lost Archiver Detection: Alien Technology Unveiled

In Shardeum, there’s a special way to find out if something’s gone wrong with the archiver nodes, and it’s called the lost archiver detection protocol. Think of it as a super advanced alien technology designed specifically for Shardeum. This protocol is like having a special radar that can detect when one or more archiver nodes stop working properly and are marked as lost.

Now, why is this important? Well, archiver nodes are super important because they store all the historical data of the network. If one of them stops working, it could cause big problems for Shardeum. That’s why it’s crucial to catch any issues with archivers right away, so the network can fix them and make sure everything keeps running smoothly.

Even though a crash might not happen because of a lost archiver, the way the lost archiver detection protocol interacts with a specific network mode could cause problems. So, it’s essential to understand how the network modes work in Shardeum to prevent any issues in the future

3. Network Modes On Shardeum: No NASA Needed

On Shardeum, there’s this cool thing called network modes that keeps everything running smoothly, and you don’t need to be a rocket scientist to understand it! It’s like having built-in plans for different scenarios to make sure the network keeps going, no matter what.

Imagine you’re playing a video game, and you have different levels with different challenges. Each level has its own set of rules and goals. That’s kind of how network modes work on Shardeum. They’re like built-in plans that tell the network what to do in different situations.

For example, let’s say something goes wrong with the network, like a crash or a problem with data. That’s where recovery mode comes in. It’s like activating a superpower to fix the problem and get everything back to normal as quickly as possible.

But there’s more to it than just recovery mode. Shardeum has several different network phases, like Forming, Processing, Safety, Restart, and Shutdown. Each phase is designed to handle different situations and emergencies.

So, while you don’t need to know all the nitty-gritty details of network modes to understand what went wrong, having a basic understanding can help you see how Shardeum keeps going strong, no matter what challenges it faces.

4. Reverse Engineering Recovery Mode: Roswell Revisited

Imagine Shardeum’s recovery mode as a top-secret operation, reminiscent of the mysteries surrounding the Roswell incident. This mode is like a hidden gem among the seven network modes, activated only when the network faces a critical challenge.

So, here’s the scoop: When the number of active nodes in the network drops below a certain threshold, let’s call it the danger zone, recovery mode kicks in. It’s like the network’s emergency alarm going off, signaling that something needs fixing.

During recovery mode, the network puts a temporary pause on processing transactions and syncing data. Instead, it focuses on a clever strategy: bringing in standby nodes to replace the inactive ones. The goal? To get the number of active nodes back to a healthy level, ideally even higher than before.

But here’s the catch: Shardeum doesn’t want to overwhelm itself by adding too many new nodes all at once. That’s why there’s a strict rule in recovery mode: the network can only grow by a maximum of 20% in each cycle, which lasts about 60 seconds. This slow and steady approach might seem like a snail’s pace, but it’s crucial for keeping the network stable and preventing chaos.

You see, each new node that joins the network needs resources and time to get in sync with everyone else. By limiting the growth to 20% per cycle, Shardeum ensures that it can handle the influx of new nodes without breaking a sweat. This cautious approach not only keeps the network running smoothly but also safeguards against any hiccups or errors that could occur during the syncing process.

So, there you have the answer — recovery mode in Shardeum is like a well-orchestrated dance, where each step is carefully planned to keep the network thriving and secure.

5. The Root Cause Of The Crash: Event Horizon

Let’s dive into the heart of the matter, where two elusive bugs lurked in the shadows, waiting to disrupt the tranquility of Shardeum’s betanet. The first culprit to emerge from the darkness was the neon library bug, casting its unpredictable shadow over our validators.

In version Sphinx 1.9.1, we introduced an update to a library utilizing Neon bindings, a cutting-edge technology bridging the worlds of Rust and TypeScript. This integration aimed to bolster the synergy between these languages, promising more seamless communication within Shardeum’s software architecture. However, this seemingly innocent enhancement birthed a glitch, causing nodes to vanish from the network at random intervals.

But the plot thickens with the emergence of a second adversary: the lost archiver detection protocol bug. In the recent incident that plunged the betanet into chaos, this bug played a leading role in the unfolding drama. Picture this, the lost archiver detection mechanism and the network recovery mode protocol, two stalwart defenders of Shardeum’s stability, colliding in an unforeseen cataclysm.

As fate would have it, the simultaneous activation of these mechanisms unleashed havoc upon the network. The lost archiver process, triggered alongside the network recovery mode, encountered a critical flaw — a bug refusing to acknowledge an empty active node list. This fatal flaw, previously untested and unforeseen, proved to be the catalyst for the network’s downfall, culminating in a catastrophic crash of the betanet.

In this tale of technological turbulence, the convergence of two seemingly unrelated subsystems created an event horizon, pulling Shardeum into the depths of chaos. But fear not, for with each challenge comes an opportunity for growth and resilience. As we unravel the mysteries of these bugs, Shardeum stands poised to emerge from the shadows, stronger and more vigilant than before.

The timeline of events surrounding the network crash and its resolution is as follows:

  1. Initial Vulnerability and Upgrade : The journey began with a vulnerability discovered during the 1.9.1 linting process within the neon npm library. An upgrade was swiftly implemented to address this issue. However, this seemingly innocuous upgrade inadvertently introduced an exception that remained hidden during local testing.

  2. Intermittent Library Exception Leading to Validator Shutdowns : The neon library began exhibiting sporadic exceptions, causing periodic shutdowns of network validators. Despite the network’s design resilience to backfill these validators, the timing of simultaneous failures among multiple validators triggered the network’s recovery mode.

  3. Triggering of Network Recovery Mode : As the network entered recovery mode, the protocol mandated the clearing and regeneration of the active node list. However, a concurrent bug in the lost archiver system, unable to accommodate an empty validator list, served as the primary catalyst for the network’s crash.

  4. Resolution and Network Restoration : With determination and ingenuity, the crash was rectified, and the network was meticulously restored using data stored on archivers. This historic achievement marked the first successful restoration of a crashed sharded layer 1 network, preserving all data intact — a milestone akin to the triumphant landing of a rocket in terms of network recovery.

  5. Completed Fixes : Following the restoration, a preliminary fix was swiftly implemented to address the library issue. Subsequently, version 1.9.5 was rolled out, introducing a crucial bugfix targeting another instance of the neon binding crash. This update allowed users on version 1.9.4 to either remain on their current version or opt for the upgrade to 1.9.5, based on their assessment of network performance and stability. Eventually, it was decided to mandate version 1.9.5 for validators to enhance network stability and address persistent neon binding issues.

Armed with this chronicle of events, we can now delve into the intricate workings behind the swift restoration of the network, a testament to the resilience and innovation within the Shardeum ecosystem.

At the forefront of Shardeum’s triumphant restoration was the deployment of it’s recovery mode — an essential component in the network’s arsenal. As described earlier, recovery mode springs into action when the network’s active node count drops below a critical threshold, facilitating a swift, controlled, and secure expansion to restore normalcy. It’s worth emphasizing that without the ingenious design and development of the network modes, Shardeum wouldn’t have bounced back from the crash with such ease, showcasing its innovative capabilities.

Moreover, Shardeum’s tech team played a pivotal role, springing into action with urgency and precision. Their initial response involved a meticulous analysis to pinpoint the root cause of the crash, ultimately tracing it to an anomaly in the interaction between the network’s lost archiver detection and recovery mode systems. Recognizing the complexity of the challenge, the team swiftly devised a comprehensive strategy to tackle both the immediate fallout and the underlying vulnerabilities head-on.

In response to the network crash, Shardeum’s tech team executed a multifaceted approach to swiftly address the issue. Technically, they first isolated the affected components to prevent any further degradation of the network’s performance. Concurrently, they deployed a patch to fix the bug in the lost archiver system, ensuring it could handle an empty validator list — a critical oversight that had contributed to the network’s failure. To fully restore the network’s operational capacity, the team activated and utilized the data preserved on archivers to reconstruct the network’s state prior to the crash, ensuring no loss of data occurred in the process.

Logistically, the team demonstrated seamless coordination across different time zones and disciplines. Leveraging cloud-based tools for real-time collaboration and monitoring, they ensured effective communication and synchronization of efforts. This coordinated approach facilitated the rapid development and deployment of fixes while ensuring all team members were aligned on the recovery process and next steps.

This incident served as a rigorous test of Shardeum’s crash management protocols, highlighting the importance of agile and innovative responses to unforeseen challenges. It underscored the team’s commitment to maintaining a resilient and secure network, prepared to address complex technical hurdles as they arise.

In conclusion, the successful restoration of Shardeum’s sharded network represents a significant advancement in network technology, signifying a milestone with profound implications for the industry. While currently underappreciated, innovations like network modes are poised to set new standards across the web 3 landscape.

I’ve long believed that Shardeum’s core innovations will shape future technological developments, inspiring new generations of ledger technologies. Witnessing Shardeum’s groundbreaking network restoration firsthand, I’m confident it will prompt a reassessment of industry standards, potentially leading to the adoption of more robust protocols and methodologies in network design and architecture.

This event not only highlights the technical prowess and innovation of Shardeum’s team but also heralds an era where decentralized networks are more resilient, adaptable, and equipped to handle unforeseen challenges through disaster recovery planning. Ultimately, Shardeum’s technology will usher in a new era of decentralization, propelling the industry towards greater innovation and efficiency.

In conclusion, this blog has shed light on the intricate journey of Shardeum’s network crash and subsequent restoration, revealing the resilience and innovation embedded within its technological framework. From the discovery of vulnerabilities to the deployment of multifaceted responses by the tech team, each step in the process exemplifies Shardeum’s commitment to maintaining a secure and robust network.

The restoration of Shardeum’s sharded network not only marks a significant milestone in network technology but also paves the way for future innovations in the industry. Through innovations like network modes and agile responses to unforeseen challenges, Shardeum has demonstrated its potential to influence the development of decentralized networks and set new industry standards.

Official Website -> https://shardeum.org/

Twitter -> https://twitter.com/shardeum

Telegram -> https://t.me/shardeum

Linkdein -> https://linkedin.com/company/shardeum

Discord -> https://discord.gg/shardeum