<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html" version="2.0">
  <!-- Source: https://medium.com/feed/@gayan.dissanayake -->
  <channel>
    <title><![CDATA[Stories by Gayan Dissanayake on Medium]]></title>
    <description><![CDATA[Stories by Gayan Dissanayake on Medium]]></description>
    <link>https://siftrss.com/f/9gnDlBdyw5</link>
    <image>
      <url>https://cdn-images-1.medium.com/fit/c/150/150/0*2DrhR4aq5368TthM</url>
      <title>Stories by Gayan Dissanayake on Medium</title>
      <link>https://medium.com/@gayan.dissanayake?source=rss-5016ec278afa------2</link>
    </image>
    <generator>Medium</generator>
    <lastBuildDate>Sat, 18 Apr 2026 01:00:45 GMT</lastBuildDate>
    <atom:link href="https://siftrss.com/f/9gnDlBdyw5" rel="self" type="application/rss+xml"/>
    <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
    <atom:link href="http://medium.superfeedr.com" rel="hub"/>
    <item>
      <title><![CDATA[Deploying MySQL Router on OpenShift with a 3-Node MySQL InnoDB Cluster on OpenShift Virtualization]]></title>
      <link>https://medium.com/@gayan.dissanayake/deploying-mysql-router-on-openshift-with-a-3-node-mysql-innodb-cluster-on-openshift-virtualization-52ee626a8b74?source=rss-5016ec278afa------2</link>
      <guid isPermaLink="false">https://medium.com/p/52ee626a8b74</guid>
      <dc:creator><![CDATA[Gayan Dissanayake]]></dc:creator>
      <pubDate>Fri, 06 Mar 2026 08:10:50 GMT</pubDate>
      <atom:updated>2026-03-06T08:10:50.690Z</atom:updated>
      <content:encoded><![CDATA[<h3>Bridging containerized workloads with VM-hosted databases using OVN-Kubernetes and MySQL Router</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*snFUgDOA4K61idAI4sz_-w.png" /></figure><h3>Introduction</h3><p>Running stateful database workloads on Kubernetes has always been a challenge. Many organizations running OpenShift choose to host their MySQL databases inside virtual machines managed by <strong>OpenShift Virtualization</strong> (formerly KubeVirt) — benefiting from the operational familiarity of VMs while staying within the Kubernetes ecosystem.</p><p>However, your application pods still need a reliable, topology-aware way to connect to that MySQL cluster. That’s where <strong>MySQL Router</strong> comes in.</p><p>In this article, we’ll walk through how to deploy MySQL Router as a Kubernetes Deployment on OpenShift, connecting it to a 3-node MySQL InnoDB Cluster (Group Replication) running on OpenShift Virtualization VMs. We&#39;ll also cover how to use a NetworkAttachmentDefinition with OVN-Kubernetes to give MySQL Router a dedicated L2 overlay network interface for reaching the database VMs — and how to validate everything is working correctly.</p><h3>Architecture Overview</h3><pre>┌─────────────────────────────────────────────────────┐<br>│              OpenShift Cluster                      │<br>│                                                     │<br>│  ┌──────────────────────┐                           │<br>│  │   App Pods           │                           │<br>│  │  (connects via SVC)  │                           │<br>│  └──────────┬───────────┘                           │<br>│             │ ClusterIP Service                     │<br>│             │ :6446 (RW) / :6447 (RO)               │<br>│  ┌──────────▼───────────┐                           │<br>│  │   MySQL Router Pod   │◄── L2 Overlay Network     │<br>│  │   (Deployment)       │    (OVN-Kubernetes)       │<br>│  └──────────┬───────────┘                           │<br>│             │                                       │<br>│  ┌──────────▼────────────────────────────────────┐  │<br>│  │         OpenShift Virtualization              │  │<br>│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐       │  │<br>│  │  │ dbnodea  │ │ dbnodeb  │ │ dbnodec  │       │  │<br>│  │  │ PRIMARY  │ │SECONDARY │ │SECONDARY │       │  │<br>│  │  │ :3306    │ │ :3306    │ │ :3306    │       │  │<br>│  │  └──────────┘ └──────────┘ └──────────┘       │  │<br>│  └───────────────────────────────────────────────┘  │<br>└─────────────────────────────────────────────────────┘</pre><p>The key components are:</p><ul><li><strong>MySQL InnoDB Cluster</strong> — a 3-node Group Replication cluster running inside OpenShift Virtualization VMs</li><li><strong>MySQL Router</strong> — deployed as a Kubernetes Deployment with an init container that bootstraps the router configuration automatically</li><li><strong>OVN-Kubernetes L2 Overlay</strong> — a NetworkAttachmentDefinition that gives the router pod a secondary network interface on the same L2 segment as the database VMs</li><li><strong>Kubernetes Service</strong> — a ClusterIP service that exposes the router&#39;s RW/RO ports to application pods</li></ul><h3>Prerequisites</h3><p>Before diving in, make sure you have:</p><ul><li>An OpenShift cluster with the <strong>OpenShift Virtualization</strong> operator installed</li><li>A working <strong>3-node MySQL InnoDB Cluster</strong> (Group Replication) running on VMs managed by OpenShift Virtualization</li><li>The <strong>Multus CNI</strong> plugin enabled (it ships with OpenShift by default)</li><li>A NetworkAttachmentDefinition configured for L2 overlay networking (covered below)</li><li>A mysql-router-secret Kubernetes Secret containing the MySQL Router bootstrap password</li></ul><h3>Step 1: Setting Up the L2 Overlay Network</h3><p>MySQL Router needs to reach the database VMs on their internal IPs. The cleanest way to achieve this on OpenShift with OVN-Kubernetes is to create a <strong>Layer 2 overlay network</strong> using a NetworkAttachmentDefinition.</p><p>yaml</p><pre>apiVersion: k8s.cni.cncf.io/v1<br>kind: NetworkAttachmentDefinition<br>metadata:<br>  name: abc-production-l2-overlay-nad<br>  namespace: abc-system-production<br>spec:<br>  config: |-<br>    {<br>        &quot;cniVersion&quot;: &quot;0.3.1&quot;,<br>        &quot;name&quot;: &quot;abc-production-l2-overlay-nad&quot;,<br>        &quot;type&quot;: &quot;ovn-k8s-cni-overlay&quot;,<br>        &quot;netAttachDefName&quot;: &quot;abc-system-production/abc-production-l2-overlay-nad&quot;,<br>        &quot;topology&quot;: &quot;layer2&quot;,<br>        &quot;subnets&quot;: &quot;192.168.100.0/24&quot;,<br>        &quot;excludeSubnets&quot;: &quot;192.168.100.10/32,192.168.100.11/32,192.168.100.12/32&quot;<br>    }</pre><p>The excludeSubnets field is important — it prevents OVN from auto-assigning IPs that are already in use by your MySQL VM nodes. With this configuration in place, you no longer need to hard-code a specific IP in the pod annotation. OVN will automatically assign a free IP from the subnet. The pod annotation becomes simply:</p><p>yaml</p><pre>annotations:<br>  k8s.v1.cni.cncf.io/networks: |<br>    [<br>      {<br>        &quot;name&quot;: &quot;abc-production-l2-overlay-nad&quot;,<br>        &quot;namespace&quot;: &quot;abc-system-production&quot;<br>      }<br>    ]</pre><p>This is much cleaner than pinning a specific IP, which would cause pod scheduling conflicts if the pod is rescheduled.</p><h3>Step 2: Creating the Secret</h3><p>MySQL Router needs credentials to bootstrap against the MySQL cluster. Store the password in a Kubernetes Secret:</p><p>yaml</p><pre>apiVersion: v1<br>kind: Secret<br>metadata:<br>  name: mysql-router-secret<br>  namespace: abc-system-production<br>type: Opaque<br>data:<br>  router-password: &lt;base64-encoded-password&gt;</pre><p>Generate the base64 value with:</p><p>bash</p><pre>echo -n &#39;your_password_here&#39; | base64</pre><h3>Step 3: Deploying MySQL Router</h3><p>The Deployment has two key parts: an <strong>init container</strong> that bootstraps the router configuration, and a <strong>main container</strong> that runs the router itself. Both share an emptyDir volume mounted at /etc/mysqlrouter.</p><h4>Why an Init Container?</h4><p>MySQL Router’s bootstrap process (mysqlrouter --bootstrap) connects to the cluster, queries the metadata, and writes a complete mysqlrouter.conf including all member IPs. This cannot be done at image build time because the cluster topology is dynamic. The init container solves this elegantly — it runs once, writes the config to the shared volume, and exits. The main container then starts the router using that generated config.</p><h4>URL-Encoding the Password</h4><p>A subtle but important detail: MySQL Router’s bootstrap command embeds credentials in a connection URI. If your password contains special characters (like @, #, or /), the connection string will break unless the password is URL-encoded. The init container script handles this with a pure-bash urlencode() function before passing the credentials to mysqlrouter --bootstrap.</p><p>Here is the complete Deployment:</p><p>yaml</p><pre>apiVersion: apps/v1<br>kind: Deployment<br>metadata:<br>  name: mysqlrouter-dep<br>  namespace: abc-system-production<br>  labels:<br>    app: mysqlrouter<br>spec:<br>  replicas: 1<br>  selector:<br>    matchLabels:<br>      app: mysqlrouter<br>  strategy:<br>    type: RollingUpdate<br>    rollingUpdate:<br>      maxUnavailable: 25%<br>      maxSurge: 25%<br>  template:<br>    metadata:<br>      labels:<br>        app: mysqlrouter<br>      annotations:<br>        k8s.v1.cni.cncf.io/networks: |<br>          [<br>            {<br>              &quot;name&quot;: &quot;abc-production-l2-overlay-nad&quot;,<br>              &quot;namespace&quot;: &quot;abc-system-production&quot;<br>            }<br>          ]<br>    spec:<br>      restartPolicy: Always<br>      terminationGracePeriodSeconds: 30<br>      securityContext:<br>        seLinuxOptions:<br>          level: &#39;s0:c37,c14&#39;<br>        fsGroup: 1001360000<br>        seccompProfile:<br>          type: RuntimeDefault<br>      tolerations:<br>        - key: node.kubernetes.io/not-ready<br>          operator: Exists<br>          effect: NoExecute<br>          tolerationSeconds: 300<br>        - key: node.kubernetes.io/unreachable<br>          operator: Exists<br>          effect: NoExecute<br>          tolerationSeconds: 300<br>      volumes:<br>        - name: mysqlrouter-config<br>          emptyDir: {}<br>      initContainers:<br>        - name: mysqlrouter-bootstrap<br>          image: mysql/mysql-router:8.0<br>          imagePullPolicy: IfNotPresent<br>          command:<br>            - /bin/bash<br>            - &#39;-c&#39;<br>            - |<br>              set -e<br>              urlencode() {<br>                local string=&quot;${1}&quot;<br>                local encoded=&quot;&quot;<br>                local pos c o<br>                for (( pos=0; pos&lt;${#string}; pos++ )); do<br>                  c=${string:$pos:1}<br>                  case &quot;$c&quot; in<br>                    [-_.~a-zA-Z0-9]) o=&quot;${c}&quot; ;;<br>                    *) printf -v o &#39;%%%02X&#39; &quot;&#39;$c&quot; ;;<br>                  esac<br>                  encoded+=&quot;${o}&quot;<br>                done<br>                echo &quot;${encoded}&quot;<br>              }<br>              echo &quot;=== Port check ===&quot;<br>              timeout 5 bash -c &quot;cat &lt; /dev/null &gt; /dev/tcp/${MYSQL_HOST}/${MYSQL_PORT}&quot; \<br>                &amp;&amp; echo &quot;PORT REACHABLE&quot; \<br>                || { echo &quot;PORT NOT REACHABLE - aborting&quot;; exit 1; }<br>              echo &quot;=== Starting bootstrap ===&quot;<br>              ENCODED_PASSWORD=$(urlencode &quot;${MYSQL_ROUTER_PASSWORD}&quot;)<br>              mysqlrouter \<br>                --bootstrap ${MYSQL_USER}:${ENCODED_PASSWORD}@${MYSQL_HOST}:${MYSQL_PORT} \<br>                --directory /etc/mysqlrouter \<br>                --conf-bind-address 0.0.0.0 \<br>                --force \<br>                2&gt;&amp;1<br>              echo &quot;=== Bootstrap completed ===&quot;<br>          env:<br>            - name: MYSQL_HOST<br>              value: &quot;&lt;mysql-primary-node-ip&gt;&quot;<br>            - name: MYSQL_PORT<br>              value: &quot;3306&quot;<br>            - name: MYSQL_USER<br>              value: root<br>            - name: MYSQL_ROUTER_PASSWORD<br>              valueFrom:<br>                secretKeyRef:<br>                  name: mysql-router-secret<br>                  key: router-password<br>          securityContext:<br>            capabilities:<br>              drop: [ALL]<br>            runAsUser: 1001360000<br>            runAsNonRoot: true<br>            allowPrivilegeEscalation: false<br>          resources:<br>            limits:<br>              cpu: 500m<br>              memory: 512Mi<br>            requests:<br>              cpu: 100m<br>              memory: 128Mi<br>          volumeMounts:<br>            - name: mysqlrouter-config<br>              mountPath: /etc/mysqlrouter<br>      containers:<br>        - name: mysqlrouter<br>          image: mysql/mysql-router:8.0<br>          imagePullPolicy: IfNotPresent<br>          command:<br>            - /bin/bash<br>            - &#39;-c&#39;<br>            - |<br>              set -e<br>              exec mysqlrouter -c /etc/mysqlrouter/mysqlrouter.conf<br>          ports:<br>            - name: rw-classic<br>              containerPort: 6446<br>              protocol: TCP<br>            - name: ro-classic<br>              containerPort: 6447<br>              protocol: TCP<br>            - name: rw-x<br>              containerPort: 6448<br>              protocol: TCP<br>            - name: ro-x<br>              containerPort: 6449<br>              protocol: TCP<br>            - name: http<br>              containerPort: 8443<br>              protocol: TCP<br>          livenessProbe:<br>            tcpSocket:<br>              port: 6446<br>            initialDelaySeconds: 10<br>            timeoutSeconds: 1<br>            periodSeconds: 15<br>            failureThreshold: 3<br>          readinessProbe:<br>            tcpSocket:<br>              port: 6446<br>            initialDelaySeconds: 5<br>            timeoutSeconds: 1<br>            periodSeconds: 10<br>            failureThreshold: 3<br>          securityContext:<br>            capabilities:<br>              drop: [ALL]<br>            runAsUser: 1001360000<br>            runAsNonRoot: true<br>            allowPrivilegeEscalation: false<br>          resources:<br>            limits:<br>              cpu: 500m<br>              memory: 512Mi<br>            requests:<br>              cpu: 100m<br>              memory: 128Mi<br>          volumeMounts:<br>            - name: mysqlrouter-config<br>              mountPath: /etc/mysqlrouter</pre><h4>A Note on OpenShift Security Context Constraints (SCC)</h4><p>OpenShift enforces SCCs, which is why we explicitly set runAsUser: 1001360000 (a UID in the range assigned to the namespace), runAsNonRoot: true, allowPrivilegeEscalation: false, and drop all Linux capabilities. The seLinuxOptions and seccompProfile are similarly required for a restricted or restricted-v2 SCC. Without these, your pod will fail to schedule.</p><h3>Step 4: Exposing MySQL Router via a Service</h3><p>Create a ClusterIP service so that application pods can connect to MySQL Router using a stable DNS name (mysql-router.abc-system-production.svc.cluster.local) rather than pod IPs:</p><p>yaml</p><pre>apiVersion: v1<br>kind: Service<br>metadata:<br>  name: mysql-router<br>  namespace: abc-system-production<br>  labels:<br>    app: mysqlrouter<br>spec:<br>  type: ClusterIP<br>  selector:<br>    app: mysqlrouter<br>  ports:<br>    - name: rw-classic<br>      protocol: TCP<br>      port: 6446<br>      targetPort: 6446<br>    - name: ro-classic<br>      protocol: TCP<br>      port: 6447<br>      targetPort: 6447<br>    - name: rw-x<br>      protocol: TCP<br>      port: 6448<br>      targetPort: 6448<br>    - name: ro-x<br>      protocol: TCP<br>      port: 6449<br>      targetPort: 6449</pre><p>Your application should connect reads to port 6447 (round-robin across secondaries with fallback to primary) and writes to port 6446 (always routed to the current primary).</p><h3>Step 5: Validating the Deployment</h3><p>Use the following Job to run 5 parallel test connections and verify that routing is working correctly:</p><p>yaml</p><pre>apiVersion: batch/v1<br>kind: Job<br>metadata:<br>  name: mysql-router-test<br>  namespace: abc-system-production<br>spec:<br>  ttlSecondsAfterFinished: 120<br>  completions: 5<br>  parallelism: 5<br>  template:<br>    spec:<br>      restartPolicy: Never<br>      containers:<br>      - name: mysql-test<br>        image: mysql:8.0<br>        command:<br>        - bash<br>        - -c<br>        - |<br>          echo &quot;=== Testing R/W Port 6446 (Expected: PRIMARY) ===&quot;<br>          mysql -h mysql-router -P 6446 -u root -p&quot;${MYSQL_ROOT_PASSWORD}&quot; \<br>            -e &quot;SELECT @@hostname AS hostname, @@read_only AS read_only, @@port AS port;&quot;</pre><pre>          echo &quot;=== Testing Read-Only Port 6447 (Expected: REPLICA) ===&quot;<br>          mysql -h mysql-router -P 6447 -u root -p&quot;${MYSQL_ROOT_PASSWORD}&quot; \<br>            -e &quot;SELECT @@hostname AS hostname, @@read_only AS read_only, @@port AS port;&quot;</pre><pre>          echo &quot;=== Cluster Member Status ===&quot;<br>          mysql -h mysql-router -P 6446 -u root -p&quot;${MYSQL_ROOT_PASSWORD}&quot; \<br>            -e &quot;SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE<br>                FROM performance_schema.replication_group_members;&quot;<br>        env:<br>        - name: MYSQL_ROOT_PASSWORD<br>          valueFrom:<br>            secretKeyRef:<br>              name: mysql-router-secret<br>              key: router-password<br>        resources:<br>          requests:<br>            cpu: &quot;100m&quot;<br>            memory: &quot;128Mi&quot;<br>          limits:<br>            cpu: &quot;200m&quot;<br>            memory: &quot;256Mi&quot;</pre><p>A successful output will look like this:</p><pre>=== Testing R/W Port 6446 (Expected: PRIMARY) ===<br>hostname    read_only   port<br>dbnodea     0           3306<br><br>=== Testing Read-Only Port 6447 (Expected: REPLICA) ===<br>hostname    read_only   port<br>dbnodec     1           3306</pre><pre>=== Cluster Member Status ===<br>MEMBER_HOST         MEMBER_PORT   MEMBER_STATE   MEMBER_ROLE<br>192.168.100.10      3306          ONLINE         SECONDARY<br>192.168.100.11      3306          ONLINE         PRIMARY<br>192.168.100.12      3306          ONLINE         SECONDARY</pre><p>Port 6446 returns read_only=0, confirming it is the PRIMARY. Port 6447 returns read_only=1, confirming it is a SECONDARY. All three cluster members are ONLINE.</p><h3>How MySQL Router Stays in Sync</h3><p>You might wonder: what happens when the primary fails over? MySQL Router’s metadata cache is the key. The generated mysqlrouter.conf contains a [metadata_cache:mycluster] section that lists all three bootstrap server addresses and sets use_gr_notifications = 1. This means the router subscribes to Group Replication notifications and instantly knows when a new node becomes primary — without waiting for a TTL to expire.</p><p>The relevant section from the bootstrapped config:</p><p>ini</p><pre>[metadata_cache:mycluster]<br>cluster_type = gr<br>bootstrap_server_addresses = IP1:3306,IP2:3306,IP3:3306<br>user = mysqlrouter<br>ttl = 0.5<br>use_gr_notifications = 1</pre><pre>[routing:primary]<br>bind_port = 6446<br>destinations = metadata-cache://mycluster/?role=PRIMARY<br>routing_strategy = first-available<br>protocol = classic</pre><pre>[routing:secondary]<br>bind_port = 6447<br>destinations = metadata-cache://mycluster/?role=SECONDARY<br>routing_strategy = round-robin-with-fallback<br>protocol = classic</pre><p>The round-robin-with-fallback strategy on the secondary port means that if all secondaries go offline, reads will automatically fall back to the primary — a sensible default for most workloads.</p><h3>Key Takeaways</h3><p>Deploying MySQL Router on OpenShift to front a VM-hosted InnoDB Cluster is well worth the setup effort. A few things worth remembering from this walkthrough:</p><p><strong>Use an init container for bootstrapping.</strong> The bootstrap process is stateful — it writes a config file based on the live cluster topology. An init container with a shared emptyDir volume is the idiomatic Kubernetes way to handle this.</p><p><strong>URL-encode your password.</strong> If your MySQL Router password contains special characters, the bootstrap URI will silently fail or connect to the wrong host. The pure-bash urlencode() function in the init container script solves this robustly.</p><p><strong>Use the </strong><strong>subnets + </strong><strong>excludeSubnets approach in your NAD.</strong> Hard-coding pod IPs creates operational headaches. Let OVN-Kubernetes assign IPs automatically while protecting the IPs reserved for your database VMs.</p><p><strong>Set OpenShift-compatible security contexts.</strong> Don’t fight the SCC — set the right runAsUser, drop all capabilities, and use seccompProfile: RuntimeDefault. It only takes a few lines and prevents hours of debugging.</p><p><strong>Validate with a parallel Job.</strong> Running 5 completions in parallel quickly verifies that both the RW and RO routing paths are working, and shows you the real cluster topology from the router’s perspective.</p><h3>Conclusion</h3><p>MySQL Router is a lightweight but powerful component that transforms a raw MySQL Group Replication cluster into a topology-aware, highly-available data tier. Running it as a Kubernetes Deployment on OpenShift — with an OVN-Kubernetes L2 overlay connecting it to your OpenShift Virtualization database VMs — gives you the best of both worlds: the simplicity of VM-based MySQL operations combined with cloud-native connectivity and observability.</p><p>The patterns shown here — init containers for bootstrapping, secondary network interfaces for DB connectivity, parallel validation Jobs — are broadly applicable beyond MySQL Router to any scenario where containerized workloads need to reach VM-hosted databases on OpenShift.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=52ee626a8b74" width="1" height="1" alt="">]]></content:encoded>
    </item>
    <item>
      <title><![CDATA[How We Brought a Dead MySQL InnoDB Cluster Back to Life]]></title>
      <link>https://medium.com/@gayan.dissanayake/how-we-brought-a-dead-mysql-innodb-cluster-back-to-life-c285d2aba919?source=rss-5016ec278afa------2</link>
      <guid isPermaLink="false">https://medium.com/p/c285d2aba919</guid>
      <dc:creator><![CDATA[Gayan Dissanayake]]></dc:creator>
      <pubDate>Fri, 20 Feb 2026 10:25:16 GMT</pubDate>
      <atom:updated>2026-02-20T10:25:16.518Z</atom:updated>
      <content:encoded><![CDATA[<h3>A war story: complete outage, GTID chaos, duplicate UUIDs, and the steps that finally worked</h3><p>There’s a particular kind of dread that comes with staring at a database cluster where every node shows OFFLINE.</p><p>No reads. No writes. Just silence where your production data used to be.</p><p>That’s exactly where we found ourselves with a MySQL InnoDB Cluster — three nodes, all down, all stubbornly refusing to cooperate. This is the complete story of how we diagnosed the failure, untangled a mess of GTID inconsistencies and duplicate UUIDs, and brought the cluster fully back online.</p><p>If you’re facing something similar, this guide is for you.</p><h3>The Setup</h3><p>Our cluster — ABCCLUSTER — runs three nodes:</p><ul><li><strong>Node A</strong> — &lt;IP 1&gt; (intended PRIMARY)</li><li><strong>Node B</strong> — &lt;IP 2&gt; (SECONDARY)</li><li><strong>Node C</strong> — &lt;IP 3&gt; (SECONDARY)</li></ul><p>Group Replication is configured with communication_stack=MYSQL, meaning it communicates over port <strong>3306</strong> — not the commonly assumed 33061. This turned out to be a critical early discovery.</p><h3>Phase 1 — Diagnosis: Understanding What Actually Broke</h3><p>The first instinct when a cluster goes dark is to start restarting services. <strong>Resist that urge.</strong> Diagnose first.</p><h3>Step 1 — Check cluster member visibility</h3><pre>SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE <br>FROM performance_schema.replication_group_members;</pre><p>Only Node A was visible — and it showed OFFLINE. Nodes B and C were completely invisible to the group.</p><h3>Step 2 — Read the error logs</h3><pre>tail -100 /var/log/mysqld.log</pre><p>The logs revealed connection failures between nodes on port <strong>3306</strong>. Not 33061–3306. This matters.</p><h3>Step 3 — Test port connectivity</h3><pre>nc -zv abc-prod-mysql-db-cluster-data-node-a 33061<br>nc -zv abc-prod-mysql-db-cluster-data-node-b 33061<br>nc -zv abc-prod-mysql-db-cluster-data-node-c 33061</pre><p>Port 33061 was blocked — but as the next step revealed, that wasn’t our actual problem.</p><h3>Step 4 — Check persisted Group Replication variables</h3><pre>SELECT * FROM performance_schema.persisted_variables <br>WHERE VARIABLE_NAME LIKE &#39;group_replication%&#39;;</pre><p>This confirmed communication_stack=MYSQL. Our cluster communicates on port <strong>3306</strong>, not 33061. The blocked port was a red herring.</p><h3>Step 5 — Note the Group Name</h3><pre>SHOW VARIABLES LIKE &#39;group_replication_group_name&#39;;</pre><p>Result: 91527b32-58f3-11ee-b8bf-fa163ec0acaf — we&#39;ll need this later.</p><h3>Phase 2 — Fixing Privileges Before Rebooting</h3><p>This phase trips up a lot of people. MySQL Shell’s dba.rebootClusterFromCompleteOutage() requires very specific privileges — and they need WITH GRANT OPTION.</p><h3>Step 6 — Launch MySQL Shell</h3><pre>mysqlsh root@localhost</pre><h3>Step 7 — Attempt cluster reboot (it failed)</h3><pre>dba.rebootClusterFromCompleteOutage(&#39;ABCCLUSTER&#39;);</pre><p><strong>Error:</strong> root@&#39;%&#39; was missing dynamic privileges with WITH GRANT OPTION.</p><h3>Step 8 — Verify what’s missing</h3><pre>SHOW GRANTS FOR &#39;root&#39;@&#39;%&#39;;</pre><p>Several dynamic privileges were present but without WITH GRANT OPTION. MySQL Shell needs to be able to <em>delegate</em> these grants to recovery accounts.</p><h3>Step 9 — Disable read-only mode on all 3 nodes</h3><p>Run this on <strong>each node</strong> before making any changes:</p><pre>SET GLOBAL super_read_only = OFF;<br>SET GLOBAL read_only = OFF;</pre><h3>Step 10 — Grant the full required privilege set on all 3 nodes</h3><pre>GRANT CLONE_ADMIN, CONNECTION_ADMIN, CREATE USER, EXECUTE, FILE, <br>GROUP_REPLICATION_ADMIN, PERSIST_RO_VARIABLES_ADMIN, PROCESS, RELOAD, <br>REPLICATION CLIENT, REPLICATION SLAVE, REPLICATION_APPLIER, <br>REPLICATION_SLAVE_ADMIN, ROLE_ADMIN, SELECT, SHUTDOWN, <br>SYSTEM_VARIABLES_ADMIN ON *.* TO &#39;root&#39;@&#39;%&#39; WITH GRANT OPTION;</pre><pre>FLUSH PRIVILEGES;</pre><blockquote><strong><em>Why all three nodes?</em></strong><em> During recovery, MySQL Shell connects to every member. If any node is missing privileges, the reboot will fail partway through.</em></blockquote><h3>Phase 3 — Rebooting the Cluster</h3><p>With privileges fixed, we can attempt the reboot — but first we need to know <em>which node has the most recent data</em> to elect it as PRIMARY.</p><h3>Step 11 — Check GTID position on all nodes</h3><pre>SELECT @@global.gtid_executed;</pre><p>Results:</p><ul><li><strong>Node A:</strong> b3ce0ea4:1-372 ← most transactions</li><li><strong>Node B:</strong> b3ce0ea4:1-369</li><li><strong>Node C:</strong> b3ce0ea4:1-369</li></ul><p>Node A is 3 transactions ahead. It <strong>must</strong> be the PRIMARY.</p><h3>Step 12 — Reboot with force, nominating Node A</h3><pre>dba.rebootClusterFromCompleteOutage(&#39;ABCCLUSTER&#39;, {<br>    force: true,<br>    primary: &#39;abc-prod-mysql-db-cluster-data-node-a:3306&#39;<br>});</pre><p><strong>Node A came online as PRIMARY. ✅</strong></p><p>However, Nodes B and C could not rejoin — they had GTID sets incompatible with what Node A had already purged. This is a common outcome when one node is ahead of others and binary logs have been purged.</p><h3>Phase 4 — Recovering Node B</h3><p>Because Node B’s GTID history conflicted with the cluster’s current state, a standard rejoin wasn’t possible. We needed to wipe its state and clone fresh data from Node A.</p><h3>Step 13 — Reset GTID state on Node B</h3><pre>SET GLOBAL super_read_only = OFF;<br>SET GLOBAL read_only = OFF;<br>STOP GROUP_REPLICATION;<br>RESET MASTER;</pre><h3>Step 14 — Attempt rejoin (failed — purged transactions)</h3><pre>cluster.rejoinInstance(&#39;root@abc-prod-mysql-db-cluster-data-node-b:3306&#39;);</pre><p><strong>Error:</strong> Transactions required for recovery have been purged from all cluster members. We need a full clone.</p><h3>Step 15 — Attempt add with clone (failed — duplicate UUID)</h3><pre>cluster.addInstance(&#39;root@abc-prod-mysql-db-cluster-data-node-b:3306&#39;, <br>    {recoveryMethod: &#39;clone&#39;});</pre><p><strong>Error:</strong> Duplicate server UUID 8a6e08dc-a1f5-11ed-8464-fa163ee55914</p><p>This happens when a node was previously cloned from another, or when auto.cnf was copied between servers.</p><h3>Step 16 — Fix the duplicate UUID</h3><pre>systemctl stop mysql<br>rm /data/mysql/auto.cnf<br>systemctl start mysql</pre><p>MySQL regenerates a fresh UUID on startup when auto.cnf is absent.</p><h3>Step 17 — Attempt add again (failed — duplicate server_id)</h3><pre>cluster.addInstance(&#39;root@abc-prod-mysql-db-cluster-data-node-b:3306&#39;, <br>    {recoveryMethod: &#39;clone&#39;});</pre><p><strong>Error:</strong> server_id=2 already used by an active cluster member.</p><h3>Step 18 — Fix the duplicate server_id</h3><pre>SET GLOBAL super_read_only = OFF;<br>SET GLOBAL read_only = OFF;<br>SET PERSIST server_id = 22;</pre><p>Then update the config file so it survives restarts:</p><pre>vi /etc/my.cnf<br># Change: server-id=2<br># To:     server-id=22<br>systemctl restart mysql</pre><h3>Step 19 — Successfully add Node B via clone</h3><pre>cluster.addInstance(&#39;root@abc-prod-mysql-db-cluster-data-node-b:3306&#39;, <br>    {recoveryMethod: &#39;clone&#39;});</pre><p>Clone initiated. Node A’s data was transferred to Node B. <strong>✅</strong></p><h3>Step 20 — Rescan to update cluster metadata</h3><pre>cluster.rescan();</pre><p>Node B’s new UUID was registered in the metadata. Node C appeared as unavailable — we answered <strong>N</strong> when prompted to remove it from metadata, since we planned to recover it next.</p><h3>Phase 5 — Recovering Node C</h3><p>The same pattern as Node B, applied consistently.</p><h3>Step 21 — Prepare Node C</h3><pre>ssh root@abc-prod-mysql-db-cluster-data-node-c<br>systemctl stop mysql<br>rm /data/mysql/auto.cnf<br>systemctl start mysql</pre><pre>SET GLOBAL super_read_only = OFF;<br>SET GLOBAL read_only = OFF;<br>SET PERSIST server_id = 33;</pre><pre>vi /etc/my.cnf<br># Change: server-id=3<br># To:     server-id=33<br>systemctl restart mysql</pre><h3>Step 22 — Add Node C via clone</h3><pre>cluster.addInstance(&#39;root@abc-prod-mysql-db-cluster-data-node-c:3306&#39;, <br>    {recoveryMethod: &#39;clone&#39;});</pre><h3>Step 23 — Final status verification</h3><pre>var cluster = dba.getCluster(&#39;ABCCLUSTER&#39;);<br>cluster.status();</pre><p><strong>All three nodes online. Cluster fully operational. ✅</strong></p><h3>The Root Cause Fix: Hardening my.cnf</h3><p>This outage happened partly because Group Replication settings weren’t persisted to disk. After a server restart, GR didn’t start automatically. To prevent that, add the following to /etc/my.cnf on <strong>all three nodes</strong>:</p><pre>gtid_mode=ON<br>enforce_gtid_consistency=ON<br>binlog_checksum=NONE<br>plugin-load-add=group_replication.so<br>group_replication=FORCE_PLUS_PERMANENT<br>group_replication_group_name=&quot;91527b32-58f3-11ee-b8bf-fa163ec0acaf&quot;<br>group_replication_start_on_boot=ON<br>group_replication_recovery_get_public_key=ON</pre><p>The key lines to understand:</p><ul><li>group_replication_start_on_boot=ON — GR starts automatically after mysqld restarts</li><li>group_replication=FORCE_PLUS_PERMANENT — prevents MySQL from starting without GR loaded</li><li>group_replication_recovery_get_public_key=ON — avoids authentication failures during recovery</li></ul><h3>5 Lessons from This Outage</h3><p><strong>1. Know your communication stack.</strong> communication_stack=MYSQL means port 3306, not 33061. Chasing firewall rules for the wrong port wastes critical time during an outage.</p><p><strong>2. GTID position determines your PRIMARY.</strong> Always query @@global.gtid_executed on every node before rebooting. Picking the wrong primary can cause data loss.</p><p><strong>3. Duplicate UUIDs and server_ids are silent killers.</strong> If nodes were ever cloned from each other, check both auto.cnf and server_id before attempting to add them to a cluster.</p><p><strong>4. Clone is your friend after GTID divergence.</strong> When purged transactions make incremental recovery impossible, recoveryMethod: &#39;clone&#39; bypasses the GTID history entirely and gives you a clean slate.</p><p><strong>5. Persist your config.</strong> SET PERSIST keeps variables across restarts in mysqld-auto.cnf, but your my.cnf should reflect the same values for transparency and manual recovery scenarios.</p><h3>The Complete Recovery Path at a Glance</h3><p>Phase What We Did Outcome <strong>Diagnosis</strong> Checked status, logs, ports, GR variables Identified communication stack and root errors <strong>Privileges</strong> Granted full dynamic privilege set with GRANT OPTION MySQL Shell able to proceed <strong>Reboot</strong> Force-rebooted with Node A as PRIMARY Node A online; B/C GTID incompatible <strong>Node B</strong> Cleared auto.cnf, fixed server_id, cloned from A Node B rejoined as SECONDARY <strong>Node C</strong> Same process as Node B Node C rejoined as SECONDARY <strong>Hardening</strong> Updated my.cnf on all nodes GR will survive future restarts</p><p>Database outages are stressful — but they’re survivable with a methodical approach. The key is resisting the urge to brute-force restarts and instead following the evidence from logs, GTID positions, and error messages. Each failure in this recovery pointed directly at the next step.</p><p>Hopefully this walkthrough saves someone else a few hours of pain.</p><p><em>Have questions or a different approach that worked for you? Drop a comment below.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c285d2aba919" width="1" height="1" alt="">]]></content:encoded>
    </item>
  </channel>
</rss>
