Cluster computing

Friday, October 18, 2024

This is a continuation of a series of articles on IaC shortcomings and resolutions. In this section, we discuss ways to transfer data from one Azure managed instance of Apache Cassandra server in a virtual network to another in a different network. The separation in terms of network for the Cassandra resource type only serves to elaborate on the steps needed to generalize the data transfer.

Data is organized in the Cassandra cluster as keyspaces and tables. The first approach is the direct approach using a command-line client like cqlsh to interact with the clusters. The steps are download the tables as csv files and upload them to the other server.

Example:

Step 1. At source server:

USE <keyspace>;

COPY <keyspace>.<table_name> TO 'path/to/file.csv' WITH HEADER = true;

Step 2. At destination server:

USE <keyspace>;

CREATE TABLE <table_name> (

column1 datatype1,

column2 datatype2,

...

PRIMARY KEY (column1)

);

COPY <keyspace>.<table_name> (column1, column2, ...) FROM 'path/to/file.csv' WITH HEADER = true;

The other option is to read the data from one server and without a local artifact save the data to the destination. An example for this would appear as follows:

This option involves running a copy activity on a Databricks notebook using Apache Spark:

Example:

from pyspark.sql import SparkSession

# Initialize the Spark session

spark = SparkSession.builder \

.appName("Copy Cassandra Data") \

.config("spark.cassandra.connection.host", "<source-cassandra-host>") \

.config("spark.cassandra.connection.port", "9042") \

.config("spark.cassandra.auth.username", "<source-username>") \

.config("spark.cassandra.auth.password", "<source-password>") \

.getOrCreate()

# List of keyspaces and tables to copy

keyspaces = ["keyspace1", "keyspace2"]

tables = ["table1", "table2"]

for keyspace in keyspaces:

for table in tables:

# Read data from the source Cassandra cluster

df = spark.read \

.format("org.apache.spark.sql.cassandra") \

.options(keyspace=keyspace, table=table) \

.load()

# Write data to the target Cassandra cluster

df.write \

.format("org.apache.spark.sql.cassandra") \

.options(

keyspace=keyspace,

table=table,

"spark.cassandra.connection.host"="<target-cassandra-host>",

"spark.cassandra.connection.port"="9042",

"spark.cassandra.auth.username"="<target-username>",

"spark.cassandra.auth.password"="<target-password>"

) \

.mode("append") \

.save()

# Stop the Spark session

spark.stop()

Note, however, that we had started out with the source and destination in different networks. So, if the databricks server is also tethered to the same network as one of the servers, it will not be able to reach the other server. One way to get around that involves peering the network but that usually affects other resources and is not always a possibility.Another option involves adding private endpoints but the source and destination might have been connected to a delegated subnet ruling out that option. Consequently, we must include an additional step to a third location as an intermediary for data transfer that both networks can access such as a storage account over public IP networking.

This would require an example as follows:

from pyspark.sql import SparkSession

from pyspark.sql.functions import col

import os

# Set up the Spark session

spark = SparkSession.builder \

.appName("Export Cassandra to Azure Storage") \

.config("spark.cassandra.connection.host", "<cassandra-host>") \

.config("spark.cassandra.connection.port", "9042") \

.config("spark.cassandra.auth.username", "<username>") \

.config("spark.cassandra.auth.password", "<password>") \

.getOrCreate()

# Define the Azure Storage account details

storage_account_name = "<storage-account-name>"

storage_account_key = "<storage-account-key>"

container_name = "<container-name>"

# Configure the storage account

spark.conf.set(f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net", storage_account_key)

# Define keyspaces and tables to export

keyspaces = ["keyspace1", "keyspace2"]

tables = ["table1", "table2"]

# Export each table to CSV and upload to Azure Storage

for keyspace in keyspaces:

for table in tables:

# Read data from Cassandra

df = spark.read \

.format("org.apache.spark.sql.cassandra") \

.options(keyspace=keyspace, table=table) \

.load()

# Define the output path

output_path = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net/{keyspace}/{table}.csv"

# Write data to CSV

df.write \

.csv(output_path, header=True, mode="overwrite")

# Stop the Spark session

spark.stop()

Lastly, it does not matter whether an agent or an intermediary stash is used for the data transfer, but the size and the number of tables do matter for the reliability of the transfer especially if the connection or the execution can be interrupted. Choosing between the options requires us to make the copying logic robust.

Cluster computing

Friday, October 18, 2024

No comments:

Post a Comment