Skip to content
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
1f5568f
do not auto add default qos class (#720)
geoffrey1330 Nov 5, 2025
3af633b
Update env_var (#721)
geoffrey1330 Nov 5, 2025
ad546ca
Enable ndcs and npcs when creating lvol (#729)
Hamdy-khader Nov 11, 2025
5f6382b
Fix sfam-2450 cluster update issues (#726)
Hamdy-khader Nov 11, 2025
4a6a4d7
Update Dockerfile_base (#730)
geoffrey1330 Nov 11, 2025
bf56cb6
inherit default cluster mode in new cluster (#733)
geoffrey1330 Nov 12, 2025
0e72282
Update environment variables for Simply Block (#737)
Hamdy-khader Nov 12, 2025
25e3dd2
Main lvol sync delete (#734)
Hamdy-khader Nov 13, 2025
cd68c60
added fdb multi AZ support (#736)
geoffrey1330 Nov 13, 2025
1c38b6e
increased k8s fdb memory limit (#740)
geoffrey1330 Nov 14, 2025
5d9e0a4
Added MIT License (#742)
noctarius Nov 14, 2025
ee8d460
Update constants.py (#744)
schmidt-scaled Nov 15, 2025
2b14491
set size of lvstore cluster in constants (as ratio to distrib page size)
schmidt-scaled Nov 15, 2025
cfd14f7
Merge remote-tracking branch 'origin/main'
schmidt-scaled Nov 15, 2025
314c4cf
Update sc name (#746)
geoffrey1330 Nov 17, 2025
ce6ae0f
updated to distributed provisioning (#748)
geoffrey1330 Nov 17, 2025
5596c11
Update Dockerfile_base (#750)
geoffrey1330 Nov 17, 2025
aaa9b42
sleep after openshift core isolation until reboot (#753)
geoffrey1330 Nov 18, 2025
b60925d
added try and except to patch_prometheus_configmap func (#756)
geoffrey1330 Nov 19, 2025
bb90c60
added hostNetwork true to simplyblock controlplane services (#771)
geoffrey1330 Nov 23, 2025
43c97a5
Set cluster_id optional on SNodeAPI docker version (#777)
Hamdy-khader Nov 25, 2025
33ee3e4
add cluster_id param for spdk_process_is_up (#779)
geoffrey1330 Nov 26, 2025
3fc94cb
Implements sfam-2459
Hamdy-khader Nov 26, 2025
b2cd483
adds service python header
Hamdy-khader Nov 26, 2025
4b21d27
WIP
Hamdy-khader Nov 26, 2025
2531483
updated images for openshift preflight check (#741)
geoffrey1330 Nov 27, 2025
74d34f7
WIP 2
Hamdy-khader Nov 27, 2025
36f45b9
added graylog env GRAYLOG_MESSAGE_JOURNAL_MAX_SIZE (#782)
geoffrey1330 Nov 27, 2025
f412121
Create partitions and alcemls on node add in parallel (#763) (#785)
Hamdy-khader Dec 1, 2025
3c60a2c
Remove stats from fdb and get it from Prometheus (#762) (#786)
Hamdy-khader Dec 1, 2025
6ddfd0b
Increase jc comp resume retry on node not online (#690)
Hamdy-khader Dec 1, 2025
e07eb62
Merge remote-tracking branch 'origin/main' into main-lvol-scheduler
Hamdy-khader Dec 2, 2025
6c363b9
wip
Hamdy-khader Dec 2, 2025
8e3fe70
Adds missing services to k8s mgmt (#788)
Hamdy-khader Dec 2, 2025
14b4282
Merge branch 'main' into main-lvol-scheduler
Hamdy-khader Dec 2, 2025
c621a9a
continue lvol new sch impl
Hamdy-khader Dec 2, 2025
a77c5e4
fix sfam-2507 (#791)
geoffrey1330 Dec 3, 2025
db2ca62
Update cluster.py (#793)
geoffrey1330 Dec 3, 2025
67455ed
Update mgmt_node_ops.py (#795)
geoffrey1330 Dec 3, 2025
20068bf
remove function get_node_name_by_ip (#794)
geoffrey1330 Dec 3, 2025
e91dbc0
Fix /cluster/create_first response (#798)
Hamdy-khader Dec 3, 2025
bc79957
Remove user creation and switch (#799)
Hamdy-khader Dec 4, 2025
43a4cae
Fix apiv2 pool add response to return pool dict (#800)
Hamdy-khader Dec 4, 2025
ce479eb
Update mgmt_node_ops.py (#796)
geoffrey1330 Dec 4, 2025
ec07572
Fix add-node apiv2 to remove unused param "full_page_unmap" (#801)
Hamdy-khader Dec 4, 2025
f48c839
Main fix add node apiv2 (#802)
Hamdy-khader Dec 4, 2025
f673f0a
Update cluster_ops.py (#797)
geoffrey1330 Dec 4, 2025
d2ad973
Fix node-add apiv2 response (#803)
Hamdy-khader Dec 4, 2025
1b0d7aa
Main fix node list apiv2 response (#804)
Hamdy-khader Dec 4, 2025
e0cd5ab
Update storage_deploy_spdk.yaml.j2 (#805)
geoffrey1330 Dec 4, 2025
7d8bb86
Adding quick outage case, changes to ssh utils (#806)
RaunakJalan Dec 5, 2025
24d6ced
Fix sn list apiv2 response _2 (#807)
Hamdy-khader Dec 5, 2025
d1163e3
Update cluster.py (#808)
geoffrey1330 Dec 8, 2025
d163662
Fdb health check (#809)
geoffrey1330 Dec 8, 2025
e344c93
Fix sfam-2515
Hamdy-khader Dec 8, 2025
e18c09a
Merge pull request #811 from simplyblock/main-sfam-2515
alirezamhd2024 Dec 8, 2025
9f299f5
Merge remote-tracking branch 'origin/main' into main-lvol-scheduler
Hamdy-khader Dec 10, 2025
a0005ff
wip
Hamdy-khader Dec 10, 2025
f69ae95
Adds unit tests for lvol scheduler
Hamdy-khader Dec 10, 2025
c94282b
use 1m test run
Hamdy-khader Dec 10, 2025
497f415
use 100k test run
Hamdy-khader Dec 10, 2025
f6b697d
fix linter
Hamdy-khader Dec 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023-2025 simplyblock GmbH

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
1 change: 1 addition & 0 deletions docker/Dockerfile_base
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,4 @@ RUN pip3 install setuptools --upgrade
COPY requirements.txt requirements.txt

RUN pip3 install -r requirements.txt

14 changes: 0 additions & 14 deletions docs/talos.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,26 +19,12 @@ kubectl label namespace simplyblock \
--overwrite
```


Patch the host machine so that OpenEBS could work

Create a machine config patch with the contents below and save as patch.yaml
```
cat > patch.yaml <<'EOF'
machine:
sysctls:
vm.nr_hugepages: "1024"
nodeLabels:
openebs.io/engine: mayastor
kubelet:
extraMounts:
- destination: /var/openebs/local
type: bind
source: /var/openebs/local
options:
- rbind
- rshared
- rw
EOF

talosctl -e <endpoint ip/hostname> -n <node ip/hostname> patch mc -p @patch.yaml
Expand Down
83 changes: 62 additions & 21 deletions simplyblock_core/cluster_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -371,8 +371,6 @@ def create_cluster(blk_size, page_size_in_blocks, cli_pass,

cluster.write_to_db(db_controller.kv_store)

qos_controller.add_class("Default", 100, cluster.get_id())

cluster_events.cluster_create(cluster)

mgmt_node_ops.add_mgmt_node(dev_ip, mode, cluster.uuid)
Expand Down Expand Up @@ -459,6 +457,7 @@ def add_cluster(blk_size, page_size_in_blocks, cap_warn, cap_crit, prov_cap_warn
cluster.strict_node_anti_affinity = strict_node_anti_affinity

default_cluster = clusters[0]
cluster.mode = default_cluster.mode
cluster.db_connection = default_cluster.db_connection
cluster.grafana_secret = monitoring_secret if default_cluster.mode == "kubernetes" else default_cluster.grafana_secret
cluster.grafana_endpoint = default_cluster.grafana_endpoint
Expand Down Expand Up @@ -1176,37 +1175,52 @@ def update_cluster(cluster_id, mgmt_only=False, restart=False, spdk_image=None,
for service in cluster_docker.services.list():
if image_parts in service.attrs['Spec']['Labels']['com.docker.stack.image'] or \
"simplyblock" in service.attrs['Spec']['Labels']['com.docker.stack.image']:
logger.info(f"Updating service {service.name}")
service.update(image=service_image, force_update=True)
service_names.append(service.attrs['Spec']['Name'])
if service.name == "app_CachingNodeMonitor":
logger.info(f"Removing service {service.name}")
service.remove()
else:
logger.info(f"Updating service {service.name}")
service.update(image=service_image, force_update=True)
service_names.append(service.attrs['Spec']['Name'])

if "app_SnapshotMonitor" not in service_names:
Comment thread
Hamdy-khader marked this conversation as resolved.
logger.info("Creating snapshot monitor service")
cluster_docker.services.create(
image=service_image,
command="python simplyblock_core/services/snapshot_monitor.py",
name="app_SnapshotMonitor",
mounts=["/etc/foundationdb:/etc/foundationdb"],
env=["SIMPLYBLOCK_LOG_LEVEL=DEBUG"],
networks=["host"],
constraints=["node.role == manager"]
)
utils.create_docker_service(
cluster_docker=cluster_docker,
service_name="app_SnapshotMonitor",
service_file="python simplyblock_core/services/snapshot_monitor.py",
service_image=service_image)

if "app_TasksRunnerLVolSyncDelete" not in service_names:
utils.create_docker_service(
cluster_docker=cluster_docker,
service_name="app_TasksRunnerLVolSyncDelete",
service_file="python simplyblock_core/services/tasks_runner_sync_lvol_del.py",
service_image=service_image)

if "app_LVolScheduler" not in service_names:
utils.create_docker_service(
cluster_docker=cluster_docker,
service_name="app_LVolScheduler",
service_file="python simplyblock_core/services/lvol_scheduler.py",
service_image=service_image)

logger.info("Done updating mgmt cluster")

elif cluster.mode == "kubernetes":
utils.load_kube_config_with_fallback()
apps_v1 = k8s_client.AppsV1Api()

namespace = constants.K8S_NAMESPACE
image_without_tag = constants.SIMPLY_BLOCK_DOCKER_IMAGE.split(":")[0]
image_parts = "/".join(image_without_tag.split("/")[-2:])
service_image = mgmt_image or constants.SIMPLY_BLOCK_DOCKER_IMAGE

deployment_names = []
# Update Deployments
deployments = apps_v1.list_namespaced_deployment(namespace=constants.K8S_NAMESPACE)
deployments = apps_v1.list_namespaced_deployment(namespace=namespace)
for deploy in deployments.items:
if deploy.metadata.name == constants.ADMIN_DEPLOY_NAME:
logger.info(f"Skipping deployment {deploy.metadata.name}")
continue
deployment_names.append(deploy.metadata.name)
for c in deploy.spec.template.spec.containers:
if image_parts in c.image:
logger.info(f"Updating deployment {deploy.metadata.name} image to {service_image}")
Expand All @@ -1216,12 +1230,39 @@ def update_cluster(cluster_id, mgmt_only=False, restart=False, spdk_image=None,
deploy.spec.template.metadata.annotations = annotations
apps_v1.patch_namespaced_deployment(
name=deploy.metadata.name,
namespace=constants.K8S_NAMESPACE,
namespace=namespace,
body={"spec": {"template": deploy.spec.template}}
)

if f"{namespace}-tasks-runner-sync-lvol-del" not in deployment_names:
utils.create_k8s_service(
k8s_apps_client=apps_v1,
namespace=namespace,
deployment_name=f"{namespace}-tasks-runner-sync-lvol-del",
container_name="tasks-runner-sync-lvol-del",
service_file="simplyblock_core/services/tasks_runner_sync_lvol_del.py",
container_image=service_image)

if f"{namespace}-snapshot-monitor" not in deployment_names:
utils.create_k8s_service(
k8s_apps_client=apps_v1,
namespace=namespace,
deployment_name=f"{namespace}-snapshot-monitor",
container_name="snapshot-monitor",
service_file="simplyblock_core/services/snapshot_monitor.py",
container_image=service_image)

if f"{namespace}-lvol-scheduler" not in deployment_names:
utils.create_k8s_service(
k8s_apps_client=apps_v1,
namespace=namespace,
deployment_name=f"{namespace}-lvol-scheduler",
container_name="lvol-scheduler",
service_file="simplyblock_core/services/lvol_scheduler.py",
container_image=service_image)

# Update DaemonSets
daemonsets = apps_v1.list_namespaced_daemon_set(namespace=constants.K8S_NAMESPACE)
daemonsets = apps_v1.list_namespaced_daemon_set(namespace=namespace)
for ds in daemonsets.items:
for c in ds.spec.template.spec.containers:
if image_parts in c.image:
Expand All @@ -1232,7 +1273,7 @@ def update_cluster(cluster_id, mgmt_only=False, restart=False, spdk_image=None,
ds.spec.template.metadata.annotations = annotations
apps_v1.patch_namespaced_daemon_set(
name=ds.metadata.name,
namespace=constants.K8S_NAMESPACE,
namespace=namespace,
body={"spec": {"template": ds.spec.template}}
)

Expand Down
6 changes: 4 additions & 2 deletions simplyblock_core/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ def get_config_var(name, default=None):
SSD_VENDOR_WHITE_LIST = ["1d0f:cd01", "1d0f:cd00"]
CACHED_LVOL_STAT_COLLECTOR_INTERVAL_SEC = 5
DEV_DISCOVERY_INTERVAL_SEC = 60
LVOL_SCHEDULER_INTERVAL_SEC = 60*15

PMEM_DIR = '/tmp/pmem'

Expand Down Expand Up @@ -133,7 +134,8 @@ def get_config_var(name, default=None):
LVOL_NVME_CONNECT_NR_IO_QUEUES=3
LVOL_NVME_KEEP_ALIVE_TO=10
LVOL_NVME_KEEP_ALIVE_TO_TCP=7
LVOL_NVMF_PORT_START=int(os.getenv('LVOL_NVMF_PORT_START', 9100))
LVOL_NVMF_PORT_ENV = os.getenv("LVOL_NVMF_PORT_START", "")
LVOL_NVMF_PORT_START = int(LVOL_NVMF_PORT_ENV) if LVOL_NVMF_PORT_ENV else 9100
QPAIR_COUNT=32
CLIENT_QPAIR_COUNT=3
NVME_TIMEOUT_US=8000000
Expand Down Expand Up @@ -224,4 +226,4 @@ def get_config_var(name, default=None):

qos_class_meta_and_migration_weight_percent = 25

MIG_PARALLEL_JOBS = 16
MIG_PARALLEL_JOBS = 64
4 changes: 2 additions & 2 deletions simplyblock_core/controllers/health_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,11 +128,11 @@ def _check_node_api(ip):
return False


def _check_spdk_process_up(ip, rpc_port):
def _check_spdk_process_up(ip, rpc_port, cluster_id):
try:
snode_api = SNodeClient(f"{ip}:5000", timeout=10, retry=2)
logger.debug(f"Node API={ip}:5000")
is_up, _ = snode_api.spdk_process_is_up(rpc_port)
is_up, _ = snode_api.spdk_process_is_up(rpc_port, cluster_id)
logger.debug(f"SPDK is {is_up}")
return is_up
except Exception as e:
Expand Down
23 changes: 23 additions & 0 deletions simplyblock_core/controllers/tasks_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,11 @@ def _add_task(function_name, cluster_id, node_id, device_id,
if task_id:
logger.info(f"Task found, skip adding new task: {task_id}")
return False
elif function_name == JobSchedule.FN_LVOL_SYNC_DEL:
task_id = get_lvol_sync_del_task(cluster_id, node_id, function_params['lvol_bdev_name'])
if task_id:
logger.info(f"Task found, skip adding new task: {task_id}")
return False

task_obj = JobSchedule()
task_obj.uuid = str(uuid.uuid4())
Expand Down Expand Up @@ -386,3 +391,21 @@ def get_jc_comp_task(cluster_id, node_id, jm_vuid=0):
if jm_vuid and "jm_vuid" in task.function_params and task.function_params["jm_vuid"] == jm_vuid:
return task.uuid
return False


def add_lvol_sync_del_task(cluster_id, node_id, lvol_bdev_name):
return _add_task(JobSchedule.FN_LVOL_SYNC_DEL, cluster_id, node_id, "",
function_params={"lvol_bdev_name": lvol_bdev_name}, max_retry=10)

def get_lvol_sync_del_task(cluster_id, node_id, lvol_bdev_name=None):
tasks = db.get_job_tasks(cluster_id)
for task in tasks:
if task.function_name == JobSchedule.FN_LVOL_SYNC_DEL and task.node_id == node_id :
if task.status != JobSchedule.STATUS_DONE and task.canceled is False:
if lvol_bdev_name:
if "lvol_bdev_name" in task.function_params and task.function_params["lvol_bdev_name"] == lvol_bdev_name:
return task.uuid
else:
return task.uuid
return False

4 changes: 2 additions & 2 deletions simplyblock_core/env_var
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
SIMPLY_BLOCK_COMMAND_NAME=sbcli-dev
SIMPLY_BLOCK_VERSION=19.2.23
SIMPLY_BLOCK_VERSION=19.2.27

SIMPLY_BLOCK_DOCKER_IMAGE=public.ecr.aws/simply-block/simplyblock:main
SIMPLY_BLOCK_DOCKER_IMAGE=public.ecr.aws/simply-block/simplyblock:main-lvol-scheduler
SIMPLY_BLOCK_SPDK_ULTRA_IMAGE=public.ecr.aws/simply-block/ultra:main-latest

2 changes: 1 addition & 1 deletion simplyblock_core/models/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ class Cluster(BaseModel):
distr_npcs: int = 0
enable_node_affinity: bool = False
grafana_endpoint: str = ""
mode: str = ""
mode: str = "docker"
grafana_secret: str = ""
contact_point: str = ""
ha_type: str = "single"
Expand Down
1 change: 1 addition & 0 deletions simplyblock_core/models/job_schedule.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ class JobSchedule(BaseModel):
FN_BALANCING_AFTER_DEV_REMOVE = "balancing_on_dev_rem"
FN_BALANCING_AFTER_DEV_EXPANSION = "balancing_on_dev_add"
FN_JC_COMP_RESUME = "jc_comp_resume"
FN_LVOL_SYNC_DEL = "lvol_sync_del"

canceled: bool = False
cluster_id: str = ""
Expand Down
1 change: 0 additions & 1 deletion simplyblock_core/models/storage_node.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,6 @@ class StorageNode(BaseNodeObject):
hublvol: HubLVol = None # type: ignore[assignment]
active_tcp: bool = True
active_rdma: bool = False
lvol_sync_del_queue: List[str] = []

def rpc_client(self, **kwargs):
"""Return rpc client to this node
Expand Down
14 changes: 7 additions & 7 deletions simplyblock_core/rpc_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -379,11 +379,11 @@ def create_lvol(self, name, size_in_mib, lvs_name, lvol_priority_class=0, ndcs=0
"clear_method": "unmap",
"lvol_priority_class": lvol_priority_class,
}
# if ndcs or npcs:
# params.update({
# 'ndcs' : ndcs,
# 'npcs' : npcs,
# })
if ndcs or npcs:
params.update({
'ndcs' : ndcs,
'npcs' : npcs,
})
return self._request("bdev_lvol_create", params)

def delete_lvol(self, name, del_async=False):
Expand Down Expand Up @@ -922,7 +922,7 @@ def distr_migration_status(self, name):
params = {"name": name}
return self._request("distr_migration_status", params)

def distr_migration_failure_start(self, name, storage_ID, qos_high_priority=False, job_size=1024, jobs=4):
def distr_migration_failure_start(self, name, storage_ID, qos_high_priority=False, job_size=64, jobs=64):
params = {
"name": name,
"storage_ID": storage_ID,
Expand All @@ -935,7 +935,7 @@ def distr_migration_failure_start(self, name, storage_ID, qos_high_priority=Fals
params["jobs"] = jobs
return self._request("distr_migration_failure_start", params)

def distr_migration_expansion_start(self, name, qos_high_priority=False, job_size=1024, jobs=4):
def distr_migration_expansion_start(self, name, qos_high_priority=False, job_size=64, jobs=64):
params = {
"name": name,
}
Expand Down
5 changes: 0 additions & 5 deletions simplyblock_core/scripts/charts/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,6 @@ dependencies:
version: "25.18.0"
repository: "https://prometheus-community.github.io/helm-charts"
condition: monitoring.enabled
- name: openebs
version: 3.9.0
repository: https://openebs.github.io/charts
alias: openebs
condition: openebs.enabled
- name: ingress-nginx
version: 4.10.1
repository: "https://kubernetes.github.io/ingress-nginx"
Expand Down
Loading
Loading