diff --git a/docs/en/14-reference/03-taos-sql/22-function.md b/docs/en/14-reference/03-taos-sql/22-function.md index 335b3b2f6873..0cd239d03ade 100644 --- a/docs/en/14-reference/03-taos-sql/22-function.md +++ b/docs/en/14-reference/03-taos-sql/22-function.md @@ -865,6 +865,49 @@ LTRIM(expr) **Applicable to**: Tables and supertables. +#### REGEXP_EXTRACT + +```sql +REGEXP_EXTRACT(expr, pattern [, group_idx]) +``` + +**Function Description**: Applies the POSIX extended regular expression `pattern` to `expr` and returns the substring matched by capture group `group_idx`. Returns NULL when there is no match or when `expr` or `pattern` is NULL. + +**Return Type**: Same as `expr` (VARCHAR or NCHAR). + +**Applicable Data Types**: `expr`: VARCHAR, NCHAR. `pattern`: VARCHAR, NCHAR. + +**Nested Subquery Support**: Applicable to both inner and outer queries. + +**Applicable to**: Tables and supertables. + +**Usage**: + +- If omitted, `group_idx` defaults to `1`. +- If provided as a non-`NULL` value, `group_idx` must be a non-negative integer constant. `0` returns the entire match; `1` returns the first capture group, `2` the second, and so on. The maximum value is 512. +- If `group_idx` is SQL `NULL`, the function returns `NULL`. +- Returns NULL if `group_idx` exceeds the number of capture groups in `pattern`, or if the addressed group did not participate in the match. +- `pattern` must be provided as a constant literal or parameter placeholder; it cannot reference a column or be computed from other expressions. + +**Example**: + +```sql +taos> SELECT REGEXP_EXTRACT('2026-04-22', '([0-9]{4})-([0-9]{2})-([0-9]{2})', 1); + regexp_extract('2026-04-22', '([0-9]{4})-([0-9]{2})-([0-9]{2})', 1) | +======================================================================= + 2026 | + +taos> SELECT REGEXP_EXTRACT('2026-04-22', '([0-9]{4})-([0-9]{2})-([0-9]{2})', 0); + regexp_extract('2026-04-22', '([0-9]{4})-([0-9]{2})-([0-9]{2})', 0) | +======================================================================= + 2026-04-22 | + +taos> SELECT REGEXP_EXTRACT('no-digits-here', '[0-9]+', 1); + regexp_extract('no-digits-here', '[0-9]+', 1) | +=============================================== + NULL | +``` + #### REGEXP_IN_SET ```sql diff --git a/docs/zh/14-reference/03-taos-sql/22-function.md b/docs/zh/14-reference/03-taos-sql/22-function.md index 83872b2af85a..1fd1ccb7dc5d 100644 --- a/docs/zh/14-reference/03-taos-sql/22-function.md +++ b/docs/zh/14-reference/03-taos-sql/22-function.md @@ -1044,6 +1044,47 @@ taos> select position('d' in 'cba'); 0 | ``` +#### REGEXP_EXTRACT + +```sql +REGEXP_EXTRACT(expr, pattern [, group_idx]) +``` + +**功能说明**:对 `expr` 应用 POSIX 扩展正则表达式 `pattern`,返回第 `group_idx` 个捕获组匹配的子串。无匹配、`expr` 或 `pattern` 为 NULL 时返回 NULL。 + +**返回结果类型**:与 `expr` 相同(VARCHAR 或 NCHAR)。 + +**适用数据类型**:`expr`:VARCHAR、NCHAR;`pattern`:VARCHAR、NCHAR。 + +**嵌套子查询支持**:适用于内层查询和外层查询。 + +**适用于**:表和超级表。 + +**使用说明**: + +- `group_idx` 通常为非负整数常量,默认为 `1`。`0` 返回整个匹配串,`1` 返回第一个捕获组,`2` 返回第二个,以此类推,最大值为 512。若 `group_idx` 为 SQL `NULL`,则返回 `NULL`。 +- 若 `group_idx` 超过 `pattern` 中的捕获组数量,或对应捕获组未参与匹配,返回 NULL。 +- `pattern` 必须为常量(字面量或预处理占位符),不可引用列;不支持 `concat('a','b')` 这类常量表达式。 + +**举例**: + +```sql +taos> SELECT REGEXP_EXTRACT('2026-04-22', '([0-9]{4})-([0-9]{2})-([0-9]{2})', 1); + regexp_extract('2026-04-22', '([0-9]{4})-([0-9]{2})-([0-9]{2})', 1) | +======================================================================= + 2026 | + +taos> SELECT REGEXP_EXTRACT('2026-04-22', '([0-9]{4})-([0-9]{2})-([0-9]{2})', 0); + regexp_extract('2026-04-22', '([0-9]{4})-([0-9]{2})-([0-9]{2})', 0) | +======================================================================= + 2026-04-22 | + +taos> SELECT REGEXP_EXTRACT('no-digits-here', '[0-9]+', 1); + regexp_extract('no-digits-here', '[0-9]+', 1) | +=============================================== + NULL | +``` + #### REGEXP_IN_SET ```sql @@ -2369,11 +2410,11 @@ LAG(expr, offset[, default_val]) **使用说明**: - `offset` 必须为大于 0 的整数。 -- `default_val` 可选;当目标行不存在时返回该值,未指定时返回 `NULL`。 -- `default_val` 需要与 `expr` 类型兼容。 -- `LAG` 按输入结果集的行序计算;可以结合 `ORDER BY` 改变计算顺序。 -- 支持与 `_rowts`、`tbname`、标签列等一起查询,也支持在子查询和 `PARTITION BY` 场景中使用。 -- 与窗口一起使用时,`LAG` 仅在当前窗口内部按窗口内结果顺序计算,不会跨窗口继承上一窗口的状态。 +- `default_val` 可选;当目标行不存在时返回该值,未指定时返回 `NULL`。 +- `default_val` 需要与 `expr` 类型兼容。 +- `LAG` 按输入结果集的行序计算;可以结合 `ORDER BY` 改变计算顺序。 +- 支持与 `_rowts`、`tbname`、标签列等一起查询,也支持在子查询和 `PARTITION BY` 场景中使用。 +- 与窗口一起使用时,`LAG` 仅在当前窗口内部按窗口内结果顺序计算,不会跨窗口继承上一窗口的状态。 #### LEAD @@ -2392,11 +2433,11 @@ LEAD(expr, offset[, default_val]) **使用说明**: - `offset` 必须为大于 0 的整数。 -- `default_val` 可选;当目标行不存在时返回该值,未指定时返回 `NULL`。 -- `default_val` 需要与 `expr` 类型兼容。 -- `LEAD` 按输入结果集的行序计算;可以结合 `ORDER BY` 改变计算顺序。 -- 支持与 `_rowts`、`tbname`、标签列等一起查询,也支持在子查询和 `PARTITION BY` 场景中使用。 -- 与窗口一起使用时,`LEAD` 仅在当前窗口内部按窗口内结果顺序计算,不会跨窗口读取下一窗口的数据。 +- `default_val` 可选;当目标行不存在时返回该值,未指定时返回 `NULL`。 +- `default_val` 需要与 `expr` 类型兼容。 +- `LEAD` 按输入结果集的行序计算;可以结合 `ORDER BY` 改变计算顺序。 +- 支持与 `_rowts`、`tbname`、标签列等一起查询,也支持在子查询和 `PARTITION BY` 场景中使用。 +- 与窗口一起使用时,`LEAD` 仅在当前窗口内部按窗口内结果顺序计算,不会跨窗口读取下一窗口的数据。 #### MAX @@ -3155,11 +3196,11 @@ MAVG(expr, k) **适用于**:表和超级表。 -**使用说明**: - -- 不支持 +、-、*、/ 运算,如 mavg(col1, k1) + mavg(col2, k1); -- 只能与普通列,选择(Selection)、投影(Projection)函数一起使用,不能与聚合(Aggregation)函数一起使用; -- 与窗口一起使用时,`MAVG` 仅在当前窗口内部按样本顺序计算,不会跨窗口延续上一窗口的样本状态。 +**使用说明**: + +- 不支持 +、-、*、/ 运算,如 mavg(col1, k1) + mavg(col2, k1); +- 只能与普通列,选择(Selection)、投影(Projection)函数一起使用,不能与聚合(Aggregation)函数一起使用; +- 与窗口一起使用时,`MAVG` 仅在当前窗口内部按样本顺序计算,不会跨窗口延续上一窗口的样本状态。 #### STATECOUNT @@ -3182,9 +3223,9 @@ STATECOUNT(expr, oper, val) **适用于**:表和超级表。 -**使用说明**: - -- 与窗口一起使用时,`STATECOUNT` 仅统计当前窗口内部的连续记录,不会跨窗口累计。 +**使用说明**: + +- 与窗口一起使用时,`STATECOUNT` 仅统计当前窗口内部的连续记录,不会跨窗口累计。 #### STATEDURATION @@ -3208,9 +3249,9 @@ STATEDURATION(expr, oper, val, unit) **适用于**:表和超级表。 -**使用说明**: - -- 与窗口一起使用时,`STATEDURATION` 仅统计当前窗口内部满足条件的连续时长,不会跨窗口累计。 +**使用说明**: + +- 与窗口一起使用时,`STATEDURATION` 仅统计当前窗口内部满足条件的连续时长,不会跨窗口累计。 ### 时间加权统计 diff --git a/include/libs/function/functionMgt.h b/include/libs/function/functionMgt.h index 2b569a9e6ba6..7e4ae7b609dc 100644 --- a/include/libs/function/functionMgt.h +++ b/include/libs/function/functionMgt.h @@ -141,6 +141,7 @@ typedef enum EFunctionType { FUNCTION_TYPE_AES_DECRYPT, FUNCTION_TYPE_SM4_ENCRYPT, FUNCTION_TYPE_SM4_DECRYPT, + FUNCTION_TYPE_REGEXP_EXTRACT, // conversion function FUNCTION_TYPE_CAST = 2000, diff --git a/include/libs/scalar/scalar.h b/include/libs/scalar/scalar.h index 518b13da7b32..807351cc7d07 100644 --- a/include/libs/scalar/scalar.h +++ b/include/libs/scalar/scalar.h @@ -132,6 +132,11 @@ int32_t crc32Function(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOut int32_t findInSetFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput); int32_t likeInSetFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput); int32_t regexpInSetFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput); +int32_t regexpExtractFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput); + +// Maximum capture-group index accepted by regexp_extract() — shared between +// translate-time validation (builtins.c) and runtime validation (sclfunc.c). +#define REGEXP_EXTRACT_MAX_GROUP_IDX 512 int32_t generateTotpSecretFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput); int32_t generateTotpCodeFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput); diff --git a/source/libs/executor/src/externalwindowoperator.c b/source/libs/executor/src/externalwindowoperator.c index 758e08303d9a..79b433a80d05 100644 --- a/source/libs/executor/src/externalwindowoperator.c +++ b/source/libs/executor/src/externalwindowoperator.c @@ -2493,8 +2493,8 @@ static int32_t extWinApplyAggPostProjection(SOperatorInfo* pOperator, SExternalW SSDataBlock* pSlice = pExtW->pProjTmpBlock; TAOS_CHECK_EXIT(projectApplyFunctions(pExtW->projSupp.pExprInfo, pSlice, pSlice, pExtW->projSupp.pCtx, - pExtW->projSupp.numOfExprs, NULL, GET_STM_RTINFO(pOperator->pTaskInfo), - pOperator->pTaskInfo)); + pExtW->projSupp.numOfExprs, NULL, + GET_STM_RTINFO(pOperator->pTaskInfo), pOperator->pTaskInfo)); int32_t numOfCols = taosArrayGetSize(pBlock->pDataBlock); // TODO(perf): only copy back the slots actually written by projSupp, not all columns. diff --git a/source/libs/function/src/builtins.c b/source/libs/function/src/builtins.c index 67096a51b5a6..5e8e72de2060 100644 --- a/source/libs/function/src/builtins.c +++ b/source/libs/function/src/builtins.c @@ -1105,6 +1105,117 @@ static int32_t translateRand(SFunctionNode* pFunc, char* pErrBuf, int32_t len) { static int32_t translateSleep(SFunctionNode* pFunc, char* pErrBuf, int32_t len) { FUNC_ERR_RET(validateParam(pFunc, pErrBuf, len)); pFunc->node.resType = (SDataType){.bytes = tDataTypes[TSDB_DATA_TYPE_INT].bytes, .type = TSDB_DATA_TYPE_INT}; + + return TSDB_CODE_SUCCESS; +} + +static int32_t translateRegexpExtract(SFunctionNode* pFunc, char* pErrBuf, int32_t len) { + FUNC_ERR_RET(validateParam(pFunc, pErrBuf, len)); + int32_t numOfParams = LIST_LENGTH(pFunc->pParameterList); + + // param[1]: pattern must be a literal/parameter constant VALUE node. + // Constant expressions are not accepted here because regexp_extract + // currently validates only VALUE nodes. + SNode* pPatNode = nodesListGetNode(pFunc->pParameterList, 1); + if (QUERY_NODE_VALUE != nodeType(pPatNode)) { + return invaildFuncParaTypeErrMsg( + pErrBuf, len, "regexp_extract: pattern must be a literal or parameter constant"); + } + + // Validate the regex pattern compiles as POSIX ERE. + // For prepared-statement placeholders, literal may contain the placeholder + // token (for example "?") instead of the bound pattern. Prefer the + // materialized datum when available, and otherwise defer validation to + // runtime for placeholders. For NCHAR patterns datum.p holds UCS-4 vardata; + // convert it to UTF-8 to match the runtime path in regexpExtractFunction. + SValueNode* pPatVal = (SValueNode*)pPatNode; + { + const char* regPattern = NULL; + char* utf8Pat = NULL; + bool freeUtf8Pat = false; + bool deferValidation = (pPatVal->placeholderNo != 0 && pPatVal->datum.p == NULL); + + if (!deferValidation) { + if (pPatVal->node.resType.type == TSDB_DATA_TYPE_NCHAR && pPatVal->datum.p != NULL) { + int32_t ncharBytes = varDataLen(pPatVal->datum.p); + utf8Pat = taosMemoryCalloc(ncharBytes + 1, 1); + if (utf8Pat == NULL) return terrno; + int32_t utf8Len = taosUcs4ToMbs((TdUcs4*)varDataVal(pPatVal->datum.p), ncharBytes, + utf8Pat, pPatVal->charsetCxt); + if (utf8Len < 0) { + taosMemoryFree(utf8Pat); + return buildFuncErrMsg(pErrBuf, len, TSDB_CODE_PAR_REGULAR_EXPRESSION_ERROR, + "regexp_extract: failed to convert NCHAR pattern to UTF-8"); + } + utf8Pat[utf8Len] = '\0'; + regPattern = utf8Pat; + freeUtf8Pat = true; + } else if (pPatVal->datum.p != NULL) { + // datum.p is a length-prefixed vardata buffer — not NUL-terminated. + // Build a NUL-terminated copy for regcomp(). + int32_t patBytes = varDataLen(pPatVal->datum.p); + utf8Pat = taosMemoryMalloc(patBytes + 1); + if (utf8Pat == NULL) return terrno; + (void)memcpy(utf8Pat, varDataVal(pPatVal->datum.p), patBytes); + utf8Pat[patBytes] = '\0'; + regPattern = utf8Pat; + freeUtf8Pat = true; + } else { + regPattern = pPatVal->literal; + } + } + + if (regPattern != NULL) { + regex_t re; + int ret = regcomp(&re, regPattern, REG_EXTENDED); + if (ret != 0) { + char msgbuf[256] = {0}; + (void)regerror(ret, NULL, msgbuf, sizeof(msgbuf)); + // do not call regfree — regcomp failed, re is partially initialised (POSIX) + if (freeUtf8Pat) taosMemoryFree(utf8Pat); + return buildFuncErrMsg(pErrBuf, len, TSDB_CODE_PAR_REGULAR_EXPRESSION_ERROR, + "Invalid regex pattern for regexp_extract: %s", msgbuf); + } + regfree(&re); // only reached when regcomp succeeded + } + if (freeUtf8Pat) taosMemoryFree(utf8Pat); + } + + // param[2]: group_idx (optional) must be a non-negative integer constant. + // NULL is also allowed by the builtin signature and should propagate like + // other scalar functions, so accept NULL-typed value nodes here and rely + // on runtime to return a NULL result. + if (numOfParams == 3) { + SNode* pIdxNode = nodesListGetNode(pFunc->pParameterList, 2); + if (QUERY_NODE_VALUE != nodeType(pIdxNode)) { + return invaildFuncParaTypeErrMsg(pErrBuf, len, "regexp_extract: group_idx must be a constant integer"); + } + + SValueNode* pIdxVal = (SValueNode*)pIdxNode; + int32_t idxType = pIdxVal->node.resType.type; + + if (TSDB_DATA_TYPE_NULL != idxType) { + if (!IS_INTEGER_TYPE(idxType)) { + return invaildFuncParaTypeErrMsg(pErrBuf, len, "regexp_extract: group_idx must be an integer"); + } + // Skip range validation for prepared-statement placeholders — the bound value + // is not yet known; the runtime check in regexpExtractFunction applies instead. + if (pIdxVal->placeholderNo == 0) { + int64_t groupIdx = pIdxVal->datum.i; + if (groupIdx < 0 || groupIdx > REGEXP_EXTRACT_MAX_GROUP_IDX) { + char errmsg[64]; + (void)snprintf(errmsg, sizeof(errmsg), + "regexp_extract: group_idx must be between 0 and %d", + REGEXP_EXTRACT_MAX_GROUP_IDX); + return invaildFuncParaValueErrMsg(pErrBuf, len, errmsg); + } + } + } + } + + // Return type matches str (param[0]): same VARCHAR/NCHAR type and byte width + pFunc->node.resType = *getSDataTypeFromNode(nodesListGetNode(pFunc->pParameterList, 0)); + return TSDB_CODE_SUCCESS; } @@ -7441,6 +7552,41 @@ const SBuiltinFuncDefinition funcMgtBuiltins[] = { .sprocessFunc = sleepFunction, .finalizeFunc = NULL }, + { + .name = "regexp_extract", + .type = FUNCTION_TYPE_REGEXP_EXTRACT, + .classification = FUNC_MGT_SCALAR_FUNC | FUNC_MGT_STRING_FUNC, + .parameters = {.minParamNum = 2, + .maxParamNum = 3, + .paramInfoPattern = 1, + .inputParaInfo[0][0] = {.isLastParam = false, + .startParam = 1, + .endParam = 1, + .validDataType = FUNC_PARAM_SUPPORT_VARCHAR_TYPE | FUNC_PARAM_SUPPORT_NCHAR_TYPE | FUNC_PARAM_SUPPORT_NULL_TYPE, + .validNodeType = FUNC_PARAM_SUPPORT_EXPR_NODE, + .paramAttribute = FUNC_PARAM_NO_SPECIFIC_ATTRIBUTE, + .valueRangeFlag = FUNC_PARAM_NO_SPECIFIC_VALUE,}, + .inputParaInfo[0][1] = {.isLastParam = false, + .startParam = 2, + .endParam = 2, + .validDataType = FUNC_PARAM_SUPPORT_VARCHAR_TYPE | FUNC_PARAM_SUPPORT_NCHAR_TYPE | FUNC_PARAM_SUPPORT_NULL_TYPE, + .validNodeType = FUNC_PARAM_SUPPORT_VALUE_NODE, + .paramAttribute = FUNC_PARAM_NO_SPECIFIC_ATTRIBUTE, + .valueRangeFlag = FUNC_PARAM_NO_SPECIFIC_VALUE,}, + .inputParaInfo[0][2] = {.isLastParam = true, + .startParam = 3, + .endParam = 3, + .validDataType = FUNC_PARAM_SUPPORT_INTEGER_TYPE | FUNC_PARAM_SUPPORT_NULL_TYPE, + .validNodeType = FUNC_PARAM_SUPPORT_VALUE_NODE, + .paramAttribute = FUNC_PARAM_NO_SPECIFIC_ATTRIBUTE, + .valueRangeFlag = FUNC_PARAM_NO_SPECIFIC_VALUE,}, + .outputParaInfo = {.validDataType = FUNC_PARAM_SUPPORT_VARCHAR_TYPE | FUNC_PARAM_SUPPORT_NCHAR_TYPE}}, + .translateFunc = translateRegexpExtract, + .getEnvFunc = NULL, + .initFunc = NULL, + .sprocessFunc = regexpExtractFunction, + .finalizeFunc = NULL, + }, }; // clang-format on diff --git a/source/libs/scalar/src/sclfunc.c b/source/libs/scalar/src/sclfunc.c index d0d0f787f69b..ea14b3c06b65 100644 --- a/source/libs/scalar/src/sclfunc.c +++ b/source/libs/scalar/src/sclfunc.c @@ -1817,6 +1817,228 @@ static int32_t base32Encode(const uint8_t *in, int32_t inLen, char *out) { return outLen; } +int32_t regexpExtractFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput) { + int32_t code = TSDB_CODE_SUCCESS; + + int32_t numOfRows = pInput[0].numOfRows; + SColumnInfoData *pStrData = pInput[0].columnData; + SColumnInfoData *pPatData = pInput[1].columnData; + SColumnInfoData *pOutputData = pOutput->columnData; + + if (numOfRows == 0) { + pOutput->numOfRows = 0; + return TSDB_CODE_SUCCESS; + } + + if (IS_NULL_TYPE(GET_PARAM_TYPE(&pInput[0])) || IS_NULL_TYPE(GET_PARAM_TYPE(&pInput[1]))) { + colDataSetNNULL(pOutputData, 0, numOfRows); + pOutput->numOfRows = numOfRows; + return TSDB_CODE_SUCCESS; + } + + if (colDataIsNull_s(pPatData, 0)) { + colDataSetNNULL(pOutputData, 0, numOfRows); + pOutput->numOfRows = numOfRows; + return TSDB_CODE_SUCCESS; + } + + // Get group_idx (default 1; param[2] is an optional integer constant). + // Read into int64_t first to avoid silent truncation/wrap for BIGINT/UBIGINT + // placeholder values before the range check, then cast after validation. + // An explicit SQL NULL group_idx propagates NULL to all output rows. + int64_t groupIdxRaw = 1; + if (inputNum == 3) { + if (IS_NULL_TYPE(GET_PARAM_TYPE(&pInput[2])) || colDataIsNull_s(pInput[2].columnData, 0)) { + colDataSetNNULL(pOutputData, 0, numOfRows); + pOutput->numOfRows = numOfRows; + return TSDB_CODE_SUCCESS; + } + GET_TYPED_DATA(groupIdxRaw, int64_t, GET_PARAM_TYPE(&pInput[2]), + colDataGetData(pInput[2].columnData, 0), + typeGetTypeModFromColInfo(&pInput[2].columnData->info)); + } + if (groupIdxRaw < 0 || groupIdxRaw > REGEXP_EXTRACT_MAX_GROUP_IDX) { + pOutput->numOfRows = numOfRows; + SCL_ERR_RET(TSDB_CODE_FUNC_FUNTION_PARA_VALUE); + } + int32_t groupIdx = (int32_t)groupIdxRaw; + + // Build null-terminated UTF-8 pattern string (pattern is a constant, always 1 row) + char patBuf[512]; + char *patStr = patBuf; + int32_t patLen = 0; + bool needFreePat = false; + { + char *rawPat = varDataVal(colDataGetData(pPatData, 0)); + int32_t rawPatLen = varDataLen(colDataGetData(pPatData, 0)); + if (GET_PARAM_TYPE(&pInput[1]) == TSDB_DATA_TYPE_NCHAR) { + if (rawPatLen == 0) { + patLen = 0; + patStr = patBuf; + patStr[0] = '\0'; + } else { + patStr = NULL; // ensure convNcharToVarchar always mallocs a fresh heap buffer + code = convNcharToVarchar(rawPat, &patStr, rawPatLen, &patLen, pInput[1].charsetCxt); + if (code != TSDB_CODE_SUCCESS) goto _exit; + needFreePat = true; + // convNcharToVarchar allocates rawPatLen bytes (no +1 for NUL); when the + // UTF-8 output fills the buffer entirely there is no room for a terminator. + // threadGetRegComp requires a NUL-terminated string — grow by one byte. + char *tmp = taosMemoryRealloc(patStr, patLen + 1); + if (tmp == NULL) { + taosMemoryFree(patStr); + needFreePat = false; + code = terrno; + goto _exit; + } + patStr = tmp; + patStr[patLen] = '\0'; + } + } else { + patLen = rawPatLen; + if (patLen >= (int32_t)sizeof(patBuf)) { + patStr = taosMemoryMalloc(patLen + 1); + if (patStr == NULL) { + code = terrno; + goto _exit; + } + needFreePat = true; + } + (void)memcpy(patStr, rawPat, patLen); + patStr[patLen] = '\0'; + } + } + + // Compile (or retrieve cached) regex — pattern is constant so cache hits every row + regex_t *regex = NULL; + code = threadGetRegComp(®ex, patStr); + if (code != 0) { + terrno = code; + goto _exit; + } + + // regmatch_t array: index 0 = whole match, 1..groupIdx = capture groups. + // Initialize all entries to -1 so any submatch slots not written by regexec + // (for example when groupIdx exceeds regex->re_nsub) remain deterministic. + int32_t nmatch = groupIdx + 1; + regmatch_t *pmatch = taosMemoryMalloc(nmatch * sizeof(regmatch_t)); + if (pmatch == NULL) { + code = terrno; + goto _exit; + } + (void)memset(pmatch, 0xFF, nmatch * sizeof(regmatch_t)); + + // Each output cell is a VarData value, and for var-length types info.bytes + // already includes the VARSTR_HEADER_SIZE length prefix plus payload space. + int32_t outBufLen = pStrData->info.bytes; + char *outBuf = taosMemoryMalloc(outBufLen); + if (outBuf == NULL) { + taosMemoryFree(pmatch); + code = terrno; + goto _exit; + } + + int32_t strType = GET_PARAM_TYPE(&pInput[0]); + bool isNchar = (strType == TSDB_DATA_TYPE_NCHAR); + + // Null-termination buffer shared across rows — grown via realloc only when needed + char *strNt = NULL; + int32_t strNtCap = 0; + + for (int32_t i = 0; i < numOfRows; i++) { + if (colDataIsNull_s(pStrData, i)) { + colDataSetNULL(pOutputData, i); + continue; + } + + char *strRaw = colDataGetData(pStrData, i); + char *strVal = varDataVal(strRaw); + int32_t strLen = varDataLen(strRaw); + + // Grow the null-termination buffer only when the current row needs more space. + // For NCHAR: UTF-8 output is at most strLen bytes (UCS-4 byte count >= UTF-8 byte count), + // so strLen + 1 is a safe upper bound for both NCHAR and VARCHAR paths. + if (strLen + 1 > strNtCap) { + char *tmp = taosMemoryRealloc(strNt, strLen + 1); + if (tmp == NULL) { + code = terrno; + break; + } + strNt = tmp; + strNtCap = strLen + 1; + } + + // Convert input into the NUL-terminated UTF-8 scratch buffer. + // For NCHAR: convert UCS-4 directly into strNt — avoids per-row malloc/free. + // For VARCHAR: data is already UTF-8, just copy it. + int32_t strUtf8Len; + if (isNchar) { + strUtf8Len = taosUcs4ToMbs((TdUcs4 *)strVal, strLen, strNt, pInput[0].charsetCxt); + if (strUtf8Len < 0) { + code = TSDB_CODE_SCALAR_CONVERT_ERROR; + terrno = code; + break; + } + } else { + (void)memcpy(strNt, strVal, strLen); + strUtf8Len = strLen; + } + strNt[strUtf8Len] = '\0'; + + int ret = regexec(regex, strNt, nmatch, pmatch, 0); + if (ret == REG_NOMATCH || (ret == 0 && pmatch[groupIdx].rm_so == -1)) { + // no match, or the requested capture group did not participate + colDataSetNULL(pOutputData, i); + } else if (ret != 0) { + // real regex execution error — capture the reason for production debugging + char msgbuf[256] = {0}; + (void)regerror(ret, regex, msgbuf, sizeof(msgbuf)); + qDebug("REGEXP_EXTRACT: regexec failed for pattern '%s', reason: %s", patStr, msgbuf); + code = TSDB_CODE_PAR_REGULAR_EXPRESSION_ERROR; + terrno = code; + break; + } else { + int32_t matchStart = pmatch[groupIdx].rm_so; + int32_t matchLen = pmatch[groupIdx].rm_eo - pmatch[groupIdx].rm_so; + + if (isNchar) { + // Convert matched UTF-8 bytes back to NCHAR (UCS-4) directly into outBuf + // to avoid a per-row malloc/free cycle. + // outBuf data capacity (outBufLen - VARSTR_HEADER_SIZE) >= N*TSDB_NCHAR_SIZE + // which is always >= matchedCodepoints*TSDB_NCHAR_SIZE. + int32_t matchedNcharLen = 0; + bool ok = taosMbsToUcs4(strNt + matchStart, matchLen, + (TdUcs4 *)(outBuf + VARSTR_HEADER_SIZE), + outBufLen - VARSTR_HEADER_SIZE, + &matchedNcharLen, pInput[0].charsetCxt); + if (!ok) { + code = TSDB_CODE_SCALAR_CONVERT_ERROR; + terrno = code; + break; + } + *(VarDataLenT *)outBuf = matchedNcharLen; + code = colDataSetVal(pOutputData, i, outBuf, false); + if (code != TSDB_CODE_SUCCESS) terrno = code; + } else { + *(VarDataLenT *)outBuf = matchLen; + (void)memcpy(outBuf + VARSTR_HEADER_SIZE, strNt + matchStart, matchLen); + code = colDataSetVal(pOutputData, i, outBuf, false); + if (code != TSDB_CODE_SUCCESS) terrno = code; + } + } + + if (code != TSDB_CODE_SUCCESS) break; + } + + taosMemoryFree(strNt); + taosMemoryFree(outBuf); + taosMemoryFree(pmatch); +_exit: + if (needFreePat) taosMemoryFree(patStr); + pOutput->numOfRows = numOfRows; + return code; +} + int32_t generateTotpSecretFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput) { SColumnInfoData *pInputData = pInput->columnData; SColumnInfoData *pOutputData = pOutput->columnData; diff --git a/test/cases/11-Functions/01-Scalar/test_fun_sca_regexp_extract.py b/test/cases/11-Functions/01-Scalar/test_fun_sca_regexp_extract.py new file mode 100644 index 000000000000..5eed7e1e4f7c --- /dev/null +++ b/test/cases/11-Functions/01-Scalar/test_fun_sca_regexp_extract.py @@ -0,0 +1,345 @@ +from new_test_framework.utils import tdLog, tdSql +import datetime + + +class TestFunRegexpExtract: + + def setup_class(cls): + cls.replicaVar = 1 + tdLog.debug(f"start to execute {__file__}") + + # ------------------------------------------------------------------ + # Helpers + # ------------------------------------------------------------------ + + def _create_tb(self, dbname="db"): + tdSql.execute(f"""CREATE STABLE {dbname}.st ( + ts TIMESTAMP, vc VARCHAR(128), nc NCHAR(64), iv INT + ) TAGS (t INT)""") + tdSql.execute(f"CREATE TABLE {dbname}.ct1 USING {dbname}.st TAGS(1)") + tdSql.execute(f"CREATE TABLE {dbname}.ct2 USING {dbname}.st TAGS(2)") + tdSql.execute(f"""CREATE TABLE {dbname}.nt ( + ts TIMESTAMP, vc VARCHAR(128), nc NCHAR(64), iv INT + )""") + + def _insert_data(self, dbname="db"): + now = int(datetime.datetime.timestamp(datetime.datetime.now()) * 1000) + # ct1: log-style rows + one NULL row + ct1_rows = [ + (now - 4000, "'code=42,type=DISK_FULL'", "'code=42,type=DISK_FULL'", 42), + (now - 3000, "'code=7,type=LOW_MEM'", "'code=7,type=LOW_MEM'", 7), + (now - 2000, "'code=0,type=OK'", "'code=0,type=OK'", 0), + (now - 1000, "NULL", "NULL", "NULL"), + ] + for ts, vc, nc, iv in ct1_rows: + tdSql.execute(f"INSERT INTO {dbname}.ct1 VALUES({ts}, {vc}, {nc}, {iv})") + # ct2: URL-style rows + ct2_rows = [ + (now - 3000, "'https://example.com'", "'https://example.com'", 1), + (now - 2000, "'http://api.example.org'", "'http://api.example.org'", 2), + (now - 1000, "'ftp://files.example.net'", "'ftp://files.example.net'", 3), + ] + for ts, vc, nc, iv in ct2_rows: + tdSql.execute(f"INSERT INTO {dbname}.ct2 VALUES({ts}, {vc}, {nc}, {iv})") + # nt: same as ct1 + for ts, vc, nc, iv in ct1_rows: + tdSql.execute(f"INSERT INTO {dbname}.nt VALUES({ts}, {vc}, {nc}, {iv})") + + def _check_basic(self, dbname="db"): + # ----------------------------------------------------------------- + # §1 Default group_idx=1 — no-table queries + # ----------------------------------------------------------------- + # RXE-BASIC-001: single capture group → group 1 + tdSql.query("SELECT REGEXP_EXTRACT('abc', '(b)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'b') + + # RXE-BASIC-002: multiple capture groups, default → group 1 only + tdSql.query("SELECT REGEXP_EXTRACT('abc', '(b)(c)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'b') + + # RXE-BASIC-003: no capture group, default group_idx=1 → NULL + tdSql.query("SELECT REGEXP_EXTRACT('abc', 'b')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # ----------------------------------------------------------------- + # §2 group_idx=0 whole match + # ----------------------------------------------------------------- + # RXE-GRP0-001: no capture group, group_idx=0 → whole match + tdSql.query("SELECT REGEXP_EXTRACT('abc', 'b', 0)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'b') + + # RXE-GRP0-002: with capture group, group_idx=0 → whole match ≠ group 1 + tdSql.query("SELECT REGEXP_EXTRACT('abc', '(b)c', 0)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'bc') + + # RXE-GRP0-003: no match, group_idx=0 → NULL + tdSql.query("SELECT REGEXP_EXTRACT('abc', 'x+', 0)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # ----------------------------------------------------------------- + # §3 Multiple capture groups by explicit index + # ----------------------------------------------------------------- + # RXE-GRP-001: explicit group_idx=1 → group 1 + tdSql.query("SELECT REGEXP_EXTRACT('abc', '(b)(c)', 1)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'b') + + # RXE-GRP-002: explicit group_idx=2 → group 2 + tdSql.query("SELECT REGEXP_EXTRACT('abc', '(b)(c)', 2)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'c') + + # RXE-GRP-003: group_idx out of range → NULL, no error + tdSql.query("SELECT REGEXP_EXTRACT('abc', '(b)(c)', 3)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # ----------------------------------------------------------------- + # §4 NULL and no-match + # ----------------------------------------------------------------- + # RXE-NULL-001: str=NULL → NULL + tdSql.query("SELECT REGEXP_EXTRACT(NULL, '(a+)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # RXE-NULL-002: no match → NULL + tdSql.query("SELECT REGEXP_EXTRACT('abc', '(x+)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # RXE-NULL-003: multiple matches, only first (leftmost) returned + tdSql.query("SELECT REGEXP_EXTRACT('a1b2', '([0-9])')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, '1') + + # RXE-NULL-004: str=NULL with group_idx=0 → NULL + tdSql.query("SELECT REGEXP_EXTRACT(NULL, 'a+', 0)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # RXE-NULL-005: explicit NULL group_idx → NULL + tdSql.query("SELECT REGEXP_EXTRACT('abc', '(b)', NULL)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # RXE-NULL-006: non-participating group in alternation → NULL + # pattern '(a)|(b)' matches 'b' via group 2; group 1 did not participate + tdSql.query("SELECT REGEXP_EXTRACT('b', '(a)|(b)', 1)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # RXE-NULL-007: participating group 2 returns matched content + tdSql.query("SELECT REGEXP_EXTRACT('b', '(a)|(b)', 2)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'b') + + # RXE-NULL-008: pattern=NULL → NULL + tdSql.query("SELECT REGEXP_EXTRACT('abc', NULL)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # ----------------------------------------------------------------- + # §5 Empty string scenarios + # ----------------------------------------------------------------- + # RXE-EMPTY-001: capture group matches empty string → '' (not NULL) + tdSql.query("SELECT REGEXP_EXTRACT('ac', '(b?)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, '') + + # RXE-EMPTY-002: empty input str with zero-length match → '' + tdSql.query("SELECT REGEXP_EXTRACT('', '(a*)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, '') + + # ----------------------------------------------------------------- + # §6 Table queries — per-row scalar behavior + # ----------------------------------------------------------------- + # RXE-TBL-001: extract numeric code — multiple rows match 'code=([0-9]+)'; + # verify row-by-row extraction for 42, 7, 0, and NULL propagation + tdSql.query(f"SELECT REGEXP_EXTRACT(vc, 'code=([0-9]+)') FROM {dbname}.ct1 ORDER BY ts") + tdSql.checkRows(4) + tdSql.checkData(0, 0, '42') + tdSql.checkData(1, 0, '7') + tdSql.checkData(2, 0, '0') + tdSql.checkData(3, 0, None) # NULL row → NULL + + # RXE-TBL-002: NULL column row → NULL; non-NULL rows → extracted value + tdSql.query(f"SELECT REGEXP_EXTRACT(vc, 'type=([A-Z_]+)') FROM {dbname}.ct1 ORDER BY ts") + tdSql.checkRows(4) + tdSql.checkData(0, 0, 'DISK_FULL') + tdSql.checkData(1, 0, 'LOW_MEM') + tdSql.checkData(2, 0, 'OK') + tdSql.checkData(3, 0, None) + + # RXE-TBL-003: empty table → 0 rows, no error + tdSql.execute(f"CREATE TABLE IF NOT EXISTS {dbname}.empty_t (ts TIMESTAMP, vc VARCHAR(64))") + tdSql.query(f"SELECT REGEXP_EXTRACT(vc, '([0-9]+)') FROM {dbname}.empty_t") + tdSql.checkRows(0) + + # ----------------------------------------------------------------- + # §7 WHERE clause + # ----------------------------------------------------------------- + # RXE-WHERE-001: IS NOT NULL filters to rows with a match + tdSql.query(f"SELECT vc FROM {dbname}.ct1 " + "WHERE REGEXP_EXTRACT(vc, 'code=([4-9][0-9]+)') IS NOT NULL " + "ORDER BY ts") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'code=42,type=DISK_FULL') + + # RXE-WHERE-002: equality on extracted scheme value + tdSql.query(f"SELECT vc FROM {dbname}.ct2 " + "WHERE REGEXP_EXTRACT(vc, '(https?)://') = 'https' " + "ORDER BY ts") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'https://example.com') + + # ----------------------------------------------------------------- + # §8 NCHAR column: extraction result equals VARCHAR equivalent + # ----------------------------------------------------------------- + # RXE-NCHAR-001: NCHAR input yields same extracted value as VARCHAR + tdSql.query(f"SELECT REGEXP_EXTRACT(nc, 'code=([0-9]+)') FROM {dbname}.ct1 ORDER BY ts") + tdSql.checkRows(4) + tdSql.checkData(0, 0, '42') + tdSql.checkData(1, 0, '7') + tdSql.checkData(2, 0, '0') + tdSql.checkData(3, 0, None) + + # ----------------------------------------------------------------- + # §9 Subquery with GROUP BY + # ----------------------------------------------------------------- + # RXE-SUB-001: group by extracted URL scheme + tdSql.query(f"""SELECT scheme, COUNT(*) AS cnt + FROM (SELECT REGEXP_EXTRACT(vc, '(https?)://') AS scheme FROM {dbname}.ct2) t + WHERE scheme IS NOT NULL + GROUP BY scheme + ORDER BY scheme""") + tdSql.checkRows(2) + tdSql.checkData(0, 0, 'http') + tdSql.checkData(0, 1, 1) + tdSql.checkData(1, 0, 'https') + tdSql.checkData(1, 1, 1) + + # ----------------------------------------------------------------- + # §10 ERE features (character class, anchors, case sensitivity) + # ----------------------------------------------------------------- + # RXE-RE-001: character class extracts decimal number + tdSql.query("SELECT REGEXP_EXTRACT('v=3.14', '([0-9]+\\.[0-9]+)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, '3.14') + + # RXE-RE-002a: anchor ^ matches at start → 'a' + tdSql.query("SELECT REGEXP_EXTRACT('abc', '^(a)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'a') + + # RXE-RE-002b: anchor ^ requires position 0; 'x' blocks match → NULL + tdSql.query("SELECT REGEXP_EXTRACT('xabc', '^(a)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # RXE-RE-003a: case-sensitive by default → NULL + tdSql.query("SELECT REGEXP_EXTRACT('ABC', '(abc)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + # RXE-RE-003b: LOWER() enables case-insensitive extraction → 'abc' + tdSql.query("SELECT REGEXP_EXTRACT(LOWER('ABC'), '(abc)')") + tdSql.checkRows(1) + tdSql.checkData(0, 0, 'abc') + + def _check_error(self, dbname="db"): + # ----------------------------------------------------------------- + # §11 Error cases + # ----------------------------------------------------------------- + # RXE-ERR-001: too few arguments (1) + tdSql.error("SELECT REGEXP_EXTRACT('abc')") + + # RXE-ERR-002: too many arguments (4) + tdSql.error("SELECT REGEXP_EXTRACT('abc', '(b)', 1, 0)") + + # RXE-ERR-003: str is non-string type (INT column) + tdSql.error(f"SELECT REGEXP_EXTRACT(iv, '([0-9]+)') FROM {dbname}.ct1") + + # RXE-ERR-004: pattern is a column reference (not a constant) + tdSql.error(f"SELECT REGEXP_EXTRACT(vc, vc) FROM {dbname}.ct1") + + # RXE-ERR-005: negative group_idx → translation-phase error + tdSql.error("SELECT REGEXP_EXTRACT('abc', '(b)', -1)") + + # RXE-ERR-006: invalid regex (unmatched parenthesis) + tdSql.error("SELECT REGEXP_EXTRACT('abc', '(b', 1)") + + # RXE-ERR-007: group_idx exceeds maximum (512) + tdSql.error("SELECT REGEXP_EXTRACT('abc', '(b)', 513)") + + def _check_doc_examples(self): + # ----------------------------------------------------------------- + # §12 Doc examples — verify the three queries from the user manual + # ----------------------------------------------------------------- + # RXE-DOC-001: date string, group 1 → year + tdSql.query("SELECT REGEXP_EXTRACT('2026-04-22', '([0-9]{4})-([0-9]{2})-([0-9]{2})', 1)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, '2026') + + # RXE-DOC-002: date string, group 0 → whole match + tdSql.query("SELECT REGEXP_EXTRACT('2026-04-22', '([0-9]{4})-([0-9]{2})-([0-9]{2})', 0)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, '2026-04-22') + + # RXE-DOC-003: no match → NULL + tdSql.query("SELECT REGEXP_EXTRACT('no-digits-here', '[0-9]+', 1)") + tdSql.checkRows(1) + tdSql.checkData(0, 0, None) + + def all_test(self, dbname="db"): + self._check_basic(dbname) + self._check_error(dbname) + self._check_doc_examples() + + def test_fun_sca_regexp_extract(self): + """Fun: regexp_extract() + + 1. regexp_extract default group_idx=1 returns first capture group + 2. regexp_extract group_idx=0 returns whole match substring + 3. regexp_extract with explicit group index (1, 2, out-of-range) + 4. regexp_extract NULL input (str, pattern, group_idx) and no-match return NULL + 5. regexp_extract capture group matching empty string returns '' + 6. regexp_extract on table columns with per-row scalar semantics + 7. regexp_extract in WHERE clause for row filtering + 8. regexp_extract on NCHAR column (return type NCHAR) + 9. regexp_extract in subquery with GROUP BY + 10. regexp_extract POSIX ERE features: character class, anchors, case sensitivity + 11. regexp_extract invalid parameter error cases (including group_idx > 512) + 12. regexp_extract user-manual doc examples + + Since: v3.4.2.0 + + Labels: common,ci + + Jira: None + + History: + - 2026-04-20 Stephen Created + """ + dbname = "db" + tdSql.prepare() + + tdLog.printNoPrefix("==========step1:create table") + self._create_tb(dbname) + + tdLog.printNoPrefix("==========step2:insert data") + self._insert_data(dbname) + + tdLog.printNoPrefix("==========step3:all check") + self.all_test(dbname) + + tdSql.execute(f"flush database {dbname}") + + tdLog.printNoPrefix("==========step4:after wal, all check again") + self.all_test(dbname) diff --git a/test/ci/cases.task b/test/ci/cases.task index 7b415ae1e8a3..a2217d5e0c01 100644 --- a/test/ci/cases.task +++ b/test/ci/cases.task @@ -455,6 +455,7 @@ ,,y,.,./ci/pytest.sh pytest cases/11-Functions/01-Scalar/test_fun_sca_to_iso8601.py ,,y,.,./ci/pytest.sh pytest cases/11-Functions/01-Scalar/test_fun_sca_to_timestamp.py ,,y,.,./ci/pytest.sh pytest cases/11-Functions/01-Scalar/test_fun_sca_to_unixtimestamp.py +,,y,.,./ci/pytest.sh pytest cases/11-Functions/01-Scalar/test_fun_sca_regexp_extract.py ,,y,.,./ci/pytest.sh pytest cases/11-Functions/01-Scalar/test_fun_sca_today.py ,,y,.,./ci/pytest.sh pytest cases/11-Functions/01-Scalar/test_fun_sca_upper.py ,,y,.,./ci/pytest.sh pytest cases/11-Functions/01-Scalar/test_fun_sca_cast_blob.py