In order to accelerate expression evaluation, we recently refactored its framework. This tutorial will show you how to use the new computational framework to rewrite or add a built-in function in TiDB.
Table of Content
The overall process
-
Select any function to your interest from the expression directory, assuming the function is named XX.
-
Override the
XXFunctionClass.getFunction()
method:This method refers to MySQL rules, inferring the return value type according to the parameter of the built-in function.
Different function signatures will be generated based on the number & type of the parameters, and the return value type of the function.
See detailed description of the function signature in the appendix at the end of this article.
-
Implement the
evalYY()
method on all the function signatures corresponding to the built-in function. YY represents the return value type of the function signature. -
Add tests
In the expression directory, refine tests about the implementation of the given function in the
TestXX()
method.In the executor directory, add tests at the SQL level.
-
Run
make dev
and ensure that all the test cases pass.
Example:
Let’s look at the PR of overriding the LENGTH ()
function for detailed explanation:
Take a look at expression/builtin_string.go
:
First, let’s take a look at the expression/builtin_string.go
file:
-
Implement the
lengthFunctionClass.getFunction()
method. This method mainly accomplishes two tasks:-
Infer the return value type of the
LEGNTH
function according to MySQL rules. -
Generate function signature based on the number & type of parameters, and return value type of the
LENGTH
function. Because theLENGTH
function only has one number & type of parameters, and return value type, we don’t need to define a type for the new function signature. Instead, we modified the existingbuiltinLengthSig
, so that it could be composite withbaseIntBuiltinFunc
, which means that the return value type in the given function signature is int.
type builtinLengthSig struct { baseIntBuiltinFunc } func (c *lengthFunctionClass) getFunction(args []Expression, ctx context.Context) (builtinFunc, error) { //Infer the return value type of `LEGNTH` function according to MySQL rules tp := types.NewFieldType(mysql.TypeLonglong) tp.Flen = 10 types.SetBinChsClnFlag(tp) //Generate function signature based on the number & type of parameters, and return value type. Note that after refactoring, instead of the `newBaseBuiltinFunc` method, the `newBaseBuiltinFuncWithTp` method is used here. //In the `newBaseBuiltinFuncWithTp` function declaration, `args` represents the function's parameters, `tp` represents the return value type of the function, and `argsTp` represents the correct types of all parameters in the function signature. // The number of parameters for `LENGTH` is 1, the parameter type is string, and the return value type is int. Therefore, `tp` here stands for the return value type of the function and `tpString` is used to identify the correct type of parameter. For a function with multiple parameters, when calling `newBaseBuiltinFuncWithTp`, we need to input the correct types of all parameters. bf, err := newBaseBuiltinFuncWithTp(args, tp, ctx, tpString) if err != nil { return nil, errors.Trace(err) } sig := &builtinLengthSig{baseIntBuiltinFunc{bf}} return sig.setSelf(sig), errors.Trace(c.verifyArgs(args)) }
-
-
Implement the
builtinLengthSig.evalInt()
method:func (b *builtinLengthSig) evalInt(row []types.Datum) (int64, bool, error) { // For the `builtinLengthSig` function signature, the parameter type is decided as string, so we can directly call the `b.args[0].EvalString()` method to calculate the parameter: val, isNull, err := b.args[0].EvalString(row, b.ctx.GetSessionVars().StmtCtx) if isNull || err != nil { return 0, isNull, errors.Trace(err) } return int64(len([]byte(val))), false, nil }
Refine the existing TestLength()
method:
Moving on to expression/builtin_string_test.go
, let’s refine the existing TestLength()
method:
func (s *testEvaluatorSuite) TestLength(c *C) {
defer testleak.AfterTest(c)() // This line is used to monitor goroutine leak.
// You can use the following cases to test the `LENGTH` function
// Note: in addition to normal cases, it is best to add some abnormal cases, such as the input args is nil, or the function has different types of parameters.
cases := []struct {
args interface{}
expected int64
isNil bool
getErr bool
}{
{"abc", 3, false, false},
{"你好", 6, false, false},
{1, 1, false, false},
...
}
for _, t := range cases {
f, err := newFunctionForTest(s.ctx, ast.Length, primitiveValsToConstants([]interface{}{t.args})...)
c.Assert(err, IsNil)
// The following lines test the return value type of the `LENGTH` function:
tp := f.GetType()
c.Assert(tp.Tp, Equals, mysql.TypeLonglong)
c.Assert(tp.Charset, Equals, charset.CharsetBin)
c.Assert(tp.Collate, Equals, charset.CollationBin)
c.Assert(tp.Flag, Equals, uint(mysql.BinaryFlag))
c.Assert(tp.Flen, Equals, 10)
// The following lines test the evaluation result of LENGTH function:
d, err := f.Eval(nil)
if t.getErr {
c.Assert(err, NotNil)
} else {
c.Assert(err, IsNil)
if t.isNil {
c.Assert(d.Kind(), Equals, types.KindNull)
} else {
c.Assert(d.GetInt64(), Equals, t.expected)
}
}
}
// The following lines test whether the function has determinacy:
f, err := funcs[ast.Length].getFunction([]Expression{Zero}, s.ctx)
c.Assert(err, IsNil)
c.Assert(f.isDeterministic(), IsTrue)
}
Test the implementation of LENGTH
at the SQL level
Finally let’s look at executor/executor_test.go
and test the implementation of LENGTH
at the SQL level:
// Tests for string built-in functions can be added in the following method:
func (s *testSuite) TestStringBuiltin(c *C) {
defer func() {
s.cleanEnv(c)
testleak.AfterTest(c)()
}()
tk := testkit.NewTestKit(c, s.store)
tk.MustExec("use test")
// for length
// It's best that these tests can also cover different scenarios:
tk.MustExec("drop table if exists t")
tk.MustExec("create table t(a int, b double, c datetime, d time, e char(20), f bit(10))")
tk.MustExec(`insert into t values(1, 1.1, "2017-01-01 12:01:01", "12:01:01", "abcdef", 0b10101)`)
result := tk.MustQuery("select length(a), length(b), length(c), length(d), length(e), length(f), length(null) from t")
result.Check(testkit.Rows("1 3 19 8 6 2 <nil>"))
}
Before refactoring…
TiDB abstracts the expression through the Expression interface (defined in the expression/expression.go file) and defines the eval
method to evaluate the expression:
type Expression interface{
...
eval(row []types.Datum) (types.Datum, error)
...
}
Expressions that implement the Expression interface include:
- Scalar Function
- Column
- Constant
The case below shows the framework of expression evaluation before refactoring:
create table t (
c1 int,
c2 varchar(20),
c3 double
)
select * from t where c1 + CONCAT( c2, c3 < "1.1" )
About the where
condition in the select
statement shown above:
In the compiling phase. TiDB will build an expression tree as shown in the graph below:
In the executing phase, the eval
method of the root node is called, and the expression is evaluated by the following traversal expression tree.
The evaluate the <
expression, take the types of the two parameters into account, and convert the values of the two parameters into required data types according to certain rules. In the expression tree above, the parameter types are double and varchar. According to the evaluation rules of MySQL, these two parameters need to be compared using the float type. Therefore, “1.1” should be converted to double type, and then be evaluated.
Similarly, for the CONCAT
expression in the expression tree above, the parameters should be converted to string type before evaluation. For the expression ‘+’, the parameters should be converted to double before evaluation.
Therefore, before refactoring, the framework of expression evaluation needs to determine the data type of the parameter on each branch repeatedly for every group of data involved. If the parameter type does not meet the evaluation rules of the expression, you need to convert it to the corresponding data type.
Moreover, from the definition of the Expression.eval ()
method, we can see that when evaluating, we must continually wrap and unwrap intermediate results through the Datum structure, which also increases time and capacity cost.
In order to solve these two problems, we refactored the expression evaluation framework.
After refactoring…
The refactored framework has two advantages:
- In the compiling phase, we use the existing information on expression types to generate the expression with parameter types that match the evaluation rules. This way, no extra branch judgment about the parameter types are needed in the executing phase.
- Only the original data types are involved in the evaluation, thus avoiding the time and capacity cost by Datum.
Let’s go back to the previous example, in the compiling phase, the generated expression tree is shown in the following graph. For expressions that do not match the function parameters types, we add the cast
function for type conversion:
In this way, in the executing phase, for every ScalarFunction
, it is guaranteed that all of its parameter types match the data types in the given expression evaluation, and we don’t need to check and convert parameter types repeatedly.
Appendix
-
For a built-in function, multiple function signatures may be created to handle different scenarios, depending on the number & type of parameters, and the type of return value. For most built-in functions, each parameter type and return value type are determined, so only a function signature is needed.
-
For referring rules of more complicated return value type, you can refer to the implementation and test of the
CONCAT
function. You can use theMySQLWorkbench
tool to run the queryselect funcName (arg0, arg1, ...)
to observe the field type of the return value in MySQL’s built-in functions when inputting different parameters. -
In the evaluation process of TiDB expression, only the following six evaluation types (currently we are implementing the JSON type) are supported:
- int (int64)
- real (float64)
- decimal
- string
- Time
- Duration
The
WrapWithCastAsXX ()
method can convert an expression to the corresponding type. -
For a function signature, its return value type has been determined, so when defining, you need to combine it with the corresponding
baseXXBuiltinFunc
and implement theevalXX ()
method. Note that XX should only be one of the six types listed above.
TiDB Cloud Dedicated
TiDB Cloudのエンタープライズ版。
専用VPC上に構築された専有DBaaSでAWSとGoogle Cloudで利用可能。
TiDB Cloud Serverless
TiDB Cloudのライト版。
TiDBの機能をフルマネージド環境で使用でき無料かつお客様の裁量で利用開始。