-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry-pick Resgroup related code from GreenPlum [Mar 2, 2022 - Feb 7, 2023] #448
Conversation
|
00835ec
to
634fd0a
Compare
9df5076
to
afe910e
Compare
FATAL "writer segworker group shared snapshot collision" happens when gp_vmem_idle_time reached, the QD will clean the idle writer and reader gang and close the connection to the QE, QE will quit in an async way. QD processes remain. If QE cannot quit before QD starts a new command, it will find the same session id in the shared snapshot and collision will happen. QE session quit may take time due to ProcArrayLock contention. Hence, this commit only cleans up reader gangs and not writer gang during idle cleanup session timeout. This way no need to remove and readd shared snapshot slot on QEs and hence avoids the collision possibility. (cherry picked from commit cc58ac6afec2587ae7afb489f59fc7c1d1949325)
These changes are back ported from 6X_STABLE branch, other than refining code and words, the names of UDFs are changed: ``` pg_catalog.gp_endpoints() -> pg_catalog.gp_get_endpoints() pg_catalog.gp_segment_endpoints() -> pg_catalog.gp_get_segment_endpoints() pg_catalog.gp_session_endpoints() -> pg_catalog.gp_get_session_endpoints() ``` And views are created for convenience: ``` CREATE VIEW pg_catalog.gp_endpoints AS SELECT * FROM pg_catalog.gp_get_endpoints(); CREATE VIEW pg_catalog.gp_segment_endpoints AS SELECT * FROM pg_catalog.gp_get_segment_endpoints(); CREATE VIEW pg_catalog.gp_session_endpoints AS SELECT * FROM pg_catalog.gp_get_session_endpoints(); ``` Co-Authored-By: Jian Guo <gjian@vmware.com> Co-Authored-By: Xuejing Zhao <zxuejing@vmware.com>
In GPDB, we do not allow users to use PREPARE TRANSACTION in regular and utility-mode connections to prevent any conflicts/issues with GPDB's distributed transaction manager that heavily utilizes two-phase commit. As part of the Postgres 10 merge into GPDB, a regression was introduced that allowed PREPARE TRANSACTION to be run in utility-mode connections. The error check was being bypassed because the TransactionStmt was not being properly obtained. The cause of this was due to an upstream Postgres refactor that introduced RawStmt which would wrap the TransactionStmt so the TransactionStmt typecast was being done on the wrong parse node (needs to be done on the RawStmt->stmt). Added simple regression test to make sure this regression doesn't occur again from future Postgres merges. Also disable some recovery TAP tests which use PREPARE TRANSACTION in utility-mode connections. Postgres commit reference (RawStmt refactor): postgres/postgres@ab1f0c8
Use optimizer_enable_nljoin to disable all xforms that produce nestloop join alternatives. Co-authored-by: Orhan Kislal <okislal@vmware.com>
It's disabled by default.
- Assert that interconnect_address is always set, in order to get rid of conditional code - Remove AI_PASSIVE flag from socket setup functions (it was being ignored anyway as we always pass a unicast address to getaddrinfo) - Comment cleanup - Added a regress test for checking motion socket creation Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Add a new state and corresponding error message for RESET and let FTS ignores it when it detects primary down. Detailed rationale behind the change: this RESET period is when a primary crashes but have not yet started recovery. Normally this is a short period but we've seen cases where the primary's postmaster waits a long time (40 to 50 seconds) for backends to exit. Because previously PM would send "in recovery" response to FTS during that time, and FTS sense no recovery progress, it would panic and issue failover. Now we just let FTS ignore that state. We could add a new FTS timeout to guard against primary being stuck waiting in that state, but we think it should be very rare so we aren't doing that until we see a need. There's a 5-second timeout `SIGKILL_CHILDREN_AFTER_SECS` on the PM side, after which PM will send `SIGKILL` to its children. Also make the new mode be respected by certain retry mechanism, such as in the isolation2 framework and the segment_failure_due_to_recovery().
In the current resource group implementation, query_mem in the plan tree is calculated using QD's system memory and the number of primary segments, not QE's own system memory and the number of primary segments. This can result in the wrong memory being allocated at the execution stage eventually, which can lead to various problems such as OOM, underutilization of QE resources, etc. The query_mem is linearly to system memory and number of primary segments if we enable resource group, the approximate calculation formula is as follows: query_mem = (total_memory * gp_resource_group_memory_limit * memory_limit / nsegments) * memory_spill_ratio / concurrency Only total_memory and nsegments differ between QD and QE, so we can dispatch these two parameters to QE, and then calculate QE's own query_mem proportionally. At the same time, we use the GUC gp_resource_group_enable_recalculate_query_mem to let the client decides whether to recalculate the query_mem proportionally on QE and repopulate the operatorMemKB in the plan tree according to this value.
...to help with debugging and introspection. This will allow us to pull information about the active segments during execution, and it will form the basis of the gp_backend_info() function.
To debug into the master backend for a given Postgres session, you can SELECT pg_backend_pid() and attach a debugger to the resulting process ID. We currently have no corresponding function for the segment backends, however -- developers have to read the output of `ps` and try to correlate their connected session to the correct backends. This is error-prone, especially if there are many sessions in flight. gp_backend_info() is an attempt to fill this gap. Running SELECT * FROM gp_backend_info(); will return a table of the following format: id | type | content | host | port | pid ----+------+---------+-----------+-------+------- -1 | Q | -1 | pchampion | 25431 | 50430 0 | w | 0 | pchampion | 25432 | 50431 1 | w | 1 | pchampion | 25433 | 50432 2 | w | 2 | pchampion | 25434 | 50433 This allows developers to jump directly to the correct host and PID for a given backend. This patch supports backends for writer gangs (type 'w' in the table), reader gangs ('r'), master QD backend ('Q') and master singleton readers ('R'). Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com> Co-authored-by: Divyesh Vanjare <vanjared@vmware.com>
The global variable host_segments is **only** used on QEs under resource group mode and the value is dispatched to QEs from QD. Previously in the function getCdbComponentInfo(), QD make a hashtable, and count host_segments group by the key of ip address. This is not correct. A typical Greenplum deployment environment may have different ip addresses point to the same machine. Use ip address as hash key will lead to wrong number of host_segments and lead to more memory limit of a segment than user's intent. This commit use hostname as a machine's unique identifier to fix the issue. Also change some names to better show the meanings.
… tainted replicated (#13177) Previously, CPhysicalJoin derived the outer distribution when it was tainted replicated. It checked only for strict replicated and universal replicated and returned the inner distribution in these cases (in this case, it satisfies random). Tainted replicated wasn't considered and was causing an undercount (the JOIN derived tainted replicated instead of random, which was causing the number of columns to be undercounted, because it wrongly assumed that one segment contained all output columns). Co-authored-by: Daniel Hoffman <hoffmand@vmware.com>
In planner, If a SegmentGeneral pat contains volatile expressions, it cannot be taken as General, and we will try to make it SingleQE by adding a motion (if this motion is not needed, it will be removed later). But a corner case is that if the path refs outer Params then it cannot be motion-ed. This commit fixes the issue by not trying to bring to singleQE for segmentgeneral path that refs outer Params. See Github Issue 13532 for details.
The 5-digits date string was invalid and would be rejected on GPDB5. But then the upstream pg modified the date parsing logic, which would make it parsed as YYYMMMDD. As it's not a standard timeformat and the change causes gp6+ to behave differently from previous version. this commit lets gp reject it by default. And if the pg-like date parsing required, we can set the value of GUC gp_allow_date_field_width_5digits to true.
…rt (#12694) According to a reported error in PolicyEagerFreeAssignOperatorMemoryKB makes query end without calling mppExecutorCleanup #12690, the code path in `standard_ExecutorStart` didn't handle exception in `PolicyAutoAssignOperatorMemoryKB` and `PolicyEagerFreeAssignOperatorMemoryKB` calling, which may cause the OOM exception not to be handled in `standard_ExecutorStart` but throw to upper `PortalStart` methods, while there is also an exception handling mechanism in `PortalStart` but `mppExecutorCleanup` will not call because `portal->queryDesc` will be `NULL` in certain transaction states. This commit fixes it.
790c7ba changed our address binding strategy to use a unicast address (segment's gp_segment_configuration.address) instead of the wildcard address, to reduce port usage on segment hosts and to ensure that we don't inadvertently use a slower network interface for interconnect traffic. In some cases, inter-segment communication using the unicast address mentioned above, may not be possible. One such example is if the source segment's address field and the destination segment's address field are on different subnets and/or existing routing rules don't allow for such communication. In these cases, using a wildcard address for address binding is the only available fallback, enabling the use of any network interface compliant with routing rules. Thus, this commit introduces the gp_interconnect_address_type GUC to support both kinds of address binding. We pick the default to be "unicast", as that is the only reasonable way to ensure that the segment's address field is used for fast interconnect communication and to keep port usage manageable on large clusters with highly concurrent workloads. Testing notes: VM setup: one coordinator node, two segment nodes. All nodes are connected through three networks. Gp segment config: coordinator node has one coordinator. Each segment node has two primaries. No mirrors. Coordinator uses a dedicated network. Two primaries on a segment node each uses one of the other two networks. With 'unicast', we fail to send packets due to the network structure: WARNING: interconnect may encountered a network error, please check your network Falling back to 'wildcard', we see that packets can be sent successfully across motions. Co-authored-by: Huansong Fu <fuhuansong@gmail.com>
`ResGroupActivated = true` is set at the ending of InitPostgres() by InitResManager(). If inside InitPostgres() some code before InitResManager() call palloc() failed, then call trace: gp_failed_to_alloc() -> VmemTracker_GetAvailableVmemMB() -> VmemTracker_GetNonNegativeAvailableVmemChunks -> VmemTracker_GetVmemLimitChunks It will trigger: VmemTracker_GetVmemLimitChunks() { AssertImply(vmemTrackerInited && IsResGroupEnabled(), IsResGroupActivated()); } Like commit c1cdb99 does, remove the AssertImply and add TODO comment.
removed meaningless code line in resgroup_helper.c
This adds a GUC optimizer_enable_replicated_table, which defaults any DML operation on a replicated table to fall back to Postgres planner. optimizer_enable_replicated_table is on by default. Co-authored-by: Daniel Hoffman <hoffmand@vmware.com>
The GUC replacement_sort_tuples was introduced in GP 9.6 to indicate the threshold to use replacement selection rather than quicksort. In PG12, the GUC was removed with all code related to replacement selection sort, and doesn't appear in GPDB7. However, in GPDB7 there is still one line about replacement_sort_tuples in sync_guc_name.h (without any other related code). It should be treated as a mistake. The fix is to simply remove the line and doesn't impact any existing behavior.
…r 100 (#13668) When we set runaway_detector_activation_percent to 0 or 100 means to disable runaway detection, this should apply to Vmem Tracker and Resource Group. However, in the current implementation, we will still invoke IsGroupInRedZone() if we enabled resource group if we set runaway_detector_activation_percent to 0 or 100. And in function IsGroupInRedZone() has some automatic operation to read variables. At the same time RedZoneHandler_IsVmemRedZone is a very frequently called function, so this will waste a lot of CPU resources. When we init Red-Zone Handler, will set redZoneChunks to INT32_MAX if we disable runaway detection, so we can use it to judge whether we are in Red-Zone or not quickly. No more tests need, since current unit tests already have cases covering this situation.
…pipeline tests (#13974) The resource group pipeline uses ORCA as an optimizer by default. But as a resource management tool, it's unimportant which optimizer we use. So use postgres query optimizer instead of ORCA to run resource group pipeline tests. After that, we can remove the file of resgroup_bypass_optimizer.source and resgroup_bypass_optimizer_1.source.
[7X] Feat: Identify backends with suboverflowed txs Subtransaction overflow is a chronic problem for Postgres and Greenplum, which arises when a backend creates more than PGPROC_MAX_CACHED_SUBXIDS (64) subtransactions. This is often caused by the use of plpgsql EXCEPTION blocks, SAVEPOINT etc. Overflow implies that pg_subtrans needs to be consulted and the in-memory XidCache is no longer sufficient. The lookup cost is particularly felt when there are long running transactions in the system, in addition to backends with suboverflow. Long running transactions increase the xmin boundary, leading to more lookups, especially older pages in pg_subtrans. Looking up older pages while we are constantly generating new pg_subtrans pages (with the suboverflowed backend(s)) leads to pg_subtrans LRU misses, exacerbating the slowdown in overall system query performance. Terminating the backend with suboverflow or backends with long running transactions can help alleviate the potential performance problems. This commit provides an extension and a view which can help DBAs identify suboverflown backends, which they can subsequently terminate. Please note that backends should be terminated from the master (which will automatically terminate the corresponding backends on the segments).
…ent (#14019) We might want to also consider adding a log message to print the query string that caused the overflow. This is important as only 1 statement out of thousands executed in a backend may trigger the overflow, or the backend can come out of the overflow state before it is inspected with our view/UDF. Logging the statement will ensure that customers can pinpoint the offending statements.
Note that lc_monetary and lc_time are related to formatting output data. Besides, formatting functions will be pushed down to QEs in some common cases. So to keep the output data consistent with the locale value set in QD, we need to sync these GUCs between QD and QEs. Co-authored-by: wuchengwen <wcw190496@alibaba-inc.com>
Add GUC gp_print_create_gang_time control whether to print information about creating gang time. We print the create gang time for both DDL and DML. If all the segDescs of a gang are from the cached pool, we regard the gang as reused. We only display the shortest and longest establish conn time and their segindexs of a gang. The info of the shortest establish conn time and the longest establish conn time is the same for 1-gang. DDL: ``` create table t(tc1 int); INFO: The shortest establish conn time: 4.48 ms, segindex: 2, The longest establish conn time: 8.13 ms, segindex: 1 set optimizer=off; INFO: (Gang) is reused ``` DML: we can use DML or explain analyze to get create gang time. ``` select * from t_create_gang_time t1, t_create_gang_time t2 where t1.tc1=2; INFO: (Slice1) is reused INFO: (Slice2) The shortest establish conn time: 4.80 ms, segindex: 0, The longest establish conn time: 4.80 ms, segindex: 0 tc1 | tc2 | tc1 | tc2 -----+-----+-----+----- (0 rows) explain analyze select * from t_create_gang_time t1, t_create_gang_time t2 where t1.tc1=2; INFO: (Slice1) is reused INFO: (Slice2) is reused QUERY PLAN ...... ```
Items in these two files should be ordered.
Currently, the postmaster process will be added to the parent cgroup, and all the auxiliary processes, such as BgWriter, SysLogger, will be added to the cgroup of user.slice, if we enable the resource group. We can not control the resource usage of the cgroup of user.slice, and it's difficult to calculate the proportion between the resource usage of the parent group and child group, the Linux Cgroup document doesn't explain it either. So this PR created a new control group, named "system_group", to control the resource usage of the postmaster process and all other auxiliary processes. And this PR uses the below principle: When a process forks a child process, the new process is born into the cgroup that the forking process belongs to at the time of the operation. After exit, a process stays associated with the cgroup that it belonged to at the time of exit until it's reaped;
Added a new view into the resource manager tool gp_toolkit to perform the function that is used frequently: gp_toolkit.gp_resgroup_role: assigned resource group to roles.
Fix the failed pipeline due to #13880
…rface (#14343) This PR is the second step in refactoring the resource group. The first one is #14256. In this PR, we do not change any behavior of the resource group, we do not change the interface exposed to the resource manager, it just abstracts all the fundamental functions to the struct CGroupOpsRoutine, and use this abstract handler to manipulate the underlying Linux Cgroup files, there are two purposes for this: 1. make the code more readable. 2. provides the base interface for Linux Cgroup v2. The second one is our main motivation for doing this. Of course, this is a relatively large change, so it's not all done, and more details need to be fixed.
NEW SYNTAX of resource group cpuset for different master and segment using syntax like cpuset="1;3-4" could different cpuset of master and segment by semicolon. As we define cpuset="1;3-4", master will apply the first cpu core, segments apply third and fourth core at same time. Differentiate mater and segment by seperating cpuset through semicolon, then apply the first half of it to master and second half to segment.
fix link problems in macOS and Windows which was introduced by #14343.
Fix dev pipline failure of previous PR #14332.
My linker complains that there's multiple definition of cgroupOpsRoutine and cgroupSystemInfo. We should declare the variable in header file with an extern tag and initialize it in one of the .c file. Since cgroupSystemInfo and cgroupOpsRoutine are required on multiple platforms, I initialize them in resgroup.c.
Simplify and refactor some codes of RG cpuset seperated by coordinator/segment. This commit if for enhenceing previous PR https://github.com/greenplum-db/gpdb/pull/14332. authored-by: chaotian <chaotian@vmware.com>
After #14343, it's time to remove all the relevant codes and test cases about the resource group memory manager. 1. What this PR has done This PR did most 2 important things: First, the most important is to remove all the codes and test cases about resource group memory model, which includes the functions, variables, GUCs, etc. add new semantics on resource group and removing memory model. Since pg_resgroupcapability.reslimittype is consistent with the enumerated type ResGroupLimitType, when we delete `RESGROUP_LIMIT_TYPE_MEMORY and other content, there will be "holes". In order to avoid more PR and review work, this PR deleted memory model and added new semantics. The GUC this PR removed: gp_resource_group_memory_limit, gp_resgroup_memory_policy, memory_spill_ratio, gp_log_resgroup_memory, gp_resgroup_memory_policy_auto_fixed_mem gp_resource_group_cpu_ceiling_enforcement, gp_resgroup_print_operator_memory_limits gp_resource_group_enable_recalculate_query_mem. 2. New Resource Group Attributes and Limits New Resource group attributes and limits: - concurrenty. The maximum number of concurrent transactions, including active and idle transactions, that are permitted in the resource group. - cpu_hard_quoata_limit. The percentage of CPU resources hard limit to this resource group. This value indicates the maximum CPU ratio that the current group can use. - cpu_soft_priority. The current group CPU priority, the larger the value, the higher the priority, the more likely to be scheduled by the CPU, the default value is 100. - cpuset. The CPU cores to reserve for this resource group. First, let's take a look at the new resource management view of resource group: postgres=# select * from gp_toolkit.gp_resgroup_config; groupid | groupname | concurrency | cpu_hard_quota_limit | cpu_soft_priority | cpuset ---------+---------------+-------------+----------------------+-------------------+-------- 6437 | default_group | 20 | 20 | 100 | -1 6438 | admin_group | 10 | -1 | 300 | -1 6441 | system_group | 0 | 10 | 100 | -1 (3 rows) 2.1 What's the meaning of cpu_hard_quota_limit It can be seen that cpu_rate_limit is removed and replaced by cpu_hard_quota_limit, which indicates the upper limit of CPU resources that the current group can use. This is a percentage, taking 20 as an example, it means that the CPU resources used by the current group cannot exceed 20% of the total CPU resources of the Host. The sum of cpu_hard_quota_limit of all groups can exceed 100, and the range of this value is [1, 100] or -1, where 100 and -1 both mean that all CPU resources can be used, and no CPU resource limit is imposed on it. When we change the value of cpu_hard_quota_limit, will write cpu.cfs_period_us * ncores * cpu_hard_quota_limit / 100 to the file cpu.cfs_quota_us. 2.2 What's the meaning of cpu_soft_priority We have added cpu_soft_priority this field, which is used to indicate the CPU priority of the current group, corresponding to the dynamic running load weight in Linux CFS. The larger the value, the greater the weight of the group, and it will be scheduled more preferentially by the Linux scheduling process. The value range is [1, +∞], currently, the value cannot exceed 2^64 - 1. The default value is 100. When we change the value of cpu_soft_priority, will write (int64)(cpu_soft_priority * 1024 / 100) to the file cpu.shares.
In DistributedLog_AdvanceOldestXmin() we advance DLOG's idea of the oldestXmin to the "globalxmin" value, and also truncate all DLOG segments that only hold xids older than the oldestXmin. The oldestXmin can be xmax, i.e. the "latestCompletedXid" + 1, when e.g. there's no other concurrent running transactions. However, during postmaster restart we initialize the oldestXmin to be up to only latestCompletedXid. As a result, when we try to advance it again, we could try to access the segment that holds latestCompletedXid, which had been truncated before the restart. Fixing it now by initializing oldestXmin properly. Add a test for the same. Had to move the test file to isolation/input in order to import the regress.so for the test_consume_xids() function we need.
Since #14562 removed some GUCs and the memory model of resource group, there have some legacy test cases and useless codes in the project, this PR will remove those codes and files. No more test needs, it's a clean process.
Currently, we will create a distributed snapshot in the function GetSnapshotData() if we are QD, and we will iterate procArray again to get the global xmin/xmax/xip. But if the current query could be dispatched to a single segment directly, which means it's a direct dispatch, there is no need to create a distributed snapshot, the local snapshot is enough.
Change of totalQueued and totalExecuted from int to int64 to avoid overflow after long running. Co-authored-by: huaxi.shx <huaxi.shx@alibaba-inc.com>
When gpdb calls InitResGroups to init a postgres backend, readStr is called to read cpuset assigned to gpdb. However the size of data buffer in readStr is too small, cpuset string readed by gpdb is truncated. This commit change the buffer size from MAX_INT_STRING_LEN(20) to MAX_CGROUP_CONTENTLEN(1024) to fix resgroup init error when there is a lot of cores in cpuset.cpus Co-authored-by: huaxi.shx <huaxi.shx@alibaba-inc.com>
The GUC's name must be populated into `sync_guc_name.h` if it needs to sync value between QD and QEs. QD will dispatch its current synced GUC values (as startup options) to create QEs. Otherwise, the settings will not take effect on the newly created QE. An example of GUC inconsistency between QD and QE: ``` CREATE OR REPLACE FUNCTION cleanupAllGangs() RETURNS BOOL AS '@abs_builddir@/../regress/regress.so', 'cleanupAllGangs' LANGUAGE C; CREATE OR REPLACE FUNCTION public.segment_setting(guc text) RETURNS SETOF text EXECUTE ON ALL SEGMENTS AS $$ BEGIN RETURN NEXT pg_catalog.current_setting(guc); END $$ LANGUAGE plpgsql; postgres=# show allow_segment_DML; allow_segment_DML ------------------- off (1 row) postgres=# set allow_segment_DML = on; SET postgres=# show allow_segment_DML; allow_segment_DML ------------------- on (1 row) postgres=# select public.segment_setting('allow_segment_DML'); segment_setting ----------------- on on on (3 rows) postgres=# select cleanupAllGangs(); cleanupallgangs ----------------- t (1 row) postgres=# show allow_segment_DML; allow_segment_DML ------------------- on (1 row) postgres=# select public.segment_setting('allow_segment_DML'); segment_setting ----------------- off off off (3 rows) ```
- Move guc `application_name` and `vacuum_cost_limit` back to `unsync_guc_name.h` to fix pipeline failure. Pipeline link: https://prod.ci.gpdb.pivotal.io/teams/main/pipelines/gpdb_main_without_asserts/jobs/gpconfig_rocky8/builds/19 - Remove several deprecated gucs.
1.support create/alter resource group with memory_limit, add the removed gucs(which are used to do memory limit) back; 2.support to acquire the amount of memory reserved for the query in resource group mode; 3.add a guc gp_resgroup_memory_query_fixed_mem to allow users set the memory limit for a query.
Normal Conflict:
Code file address changed:
Skipped: