常見問題

本文匯總了Kudu使用時的常見問題。

在哪裡查看Kudu的記錄檔？
Kudu支援的分區方式有哪些？
如何訪問Kudu WebUI？
Kudu用戶端串連報錯NonRecoverableException
如何查看社區FAQ？
報錯Bad status: Network error: Could not obtain a remote proxy to the peer.: unable to resolve address for <hostname>: Name or service not known
報錯Bad status: I/O error: Failed to load Fs layout: could not verify integrity of files: <目錄>, <數字> data directories provided, but expected <數字>
報錯pthread_create failed: Resource temporarily unavailable (error 11)
啟動Kudu失敗，該如何處理？
報錯Service unavailable: RunTabletServer() failed: Cannot initialize clock: timed out waiting for clock synchronisation: Error reading clock. Clock considered unsynchronized
報錯Rejecting Write request: Soft memory limit exceeded

在哪裡查看Kudu的記錄檔？

Kudu記錄檔路徑在/mnt/disk1/log/kudu下。

Kudu支援的分區方式有哪些？

Kudu支援Range分區方式以及Hash分區方式，兩種分區方式可以嵌套使用，詳情請參見Apache Kudu Schema Design。

如何訪問Kudu WebUI？

因為Kudu WebUI與Knox還沒有整合，所以不能通過Knox查看Kudu WebUI。您可以通過隧道的方式訪問Kudu的WebUI，詳情請參見通過SSH隧道方式訪問開源組件Web UI。

Kudu用戶端串連報錯NonRecoverableException

報錯詳情

報錯詳細資料，如下所示。

org.apache.kudu.client.NonRecoverableException: Could not connect to a leader master. Client configured with 1 master(s) (192.168.0.10:7051) but cluster indicates it expects 3 master(s) (192.168.0.36:7051,192.168.0.11:7051,192.168.0.10:7051)

問題原因
此問題主要是因為在設定Master節點時，只設定了一個Master節點的資訊，程式會找不到主Master節點。
解決方案
您在進行此配置時，需要配置所有Master節點的資訊。

如何查看社區FAQ？

可以查看社區的Apache Kudu Troubleshooting。

報錯Bad status: Network error: Could not obtain a remote proxy to the peer.: unable to resolve address for <hostname>: Name or service not known

問題原因：無法解析<hostname>，導致kudu tablet的raft server不能擷取自己的peer機器類型，不清楚是否存活，所以終止。
解決方案：
1. 在/etc/hosts中手動添加解析。
2. 如果<hostname>所代表的機器已經被釋放了，可以在/etc/hosts中添加解析到隨便一個IP上，無論該IP是否可以訪問，之後kudu tserver走Replica恢複的流程。

報錯Bad status: I/O error: Failed to load Fs layout: could not verify integrity of files: <目錄>, <數字> data directories provided, but expected <數字>

該問題是由於-fs_data_dirs設定的磁碟數目與-fs_metadata_dir記錄的中繼資料不一致。直接將-fs_data_dirs的磁碟數目修改為與中繼資料匹配的即可。

報錯pthread_create failed: Resource temporarily unavailable (error 11)

該問題是因為沒有資源了，導致線程建立失敗。

沒有資源
需要通過ulimit -a確認下max user processes的值，是否是比較小的一個值。如果比較小，則需要修改/etc/security/limits.conf或者增加一個/etc/security/limits.d/kudu.conf檔案，在其中增加對max user processes值的修改。
混部的情況下，使用了Kudu client 0.8版本
根據KUDU-1453，Spark executor使用kudu client 0.8可能會造成線程泄漏，先考慮升級到0.9及以上是否能夠解決。
線程泄漏
- Trino導致的問題
  主要是由於Trino退出的時候，shutdown hook裡面等待blocking queue的take方法有返回，shutdown hook線程不會被interrupt，而EMR管控又會不斷的發SIGTERM建立出來新的SIGTERM Handler線程，因此導致了線程的耗盡。
  需要Trino側解決，或者直接執行Kill -9命令。
- Jindo SDK導致的問題
  這是由於Spark在執行Write Job的時候會用到JindoOssCommitter這個類，這個類會建立JindoOssMagicCommitter，並在其中產生一個名為oss-committer-pool的線程池。該線程池並不是static的，並且也沒有被手動shutdown。因此，在不斷的建立JindoOssMagicCommitter的過程中，會不斷的產生新的線程池，而老的線程池因為各種原因，也沒有被釋放，因此會使用過量的線程。如果此時您使用的是Spark Streaming或Structure Streaming，則可能使得系統資源耗盡。
  您可以設定以下參數解決。
```
spark.sql.hive.outputCommitterClass=org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
spark.sql.sources.outputCommitterClass=org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
```
- 排查工具
  通過執行以下threads_monitor.sh指令碼，來尋找系統中使用最多線程的進程是什麼，然後再解決問題。
```
#!/bin/bash

total_threads=0
max_pid=-1
max_threads=-1

for tid in `ls /proc`
do
  if [[ $tid != *self && -f /proc/$tid/status ]]; then
    num_threads=`cat /proc/$tid/status | grep Threads | awk '{print $NF}'`
    ((total_threads+=num_threads))
    if [[ ${max_pid} -eq -1 || ${max_threads} -lt ${num_threads} ]]; then
      max_pid=${tid}
      max_threads=${num_threads}
    fi
#    echo "Thread ${pid}: ${num_threads}"
  fi
done

echo "Total threads: ${total_threads}"
echo "Max threads: ${max_threads}, pid is ${max_pid}"
ps -ef | grep ${max_pid} | grep -v grep
```

啟動Kudu失敗，該如何處理？

使用Kudu時，bigboot提供的bigboot monitor進行啟動、運行、失敗自動拉起等操作。bigboot 3.5.0版本存在缺陷。如果kudu異常退出，則無法刪除資料庫中的服務資訊，導致後面無法再拉起。此時您需要先stop再start即可。

說明

需要在機器中操作，控制台可能會由於服務已經終止了，不會執行stop操作。

您可以在Worker節點執行以下命令。如果您是在Header節點執行以下命令，請替換命令中的kudu-tserver為kudu-master。

/usr/lib/b2monitor-current/bin/monictrl -stop kudu-tserver
/usr/lib/b2monitor-current/bin/monictrl -start kudu-tserver

報錯Service unavailable: RunTabletServer() failed: Cannot initialize clock: timed out waiting for clock synchronisation: Error reading clock. Clock considered unsynchronized

報錯詳情

日誌中可能會有如下錯誤資訊。

E1010 10:37:54.165313 29920 system_ntp.cc:104] /sbin/ntptime
------------------------------------------
stdout:
ntp_gettime() returns code 5 (ERROR)
  time e6ee0402.2a452c4c  Mon, Oct 10 2022 10:37:54.165, (.165118697),
  maximum error 16000000 us, estimated error 16000000 us, TAI offset 0
ntp_adjtime() returns code 5 (ERROR)
  modes 0x0 (),
  offset 0.000 us, frequency 187.830 ppm, interval 1 s,
  maximum error 16000000 us, estimated error 16000000 us,
  status 0x2041 (PLL,UNSYNC,NANO),
  time constant 6, precision 0.001 us, tolerance 500 ppm,

問題原因：機器上的ntpd無法串連到所配置的ntp server。
解決方案：請嘗試重啟解決。

報錯Rejecting Write request: Soft memory limit exceeded

報錯原因：超過Soft memory limit。
解決方案
您可以進行以下操作：
1. 設定參數memory_limit_hard_bytes，整體調大記憶體使用量量，預設值是0，代表根據系統自動化佈建，也可以調整為-1，則表示不做任何限制。
2. 設定參數memory_limit_soft_percentage，表示可使用記憶體的比例，預設值是80。