This topic provides answers to some frequently asked questions about Hudi.
What do I do if duplicate data is returned when I use Spark to query data in a Hudi table?
What do I do if duplicate data is returned when I use Hive to query data in a Hudi table?
What do I do if duplicate data is returned when I use Spark to query data in a Hudi table?
Cause: You are not allowed to read Hudi data by using the Data Source API of Spark.
Solution: Add
spark.sql.hive.convertMetastoreParquet=false
to the command that is used to query data in a Hudi table.
What do I do if duplicate data is returned when I use Hive to query data in a Hudi table?
Cause: By default, Hive uses HiveCombineInputFormat. However, this input format class cannot be used to call an
input format
that is customized for a table.Solution: Add
set hive.input.format = org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat
to the command that is used to query data in a Hudi table.
What do I do if partition pruning does not take effect when I use Spark to query data in a Hudi table?
Cause: If the name of a partition field contains a forward slash (/), the number of partition fields detected during the query is inconsistent with the actual number of partition levels. As a result, partition pruning does not take effect.
Solution: Add
hoodie.datasource.write.partitionpath.urlencode= true
to the command that is used to write data to a Hudi table by using the DataFrame API of Spark.
What do I do if the error message "xxx is only supported with v2 tables
" appears when I execute the ALTER TABLE statement in Spark?
Cause: The hoodie.schema.on.read.enable configuration item for Hudi is not set to true when you use the Hudi-Spark schema evolution feature.
Solution: Add
set hoodie.schema.on.read.enable=true
to theALTER TABLE
statement that is executed for a Hudi table. For more information, see SparkSQL Schema Evolution and Syntax Description of Apache Hudi.