Description
The esql-datasource-parquet plugin bundles hadoop-client-api (~20MB) and
hadoop-client-runtime (~31MB) because Parquet-MR's default CodecFactory calls
into Hadoop's Configuration class when decompressing compressed column chunks
(Snappy, GZIP, ZSTD, LZ4). This is the only remaining Hadoop touchpoint in the
Parquet read path — PlainParquetConfiguration already replaces Hadoop for config.
Without hadoop-client-runtime, reading any compressed Parquet file causes a fatal
NoClassDefFoundError: org/apache/hadoop/shaded/com/ctc/wstx/io/InputBootstrapper
because Hadoop's Configuration static initializer loads shaded Woodstox XML classes.
Parquet-MR exposes ParquetReadOptions.Builder.withCodecFactory(CompressionCodecFactory)
to inject a custom codec factory. All required codec libraries are already on the
classpath (snappy-java, zstd-jni via esql-datasource-compression-libs; lz4-java
via server; GZIP via JDK). A ~150-line pure-Java/JNI CompressionCodecFactory would
replace the Hadoop-backed default and allow removing both Hadoop JARs.
Note: the ORC plugin cannot follow the same path — OrcFile.createReader structurally
requires Hadoop Configuration, Path, and FileSystem with no abstraction layer.
Upstream tracking: apache/parquet-java#2818
(open since Sep 2023, no fix version).
Description
The
esql-datasource-parquetplugin bundleshadoop-client-api(~20MB) andhadoop-client-runtime(~31MB) because Parquet-MR's defaultCodecFactorycallsinto Hadoop's
Configurationclass when decompressing compressed column chunks(Snappy, GZIP, ZSTD, LZ4). This is the only remaining Hadoop touchpoint in the
Parquet read path —
PlainParquetConfigurationalready replaces Hadoop for config.Without
hadoop-client-runtime, reading any compressed Parquet file causes a fatalNoClassDefFoundError: org/apache/hadoop/shaded/com/ctc/wstx/io/InputBootstrapperbecause Hadoop's
Configurationstatic initializer loads shaded Woodstox XML classes.Parquet-MR exposes
ParquetReadOptions.Builder.withCodecFactory(CompressionCodecFactory)to inject a custom codec factory. All required codec libraries are already on the
classpath (snappy-java, zstd-jni via
esql-datasource-compression-libs; lz4-javavia server; GZIP via JDK). A ~150-line pure-Java/JNI
CompressionCodecFactorywouldreplace the Hadoop-backed default and allow removing both Hadoop JARs.
Note: the ORC plugin cannot follow the same path —
OrcFile.createReaderstructurallyrequires Hadoop
Configuration,Path, andFileSystemwith no abstraction layer.Upstream tracking: apache/parquet-java#2818
(open since Sep 2023, no fix version).