[ES|QL] Parquet plugin ships ~50MB of Hadoop JARs solely for codec decompression

### Description

The `esql-datasource-parquet` plugin bundles `hadoop-client-api` (~20MB) and
`hadoop-client-runtime` (~31MB) because Parquet-MR's default `CodecFactory` calls
into Hadoop's `Configuration` class when decompressing compressed column chunks
(Snappy, GZIP, ZSTD, LZ4). This is the only remaining Hadoop touchpoint in the
Parquet read path — `PlainParquetConfiguration` already replaces Hadoop for config.

Without `hadoop-client-runtime`, reading any compressed Parquet file causes a fatal
`NoClassDefFoundError: org/apache/hadoop/shaded/com/ctc/wstx/io/InputBootstrapper`
because Hadoop's `Configuration` static initializer loads shaded Woodstox XML classes.

Parquet-MR exposes `ParquetReadOptions.Builder.withCodecFactory(CompressionCodecFactory)`
to inject a custom codec factory. All required codec libraries are already on the
classpath (snappy-java, zstd-jni via `esql-datasource-compression-libs`; lz4-java
via server; GZIP via JDK). A ~150-line pure-Java/JNI `CompressionCodecFactory` would
replace the Hadoop-backed default and allow removing both Hadoop JARs.

Note: the ORC plugin cannot follow the same path — `OrcFile.createReader` structurally
requires Hadoop `Configuration`, `Path`, and `FileSystem` with no abstraction layer.

Upstream tracking: [apache/parquet-java#2818](https://github.com/apache/parquet-java/issues/2818)
(open since Sep 2023, no fix version).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ES|QL] Parquet plugin ships ~50MB of Hadoop JARs solely for codec decompression #146716

Description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[ES|QL] Parquet plugin ships ~50MB of Hadoop JARs solely for codec decompression #146716

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions