python-snappy appears to be installed - Dask returns a ValueError.
Helm Config for jupyter and workers:
env:
- name: EXTRA_CONDA_PACKAGES
value: numba xarray s3fs python-snappy pyarrow ruamel.yaml -c conda-forge
- name: EXTRA_PIP_PACKAGES
value: dask-ml --upgradeThe containers shows python-snappy (via conda list)
The dataframe is loaded from a multi-part parquet file generated by Apache Drill:
files = ['s3://{}'.format(f) for f in fs.glob(path='{}/*.parquet'.format(filename))]
df = dd.read_parquet(files)Running len(df) on the dataframe returns:
distributed.utils - ERROR - Data is compressed as snappy but we don't have this installed
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
yield
File "/opt/conda/lib/python3.6/site-packages/distributed/client.py", line 921, in _handle_report
six.reraise(*clean_exception(**msg))
File "/opt/conda/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 203, in read
msg = yield from_frames(frames, deserialize=self.deserialize)
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
return
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 315, in wrapper
future.set_result(_value_from_stopiteration(e))
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py", line 75, in from_frames
res = _from_frames()
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py", line 61, in _from_frames
return protocol.loads(frames, deserialize=deserialize)
File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/core.py", line 96, in loads
msg = loads_msgpack(small_header, small_payload)
File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/core.py", line 171, in loads_msgpack
" installed" % str(header['compression']))
ValueError: Data is compressed as snappy but we don't have this installedCan anyone please suggest a correct configuration here or remediation steps?
This error actually isn't coming from reading your parquet files, it's coming from how Dask compresses data between machines. You can probably resolve this by installing or not installing python-snappy consistently on all of your client/scheduler/worker pods.
You should do either of the following:
jupyter and worker pods. If you're using pyarrow then this is unnecessary, I believe that Arrow includes snappy at the C++ level.python-snappy to your scheduler podFWIW I personally recommend lz4 for inter-machine compression over snappy.