OpenBEATs is a general-purpose audio encoder pre-trained on speech, music, environmental sound, and bioacoustics. This package runs it on audio and returns patch-level embeddings, plus class probabilities when a fine-tuned checkpoint is used.
pip install openbeatsThis adds two commands, openbeats-infer and openbeats-download. The
dependencies are kept light (torch, torchaudio, numpy, huggingface-hub, pyyaml,
soundfile), and torch is pinned loosely so an existing build is not replaced. To
avoid touching an existing environment, install it in its own with
uv or pipx:
uv tool install openbeats # or: pipx install openbeatsHandy for a quick look:
openbeats-infer --checkpoint espnet/OpenBEATS-Large-i2-as20k \
--audio audio.wav --out embeddings.npz--checkpoint takes a Hugging Face repo id (downloaded automatically), a local
directory, or a checkpoint file. The .npz holds patch_embeddings
(num_patches, 1024), plus logits and probs when the checkpoint has a
classifier. Other options: --device cuda, --max-layer N, and
--chunk-seconds 10 for long recordings.
from openbeats.model import OpenBeats
from openbeats.utils import load_audio
# load model
model = OpenBeats.from_pretrained("espnet/OpenBEATS-Large-i2-as20k", device="cuda")
# from a file with any sample rate
out = model.encode_file("audio.wav") # pass chunk_seconds=10 for long audio
# or load the waveform in 16khz monoaural array with values in [-1,1]
wav, sr = load_audio("audio.wav")
# and pass it
out = model.encode(wav, sr)
print(out["patch_embeddings"].shape) # (num_patches, 1024)The variants (Base and Large, plus AudioSet and bioacoustics fine-tunes) live in the espnet OpenBEATs collection.
If you use OpenBEATs, please cite:
@INPROCEEDINGS{11230965,
author={Bharadwaj, Shikhar and Cornell, Samuele and Choi, Kwanghee and Fukayama, Satoru and Shim, Hye-Jin and Deshmukh, Soham and Watanabe, Shinji},
booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
title={OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Training;Representation learning;Codes;Conferences;Pipelines;Signal processing;Cognition;Robustness;Reproducibility of results;Question answering (information retrieval)},
doi={10.1109/WASPAA66052.2025.11230965}}If you use the checkpoints trained for our ICME 2025 Audio Encoder Challenge submission, please also cite:
@article{bharadwaj2026cmu,
title={The CMU-AIST submission for the ICME 2025 Audio Encoder Challenge},
author={Bharadwaj, Shikhar and Cornell, Samuele and Choi, Kwanghee and Shim, Hye-jin and Deshmukh, Soham and Fukayama, Satoru and Watanabe, Shinji},
journal={arXiv preprint arXiv:2601.16273},
year={2026}
}