Semantic visual filtering via mask distillation — no online segmentation at training or deployment.
Anonymous Institution(s)
Humans possess a remarkable ability to filter out irrelevant sensory clutter, extracting only the information needed to anticipate and act within dynamic environments. Prior attempts to mitigate this through augmentation and masking strategies have improved robustness, but remain limited by computational overhead, weak semantic grounding, or instability in actor-critic training.
Inspired by how language guides human perception, we introduce TaLaS — Task-Relevant Language-conditioned Segmentation — a framework that leverages language-conditioned segmentation to impose semantic structure on visual observations. TaLaS employs a two-phase design where in Phase I, a lightweight masker is pretrained on language-guided masks; in Phase II, a student masker is regularized with strong augmentations to enforce consistency. This yields a task-relevant feature extractor that improves policy stability and removes the need for online segmentation at inference time. To address the actor's deployment distribution shift, we employ asymmetric actor-critic training.
TaLaS improves robustness to distractors and achieves particularly strong performance under challenging visual shifts on RL-ViGen, while remaining competitive in easier settings. The benchmark includes challenging variants of the DeepMind Control Suite, Quadruped Locomotion, and Dexterous Manipulation tasks.
An overview of the TaLaS framework: two-phase mask distillation and asymmetric actor-critic training.
TaLaS decouples semantic grounding from online computation through a two-phase training pipeline, followed by asymmetric RL optimization on masked observations.
Natural language prompts drive a frozen segmentation backbone (SAMWISE) on clean observations to produce task-relevant masks. A lightweight convolutional masker ψm is trained to imitate these targets via binary cross-entropy, amortizing expensive segmentation into a compact network that requires no online inference.
The pretrained teacher masker is frozen. A student masker ψ★m receives strongly augmented observations (image overlay from Places365) and is trained to match the teacher's clean-frame masks — achieving augmentation-invariant semantic filtering without any online segmentation model.
The frozen student masker filters all observations before the encoder. To handle mask imperfections at deployment, we use an asymmetric strategy: the actor is conditioned on both clean and augmented masked embeddings; the critic evaluating actor actions uses only the clean masked view. Bootstrap targets are computed from clean views only, maintaining low-variance value estimates while regularizing the policy against residual mask noise.
Figure 1. Depiction of the masker training phases of TaLaS. Phase I (Language-conditioned distillation): A text prompt drives a frozen SAM backbone on unaugmented frames; the compact masker ψm is pretrained by masked reconstruction. Phase II (Augmentation-consistent student): An augmented frame is fed to the noisy student ψ★m, optimized to match teacher targets on the clean frame.
Figure 2. Integrating the masking strategy with the RL backbone for robust policy learning. The frozen masker ψm filters unaugmented inputs; the student masker ψ★m filters augmented inputs. Critics θq are trained on both views; the target critic bootstraps from the clean view only.
| Task | SAC | DrQ | SVEA | SRM | PIE-G | SGQN | MaDi | CNSN | TaLaS |
|---|---|---|---|---|---|---|---|---|---|
| Cartpole Swingup | 158 | 138 | 393 | 475 | 323 | 488 | 619 | 309 | 395 |
| Walker Walk | 122 | 104 | 377 | 535 | 641 | 655 | 504 | 669 | 787 |
| Walker Stand | 231 | 289 | 834 | 863 | 852 | 851 | 824 | 856 | 940 |
| Ball in Cup Catch | 101 | 100 | 403 | 566 | 773 | 782 | 758 | 721 | 864 |
| Finger Spin | 13 | 91 | 335 | 419 | 762 | 554 | 358 | 556 | 788 |
| Cheetah Run | 10 | 32 | 105 | 115 | 154 | 144 | 170 | 162 | 198 |
| Average | 106 | 126 | 408 | 496 | 584 | 579 | 539 | 545 | 662 |
| Task | SAC | DrQ | SVEA | SRM | PIE-G | SGQN | MaDi | CNSN | TaLaS |
|---|---|---|---|---|---|---|---|---|---|
| Cartpole Swingup | 398 | 485 | 782 | 724 | 482 | 717 | 848 | 353 | 531 |
| Walker Walk | 245 | 682 | 819 | 854 | 871 | 860 | 895 | 923 | 878 |
| Walker Stand | 389 | 873 | 961 | 963 | 957 | 955 | 967 | 956 | 961 |
| Ball in Cup Catch | 192 | 318 | 871 | 924 | 910 | 761 | 807 | 892 | 856 |
| Finger Spin | 206 | 533 | 808 | 853 | 837 | 609 | 679 | 683 | 850 |
| Cheetah Run | 87 | 102 | 249 | 257 | 287 | 269 | 294 | 347 | 219 |
| Average | 253 | 499 | 757 | 763 | 724 | 697 | 748 | 692 | 716 |
Mean episode return, 5 seeds per task. Bold = best in setting.
Figure 4. t-SNE embeddings of 10 states augmented with 40 unseen backgrounds. TaLaS (score: 878) yields the tightest clusters compared to SRM (535), PIEG (641), and SGQN (655), indicating stronger domain-invariant representations.
| Task | Setting | DrQ | DrQ-v2 | CURL | SVEA | SRM | PIE-G | SGQN | TaLaS |
|---|---|---|---|---|---|---|---|---|---|
| Unitree Walk | VE | 67 | 98 | 75 | 98 | 98 | 140 | 152 | 189 |
| Unitree Stand | VE | 341 | 375 | 431 | 587 | 553 | 380 | 447 | 325 |
| Average VE | 204 | 236 | 253 | 343 | 326 | 260 | 332 | 257 | |
| Unitree Walk | VH | 40 | 83 | 61 | 74 | 72 | 204 | 123 | 206 |
| Unitree Stand | VH | 66 | 96 | 99 | 279 | 300 | 202 | 140 | 289 |
| Average VH | 53 | 89 | 80 | 177 | 186 | 203 | 131 | 247 |
TaLaS achieves a 22% gain in VH average return over next-best method (PIEG).
Figure 3. Performance on Adroit tasks (Pen, Door, Hammer) under video-easy (VE) and video-hard (VH) settings, averaged over 3 seeds. Error bars indicate standard deviation.
@article{talas2025,
title = {Task-Relevant Language-conditioned Segmentation for
Robust Generalization in Reinforcement Learning},
author = {Anonymous Authors},
journal = {Transactions on Machine Learning Research},
year = {2025},
note = {Under review},
url = {https://talas-rl.github.io}
}