Under Review · TMLR

Task-Relevant Language-conditioned Segmentation for Robust Generalization in Reinforcement Learning

Semantic visual filtering via mask distillation — no online segmentation at training or deployment.

Anonymous Authors  ·  Paper under double-blind review

Anonymous Institution(s)

Semantic Filtering for Visual RL

Humans possess a remarkable ability to filter out irrelevant sensory clutter, extracting only the information needed to anticipate and act within dynamic environments. Prior attempts to mitigate this through augmentation and masking strategies have improved robustness, but remain limited by computational overhead, weak semantic grounding, or instability in actor-critic training.


Inspired by how language guides human perception, we introduce TaLaS — Task-Relevant Language-conditioned Segmentation — a framework that leverages language-conditioned segmentation to impose semantic structure on visual observations. TaLaS employs a two-phase design where in Phase I, a lightweight masker is pretrained on language-guided masks; in Phase II, a student masker is regularized with strong augmentations to enforce consistency. This yields a task-relevant feature extractor that improves policy stability and removes the need for online segmentation at inference time. To address the actor's deployment distribution shift, we employ asymmetric actor-critic training.


TaLaS improves robustness to distractors and achieves particularly strong performance under challenging visual shifts on RL-ViGen, while remaining competitive in easier settings. The benchmark includes challenging variants of the DeepMind Control Suite, Quadruped Locomotion, and Dexterous Manipulation tasks.

Method Overview Video

TaLaS — Method Overview
📹 Replace this placeholder with your YouTube embed or <video> tag

An overview of the TaLaS framework: two-phase mask distillation and asymmetric actor-critic training.

Rollout Comparisons

Video placeholder
Walker Walk — Video Hard
TaLaS vs baselines under heavy background distraction
Video placeholder
Hammer — Adroit Manipulation
Dexterous manipulation with sparse rewards
Video placeholder
Unitree Quadruped Walk
Locomotion under visual distribution shift
Video placeholder
Mask Visualization
Language-guided semantic masks across tasks

How TaLaS Works

TaLaS decouples semantic grounding from online computation through a two-phase training pipeline, followed by asymmetric RL optimization on masked observations.

🎯
Phase I

Language-Guided Mask Distillation

Natural language prompts drive a frozen segmentation backbone (SAMWISE) on clean observations to produce task-relevant masks. A lightweight convolutional masker ψm is trained to imitate these targets via binary cross-entropy, amortizing expensive segmentation into a compact network that requires no online inference.

🔄
Phase II

Augmentation-Consistent Student

The pretrained teacher masker is frozen. A student masker ψm receives strongly augmented observations (image overlay from Places365) and is trained to match the teacher's clean-frame masks — achieving augmentation-invariant semantic filtering without any online segmentation model.

⚖️
RL Backbone

Asymmetric Actor-Critic on Masked Observations

The frozen student masker filters all observations before the encoder. To handle mask imperfections at deployment, we use an asymmetric strategy: the actor is conditioned on both clean and augmented masked embeddings; the critic evaluating actor actions uses only the clean masked view. Bootstrap targets are computed from clean views only, maintaining low-variance value estimates while regularizing the policy against residual mask noise.

Figure 1 — TaLaS masker training phases Replace with Figure 1 from the paper (Phase I + Phase II diagram)

Figure 1. Depiction of the masker training phases of TaLaS. Phase I (Language-conditioned distillation): A text prompt drives a frozen SAM backbone on unaugmented frames; the compact masker ψm is pretrained by masked reconstruction. Phase II (Augmentation-consistent student): An augmented frame is fed to the noisy student ψm, optimized to match teacher targets on the clean frame.

Figure 2 — RL backbone with asymmetric actor-critic Replace with Figure 2 from the paper (full pipeline diagram)

Figure 2. Integrating the masking strategy with the RL backbone for robust policy learning. The frozen masker ψm filters unaugmented inputs; the student masker ψm filters augmented inputs. Critics θq are trained on both views; the target critic bootstraps from the clean view only.

Quantitative Results

+13.4%
gain over next-best (PIEG) on DMC Video-Hard
7 / 12
DMC environments where TaLaS ranks first
7.5%
performance drop VE→VH (vs 35% for SRM, 19% for PIEG)
▲ Video-Hard — 100 background videos
TaskSACDrQSVEASRMPIE-GSGQNMaDiCNSNTaLaS
Cartpole Swingup158138393475323488619309395
Walker Walk122104377535641655504669787
Walker Stand231289834863852851824856940
Ball in Cup Catch101100403566773782758721864
Finger Spin1391335419762554358556788
Cheetah Run1032105115154144170162198
Average106126408496584579539545662
● Video-Easy — 10 background videos
TaskSACDrQSVEASRMPIE-GSGQNMaDiCNSNTaLaS
Cartpole Swingup398485782724482717848353531
Walker Walk245682819854871860895923878
Walker Stand389873961963957955967956961
Ball in Cup Catch192318871924910761807892856
Finger Spin206533808853837609679683850
Cheetah Run87102249257287269294347219
Average253499757763724697748692716

Mean episode return, 5 seeds per task. Bold = best in setting.

Figure 4 — t-SNE cluster visualization Replace with Figure 4 from the paper

Figure 4. t-SNE embeddings of 10 states augmented with 40 unseen backgrounds. TaLaS (score: 878) yields the tightest clusters compared to SRM (535), PIEG (641), and SGQN (655), indicating stronger domain-invariant representations.

TaskSettingDrQDrQ-v2CURLSVEASRMPIE-GSGQNTaLaS
Unitree WalkVE6798759898140152189
Unitree StandVE341375431587553380447325
Average VE204236253343326260332257
Unitree WalkVH4083617472204123206
Unitree StandVH669699279300202140289
Average VH538980177186203131247

TaLaS achieves a 22% gain in VH average return over next-best method (PIEG).

Figure 3 — Adroit task bar chart Replace with Figure 3 from the paper (Pen / Door / Hammer VE + VH)

Figure 3. Performance on Adroit tasks (Pen, Door, Hammer) under video-easy (VE) and video-hard (VH) settings, averaged over 3 seeds. Error bars indicate standard deviation.

+48%
over PIE-G on Adroit VE average
+51%
over top-3 baselines avg on Adroit VH
7.2s
per 100 env steps vs ~52s for online SAMWISE

BibTeX

@article{talas2025,
  title   = {Task-Relevant Language-conditioned Segmentation for
             Robust Generalization in Reinforcement Learning},
  author  = {Anonymous Authors},
  journal = {Transactions on Machine Learning Research},
  year    = {2025},
  note    = {Under review},
  url     = {https://talas-rl.github.io}
}