TaLaS: Task-Relevant Language-conditioned Segmentation

Abstract

Semantic Filtering for Visual RL

Humans possess a remarkable ability to filter out irrelevant sensory clutter, extracting only the information needed to anticipate and act within dynamic environments. Prior attempts to mitigate this through augmentation and masking strategies have improved robustness, but remain limited by computational overhead, weak semantic grounding, or instability in actor-critic training.

Inspired by how language guides human perception, we introduce TaLaS — Task-Relevant Language-conditioned Segmentation — a framework that leverages language-conditioned segmentation to impose semantic structure on visual observations. TaLaS employs a two-phase design where in Phase I, a lightweight masker is pretrained on language-guided masks; in Phase II, a student masker is regularized with strong augmentations to enforce consistency. This yields a task-relevant feature extractor that improves policy stability and removes the need for online segmentation at inference time. To address the actor's deployment distribution shift, we employ asymmetric actor-critic training.

TaLaS improves robustness to distractors and achieves particularly strong performance under challenging visual shifts on RL-ViGen, while remaining competitive in easier settings. The benchmark includes challenging variants of the DeepMind Control Suite, Quadruped Locomotion, and Dexterous Manipulation tasks.

Demo

Method Overview Video

TaLaS — Method Overview

📹 Replace this placeholder with your YouTube embed or <video> tag

An overview of the TaLaS framework: two-phase mask distillation and asymmetric actor-critic training.

Qualitative Results

Rollout Comparisons

Video placeholder

Walker Walk — Video Hard

TaLaS vs baselines under heavy background distraction

Video placeholder

Hammer — Adroit Manipulation

Dexterous manipulation with sparse rewards

Video placeholder

Unitree Quadruped Walk

Locomotion under visual distribution shift

Video placeholder

Mask Visualization

Language-guided semantic masks across tasks

Method

How TaLaS Works

TaLaS decouples semantic grounding from online computation through a two-phase training pipeline, followed by asymmetric RL optimization on masked observations.

🎯

Phase I

Language-Guided Mask Distillation

Natural language prompts drive a frozen segmentation backbone (SAMWISE) on clean observations to produce task-relevant masks. A lightweight convolutional masker ψ_m is trained to imitate these targets via binary cross-entropy, amortizing expensive segmentation into a compact network that requires no online inference.

🔄

Phase II

Augmentation-Consistent Student

The pretrained teacher masker is frozen. A student masker ψ^★_m receives strongly augmented observations (image overlay from Places365) and is trained to match the teacher's clean-frame masks — achieving augmentation-invariant semantic filtering without any online segmentation model.

⚖️

RL Backbone

Asymmetric Actor-Critic on Masked Observations

The frozen student masker filters all observations before the encoder. To handle mask imperfections at deployment, we use an asymmetric strategy: the actor is conditioned on both clean and augmented masked embeddings; the critic evaluating actor actions uses only the clean masked view. Bootstrap targets are computed from clean views only, maintaining low-variance value estimates while regularizing the policy against residual mask noise.

Figure 1 — TaLaS masker training phases Replace with Figure 1 from the paper (Phase I + Phase II diagram)

Figure 1. Depiction of the masker training phases of TaLaS. Phase I (Language-conditioned distillation): A text prompt drives a frozen SAM backbone on unaugmented frames; the compact masker ψ_m is pretrained by masked reconstruction. Phase II (Augmentation-consistent student): An augmented frame is fed to the noisy student ψ^★_m, optimized to match teacher targets on the clean frame.

Figure 2 — RL backbone with asymmetric actor-critic Replace with Figure 2 from the paper (full pipeline diagram)

Figure 2. Integrating the masking strategy with the RL backbone for robust policy learning. The frozen masker ψ_m filters unaugmented inputs; the student masker ψ^★_m filters augmented inputs. Critics θ_q are trained on both views; the target critic bootstraps from the clean view only.

Experiments

Quantitative Results

+13.4%
gain over next-best (PIEG) on DMC Video-Hard

7 / 12
DMC environments where TaLaS ranks first

7.5%
performance drop VE→VH (vs 35% for SRM, 19% for PIEG)

▲ Video-Hard — 100 background videos

Task	SAC	DrQ	SVEA	SRM	PIE-G	SGQN	MaDi	CNSN	TaLaS
Cartpole Swingup	158	138	393	475	323	488	619	309	395
Walker Walk	122	104	377	535	641	655	504	669	787
Walker Stand	231	289	834	863	852	851	824	856	940
Ball in Cup Catch	101	100	403	566	773	782	758	721	864
Finger Spin	13	91	335	419	762	554	358	556	788
Cheetah Run	10	32	105	115	154	144	170	162	198
Average	106	126	408	496	584	579	539	545	662

● Video-Easy — 10 background videos

Task	SAC	DrQ	SVEA	SRM	PIE-G	SGQN	MaDi	CNSN	TaLaS
Cartpole Swingup	398	485	782	724	482	717	848	353	531
Walker Walk	245	682	819	854	871	860	895	923	878
Walker Stand	389	873	961	963	957	955	967	956	961
Ball in Cup Catch	192	318	871	924	910	761	807	892	856
Finger Spin	206	533	808	853	837	609	679	683	850
Cheetah Run	87	102	249	257	287	269	294	347	219
Average	253	499	757	763	724	697	748	692	716

Mean episode return, 5 seeds per task. Bold = best in setting.

Figure 4 — t-SNE cluster visualization Replace with Figure 4 from the paper

Figure 4. t-SNE embeddings of 10 states augmented with 40 unseen backgrounds. TaLaS (score: 878) yields the tightest clusters compared to SRM (535), PIEG (641), and SGQN (655), indicating stronger domain-invariant representations.

Task	Setting	DrQ	DrQ-v2	CURL	SVEA	SRM	PIE-G	SGQN	TaLaS
Unitree Walk	VE	67	98	75	98	98	140	152	189
Unitree Stand	VE	341	375	431	587	553	380	447	325
Average VE		204	236	253	343	326	260	332	257
Unitree Walk	VH	40	83	61	74	72	204	123	206
Unitree Stand	VH	66	96	99	279	300	202	140	289
Average VH		53	89	80	177	186	203	131	247

TaLaS achieves a 22% gain in VH average return over next-best method (PIEG).

Figure 3 — Adroit task bar chart Replace with Figure 3 from the paper (Pen / Door / Hammer VE + VH)

Figure 3. Performance on Adroit tasks (Pen, Door, Hammer) under video-easy (VE) and video-hard (VH) settings, averaged over 3 seeds. Error bars indicate standard deviation.

+48%
over PIE-G on Adroit VE average

+51%
over top-3 baselines avg on Adroit VH

7.2s
per 100 env steps vs ~52s for online SAMWISE

Task-Relevant Language-conditioned Segmentation for Robust Generalization in Reinforcement Learning

Semantic Filtering for Visual RL

Method Overview Video

Rollout Comparisons

How TaLaS Works

Language-Guided Mask Distillation

Augmentation-Consistent Student

Asymmetric Actor-Critic on Masked Observations

Quantitative Results

BibTeX