PAVLM

Abstract

Affordance understanding, the task of identifying actionable regions on 3D objects, is critical for enabling robotic systems to interact with the physical world. Although Visual Language Models (VLMs) have excelled in high-level reasoning and long-horizon planning for robotic manipulation, they still fall short in grasping the nuanced physical properties required for effective human-robot interaction. In this paper, we introduce PAVLM (Point cloud Affordance Vision-Language Model), a novel framework that leverages the rich multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud. PAVLM integrates a geometric-guided propagation module with hidden embeddings from large language models (LLMs) to enrich visual semantics. On the language side, we prompt LLaMa-3.1 models to generate refined context-aware text, augmenting the instructional input with deeper semantic cues. Experimental results on the 3D-AffordanceNet benchmark demonstrate that PAVLM outperforms baseline methods for both full and partial point clouds, particularly excelling in its generalization to novel open-world affordance tasks of 3D objects.

Methodology

PAVLM combines the strengths of visual and language models to enhance 3D affordance understanding. Specifically, our geometric-guided propagation module enriches visual semantics, while LLaMa-3.1 generates refined, context-aware text instructions to better guide the robot's actions. In the following sections, we will delve into the architecture of the key components, starting with the Geometric-guided Point Encoder and Decoder. The diagram visually illustrates the overall architecture of PAVLM, highlighting the interaction between the visual and language models.

Visualization of affordance prediction results on multiple unseen objects and categories

Our PAVLM pipeline is designed to directly encode point cloud data. We first devise a geometric-guided propagation module to extract point features. Meanwhile, Llama-3.1 is tasked with generating richer and more detailed textual instructions. Both visual and textual embeddings are aligned through the 3D Image-Bind approach, with the combined features fed into the multi-modal LLM (Llama-2) for mask label generation. Finally, the per-point feature embeddings are multiplied by the <mask label> token and input into a 3D affordance decoder to generate the final affordance map. The following figure provides a detailed view of the PAVLM pipeline, illustrating each stage, from point cloud encoding to affordance map generation.

The diagram illustrates the architecture of the proposed Geometric-guided Point Encoder and Decoder within the PAVLM pipeline. It highlights the two core components: the geometric information extraction module and the feature propagation module. As shown in the figure, the point cloud data is divided into patches and processed through a series of transformer blocks to extract refined geometric features, while the decoder handles feature propagation and ensures that point-wise affordance information is accurately distributed across the point cloud.

We use an augmentation prompt template to enhance the diversity of generated question-answer pairs, improving the model's contextual understanding. This strategy is particularly important for ensuring flexibility in robotic interactions. As demonstrated in the accompanying diagram, the system generates multiple versions of question-answer pairs based on a seed input.

Experimental Results

We conducted extensive experiments to evaluate PAVLM's performance. Our ablation studies examined the effects of different text prompts and vision encoders, while comparisons with state-of-the-art methods demonstrated PAVLM's superiority in both seen and unseen object categories.

Ablation Study: Text Prompts

Table 1: Ablation study of different text prompts.
Point cloud	Text Prompt	mAP↑	AUC↑	aIOU↑	MSE↓
Full-shape (Aug.)	Ours	45.7	86.9	16.8	0.45
Full-shape	Ours	46.0	87.1	17.1	0.44
Full-shape	Object, Action	44.5	85.6	16.0	0.67
Full-shape	Action	45.4	87.2	16.7	0.45
Full-shape	Hi	11.4	51.9	0.4	0.90

Ablation Study: Vision Encoders

Table 2: Ablation study of different vision encoders.
Point cloud	Vision encoder	mAP↑	AUC↑	aIOU↑	MSE↓
Full-shape	PointNet++	47.1	86.2	18.1	0.36
Full-shape	DGCNN	31.0	72.9	1.0	0.68
Full-shape	Ours	48.5	86.8	17.7	0.36
Partial-shape	PointNet++	25.3	69.8	3.4	0.73
Partial-shape	DGCNN	26.8	69.7	0.7	0.67
Partial-shape	Ours	40.2	82.2	11.7	0.39

Comparison: Seen Categories

Table 3: Comparison results with baselines on the seen split of dataset
Point cloud	Method	mAP↑	AUC↑	aIOU↑	MSE↓
Full-shape	PointCLIP	7.6	49.9	0.9	0.80
Full-shape	PointCLIP V2	7.6	50.0	0.8	0.70
Full-shape	Ours (frozen)	46.3	86.9	17.1	0.44
Full-shape	Ours	48.5	86.8	17.7	0.36
Partial-shape	PointCLIP	9.1	50.1	1.1	0.78
Partial-shape	PointCLIP V2	9.2	49.9	1.4	0.66
Partial-shape	Ours (frozen)	37.2	64.2	11.4	0.36
Partial-shape	Ours	40.2	82.2	11.7	0.39

Comparison: Unseen Categories

Table 4: Comparison results with baselines on the unseen split of dataset
Point cloud	Method	mAP↑	AUC↑	aIOU↑	MSE↓
Full-shape	PointCLIP	3.7	49.9	0.5	1.58
Full-shape	PointCLIP V2	3.7	49.3	0.4	0.91
Full-shape	Ours (frozen)	11.8	60.3	1.5	0.78
Full-shape	Ours	12.1	53.9	2.6	0.51
Partial-shape	PointCLIP	4.7	50.2	0.6	1.14
Partial-shape	PointCLIP V2	4.7	50.3	0.38	0.93
Partial-shape	Ours (frozen)	13.8	60.3	2.0	0.60
Partial-shape	Ours	22.3	63.2	2.1	0.82

Our results show that PAVLM consistently outperforms baseline methods across various metrics. The ablation studies reveal that comprehensive question prompts and our proposed geometric-guided point encoder contribute significantly to the model's performance. Furthermore, PAVLM demonstrates strong generalization capabilities, particularly in handling unseen object categories.

PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model