历年综述论文分类汇总戳这里↘️ CV-Surveys施工中~~~~~~~~~~
- PostureHMR: Posture Transformation for 3D Human Mesh Recovery
- WANDR: Intention-guided Human Motion Generation
📺video - Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
🏠project - SimDA: Simple Diffusion Adapter for Efficient Video Generation
⭐code
🏠project - Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D
🏠project - DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
🏠project - PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
🏠project - On the Scalability of Diffusion-based Text-to-Image Generation
- DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing
⭐code - Shadow Generation for Composite Image Using Diffusion Model
⭐code - FreeU: Free Lunch in Diffusion U-Net
⭐code
🏠project - AnyDoor: Zero-shot Object-level Image Customization
🏠project图像生成 - [Unlocking Pretrained Image Backbones for Semantic Image Synthesis]
- Unlocking Pre-trained Image Backbones for Semantic Image Synthesis
- HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D
🏠project - GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs
🏠project3D 场景合成 - Learning from Observer Gaze: Zero-shot Attention Prediction Oriented by Human-Object Interaction Recognition人机交互
- CG-HOI: Contact-Guided 3D Human-Object Interaction Generation
🏠project - Denoising Point Cloud in Latent Space via Graph Convolution and Invertible Neural Network点云去噪
- Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation
- Infer from What You Have Seen Before: Temporally-dependent Classifier for Semi-supervised Video Semantic Segmentation
- BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation
⭐code - ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
- APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation
- SANeRF-HQ: Segment Anything for NeRF in High Quality
🏠project - Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
- PSDPM: Prototype-based Secondary Discriminative Pixels Mining for Weakly Supervised Semantic Segmentation
- COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction
⭐code - Prompt3D: Random Prompt Assisted Weakly-Supervised 3D Object Detection
- Language-driven All-in-one Adverse Weather Removal恶劣天气消除
- HomoFormer: Homogenized Transformer for Image Shadow Removal
- Language-guided Image Reflection Separation图像反射分离
- Image Restoration by Denoising Diffusion Models With Iteratively Preconditioned Guidance
⭐code - Efficient Meshflow and Optical Flow Estimation from Event Cameras光流估计
- LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis
⭐code
🏠projectNVS - Federated Generalized Category Discovery图像分类
- Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification
🏠project - Detours for Navigating Instructional Videos旅游视频导航
- TransLoc4D: Transformer-based 4D-Radar Place Recognition地点识别
- Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation
- Driving Everywhere with Large Language Model Policy Adaptation
🏠projectLLM - Towards Transferable Targeted 3D Adversarial Attack in the Physical World对抗性攻击
- Revisiting Spatial-Frequency Information Integration from a Hierarchical Perspective for Panchromatic and Multi-Spectral Image Fusion图像融合
- FaceLift: Semi-supervised 3D Facial Landmark Localization
- 3D Facial Expressions through Analysis-by-Neural-Synthesis
🏠project - Aligning Logits Generatively for Principled Black-Box Knowledge Distillation知识蒸馏
- Mean-Shift Feature Transformer
- Motion Diversification Networks
- Domain Gap Embeddings for Generative Dataset Augmentation
- Absolute Pose from One or Two Scaled and Oriented Features
- Small Steps and Level Sets: Fitting Neural Surface Models with Point Guidance
- From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers
- GART: Gaussian Articulated Template Models
🏠project - Towards Learning a Generalist Model for Embodied Navigation
- Language-aware Query Mask Transformer for Referring Image Segmentation
- pix2gestalt: Amodal Segmentation by Synthesizing Wholes
⭐code
🏠project无模态分割 - Unlocking the Potential of Pre-trained Vision Transformers for Few-Shot Semantic Segmentation through Relationship Descriptors
- Relational Matching for Weakly Semi-Supervised Oriented Object Detection弱半监督定向目标检测
- Endow SAM with Keen Eyes: Temporal-spatial Prompt Learning for Video Camouflaged Object Detection
- JointSQ: Joint Sparsification-Quantization for Distributed Learning量化
- NICE: Neurogenesis Inspired Contextual Encoding for Replay-free Class Incremental Learning
⭐code增量学习 - Learning for Transductive Threshold Calibration in Open-World RecognitionGCN
- LightOctree: Lightweight 3D Spatially-Coherent Indoor Lighting Estimation室内照明估计
- SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching
⭐code
🏠project语义匹配 - TextCraftor: Your Text Encoder Can be Image Quality Controller图像质量
- Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation
🏠project文本到图像生成 - AAMDM: Accelerated Auto-regressive Motion Diffusion Model运动扩散模型
- TexOct: Generating Textures of 3D Models with Octree-based Diffusion3D 模型
- Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
🏠project - OTE: Exploring Accurate Scene Text Recognition Using One Token场景文本识别
- LiDAR-based Person Re-identification
- Shallow-Deep Collaborative Learning for Unsupervised Visible-Infrared Person Re-Identification可见光-红外
- Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation3D 点云实例分割
- Neural Spline Fields for Burst Image Fusion and Layer Separation
🏠project图像融合 - Deep Video Inverse Tone Mapping Based on Temporal Clues
- Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models对抗性攻击
- Seeing the Unseen: Visual Common Sense for Semantic Placement
- Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching
⭐code数据压缩 - L2B: Learning to Bootstrap Robust Models for Combating Label Noise
⭐code - Navigate Beyond Shortcuts: Debiased Learning through the Lens of Neural Collapse
- Latent Modulated Function for Computational Optimal Continuous Image Representation
ASDF
⭐code
🏠project
📺video
🌻dataset
- UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
⭐code用于音频、视频、点云、时间序列和图像识别的通用感知大内核卷积网络 - GPT4Point: A Unified Framework for Point-Language Understanding and Generation点语言理解和生成的统一框架
- AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond用于运动理解、规划、生成等的一体化框架
- [Sharingan: A Transformer Architecture for Multi-Person Gaze Following]
- Sharingan: A Transformer-based Architecture for Gaze Following目光跟随
- What Sketch Explainability Really Means for Downstream Tasks
- SketchINR: A First Look into Sketches as Implicit Neural Representations
- CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention
- Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
⭐code
- ScanFormer: Referring Expression Comprehension by Iteratively Scanning
- Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
⭐code零样本指代表达理解
- Dexterous Grasp Transformer
- Mean-Shift Feature Transformer
- Dual-scale Transformer for Large-scale Single-Pixel Imaging
- Solving Masked Jigsaw Puzzles with Diffusion Transformers
- Instance-Aware Group Quantization for Vision Transformers
🏠project - Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers
⭐code - Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer
- Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers
- Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods
- On the Faithfulness of Vision Transformer Explanations
- Learning Correlation Structures for Vision Transformers
- Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression
- Point Transformer V3: Simpler, Faster, Stronger
⭐code - A General and Efficient Training for Transformer via Token Expansion
⭐code - HEAL-SWIN: A Vision Transformer On The Sphere
⭐code - SHViT: Single-Head Vision Transformer with Memory Efficient Macro DesignVision
- TransNeXt: Robust Foveal Visual Perception for Vision Transformers
⭐code - Making Vision Transformers Truly Shift-Equivariant
- Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
⭐code
- Time-Efficient Light-Field Acquisition Using Coded Aperture and Events
- Event-based Light Field Project
🏠project - Continuous Pose for Monocular Cameras in Neural Implicit Representation
⭐code - 相机姿态
- 快照压缩成像
- Z∗: Zero-shot Style Transfer via Attention Rearrangement
- MoST: Motion Style Transformer between Diverse Action Contents
⭐code - ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation
⭐code
🏠project - Arbitrary Motion Style Transfer with Multi-condition Motion Latent Diffusion Model
- 零样本文本驱动运动迁移
- Segment Every Out-of-Distribution Object
- Label-Efficient Group Robustness via Out-of-Distribution Concept Curation
- Enhancing the Power of OOD Detection via Sample-Aware Model SelectionOOD
- Discriminability-Driven Channel Selection for Out-of-Distribution Detection
- CORES: Convolutional Response-based Score for Out-of-distribution Detection
- Learning Transferable Negative Prompts for Out-of-Distribution Detection
⭐code - A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?
⭐code - 异常检测
- 数据集
- Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset
- Advancing Saliency Ranking with Human Fixations: Dataset, Models and Benchmarks
- MAGICK: A Large-scale Captioned Dataset from Matting Generated Images using Chroma Keying
- HardMo: A Large-Scale Hardcase Dataset for Motion Capture
- The STVchrono Dataset: Towards Continuous Change Recognition in Time
- Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding
- LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising
- On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm
- Towards Modern Image Manipulation Localization: A Large-Scale Dataset and Novel Methods
- Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation
- FineSports: A Multi-person Hierarchical Sports Video Dataset for Fine-grained Action Understanding细粒度动作理解
- MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
🏠project - Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset
- Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network
🌻dataset - JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups
🏠project - TULIP: A Multi-camera 3D Dataset for Precision Assessment of Parkinson's Disease
- JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments
- OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion
🏠project - SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos
- RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method
🏠project - MatSynth: A Modern PBR Materials Dataset
🏠project - RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception
⭐code - Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection
- EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
⭐code - MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception
🌻dataset - HouseCat6D -- A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios
- HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative
🌻dataset - DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision
🌻dataset - EFHQ: Multi-purpose ExtremePose-Face-HQ dataset
⭐code
🏠project数据集 - LUWA Dataset: Learning Lithic Use-Wear Analysis on Microscopic Images
- MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors
⭐code - FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions
🏠project - TUMTraf V2X Cooperative Perception Dataset
🏠project - MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures
🌻dataset
- 基准
- When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach
- THRONE: A Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models
- M3-UDA: A New Benchmark for Unsupervised Domain Adaptive Fetal Cardiac Structure Detection
⭐code - DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World Videos现实视频中远程点跟踪的基准
- RoDLA: Benchmarking the Robustness of Document Layout Analysis Models
⭐code - [ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks]
- Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
⭐code - UVEB: A Large-scale Benchmark and Baseline Towards Real-World Underwater Video Enhancement
- DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling
🏠project - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
⭐code - VBench : Comprehensive Benchmark Suite for Video Generative Models
⭐code
🏠project - MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark
- CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs
🏠project - How to Train Neural Field Representations: A Comprehensive Study and Benchmark
- OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
- 弱监督学习
- 部分标签学习
- 半监督
- Targeted Representation Alignment for Open-World Semi-Supervised Learning
- CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning
- BEM: Balanced and Entropy-based Mix for Long-Tailed Semi-Supervised Learning
- 正样本标签学习
- Positive-Unlabeled Learning by Latent Group-Aware Meta DisambiguationPositive-Unlabeled Learning(正样本标签学习)半监督学习的一个重要分支
- 自监督学习
- SD2Event: Self-supervised Learning of Dynamic Detectors and Contextual Descriptors for Event Cameras
- An Asymmetric Augmented Self-Supervised Learning Method for Unsupervised Fine-Grained Image Hashing
- Self-supervised debiasing using low rank regularization
- CNC-Net: Self-Supervised Learning for CNC Machining Operations
- 无监督学习
- Efficient Multitask Dense Predictor via Binarization密集预测
- Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models
- Exploiting Diffusion Prior for Generalizable Dense Prediction
🏠project - ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
⭐code
👍百度提出视觉新骨干ViT-CoMer,刷新密集预测任务SOTA - Multi-Task Dense Prediction via Mixture of Low-Rank Experts
⭐code
- 异常检测
- Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation
- Prompt-enhanced Multiple Instance Learning for Weakly Supervised Anomaly Detection弱监督异常检测
- Long-Tailed Anomaly Detection with Learnable Class Names
🏠project - RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection
⭐code - Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts
- PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection
- 薄膜去除
- Boosting Order-Preserving and Transferability for Neural Architecture Search: a Joint Architecture Refined Search and Fine-tuning Approach
- Building Optimal Neural Architectures using Interpretable Knowledge
⭐code - AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search
- 网络架构搜索
- Equivariant Multi-Modality Image Fusion图像融合
- Task-Customized Mixture of Adapters for General Image Fusion
⭐code - Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion
⭐code - Revisiting Spatial-Frequency Information Integration from a Hierarchical Perspective for Panchromatic and Multi-Spectral Image Fusion
- Neural Spline Fields for Burst Image Fusion and Layer Separation
🏠project
- Language-only Training of Zero-shot Composed Image Retrieval
- Evaluating Transferability in Retrieval Tasks: An Approach Using MMD and Kernel Methods
- Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
- On Train-Test Class Overlap and Detection for Image Retrieval
⭐code - Language-only Efficient Training of Zero-shot Composed Image Retrieval
⭐code - 细粒度图像检索
- 跨模态检索
- 基于草图的检索
- 视频检索
- 文本-视频检索
- 视频文本检索
- 组合图像检索
- GNN
- GCN
- OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
- CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive LearningSGG
- Multi-Level Neural Scene Graphs for Dynamic Urban Environments
⭐code - HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
⭐code
⭐code - DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation
⭐code
🏠project - From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
- EGTR: Extracting Graph from Transformer for Scene Graph Generation
⭐code - [LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation]
- LLM4SGG: Large Language Model for Weakly Supervised Scene Graph GenerationSGG
- Programmable Motion Generation for Open-set Motion Control Tasks
- Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
- AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents
- Towards Variable and Coordinated Holistic Co-Speech Motion Generation
⭐code - Generating Human Motion in 3D Scenes from Text Descriptions根据文本描述生成 3D 场景中的人体运动
- NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis
🏠project人体运动合成 - OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers
🏠project - WANDR: Intention-guided Human Motion Generation
📺video - 动物运动
- SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
🏠project - Spatial VLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
🏠projectVQA - Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
- Video-QA
- 图表问答
- 视觉文本问答
- 化学结构识别
- 文档色度检测
- 文本检测
- 场景文本识别
- OTE: Exploring Accurate Scene Text Recognition Using One Token
- An Empirical Study of Scaling Law for Scene Text Recognition
⭐code场景文本识别An Empirical Study of Scaling Law for OCR - Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer
⭐code - Kernel Adaptive Convolution for Scene Text Detection via Distance Map Prediction
- 文档理解
- Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding
- PanoContext-Former: Panoramic Total Scene Understanding with a Transformer
- GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding
- GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding
⭐code - A Category Agnostic Model for Visual Rearrangement
:Thumbsup:VILP场景变化检测和场景变化匹配 - 360+x: A Panoptic Multi-modal Scene Understanding Dataset
⭐code - 文本驱动的 3D 场景生成
- Scaling Up Dynamic Human-Scene Interaction Modeling
⭐code
🏠project - ReGenNet: Towards Human Action-Reaction Synthesis
⭐code - DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
⭐code交互 - Towards Open-Vocabulary HOI Detection via Conditional Multi-level Decoding and Fine-grained Semantic Enhancement
- 人体运动跟踪
- 新运动合成
- 手部交互
- InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion
⭐code - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video
⭐code
🏠project手物交互 - TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding
🏠project - Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction
⭐code - G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis
⭐code
- InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion
- 人物交互
- Discovering Syntactic Interaction Clues for Human-Object Interaction Detection
- LEMON: Learning 3D Human-Object Interaction Relation from 2D Images
⭐code
🏠project - Disentangled Pre-training for Human-Object Interaction Detection
⭐code - GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation, Demonstration, and Imitation
🏠project - Learning from Observer Gaze: Zero-shot Attention Prediction Oriented by Human-Object Interaction Recognition人机交互
- 3D 人物交互
- How Far Can We Compress Instant NGP-Based NeRF?
- IReNe: Instant Recoloring of Neural Radiance Fields
- PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF
- LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes
- NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation
- PaReNeRF: Toward Fast Large-scale Dynamic NeRF with Patch-based ReferenceNeRF
- Global and Hierarchical Geometry Consistency Priors for Few-shot NeRFs in Indoor Scenes
- Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling
⭐code - BANF: Band-limited Neural Fields for Levels of Detail Reconstruction
⭐code
🏠project - NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation
- MuRF: Multi-Baseline Radiance Fields
🏠project
🏠project - InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields
- NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors
⭐code
🏠project - Neural Fields as Distributions: Signal Processing Beyond Euclidean Space
🏠project - CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs
⭐code - DaReNeRF: Direction-aware Representation for Dynamic Scenes
- Geometry Transfer for Stylizing Radiance Fields
🏠project - S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes
⭐code - SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream
⭐code - Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes
⭐code - Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates
⭐code - LAENeRF: Local Appearance Editing for Neural Radiance Fields
⭐code
🏠project - TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video
- NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation
⭐code - Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency
⭐code - Grounding and Enhancing Grid-based Models for Neural Fields
🏠project - Mitigating Motion Blur in Neural Radiance Fields with Events and Frames
- OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos
- Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects
⭐code - Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields
🏠project - Dynamic LiDAR Re-simulation using Compositional Neural Fields
🏠project - SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields
🏠project - ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization
- NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows
- 新视图合成
- G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images
- Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis
- DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
- 3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis
- Generalizable Novel-View Synthesis using a Stereo Camera
🏠project - DART: Implicit Doppler Tomography for Radar Novel View Synthesis
🏠project - XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold
- Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis
⭐code
🏠project - NViST: In the Wild New View Synthesis from a Single Image with Transformers
⭐code
🏠project - ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models
🏠project - SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
⭐code
🏠project - GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis
⭐code
🏠project新视图 - DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization
⭐code
🏠project - LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis
⭐code
🏠project - Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?
- Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
⭐code
🏠project - CoPoNeRF: Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs
⭐code
🏠project - EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
⭐code
🏠project - Free3D: Consistent Novel View Synthesis without 3D Representation
⭐code
🏠project - Novel View Synthesis with View-Dependent Effects from a Single Image
🏠project
- 渲染
- NeRF Director: Revisiting View Selection in Neural Volume Rendering
- Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance渲染
- [Perceptual Assessment and Optimization of HDR Image Rendering]
- Perceptual Assessment and Optimization of High Dynamic Range Image Rendering
- Global Latent Neural Rendering
- Real-time Acquisition and Reconstruction of Dynamic Volumes with Neural Structured Illumination
📺video - Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields
🏠project - Dr.Bokeh: DiffeRentiable Occlusion-aware Bokeh Rendering
🏠project - HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting
:Thumbsup:HiFi4G: 通过紧凑高斯进行高保真人体性能渲染 - ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering
🏠project - SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild
🏠project神经渲染 - HashPoint: Accelerated Point Searching and Sampling for Neural Rendering
🏠project - HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces
🏠project - DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling
⭐code
🏠project - Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras
🏠project - ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis
🏠project
- 目标重建
- 实体识别
- 提示学习
- 基础模型
- MuGE: Multiple Granularity Edge Detection
- RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses
⭐code
- Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception
- 行人检测
- 人群计数
- 行人属性检测
- 重识别
- SEAS: ShapE-Aligned Supervision for Person Re-Identification
- View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network
⭐code - CA-Jaccard: Camera-aware Jaccard Distance for Person Re-identification
- Attribute-Guided Pedestrian Retrieval: Bridging Person Re-ID with Internal Attribute Variability
- 基于雷达的Re-Id
- 可见光-红外人员重识别
- 文本-图像重识别
- MC
- KD
- Small Scale Data-Free Knowledge Distillation
- Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation
- C2KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation
- CrossKD: Cross-Head Knowledge Distillation for Dense Object Detection
- Aligning Logits Generatively for Principled Black-Box Knowledge Distillation
- FreeKD: Knowledge Distillation via Semantic Frequency Prompt
- Logit Standardization in Knowledge Distillation
- $V_kD:$ Improving Knowledge Distillation using Orthogonal Projections
⭐code - Scale Decoupled Distillation
⭐code - NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation
⭐code - De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts
- PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
⭐code
🏠project
:Thumbsup:中文解读
- 剪枝
- Resource-Efficient Transformer Pruning for Finetuning of Large Models
- MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
- Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment
- MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
⭐code
- 量化
- Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization
⭐code - Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
⭐code - Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery
🏠project - Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans
🏠project - WildlifeMapper: Aerial Image Analysis for Multi-Species Detection and Identification
⭐code - Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels
⭐code - Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
⭐code - 遥感
- SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
- 3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions
⭐code - Poly Kernel Inception Network for Remote Sensing Detection
- Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening
⭐code
- 航空图像分割
- 基于参考图像的超分辨率
- Masking Clusters in Vision-language Pretraining
- Beyond Average: Individualized Visual Scanpath Prediction
- Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
- JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models
- HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models
⭐code - Seeing the Unseen: Visual Common Sense for Semantic Placement
- EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
⭐code
🏠project - SPIN: Simultaneous Perception, Interaction and Navigation
- DePT: Decoupled Prompt Tuning
⭐code - Osprey: Pixel Understanding with Visual Instruction Tuning
⭐code - FairCLIP: Harnessing Fairness in Vision-Language Learning
🏠project - Efficient Test-Time Adaptation of Vision-Language Models
⭐code - BioCLIP: A Vision Foundation Model for the Tree of Life
⭐code - InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
⭐code - Anchor-based Robust Finetuning of Vision-Language Models
- Multi-Modal Hallucination Control by Visual Information Grounding
- Do Vision and Language Encoders Represent the World Similarly?
- Dual-View Visual Contextualization for Web Navigation
- Any-Shift Prompting for Generalization over Distributions
- Non-autoregressive Sequence-to-Sequence Vision-Language Models
- One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
⭐code - SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
⭐code - RegionGPT: Towards Region Understanding Vision Language Model
- Enhancing Vision-Language Pre-training with Rich Supervisions
- Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
⭐code - Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples
- Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
⭐code - Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding
⭐code视觉语言构图理解 - Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
- Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
⭐code - A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
⭐code - Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
⭐code - SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining视觉-语言
- Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
- Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
⭐code - Iterated Learning Improves Compositionality in Large Vision-Language Models
- ViTamin: Designing Scalable Vision Models in the Vision-Language Era
⭐code - Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
⭐code - Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models
🏠project - Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning
🏠project - HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
⭐code - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
- Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
- Learning Vision from Models Rivals Learning Vision from Data
⭐code - Probing the 3D Awareness of Visual Foundation Models
⭐code - 视觉理解
- LLM
- PixelLM: Pixel Reasoning with Large Multimodal Model
🏠project - Driving Everywhere with Large Language Model Policy Adaptation
🏠project - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
- GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
🏠project - Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement
🏠project - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception
- V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
⭐code - Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- Pixel Aligned Language Models
🏠project - SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection
- UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
- ModaVerse: Efficiently Transforming Modalities with LLMs
- VCoder: Versatile Vision Encoders for Multimodal Large Language Models
⭐code
🏠project - mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
- MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
🏠project大语言模型 - RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
⭐code - Prompt Highlighter: Interactive Control for Multi-Modal LLMs
⭐code
🏠project - Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
🏠project - General Object Foundation Model for Images and Videos at Scale
⭐code
🏠project
👍GLEE 华科与字节跳动联手打造全能目标感知基础模型 - Link-Context Learning for Multimodal LLMs
⭐codeLLMs - Cloud-Device Collaborative Learning for Multimodal Large Language Models
- LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model
⭐code
:Thumbsup:成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM - Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
⭐code
:Thumbsup:成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM - [LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge]
- LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
⭐code
🏠projectMLLMs - GSVA: Generalized Segmentation via Multimodal Large Language Models
- PixelLM: Pixel Reasoning with Large Multimodal Model
- VLN
- Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
⭐code
:Thumbsup:VILP - Volumetric Environment Representation for Vision-Language Navigation
- OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation
- Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
- 视频语言
- Visual Grounding
- Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
- Viewpoint-Aware Visual Grounding in 3D Scenes
- Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
🏠project - Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and ConsistencyVisual Grounding
- CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion
⭐code - 图像隐写术
- 知识产权保护
- Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models
- MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection
⭐code - CPR: Retrieval Augmented Generation for Copyright Protection
- VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models
⭐code - Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models
- IP 保护
- 3D Feature Tracking via Event Camera
- Projecting Trackable Thermal Patterns for Dynamic Computer Vision
- ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe
- DPHMs: Diffusion Parametric Head Models for Depth-based Tracking
🏠project - NetTrack: Tracking Highly Dynamic Objects with a Net
⭐code - RTracker: Recoverable Tracking via PN Tree Structured Memory
- Context-Aware Integration of Language and Visual References for Natural Language Tracking
- CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras
- SpatialTracker: Tracking Any 2D Pixels in 3D Space
⭐code - Learning Tracking Representations from Single Point Annotations
- 视觉目标跟踪
- 多目标跟踪
- Multi-Object Tracking in the Dark
- ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association
- Delving into the Trajectory Long-tail Distribution for Muti-object Tracking
⭐code - Self-Supervised Multi-Object Tracking with Path Consistency
- DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction
⭐code
🏠project - iKUN: Speak to Trackers without Retraining
⭐code
- 视频目标跟踪
- 点跟踪
- 对抗
- Infrared Adversarial Car Stickers
- Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training
- MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model对抗性扰动
- Towards Transferable Targeted 3D Adversarial Attack in the Physical World
- Deep-TROJ: An Inference Stage Trojan Insertion Algorithm through Efficient Weight Replacement Attack攻击
- Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models
- Re-thinking Data Availability Attacks Against Deep Neural Networks
- Re-thinking Data Availablity Attacks Against Deep Neural Networks攻击
- NAPGuard: Towards Detecting Naturalistic Adversarial Patches
- Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transfomers后门攻击
- Nearest Is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks
- Semantic-Aware Multi-Label Adversarial Attacks对抗攻击
- Language-Driven Anchors for Zero-Shot Adversarial Robustness零样本对抗
- Transferable Structural Sparse Adversarial Attack Via Exact Group Sparsity Training
- On The Vulnerability of Efficient Vision Transformers to Adversarial Computation Attacks对抗性计算攻击
- Learning to Transform Dynamically for Better Adversarial Transferability
- Boosting Adversarial Transferability by Block Shuffle and Rotation
⭐code对抗性可转移性 - MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models
- Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness
:Thumbsup:VILP - Adversaral Doodles: Interpretable and Human-drawable Attacks Provide Describable Insights
- PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor
- Revisiting Adversarial Training under Long-Tailed Distributions
⭐code - Towards Fairness-Aware Adversarial Learning
- Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement
- Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM
- LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning
⭐code - Boosting Adversarial Training via Fisher-Rao Norm-based Regularization
⭐code - A Stealthy Wrongdoer: Feature-Oriented Reconstruction Attack against Split Learning攻击
- 后门攻击
- 持续学习
- RCL: Reliable Continual Learning for Unified Failure Detection
- Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
⭐code - Enhancing Visual Continual Learning with Language-Guided Supervision
- Convolutional Prompting meets Language Models for Continual Learning
- Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation
- InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning
- Learning Equi-angular Representations for Online Continual Learning
- BrainWash: A Poisoning Attack to Forget in Continual Learning
- 类增量学习
- Dual-consistency Model Inversion for Non-exemplar Class Incremental Learning
- Gradient Reweighting: Towards Imbalanced Class-Incremental Learning
- NICE: Neurogenesis Inspired Contextual Encoding for Replay-free Class Incremental Learning
⭐code - Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning
⭐code - Text-Enhanced Data-free Approach for Federated Class-Incremental Learning
⭐code - Generative Multi-modal Models are Good Class-Incremental Learners
⭐code
- 多任务
- A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning
- Masked AutoDecoder is Effective Multi-Task Vision Generalist
- Task-conditioned adaptation of visual features in multi-task policy learning
- DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data
⭐code - FedHCA2: Towards Hetero-Client Federated Multi-Task Learning
⭐code - MTLoRA: A Low-Rank Adaptation Approach for Efficient Multi-Task Learning
- Joint-Task Regularization for Partially Labeled Multi-Task Learning
- 多视角学习
- 元学习
- 联邦学习
- An Aggregation-Free Federated Learning for Tackling Data Heterogeneity
- Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data联邦学习
- FLHetBench: Benchmarking Device and State Heterogeneity in Federated Learning
- FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-Tuning
- Fair Federated Learning under Domain Skew with Local Consistency and Domain Diversity
⭐code - PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees
⭐code - Relaxed Contrastive Learning for Federated Learning
- DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning
- FedAS: Bridging Inconsistency in Personalized Fedearated Learning
⭐code - Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning
- Data Valuation and Detections in Federated Learning
⭐code - An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning
⭐code - FedSOL: Stabilized Orthogonal Learning with Proximal Restrictions in Federated Learning
- Communication-Efficient Federated Learning with Accelerated Client Gradient
- 强化学习
- 多模态机器学习
- 迁移学习
- Hearing Anything Anywhere
- Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
- AV-RIR: Audio-Visual Room Impulse Response Estimation
📺video - DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction
- Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
⭐code - 视听对话
- 视听导航
- 视听分割
- 语音识别
- 语音定位
- 音-视语音表示学习
- 文本驱动的语音定位
- 从图像和语言提示合成音乐
- 耳音频生成和定位
- 视频和音频同步
- 视听表征学习
- 说话人检测
- AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
- Preserving Fairness Generalization in Deepfake Detection
⭐code - Exploiting Style Latent Flows for Generalizing Deepfake Detection Video Detection
🏠project - Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection
⭐code - [Exploiting Style Latent Flows for Generalizing Video Deepfake Detection]
- Exploiting Style Latent Flows for Generalizing Deepfake Detection Video Detection
- LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection
- LAA-Net: Localized Artifact Attention Network for High-Quality Deepfakes Detection
- Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection
- 图像篡改检测
- DiffForensics: Leveraging Diffusion Prior to Image Forgery Detection and Localization伪造图像检测
- EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection
⭐code用于篡改定位和版权保护的多功能图像水印 - UnionFormer: Unified-Learning Transformer with Multi-View Representation for Image Manipulation Detection and Localization图像操作检测和定位
- 合成图像检测
- DG
- A2XP: Towards Private Domain Generalization
⭐code - PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization
- Towards Generalizing to Unseen Domains with Few Labels
- Rethinking the Evaluation Protocol of Domain Generalization
- Rethinking Multi-domain Generalization with A General Learning Objective
- Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization
⭐code
- A2XP: Towards Private Domain Generalization
- DA
- Parameter Efficient Self-Supervised Geospatial Domain Adaptation
- Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation
- Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer
- LEAD: Learning Decomposition for Source-free Universal Domain Adaptation
⭐code - A2XP:Towards Private Domain Generalization
⭐code
🏠project - Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation
⭐code - Source-Free Domain Adaptation with Frozen Multimodal Foundation Model
⭐code - Universal Semi-Supervised Domain Adaptation by Mitigating Common-Class Bias
- Unified Language-driven Zero-shot Domain Adaptation
🏠project
- FSL
- DeIl: Direct and Inverse CLIP for Open-World Few-Shot Learning
- AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning
- Discriminative Sample-Guided and Parameter-Efficient Feature Space Adaptation for Cross-Domain Few-Shot Learning
⭐code - Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning
- Few-shot Learner Parameterization by Diffusion Time-steps
⭐code
- ZSL
- Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
- Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning
👍提升生成式零样本学习能力,视觉增强动态语义原型方法 - Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning
- Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning
⭐code
- Efficient Meshflow and Optical Flow Estimation from Event Cameras
- UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model
- ADFactory: An Effective Framework for Generalizing Optical Flow with NeRF
- Dense Optical Tracking: Connecting the Dots
⭐code
🏠project光流 - MemFlow: Optical Flow Estimation and Prediction with Memory
⭐code - OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation
- 场景流
- 3D 场景流估计
- Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval
- DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses
⭐code - 物体姿态估计
- 6DoF
- Towards Co-Evaluation of Cameras, HDR, and Algorithms for Industrial-Grade 6DoF Pose Estimation
- Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
- SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation
⭐code - FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation
⭐code
🏠project - MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
🏠project - GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects
- Open-vocabulary object 6D pose estimation
🏠project - SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation
⭐code - A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization
- Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation
- MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation
- Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge
- 重识别
- 计数
- Instance Tracking in 3D Scenes from Egocentric Videos
- VPR
- 导航
- Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation
:Thumbsup:VILP - Detours for Navigating Instructional Videos旅游视频导航
- MemoNav: Working Memory Model for Visual Navigation
- DiaLoc: An Iterative Approach to Embodied Dialog Localization
- F$^3$Loc: Fusion and Filtering for Floorplan Localization
- An Interactive Navigation Method with Effect-oriented Affordance交互式导航
- Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation
- SLAM
- SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System
⭐code - Gaussian Splatting SLAM
⭐code
🏠project - SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
🏠project - NARUTO: Neural Active Reconstruction from Uncertain Target Observations
- Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM
- Implicit Event-RGBD Neural SLAM
⭐code
🏠project - Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras
- Dense Neural SLAM with Loop Closures
🏠project
- SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System
- 机器人
- Learning to navigate efficiently and precisely in real environments
- CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation
🏠project - Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation
⭐code - Diffusion-EDFs:Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation
⭐code
🏠project - SUGAR: Pre-training 3D Visual Representations for Robotics
⭐code
- Avatar(虚拟建模)
- GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image
⭐code- Stratified Avatar Generation from Sparse Observations(https://zerg-overmind.github.io/)
📺video - Real-Time Simulated Avatar from Head-Mounted Sensors
🏠project - Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling
⭐code
🏠project - NECA: Neural Customizable Human Avatar
⭐code - Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework
⭐code - GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians
⭐code
🏠project - Gaussian Head Avatar:Ultra High-fidelity Head Avatar via Dynamic Gaussians
⭐code
🏠project - UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures
⭐code
🏠project - GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians
🏠project - 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting
🏠project3D动画 - AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing3D 人体头像生成
- Human Gaussian Splatting: Real-time Rendering of Animatable Avatars
- GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh
⭐code - DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation
🏠project - PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
🏠project - FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding
🏠project - MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
🏠project人体图像动画 - Relightable Gaussian Codec Avatars
🏠project
- Stratified Avatar Generation from Sparse Observations(https://zerg-overmind.github.io/)
- 头发建模
- 虚拟试穿
- Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On
- CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model
- StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
⭐code - PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns
🏠project
- 抓取
- 卡通人物
- SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
- 自动驾驶
- Accurate Training Data for Occupancy Map Prediction in Automated Driving using Evidence Theory
- DualAD: Disentangling the Dynamic and Static World for End-to-End Driving
- Uncertainty-Driven Continual Learning for Autonomous Driving
- UniPAD: A Universal Pre-training Paradigm for Autonomous Driving
⭐code - Generalized Predictive Model for Autonomous Driving
- LMDrive: Closed-Loop End-to-End Driving with Large Language Models
⭐code
🏠project - On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving
- Visual Point Cloud Forecasting enables Scalable Autonomous Driving
⭐code - Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving
⭐code - CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow
- Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving
- AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving
- NeuRAD: Neural Rendering for Autonomous Driving
⭐code
🏠project - Generalized Predictive Model for Autonomous Driving
⭐code - Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving
⭐code
🏠project - Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents
⭐code
🏠project - 3D LiDAR Mapping in Dynamic Environments Using a 4D Implicit Neural Representation
⭐code - SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving自动驾驶去雾
- Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving
⭐code
- 轨迹预测
- Pose-Transformed Equivariant Network for 3D Point Trajectory Prediction
- GigaTraj: Predicting Long-term Trajectories of Hundreds of Pedestrians in Gigapixel Complex Scenes
- ERMVP: Communication-Efficient and Collaboration-Robust Multi-Vehicle Perception in Challenging Environments
- HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention
⭐code
:Thumbsup:VILP - Adapting to Length Shift: FlexiLength Network for Trajectory Prediction
- OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising
⭐code - SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction行人轨迹预测
- T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory
⭐code - Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations
- SmartRefine: An Scenario-Adaptive Refinement Framework for Efficient Motion Prediction
⭐code - Producing and Leveraging Online Map Uncertainty in Trajectory Prediction
- SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model
⭐code - Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
⭐code - Quantifying Uncertainty in Motion Prediction with Variational Bayesian Mixture
⭐code
- 车道线检测
- 车载凝视估计
- 3D Occupancy Prediction
- Single-View Scene Point Cloud Human Grasp Generation
- LTA-PCS: Learnable Task-Agnostic Point Cloud Sampling
- Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform
- Multiway Point Cloud Mosaicking with Diffusion and Global Optimization
📺video - Point Cloud Pre-training with Diffusion Models
- Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis
⭐code - Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis
⭐code - Unsupervised Template-assisted Point Cloud Shape Correspondence Network
- GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds
- Object Dynamics Modeling with Hierarchical Point Cloud-based Representations
- 点云配准
- 3D 点云
- Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds
⭐code - Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds
⭐code - Coupled Laplacian Eigenmaps for Locally-Aware 3D Rigid Point Cloud Matching
⭐code
- Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds
- 点云识别
- 点云上采样
- 点云分割
- 点云分析
- 点云理解
- 点云生成
- 点云去噪
- 点云质量评估
- Unsupervised Salient Instance Detection
- Neural Exposure Fusion for High-Dynamic Range Object Detection
- SFOD: Spiking Fusion Object Detector
⭐code - Exploring Orthogonality in Open World Object Detection
⭐code - SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection
- Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement
⭐code - Theoretically Achieving Continuous Representation of Oriented Bounding Boxes
⭐code - RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features
- Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation
⭐code - CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation
🏠project - DETRs Beat YOLOs on Real-time Object Detection
🏠project - Hyperbolic Learning with Synthetic Captions for Open-World Detection
- What, How, and When Should Object Detectors Update in Continually Changing Test Domains
- SAR目标检测
- 3D目标检测
- Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection
- Prompt3D: Random Prompt Assisted Weakly-Supervised 3D Object Detection
- CaKDP: Category-aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection
- An Empirical Study of the Generalization Ability of Lidar 3D Object Detectors to Unseen Domains
- SeaBird: Segmentation in Bird’s View with Dice Loss Improves Monocular 3D Detection of Large Objects
⭐code - UniMODE: Unified Monocular 3D Object Detection
- Learning Occupancy for Monocular 3D Object Detection
⭐code - CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images
⭐code - VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection
⭐code - Improving Distant 3D Object Detection Using 2D Box Supervision
- SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection
⭐code - Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors
⭐code - IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
⭐code - RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection
⭐code - Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection
- SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects
⭐code - MonoCD: Monocular 3D Object Detection with Complementary Depths
⭐code
- 显著目标检测
- 定向目标检测
- 域适应目标检测
- 开放式目标检测
- 半监督目标检测
- 端到端目标检测
- 开放词汇目标检测
- Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection
- Retrieval-Augmented Open-Vocabulary Object Detection
⭐code - Taming Self-Training for Open-Vocabulary Object Detection
⭐code - Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection
- DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
- The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding
🏠project
- 视频伪装目标检测
- 基于事件的目标检测
- 小目标检测
- 物体识别
- 目标发现
- 目标定位
- STMixer: A One-Stage Sparse Action Detector
- Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence
⭐code - Selective, Interpretable and Motion Consistent Privacy Attribute Obfuscation for Action Recognition
⭐code - Selective, Interpretable, and Motion Consistent Privacy Attribute Obfuscation for Action Recognition
⭐code
🏠project - X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization
⭐code - LLMs are Good Action Recognizers
- Action Detection via an Image Diffusion Process
- Language Model Guided Interpretable Video Action Reasoning
⭐code - SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
🏠project - TIM: A Time Interval Machine for Audio-Visual Action Recognition
⭐code - VicTR: Video-conditioned Text Representations for Activity Recognition
- Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes
🏠project - 基于事件的动作识别
- 零样本动作识别
- [Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition]
- 细粒度动作识别
- 时序动作定位
- 时序动作检测
- 群体活动识别
- 动作预期
- CLOAF: CoLlisiOn-Aware Human Flow
- Meta-Point Learning and Refining for Category-Agnostic Pose Estimation
- SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering
⭐code - GALA: Generating Animatable Layered Assets from a Single Scan
⭐code
🏠project - ShapeMatcher: Self-Supervised Joint Shape Canonicalization, Segmentation, Retrieval and Deformation自监督关节形状规范化、分割、检索和变形
- 手部
- URHand: Universal Relightable Hands
🏠project - OHTA: One-shot Hand Avatar via Data-driven Implicit Priors
⭐code
🏠project - BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics
⭐code
🏠project - 3D手部姿态估计
- 手部网格重建
- 手部网格恢复
- 手部姿态跟踪
- 手部纹理重建
- 手势合成
- Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation
🏠project - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis
🏠project - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
🏠project - EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
🏠project
- Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation
- URHand: Universal Relightable Hands
- 人体
- LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment
- RAM-Avatar: Real-time Photo-Realistic Avatar from Monocular Videos with Full-body Control
⭐code - SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation
⭐code - 多人姿势估计
- 3D 人体
- TexVocab: Texture Vocabulary-conditioned Human Avatars
🏠project - MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models
- SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation
⭐code - Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting
- Score-Guided Diffusion for 3D Human Recovery
⭐code - A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation
- KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation
⭐code - Multiple View Geometry Transformers for 3D Human Pose Estimation
⭐code - Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling
- Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning
⭐code - EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams
🏠project - 3D Human Pose Perception from Egocentric Stereo Videos
⭐code
🏠project - Forecasting of 3D Whole-body Human Poses with Grasping Objects3D 全身人体姿势
- BodyMAP -- Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed
⭐code
🏠project - Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches
- Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation
⭐code
:Thumbsup:让视频姿态Transformer变得飞速,北大提出高效三维人体姿态估计框架HoT - Optimizing Diffusion Noise Can Serve As Universal Motion Priors
⭐code
🏠project
- TexVocab: Texture Vocabulary-conditioned Human Avatars
- 人体网格恢复/重建
- 动作捕捉
- ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning
🏠project动作捕捉 - Loose Inertial Poser: Motion Capture with IMU-attached Loose-Wear Jacket
⭐code - Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera
⭐code
🏠project
- ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning
- 3D人体生成
- 语音驱动的人体动画
- 文本提示的人体动画
- 手语翻译
- 3D姿势迁移
- 人体重建
- Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption
⭐code - Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer
⭐code - HiLo: Detailed and Robust 3D Clothed Human Reconstruction with High-and Low-Frequency Information of Parametric Models
- ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image
- Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption
- 类别无关的姿势估计
- 视频估计人体动力学
- 人体姿势回归
- 3D人体模型
- 人体生成
- 人体运动理解
- 人体形状
- 舞蹈生成
- Deep Video Inverse Tone Mapping Based on Temporal Clues
- VideoMosaic: Connecting the Temporal Dots in Long Videos for LLMs
- vid-TLDR: Training Free Token merging for Light-weight Video Transformer
⭐code - Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video
⭐code - Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement
- [Understanding Video Transfomers via Universal Concept Discovery]
- Understanding Video Transformers via Universal Concept Discovery
- 视频理解
- Compositional Video Understanding with Spatiotemporal Structure-based Transformers
- Koala: Key frame-conditioned long video-LLM
- MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
⭐code - Abductive Ego-View Accident Video Understanding for Safe Driving Perception
🏠project - OmniVid: A Generative Framework for Universal Video Understanding
⭐code - A Unified Framework for Human-centric Point Cloud Video Understanding
- Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection
- MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
🏠project - TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
⭐code - Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
⭐code
- 视频摘要
- 视频重建
- 视频表示
- 视频判读
- 电影描述
- [MICap: A Unified Model for Identity-aware Movie Descriptions]
- 视频和谐化
- 视频对话式音乐推荐系统
- 视频异常检测
- Multi-Scale Video Anomaly Detection by Multi-Grained Spatio-Temporal Representation Learning
- Harnessing Large Language Models for Training-free Video Anomaly Detection
⭐code - Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline
⭐code - MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection
- PREGO: online mistake detection in PRocedural EGOcentric videos
- Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors
⭐code - Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection
- GlitchBench: Can large multimodal models detect video game glitches
🏠project大型多模态模型能否检测视频游戏故障
- 视频场景检测
- 自动生成电影预告片
- 视频预测
- 视频重照明
- 睡眠监测
- Video Paragraph Grounding
- video Grounding
- 视频稳定
- 视频帧插值
- Video Frame Interpolation via Direct Synthesis with the Event-based Reference
- TTA-EVF: Test-Time Adaptation for Event-based Video Frame Interpolation via Reliable Pixel and Sample Estimation
- Sparse Global Matching for Video Frame Interpolation with Large Motion
⭐code - Perception-Oriented Video Frame Interpolation via Asymmetric Blending
⭐code - SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation
🏠project - EVS-assisted joint Deblurring, Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling
- 视频识别
- 视频主题交换
- 视频对话
- Rapid 3D Model Generation with Intuitive 3D Input
- Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training
- LowRankOcc: Tensor Decomposition and Low-Rank Recovery for Vision-based 3D Semantic Occupancy Prediction
- TexOct: Generating Textures of 3D Models with Octree-based Diffusion
- ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
⭐code - CAGE: Controllable Articulation GEneration
⭐code
🏠project3D - Sparse views, Near light: A practical paradigm for uncalibrated point-light photometric stereo
- Dispersed Structured Light for Hyperspectral 3D Imaging
- UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence
⭐code服装操作 - GoMVS: Geometrically Consistent Cost Aggregation for Multi-View Stereo
⭐code
⭐code - Digital Life Project: Autonomous 3D Characters with Social Intelligence
🏠project - Image Sculpting: Precise Object Editing with 3D Geometry Control
🏠project - TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations3D
- Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception
- GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
⭐code
🏠project - Differentiable Display Photometric Stereo
- ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion
⭐code
🏠project - Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps
- REACTO: Reconstructing Articulated Objects from a Single Video
⭐code - Low-Latency Neural Stereo Streaming
- Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes
- Spectrum AUC Difference (SAUCD): Human-aligned 3D Shape Evaluation
🏠project - 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos
🏠project - Wired Perspectives: Multi-View Wire Art Embraces Generative AI
⭐code
🏠project - Memory-based Adapters for Online 3D Scene Perception
⭐code - FastMAC: Stochastic Spectral Sampling of Correspondence Graph
⭐code - One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion
⭐code
🏠project - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
🏠project - CityDreamer: Compositional Generative Model of Unbounded 3D Cities
🏠project - Towards 3D Vision with Low-Cost Single-Photon Cameras
- EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
⭐code - Mosaic-SDF for 3D Generative Models
🏠project - 三维重建
- [3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surfaces]
- 3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surface
🏠project
📺video - Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers
🏠project - PlatoNeRF: 3D Reconstruction in Plato’s Cave via Single-View Two-Bounce Lidar
🏠project - WonderJourney: Going from Anywhere to Everywhere
🏠project - Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments
⭐code
🏠project - DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans
- IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images
⭐code - Splatter Image: Ultra-Fast Single-View 3D Reconstruction
⭐code
🏠project - Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction
⭐code
🏠project
👍CVPR 2024满分论文:浙大提出基于可变形三维高斯的高质量单目动态重建新方法 - PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar
⭐code
🏠project - VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction
⭐code - MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections
⭐code - ZeroShape: Regression-based Zero-shot Shape Reconstruction
⭐code
🏠project - DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction
- G3DR: Generative 3D Reconstruction in ImageNet
⭐code
🏠project - 3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surface
⭐code - Bayesian Diffusion Models for 3D Shape Reconstruction
- RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction
- ZeroRF: Fast Sparse View 360° Reconstruction with Zero Pretraining
🏠project视图 360° 重建
- 表面重建
- MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video
⭐code
🏠project - UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and UnFavOrable Data Sets
⭐code
⭐code - UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and Unfavorable Sets
⭐code
- MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video
- 三维形状
- GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors
- TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding
⭐code - Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes
🏠project - ShapeWalk: Compositional Shape Editing through Language-Guided Chains
⭐code
🏠project - Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation
- Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
🏠project - FSC: Few-point Shape Completion
- Category-Level Multi-Part Multi-Joint 3D Shape Assembly
- Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation
- Stereo Matching
- Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching
⭐code - Robust Synthetic-to-Real Transfer for Stereo Matching
- Neural Markov Random Field for Stereo Matching
⭐code - Reusable Architecture Growth for Continual Stereo Matching
- MoCha-Stereo: Motif Channel Attention Network for Stereo Matching
⭐code
🏠project - Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching
- Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching
- 表面法线估计
- 特征匹配
- 三维检索
- 深度补全
- Flexible Depth Completion for Sparse and Varying Point Densities
- Improving Depth Completion via Depth Feature Upsampling
- Test-Time Adaptation for Depth Completion
- Bilateral Propagation Network for Depth Completion
- DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions
- Tri-Perspective View Decomposition for Geometry-Aware Depth Completion
⭐code
- 深度估计
- Cross-spectral Gated-RGB Stereo Depth Estimation
- Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation
- Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion
⭐code - On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
🏠project - Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
🏠project - Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion
- ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation
⭐code - From-Ground-To-Objects:Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior
🏠project - From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior
- UniDepth: Universal Monocular Metric Depth Estimation
⭐code - WorDepth: Variational Language Prior for Monocular Depth Estimation
- 全景定位
- 3D关键点检测
- 布局重建
- CAD 重建
- 形状匹配
- 3DGS
- 场景重建
- Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses
- SuperPrimitive: Scene Reconstruction at a Primitive Level
🏠project - Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction
⭐code - OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees
- 3D 场景合成
- 3D 场景图
- 3D 场景编辑
- GaussianEditor:Editing 3D Gaussians Delicately with Text Instructions
🏠project - Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training
:Thumbsup:文本或图像提示精准编辑3D场景,美图&信工所&北航&中大联合提出3D编辑方法CustomNeRF - PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
⭐code - Neural 3D Strokes: Creating Stylized 3D Scenes with Vectorized 3D Strokes
🏠project3D 场景
- GaussianEditor:Editing 3D Gaussians Delicately with Text Instructions
- 语义匹配
- 室内照明估计
- Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology
⭐code - Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling
- Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning
⭐code - MindBridge: A Cross-Subject Brain Decoding Framework
⭐code
⭐code - MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant
- Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images
⭐code - PairAug: What Can Augmented Image-Text Pairs Do for Radiology?
⭐code - CT
- 切片分类
- Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction
- Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis
⭐code - Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification
🏠project
- 肿瘤合成
- 病理检测
- 基因检测
- 癌症检测
- 医学图像配准
- 医学图像分割
- One-Prompt to Segment All Medical Images
- Diversified and Personalized Multi-rater Medical Image Segmentation
⭐code - Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding
⭐code - Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation
- Tyche: Stochastic In-Context Learning for Medical Image Segmentation
- Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention
🏠project - Clustering Propagation for Universal Medical Image Segmentation
- Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling无监督语义分割
- MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation
⭐code超声心动图视频分割 - Bi-level Learning of Task-Specific Decoders for Joint Registration and One-Shot Medical Image Segmentation医学图像分割
- Constructing and Exploring Intermediate Domains in Mixed Domain Semi-supervised Medical Image Segmentation
⭐code - Incremental Nuclei Segmentation from Histopathological Images via Future-class Awareness and Compatibility-inspired Distillation
⭐code细胞核分割 - PH-Net: Semi-Supervised Breast Lesion Segmentation via Patch-wise Hardness
⭐code半监督乳腺病变分割 - PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation全景肾脏病理分割
- Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation
⭐code
- X-ray
- MRI
- 异常检测
- 脑活动
- 生存预测
- 计算病理学
- 组织病理学
- SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology
- CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment
- Prompting Vision Foundation Models for Pathology Image Analysis
- Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
🏠project
- 医学超分辨率
- 3D医学影像
- 放射学报告生成
- 医学基础模型
- ToonerGAN: Reinforcing GANs for Obfuscating Automated Facial Indexing
- Neural Implicit Morphing of Face Images
🏠project - Anatomically Constrained Implicit Face Models
- Face2Diffusion for Fast and Editable Face Personalization
⭐code
⭐code - LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example
⭐code
🏠project - Self-Supervised Facial Representation Learning with Facial Region Awareness
- 人脸表情
- 人脸属性分类
- 人脸活体检测
- One-Class Face Anti-spoofing via Spoof Cue Map-Guided Feature Learning
- Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing
⭐code - Gradient Alignment for Cross-Domain Face Anti-Spoofing
⭐code - Test-Time Domain Generalization for Face Anti-Spoofing
- Gradient Alignment for Cross-domain Face Anti-Spoofing
⭐code
- 说话头合成
- Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis
- Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
- CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation
- SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis
⭐code
🏠project - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio
⭐code - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
⭐code
🏠project
- 人脸识别
- 人脸合成
- 人脸重建
- 人脸重现
- 人脸恢复
- 人脸识别
- 肖像编辑
- 人脸去识别
- 人脸化妆
- 头发重建
- 人脸关键点
- Generalizable Face Landmarking Guided by Conditional Face Warping
⭐code
🏠project - [FaceLift: Semi-supervised 3D Facial Landmark Localization]
- Generalizable Face Landmarking Guided by Conditional Face Warping
- 防御人脸编辑滥用
- 人脸动作单元
- 人脸匿名化
- 三维人脸
- 4D 头像合成
- L-MAGIC: Language Model Assisted Generation of Images with Consistency
- CoDi-2: Interleaved and In-Context Any-to-Any Generation
⭐code
🏠project - IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation
- TexTile: A Differentiable Metric for Texture Tileability
🏠project - SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer
- PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
⭐code
🏠project - MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
⭐code - AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error
- FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
⭐code - It's All About Your Sketch: Democratising Sketch Control in Diffusion Models
⭐code - Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling
- ProMark: Proactive Diffusion Watermarking for Causal Attribution
- Diversity-aware Channel Pruning for StyleGAN Compression
⭐code - DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception
- GAN
- StyLitGAN: Image-based Relighting via Latent Control
⭐code
🏠project - What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs
🏠project - [Rendering Every Pixel for High-Fidelity Geometry in 3D GANs]
- StyLitGAN: Image-based Relighting via Latent Control
- 扩散
- Self-correcting LLM-controlled Diffusion
- Image Neural Field Diffusion Models
- Self-correcting LLM-controlled Diffusion Models
- Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
- Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout
- DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
⭐code - DiffLoc: Diffusion Model for Outdoor LiDAR Localization
- Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples
- AAMDM: Accelerated Auto-regressive Motion Diffusion Model
- Diffusion Model Alignment Using Direct Preference Optimization
- Residual Denoising Diffusion Models
- [Residual Learning in Diffusion Models]
- FreeU: Free Lunch in Diffusion U-Net
⭐code
🏠project - Shadow Generation for Composite Image Using Diffusion Model
⭐code - Alchemist: Parametric Control of Material Properties with Diffusion Models
- Orthogonal Adaptation for Modular Customization of Diffusion Models
🏠project扩散模型 - Observation-Guided Diffusion Probabilistic Models
⭐code - TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models
⭐code - Visual Anagrams: Synthesizing Multi-View Optical Illusions with Diffusion Models
- SPAD:Spatially Aware Multiview Diffusers
🏠project - Structure-Guided Adversarial Training of Diffusion Models
- One-step Diffusion with Distribution Matching Distillation
🏠project - Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance
⭐code - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models
⭐code
🏠project - X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
🏠project - Readout Guidance: Learning Control from Diffusion Features
🏠project - PointInfinity: Resolution-Invariant Point Diffusion Models
🏠project - Unsupervised Keypoints from Pretrained Diffusion Models
⭐code - Amodal Completion via Progressive Mixed Context Diffusion
⭐code
🏠project - SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution
🏠project - DREAM: Diffusion Rectification and Estimation-Adaptive Models
- Towards Memorization-Free Diffusion Models
- Efficient Dataset Distillation via Minimax Diffusion
⭐code - MatFuse: Controllable Material Generation with Diffusion Models
⭐code
🏠project - FreeU : Free Lunch in Diffusion U-Net
⭐code
🏠project - Accelerating Diffusion Sampling with Optimized Time Steps
- Boosting Diffusion Models with Moving Average Sampling in Frequency Domain
- One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications
⭐code
🏠project - Balancing Act: Distribution-Guided Debiasing in Diffusion Models
⭐code - Shadow Generation for Composite Image Using Diffusion model
⭐code - MACE: Mass Concept Erasure in Diffusion Models
⭐code - DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
🏠project
🏠project
⭐code - Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models
⭐code
🏠project - DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
⭐code
🏠project - SVGDreamer: Text Guided SVG Generation with Diffusion Model
⭐code
🏠project
👍SVGDreamer: 北航&港大发布全新文本引导的矢量图形可微渲染方法 - Relation Rectification in Diffusion Model
⭐code
🏠project
- 图像合成/生成
- 图像合成
- One-Shot Structure-Aware Stylized Image Synthesis
- ViewFusion: Towards Multi-View Consistency via Interpolated Denoising
⭐code - Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks
- Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis
- [Unlocking Pretrained Image Backbones for Semantic Image Synthesis]
- Unlocking Pre-trained Image Backbones for Semantic Image Synthesis
- 图像生成
- [ElasticDiffusion: Training-free Arbitrary Size Image Generation]
- ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation
🏠project - AnyDoor: Zero-shot Object-level Image Customization
🏠project图像生成 - Taming Stable Diffusion for Text to 360° Panorama Image Generation
⭐code
⭐code - Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations
- Generative Image Dynamics
🏠project - Clockwork Diffusion: Efficient Generation With Model-Step Distillation
- UniGS: Unified Representation for Image Generation and Segmentation
⭐code图像生成 - Exact Fusion via Feature Distribution Matching for Few-shot Image Generation
- LAKE-RED: Camouflaged images generation by latent background knowledge retrieval-augmented diffusion
- FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition
⭐code
🏠project - Adversarial Text to Continuous Image Generation
- Style Aligned Image Generation via Shared Attention
🏠project - CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation
- Instruct-Imagen: Image Generation with Multi-modal Instruction
- InstanceDiffusion: Instance-level Control for Image Generation
⭐code
🏠project - DemoFusion: Democratising High-Resolution Image Generation With No $$$
⭐code
🏠project - ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
⭐code
🏠project
⭐code - When StyleGAN Meets Stable Diffusion:a W+ Adapter for Personalized Image Generation
⭐code
🏠project - Correcting Diffusion Generation through Resampling
⭐code - Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder
- Condition-Aware Neural Network for Controlled Image Generation
- A Unified and Interpretable Emotion Representation and Expression Generation
⭐code - 主题驱动的图像生成
- 文本-图像
- Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis
- Learning Multi-dimensional Human Preference for Text-to-Image Generation
- Personalized Residuals for Concept-Driven Text-to-Image Generation
- Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models
- Customization Assistant for Text-to-image Generation
- Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation
🏠project - DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
🏠project - PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
🏠project - On the Scalability of Diffusion-based Text-to-Image Generation
- Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
🏠project - EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
⭐code - Grounded Text-to-Image Synthesis with Attention Refocusing
🏠project - OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
⭐code - Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
⭐code - CONFORM: Contrast is All You Need for High-Fidelity Text-to-Image Diffusion Models
- InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization
⭐code - Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models
- Cross Initialization for Face Personalization of Text-to-Image Models文本到图像Cross Initialization for Personalized Text-to-Image Generation
- CosmicMan: A Text-to-Image Foundation Model for Humans
⭐code - Dynamic Prompt Optimizing for Text-to-Image Generation
⭐code - WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
- Attention Calibration for Disentangled Text-to-Image Personalization
- RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
⭐code - InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
⭐code
🏠project - Learning Continuous 3D Words for Text-to-Image Generation
⭐code
🏠project - NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging
⭐code - HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
🏠project - Discriminative Probing and Tuning for Text-to-Image Generation
⭐code
🏠project - Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization
- ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
⭐code
🏠project - FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models
⭐code
- 图像合成
- 视频合成/生成
- 视频生成
- InstructVideo: Instructing Video Diffusion Models with Human Feedback
⭐code
🏠project - On the Content Bias in Fréchet Video Distance
⭐code - 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
- SimDA: Simple Diffusion Adapter for Efficient Video Generation
⭐code
🏠project - FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
🏠project
⭐code - Vlogger: Make Your Dream A Vlog
⭐code
🏠project - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
🏠project - [LAMP: Learn A Motion Pattern for Few-Shot Video Generation]
- EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
🏠project - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
🏠project - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
⭐code - BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
⭐code
🏠project视频合成 - DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
🏠project
- InstructVideo: Instructing Video Diffusion Models with Human Feedback
- 文本-视频
- Mind the Time: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
- Grid Diffusion Models for Text-to-Video Generation
- Breathing Life Into Sketches Using Text-to-Video Priors
⭐code
🏠project - Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
- Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
🏠project - A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
🏠project - TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
⭐code - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
🏠project
- 图像-视频
- 视频-视频
- 视频生成
- 纹理生成/合成
- 文本-3D
- DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
- Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior
⭐code - LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching
⭐code - Taming Mode Collapse in Score Distillation for Text-to-3D Generation
🏠project - Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors
🏠project - DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior
⭐code - VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation
⭐code - GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
⭐code
🏠project - Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences
- DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors
🏠project - HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation
- Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D
🏠project - HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D
🏠project
- 图像-3D
- 文本-4D
- 语义场景生成
- 语义场景补全
- 图像-图像翻译
- [StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation]
- 图像检测
- 图像编辑
- Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
⭐code - DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing
⭐code - Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing
🏠project - [Inversion-Free Image Editing with Language-Guided Diffusion Models]
- Inversion-Free Image Editing with Natural Language
⭐code
🏠project - TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing
⭐code - Edit One for All: Interactive Batch Image Editing
⭐code
🏠project - SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
🏠project - On Exact Inversion of DPM-Solvers
⭐code
🏠project - Doubly Abductive Counterfactual Inference for Text-based Image Editing
⭐code基于文本的图像编辑 - Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing
- ZONE: Zero-Shot Instruction-Guided Local Editing
- HIVE: Harnessing Human Feedback for Instructional Visual Editing
- FreeDrag: Feature Dragging for Reliable Point-based Image Editing
- Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
- 布局生成
- 文本-矢量
- 矢量字体
- 手写数学表达式
- NeRF-to-NeRF
- GenN2N: Generative NeRF2NeRF TranslationNeRF-to-NeRF
- 生成伪装图像
- 场景生成
- 交互式编辑
- 视频编辑
- CCEdit: Creative and Controllable Video Editing via Diffusion Models
🏠project
📺video - MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
- RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
⭐code
🏠project - A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
🏠project - Video-P2P: Video Editing with Cross-attention Control
🏠project - VidToMe: Video Token Merging for Zero-Shot Video Editing
🏠project - Video Interpolation with Diffusion Models
⭐code - MotionEditor: Editing Video Motion via Content-Aware Diffusion
- CAMEL: CAusal Motion Enhancement tailored for Lifting Text-driven Video Editing
⭐code文本驱动视频编辑
- CCEdit: Creative and Controllable Video Editing via Diffusion Models
- 漫画生成
- 文本驱动 3D 风格化
- Image Warping
- 图像重建
- 图像拼接
- 姿势引导的人体图像合成
- 文本引导的人体图像合成
- 文本图像对齐
- 去鬼影
- 去阴影
- 去模糊
- Unsupervised Blind Image Deblurring Based on Self-Enhancement
- ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation
- Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains
⭐code - AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring
⭐code
⭐code - A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning
⭐code
- 去雾
- 去噪
- Real-World Mobile Image Denoising Dataset with Efficient Baselines
- Robust Image Denoising through Adversarial Frequency Mixup
- Exploring Efficient Asymmetric Blind-Spots for Self-Supervised Denoising in Real-World Scenarios
- Stable Neighbor Denoising for Source-free Domain Adaptive Segmentation
- Transfer CLIP for Generalizable Image Denoising
- Residual Denoising Diffusion Models
⭐code - Equivariant plug-and-play image reconstruction
⭐code - Patch2Self2: Self-supervised Denoising on Coresets via Matrix Sketching
- Hyper-MD: Mesh Denoising with Customized Parameters Aware of Noise Intensity and Geometric Characteristics
- 去雨
- 去反射
- 修图
- 图像增强
- Color Shift Estimation-and-Correction for Image Enhancement
- Zero-Reference Low-Light Enhancement via Physical Quadruple Priors
⭐code - Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach
⭐code - Specularity Factorization for Low-Light Enhancement
- Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving
- 图像恢复
- Learning Diffusion Texture Priors for Image Restoration
- Adapt or perish: Adaptive sparse transformer with attentive feature refinement for image restoration
- Image Restoration by Denoising Diffusion Models With Iteratively Preconditioned Guidance
⭐code - Deep Equilibrium Diffusion Restoration with Parallel Sampling
⭐code - Distilling Semantic Priors from SAM to Efficient Image Restoration Models
- Boosting Image Restoration via Priors from Pre-trained Models
- Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration
- Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model
⭐code - Restoration by Generation with Constrained Priors
🏠project - Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration
- Improving Image Restoration through Removing Degradations in Textual Representations
⭐code
- 图像修复
- Brush2Prompt: Contextual Prompt Generator for Object Inpainting
- Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting
- Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting
⭐code - Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting
⭐code
- 图像质量
- 恶劣天气消除
- 大气湍流去除
- Image Portrait Relighting(图像重照光)
- 图片缩小
- 图像校正
- 图像着色
- 运动(去)模糊
- 视频修复
- 视频去雾
- 视频去模糊
- Frequency-aware Event-based Video Deblurring for Real-World Motion Blur
- Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring
⭐code
🏠project - FMA-Net: Flow Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring
⭐code
🏠project - DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video
🏠project
- 视频增强
- 视频质量评估
- Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
⭐code
🏠project - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
⭐code - MeaCap: Memory-Augmented Zero-shot Image Captioning
⭐code - [Sieve: Multimodal Dataset Pruning using Image-Captioning Models]
- Sieve: Multimodal Dataset Pruning Using Image Captioning Models
- 视频描述/字幕
- Streaming Dense Video Captioning
⭐code
⭐code - Video ReCap: Recursive Captioning of Hour-Long Videos
⭐code
🏠project
🌻dataset - Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- VideoCon: Robust Video-Language Alignment via Contrast Captions
⭐code
🏠project - Retrieval-Augmented Egocentric Video Captioning
- Streaming Dense Video Captioning
- 密集字幕
- 视频压缩
- 图像压缩
- A Dynamic Kernel Prior Model for Unsupervised Blind Image Super-Resolution图像超分辨率
- Continuous Optical Zooming: A Benchmark for Arbitrary-Scale Image Super-Resolution in Real World
- Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary
⭐code - SinSR: Diffusion-Based Image Super-Resolution in a Single Step
⭐code - CAMixerSR: Only Details Need More "Attention"
- Text-guided Explorable Image Super-resolution
- CFAT: Unleashing Triangular Windows for Image Super-resolution
- SeD: Semantic-Aware Discriminator for Image Super-Resolution
- SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution
- Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts
- Boosting Flow-based Generative Super-Resolution Models via Learned Prior
⭐code - CFAT: Unleashing TriangularWindows for Image Super-resolution
- Beyond Image Super-Resolution for Image Recognition with Task-Driven Perceptual Loss
⭐code - AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution
⭐code - Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer
- DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF超分辨率
- Neural Super-Resolution for Real-time Rendering with Radiance Demodulation
- Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution
⭐code - VSR
- Fair-VPT: Fair Visual Prompt Tuning for Image Classification
- Classes Are Not Equal: An Empirical Study on Image Recognition Fairness
- MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes
- SURE: SUrvey REcipes for building reliable and robust deep networks
⭐code - A Bayesian Approach to OOD Robustness in Image Classification
- Hyperspherical Classification with Dynamic Label-to-Prototype Assignment
⭐code - Discover and Mitigate Multiple Biased Subgroups in Image Classifiers
⭐code - Deep Imbalanced Regression via Hierarchical Classification Adjustment
- Large Language Models are Good Prompt Learners for Low-Shot Image Classification
⭐code - Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
- 域泛化图像分类
- 长尾识别
- 小样本图像分类
- 零样本分类
- 细粒度
- 开集分类
- 小样本识别
- GCD(广义类别发现)
- Matching Anything by Segmenting Anything
- MESA: Matching Everything by Segmenting Anything
- CoralSCOP: Segment any COral Image on this Planet分割
- SANeRF-HQ: Segment Anything for NeRF in High Quality
🏠project - ASAM: Boosting Segment Anything Model with Adversarial Tuning
- Universal Segmentation at Arbitrary Granularity with Language Instruction通用分割
- Segment and Caption Anything
🏠project - COCONut: Modernizing COCO Segmentation
⭐code - Multi-view Aggregation Network for Dichotomous Image Segmentation
⭐code - OMG-Seg: Is One Model Good Enough For All Segmentation?
🏠project - Unsegment Anything by Simulating Deformation
- BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model
⭐code - VRP-SAM: SAM with Visual Reference Prompt
- PEM: Prototype-based Efficient MaskFormer for Image Segmentation
- Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM
⭐code - CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
🏠project - Benchmarking Segmentation Models with Mask-Preserved Attribute Editing
⭐code - CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers
- Continual Segmentation with Disentangled Objectness Learning and Class Recognition
⭐code - Kandinsky Conformal Prediction: Efficient Calibration of Image Segmentation Algorithms
⭐code - Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation
⭐code
🏠project - A Simple Recipe for Language-guided Domain Generalized Segmentation
🏠project - Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts
⭐code - Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation
⭐code
:Thumbsup:分割一切模型SAM泛化能力差?域适应策略给解决了 - 开放词汇分割
- Transferable and Principled Efficiency for Open-Vocabulary Segmentation
- OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation
- Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
⭐code - Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models
- 视频分割
- UniVS: Unified and Universal Video Segmentation with Prompts as Queries
⭐code - Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence
🏠project视频分割 - Learning to Segment Referred Objects from Narrated Egocentric Videos
- Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
⭐code
- UniVS: Unified and Universal Video Segmentation with Prompts as Queries
- 语义分割
- Open Set Domain Adaptation for Semantic Segmentation
- TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation
- ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
- HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation
- Frequency-Adaptive Dilated Convolution for Semantic Segmentation
⭐code - GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation
- Improving Bird's Eye View Semantic Segmentation by Task Decomposition
⭐code - UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather
- Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
- Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball
- 3D 语义分割
- 点云语义分割
- 无监督语义分割
- 小样本语义分割
- 半监督语义分割
- Training Vision Transformers for Semi-Supervised Semantic Segmentation
- AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation
⭐code - CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation
⭐code - Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation
⭐code
- 弱监督语义分割
- Class Tokens Infusion for Weakly Supervised Semantic Segmentation
- Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation
⭐code - DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation
⭐code - Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation
- Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation
⭐code - PSDPM: Prototype-based Secondary Discriminative Pixels Mining for Weakly Supervised Semantic Segmentation
- 域泛化语义分割
- Collaborating Foundation models for Domain Generalized Semantic Segmentation
⭐code - Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning
⭐code - Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
⭐code
- Collaborating Foundation models for Domain Generalized Semantic Segmentation
- 文本监督语义分割
- 开放世界语义分割
- 开放词汇语义分割
- 全景分割
- 实例分割
- Mudslide: A Universal Nuclear Instance Segmentation Method
- DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data
- FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures
⭐code - Teeth-SEG: An Efficient Instance Segmentation Framework for Orthodontic Treatment based on Anthropic Prior Knowledge
- 3D 实例分割
- 场景分割
- 动作分割
- Progress-Aware Online Action Segmentation for Egocentric Procedural Task Videos
- Coherent Temporal Synthesis for Incremental Action Segmentation
- Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment
- Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation
- FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Fully-Supervised Action Segmentation
⭐code全监督动作分割
- 参考图像分割
- 指代表达式分割
- [Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation]
- Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation
⭐code
- VOS
- VSS
- 抠图
- 少样本分割
- Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation
- Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation
⭐code - Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach
- Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation
- 裂纹分割
- 交互式分割
- 无模态分割
- Implicit Motion Function
- Ungeneralizable Examples
- Generalized Event Cameras
- Event-based Structure-from-Orbit
- Seeing the World through Your Eyes
- ProMotion: Prototypes As Motion Learners
- Move Anything with Layered Scene Diffusion
- GLACE: Global Local Accelerated Coordinate Encoding
- Quantifying Task Priority for Multi-Task Optimization
- Model Adaptation for Time Constrained Embodied Control
- A theory of volumetric representations for opaque solids
- DAVE -- A Detect-and-Verify Paradigm for Low-Shot Counting
- EvDiG: Event-guided Direct and Global Components Separation
- Efficient Model Stealing Defense with Noise Transition Matrix
- OpenStreetView-5M: The Many Roads to Global Visual Geolocation
- WaveMo: Learning Wavefront Modulations to See Through Scattering
- All Rivers Run to the Sea: Private Learning with Asymmetric Flows
- HDQMF: Holographic Feature Decomposition Using Quantum Algorithms
- Cross-dimension Affinity Distillation for 3D EM Neuron Segmentation
- Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement
- Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing
- QUADify: Extracting Meshes with Pixel-level Details and Materials from Images
- Multimodal autoregressive learning for time-aligned and contextual modalities
- E-GPS: Explainable Geometry Problem Solving via Top-Down Solver and Bottom-Up Generator
- Zero-Shot Structure-Preserving Diffusion Model for High Dynamic Range Tone Mapping
- Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata
- Leveraging Camera Triplets for Efficient and Accurate Structure-from-Motion
- Partial-to-Partial Shape Matching with Geometric Consistency
- LCD: Towards Hierarchical Embeddings with Localizability, Composability, and Decomposability Learned from Anatomy
- Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding
- EASE-DETR: Easing the Competition among Object Queries
- Making Visual Sense of Oracle Bones for You and Me
- 2S-UDF: A Novel Two-stage UDF Learning Method for Robust Non-watertight Model Reconstruction from Multi-view Images多视图图像重建
- Multimodal Representation Learning by Alternating Unimodal Adaptation多模态
- Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
- Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering多模态
- Efficient Hyperparameter Optimization with Adaptive Fidelity Identification
- Practical Measurements of Translucent Materials with Inter-Pixel Translucency Prior
- Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation
⭐code - Non-rigid Structure-from-Motion: Temporally-smooth Procrustean Alignment and Spatially-variant Deformation Modeling
- Seeing Motion at Nighttime with an Event Camera
⭐code - Batch Normalization Alleviates the Spectral Bias in Coordinate Networks
- Affine Equivariant Networks Based on Differential Invariants
- NC-TTT: A Noise Constrastive Approach for Test-Time Training
- Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation
- Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains
- Pre-training Vision Models with Mandelbulb Variations
- Noisy One-point Homographies are Surprisingly Good
- Revisiting Global Translation Estimation with Feature Tracks
🏠project - Efficient Scene Recovery Using Luminous Flux Prior
- MR-VNet: Media Restoration using Volterra Networks
- LEAD: Exploring Logit Space Evolution for Model Selection
- EventPS: Real-Time Photometric Stereo Using an Event Camera
- Your Transferability Barrier is Fragile: Free-Lunch for Transferring the Non-Transferable Learning
- Adapters Strike Back
- A Theory of Joint Light and Heat Transport for Lambertian Scenes
- A Physics-informed Low-rank Deep Neural Network for Blind and Universal Lens Aberration Correction
- MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation
- Animating General Image with Large Visual Motion Model
- Pixel-level Semantic Correspondence through Layout-aware Representation Learning and Multi-scale Matching Integration
- Tuning Stable Rank Shrinkage: Aiming at the Overlooked Structural Risk in Fine-tuning
- Motion Diversification Networks
- Domain Gap Embeddings for Generative Dataset Augmentation
- Absolute Pose from One or Two Scaled and Oriented Features
- Small Steps and Level Sets: Fitting Neural Surface Models with Point Guidance
- From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers
- Navigate Beyond Shortcuts: Debiased Learning through the Lens of Neural Collapse
- Latent Modulated Function for Computational Optimal Continuous Image Representation
- Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching
⭐code数据压缩 - L2B: Learning to Bootstrap Robust Models for Combating Label Noise
⭐code - GART: Gaussian Articulated Template Models
🏠project - Towards Learning a Generalist Model for Embodied Navigation
- Revisiting Sampson Approximations for Geometric Estimation Problems
- Real-Time Neural BRDF with Spherically Distributed Primitives
- PIGEON: Predicting Image Geolocations图像地理位置
- Uncertainty Visualization via Low-Dimensional Posterior Projections
- Eclipse: Disambiguating Illumination and Materials using Unintended Shadows
🏠project - Fast ODE-based Sampling for Diffusion Models in Around 5 Steps
⭐code - CLiC: Concept Learning in Context
- Pick-or-Mix: Dynamic Channel Sampling for ConvNets
- AutoAD III: The Prequel -- Back to the Pixels
🏠project - Training-free Pretrained Model Merging
⭐code - Overcoming Generic Knowledge Loss with Selective Parameter Update
- Selective nonlinearities removal from digital signals
- Memory-Scalable and Simplified Functional Map Learning
- Fully Exploiting Every Real Sample: Super-Pixel Sample Gradient Model Stealing
- Hierarchical Correlation Clustering and Tree Preserving Embedding
- GLID: Pre-training a Generalist Encoder-Decoder Vision Model
- SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement
- MS-DETR: Efficient DETR Training with Mixed Supervision
- PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization
- Unified Entropy Optimization for Open-Set Test-Time Adaptation
- IIRP-Net: Iterative Inference Residual Pyramid Network for Enhanced Image Registration图像配准
- Generative Unlearning for Any Identity
⭐code - Error Detection in Egocentric Procedural Task Videos
- [Enhancing Multimodal Cooperation via Sample-level Modality Valuation]
- Enhancing Multimodal Cooperation via Fine-grained Modality Valuation
⭐code - Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships
- Ink Dot-Oriented Differentiable Optimization for Neural Image Halftoning
- SVDTree: Semantic Voxel Diffusion for Single Image Tree Reconstruction
⭐code单图像树重建 - Enhancing Intrinsic Features for Debiasing via Investigating Class-Discerning Common Attributes in Bias-Contrastive Pair去偏见
- BoQ: A Place is Worth a Bag of learnable Queries
- Distilled Datamodel with Reverse Gradient Matching
- Towards Calibrated Multi-label Deep Neural Networks
- BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
- MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation
- Gradient-based Parameter Selection for Efficient Fine-Tuning
- In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging
- ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images
- Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements
- Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
⭐code
🏠project - SVDinsTN: A Tensor Network Paradigm for Efficient Structure Search from Regularized Modeling Perspective
- Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology
- Epistemic Uncertainty Quantification For Pre-trained Neural Network
- Fooling Polarization-based Vision using Locally Controllable Polarizing Projection
- TEA: Test-time Energy Adaptation
⭐code - Would Deep Generative Models Amplify Bias in Future Models?
- A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion
- Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation
⭐code - DeMatch: Deep Decomposition of Motion Field for Two-View Correspondence Learning
⭐code - Explaining CLIP's performance disparities on data from blind/low vision users
- Implicit Assimilation of Sparse In Situ Data for Dense & Global Storm Surge Forecasting
- Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models
- CURSOR: Scalable Mixed-Order Hypergraph Matching with CUR Decomposition
- Bayesian Differentiable Physics for Cloth Digitalization
⭐code - DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models
- FINER: Flexible spectral-bias tuning in Implicit NEural Representation by Variable-periodic Activation Functions
- Physical Property Understanding from Language-Embedded Feature Fields
⭐code - Clustering for Protein Representation Learning
- Learning Triangular Distribution in Visual World
- InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
- NeISF: Neural Incident Stokes Field for Geometry and Material Estimation
- Correspondence-Free Non-Rigid Point Set Registration Using Unsupervised Clustering Analysis
- Robust Depth Enhancement via Polarization Prompt Fusion Tuning
⭐code - Dual-Scale Transformer for Large-Scale Single-Pixel Imaging
⭐code - Posterior Distillation Sampling
🏠project - Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo
⭐code - Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
⭐code - MaxQ: Multi-Axis Query for N:M Sparsity Network
⭐code - AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation
⭐code - Can Biases in ImageNet Models Explain Generalization?
⭐code - LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction
🏠project - From Activation to Initialization: Scaling Insights for Optimizing Neural Fields
- Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation
⭐code - An N-Point Linear Solver for Line and Motion Estimation with Event Cameras
- UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion
⭐code - Deep Generative Data Assimilation in Multimodal Setting
⭐code - PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks
- Sparse Views, Near Light: A Practical Paradigm for Uncalibrated Point-light Photometric Stereo
- MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction
⭐code - Prompt Learning via Meta-Regularization
⭐code - Scalable 3D Registration via Truncated Entry-wise Absolute Residuals
- CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization
- Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
⭐code
🏠project - Generative Quanta Color Imaging
⭐code - Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D
- MedBN: Robust Test-Time Adaptation against Malicious Test Samples
- Material Palette: Extraction of Materials from a Single Image
⭐code
🏠project - Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks
- Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training
⭐code - Riemannian Multinomial Logistics Regression for SPD Neural Networks
⭐code - A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network
⭐code - Backpropagation-free Network for 3D Test-time Adaptation
⭐code - Estimating Noisy Class Posterior with Part-level Labels for Noisy Label Learning
⭐code - ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object
⭐code - Region-Based Representations Revisited
- Neural Clustering based Visual Representation Learning
⭐code - Efficient Stitchable Task Adaptation
⭐code - Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery
- Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis
- Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks
- PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution
- Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture
⭐code - LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels
⭐code - Continual Forgetting for Pre-trained Vision Models
⭐code - EarthLoc: Astronaut Photography Localization by Indexing Earth from Space
⭐code - SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks
- SurfaceAug: Closing the Gap in Multimodal Ground Truth Sampling
- Controllable Safety-Critical Closed-loop Traffic Simulation via Guided Diffusion
- Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations
- AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis
- Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing
- Misalignment-Robust Frequency Distribution Loss for Image Transformation
⭐code - Boosting Neural Representations for Videos with a Conditional Decoder
- SeMoLi: What Moves Together Belongs Together
- VideoMAC: Video Masked Autoencoders Meet ConvNets
- WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts
- Integrating Efficient Optimal Transport and Functional Maps For Unsupervised Shape Correspondence Learning
- Training-Free Pretrained Model Merging
⭐code - Neural Redshift: Random Networks are not Random Functions
- LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking
- HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction
- Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image
- Desigen: A Pipeline for Controllable Design Template Generation
⭐code - S^2MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering
⭐code - Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer
- Rewrite the Stars
⭐code - Neural Refinement for Absolute Pose Regression with Feature Synthesis
⭐code
🏠project