Afleveringen

  • ArXiv Computer Vision research for Thursday, June 13, 2024.

    00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data

    01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth

    03:08: GGHead: Fast and Generalizable 3D Gaussian Heads

    04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

    06:34: Towards Vision-Language Geo-Foundation Model: A Survey

    08:11: SimGen: Simulator-conditioned Driving Scene Generation

    09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

    11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior

    12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

    13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image

    15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis

    16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

    17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

    19:39: Real-Time Deepfake Detection in the Real-World

    21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

    23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant

    24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

    26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

    28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

    29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

    31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

    33:16: Towards Evaluating the Robustness of Visual State Space Models

    34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

    36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras

    37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach

    40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

    41:40: Explore the Limits of Omni-modal Pretraining at Scale

    42:46: Interpreting the Weight Space of Customized Diffusion Models

    43:58: Depth Anything V2

    45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

    46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

    48:11: Rethinking Score Distillation as a Bridge Between Image Distributions

    49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

  • ArXiv Computer Vision research for Thursday, June 13, 2024.

    00:21: INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

    02:11: Large-Scale Evaluation of Open-Set Image Classification Techniques

    03:43: PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation

    05:00: MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era

    06:41: Auto-Vocabulary Segmentation for LiDAR Points

    07:30: AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring

    08:43: EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

    10:23: Fine-Grained Domain Generalization with Feature Structuralization

    12:03: SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution

    14:13: ReMI: A Dataset for Reasoning with Multiple Images

    15:41: A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

    17:26: Thoracic Surgery Video Analysis for Surgical Phase Recognition

    18:58: Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval

    20:40: Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

    22:26: CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification

    24:22: Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024

    25:21: Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns

    26:30: WildlifeReID-10k: Wildlife re-identification dataset with 10k individual animals

    27:44: MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction

    29:28: Comparison Visual Instruction Tuning

    30:51: MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

    32:14: Deep Transformer Network for Monocular Pose Estimation of Ship-Based UAV

    33:10: Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

    34:33: Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

    36:04: StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning

    37:30: Parameter-Efficient Active Learning for Foundational models

    38:31: Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

    40:22: Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

    42:38: Towards AI Lesion Tracking in PET/CT Imaging: A Siamese-based CNN Pipeline applied on PSMA PET/CT Scans

    44:36: Memory-Efficient Sparse Pyramid Attention Networks for Whole Slide Image Analysis

    46:19: Instance-level quantitative saliency in multiple sclerosis lesion segmentation

    48:37: CMC-Bench: Towards a New Paradigm of Visual Signal Compression

    50:05: Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

    52:05: CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

  • Zijn er afleveringen die ontbreken?

    Klik hier om de feed te vernieuwen.

  • ArXiv Computer Vision research for Thursday, June 13, 2024.

    00:21: FouRA: Fourier Low Rank Adaptation

    01:41: Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

    03:18: Few-Shot Anomaly Detection via Category-Agnostic Registration Learning

    04:57: Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

    06:46: ToSA: Token Selective Attention for Efficient Vision Transformers

    08:00: Computer vision-based model for detecting turning lane features on Florida's public roadways

    09:08: Improving Adversarial Robustness via Feature Pattern Consistency Constraint

    10:52: Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network

    12:10: NeRF Director: Revisiting View Selection in Neural Volume Rendering

    13:36: Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency

    15:03: Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

    16:40: COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing

    18:16: Fusion of regional and sparse attention in Vision Transformers

    19:26: Zoom and Shift are All You Need

    20:17: EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

    21:49: The Penalized Inverse Probability Measure for Conformal Classification

    23:24: OpenMaterial: A Comprehensive Dataset of Complex Materials for 3D Reconstruction

    24:47: Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation

    26:30: Computer Vision Approaches for Automated Bee Counting Application

    27:17: Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

    28:16: A Label-Free and Non-Monotonic Metric for Evaluating Denoising in Event Cameras

    29:43: Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

    31:25: Neural NeRF Compression

    32:29: Preserving Identity with Variational Score for General-purpose 3D Editing

    33:50: AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings

    34:51: Adaptive Temporal Motion Guided Graph Convolution Network for Micro-expression Recognition

    36:10: Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

    37:34: AMSA-UNet: An Asymmetric Multiple Scales U-net Based on Self-attention for Deblurring

    38:49: Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark

    40:45: A PCA based Keypoint Tracking Approach to Automated Facial Expressions Encoding

    42:02: Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious?

    43:28: FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

    45:11: How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models

    47:08: Suitability of KANs for Computer Vision: A preliminary investigation

  • ArXiv Computer Vision research for Wednesday, June 12, 2024.

    00:20: From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition

    02:09: APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentatio

    03:57: 2.5D Multi-view Averaging Diffusion Model for 3D Medical Image Translation: Application to Low-count PET Reconstruction with CT-less Attenuation Correction

    05:47: DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

    06:58: Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze

    08:02: LaneCPP: Continuous 3D Lane Detection using Physical Priors

    09:23: FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation

    11:10: VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

    12:46: MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

    14:39: OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    16:49: AWGUNET: Attention-Aided Wavelet Guided U-Net for Nuclei Segmentation in Histopathology Images

    18:15: Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

    19:58: Coherent Optical Modems for Full-Wavefield Lidar

    21:32: Transformation-Dependent Adversarial Attacks

    22:45: PixMamba: Leveraging State Space Models in a Dual-Level Architecture for Underwater Image Enhancement

    24:10: GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

    25:57: ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

    27:26: Self-supervised Learning of Neural Implicit Feature Fields for Camera Pose Refinement

    28:51: Real2Code: Reconstruct Articulated Objects via Code Generation

    30:02: Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

    31:42: RMem: Restricted Memory Banks Improve Video Object Segmentation

    33:12: What If We Recaption Billions of Web Images with LLaMA-3?

    34:42: Real3D: Scaling Up Large Reconstruction Models with Real-World Images

    36:07: Enhancing End-to-End Autonomous Driving with Latent World Model

    37:12: Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation

    38:43: On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models

    40:16: Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

    42:15: ICE-G: Image Conditional Editing of 3D Gaussian Splats

  • ArXiv Computer Vision research for Wednesday, June 12, 2024.

    00:21: From Sim-to-Real: Toward General Event-based Low-light Frame Interpolation with Per-scene Optimization

    01:44: Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

    03:20: Adversarial Patch for 3D Local Feature Extractor

    04:00: Valeo4Cast: A Modular Approach to End-to-End Forecasting

    05:38: The impact of deep learning aid on the workload and interpretation accuracy of radiologists on chest computed tomography: a cross-over reader study

    08:50: Universal Scale Laws for Colors and Patterns in Imagery

    10:11: CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer

    11:44: ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

    13:25: Continuous fake media detection: adapting deepfake detectors to new generative techniques

    15:18: Category-level Neural Field for Reconstruction of Partially Observed Objects in Indoor Environment

    16:23: One-Step Effective Diffusion Network for Real-World Image Super-Resolution

    18:12: 2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

    19:22: Diffusion-Promoted HDR Video Reconstruction

    21:09: Runtime Freezing: Dynamic Class Loss for Multi-Organ 3D Segmentation

    21:52: A Sociotechnical Lens for Evaluating Computer Vision Models: A Case Study on Detecting and Reasoning about Gender and Emotion

    23:54: DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

    25:28: Using Deep Convolutional Neural Networks to Detect Rendered Glitches in Video Games

    26:39: OpenCOLE: Towards Reproducible Automatic Graphic Design Generation

    27:23: Dataset Enhancement with Instance-Level Augmentations

    28:33: Interpretable Representation Learning of Cardiac MRI via Attribute Regularization

    29:33: A New Class Biorthogonal Spline Wavelet for Image Edge Detection

    30:48: Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata

    32:10: Vessel Re-identification and Activity Detection in Thermal Domain for Maritime Surveillance

    33:32: AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer

    35:09: From Chaos to Clarity: 3DGS in the Dark

    36:32: LaMOT: Language-Guided Multi-Object Tracking

    38:07: UDON: Universal Dynamic Online distillatioN for generic image representations

    39:49: WMAdapter: Adding WaterMark Control to Latent Diffusion Models

    40:48: Blind Image Deblurring using FFT-ReLU with Deep Learning Pipeline Integration

    42:06: DocSynthv2: A Practical Autoregressive Modeling for Document Generation

  • ArXiv Computer Vision research for Wednesday, June 12, 2024.

    00:20: FaithFill: Faithful Inpainting for Object Completion Using a Single Reference Image

    01:21: Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

    02:49: Unveiling the Power of Wavelets: A Wavelet-based Kolmogorov-Arnold Network for Hyperspectral Image Classification

    04:26: Flexible Music-Conditioned Dance Generation with Style Description Prompts

    05:52: Robust 3D Face Alignment with Multi-Path Neural Architecture Search

    07:00: Small Scale Data-Free Knowledge Distillation

    08:48: KernelWarehouse: Rethinking the Design of Dynamic Convolution

    10:31: A Comprehensive Survey on Machine Learning Driven Material Defect Detection: Challenges, Solutions, and Future Prospects

    12:34: Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

    14:02: IFTD: Image Feature Triangle Descriptor for Loop Detection in Driving Scenes

    14:54: Multi-Teacher Multi-Objective Meta-Learning for Zero-Shot Hyperspectral Band Selection

    16:30: DemosaicFormer: Coarse-to-Fine Demosaicing Network for HybridEVS Camera

    18:10: Spatial-Frequency Dual Progressive Attention Network For Medical Image Segmentation

    20:07: Accurate Explanation Model for Image Classifiers using Class Association Embedding

    21:55: Real-world Image Dehazing with Coherence-based Label Generator and Cooperative Unfolding Network

    23:11: SimSAM: Simple Siamese Representations Based Semantic Affinity Matrix for Unsupervised Image Segmentation

    24:06: Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization

    25:34: OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding

    26:58: Generalizable Disaster Damage Assessment via Change Detection with Vision Foundation Model

    28:26: Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

    29:52: Deep Learning for Slum Mapping in Remote Sensing Images: A Meta-analysis and Review

    31:49: LVBench: An Extreme Long Video Understanding Benchmark

    33:14: Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

    34:48: A Robust Pipeline for Classification and Detection of Bleeding Frames in Wireless Capsule Endoscopy using Swin Transformer and RT-DETR

    36:23: 3D CBCT Challenge 2024: Improved Cone Beam CT Reconstruction using SwinIR-Based Sinogram and Image Enhancement

    37:29: MWIRSTD: A MWIR Small Target Detection Dataset

    38:34: CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

    40:27: A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

    42:35: Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

    44:26: Identification of Conversation Partners from Egocentric Video

  • ArXiv Computer Vision research for Tuesday, June 11, 2024.

    00:21: DERM12345: A Large, Multisource Dermatoscopic Skin Lesion Dataset with 38 Subclasses

    01:44: Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration

    02:49: Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

    04:04: OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

    06:01: 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

    07:24: VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    08:58: Image Neural Field Diffusion Models

    10:11: Comparing Deep Learning Models for Rice Mapping in Bhutan Using High Resolution Satellite Imagery

    12:29: GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection

    14:26: ReduceFormer: Attention with Tensor Reduction by Summation

    15:23: Trim 3D Gaussian Splatting for Accurate Geometry Representation

    16:44: SPIN: Spacecraft Imagery for Navigation

    18:24: Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

    20:00: Understanding Visual Concepts Across Models

    21:12: Instant 3D Human Avatar Generation using Image Diffusion Models

    22:47: Neural Gaffer: Relighting Any Object via Diffusion

    24:19: Autoregressive Pretraining with Mamba in Vision

    25:51: Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

    27:19: Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

    28:50: Situational Awareness Matters in 3D Vision Language Reasoning

    30:10: Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

    31:46: Zero-shot Image Editing with Reference Imitation

    33:08: Image and Video Tokenization with Binary Spherical Quantization

    34:18: An Image is Worth 32 Tokens for Reconstruction and Generation

    36:28: Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring

  • ArXiv Computer Vision research for Tuesday, June 11, 2024.

    00:21: NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images

    01:27: Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph

    03:14: T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text

    04:45: Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images

    06:23: FaceGPT: Self-supervised Learning to Chat about 3D Human Faces

    07:52: RecMoDiffuse: Recurrent Flow Diffusion for Human Motion Generation

    09:15: VoxNeuS: Enhancing Voxel-Based Neural Surface Reconstruction via Gradient Interpolation

    10:51: RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly Detection

    12:05: RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker

    13:52: MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD

    15:15: Can Foundation Models Reliably Identify Spatial Hazards? A Case Study on Curb Segmentation

    16:56: MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

    18:20: Open-World Human-Object Interaction Detection via Multi-modal Prompts

    20:03: Which Country Is This? Automatic Country Ranking of Street View Photos

    20:44: Needle In A Multimodal Haystack

    22:10: Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

    23:24: Towards Realistic Data Generation for Real-World Super-Resolution

    24:37: Unsupervised Object Detection with Theoretical Guarantees

    25:43: Embedded Graph Convolutional Networks for Real-Time Event Data Processing on SoC FPGAs

    27:45: A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

    29:01: Cinematic Gaussians: Real-Time HDR Radiance Fields with Depth of Field

    30:24: Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach

    32:09: Global-Regularized Neighborhood Regression for Efficient Zero-Shot Texture Anomaly Detection

    33:52: Deep Implicit Optimization for Robust and Flexible Image Registration

    35:28: Visual Representation Learning with Stochastic Frame Prediction

  • ArXiv Computer Vision research for Tuesday, June 11, 2024.

    00:20: Explaining Representation Learning with Perceptual Components

    01:28: Optimal Matrix-Mimetic Tensor Algebras via Variable Projection

    03:03: Sparse Bayesian Networks: Efficient Uncertainty Quantification in Medical Image Analysis

    04:24: Neural Visibility Field for Uncertainty-Driven Active Mapping

    05:21: Triple-domain Feature Learning with Frequency-aware Memory Enhancement for Moving Infrared Small Target Detection

    06:55: Stepwise Regression and Pre-trained Edge for Robust Stereo Matching

    08:38: Evolving from Single-modal to Multi-modal Facial Deepfake Detection: A Survey

    10:08: Dual Thinking and Perceptual Analysis of Deep Learning Models using Human Adversarial Examples

    11:10: Generative Lifting of Multiview to 3D from Unknown Pose: Wrapping NeRF inside Diffusion

    12:34: RWKV-CLIP: A Robust Vision-Language Representation Learner

    14:01: Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    15:03: Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection

    16:40: MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results

    18:34: Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models

    19:38: LiSD: An Efficient Multi-Task Learning Framework for LiDAR Segmentation and Detection

    21:04: RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks

    22:49: PanoSSC: Exploring Monocular Panoptic 3D Scene Reconstruction for Autonomous Driving

    24:15: EFFOcc: A Minimal Baseline for EFficient Fusion-based 3D Occupancy Network

    26:25: 1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation

    27:16: DualMamba: A Lightweight Spectral-Spatial Mamba-Convolution Network for Hyperspectral Image Classification

    29:09: Triage of 3D pathology data via 2.5D multiple-instance learning to guide pathologist assessments

    31:08: Unified Modeling Enhanced Multimodal Learning for Precision Neuro-Oncology

    32:23: CAT: Coordinating Anatomical-Textual Prompts for Multi-Organ and Tumor Segmentation

    33:54: RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents

    35:17: AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

  • ArXiv Computer Vision research for Monday, June 10, 2024.

    00:20: ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery

    01:59: Diving into Underwater: Segment Anything Model Guided Underwater Salient Instance Segmentation and A Large-scale Dataset

    03:44: Vript: A Video Is Worth Thousands of Words

    05:38: FRAG: Frequency Adapting Group for Diffusion Video Editing

    06:50: Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training

    08:38: Robust Latent Representation Tuning for Image-text Classification

    09:46: Generalizable Human Gaussians from Single-View Image

    11:05: ProcessPainter: Learn Painting Process from Sequence Data

    12:29: PointABM:Integrating Bidirectional State Space Model with Multi-Head Self-Attention for Point Cloud Analysis

    13:41: Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control

    15:00: Latent Representation Matters: Human-like Sketches in One-shot Drawing Tasks

    16:14: GAIA: Rethinking Action Quality Assessment for AI-Generated Videos

    17:54: Texture Re-scalable Universal Adversarial Perturbation

    19:44: W-Net: One-Shot Arbitrary-Style Chinese Character Generation with Deep Neural Networks

    20:46: ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models

    22:04: DiffInject: Revisiting Debias via Synthetic Data Generation using Diffusion-based Style Injection

    23:13: A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis

    25:15: Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation

    26:36: Generalized Nested Latent Variable Models for Lossy Coding applied to Wind Turbine Scenarios

    27:48: Black carbon plumes from gas flaring in North Africa identified from multi-spectral imagery with deep learning

    28:58: An Effective-Efficient Approach for Dense Multi-Label Action Detection

    30:42: 2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment Retrieval

    31:49: iMotion-LLM: Motion Prediction Instruction Tuning

    33:05: Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis

    34:57: Data Augmentation in Earth Observation: A Diffusion Model Approach

    36:22: UEMM-Air: A Synthetic Multi-modal Dataset for Unmanned Aerial Vehicle Object Detection

    37:49: UnSupDLA: Towards Unsupervised Document Layout Analysis

    39:11: I-MPN: Inductive Message Passing Network for Effective and Efficient Human-in-the-Loop Annotation of Mobile Eye Tracking Data

    40:46: Tuning-Free Visual Customization via View Iterative Self-Attention Control

  • ArXiv Computer Vision research for Monday, June 10, 2024.

    00:20: DualAD: Disentangling the Dynamic and Static World for End-to-End Driving

    01:41: NeuroMoCo: A Neuromorphic Momentum Contrast Learning Method for Spiking Neural Networks

    03:22: Vehicle Vectors and Traffic Patterns from Planet Imagery

    04:15: A Guide to Stochastic Optimisation for Large-Scale Inverse Problems

    05:37: Cascading Unknown Detection with Known Classification for Open Set Recognition

    06:42: Latent Directions: A Simple Pathway to Bias Mitigation in Generative AI

    07:57: MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

    09:32: UMAD: Unsupervised Mask-Level Anomaly Detection for Autonomous Driving

    10:15: Improving Deep Learning-based Automatic Cranial Defect Reconstruction by Heavy Data Augmentation: From Image Registration to Latent Diffusion Models

    11:47: Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization

    13:12: Generalizing to Unseen Domains in Diabetic Retinopathy with Disentangled Representations

    15:01: FPN-IAIA-BL: A Multi-Scale Interpretable Deep Learning Model for Classification of Mass Margins in Digital Mammography

    16:18: STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics

    17:53: Hybrid Video Anomaly Detection for Anomalous Scenarios in Autonomous Driving

    18:35: Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

    20:24: SYM3D: Learning Symmetric Triplanes for Better 3D-Awareness of GANs

    21:48: Spatiotemporal Graph Neural Network Modelling Perfusion MRI

    22:57: VCR: Visual Caption Restoration

    24:37: AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

    26:29: NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative

    28:09: Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer

    30:12: Merlin: A Vision Language Foundation Model for 3D Computed Tomography

    32:58: Genomics-guided Representation Learning for Pathologic Pan-cancer Tumor Microenvironment Subtype Prediction

    34:26: PGSR: Planar-based Gaussian Splatting for Efficient and High-Fidelity Surface Reconstruction

    36:04: NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing

    37:28: Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    39:08: GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation

    40:52: IllumiNeRF: 3D Relighting without Inverse Rendering

  • ArXiv Computer Vision research for Sunday, June 09, 2024.

    00:20: ControlLoc: Physical-World Hijacking Attack on Visual Perception in Autonomous Driving

    02:23: Unified Text-to-Image Generation and Retrieval

    03:51: F-LMM: Grounding Frozen Large Multimodal Models

    05:34: Multi-Stain Multi-Level Convolutional Network for Multi-Tissue Breast Cancer Image Segmentation

    07:43: BOSC: A toolbox for aerial imagery mapping

    08:27: Mamba YOLO: SSMs-Based YOLO For Object Detection

    10:12: Solution for CVPR 2024 UG2+ Challenge Track on All Weather Semantic Segmentation

    11:02: Scaling Graph Convolutions for Mobile Vision

    12:59: RefGaussian: Disentangling Reflections from 3D Gaussian Splatting for Realistic Rendering

    14:28: Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

    15:45: Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers

    16:40: OmniControlNet: Dual-stage Integration for Conditional Image Generation

    17:51: GCtx-UNet: Efficient Network for Medical Image Segmentation

    19:14: InfoGaussian: Structure-Aware Dynamic Gaussians through Lightweight Information Shaping

    20:40: BD-SAT: High-resolution Land Use Land Cover Dataset & Benchmark Results for Developing Division: Dhaka, BD

    22:19: Bits-to-Photon: End-to-End Learned Scalable Point Cloud Compression for Direct Rendering

    23:28: MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification

    24:38: Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

    26:12: CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

    29:32: Inter-slice Super-resolution of Magnetic Resonance Images by Pre-training and Self-supervised Fine-tuning

    31:04: Causality-inspired Latent Feature Augmentation for Single Domain Generalization

    32:41: MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

    34:13: FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

  • ArXiv Computer Vision research for Sunday, June 09, 2024.

    00:20: PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction

    01:47: Anomaly Multi-classification in Industrial Scenarios: Transferring Few-shot Learning to a New Task

    02:51: GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

    04:51: Visual Prompt Tuning in Null Space for Continual Learning

    06:20: SRC-Net: Bi-Temporal Spatial Relationship Concerned Network for Change Detection

    08:00: Evolution-aware VAriance (EVA) Coreset Selection for Medical Image Classification

    09:29: Diverse 3D Human Pose Generation in Scenes based on Decoupled Structure

    10:30: HDMba: Hyperspectral Remote Sensing Imagery Dehazing with State Space Model

    12:17: Hierarchical Features Matter: A Deep Exploration of GAN Priors for Improved Dataset Distillation

    13:37: ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition

    15:05: Binarized Diffusion Model for Image Super-Resolution

    16:43: Region of Interest Loss for Anonymizing Learned Image Compression

    18:15: A DeNoising FPN With Transformer R-CNN for Tiny Object Detection

    20:09: Vision Mamba: Cutting-Edge Classification of Alzheimer's Disease with 3D MRI Scans

    21:59: MLCM: Multistep Consistency Distillation of Latent Diffusion Model

    24:02: CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder

    25:42: VCR-GauS: View Consistent Depth-Normal Regularizer for Gaussian Surface Reconstruction

    27:09: Utilizing Grounded SAM for self-supervised frugal camouflaged human detection

    28:28: Learning to utilize gradient information for crisp edge detection

    29:57: A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions

    31:29: Convolution and Attention-Free Mamba-based Cardiac Image Segmentation

    32:51: OD-DETR: Online Distillation for Stabilizing Training of Detection Transformer

    34:18: SlowPerception: Physical-World Latency Attack against Visual Perception in Autonomous Driving

    36:11: SAM-PM: Enhancing Video Camouflaged Object Detection using Spatio-Temporal Attention

  • ArXiv Computer Vision research for Saturday, June 08, 2024.

    00:20: Blurry-Consistency Segmentation Framework with Selective Stacking on Differential Interference Contrast 3D Breast Cancer Spheroid

    01:31: 1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation

    03:01: Metric Convolutions: A Unifying Theory to Adaptive Convolutions

    04:13: Layered Image Vectorization via Semantic Simplification

    05:18: Select-Mosaic: Data Augmentation Method for Dense Small Object Scenes

    06:31: 3D MRI Synthesis with Slice-Based Latent Diffusion Models: Improving Tumor Segmentation Tasks in Data-Scarce Regimes

    07:51: Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models

    09:42: Unsupervised learning of Data-driven Facial Expression Coding System (DFECS) using keypoint tracking

    11:36: HDRT: Infrared Capture for HDR Imaging

    13:14: Attri-Net: A Globally and Locally Inherently Interpretable Model for Multi-Label Classification Using Class-Specific Counterfactuals

    14:49: Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

    16:18: Training-Free Robust Interactive Video Object Segmentation

    17:49: One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

    19:50: A Two-Stage Adverse Weather Semantic Segmentation Method for WeatherProof Challenge CVPR 2024 Workshop UG2+

    21:04: PAPR in Motion: Seamless Point-level 3D Scene Interpolation

    22:25: VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification

    23:38: Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

    25:24: Aligning Human Knowledge with Visual Concepts Towards Explainable Medical Image Classification

    26:50: Understanding Inhibition Through Maximally Tense Images

    27:52: Can Prompt Modifiers Control Bias? A Comparative Analysis of Text-to-Image Generative Models

    29:19: Deep Learning to Predict Glaucoma Progression using Structural Changes in the Eye

    30:58: Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision

    32:32: Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

    34:11: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

    35:35: Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

  • ArXiv Computer Vision research for Friday, June 07, 2024.

    00:21: RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection

    01:52: AGBD: A Global-scale Biomass Dataset

    03:30: MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

    04:52: Faster Than Lies: Real-time Deepfake Detection using Binary Neural Networks

    06:03: Leveraging Activations for Superpixel Explanations

    07:02: Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

    08:28: Nacala-Roof-Material: Drone Imagery for Roof Detection, Classification, and Segmentation to Support Mosquito-borne Disease Risk Assessment

    10:10: Multi-style Neural Radiance Field with AdaIN

    10:52: Multiplane Prior Guided Few-Shot Aerial Scene Rendering

    12:15: Semantic Segmentation on VSPW Dataset through Masked Video Consistency

    13:24: CityCraft: A Real Crafter for 3D City Generation

    15:21: ProMotion: Prototypes As Motion Learners

    16:57: AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

    18:00: Clarifying Myths About the Relationship Between Shape Bias, Accuracy, and Robustness

    19:50: GANetic Loss for Generative Adversarial Networks with a Focus on Medical Applications

    21:35: Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

    23:28: Bootstrapping Referring Multi-Object Tracking

    24:50: Prototype Correlation Matching and Class-Relation Reasoning for Few-Shot Medical Image Segmentation

    26:48: GenHeld: Generating and Editing Handheld Objects

    27:57: Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations

    29:11: Hibou: A Family of Foundational Vision Transformers for Pathology

    30:41: Diving Deep into the Motion Representation of Video-Text Models

    31:46: CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion

    33:18: A Novel Time Series-to-Image Encoding Approach for Weather Phenomena Classification

    34:48: LLavaGuard: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment

    36:06: Contextual fusion enhances robustness to image blurring

    37:01: Energy Propagation in Scattering Convolution Networks Can Be Arbitrarily Slow

    38:12: Towards Semantic Equivalence of Tokenization in Multimodal LLM

    39:33: PatchSVD: A Non-uniform SVD-based Image Compression Algorithm

    40:29: DVOS: Self-Supervised Dense-Pattern Video Object Segmentation

    42:16: 3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs

  • ArXiv Computer Vision research for Friday, June 07, 2024.

    00:20: Image Processing Based Forest Fire Detection

    01:08: STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting

    03:05: UVCPNet: A UAV-Vehicle Collaborative Perception Network for 3D Object Detection

    04:47: UCDNet: Multi-UAV Collaborative 3D Object Detection Network by Reliable Feature Mapping

    06:14: SMART: Scene-motion-aware human action recognition framework for mental disorder group

    08:12: LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model

    09:34: Evaluating and Mitigating IP Infringement in Visual Generative AI

    11:01: MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

    12:20: OVMR: Open-Vocabulary Recognition with Multi-Modal References

    13:57: ACE Metric: Advection and Convection Evaluation for Accurate Weather Forecasting

    15:11: XctDiff: Reconstruction of CT Images with Consistent Anatomical Structures from a Single Radiographic Projection Image

    16:22: MTS-Net: Dual-Enhanced Positional Multi-Head Self-Attention for 3D CT Diagnosis of May-Thurner Syndrome

    17:58: CDeFuse: Continuous Decomposition for Infrared and Visible Image Fusion

    19:41: MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

    21:24: PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction

    22:58: Interpretable Multimodal Out-of-context Detection with Soft Logic Regularization

    24:24: SMC++: Masked Learning of Unsupervised Video Semantic Compression

    26:19: Diffusion-based Generative Image Outpainting for Recovery of FOV-Truncated CT Images

    27:09: MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

    28:35: Predictive Dynamic Fusion

    29:43: Online Continual Learning of Video Diffusion Models From a Single Video Stream

    30:40: A short review on graphonometric evaluation tools in children

    31:49: Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors

    33:04: EGOR: Efficient Generated Objects Replay for incremental object detection

    34:37: 3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation

    36:02: Multi-Granularity Language-Guided Multi-Object Tracking

    37:56: Normal-guided Detail-Preserving Neural Implicit Functions for High-Fidelity 3D Surface Reconstruction

    39:52: Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior

    41:48: 3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

    43:54: Seeing the Unseen: Visual Metaphor Captioning for Videos

    45:09: Zero-Shot Video Editing through Adaptive Sliding Score Distillation

    46:28: Labeled Data Selection for Category Discovery

  • ArXiv Computer Vision research for Thursday, June 06, 2024.

    00:20: M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and RGB Data

    02:34: Understanding Information Storage and Transfer in Multi-modal Large Language Models

    04:27: Conv-INR: Convolutional Implicit Neural Representation for Multimodal Visual Signals

    06:01: Localized Gaussian Point Management

    07:59: A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation

    09:25: GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions

    11:07: MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

    13:02: ELFS: Enhancing Label-Free Coreset Selection via Clustering-based Pseudo-Labeling

    14:39: VideoTetris: Towards Compositional Text-to-Video Generation

    16:00: SpectralZoom: Efficient Segmentation with an Adaptive Hyperspectral Camera

    17:04: Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

    18:51: Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry

    20:05: Vision-LSTM: xLSTM as Generic Vision Backbone

    21:01: ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation

    22:03: ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

    23:43: Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

    25:32: Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

    27:23: VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

    28:33: DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

    30:24: SF-V: Single Forward Video Generation Model

    31:51: ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

    34:06: Parameter-Inverted Image Pyramid Networks

    35:50: Coarse-To-Fine Tensor Trains for Compact Visual Representations

    37:23: BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

    38:37: DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

    40:24: Coherent Zero-Shot Visual Instruction Generation

    41:17: Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

    42:58: RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

    44:56: GLACE: Global Local Accelerated Coordinate Encoding

    46:43: Interpreting the Second-Order Effects of Neurons in CLIP

    48:03: Learning 1D Causal Visual Representation with De-focus Attention Networks

    49:41: Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

    51:14: Stereo-Depth Fusion through Virtual Pattern Projection

  • ArXiv Computer Vision research for Thursday, June 06, 2024.

    00:20: ReDistill: Residual Encoded Distillation for Peak Memory Reduction

    01:58: Instance Segmentation and Teeth Classification in Panoramic X-rays

    03:34: Enhanced Semantic Segmentation Pipeline for WeatherProof Dataset Challenge

    04:44: Amortized Equation Discovery in Hybrid Dynamical Systems

    05:57: Monocular Localization with Semantics Map for Autonomous Vehicles

    07:22: From operculum and body tail movements to different coupling of physical activity and respiratory frequency in farmed gilthead sea bream and European sea bass. Insights on aquaculture biosensing

    09:36: Semantic Similarity Score for Measuring Visual Similarity at Semantic Level

    11:32: LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model

    13:12: Polyp and Surgical Instrument Segmentation with Double Encoder-Decoder Networks

    13:52: C^2RV: Cross-Regional and Cross-View Learning for Sparse-View CBCT Reconstruction

    15:19: Data-Centric Label Smoothing for Explainable Glaucoma Screening from Eye Fundus Images

    16:39: Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following

    18:03: Frequency-based Matcher for Long-tailed Semantic Segmentation

    19:28: LDM-RSIC: Exploring Distortion Prior with Latent Diffusion Models for Remote Sensing Image Compression

    21:18: LNQ Challenge 2023: Learning Mediastinal Lymph Node Segmentation with a Probabilistic Lymph Node Atlas

    22:45: 3rd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

    23:30: Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

    25:10: Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

    26:03: Shaping History: Advanced Machine Learning Techniques for the Analysis and Dating of Cuneiform Tablets over Three Millennia

    28:01: Semmeldetector: Application of Machine Learning in Commercial Bakeries

    29:08: Class-Aware Cartilage Segmentation for Autonomous US-CT Registration in Robotic Intercostal Ultrasound Imaging

    30:45: How Far Can We Compress Instant-NGP-Based NeRF?

    32:11: UrbanSARFloods: Sentinel-1 SLC-Based Benchmark Dataset for Urban and Open-Area Flood Mapping

    34:01: Global Parameterization-based Texture Space Optimization

    34:52: LenslessFace: An End-to-End Optimized Lensless System for Privacy-Preserving Face Verification

    36:22: The 3D-PC: a benchmark for visual perspective taking in humans and machines

    38:29: Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization

    40:08: Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets

    41:50: A Voxel-based Approach for Simulating Microbial Decomposition in Soil: Comparison with LBM and Improvement of Morphological Models

    43:25: Encoding Semantic Priors into the Weights of Implicit Neural Representation

    45:04: Diffusion-based image inpainting with internal learning

    45:58: CDMamba: Remote Sensing Image Change Detection with Mamba

    47:36: Matching Anything by Segmenting Anything

  • ArXiv Computer Vision research for Wednesday, June 05, 2024.

    00:20: Image Copy-Move Forgery Detection and Localization Scheme: How to Avoid Missed Detection and False Alarm

    01:52: VWise: A novel benchmark for evaluating scene classification for vehicular applications

    03:03: Text-to-Image Rectified Flow as Plug-and-Play Priors

    04:25: L-PR: Exploiting LiDAR Fiducial Marker for Unordered Low Overlap Multiview Point Cloud Registration

    06:17: Learning Visual Prompts for Guiding the Attention of Vision Transformers

    07:25: Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation

    08:51: EngineBench: Flow Reconstruction in the Transparent Combustion Chamber III Optical Engine

    10:37: A Flexible Recursive Network for Video Stereo Matching Based on Residual Estimation

    12:05: SuperFormer: Volumetric Transformer Architectures for MRI Super-Resolution

    13:20: SelfReDepth: Self-Supervised Real-Time Depth Restoration for Consumer-Grade Sensors

    15:01: Gaussian Representation for Deformable Image Registration

    16:37: Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

    18:01: UnWave-Net: Unrolled Wavelet Network for Compton Tomography Image Reconstruction

    19:42: CoFie: Learning Compact Neural Surface Representations with Coordinate Fields

    21:04: Post-hoc Part-prototype Networks

    22:19: Computation-Efficient Era: A Comprehensive Survey of State Space Models in Medical Image Analysis

    24:26: CattleFace-RGBT: RGB-T Cattle Facial Landmark Benchmark

    25:51: Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input

    27:18: FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

    28:38: LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

    29:58: Polarization Wavefront Lidar: Learning Large Scene Reconstruction from Polarized Wavefronts

    31:36: AD-H: Autonomous Driving with Hierarchical Agents

    33:39: Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review

  • ArXiv Computer Vision research for Wednesday, June 05, 2024.

    00:20: Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision

    02:03: A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

    03:42: Exploiting LMM-based knowledge for image classification tasks

    04:37: EgoSurgery-Tool: A Dataset of Surgical Tool and Hand Detection from Egocentric Open Surgery Videos

    06:09: EpidermaQuant: Unsupervised detection and quantification of epidermal differentiation markers on H-DAB-stained images of reconstructed human epidermis

    08:15: Enhancing 3D Lane Detection and Topology Reasoning with 2D Lane Priors

    09:24: VQUNet: Vector Quantization U-Net for Defending Adversarial Atacks by Regularizing Unwanted Noise

    10:36: Enhanced Automotive Object Detection via RGB-D Fusion in a DiffusionDet Framework

    11:42: ZeroPur: Succinct Training-Free Adversarial Purification

    13:23: Tiny models from tiny data: Textual and null-text inversion for few-shot distillation

    15:10: Multi-Task Multi-Scale Contrastive Knowledge Distillation for Efficient Medical Image Segmentation

    16:44: Dynamic 3D Gaussian Fields for Urban Areas

    18:10: MMCL: Boosting Deformable DETR-Based Detectors with Multi-Class Min-Margin Contrastive Learning for Superior Prohibited Item Detection

    20:02: FAPNet: An Effective Frequency Adaptive Point-based Eye Tracker

    21:52: Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

    23:14: Situation Monitor: Diversity-Driven Zero-Shot Out-of-Distribution Detection using Budding Ensemble Architecture for Object Detection

    24:28: Writing Order Recovery in Complex and Long Static Handwriting

    25:50: Identification of Stone Deterioration Patterns with Large Multimodal Models

    26:58: Searching Priors Makes Text-to-Video Synthesis Better

    28:32: Interactive Image Selection and Training for Brain Tumor Segmentation Network

    29:35: Global Clipper: Enhancing Safety and Reliability of Transformer-based Object Detection Models

    30:53: Generative Diffusion Models for Fast Simulations of Particle Collisions at CERN

    31:52: Prompt-based Visual Alignment for Zero-shot Policy Transfer

    33:33: ADer: A Comprehensive Benchmark for Multi-class Visual Anomaly Detection