Afleveringen
-
ArXiv Computer Vision research for Thursday, June 13, 2024.
00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data
01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth
03:08: GGHead: Fast and Generalizable 3D Gaussian Heads
04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset
06:34: Towards Vision-Language Geo-Foundation Model: A Survey
08:11: SimGen: Simulator-conditioned Driving Scene Generation
09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition
11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior
12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living
13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image
15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis
16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
19:39: Real-Time Deepfake Detection in the Real-World
21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant
24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations
26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion
28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing
31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
33:16: Towards Evaluating the Robustness of Visual State Space Models
34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images
36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras
37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach
40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
41:40: Explore the Limits of Omni-modal Pretraining at Scale
42:46: Interpreting the Weight Space of Customized Diffusion Models
43:58: Depth Anything V2
45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models
48:11: Rethinking Score Distillation as a Bridge Between Image Distributions
49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
-
ArXiv Computer Vision research for Thursday, June 13, 2024.
00:21: INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
02:11: Large-Scale Evaluation of Open-Set Image Classification Techniques
03:43: PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation
05:00: MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era
06:41: Auto-Vocabulary Segmentation for LiDAR Points
07:30: AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring
08:43: EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts
10:23: Fine-Grained Domain Generalization with Feature Structuralization
12:03: SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution
14:13: ReMI: A Dataset for Reasoning with Multiple Images
15:41: A Large-scale Universal Evaluation Benchmark For Face Forgery Detection
17:26: Thoracic Surgery Video Analysis for Surgical Phase Recognition
18:58: Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval
20:40: Adaptive Slot Attention: Object Discovery with Dynamic Slot Number
22:26: CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification
24:22: Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024
25:21: Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns
26:30: WildlifeReID-10k: Wildlife re-identification dataset with 10k individual animals
27:44: MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction
29:28: Comparison Visual Instruction Tuning
30:51: MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
32:14: Deep Transformer Network for Monocular Pose Estimation of Ship-Based UAV
33:10: Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
34:33: Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models
36:04: StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning
37:30: Parameter-Efficient Active Learning for Foundational models
38:31: Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation
40:22: Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases
42:38: Towards AI Lesion Tracking in PET/CT Imaging: A Siamese-based CNN Pipeline applied on PSMA PET/CT Scans
44:36: Memory-Efficient Sparse Pyramid Attention Networks for Whole Slide Image Analysis
46:19: Instance-level quantitative saliency in multiple sclerosis lesion segmentation
48:37: CMC-Bench: Towards a New Paradigm of Visual Signal Compression
50:05: Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs
52:05: CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models
-
Zijn er afleveringen die ontbreken?
-
ArXiv Computer Vision research for Thursday, June 13, 2024.
00:21: FouRA: Fourier Low Rank Adaptation
01:41: Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
03:18: Few-Shot Anomaly Detection via Category-Agnostic Registration Learning
04:57: Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting
06:46: ToSA: Token Selective Attention for Efficient Vision Transformers
08:00: Computer vision-based model for detecting turning lane features on Florida's public roadways
09:08: Improving Adversarial Robustness via Feature Pattern Consistency Constraint
10:52: Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network
12:10: NeRF Director: Revisiting View Selection in Neural Volume Rendering
13:36: Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency
15:03: Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality
16:40: COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing
18:16: Fusion of regional and sparse attention in Vision Transformers
19:26: Zoom and Shift are All You Need
20:17: EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding
21:49: The Penalized Inverse Probability Measure for Conformal Classification
23:24: OpenMaterial: A Comprehensive Dataset of Complex Materials for 3D Reconstruction
24:47: Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation
26:30: Computer Vision Approaches for Automated Bee Counting Application
27:17: Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding
28:16: A Label-Free and Non-Monotonic Metric for Evaluating Denoising in Event Cameras
29:43: Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer
31:25: Neural NeRF Compression
32:29: Preserving Identity with Variational Score for General-purpose 3D Editing
33:50: AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings
34:51: Adaptive Temporal Motion Guided Graph Convolution Network for Micro-expression Recognition
36:10: Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation
37:34: AMSA-UNet: An Asymmetric Multiple Scales U-net Based on Self-attention for Deblurring
38:49: Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark
40:45: A PCA based Keypoint Tracking Approach to Automated Facial Expressions Encoding
42:02: Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious?
43:28: FacEnhance: Facial Expression Enhancing with Recurrent DDPMs
45:11: How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models
47:08: Suitability of KANs for Computer Vision: A preliminary investigation
-
ArXiv Computer Vision research for Wednesday, June 12, 2024.
00:20: From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition
02:09: APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentatio
03:57: 2.5D Multi-view Averaging Diffusion Model for 3D Medical Image Translation: Application to Low-count PET Reconstruction with CT-less Attenuation Correction
05:47: DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor
06:58: Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze
08:02: LaneCPP: Continuous 3D Lane Detection using Physical Priors
09:23: FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation
11:10: VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
12:46: MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
14:39: OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
16:49: AWGUNET: Attention-Aided Wavelet Guided U-Net for Nuclei Segmentation in Histopathology Images
18:15: Diffusion Soup: Model Merging for Text-to-Image Diffusion Models
19:58: Coherent Optical Modems for Full-Wavefield Lidar
21:32: Transformation-Dependent Adversarial Attacks
22:45: PixMamba: Leveraging State Space Models in a Dual-Level Architecture for Underwater Image Enhancement
24:10: GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
25:57: ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery
27:26: Self-supervised Learning of Neural Implicit Feature Fields for Camera Pose Refinement
28:51: Real2Code: Reconstruct Articulated Objects via Code Generation
30:02: Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models
31:42: RMem: Restricted Memory Banks Improve Video Object Segmentation
33:12: What If We Recaption Billions of Web Images with LLaMA-3?
34:42: Real3D: Scaling Up Large Reconstruction Models with Real-World Images
36:07: Enhancing End-to-End Autonomous Driving with Latent World Model
37:12: Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation
38:43: On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models
40:16: Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
42:15: ICE-G: Image Conditional Editing of 3D Gaussian Splats
-
ArXiv Computer Vision research for Wednesday, June 12, 2024.
00:21: From Sim-to-Real: Toward General Event-based Low-light Frame Interpolation with Per-scene Optimization
01:44: Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement
03:20: Adversarial Patch for 3D Local Feature Extractor
04:00: Valeo4Cast: A Modular Approach to End-to-End Forecasting
05:38: The impact of deep learning aid on the workload and interpretation accuracy of radiologists on chest computed tomography: a cross-over reader study
08:50: Universal Scale Laws for Colors and Patterns in Imagery
10:11: CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer
11:44: ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
13:25: Continuous fake media detection: adapting deepfake detectors to new generative techniques
15:18: Category-level Neural Field for Reconstruction of Partially Observed Objects in Indoor Environment
16:23: One-Step Effective Diffusion Network for Real-World Image Super-Resolution
18:12: 2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation
19:22: Diffusion-Promoted HDR Video Reconstruction
21:09: Runtime Freezing: Dynamic Class Loss for Multi-Organ 3D Segmentation
21:52: A Sociotechnical Lens for Evaluating Computer Vision Models: A Case Study on Detecting and Reasoning about Gender and Emotion
23:54: DistilDoc: Knowledge Distillation for Visually-Rich Document Applications
25:28: Using Deep Convolutional Neural Networks to Detect Rendered Glitches in Video Games
26:39: OpenCOLE: Towards Reproducible Automatic Graphic Design Generation
27:23: Dataset Enhancement with Instance-Level Augmentations
28:33: Interpretable Representation Learning of Cardiac MRI via Attribute Regularization
29:33: A New Class Biorthogonal Spline Wavelet for Image Edge Detection
30:48: Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata
32:10: Vessel Re-identification and Activity Detection in Thermal Domain for Maritime Surveillance
33:32: AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer
35:09: From Chaos to Clarity: 3DGS in the Dark
36:32: LaMOT: Language-Guided Multi-Object Tracking
38:07: UDON: Universal Dynamic Online distillatioN for generic image representations
39:49: WMAdapter: Adding WaterMark Control to Latent Diffusion Models
40:48: Blind Image Deblurring using FFT-ReLU with Deep Learning Pipeline Integration
42:06: DocSynthv2: A Practical Autoregressive Modeling for Document Generation
-
ArXiv Computer Vision research for Wednesday, June 12, 2024.
00:20: FaithFill: Faithful Inpainting for Object Completion Using a Single Reference Image
01:21: Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation
02:49: Unveiling the Power of Wavelets: A Wavelet-based Kolmogorov-Arnold Network for Hyperspectral Image Classification
04:26: Flexible Music-Conditioned Dance Generation with Style Description Prompts
05:52: Robust 3D Face Alignment with Multi-Path Neural Architecture Search
07:00: Small Scale Data-Free Knowledge Distillation
08:48: KernelWarehouse: Rethinking the Design of Dynamic Convolution
10:31: A Comprehensive Survey on Machine Learning Driven Material Defect Detection: Challenges, Solutions, and Future Prospects
12:34: Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation
14:02: IFTD: Image Feature Triangle Descriptor for Loop Detection in Driving Scenes
14:54: Multi-Teacher Multi-Objective Meta-Learning for Zero-Shot Hyperspectral Band Selection
16:30: DemosaicFormer: Coarse-to-Fine Demosaicing Network for HybridEVS Camera
18:10: Spatial-Frequency Dual Progressive Attention Network For Medical Image Segmentation
20:07: Accurate Explanation Model for Image Classifiers using Class Association Embedding
21:55: Real-world Image Dehazing with Coherence-based Label Generator and Cooperative Unfolding Network
23:11: SimSAM: Simple Siamese Representations Based Semantic Affinity Matrix for Unsupervised Image Segmentation
24:06: Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization
25:34: OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding
26:58: Generalizable Disaster Damage Assessment via Change Detection with Vision Foundation Model
28:26: Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models
29:52: Deep Learning for Slum Mapping in Remote Sensing Images: A Meta-analysis and Review
31:49: LVBench: An Extreme Long Video Understanding Benchmark
33:14: Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking
34:48: A Robust Pipeline for Classification and Detection of Bleeding Frames in Wireless Capsule Endoscopy using Swin Transformer and RT-DETR
36:23: 3D CBCT Challenge 2024: Improved Cone Beam CT Reconstruction using SwinIR-Based Sinogram and Image Enhancement
37:29: MWIRSTD: A MWIR Small Target Detection Dataset
38:34: CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models
40:27: A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder
42:35: Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
44:26: Identification of Conversation Partners from Egocentric Video
-
ArXiv Computer Vision research for Tuesday, June 11, 2024.
00:21: DERM12345: A Large, Multisource Dermatoscopic Skin Lesion Dataset with 38 Subclasses
01:44: Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration
02:49: Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning
04:04: OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding
06:01: 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models
07:24: VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
08:58: Image Neural Field Diffusion Models
10:11: Comparing Deep Learning Models for Rice Mapping in Bhutan Using High Resolution Satellite Imagery
12:29: GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection
14:26: ReduceFormer: Attention with Tensor Reduction by Summation
15:23: Trim 3D Gaussian Splatting for Accurate Geometry Representation
16:44: SPIN: Spacecraft Imagery for Navigation
18:24: Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions
20:00: Understanding Visual Concepts Across Models
21:12: Instant 3D Human Avatar Generation using Image Diffusion Models
22:47: Neural Gaffer: Relighting Any Object via Diffusion
24:19: Autoregressive Pretraining with Mamba in Vision
25:51: Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance
27:19: Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
28:50: Situational Awareness Matters in 3D Vision Language Reasoning
30:10: Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?
31:46: Zero-shot Image Editing with Reference Imitation
33:08: Image and Video Tokenization with Binary Spherical Quantization
34:18: An Image is Worth 32 Tokens for Reconstruction and Generation
36:28: Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring
-
ArXiv Computer Vision research for Tuesday, June 11, 2024.
00:21: NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images
01:27: Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph
03:14: T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text
04:45: Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images
06:23: FaceGPT: Self-supervised Learning to Chat about 3D Human Faces
07:52: RecMoDiffuse: Recurrent Flow Diffusion for Human Motion Generation
09:15: VoxNeuS: Enhancing Voxel-Based Neural Surface Reconstruction via Gradient Interpolation
10:51: RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly Detection
12:05: RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker
13:52: MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD
15:15: Can Foundation Models Reliably Identify Spatial Hazards? A Case Study on Curb Segmentation
16:56: MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance
18:20: Open-World Human-Object Interaction Detection via Multi-modal Prompts
20:03: Which Country Is This? Automatic Country Ranking of Street View Photos
20:44: Needle In A Multimodal Haystack
22:10: Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models
23:24: Towards Realistic Data Generation for Real-World Super-Resolution
24:37: Unsupervised Object Detection with Theoretical Guarantees
25:43: Embedded Graph Convolutional Networks for Real-Time Event Data Processing on SoC FPGAs
27:45: A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation
29:01: Cinematic Gaussians: Real-Time HDR Radiance Fields with Depth of Field
30:24: Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach
32:09: Global-Regularized Neighborhood Regression for Efficient Zero-Shot Texture Anomaly Detection
33:52: Deep Implicit Optimization for Robust and Flexible Image Registration
35:28: Visual Representation Learning with Stochastic Frame Prediction
-
ArXiv Computer Vision research for Tuesday, June 11, 2024.
00:20: Explaining Representation Learning with Perceptual Components
01:28: Optimal Matrix-Mimetic Tensor Algebras via Variable Projection
03:03: Sparse Bayesian Networks: Efficient Uncertainty Quantification in Medical Image Analysis
04:24: Neural Visibility Field for Uncertainty-Driven Active Mapping
05:21: Triple-domain Feature Learning with Frequency-aware Memory Enhancement for Moving Infrared Small Target Detection
06:55: Stepwise Regression and Pre-trained Edge for Robust Stereo Matching
08:38: Evolving from Single-modal to Multi-modal Facial Deepfake Detection: A Survey
10:08: Dual Thinking and Perceptual Analysis of Deep Learning Models using Human Adversarial Examples
11:10: Generative Lifting of Multiview to 3D from Unknown Pose: Wrapping NeRF inside Diffusion
12:34: RWKV-CLIP: A Robust Vision-Language Representation Learner
14:01: Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
15:03: Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection
16:40: MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results
18:34: Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models
19:38: LiSD: An Efficient Multi-Task Learning Framework for LiDAR Segmentation and Detection
21:04: RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks
22:49: PanoSSC: Exploring Monocular Panoptic 3D Scene Reconstruction for Autonomous Driving
24:15: EFFOcc: A Minimal Baseline for EFficient Fusion-based 3D Occupancy Network
26:25: 1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation
27:16: DualMamba: A Lightweight Spectral-Spatial Mamba-Convolution Network for Hyperspectral Image Classification
29:09: Triage of 3D pathology data via 2.5D multiple-instance learning to guide pathologist assessments
31:08: Unified Modeling Enhanced Multimodal Learning for Precision Neuro-Oncology
32:23: CAT: Coordinating Anatomical-Textual Prompts for Multi-Organ and Tumor Segmentation
33:54: RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents
35:17: AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding
-
ArXiv Computer Vision research for Monday, June 10, 2024.
00:20: ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery
01:59: Diving into Underwater: Segment Anything Model Guided Underwater Salient Instance Segmentation and A Large-scale Dataset
03:44: Vript: A Video Is Worth Thousands of Words
05:38: FRAG: Frequency Adapting Group for Diffusion Video Editing
06:50: Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training
08:38: Robust Latent Representation Tuning for Image-text Classification
09:46: Generalizable Human Gaussians from Single-View Image
11:05: ProcessPainter: Learn Painting Process from Sequence Data
12:29: PointABM:Integrating Bidirectional State Space Model with Multi-Head Self-Attention for Point Cloud Analysis
13:41: Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control
15:00: Latent Representation Matters: Human-like Sketches in One-shot Drawing Tasks
16:14: GAIA: Rethinking Action Quality Assessment for AI-Generated Videos
17:54: Texture Re-scalable Universal Adversarial Perturbation
19:44: W-Net: One-Shot Arbitrary-Style Chinese Character Generation with Deep Neural Networks
20:46: ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models
22:04: DiffInject: Revisiting Debias via Synthetic Data Generation using Diffusion-based Style Injection
23:13: A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis
25:15: Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation
26:36: Generalized Nested Latent Variable Models for Lossy Coding applied to Wind Turbine Scenarios
27:48: Black carbon plumes from gas flaring in North Africa identified from multi-spectral imagery with deep learning
28:58: An Effective-Efficient Approach for Dense Multi-Label Action Detection
30:42: 2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment Retrieval
31:49: iMotion-LLM: Motion Prediction Instruction Tuning
33:05: Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis
34:57: Data Augmentation in Earth Observation: A Diffusion Model Approach
36:22: UEMM-Air: A Synthetic Multi-modal Dataset for Unmanned Aerial Vehicle Object Detection
37:49: UnSupDLA: Towards Unsupervised Document Layout Analysis
39:11: I-MPN: Inductive Message Passing Network for Effective and Efficient Human-in-the-Loop Annotation of Mobile Eye Tracking Data
40:46: Tuning-Free Visual Customization via View Iterative Self-Attention Control
-
ArXiv Computer Vision research for Monday, June 10, 2024.
00:20: DualAD: Disentangling the Dynamic and Static World for End-to-End Driving
01:41: NeuroMoCo: A Neuromorphic Momentum Contrast Learning Method for Spiking Neural Networks
03:22: Vehicle Vectors and Traffic Patterns from Planet Imagery
04:15: A Guide to Stochastic Optimisation for Large-Scale Inverse Problems
05:37: Cascading Unknown Detection with Known Classification for Open Set Recognition
06:42: Latent Directions: A Simple Pathway to Bias Mitigation in Generative AI
07:57: MVGamba: Unify 3D Content Generation as State Space Sequence Modeling
09:32: UMAD: Unsupervised Mask-Level Anomaly Detection for Autonomous Driving
10:15: Improving Deep Learning-based Automatic Cranial Defect Reconstruction by Heavy Data Augmentation: From Image Registration to Latent Diffusion Models
11:47: Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization
13:12: Generalizing to Unseen Domains in Diabetic Retinopathy with Disentangled Representations
15:01: FPN-IAIA-BL: A Multi-Scale Interpretable Deep Learning Model for Classification of Mass Margins in Digital Mammography
16:18: STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics
17:53: Hybrid Video Anomaly Detection for Anomalous Scenarios in Autonomous Driving
18:35: Margin-aware Preference Optimization for Aligning Diffusion Models without Reference
20:24: SYM3D: Learning Symmetric Triplanes for Better 3D-Awareness of GANs
21:48: Spatiotemporal Graph Neural Network Modelling Perfusion MRI
22:57: VCR: Visual Caption Restoration
24:37: AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
26:29: NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
28:09: Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer
30:12: Merlin: A Vision Language Foundation Model for 3D Computed Tomography
32:58: Genomics-guided Representation Learning for Pathologic Pan-cancer Tumor Microenvironment Subtype Prediction
34:26: PGSR: Planar-based Gaussian Splatting for Efficient and High-Fidelity Surface Reconstruction
36:04: NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing
37:28: Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
39:08: GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation
40:52: IllumiNeRF: 3D Relighting without Inverse Rendering
-
ArXiv Computer Vision research for Sunday, June 09, 2024.
00:20: ControlLoc: Physical-World Hijacking Attack on Visual Perception in Autonomous Driving
02:23: Unified Text-to-Image Generation and Retrieval
03:51: F-LMM: Grounding Frozen Large Multimodal Models
05:34: Multi-Stain Multi-Level Convolutional Network for Multi-Tissue Breast Cancer Image Segmentation
07:43: BOSC: A toolbox for aerial imagery mapping
08:27: Mamba YOLO: SSMs-Based YOLO For Object Detection
10:12: Solution for CVPR 2024 UG2+ Challenge Track on All Weather Semantic Segmentation
11:02: Scaling Graph Convolutions for Mobile Vision
12:59: RefGaussian: Disentangling Reflections from 3D Gaussian Splatting for Realistic Rendering
14:28: Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks
15:45: Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers
16:40: OmniControlNet: Dual-stage Integration for Conditional Image Generation
17:51: GCtx-UNet: Efficient Network for Medical Image Segmentation
19:14: InfoGaussian: Structure-Aware Dynamic Gaussians through Lightweight Information Shaping
20:40: BD-SAT: High-resolution Land Use Land Cover Dataset & Benchmark Results for Developing Division: Dhaka, BD
22:19: Bits-to-Photon: End-to-End Learned Scalable Point Cloud Compression for Direct Rendering
23:28: MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification
24:38: Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024
26:12: CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
29:32: Inter-slice Super-resolution of Magnetic Resonance Images by Pre-training and Self-supervised Fine-tuning
31:04: Causality-inspired Latent Feature Augmentation for Single Domain Generalization
32:41: MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba
34:13: FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model
-
ArXiv Computer Vision research for Sunday, June 09, 2024.
00:20: PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction
01:47: Anomaly Multi-classification in Industrial Scenarios: Transferring Few-shot Learning to a New Task
02:51: GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement
04:51: Visual Prompt Tuning in Null Space for Continual Learning
06:20: SRC-Net: Bi-Temporal Spatial Relationship Concerned Network for Change Detection
08:00: Evolution-aware VAriance (EVA) Coreset Selection for Medical Image Classification
09:29: Diverse 3D Human Pose Generation in Scenes based on Decoupled Structure
10:30: HDMba: Hyperspectral Remote Sensing Imagery Dehazing with State Space Model
12:17: Hierarchical Features Matter: A Deep Exploration of GAN Priors for Improved Dataset Distillation
13:37: ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition
15:05: Binarized Diffusion Model for Image Super-Resolution
16:43: Region of Interest Loss for Anonymizing Learned Image Compression
18:15: A DeNoising FPN With Transformer R-CNN for Tiny Object Detection
20:09: Vision Mamba: Cutting-Edge Classification of Alzheimer's Disease with 3D MRI Scans
21:59: MLCM: Multistep Consistency Distillation of Latent Diffusion Model
24:02: CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder
25:42: VCR-GauS: View Consistent Depth-Normal Regularizer for Gaussian Surface Reconstruction
27:09: Utilizing Grounded SAM for self-supervised frugal camouflaged human detection
28:28: Learning to utilize gradient information for crisp edge detection
29:57: A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions
31:29: Convolution and Attention-Free Mamba-based Cardiac Image Segmentation
32:51: OD-DETR: Online Distillation for Stabilizing Training of Detection Transformer
34:18: SlowPerception: Physical-World Latency Attack against Visual Perception in Autonomous Driving
36:11: SAM-PM: Enhancing Video Camouflaged Object Detection using Spatio-Temporal Attention
-
ArXiv Computer Vision research for Saturday, June 08, 2024.
00:20: Blurry-Consistency Segmentation Framework with Selective Stacking on Differential Interference Contrast 3D Breast Cancer Spheroid
01:31: 1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation
03:01: Metric Convolutions: A Unifying Theory to Adaptive Convolutions
04:13: Layered Image Vectorization via Semantic Simplification
05:18: Select-Mosaic: Data Augmentation Method for Dense Small Object Scenes
06:31: 3D MRI Synthesis with Slice-Based Latent Diffusion Models: Improving Tumor Segmentation Tasks in Data-Scarce Regimes
07:51: Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models
09:42: Unsupervised learning of Data-driven Facial Expression Coding System (DFECS) using keypoint tracking
11:36: HDRT: Infrared Capture for HDR Imaging
13:14: Attri-Net: A Globally and Locally Inherently Interpretable Model for Multi-Label Classification Using Class-Specific Counterfactuals
14:49: Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis
16:18: Training-Free Robust Interactive Video Object Segmentation
17:49: One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
19:50: A Two-Stage Adverse Weather Semantic Segmentation Method for WeatherProof Challenge CVPR 2024 Workshop UG2+
21:04: PAPR in Motion: Seamless Point-level 3D Scene Interpolation
22:25: VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification
23:38: Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
25:24: Aligning Human Knowledge with Visual Concepts Towards Explainable Medical Image Classification
26:50: Understanding Inhibition Through Maximally Tense Images
27:52: Can Prompt Modifiers Control Bias? A Comparative Analysis of Text-to-Image Generative Models
29:19: Deep Learning to Predict Glaucoma Progression using Structural Changes in the Eye
30:58: Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision
32:32: Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval
34:11: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
35:35: Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion
-
ArXiv Computer Vision research for Friday, June 07, 2024.
00:21: RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection
01:52: AGBD: A Global-scale Biomass Dataset
03:30: MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers
04:52: Faster Than Lies: Real-time Deepfake Detection using Binary Neural Networks
06:03: Leveraging Activations for Superpixel Explanations
07:02: Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement
08:28: Nacala-Roof-Material: Drone Imagery for Roof Detection, Classification, and Segmentation to Support Mosquito-borne Disease Risk Assessment
10:10: Multi-style Neural Radiance Field with AdaIN
10:52: Multiplane Prior Guided Few-Shot Aerial Scene Rendering
12:15: Semantic Segmentation on VSPW Dataset through Masked Video Consistency
13:24: CityCraft: A Real Crafter for 3D City Generation
15:21: ProMotion: Prototypes As Motion Learners
16:57: AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation
18:00: Clarifying Myths About the Relationship Between Shape Bias, Accuracy, and Robustness
19:50: GANetic Loss for Generative Adversarial Networks with a Focus on Medical Applications
21:35: Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs
23:28: Bootstrapping Referring Multi-Object Tracking
24:50: Prototype Correlation Matching and Class-Relation Reasoning for Few-Shot Medical Image Segmentation
26:48: GenHeld: Generating and Editing Handheld Objects
27:57: Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations
29:11: Hibou: A Family of Foundational Vision Transformers for Pathology
30:41: Diving Deep into the Motion Representation of Video-Text Models
31:46: CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion
33:18: A Novel Time Series-to-Image Encoding Approach for Weather Phenomena Classification
34:48: LLavaGuard: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment
36:06: Contextual fusion enhances robustness to image blurring
37:01: Energy Propagation in Scattering Convolution Networks Can Be Arbitrarily Slow
38:12: Towards Semantic Equivalence of Tokenization in Multimodal LLM
39:33: PatchSVD: A Non-uniform SVD-based Image Compression Algorithm
40:29: DVOS: Self-Supervised Dense-Pattern Video Object Segmentation
42:16: 3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs
-
ArXiv Computer Vision research for Friday, June 07, 2024.
00:20: Image Processing Based Forest Fire Detection
01:08: STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting
03:05: UVCPNet: A UAV-Vehicle Collaborative Perception Network for 3D Object Detection
04:47: UCDNet: Multi-UAV Collaborative 3D Object Detection Network by Reliable Feature Mapping
06:14: SMART: Scene-motion-aware human action recognition framework for mental disorder group
08:12: LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model
09:34: Evaluating and Mitigating IP Infringement in Visual Generative AI
11:01: MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models
12:20: OVMR: Open-Vocabulary Recognition with Multi-Modal References
13:57: ACE Metric: Advection and Convection Evaluation for Accurate Weather Forecasting
15:11: XctDiff: Reconstruction of CT Images with Consistent Anatomical Structures from a Single Radiographic Projection Image
16:22: MTS-Net: Dual-Enhanced Positional Multi-Head Self-Attention for 3D CT Diagnosis of May-Thurner Syndrome
17:58: CDeFuse: Continuous Decomposition for Infrared and Visible Image Fusion
19:41: MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description
21:24: PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction
22:58: Interpretable Multimodal Out-of-context Detection with Soft Logic Regularization
24:24: SMC++: Masked Learning of Unsupervised Video Semantic Compression
26:19: Diffusion-based Generative Image Outpainting for Recovery of FOV-Truncated CT Images
27:09: MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks
28:35: Predictive Dynamic Fusion
29:43: Online Continual Learning of Video Diffusion Models From a Single Video Stream
30:40: A short review on graphonometric evaluation tools in children
31:49: Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors
33:04: EGOR: Efficient Generated Objects Replay for incremental object detection
34:37: 3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation
36:02: Multi-Granularity Language-Guided Multi-Object Tracking
37:56: Normal-guided Detail-Preserving Neural Implicit Functions for High-Fidelity 3D Surface Reconstruction
39:52: Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior
41:48: 3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views
43:54: Seeing the Unseen: Visual Metaphor Captioning for Videos
45:09: Zero-Shot Video Editing through Adaptive Sliding Score Distillation
46:28: Labeled Data Selection for Category Discovery
-
ArXiv Computer Vision research for Thursday, June 06, 2024.
00:20: M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and RGB Data
02:34: Understanding Information Storage and Transfer in Multi-modal Large Language Models
04:27: Conv-INR: Convolutional Implicit Neural Representation for Multimodal Visual Signals
06:01: Localized Gaussian Point Management
07:59: A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation
09:25: GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions
11:07: MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
13:02: ELFS: Enhancing Label-Free Coreset Selection via Clustering-based Pseudo-Labeling
14:39: VideoTetris: Towards Compositional Text-to-Video Generation
16:00: SpectralZoom: Efficient Segmentation with an Adaptive Hyperspectral Camera
17:04: Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
18:51: Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry
20:05: Vision-LSTM: xLSTM as Generic Vision Backbone
21:01: ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation
22:03: ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization
23:43: Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step
25:32: Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking
27:23: VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
28:33: DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
30:24: SF-V: Single Forward Video Generation Model
31:51: ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
34:06: Parameter-Inverted Image Pyramid Networks
35:50: Coarse-To-Fine Tensor Trains for Compact Visual Representations
37:23: BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
38:37: DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
40:24: Coherent Zero-Shot Visual Instruction Generation
41:17: Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
42:58: RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
44:56: GLACE: Global Local Accelerated Coordinate Encoding
46:43: Interpreting the Second-Order Effects of Neurons in CLIP
48:03: Learning 1D Causal Visual Representation with De-focus Attention Networks
49:41: Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image
51:14: Stereo-Depth Fusion through Virtual Pattern Projection
-
ArXiv Computer Vision research for Thursday, June 06, 2024.
00:20: ReDistill: Residual Encoded Distillation for Peak Memory Reduction
01:58: Instance Segmentation and Teeth Classification in Panoramic X-rays
03:34: Enhanced Semantic Segmentation Pipeline for WeatherProof Dataset Challenge
04:44: Amortized Equation Discovery in Hybrid Dynamical Systems
05:57: Monocular Localization with Semantics Map for Autonomous Vehicles
07:22: From operculum and body tail movements to different coupling of physical activity and respiratory frequency in farmed gilthead sea bream and European sea bass. Insights on aquaculture biosensing
09:36: Semantic Similarity Score for Measuring Visual Similarity at Semantic Level
11:32: LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model
13:12: Polyp and Surgical Instrument Segmentation with Double Encoder-Decoder Networks
13:52: C^2RV: Cross-Regional and Cross-View Learning for Sparse-View CBCT Reconstruction
15:19: Data-Centric Label Smoothing for Explainable Glaucoma Screening from Eye Fundus Images
16:39: Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following
18:03: Frequency-based Matcher for Long-tailed Semantic Segmentation
19:28: LDM-RSIC: Exploring Distortion Prior with Latent Diffusion Models for Remote Sensing Image Compression
21:18: LNQ Challenge 2023: Learning Mediastinal Lymph Node Segmentation with a Probabilistic Lymph Node Atlas
22:45: 3rd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation
23:30: Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt
25:10: Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis
26:03: Shaping History: Advanced Machine Learning Techniques for the Analysis and Dating of Cuneiform Tablets over Three Millennia
28:01: Semmeldetector: Application of Machine Learning in Commercial Bakeries
29:08: Class-Aware Cartilage Segmentation for Autonomous US-CT Registration in Robotic Intercostal Ultrasound Imaging
30:45: How Far Can We Compress Instant-NGP-Based NeRF?
32:11: UrbanSARFloods: Sentinel-1 SLC-Based Benchmark Dataset for Urban and Open-Area Flood Mapping
34:01: Global Parameterization-based Texture Space Optimization
34:52: LenslessFace: An End-to-End Optimized Lensless System for Privacy-Preserving Face Verification
36:22: The 3D-PC: a benchmark for visual perspective taking in humans and machines
38:29: Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization
40:08: Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets
41:50: A Voxel-based Approach for Simulating Microbial Decomposition in Soil: Comparison with LBM and Improvement of Morphological Models
43:25: Encoding Semantic Priors into the Weights of Implicit Neural Representation
45:04: Diffusion-based image inpainting with internal learning
45:58: CDMamba: Remote Sensing Image Change Detection with Mamba
47:36: Matching Anything by Segmenting Anything
-
ArXiv Computer Vision research for Wednesday, June 05, 2024.
00:20: Image Copy-Move Forgery Detection and Localization Scheme: How to Avoid Missed Detection and False Alarm
01:52: VWise: A novel benchmark for evaluating scene classification for vehicular applications
03:03: Text-to-Image Rectified Flow as Plug-and-Play Priors
04:25: L-PR: Exploiting LiDAR Fiducial Marker for Unordered Low Overlap Multiview Point Cloud Registration
06:17: Learning Visual Prompts for Guiding the Attention of Vision Transformers
07:25: Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation
08:51: EngineBench: Flow Reconstruction in the Transparent Combustion Chamber III Optical Engine
10:37: A Flexible Recursive Network for Video Stereo Matching Based on Residual Estimation
12:05: SuperFormer: Volumetric Transformer Architectures for MRI Super-Resolution
13:20: SelfReDepth: Self-Supervised Real-Time Depth Restoration for Consumer-Grade Sensors
15:01: Gaussian Representation for Deformable Image Registration
16:37: Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach
18:01: UnWave-Net: Unrolled Wavelet Network for Compton Tomography Image Reconstruction
19:42: CoFie: Learning Compact Neural Surface Representations with Coordinate Fields
21:04: Post-hoc Part-prototype Networks
22:19: Computation-Efficient Era: A Comprehensive Survey of State Space Models in Medical Image Analysis
24:26: CattleFace-RGBT: RGB-T Cattle Facial Landmark Benchmark
25:51: Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input
27:18: FILS: Self-Supervised Video Feature Prediction In Semantic Language Space
28:38: LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection
29:58: Polarization Wavefront Lidar: Learning Large Scene Reconstruction from Polarized Wavefronts
31:36: AD-H: Autonomous Driving with Hierarchical Agents
33:39: Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review
-
ArXiv Computer Vision research for Wednesday, June 05, 2024.
00:20: Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision
02:03: A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
03:42: Exploiting LMM-based knowledge for image classification tasks
04:37: EgoSurgery-Tool: A Dataset of Surgical Tool and Hand Detection from Egocentric Open Surgery Videos
06:09: EpidermaQuant: Unsupervised detection and quantification of epidermal differentiation markers on H-DAB-stained images of reconstructed human epidermis
08:15: Enhancing 3D Lane Detection and Topology Reasoning with 2D Lane Priors
09:24: VQUNet: Vector Quantization U-Net for Defending Adversarial Atacks by Regularizing Unwanted Noise
10:36: Enhanced Automotive Object Detection via RGB-D Fusion in a DiffusionDet Framework
11:42: ZeroPur: Succinct Training-Free Adversarial Purification
13:23: Tiny models from tiny data: Textual and null-text inversion for few-shot distillation
15:10: Multi-Task Multi-Scale Contrastive Knowledge Distillation for Efficient Medical Image Segmentation
16:44: Dynamic 3D Gaussian Fields for Urban Areas
18:10: MMCL: Boosting Deformable DETR-Based Detectors with Multi-Class Min-Margin Contrastive Learning for Superior Prohibited Item Detection
20:02: FAPNet: An Effective Frequency Adaptive Point-based Eye Tracker
21:52: Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion
23:14: Situation Monitor: Diversity-Driven Zero-Shot Out-of-Distribution Detection using Budding Ensemble Architecture for Object Detection
24:28: Writing Order Recovery in Complex and Long Static Handwriting
25:50: Identification of Stone Deterioration Patterns with Large Multimodal Models
26:58: Searching Priors Makes Text-to-Video Synthesis Better
28:32: Interactive Image Selection and Training for Brain Tumor Segmentation Network
29:35: Global Clipper: Enhancing Safety and Reliability of Transformer-based Object Detection Models
30:53: Generative Diffusion Models for Fast Simulations of Particle Collisions at CERN
31:52: Prompt-based Visual Alignment for Zero-shot Policy Transfer
33:33: ADer: A Comprehensive Benchmark for Multi-class Visual Anomaly Detection
- Laat meer zien