The BIOSCAN-5M dataset is a comprehensive collection of multi-modal information on over 5 million arthropod specimens, with 98% being insects. It combines specimen images, DNA barcodes, and taxonomic classifications to address the need for automated species identification and discovery. The dataset builds upon the earlier BIOSCAN-1M dataset, offering increased data volume, diversity, and enhanced taxonomic label cleaning. Its multi-modal nature synergizes diverse data types to unlock unprecedented insights into insect biodiversity. Researchers can use this dataset with modern tools like FiftyOne to explore, visualize, and analyze the data, leveraging embeddings from models like BioCLIP for specimen images and BarcodeBERT for DNA barcode sequences. The dataset provides a holistic toolkit for advancing biodiversity science, conservation, and AI-driven ecological monitoring.