Introducing Baseten Self-hosted |
Anupreet Walia, Rachel Rapp |
Aug 08, 2024 |
670 |
- |
How to benchmark image generation models like Stable Diffusion XL |
Philip Kiely |
Jan 31, 2024 |
1374 |
- |
Comparing tokens per second across LLMs |
Philip Kiely |
May 09, 2024 |
769 |
- |
How latent consistency models work |
Rachel Rapp |
Jun 04, 2024 |
1140 |
- |
Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT |
Pankaj Gupta, Philip Kiely |
Feb 06, 2024 |
1623 |
- |
New in February 2024 |
Baseten |
Feb 29, 2024 |
634 |
- |
How to serve 10,000 fine-tuned LLMs from a single GPU |
Pankaj Gupta, Philip Kiely |
Jul 23, 2024 |
1895 |
- |
Streaming real-time text to speech with XTTS V2 |
Het Trivedi, Philip Kiely |
Apr 18, 2024 |
1318 |
- |
Continuous vs dynamic batching for AI inference |
Matt Howard, Philip Kiely |
Apr 05, 2024 |
1350 |
- |
High performance ML inference with NVIDIA TensorRT |
Justin Yi, Philip Kiely |
Mar 12, 2024 |
1076 |
- |
FP8: Efficient model inference with 8-bit floating point numbers |
Pankaj Gupta, Philip Kiely |
Mar 07, 2024 |
1021 |
2 |
The best open source large language model |
Philip Kiely |
Feb 09, 2024 |
1920 |
- |
New in January 2024 |
Baseten |
Jan 31, 2024 |
580 |
- |
Using fractional H100 GPUs for efficient model serving |
Matt Howard, Vlad Shulman, Pankaj Gupta, Philip Kiely |
Mar 28, 2024 |
1086 |
- |
40% faster Stable Diffusion XL inference with NVIDIA TensorRT |
Pankaj Gupta, Justin Yi, Philip Kiely |
Feb 22, 2024 |
2403 |
- |
Ten reasons to join Baseten |
Dustin Michaels, Philip Kiely |
Jul 25, 2024 |
1230 |
- |
Why GPU utilization matters for model inference |
Marius Killinger, Philip Kiely |
Feb 20, 2024 |
816 |
- |
New in March 2024 |
Baseten |
Mar 28, 2024 |
553 |
- |
Compound AI systems explained |
Rachel Rapp |
Aug 06, 2024 |
1338 |
- |
What I learned as a forward-deployed engineer working at an AI startup |
Het Trivedi |
May 31, 2024 |
1353 |
- |
Introducing Baseten Chains |
Bola Malek, Marius Killinger, Sid Shanker, Rachel Rapp, Mike Bilodeau |
Jun 27, 2024 |
1132 |
9 |
The benefits of globally distributed infrastructure for model serving |
Phil Howes, Philip Kiely |
Mar 01, 2024 |
603 |
- |
33% faster LLM inference with FP8 quantization |
Pankaj Gupta, Philip Kiely |
Mar 14, 2024 |
1876 |
- |
Using asynchronous inference in production |
Samiksha Pal, Helen Yang, Rachel Rapp |
Jul 11, 2024 |
950 |
- |
Introduction to quantizing ML models |
Abu Qader, Philip Kiely |
Jan 31, 2024 |
1679 |
1 |
New in April 2024 |
Baseten |
May 01, 2024 |
552 |
- |
Benchmarking fast Mistral 7B inference |
Abu Qader, Pankaj Gupta, Justin Yi, Philip Kiely |
Mar 14, 2024 |
1571 |
- |
SPC hackathon winners build with Llama 3.1 on Baseten |
Philip Kiely |
Aug 16, 2024 |
615 |
- |
Understanding performance benchmarks for LLM inference |
Philip Kiely |
Jan 12, 2024 |
1459 |
- |
Comparing few-step image generation models |
Rachel Rapp |
Jun 14, 2024 |
1087 |
- |
Introducing automatic LLM optimization with TensorRT-LLM Engine Builder |
Abu Qader, Philip Kiely |
Aug 01, 2024 |
939 |
2 |
Deploying custom ComfyUI workflows as APIs |
Het Trivedi, Rachel Rapp |
Jul 25, 2024 |
1144 |
1 |
New in May 2024 |
Baseten |
Jun 03, 2024 |
598 |
- |
CI/CD for AI model deployments |
Vlad Shulman, Samiksha Pal, Sid Shanker, Philip Kiely |
Apr 30, 2024 |
914 |
- |
Announcing our Series B |
Tuhin Srivastava |
Mar 04, 2024 |
629 |
2 |
Control plane vs workload plane in model serving infrastructure |
Colin McGrath, Matt Howard, Philip Kiely |
May 29, 2024 |
870 |
- |
Baseten Chains explained: building multi-component AI workflows at scale |
Marius Killinger, Rachel Rapp |
Jul 02, 2024 |
2424 |
- |
How to double tokens per second for Llama 3 with Medusa |
Abu Qader, Philip Kiely |
Aug 20, 2024 |
1462 |
2 |
The best open-source image generation model |
Philip Kiely |
Aug 29, 2024 |
1409 |
- |
How to build function calling and JSON mode for open-source and fine-tuned LLMs |
Bryce Dubayah, Philip Kiely |
Sep 12, 2024 |
1339 |
1 |
Introducing function calling and structured output for open-source and fine-tuned LLMs |
Bryce Dubayah, Philip Kiely |
Sep 12, 2024 |
604 |
- |
Building high-performance compound AI applications with MongoDB Atlas and Baseten |
Philip Kiely |
Sep 17, 2024 |
1425 |
- |
Introducing Baseten Hybrid: control and flexibility in your cloud and ours |
Mike Bilodeau, Rachel Rapp |
Sep 26, 2024 |
633 |
- |
Baseten partners with Google Cloud to deliver high-performance AI infrastructure to a broader audience |
Mike Bilodeau, Rachel Rapp |
Sep 26, 2024 |
688 |
- |
Export your model inference metrics to your favorite observability tool |
Helen Yang, Nicolas Gere-lamaysouette, Philip Kiely |
Oct 05, 2024 |
493 |
- |
Evaluating NVIDIA H200 GPUs for LLM inference |
Pankaj Gupta, Philip Kiely |
Oct 23, 2024 |
1294 |
- |
Introducing canary deployments on Baseten |
Sid Shanker, Jonathan Rochette, Raymond Cano, Rachel Rapp |
Nov 01, 2024 |
932 |
- |
Create custom environments for deployments on Baseten |
Samiksha Pal, Raymond Cano, Sid Shanker, Rachel Rapp |
Nov 15, 2024 |
621 |
- |
Introducing Custom Servers: Deploy production-ready model servers from Docker images |
Tianshu Cheng, Bola Malek, Rachel Rapp |
Dec 09, 2024 |
807 |
- |
Generally Available: The fastest, most accurate, and cost-efficient Whisper transcription |
William Gao, Derrick Yang, Tianshu Cheng, Rachel Rapp |
Dec 12, 2024 |
1145 |
- |
A quick introduction to speculative decoding |
Pankaj Gupta, Justin Yi, Philip Kiely |
Dec 20, 2024 |
1139 |
- |
Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference |
Justin Yi, Abu Qader, Bryce Dubayah, Rachel Rapp |
Dec 20, 2024 |
904 |
- |
How we built production-ready speculative decoding with TensorRT-LLM |
Pankaj Gupta, Justin Yi, Philip Kiely |
Dec 20, 2024 |
2729 |
- |
New observability features: activity logging, LLM metrics, and metrics dashboard customization |
Suren Atoyan, Aaron Relph, Marius Killinger, Sid Shanker, Rachel Rapp |
Dec 23, 2024 |
540 |
- |