Hi, I currently work on a GenAI platform for one of the largest local industrial companies. My daily work mostly involves building inference infrastructure on top of a 48x H200 GPU, Kubernetes and vLLM. Hence, I'd say it's 80% SRE and 20% software engineering when it comes to building request routing and internal control planes.
Although I have a background in backend engineering rather than ML research or low-level GPU programming, I am trying to understand what I need to learn to become a proper SWE, not just someone who knows how to deploy and serve LLMs. How did you get into AI infrastructure, and which skills made the biggest difference? I'm especially curious about things you underestimated at first, such as distributed systems, useful resources along the way and difficult-to-acquire skills outside large-scale companies.
Any answers or advice would be much appreciated.
Comments URL: [https://news.ycombinator.com/item?id=48609457](https://news.ycombinator.com/item?id=48609457)
Points: 2