arXiv, the widely used open-access platform for scientific preprints hosted by Cornell University, is shifting its entire operation from university-hosted virtual machines to Google Cloud Platform (GCP).
The move anchors a multi-year technical refresh project named “arXiv CE” (Cloud Edition), designed to bolster the platform’s capacity and stability as it grapples with increasing usage and seeks to shed legacy code.
This shift comes as arXiv, which hosts over 2.6 million papers and serves around five million users monthly, navigates both internal technical debt and external financial pressures faced by its host institution, Cornell. The initiative receives support from the Simons Foundation and strategic guidance from Invest in Open Infrastructure, which began in early 2023.
Modernizing a Foundational Platform
For many researchers, particularly in physics and math, arXiv is a daily resource. “Everybody in math and physics uses it,” computer scientist Scott Aaronson told WIRED in March. “I scan it every night.”
Founded by Paul Ginsparg in 1991 while at Los Alamos National Laboratory, arXiv bypassed traditional, slow peer-review journal timelines, allowing rapid sharing of preprints.
Its initial form used shell scripts running on Ginsparg’s NeXT machine before moving to email/FTP and later the web. Its success demonstrated, according to physicist Paul Fendley, “that you could divorce the actual transmission of your results from the process of refereeing.”
However, the platform’s technical underpinnings have aged. The arXiv CE project directly targets this legacy infrastructure. A core objective detailed on arXiv’s careers page is the replacement of remaining Perl and PHP backend components, standardizing on Python.
The plan involves re-architecting article processing to be fully asynchronous and containerizing services. Containerization packages applications for consistent deployment, and arXiv plans to use technologies like Kubernetes (an open-source system for automating container management) or Google Cloud Run (a managed serverless container platform).
Improved monitoring, logging, and a Continuous Integration/Continuous Deployment (CI/CD) pipeline—automating code updates—are also key technical goals. These efforts supplement existing infrastructure choices, like using the Fastly content delivery network (CDN).
Strategic Overhaul and Future Goals
The move to GCP is presented as a necessary step for broader service improvements. arXiv aims to expand into new subject areas more easily, enhance metadata collection (including funder IDs and addressing author ambiguity), and improve accessibility and overall usability for its global research community.
This aligns with a strategic planning effort underway since at least early 2023, supported by the Simons Foundation and involving guidance from Invest in Open Infrastructure (IOI). Ivan Oransky of the Simons Foundation noted IOI’s “extensive experience in the open infrastructure space and their expertise in sustainability and governance will help arXiv chart its course for decades to come.”
Community Reaction and Cornell’s Context
News of the move to GCP has sparked discussion within the technical community, notably on forums like Hacker News. Commenters raised concerns about potential long-term cost increases with cloud operational expenditures versus on-premises capital costs, the risks of vendor lock-in, and potential access restrictions for users in certain regions, such as Iran, due to platform policies. One user expressed skepticism, anticipating “goodbye simplicity and stability, hello exorbitant monthy costs for the same/less service quality.”
Others pointed to the growing demands on arXiv, particularly increased load from AI crawlers accessing its repository, necessitating enhanced scalability. A user claiming close ties stated the platform’s current “stability is just due to the exceptional amount of effort they take to keep it going.”
The use of established cloud services was seen by some as a practical way to manage scaling and technical debt. With Google already listed as a Gold Sponsor, speculation arose about potential credits influencing the choice. The timing also coincides with financial challenges at Cornell University. A recent NPR report detailed a $1 billion federal funding freeze by the Trump administration.
This followed a university-wide hiring freeze announced in March citing financial uncertainty. While arXiv hasn’t officially linked the GCP move to these budget issues, this context adds to the discussion around the migration’s motivations.
A Long-Running Platform Evolves
Since its inception, arXiv has become central to scientific communication. The migration to GCP is the latest step in adapting the platform, which processes documents often written in LaTeX (a standard document preparation system in many scientific fields), to modern technical demands.
The arXiv CE project, announced already in 2023 via a blog post seeking developers, represents a substantial commitment to overhaul the system. While Ginsparg, who once described arXiv as “a child I sent off to college but who keeps coming back to camp out in my living room, behaving badly,” is less involved day-to-day, the platform under new leadership and with recent foundation support is now undertaking this shift to ensure its continued service to the research world.