Home

Donate
Perspective

Cities Need a New Model for Incentivizing Responsible AI

Emily Royall / Sep 24, 2025

The views expressed here are the author's own and do not represent those of their employer.

Is This Even Real II by Elise Racine / Better Images of AI / CC by 4.0

For the past seven years, I have led a team at the City of San Antonio responsible for evaluating emerging technologies, with a particular focus on artificial intelligence for public sector use. Our structured pilot program partnered with vendors on time-bound initiatives, enabling us to rigorously test outcomes and deliver evidence-based recommendations to city leadership. Nearly 30 pilots later—spanning AI-powered cameras, resident-facing chatbots, and Generative AI tools for procurement—we had a clear view of both the promises and pitfalls of AI in government.

What we discovered was sobering. Many AI systems simply were not designed for the realities of municipal environments. Classification algorithms trained on closed campuses in distant midwestern cities often failed to perform when deployed in the field in San Antonio. APIs were poorly documented and insecure, explainability was limited, and accountability risks were high. Worse still, procurement practices favored large vendors, leaving little room for competition and offering few incentives for responsible AI development.

Most American cities do not build AI systems—they buy them. The market for AI applications in government and public services is expected to grow 17% annually to over $50B by 2030, with companies like Anthropic and OpenAI already introducing their first GenAI models tailored for government use in the last year. In a world where AI is used to shape decision-making that has material impacts on communities, like determining eligibility for benefits, evaluating roadway improvements, or identifying criminal behavior, cities cannot rely on vendor promises or fragmented oversight.

The US has no single, overarching federal law that specifically governs AI development or use. Instead, AI is regulated indirectly through existing laws and sector-specific frameworks. States continue to enact AI laws amid a federal regulatory void. Earlier this year, a sweeping 10-year moratorium on state AI regulation passed the House as part of the budget bill, known as the “One Big Beautiful Bill,” but faced significant opposition in the Senate and was not included in the final legislation. Given the fragmented policy landscape and lack of federal baseline rules, public agencies need a new model: one that establishes universal performance benchmarks for government use of AI, embeds them into public procurement, and uses scale to shift market behavior toward transparency, accountability, and competition.

The missing piece: Performance benchmarking for public sector AI

Cities already rely on third parties for high-risk goods and services, such as emergency equipment, water treatment chemicals, cloud services, and road construction. To manage risk, governments require compliance with federal and state regulations, as well as technical certifications from governmental and non-governmental institutions e.g., NIST, ANSI, SOC II, FedRAMP), and contractual guardrails, including Service Level Agreements (SLAs) or bond measures. In each case, performance standards are clear, testing is independent, and oversight is structured.

All these goods and services have clear performance requirements that private companies are held accountable to in order to sell to the government. Typically, standards creation is led by federal agencies or recognized standards bodies, testing is carried out by independent accredited labs, and federal oversight serves as an intermediary to guarantee products meet safety and effectiveness benchmarks before governments buy them.

In the AI industry, there is no such system. While frameworks like NIST’s AI RMF, ISO 42001, and the EU AI Act offer guidance, the independent lab ecosystem and certification processes that underpin procurement in other sectors do not yet exist for AI. As it stands, AI governance standards are largely voluntary, unenforceable industry commitments. Audit frameworks are varied and unstandardized, while practitioners struggle to keep pace with a fast-growing industry. The result is a patchwork of vendor claims and ad hoc oversight.

For public authorities, performance benchmarks are not a luxury—they are essential. Benchmarks help governments define their risk appetite, negotiate contracts with objective evidence, and establish shared language for what “good AI” looks like in the public sector, accuracy thresholds, error rates, explainability, and transparency. Of the many parameters that matter in AI governance, including bias, interpretability, and interoperability, three stand out as especially urgent for municipal adoption today: accuracy, security, and transparency.

Accuracy

In increasingly popular, high-risk use cases like using Generative AI to automate police report drafting, domain-specific accuracy standards grounded in municipal data are critical. Promising advances are emerging, like Stanford’s Center for Research on Foundation Models (CRFM)’s HELM Leaderboards for foundational GenAI models. These leaderboards track the performance of AI across transparent criteria through rigorous testing and capabilities assessments. Similarly, TruthfulQA is a benchmark established through collaborations between Oxford and OpenAI that measures whether a language model is truthful when responding to questions across several categories, including health, law, and finance. However, while research in these areas is rapidly accelerating, the unique performance requirements necessary for public sector use cases don’t yet exist, largely due to a lack of market incentives and regulatory pressures.

Security

Large language models are vulnerable to prompt injection and jailbreaks, creating risks for resident-facing chatbots and tools with elevated privileges like Microsoft Copilot. Researchers at the University of Pennsylvania and ETH Zurich introduced a repository of adversarial prompts, a standardized evaluation framework, and an LLM leaderboard, collectively called the “Jailbreak Bench,” to support security assessments of these tools. Additionally, NIST’s AISI work (AI 600-1 RMF & ARIA scenarios) offers guidance that incorporates red-teaming into risk management and provides useful scaffolding for procurement language. These evaluation frameworks offer a foundation for procurement safeguards, but adoption is limited and uncoordinated across public authorities.

Transparency

Training data provenance, testing results, and safety disclosures are still missing from most government procurements of AI. While industry recognizes the tradeoffs between transparency and security of open AI systems, legal scholars are beginning to unpack “open spectrum AI”, acknowledging that transparency must be unbundled across key components of AI development, including data, source code, compute, model weights, operational controls, and labor. Similarly, a movement is growing that calls for transparent reporting of the environmental impacts of AI systems. Such transparency is necessary not just to test that algorithms are fit for purpose, but also to develop internal security and incident response protocols.

Last year, Stanford CRFM researchers Rishi Bommasani and Kevin Klyman developed the Foundation Model Transparency Index (FMTI), which provides a 100-indicator scorecard across AI developers evaluating them for transparency criteria, including training data disclosures, safety, and governance. In another recent effort to streamline reporting, Margaret Mitchell et al. recommend that newly released models be accompanied by documentation detailing their performance characteristics using a framework called “model cards.” Both these efforts are candidate standards for public procurement of AI, but the capabilities needed to integrate them into governance processes are extremely limited due to a lack of knowledge and training among public servants.

Scaling benchmarks for AI with cooperative purchasing

Benchmarks alone do not change markets; adoption at scale does. When my San Antonio team introduced responsible AI requirements into enterprise procurements based on our testing, we quickly realized we lacked market leverage to be successful, highlighting the need for a coordinated approach to setting these standards across governments. One of the most powerful, yet underutilized, tools cities already have to set and adopt standards at scale is cooperative purchasing.

Cooperative purchasing enables multiple governments to purchase goods and services from a single competitively awarded contract, thereby leveraging volume, reducing administrative costs, and standardizing terms. Mechanisms such as joint solicitations and piggybacking are decades-old, legally grounded, and widely used. From NASPO ValuePoint’s multi-state agreements to Sourcewell and BuyBoard’s cooperative models, governments already pool demand to shape markets in sectors ranging from staplers to cloud services.

Applying this mechanism to AI is both feasible and urgent. By embedding vetted performance standards and benchmarks into cooperative contracts, cities can:

  • Scale performance standards: Bake vetted performance standards for specific use cases into cooperative RFPs and contracts shared across cities.
  • Incentivize the market: Vendors meeting established benchmarks gain access to multiple cities at once.
  • Increase competition: Smaller or ethical vendors who meet standards can break into markets traditionally dominated by large incumbents.

Early steps towards a coalition model for AI governance

Procurement has always been a tool for acquisition. But what if it also served as a tool for governance? In the absence of federal or state regulations governing how AI technology companies build their products, a coalition-driven approach to AI contracting has the potential to flip the script – where public authorities set the terms and use their purchasing power to demand performance standards and quality control.

The GovAI Coalition’s Procurement Committee is beginning to explore this path. Coalition members are pooling transparency data from cooperating vendors, drafting shareable AI contract templates, and identifying candidate use cases for cooperative solicitations. The Committee is surfacing AI contracts across the country, making them accessible for analysis and sharing. These are modest yet essential first steps toward a model that could fundamentally reshape AI governance across the country.

The stakes could not be higher. If AI tools fail to deliver accuracy, safety, and accountability for cities, governments will eventually stop buying them—leaving technology companies with slumping sales and stalled markets. But when AI is tested, proven, and benchmarked, both public agencies and vendors benefit. Cities gain systems that actually solve problems, while companies build credibility, grow demand, and expand their market in a competitive process.

Through collective action, coalition-driven performance benchmarks scaled through cooperative purchasing offer a rare chance for governments to lead the market rather than chase it. Done right, this approach creates a virtuous cycle: safe, effective AI strengthens public trust, expands adoption, and sustains growth for the companies that build it. Together, we can accelerate the development of responsible, safe, and effective AI in American cities.

Authors

Emily Royall
Emily’s career has focused on advancing public interest technology through strategic procurement and governance. Her professional experience spans academic, non-profit, and government organizations, including City Form Labs in Singapore, the Massachusetts Office of Information Technology (MassIT), a...

Related

Perspective
Want Accountable AI in Government? Start with ProcurementJuly 15, 2025

Topics