Josh Dzieza | The Verge

You Could Be Next

2026-03-16T14:00:57-04:00

The LinkedIn post seemed like yet another scam job offer, but Katya was desperate enough to click. After college, she’d struggled to make a living as a freelance journalist, gone to grad school, then pivoted to what she hoped would be a more stable career in content marketing — only to find AI had automated much of the work. This company was called Crossing Hurdles, and it promised copywriting jobs starting at $45 per hour.

Katya clicked and was taken to a page for another company, called Mercor, where she was instructed to interview on-camera with an AI named Melvin. “It just seemed like the sketchiest thing in the world,” Katya says. She closed the tab. But a few weeks later, still unemployed, she got a message inviting her to apply to Mercor. This time, she looked up the company. Mercor, it seemed, sold data to train AI, and she was being recruited to create that data. “My job is gone because of ChatGPT, and I was being invited to train the model to do the worst version of it imaginable,” she says. The idea depressed her. But her financial situation was increasingly dire, and she had to find a new place to live in a hurry, so she turned on her webcam and said “hello” to Melvin.

It was a strange, if largely pleasant, experience. Manifesting on Katya’s laptop as a disembodied male voice, Melvin seemed to have actually read her résumé and asked specific questions about it. A few weeks later, Katya, who like most workers in this story asked to use a pseudonym out of fear of retaliation, received an email from Mercor offering her a job. If she accepted, she should sign the contract, submit to a background check, and install monitoring software onto her computer. She signed immediately.

She was added to a Slack channel, where it was clear she was entering a project already underway. Hundreds of people were busy writing examples of prompts someone might ask a chatbot, writing the chatbot’s ideal response to those prompts, then creating a detailed checklist of criteria that defined that ideal response. Each task took several hours to complete before the data was sent to workers stationed somewhere down the digital assembly line for further review. Katya wasn’t told whose AI she was training — managers referred to it only as “the client” — or what purpose the project served. But she enjoyed the work. She was having fun playing with the models, and the pay was very good. “It was like having a real job,” she says.

Two days after Katya started, the project was abruptly paused. A few days after that, a supervisor popped into the room to let everyone know it had been canceled. “I’m working assuming that I can plan around this. I’m saving up for first and last month’s rent for an apartment,” Katya says, “and then I’m back on my ass. No warning, no security, nothing.” Several days later, she got an email from Mercor with another offer, this one for a job evaluating what seemed to be conversations between chatbots and real users — many appeared to be from people in Malaysia and Vietnam practicing English — according to various criteria, like how well the chatbot followed instructions and the appropriateness of its tone. Sign the contract, the email said, and you’ll have a Zoom onboarding call in 45 minutes. It was 6:30PM on a Sunday night. Scarred from the abrupt disappearance of the previous gig, she accepted the offer and worked until she couldn’t stay awake.

Machine-learning systems learn by finding patterns in enormous quantities of data, but first that data has to be sorted, labeled, and produced by people. ChatGPT got its startling fluency from thousands of humans hired by companies such as Scale AI and Surge AI to write examples of things a helpful chatbot assistant would say and to grade its best responses. A little over a year ago, concerns began to mount in the industry about a plateau in the technology’s progress. Training models based on this type of grading yielded chatbots that were very good at sounding smart but still too unreliable to be useful. The exception was software engineering, where the ability of models to automatically check whether bits of code worked — did the code compile, did it print HELLO WORLD — allowed them to trial-and-error their way to genuine competence.

The problem was that few other human activities offer such unambiguous feedback. There are no objective tests for whether financial analysis or advertising copy is “good.” Undeterred, AI companies set out to make such tests, collectively paying billions of dollars to professionals of all types to write exacting and comprehensive criteria for a job well done. Mercor, the company Katya stumbled upon, was founded in 2023 by three then-19-year-olds from the Bay Area, Brendan Foody, Adarsh Hiremath, and Surya Midha, as a jobs platform that used AI interviews to match overseas engineers with tech companies. The company received so many inquiries from AI developers seeking professionals to produce training data that it decided to adapt. Last year, Mercor was valued at $10 billion, making its trio of founders the world’s youngest self-made billionaires. OpenAI has been a client; so has Anthropic.

Each of these data companies touts its stable of pedigreed experts. Mercor says around 30,000 professionals work on its platform each week, while Scale AI claims to have more than 700,000 “M.A.’s, Ph.D.’s, and college graduates.” Surge AI advertises its Supreme Court litigators, McKinsey principals, and platinum recording artists. These companies are hiring people with experience in law, finance, and coding, all areas where AI is making rapid inroads. But they’re also hiring people to produce data for practically any job you can imagine. Job listings seek chefs, management consultants, wildlife-conservation scientists, archivists, private investigators, police sergeants, reporters, teachers, and rental-counter clerks. One recent job ad called for experts in “North American early to mid-teen humor” who can, among other requirements, “explain humor using clear, logical language, including references to North American slang, trends, and social norms.” It is, as one industry veteran put it, the largest harvesting of human expertise ever attempted.

These companies have found rich recruiting ground among the growing ranks of the highly educated and underemployed. Aside from the 2008 financial crash and the pandemic, hiring is at its lowest point in decades. This past August, the early-career job-search platform Handshake found that job postings on the site had declined more than 16 percent compared with the year before and that listings were receiving 26 percent more applications. Meanwhile, Handshake launched an initiative last year connecting job seekers with roles producing AI training data. “As AI reshapes the future of work,” the company wrote, announcing the program, “we have the responsibility to rethink, educate, and prepare our network to navigate careers and participate in the AI economy.”

There is an underlying tension between the predictions of generally intelligent systems that can replace much of human cognitive labor and the money AI labs are actually spending on data to automate one task at a time. It is the difference between a future of abrupt mass unemployment and something more subtle but potentially just as disruptive: a future in which a growing number of people find work teaching AI to do the work they once did. The first wave of these workers consists of software engineers, graphic designers, writers, and other professionals in fields where the new training techniques are proving effective. They find themselves in a surreal situation, competing for precarious gigs pantomiming the careers they’d hoped to have.

Each of the more than 30 workers I spoke with occupied a position along a vast and growing data-supply chain. There are people crafting checklists that define a good chatbot response, typically called “rubrics,” and other people grading those rubrics. Others grade chatbot answers according to those rubrics, and still others take the rubrics and write out what’s often described as a “golden output,” or the ideal chatbot answer. Others are asked to explain every step they took to arrive at this golden output in the voice of a chatbot thinking to itself, producing what’s called a “reasoning trace” for AI to follow later when it encounters a similar task out in the real world.

Sometimes the labs want only rubrics for prompts their AI can’t already do, which means companies like Mercor ask workers to produce “stumpers,” or requests that will make the model fail. “It sounds easy, but it’s really hard,” says a worker who was trying to stump models by asking them to make inventory-management dashboards. Models fail in counterintuitive ways. They may be able to solve advanced-physics exam questions, but ask them for transit directions and they’ll recommend transferring on nonconnecting train lines. Finding these weak spots takes time and creativity.

One type of project gathers groups of lawyers, human-resources managers, teachers, consultants, or bankers for something Mercor calls world-building. “You and your team will role-play a real-life team within your profession,” the training materials read. The teams are given dedicated emails, calendars, and chat apps and asked to create a hundred or more documents that would be associated with some corporate undertaking, like a fictional mining company analyzing whether to enter the data-center business.

After several 16-hour days of fantasy document production, one worker recounts, the resulting slide decks, meeting notes, and financial forecasts are sent to another team, which uses them as grist in their attempts to stump a model operating in this simulated corporate environment. Then, having stumped the model, that team writes new, more nuanced rubrics, golden answers, and so on. Workers can only guess who the customer is or how many others are working on the project — based on references to teams like Management Consulting World No. 133, there could be hundreds, maybe thousands.

There are people hired to evaluate the ability of image models to follow their prompts and others who summarize video clips in extraordinary detail, presumably to train video models. Efforts to improve AI’s ability to have spoken conversations have resulted in a surging demand for voice actors, who might find themselves recording “authentic, emotionally resonant” speeches, according to one listing. “I just tell people I’m an AI trainer, then it sounds more professional than what I’m doing,” says an aspiring screenwriter who was instructed to record himself pretending to ask a chatbot for a fitness plan while pots and pans clanged in the kitchen. Another time, he was told to record himself dispensing financial advice over the phone to a parade of people he assumed were other workers.

This audio might then be broken down and sent to someone like Ernest, who used to make a living as an online tutor until the company he worked for replaced him with a chatbot. When we spoke, he was listening to minutelong clips of random dialogue slowed to 0.1x speed and marking when someone started and stopped speaking down to the millisecond. Many of the clips included a person talking with a chatbot and interjecting “huh” or “I see,” so he assumes he was improving AI’s ability to have naturally flowing conversation, but he has no actual idea.

As is standard practice in the field, the project was referred to by a codename and the client only ever as “the client.” The entire system is designed so that workers have minimal insight into the supply chain they are part of. If they find out who the customer is, they are contractually forbidden from telling anyone, even their own colleagues. Nor are they allowed to describe the details of their work beyond broad generalities like “providing expertise in XYZ domain to improve models for a top AI lab,” according to one Mercor agreement. So afraid are workers of inadvertently violating their confidentiality agreements and getting fired that when they discuss their work in public forums, they mask their already codenamed projects with additional codenames, for example by referring to a project called “Raven” as “Poe.”

“I’m being handed a shovel and told to dig my own grave.”

Katya’s second project with Mercor was far more stressful. There was less work to go around, and it came in fits and starts. Managers would drop a message in the Slack channel saying new tasks were incoming in half an hour, and, she says, “everyone in Slack would drop what they were doing and jump on them like piranhas,” working as fast as they could while the bar showing how many tasks remained slid toward zero. Then they were back in Slack again, politely begging supervisors for more work and more hours, talking about their kids’ birthdays or their need to pay rent, or telling anyone who might be listening that their availability was wide open in case there was more work to be done. Soon, Katya was dropping everything at the sound of a Slack ding too. “Sometimes I’m on the toilet or at dinner and I get the Slack notification. I’m like, ‘Oh, sorry, I gotta work now.’”

That project soon ended and then came another. It was nearly identical to the first, which she had enjoyed, but now, on top of writing rubrics, she had to stump the model and complete the more difficult task in the same amount of time. She was also getting paid $8 an hour less. This is common at Mercor. Nearly every worker I spoke with reported that demands increased, time requirements shrank, and pay decreased as projects continued. Those who couldn’t meet the new demands got “offboarded” and replaced by new recruits.

Chris joined Mercor last year, after a difficult few months struggling to find film work. Unlike many people who suspect they’re casualties of automation, he knew for certain that this was the case. He’d had a recurring job drafting episodes for an unscripted television show — doing preinterviews, sketching scenes, writing the reality TV equivalent of a screenplay. But in late 2024, he was told the show would be running on a “skeleton crew” and his work was no longer needed. He found out later the company was using ChatGPT to draft new episodes. So that October, when Chris received an offer to write an entire sci-fi screenplay for a major AI company, he said “yes,” grim as the prospect was. Since then, he has gone from gig to gig. “This is my only source of income right now,” he says. “I know people who are award-winning producers and directors, and they’re not advertising that they’re doing this work, but that’s how they’re putting food on the table.”

His first jobs with Mercor were, like Katya’s, relatively pleasant and well paid, but soon came the 6PM fist-bump-emoji Slack exhortations to “come on team, let’s push through this,” followed by sudden halts and months of silence. “You were just constantly waiting for the crack of the starting gun at any hour of the day,” Chris says. Then it was crunch time again and managers, increasingly panicked as deadlines neared, started threatening workers with offboarding if they didn’t complete tasks quickly enough.

The time he spent working was tracked to the second by software called Insightful, which monitored everything he did on his computer. Time that the software deems “unproductive” could be deducted from his pay, and if a few minutes passed without him typing, the system pinged him to ask whether he had been working. Sometimes Chris saw people post in Slack that they’d gone over the target time on a particularly tricky task and that they hoped it would be okay; the next day, they would be gone.

Increasingly worried he would be offboarded too, he started working off the clock, deactivating Insightful while reading instructions so he could move faster. If he went over the target time, he turned the clock off and kept working for free.

Companies say this software is necessary to accurately track hours and prevent workers from cheating, which, in this case, means using AI, something all data companies strictly forbid. The ground truth of verified human expertise is what they’re selling, and when AI trains on AI-generated data, it gradually degrades, a phenomenon researchers call “model collapse.” Employees of data companies say it is a constant battle to screen out AI slop. For workers, AI is a particular temptation as pressure increases. When the retail expert trying to stump models with analytics dashboards had her target time dropped from eight hours per task to five to three and a half, she turned off Insightful and sought outside help. “To be honest, I went into Copilot and ChatGPT and put my prompt in there and said, ‘How can I work this so you guys can’t answer it?’” Then she went to another chatbot and asked if the prompt sounded AI generated and, if so, to make it sound more human.

“It’s just so horrible, the mental effect of it,” says Mimi, a screenwriter who has worked on multiple streaming shows and has been training AI for Mercor for several months. She found out about Mercor from a fellow screenwriter who dropped one of its job links in a Writers Guild of America Facebook group.

Like a lot of people in this line of work, Mimi is conflicted. “One documentary-maker who’s won Emmys, he messaged me and he was like, ‘I’m being handed a shovel and told to dig my own grave,’ and that’s exactly how everyone thinks about it,” she says. Still, as a single mom, she needed the money. She was thankful for the work at first, then the project was paused, unpaused, and paused again. For five weeks, she was told a project would be starting imminently. When it finally did, requirements were added, while the expected time shortened, and she raced to keep up under the watchful eye of Insightful. She felt that someone put it well on Slack when they said it was like they were living in a fishbowl waiting for their human masters to drop in food, and only the ones who were fast enough to swim to the top could eat.

“Last night, I got so fucking stressed because my kid came home and it was 7PM, and I get this message, ‘The tasks are out!’ and I’m just working, just trying to get as many hours in before I can go to bed,” Mimi says, choking up. “I spend no time with my kid, and at one point, he can’t find something for school and I just start screaming at him. This work is turning me into a fucking demon.” She’s especially disturbed by the surveillance: “The idea that somebody can measure your time and that all the little bits that go into being a human are taken away because they’re not profitable, that you can’t charge for going to the toilet because that’s not time you’re working, you can’t charge for making a cup of coffee because that’s not time you’re working, you can’t charge for having a stretch because your back hurts. This is why unions were formed, so people could have guaranteed hours and guaranteed lunch breaks and guaranteed holidays and sick pay. This is the gig economy to the very extreme.”

This is what concerns her more than the AI itself: that it’s bringing to knowledge work the sort of precarious platform labor that has transformed taxi driving and food delivery. Meanwhile, she watches in horror the desperate gratitude of her colleagues as they rejoice at the 7PM announcement of incoming work.

“How long are these tasks expected to last?” one worker asked in Slack.

“I’m wondering too, I’d like to know whether I can sleep or not.”

With no answer forthcoming, they swapped tips on how to stave off sleep.

“Nobody knows what’s going on. Everybody’s really confused.”

When Mercor began recruiting aggressively last year, it framed itself as a more worker-friendly version of the platforms that had come before it. Criticizing his rival Scale AI on a podcast, Foody, Mercor’s CEO, said, “Having phenomenal people that you treat incredibly well is the most important thing in this market.” Workers who joined during this time do report being treated well; the pay was better than elsewhere, and instead of being managed by opaque algorithms, as is common, there were actual human supervisors they could go to with questions.

But people who have worked in management at data companies say they often start out this way, wooing workers off incumbent platforms with promises of better treatment, only for conditions to degrade as they compete to win eight-figure contracts doled out by the half-dozen AI companies who are interested in buying this data in bulk. At Mercor, there was the additional complication of management largely consisting of people in their 20s with minimal work experience who had been given hundreds of millions of investor dollars to pursue rapid growth.

“I don’t care if somebody’s 21 and they’re my manager,” says Chris, the reality TV producer. “But they’ve never worked at this scale. When you try to find some kind of guidance in Slack, very maturely and clearly explaining what the situation is, you get a meme back with a corgi rolling its eyes and it says, ‘Use your judgment.’ But it’s like, ‘Use your judgment and fuck it up, and you get fired.’ You went to Harvard, you graduated last year, and your guidance for a group of people, many of whom are experienced professionals, is a meme?”

Lawyers, designers, producers, writers, scientists — all complained of inexperienced managers giving contradictory instructions, demanding long hours or mandatory Zoom meetings for ostensibly flexible work, and threatening people with offboarding for moving too slowly, threats that were particularly galling for mid-career professionals who felt their 20-year-old bosses barely understood the fields they were trying to automate.

“The founders pride themselves on ‘9-9-6,’” says a lawyer, referring to a term that originated in China to describe 72-hour workweeks associated with burnout and suicide but has been appropriated by Silicon Valley as aspirational. “You need to be accessible at all hours, and they’re going to pump out messages at 6AM, and you better jump because the perception is you will be offboarded and another person will replace you.”

“It’s not just that team leads are young, project managers are young, senior project managers are young. It’s that the senior-senior project managers, the ones responsible for the project in its entirety, are young. I guess that comes from the top because they’re young, right?” says Lindsay, a graphic designer and illustrator in her 50s who came to Mercor after 85 percent of her work evaporated over the past year, owing, she believes, to improvements in generative AI.

Increasingly desperate for work, she scoured job boards; it seemed the only listings matching her expertise were offers to help build the technology she blamed for demolishing her career. “I swallowed my hatred and signed up,” she says. After some initial work producing graphic-design data, she was invited to join a job for Meta grabbing videos from Instagram Reels and tagging whatever was in them. It was boring, and at $21 per hour, the pay was middling, but Lindsay needed the money. So, she discovered when she was brought into the project’s Slack, did approximately 5,000 others.

In early November, a Mercor representative announced that Lindsay’s project would be ending owing to “scope changes,” though workers had previously been told the project would run through the end of the year. Lindsay and thousands of others found themselves removed from the company’s Slack.

Soon, an email arrived in their inbox, inviting them to a new project called Nova paying $16 per hour.

Thousands of workers poured into the new Slack only to discover it was the exact same job, now paying 24 percent less. All but two of the Slack channels had been deleted, including the watercooler, support, and help rooms. The ability to direct-message one another had also been cut off. There were no team leads to be found. With no one to ask for assistance, workers flooded the main rooms with pleas and indignation.

“Nobody knows what’s going on. Everybody’s really confused,” says Lindsay. “The messages are coming so fast in that channel. It’s just absolute chaos. ‘Help, please. What do I do? What am I supposed to do? Where do I go? Can I get started tasking? Am I supposed to redo all the assessments that I’ve done before?’”

Someone emailed support asking for help, and for some reason that email was sent to every one of the thousand-some people on the project, who seized on it and began to reply-all with their bafflement and outrage. “It was absolute carnage,” says Lindsay. “There’s no other word for it.”

Workers began posting complaints on Mercor’s subreddit, only to have their posts quickly deleted by the Mercor representatives who moderate it. In response, two unsanctioned Mercor subreddits were created, where workers could freely express such sentiments as “CHILDREN RUN THIS COMPANY, THEY WILL SOON HAVE THEIR DAY OF RECKONING.”

“It’s just really sad,” says Lindsay. “There are some people in there where it’s genuinely the difference between them being able to feed their families and not feed their families.”

“I hate gen AI,” she adds. “I think AI should be used for curing cancer. I think it should be used for space exploration, not in the creative industries. But I need to be able to pay my rent. And then when people like Mercor pull this stuff where they treat you like nothing more than a lab rat — I’ve been working for a very long time. I have never, ever been treated as badly as this.”

Intermittent work, extreme secrecy, and abrupt firings are the norm across the data industry. On Surge AI’s work platform, called Data Annotation Tech, workers are not only regularly terminated without explanation; they are often not even told they’ve been fired. They just log in one day and find the dashboard empty of tasks. The phenomenon is so ubiquitous they call it simply “the dash of death.”

Last year, a Texan with a master’s degree in divinity who was teaching voice models to respond to queries with appropriate levels of feeling — different tones for a user telling them their dog died versus asking for a trip itinerary — logged in to work one morning and found his dashboard empty. Scrolling to the bottom of the page for the support button, he discovered it no longer worked. That’s when he knew he had been terminated. His mind raced through possible reasons: Had he worked too much? Had his quality slipped? He knew he would never find out. “I felt cut adrift,” he says. Anxious about how he would pay his bills and care for his ailing dog, he grew depressed, then horrified. He thought about his teacher friends who couldn’t get their students to write and all the people graduating with now-worthless computer-science degrees. “The technology makes us see everything as a utility, something to be used,” he says, a category that he feels includes discarded data workers like himself. He resolved to become a chaplain, figuring that no matter what the AI future holds, people will need a fellow human to be there for them.

The on-again, off-again nature of the work is not just the result of company culture; it stems from the cadence of AI development itself. People across the industry described the pattern. A model builder, like OpenAI or Anthropic, discovers that its model is weak on chemistry, so it pays a data vendor like Mercor or Scale AI to find chemists to make data. The chemists do tasks until there is a sufficient quantity for a batch to go back to the lab, and the job is paused until the lab sees how the data affects the model. Maybe the lab moves forward, but this time, it’s asking for a slightly different type of data. When the job resumes, the vendor discovers the new instructions make the tasks take longer, which means the cost estimate the vendor gave the lab is now wrong, which means the vendor cuts pay or tries to get workers to move faster. The new batch of data is delivered, and the job is paused once more. Maybe the lab changes its data requirements again, discovers it has enough data, and ends the project or decides to go with another vendor entirely. Maybe now the lab wants only organic chemists and everyone without the relevant background gets taken off the project. Next, it’s biology data that’s in demand, or architectural sketches, or K–12 syllabus design.

To compete, data companies arrange things so that they will always have workers on call while preserving their freedom to drop them at a moment’s notice. “Every vendor is going to have some kind of setup whereby they don’t really make promises to people,” says a senior employee of a major data company. The companies rarely have much notice of these shifts themselves, sometimes because the AI developers aren’t sure exactly what data they need in the first place, other times because they are shopping around for the best deal. “They want to keep us in the dark,” the employee continues, “so we inevitably keep the contributors in the dark, then a purchase falls through and you have a thousand people you’ve trained and formed a relationship with just saying, like, ‘What the fuck? Why isn’t there work?’ It’s a horrible feeling from an operator’s perspective, too, but obviously it’s way worse for them.”

The workers at the bottom of this supply chain exist in a state of extreme precarity and maximum competitive frenzy — especially because their strict confidentiality agreements make it impossible for them to establish any kind of seniority or relationship that might outlast a particular project. “The power is all on one side because they can’t talk about it,” says Matthew McMullen, a strategy and operations executive who has worked in the industry since the self-driving-car boom in the mid-2010s. “The labs benefit from you not being able to leverage your experience in the market, and this silence is like their pricing power. The silence is their ability to extract mass information from people without giving them the power to object or to unionize or to make companies themselves. As long as they can’t prove what they’ve done, these raters can’t demand what they’re worth. The only way that people can demand things is by showing their ability to step up, to take on more work. The only power that they have is to keep going, to get back in line.”

Which is what they do. When a project for Mercor ends, managers often post a link to other projects on the platform and encourage people to apply. “But again, there are thousands of people applying, so you throw your application into a hole and hope to hear back at some undefined point,” says Katya. While they wait, workers sign up for Handshake, Micro1, Alignerr, or another of the ever-growing number of data providers.

These companies are always recruiting. Like Mercor, many use AI interviewers and automated evaluations, meaning they have no incentive to limit the number of interviews they do. Mercor offers referral bonuses of several hundred dollars, leading some to promote the company so aggressively that mentions of it have been banned from several subreddits. Katya has applied for dozens of jobs and gotten three, not an unusual ratio.

Nor do companies bear any cost for overhiring. Because workers are ostensibly independent contractors, they are not owed paid time off, breaks, healthcare, overtime pay, or unemployment benefits. It’s free to keep them hanging around, and a surplus of vetted workers ensures they will jump quickly to finish tasks before someone else does. It all combines to create an arrangement in which employers can turn labor on and off like a tap. (Reached for comment, Mercor spokesperson Heidi Hagberg said that “the nature of this is project based contract work, meaning it can extend, pause, or end at any time, especially as the client’s scopes and needs evolve,” and that many of the worker complaints “were centered around the misalignment of expectations of a full-time job versus -project-based work.”)

If you move fast and get lucky and have the right combination of expertise and stay on the right side of each platform’s unique and mysterious recipe of productivity metrics, you can make decent money. I spoke to a playwright making $10,000 a month, a multitalented chemist who at various points found gigs demonstrating poker and singing for AI. But even then, there is an inescapable awareness of ephemerality because producing training data means working toward your own obsolescence. While the number of people doing data work may continue to rise, any particular gig will last only as long as it takes for the machines to successfully mimic it. It takes years for a human to develop expertise, and sooner or later, they’re going to run out of skills to sell.

A worker with a master’s in linguistics had found steady rubric work for a year, but late in 2025, he noticed it was becoming more difficult to stump the models. Any obscure theory or Indigenous language he asked about, the model would find the correct papers. Instead of submitting three or four rubrics per week, he was lucky to get one. Everyone else on the project was following the same trajectory, so he wasn’t surprised when it came to an end. Their know-how had been extracted. In the past, he’d always been able to find a new gig, but now when he looked around, he saw only requests for medical experts, human-resources managers, and teachers. He has now been without work for five months and isn’t sure what to do next.

These platforms are reminiscent of Uber and Lyft a decade ago. Yet in some ways these workers are in a worse position, more replaceable despite their advanced degrees

To the extent that policy responses to AI automation are discussed at all, they mostly concern what to do when AI renders large categories of workers obsolete. Maybe this will happen, but another possibility is that particular tasks will get automated and humans redistributed to other parts of the production process, some revising so-so AI output, others crafting rubrics to improve it. Much of this work will be inherently intermittent, which means it will be done by independent contractors, workers whom current regulations leave almost wholly unprotected. Daron Acemoglu, a professor of economics at MIT who studies automation, compares the situation to that of weavers, who before the industrial revolution were “like the labor aristocracy,” self-employed artisans in control of their own time. Then came weaving machines, and in order to survive, they were forced to take new jobs in factories, where they worked longer hours for less money under the close supervision of management. The problem wasn’t simply that technology took their jobs; it enabled a new organization of work that gave all power to the owners of capital, who made work a nightmare until labor organizing and regulation set limits.

Early labor skirmishes are already happening, mostly in California, which has some of the most aggressive rules around classifying platform workers. Three class-action lawsuits have been filed against Mercor in the past six months. (Similar suits were previously filed against Surge AI and Scale AI, which is settling.) The lawsuits all accuse the companies of misclassifying workers as independent contractors given the “extraordinary control” they exert over them. This is “an entirely new kind of work,” one that the company trains people to do and that cannot be done except on the company’s platform. Workers have so little visibility into what they’re working on that one person, alleges a suit filed in December, accepted a Mercor project only to be tasked with recording himself reading sexually explicit scripts. Once he discovered this, the worker risked deactivation if he abandoned the project, forcing him to “choose between being paid and being humiliated.”

These companies are reminiscent of Uber and Lyft a decade ago, says Glenn Danas, a partner at the law firm Clarkson, which is suing Mercor and several other data platforms. Yet in some ways these workers are in a worse position, more replaceable despite their advanced degrees. Uber drivers have to be physically present in a city to work, and they can organize and push for regulation there. If the same were to happen with data workers, companies could just recruit from somewhere else where people will work for less. When Mercor cut pay for its Meta project to $16 per hour, it dropped below the minimum wage in California and other states, yet people there kept working because they needed the money. This was something at least one supervisor acknowledged, writing in Slack, “While we won’t actively hire from any states where the minimum wage is above the project’s rate, if you are already active on the project and would like to work at the $16/hr rate, we want to enable you to do so.”

Entire professions risk a similar race to the bottom, says Acemoglu, if companies are able to pit workers against one another, each selling their data before someone else can underbid them. “We may also need unionlike organizations that exercise some sort of collective ownership and prevent any kind of simple divide-and-rule strategies by large companies to drive down data prices,” he says. “If there isn’t the legal infrastructure for a data economy of this sort, many of the people who produce the data will be underpaid or, to use a more loaded term, exploited.”

Katya was among the thousands of people invited to join the $16-an-hour Project Nova and was appalled by the low pay. “I think that was Mercor’s experiment in how close to the bottom they can scrape without jeopardizing the data that they’re getting,” she says. Her main project had been paused for weeks and might resume the next day or never.

In the end, she decided the money wasn’t worth it. She applied to work at a local coffee shop. It wasn’t the career pivot she’d imagined when she went to grad school; she just hoped working as a barista would be more stable. “At least when you work at a coffee shop for minimum wage, you have some friends to talk to and a boss who pretends to care about you. You have some kind of security; you know what your hours are going to be week to week,” she says.

But then she heard her phone ding. One of her projects was back on.

How many AIs does it take to read a PDF?

2026-02-23T06:04:40-05:00

Image: Kristen Radtke / The Verge

Last November, the House Oversight Committee had just released 20,000 pages of documents from the estate of Jeffrey Epstein, and Luke Igel and some friends were clicking around, trying to follow the threads of conversation through garbled email threads and a PDF viewer that was, frankly, “gross.” In the coming months, the Department of Justice would release its own batches of files, more than three million of them — again, all PDFs.

This was a problem. While the Department of Justice had run optical character recognition over the text, it was not very good, Igel said, rendering the files more or less unsearchable.

“There was no interface the government put out that allowed you to actually see any sort of summary of things like flights, things like calendar events, things like text messages. There was no real index. You just had to get lucky and hope that the document ID that you were looking at contains what you’re looking for,” said Igel, cofounder of the AI video editing startup Kino. What if, Igel thought, they built a Gmail clone to view and search all this correspondence in a more intuitive way?

To do this, they would need to extract the information contained in PDFs, which is far less straightforward than it might sound. Despite rapid progress in AI’s ability to build complex software and solve advanced physics problems, the ubiquitous format of PDF remains something of a grand challenge. Edwin Chen, the CEO of the data company Surge, includes it among AI’s “unsexy failures” limiting real-world usefulness. Last year, he found that even state-of-the-art models asked to extract information from a PDF will instead summarize it, confuse footnotes with body text, or outright hallucinate contents. In a half-joking timeline of AI development, the researcher Pierre-Carl Langlais placed “PDF parsing is solved!” shortly before AGI.

First, Igel’s friend, the “tech jester” Riley Walz, used his remaining credits on Google’s Gemini. It only worked reliably for some of the cleanest scans, and would be prohibitively expensive to run on millions of documents anyway, so Igel reached out to his former MIT classmate Adit Abraham, who happened to work in the office above his, where he ran a PDF-parsing AI company called Reducto.

PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by them

Reducto, one of several companies trying to solve PDFs, was able to extract information from email threads with cryptic decoding errors, heavily redacted call logs, and low-quality scans of handwritten flight manifests. After the data was exported in a usable format, Igel and Walz went on a building spree, creating essentially a full Epstein-themed app ecosystem: Jmail, an unsettling, searchable prototype of Epstein’s inbox; Jflights, an interactive globe crisscrossed with flight paths, each one clickable to view underlying PDFs of flight data, passenger manifests, and scanned email invitations; Jamazon, to search Epstein’s Amazon purchases; and Jikipedia, to search businesses and people who turn up in the files, citing, naturally, more PDFs.

“That’s where the magic of extracting information of PDFs became real for me,” Igel said. “It’s going to completely change the way a lot of jobs happen.”

PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by them. The format was developed by Adobe in the early 1990s as a way to reproduce documents while preserving their precise visual appearance, first when printing them on paper, then later when depicting them on a screen. Where formats like HTML represent text in logical order, PDF consists of character codes, coordinates, and other instructions for painting an image of a page.

Optical character recognition (OCR) can turn those pictures of words back into text computers can use, but if it comes across a PDF where text is displayed in multiple columns — as many academic papers are — it will plow ahead left to right and create an unintelligible jumble. OCR tools are designed to detect and correct for these sorts of formatting variations, but tables, images, diagrams, captions, footnotes, and headers all present further obstacles. If you give an AI assistant like ChatGPT a PDF, it will cycle through a variety of these tools, sometimes fail, sometimes pass the PDF to a large vision model to perform OCR, sometimes hallucinate, and generally take a very long time and use a lot of computing power for uneven results.

“The key issue is that they cannot recognize editorial structure,” said Langlais. “It’s all fine while it’s relatively simple text, but then you’ve got all these tables, you’ve got forms. A PDF is part of some kind of textual culture with norms that it needs to understand.”

A further problem that arises from and compounds PDF’s inherent difficulty is that models rarely train on them. This has begun to change, partly because AI developers are increasingly desperate for high-quality data, and PDFs contain a disproportionate amount of it. Government reports, textbooks, academic papers — all PDFs. “PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models,” wrote researchers at the Allen Institute for AI last year in a paper announcing a new specialized PDF-reading model.

“The lore has it that the very first PDF ever was an IRS 1040,” said Duff Johnson, CEO of the PDF Association, the industry organization that helps develop the PDF global standard, ISO 32000-2:2020, itself a PDF nearly a thousand pages long. In 1994, the IRS wanted a way to share forms that were absolutely consistent without printing and mailing every possible document, so it mailed CDs full of PDFs instead. From there, PDF spread with email to become a fundamental component of digital work. Book publishers sending manuscripts to the printer, patent applicants submitting diagrams of new devices, anyone who needed to share a document that would look the same to whomever received it turned to PDF.

“There’s no other technology solving the problem the PDF solves,” said Duff. Websites are temporary, appearing differently depending on the browser, mediated by CSS. Links rot. Word docs change depending on your machine and can be edited and overwritten. A PDF is the same no matter who opens it, when, or how.

“That’s what engineering companies need. That’s what lawyers need. That’s what governments need. That’s what anybody who’s doing anything in the world, who has records to maintain, they need that,” Duff said. “Earlier today I opened up a PDF from 1995. I didn’t worry about it. I just opened it. It worked fine. It worked perfectly. I would expect no less.” (It was a PDF about PDFs.)

“So I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct”

There has been a shift over the last year or so toward specialized PDF-parsing models, said Luca Soldaini, a researcher at the Allen Institute for AI who worked on their PDF model, olmOCR. They trained a vision language model — like a large language model, but with pixels instead of word tokens — on about 100,000 PDFs: public domain books, academic papers, brochures, documents from the Library of Congress with human-written transcriptions. The model was further trained to optimize specific problem areas, like parsing tables without mixing up the rows and columns.

“If text is large on the page, the model will learn to say, ‘Oh, that’s probably a header,’” said Soldaini. The model was the most popular one the institute released last year, Soldaini said, rivaling the institute’s generalist models. A PDF reading AI doesn’t capture the spotlight like those models, Soldaini said, but people are actually using it.

A few months later, researchers at Hugging Face, the company that runs a popular open-source AI platform, had just published a 5 billion-document dataset for training multilingual models and were thinking about what to do next. They had already processed the whole of Common Crawl, the enormous archive of mostly HTML text scraped from the web that forms the foundation of many large language models. Like many AI researchers, Hugging Face’s Hynek Kydlíček recalled, they were wondering whether they had run out of easily available data.

“We thought, let’s look at the Common Crawl and, like, maybe there is more stuff we just haven’t seen,” said Kydlíček. Indeed, there was: roughly 1.3 billion PDFs. “That’s how we figured out that PDFs could be actually a super big and super high-quality source we can still train on,” Kydlíček said. “But the format of PDFs is, like, super super hard to extract text from.”

Kydlíček and his collaborators rigged up a system that separated PDFs into easy to parse — mostly text — and difficult to parse, full of images and charts. The hard PDFs were sent to a version of olmOCR that had been modified by Reducto, called RolmOCR. After they stripped out the PDFs of horse racing results that made up an inexplicably large quantity of the corpus, the team declared they had “liberated three trillion of the finest tokens,” now available for model training.

Yet parsing PDFs well enough for model training is one thing. Extracting them with the degree of accuracy demanded by lawyers and engineers is another. When the Hugging Face team did their first tests, they found their model would invent text when there wasn’t any, filling blank pages with nonsense and describing images and art. They trained it to correct these errors, but it’s impossible to anticipate every formatting oddity or off-kilter scan.

“It’s solved in like 98 percent of cases, and like in many areas you always have this problem of getting these last 2 percent,” Kydlíček said. “I would say OCR is one of the best economic use cases for visual language models, so there are a lot of eyes on it right now, a lot of people throwing a lot of resources onto this. So I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct.”

One of the teams doing the best work, Kydlíček said, is Reducto, the company Igel is using to parse the Epstein files. Abraham cofounded the company as a service that managed customers’ long-term histories with language models, similar to the “memory” feature that is now standard in chatbots. Abraham kept getting requests to manage people’s files as well, which naturally were in the form of PDFs. He found working with them to be “shockingly hard.”

Like self-driving cars, PDFs have a long tail of unusual challenges

“One of our core intuitions was all these documents were made for humans like you and I to interpret, and there’s a lot of visual information here that we take for granted, like that every gap between two paragraphs is me telling you, ‘Hey, this is a new idea.’ Every indentation is me telling you, ‘Hey, this is a sub idea of the parent idea.’ The question was like, how do you encode all of that context?”

Much of the team had a background in self-driving vehicles, where computer vision models “segment” data into entities like car, pedestrian, dumpster. They took a similar approach to PDFs, using a model to first divide the page into headers, tables, footnotes, and so on, before passing them to other specialized models for parsing. When they posted about their approach in early 2024, the response was immediate.

“This wasn’t supposed to be a pivot,” said Abraham. Other developers reached out to say that their progress had been stymied by PDFs. “It kind of spiraled from there.”

Reducto now uses a growing assortment of small, specialized models taking multiple passes to parse a PDF. When the segmenting model detects a table, it goes to a table-parsing model. If a chart is detected, different elements get sent to different models: one trained to extract axes, another to read legends, and so on. A vision language model then takes a pass on the output to correct errors. Using this approach, Reducto is able to turn charts into spreadsheets with a high degree of accuracy, something Abraham says the company’s financial clients have long requested and that stymies far larger frontier models.

Still, like self-driving cars, PDFs have a long tail of unusual challenges.

“There’s a big difference between getting a car to stay in a lane versus getting a car to handle whatever would show up on the street, and we see with PDFs a similar thing. I’ve seen the most insane documents you could imagine,” said Abraham. PDF files that contain other PDFs, legal documents with passages sometimes underlined and sometimes crossed out, faxes of medical forms that doctors have scrawled over and drawn lines connecting ideas on different edges of the page. “I don’t think PDFs are a fully solved problem. I wish that were the case. We’re close, but there’s still plenty to do.”

There will be no shortage of PDFs to parse. The format does not appear to be going anywhere. Why would it, asked Duff of the PDF Association, with some incredulity at the very thought. Companies once tried to unseat PDF, Duff said, but their products are “now a footnote in history,” while PDFs continue to proliferate.

“Look at the Google Trends for PDF,” Duff said. It shows a steadily rising curve (with dips in August) year after year. “No other technology looks like that. More and more people over time are including PDF in their searches, because that tends to be where the high-quality content is.

“What’s going to happen is that all the world’s systems will instead understand and use PDF better and better,” Duff said. “The AI companies didn’t focus on PDF, because PDF is very hard, until they realized that, well, it turns out a lot of the really high-quality stuff is in fact in PDF, and so now we have to deal with it.”

Feeding the machine

2026-01-30T10:57:17-05:00

When he was 19 years old, Brendan Foody started Mercor with two of his high school friends as a way for his other friends, who also had startups, to hire software engineers overseas. It launched in 2023 as essentially a staffing agency, albeit a highly automated one. Language models reviewed resumes and did the interviewing. Within months, Mercor was bringing in $1 million in annualized revenue and turning a modest profit.

Then, in early 2024, the company Scale AI approached Mercor with a big request: They needed 1,200 software engineers. At the time, Scale was one of the only well-known names in the historically back-of-house business of producing AI training data. It had grown to a valuation of nearly $14 billion by orchestrating hundreds of thousands of people around the world to label data for self-driving cars, e-commerce algorithms, and language-model-powered chatbots. Now that OpenAI, Anthropic, and other companies were trying to teach their chatbots to code, Scale needed software engineers to produce the training data.

This, Foody sensed, could herald a larger change in the AI industry. He’d heard about growing demand for specialized data work, and now here was Scale asking for a thousand coders. When the engineers he recruited started complaining about missed pay (Scale has a reputation among data workers for chaotic platform management and is being sued in California over wage theft, among other infractions), Foody decided to cut out the middleman.

In September, Foody announced that Mercor had reached $500 million annualized revenue, making it “the fastest growing company of all time.” The previous titleholder was Anysphere, which makes the AI coding tool Cursor. In a sign of the times, Cursor recently noted that its users produce the exact sort of training data labs are paying for, and The Information recently reported that OpenAI and xAI are interested in buying it.

Mercor’s most recent fundraising round valued the company at $10 billion. Foody and his two cofounders are 22 years old, making them the youngest self-made billionaires. At least one of their early employees has already left to start an AI data company of her own.

While discussions of AI infrastructure typically focus on the gargantuan buildout of data centers, an analogous race is happening with training data. Labs have already exhausted all the easily accessible data, adding to questions about whether early rapid progress through sheer increases in scale will continue. Meanwhile, most recent improvements have come through new training techniques that make use of smaller datasets tailor-made by experts in particular fields, like programming and finance, and AI companies will pay premium prices for it.

There are no good statistics on how much labs are spending, but rough estimates from investors and industry insiders place the figure at over $10 billion this year and growing, the vast majority coming from five or so companies. These companies have yet to find a way to make money from AI, but the people selling them training data have. For now, they are some of the only AI companies turning a profit.

“It’s every nook and cranny of human expertise.”

The data industry has long been the most undervalued and unglamorous aspect of AI development, according to a 2021 study by Google researchers, seen as regrettably necessary janitorial work to be done as quickly and cheaply as possible. Yet modern machine learning could not exist without its ecosystem of data suppliers, and the two spheres move in tandem.

The enormous datasets that proved the viability of machine learning in the early 2010s were made possible by the emergence several years before of Amazon Mechanical Turk, an early crowdsourcing platform where thousands of people could be paid pennies to label images of dogs and cats. The push to develop autonomous vehicles fed the growth of a new batch of companies, among them Scale AI, which refined the crowdsourcing approach through a dedicated work platform called Remotasks where workers used semi-automated annotation software to draw boxes around stop signs and traffic cones.

The turn to language model chatbots after the launch of ChatGPT initiated another transformation of the industry. ChatGPT got its humanlike fluency from a training approach called reinforcement learning from human feedback, or RLHF, which involved paying contractors to rate the quality of chatbot responses. A second model trained on these ratings, then rewarded ChatGPT whenever it did something that this second model predicted humans would like. Providing these ratings was a more nuanced affair than past iterations of crowdsourced data work, particularly as the chatbots got more advanced; it takes someone with medical training to judge whether medical advice is good.

Scale supplied much of the human ratings, but a new company, Surge AI, self-funded by a data scientist named Edwin Chen, quietly grew to become the industry’s other major provider. In Chen’s past jobs at Google, Twitter, and Facebook, he had been dismayed at the poor quality of the data he received from vendors, full of mislabelings done for minimal pay by people who lacked relevant backgrounds. The vendors, Chen said, were just “body shops,” throwing people at the problem and trying to substitute quantity for quality.

Where Scale had its Remotasks platform, Surge has Data Annotation Tech: smaller, more targeted in its recruiting, and with tighter quality controls. It also paid better, around $30 an hour, though like Scale, Surge is also being sued in California for misclassification and unpaid wages. Demand from OpenAI and the labs trying to catch up was immense. The company has been profitable since it launched, and last year, it reportedly took in more than $1 billion in revenue, surpassing Scale’s reported $870 million. Earlier this year, Reuters reported that Surge is considering taking funding for the first time, looking for a $1 billion investment at a $15 billion valuation. According to Forbes, Chen still owns approximately 75 percent of it.

Data about which chatbot responses people prefer is a crude signal, however. Models are prone to learning simple hacks like “tell the user they’ve made an excellent point” instead of something as complex as “check for factual consistency with reliable sources.” Even when domain experts are doing the judging, the results often just sound more expert but are still too unreliable to actually be useful. Models ace bar exams but invent case law, pass CPA tests but pick the wrong cells in a spreadsheet. In July, researchers at MIT released a study finding that 95 percent of the businesses that have adopted generative AI have seen zero return.

AI companies hope that reinforcement learning with more granular criteria will change this. Recent improvements in math and coding are a proof of concept. OpenAI’s o1 and DeepSeek’s R1 showed that given a bunch of math and coding problems and a few step-by-step examples of how humans thought their way to solutions, models can become quite adept at these domains. As they trial-and-error their way to correct solutions, models weigh possible approaches, backtrack, and display other problem-solving techniques developers have called “reasoning.”

The problem is that math and coding problems are idealized, self-contained tasks compared to what a software engineer might encounter in the real world, so scores on benchmarks don’t reflect actual performance. To make models useful, AI companies need more data that is reflective of real tasks an engineer might do — hence the rush to hire software engineers.

The other problem is that math and coding might be the easiest possible domains for AI to conquer. For reinforcement learning to work, models need a clear signal of success to optimize for. This is why the method works so well for games like Go: Winning is a clear, unambiguous outcome, so models can try a million ways to achieve it. Similarly, code either runs or it doesn’t. The analogy isn’t perfect; ugly, inefficient code can still run, but it provides something verifiable to optimize for.

Few other things in life are like this. There is no universal test for determining whether a legal brief or consulting analysis is “good.” Success depends on the context, goals, audience, and countless other variables.

“There seems to be a belief in the community that there’s a single reward function, that if we can just specify what we want these AI systems to do, then we can train them to [do it],” said Joelle Pineau, chief AI officer at Cohere, an enterprise-focused AI lab. But, she said, the reality is more varied and nuanced.

“[Reinforcement learning] wants one reward function. It’s not very good about finding solutions when you have multiple conflicting values that need to coexist, so we may need a very different paradigm than that.”

In lieu of a new paradigm, AI companies are attempting to brute force the problem by paying — via companies like Mercor and Surge — thousands of lawyers, consultants, and other professionals to write out in painstaking detail the criteria for what counts as a job well done in every conceivable context. The hope is that these lists, often called grading rubrics, will allow models to reinforcement-learn their way to competence in the same way they have begun doing with software engineering.

It was like breaking a billion-dollar piñata over all the data startups. Handshake saw demand triple overnight.

Rubrics are extremely labor-intensive to produce. People who work on them said that it is not unusual to spend 10 hours or more refining a single one, which might include more than a dozen different criteria. Companies guard the details of their training methods closely, but an example OpenAI released for its recent medical benchmark offers a good indication of what they’re like. Asked a question about an unresponsive neighbor, the model gets rewarded if its response includes advice to check for a pulse, locate a defibrillator, perform CPR, and 16 other criteria. There are nearly 50,000 such criteria in the benchmark, with different ones applying to different prompts. Labs are ordering tens to hundreds of thousands of rubrics with millions of criteria between them per training run, according to people in the data industry.

These rubrics need to be “super granular,” according to Mercor’s Foody. Producing consulting rubrics, Foody said, would start by creating a taxonomy of all the industries a consulting company operates in, then all the types of consulting it does in each of those industries, then all the types of reports and analyses a consultant might produce in each of those categories.

Performing these tasks typically requires doing things on computers, and each of those things needs a rubric, too. Sending an email requires a lot of steps — opening a browser, beginning a new message, typing it out, and so on. But what if your only verifier for success was whether the email was sent or received? It’s important to check for more actions than just one, according to Aakash Sabharwal, Scale’s VP of engineering.

Models learn to perform these tasks in simplified versions of software called reinforcement learning environments, often described as AI “gyms,” where models can stumble around until they figure out how to do the clicking and dragging required to score well on the grading rubric. The market for these environments is booming, too.

As with rubrics, each one needs to be tailored to its use. “Sometimes it’s a DoorDash or a Salesforce clone, but a lot of times it’s just an enterprise-specific environment,” said Alex Ratner, cofounder and CEO of Snorkel AI. Snorkel makes annotation software but recently launched a human data service of its own.

Ratner cites a recurring irony in AI development known as Moravec’s paradox, named for a researcher working on computer vision in the 1980s who observed that the things that come easiest to humans are often the most difficult for machines. At the time, conventional wisdom was that machine vision would be solved before chess; after all, only a select few humans have the talent and training to be grandmasters, whereas even children can see. Now models can solve complex one-off coding challenges, but they flounder on more basic real-world engineering tasks without close human supervision, misusing tools and making obvious errors.

“That kind of real work, with ambiguous, intermediate metrics of success that seem way more mundane than a coding competition, that is where models struggle,” Ratner said. “That’s the counterintuitive frontier, and that’s where people are trying to lean in, ourselves included, with building more complex environments, more nuanced rubrics.”

According to vendors, the most in-demand fields are the ones that sit at the sweet spot of verifiability and economic value. Software engineering continues to be the largest, followed by finance and consulting. Law is popular, though so far it is proving to be less verifiable and thus amenable to reinforcement learning. Physics, chemistry, math are all in demand. Really, it’s nearly anything you can imagine. There are ads for nuclear engineers and animal trainers.

“It’s everything from clinical hospital settings to legal deep research to — we got a request for woodworking the other day,” Ratner said. “It’s every nook and cranny of human expertise.”

Encoding all of humanity’s skill and know-how into checklists is an enormous, possibly quixotic undertaking, but the frontier labs have billions to spend, and the sheer scale of their demand is reconfiguring the data industry. New entrants seem to appear by the day, and everyone is touting successively more pedigreed experts getting paid ever higher rates.

Surge touts its Fields Medalist mathematicians, Supreme Court litigators, and Harvard historians. Mercor advertises its Goldman analysts and McKinsey consultants. Handshake AI, another fast-growing expert provider, boasts of its physicists from Berkeley and Stanford and the ability to draw alumni from more than 1,000 universities.

Garrett Lord, the CEO and cofounder of Handshake, started picking up signals about the changing data market last year, when incumbent data providers came around asking for experts. Handshake had experts. Lord founded the company in 2014 as a sort of LinkedIn-meets-Glassdoor for college students and recent grads looking for internships and first jobs. More than a thousand college career centers pay for access, as do companies looking to recruit from Handshake’s 20 million alumni, grad students, masters, and PhDs. Early this year, Lord entered the AI data market himself, launching essentially a second company within his existing one, called Handshake AI.

Then, in June, Meta hired away Scale’s CEO and took a 49 percent stake in the company. Rival labs fled, wary that Scale would no longer be a neutral provider — could they trust the data now that it was being provided by a quasi-Meta subsidiary? It was like breaking a billion-dollar piñata over all the data startups. Handshake saw demand triple overnight.

In November, Handshake surpassed a $150 million run rate, exceeding the original decade-old business. There is more demand than the company can meet, Lord said. “We’ve gone from three to 150 people in five months,” Lord said. “We’ve had 18 people start on a Monday. We’re running out of desks.”

The ravenous demand of AI model-builders is pulling any company that might have data to offer into its gravitational field. Turing, which began as a staffing agency but pivoted to training data after OpenAI approached the company in 2022, also saw demand spike following the Scale deal. As did Labelbox, which makes annotation software but last year launched its own expert-annotator service, called Alignerr, where buyers can search for experts, called “Alignerrs,” who’ve been vetted by Labelbox’s AI interviewer, named Zara.

Staffing agencies, content moderation subcontractors, and other adjacent businesses are also reorienting around the labs. Invisible Technologies started 10 years ago as a personal assistant bot that directed tasks to workers overseas, but it started posting twentyfold revenue increases as AI labs hired those workers to produce data. This year, it brought on an ex-McKinsey executive as CEO, took on venture funding, and is positioning itself as an AI training company. The company Pareto followed the same trajectory, launching in 2020 by offering executive assistants based in the Philippines and now selling AI training data services.

The company Micro1 began in 2022 as a staffing agency for hiring software engineers, who had been vetted by AI, but now it’s a data labeling company too. In July, Reuters reported that the company had seen annualized revenue go from $10 million to $100 million this year and was finalizing a Series A funding round valuing the company at $500 million.

Even Uber is angling to get a piece of the action. In October, it bought a Belgian data labeling startup and is in the process of rolling out an annotation platform to US workers, so drivers can annotate when they aren’t driving.

“This Cambrian explosion happened, and now let’s see who survives.”

Then there is a long list of smaller, niche players. The company Sapien is paying data labelers in crypto. Rowan Stone, CEO of Sapien, told The Verge in July that the data labeling company — which specializes in vertical models focused on just one thing and has Scale cofounder Lucy Guo on its advisory board — is “absorbing the collective knowledge of humanity.” They aren’t even the only human data startup paying in crypto tokens.

Stellar, Aligned, FlexiBench, Revelo, Deccan AI — everyone is touting their talent networks, their experts in the loop, their data enrichment pipelines. The company Mechanize rose above the scrum on a wave of viral outrage by announcing in April that its goal was “the full automation of all work.” How will it accomplish this provocative goal? By selling training data and environments, like everyone else.

Like Nvidia, the dominant designer of AI chips, these companies sell the picks and shovels for the AI gold rush, capturing the billions in debt-financed spending flowing out of the frontier labs as they race to achieve superintelligence. It’s a safer business than prospecting, and it is much easier to start selling data than to design new chips, so startups are proliferating.

“It’s like everyone and their mother realized, ‘Hey, I’m doing a human data startup,’” said Adam J. Gramling, a former Scale employee who said he received approximately 300 recruiting messages on LinkedIn when he announced his departure in one of Scale’s recent rounds of layoffs. “This Cambrian explosion happened, and now let’s see who survives.”

The data industry may be growing quickly, but it is a historically tumultuous business. The industry is littered with former giants felled by a sudden change in training techniques or customer departure. In August 2020, the Australian data annotation company Appen’s market cap surpassed the equivalent of $4.3 billion USD; now, it’s less than $130 million, a 97 percent decline. For Appen, 80 percent of its revenue came from just five clients — Microsoft, Apple, Meta, Google, and Amazon — which made even a single client departure an existential event.

Today’s market is also highly concentrated. On a recent podcast, Foody compared Mercor’s customer concentration to Nvidia, where four customers represent 61 percent of its revenue. If investors tire of giving money to model-builders, or the labs take a different approach to training, the effects could be devastating. All of the AI developers use multiple data suppliers already, and as the exodus from Scale showed, they are quick to take their money elsewhere.

All this lends itself to a fiercely competitive atmosphere. On podcasts and in interviews, the CEOs take swipes at the business models of their rivals. Chen still thinks most of his competitors are “body shops.” Foody refers to Surge and Scale as legacy crowdsourcers in an era of highly paid experts. Handshake’s Lord says his rivals are spending thousands on recruiters spamming physicists on TikTok, but they’re all already on his platform. All three say Scale had quality problems even before it was tainted by Meta’s investment. Every time one of these barbs is reported, a Scale spokesperson snipes back, accusing Foody of seeking publicity or mocking Chen for his lengthy fundraising round. Scale is also currently suing Mercor, claiming it poached an employee who stole clients on their way out the door.

For now, there is more than enough money flowing from the labs for everyone. They want rubrics, environments, experts of every conceivable type, but they’re still buying the old types of data too. “It’s always increasing,” says Surge’s Chen. “These ever-increasing new forms of training, they’re almost complementary to each other.”

Even Scale is growing after its post-Meta setback, and major customers have come back, at least in some capacity. Interim CEO Jason Droege said in an onstage interview in September that the company is still working with Google, Microsoft, OpenAI, and xAI. To better compete in the enterprise AI space, Scale has also started a program called the “Human Frontier Collective” for white-collar professionals in STEM fields like computer science, engineering, mathematics, and cognitive science.

Scale told The Verge that both its data and applications businesses are each generating nine figures of revenue, with its data business growing each month since the Meta investment and its application business doubling from the first half to the second half of 2025. It also said that the third quarter of 2025 was its public sector business’s best quarter since 2020, partly due to government contracts. Scale also reportedly expects revenue for this year to more than double, to $2 billion. (The company declined to comment on the figure on the record.)

It has diversified into selling evaluations, the tests that AI developers use to see where their models are weak and need more training data, according to Bing Liu, Scale’s head of research. The business strategy: Companies will ideally use the evaluations to see where their own models are lacking in data — and then, ideally, buy those types of data from Scale.

The 11-digit valuations of just-launched data companies could be seen as signs of an AI bubble, but they could also represent a bet on a certain trajectory of AI development. (Both can also be true.) The goal held out by the AI labs when justifying their enormous expenditures is an imminent breakthrough to artificial general intelligence, something, to use the definition in OpenAI’s charter, that is “highly autonomous” and can “outperform humans at most economically valuable work.”

The term is amorphous and disputed, but one thing artificial general intelligence should be able to do is, well, generalize. If you train it to do math and accounting, it should be able to do your taxes without further rounds of reinforcement learning on tax law, state-specific tax rules, the most recent edition of TurboTax, and so on. A generally capable agent should not need massive amounts of new data to handle each variety of task in every domain.

“The future where the AI labs are right is one where as performance goes up, the need for human data goes down, until you can take the human out of the loop entirely,” said Daniel Kang, assistant professor of computing and data science at the University of Illinois Urbana-Champaign, who has written about the demand for training data. Instead, the opposite seems to be happening. Labs are spending more on data than ever before, and improvements are coming from bespoke datasets tailored to increasingly specific applications. Given current training trends, Kang predicts that getting high-quality human data in each discrete domain will be the primary bottleneck for future AI progress.

In this scenario, AI looks more like a “normal technology,” Kang said. Normal technology here being something like steam engines or the internet — potentially transformative, but also not computer god. (This is also, he hypothesized, why companies are less keen to trumpet their spending on data than they are on data centers: It cuts against their fundraising narrative.) In the AI-as-normal future, companies will need to buy new data whenever they want to automate a particular task, and keep buying data as workflows change.

The data companies are betting on that too. “The labs very much want to say that we’re going to have superintelligence that generalizes as soon as possible,” said Foody. “The way it’s playing out in practice is that reinforcement learning has a limited generalization radius, so they need to build evals across all the things that they want to optimize for, and their investments in that are exploding very quickly.”

Other companies, predicting that the frontier models will not “just hit this point of generalization where it’s just magic and you can do everything,” in the words of Ryan Wexler, who manages AI infrastructure investments at SignalFire, are positioning themselves to cater to the many companies that will need to tune models to suit their purposes.

SignalFire invested in Centaur AI, a medical and scientific data company. Rather than the frontier labs, most of Centaur’s customers are medical institutions like Memorial Sloan Kettering or Medtronic with highly specific applications and low margins for error. Last year, the smart mattress company Eight Sleep wanted to add “snore detection” to its bed’s suite of capabilities. Existing models struggled, so the company hired Centaur to enlist more than 50,000 people to label snores.

“The attempts to make the God model, I don’t know what will happen there, but I’m very confident that demand will keep growing among everyone else,” said Centaur’s founder and CEO, Erik Duhaime. “Everyone was sold some dream that this will be easy, plug and play,” Duhaime said. “Now they’re realizing, ‘Oh, we need to customize this thing for our use case.’”

Matt Fitzpatrick, the CEO of Invisible, is also focusing on its enterprise services. If you look at “spend curves over time,” he said, the enterprise is “where a lot of this will move.” Since January, the company has overhauled its business to focus more on attracting enterprise clients, with about 30 percent of its data annotation pool now being people with PhDs and master’s degrees. Fitzpatrick describes the company as a “digital assembly line” where experts “anywhere on Earth” can be called in to generate data. Invisible is currently often being asked to provide environments for software development and contact centers, he said.

If AGI is to be achieved one order of contact-center training rubrics at a time, the future looks bright for data vendors, which is perhaps why a new grandeur has entered the language of the CEOs. Turing’s CEO predicts that AI data annotator will become the most common job on the planet in the coming years, with billions of people evaluating and training models. Handshake’s Lord sees the nascent formation of a new category of work, comparing it to Uber drivers a decade ago.

“We’re going to need a huge build-out of data and evals across every industry in the economy,” Foody said. At Mercor, he says, the customer support team responds to tickets the AI agent can’t manage, but also updates its rubrics so it can field those questions next time. “If you zoom out,” he said, “it feels like the entire economy will become a reinforcement learning environment.”

If investors don’t find this vision as enticing as a country of geniuses in a data center, as Anthropic’s Dario Amodei described the impending transformation, they can take consolation in the fact that someone, at least, has found a way to make money off AI.

How Wikipedia survives while the rest of the internet breaks

2026-01-20T13:03:36-05:00

When armies invade, hurricanes form, or governments fall, a Wikipedia editor will typically update the relevant articles seconds after the news breaks. So quick are editors to change “is” to “was” in cases of notable deaths that they are said to have the fastest past tense in the West. So it was unusual, according to one longtime editor who was watching the page, that on the afternoon of January 20th, 2025, hours after Elon Musk made a gesture resembling a Nazi salute at a rally following President Donald Trump’s inauguration and well into the ensuing public outcry, no one had added the incident to the encyclopedia.

Then, just before 4PM, an editor by the name of PickleG13 added a single sentence to Musk’s 8,600-word biography: “Musk appeared to perform a Nazi salute,” citing an article in The Jerusalem Post. In a note explaining the change, the editor wrote, “This controversy will be debated, but it does appear and is being reported that Musk may have performed a Hitler salute.” Two minutes later, another editor deleted the line for violating Wikipedia’s stricter standards for unflattering information in biographies of living people.

But PickleG13 was correct. That evening, as the controversy over the gesture became a vortex of global attention, another editor called for an official discussion about whether it deserved to be recorded in Wikipedia. At first, the debate on the article’s “talk page,” where editors discuss changes, was much the same as the one playing out across social media and press: it was obviously a Nazi salute vs. it was an awkward wave vs. it couldn’t have been a wave, just look at the touch to his shoulder, the angle of his palm vs. he’s autistic vs. no, he’s antisemitic vs. I don’t see the biased media calling out Obama for doing a Nazi salute in this photo I found on Twitter vs. that’s just a still photo, stop gaslighting people about what they obviously saw. But slowly, through the barbs and rebuttals and corrections, the trajectory shifted.

Wikipedia is the largest compendium of human knowledge ever assembled, with more than 7 million articles in its English version, the largest and most developed of 343 language projects. Started nearly 25 years ago, the site was long mocked as a byword for the unreliability of information on the internet, yet today it is, without exaggeration, the digital world’s factual foundation. It’s what Google puts at the top of search results otherwise awash in ads and spam, what social platforms cite when they deign to correct conspiracy theories, and what AI companies scrape in their ongoing quest to get their models to stop regurgitating info-slurry — and consult with such frequency that they are straining the encyclopedia’s servers. Each day, it’s where approximately 70 million people turn for reliable information on everything from particle physics to rare Scottish sheep to the Erfurt latrine disaster of 1184, a testament both to Wikipedia’s success and to the total degradation of the rest of the internet as an information resource.

“It’s basically the only place on the internet that doesn’t function as a confirmation bias machine.”

But as impressive as this archive is, it is the byproduct of something that today looks almost equally remarkable: strangers on the internet disagreeing on matters of existential gravity and breathtaking pettiness and, through deliberation and debate, building a common ground of consensus reality.

“One of the things I really love about Wikipedia is it forces you to have measured, emotionless conversations with people you disagree with in the name of trying to construct the accurate narrative,” said DF Lovett, a Minnesota-based writer and marketer who mostly edits articles about local landmarks and favorite authors but later joined the salute debate to argue that “Elon Musk straight-arm gesture controversy” was a needlessly awkward description. “It’s basically the only place on the internet that doesn’t function as a confirmation bias machine,” he said, which is also why he thinks people sometimes get mad at it. Wikipedia is one of the few platforms online where tremendous computing power isn’t being deployed in the service of telling you exactly what you want to hear.

Whether Musk had made a Nazi salute or was merely awkward, the editors decided, was not for them to say, even if they had their opinions. What was a fact, they agreed, was that on January 20th, Musk had “twice extended his right arm toward the crowd in an upward angle,” that many observers compared the gesture to a Nazi salute, and that Musk denied any meaning behind the motion. Consensus was reached. The lines were added back. Approximately 7,000 words of deliberation to settle, for a time, three sentences. This was Wikipedia’s process working as intended.

It was at this point that Musk himself cannonballed into the discourse, tweeting that the encyclopedia was “legacy media propaganda!”

This was not Musk’s first time attacking the site — that appears to have been in 2019, when he complained that it accurately described him as an early investor in Tesla rather than its founder. But recently he has taken to accusing the encyclopedia of a liberal bias, mocking it as “wokepedia,” and calling for it to be defunded. In so doing, he has joined a growing number of powerful people, groups, and governments that have made the site a target. In August, Republicans on the US House Oversight Committee sent a letter to the Wikimedia Foundation requesting information on attempts to “inject bias” into the encyclopedia and data about editors suspected of doing so.

Musk repeating the salute before saying: “My heart goes out to you. It is thanks to you that the future of civilization is assured.”

" data-portal-copyright="" />

When governments have cowed the press and flooded social platforms with viral propaganda, Wikipedia has become the next target, and a more stubborn one. Because it is edited by thousands of mostly pseudonymous volunteers around the world — and in theory, by anyone who feels like it — its contributors are difficult for any particular state to persecute. Since it’s supported by donations, there is no government funding to cut off or advertisers to boycott. And it is so popular and useful that even highly repressive governments have been hesitant to block it.

Instead, they have developed an array of more sophisticated strategies. In Hong Kong, Russia, India, and elsewhere, government officials and state-aligned media have accused the site of ideological bias while online vigilantes harass editors. In several cases, editors have been sued, arrested, or threatened with violence.

When several dozen editors gathered in San Francisco this February, many were concerned that the US could be next. The US, with its strong protections for online speech, has historically been a refuge when the encyclopedia has faced attacks elsewhere in the world. It is where the Wikimedia Foundation, the nonprofit that supports the project, is based. But the site has become a popular target for conservative media and influencers, some of whom now have positions in the Trump administration. In January, the Forward published slides from the Heritage Foundation, the think tank responsible for Project 2025, outlining a plan to reveal the identities of editors deemed antisemitic for adding information critical of Israel, a cudgel that the administration has wielded against academia.

“It’s about creating doubt, confusion, attacking sources of trust,” an editor told the assembled group. “It came for the media and now it’s coming for Wikipedia and we need to be ready.”

In 1967, Hannah Arendt published an essay in The New Yorker about what she saw as an inherent conflict between politics and facts. As varieties of truth go, she wrote, facts are fragile. Unlike axioms and mathematical proofs that can be derived by anyone at any time, there is nothing necessary about the fact, to use Arendt’s example, that German troops crossed the border with Belgium on the night of August 4th, 1914, and not some other border at some other time. Like all facts, this one is established through witnesses, testimony, documents, and collective agreement about what counts as evidence — it is political, and as the propaganda machines of the 20th century showed, political power is perfectly capable of destroying it. Furthermore, they will always be tempted to, because facts represent a sort of rival power, a constraint and limit “hated by tyrants who rightly fear the competition of a coercive force they cannot monopolize,” and at risk in democracies, where they are suspiciously impervious to public opinion. Facts, in other words, don’t care about your feelings. “Unwelcome facts possess an infuriating stubbornness,” Arendt wrote.

This infuriating stubbornness turns out to be important, though. A lie might be more plausible or useful than a fact, but it lacks a fact’s dumb arbitrary quality of being the case for no particular reason and no matter your opinion or influence. History once rewritten can be rewritten again and becomes insubstantial. Rather than believe the lie, people stop believing anything at all, and even those in power lose their bearings. This gives facts “great resiliency” that is “oddly combined” with their fragility. Having a stubborn common ground of shared reality turns out to be a basic precondition of collective human life — of politics. Even political power seems to recognize this, Arendt wrote, when it establishes ideally impartial institutions insulated from its own influence, like the judiciary, the press, and academia, charged with producing facts according to methods other than the pure exercise of power.

Leonardo DiCaprio is an American actor and film producer." data-portal-copyright="" /> Outside Wikipedia, original research is a key part of scholarly work. However, Wikipedia editors must base their contributions on reliable, published sources, not their own original research." data-portal-copyright="" /> On the floor of the US Senate, Republican Sen. Jim Inhofe displayed a snowball — on February 26th, 2015, in winter — as evidence the globe was not warming, in a year that was found to be Earth’s warmest on record at the time." data-portal-copyright="" /> is a species of riffle beetle in the superfamily Byrrhoidea.
The species was named after actor and environmentalist Leonardo DiCaprio to acknowledge his work in "promoting environmental awareness and bringing the problems of climate change and biodiversity loss into the spotlight."" data-portal-copyright="" />

Wikipedia has come to play a similar role of factual ballast to an increasingly unmoored internet, but without the same institutional authority and with its own methods developed piecemeal over the last two decades for arriving at consensus fact. How to defend it from political attacks is not straightforward. At the conference, many editors felt both that attacks from the Trump administration were a genuine threat and that being cast as “the resistance” risked jeopardizing the encyclopedia’s position of trusted neutrality.

“I would really argue not to take the attack approach, to really take the passive approach,” said one editor when someone broached the idea of actively debunking some of the false information swamping the rest of the internet. “People see us as credible because we don’t attack, because we are just providing information to everyone all the time in a boring way. Sometimes boring is good. Boring is credible.”

Even the editor at the summit who had been most directly affected by the Trump administration urged against a direct response. Jamie Flood had been a librarian and outreach specialist at the National Agricultural Library, where among other duties she led group trainings and uploaded research on topics like germplasm and childhood nutrition to Wikipedia. Museums and libraries around the world employ such “Wikipedians in residence” to act as liaisons with the encyclopedia’s community for the same reason that the World Health Organization partnered with Wikipedia during the covid-19 pandemic to make the latest information available: if you want research to reach the public, there is no better place.

Along with several other Wikipedians employed by the federal government, Flood had just been laid off by DOGE, collateral damage in a general dismantling of research and archival institutions. “I’m a casualty of this administration’s war on information,” Flood said.

“‘Imagine a world where all knowledge is freely available to everyone.’”

Still, Wikipedia absolutely should not counterattack, Flood said. “Wikipedia is always in the background. They’re not making a big statement, and I don’t think they should. I’ve been training people for a long time and I still go back to this early quote of Jimmy Wales, one of the founders: ‘Imagine a world where all knowledge is freely available to everyone.’ That’s enough. That’s a statement in and of itself. In a time of misinformation, in a time of suppression, having this place where people can come and bring knowledge and share knowledge, that is a statement.”

Wikipedia should be, in other words, as stubborn as a fact. But then, facts are fragile things.

A common refrain among Wikipedians is that the site works in practice but not in theory. It seems to flout everything we’ve learned about human behavior online: anonymous strangers discussing divisive topics and somehow, instead of dissolving into factions and acrimony, working together to build something of value.

The project’s origins go back to 1999. Wales, a former options trader who had founded a laddish web portal called Bomis, wanted to start a free online encyclopedia. He hired an acquaintance from an Ayn Rand listserv that Wales previously ran, a philosophy PhD student named Larry Sanger. Their first attempt, called Nupedia, was not so different from encyclopedias as they have existed since Diderot’s Encyclopédie in 1751. Experts would write articles that went through seven stages of editorial review. It was slow going. After a year, Nupedia had just over 20 articles.

In an attempt to speed things along, they decided to experiment with wikis, a web format gaining popularity among open-source software developers that allowed multiple people to collaboratively edit a project. (Wiki is the Hawaiian word for “quick.”) The wiki was intended to be a forum where the general public could contribute draft articles that would then be fed into Nupedia’s peer-review pipeline, but the experts objected and the crowdsourced site was given its own domain, Wikipedia.com. It went live on January 15th, 2001. Within days, it had more articles than all of Nupedia, albeit of varying quality. After a year, Wikipedia had more than 20,000 articles.

“…write about what people believe, rather than what is so”

There were few rules at first, but one that Wales said was “non-negotiable” was that Wikipedia should be written from a “neutral point of view.” The policy, abbreviated as NPOV, was imported from the “nonbias policy” Sanger had written for Nupedia. But on Wikipedia, Wales considered it as much a “social concept of cooperation” as an editorial standard. If this site was going to be open to anyone to edit, the only way to avoid endless flame wars over who is right was, provocatively speaking, to set questions of truth aside. “We could talk about that and get nowhere,” Wales wrote to the Wikipedia email list. “Perhaps the easiest way to make your writing more encyclopedic is to write about what people believe, rather than what is so,” he explained.

Ideally, the neutrality principle would allow people of different views to agree, if not on the matter at hand, then at least on what it was they were disagreeing about. “If you’ve got a kind and thoughtful Catholic priest and a kind and thoughtful Planned Parenthood activist, they’re never going to agree about abortion, but they can probably work together on an article,” Wales would later say.

This view faced an immediate challenge, which is that people believe all sorts of things: that the Earth is 6,000 years old, that climate change is a scam, that the Holocaust was a hoax, that the Irish potato famine was overblown, that chiropractors are all charlatans, that they have discovered a new geometry, and that Mother Teresa was a jerk.

Lawrence Mark Sanger is an American Internet project developer and philosopher who cofounded Wikipedia along with Jimmy Wales. " data-portal-copyright="" /> Anti-denialist banner at the 2017 Climate March in Washington, DC." data-portal-copyright="" /> Mary Teresa Bojaxhiu was an Albanian Indian Catholic nun, founded the Missionaries of Charity, and is a Catholic saint." data-portal-copyright="" /> Young Earth creationism (YEC) is a form of creationism that holds as a central tenet that the Earth and its lifeforms were created by supernatural acts of the Abrahamic God between about 10,000 and 6,000 years ago, contradicting established scientific data that puts the age of Earth around 4.54 billion years." data-portal-copyright="" />

In response, the early volunteers added another rule. You can’t just say things; any factual claim needs a citation that readers can check for themselves. When people started emailing Wales their proofs that Einstein was wrong about relativity, he clarified that the cited source could not be your own “original research.” Sorry, Wales wrote to an Einstein debunker, it doesn’t matter whether your theory is true. When it is published in a physics journal, you can cite that.

Instead of trying to ascertain the truth, editors assessed the credibility of sources, looking to signals like whether a publication had a fact-checking department, got cited by other reputable sources, and issued corrections when it got things wrong.

At their best, these ground rules ensured debates followed a productive dialectic. An editor might write that human-caused climate change was a fact; another might change the line to say there was ongoing debate; a third editor would add the line back, backed up by surveys of climate scientists, and demand peer-reviewed studies supporting alternate theories. The outcome was a more accurate description of the state of knowledge than many journalists were promoting at the time by giving “both sides” equal weight, and also a lot of work to arrive at. A 2019 study published in Nature found that Wikipedia’s most polarizing articles — eugenics, global warming, Leonardo DiCaprio — are the highest quality, because each side keeps adding citations in support of their views. Wikipedia: a machine for turning conflict into bibliographies.

Coupled with some technical features of wikis, like the ability for anyone to edit anyone else’s writing, and some early administrative rules, like not being allowed to undo someone else’s edit more than three times per day, users were practically forced to talk through disagreements and arrive at “consensus.” This became Wikipedia’s governing principle.

This may make the process sound more peaceful than it is. Disputes were constant. Early on, Sanger, who had remained partial to a more hierarchical, expert-driven model, clashed repeatedly with editors he decried as “anarchists” and demanded greater authority for himself, which the editors rejected. When revenue from Bomis dried up after the dot-com crash, Wales laid Sanger off and took over management of the project.

Wales governed from a greater remove, appearing only occasionally to broker peace between warring editors, resolve an impasse, or reassure people that they didn’t need to spend time devising procedures to screen out a sudden influx of neo-Nazis that were planning to overwhelm discussion, because if they showed up, “I will personally ban them all if necessary, and that’s that.” Editors sometimes ironically referred to him as their “God King” or “benevolent dictator,” but he described his role as a sort of constitutional monarch safeguarding the community as it developed the processes to fully govern itself. Because Wikipedia was under a Creative Commons license, anyone who didn’t like the way the project was run could copy it and start their own, as a group of Spanish users did when the possibility of running ads was raised in 2002. The next year Wales established a nonprofit, the Wikimedia Foundation, to raise funds and handle the technical and legal work required to keep the project running. The encyclopedia itself, however, would be entirely edited and managed by volunteers.

In early 2004, Wales delegated his moderating powers to a group of elected editors, called the Arbitration Committee. From that point onward, he was essentially another editor, screenname Jimbo Wales, liable to have his edits undone like anyone else. He attempted several times to update his own birthdate to reflect the fact that his mother says he was born slightly before midnight on August 7th, 1966, not on August 8th, as his birth certificate read, only to be reprimanded for editing his own page and trying to cite his own “original research.” (After several years of debates and citable coverage from reliable sources, August 7th eventually won, with a note explaining the discrepancy.)

AGF

Over the ensuing two decades, editors amended policies to cope with conspiracy theorists, revisionist historians, militant fandoms, and other perennial goblins of the open web. There were the three core content guidelines of Neutral Point of View, Verifiability, and No Original Research; the five pillars of Wikipedia; and a host of rules around editor conduct, like the injunction to avoid ad hominem attacks and assume good faith of others, defined and refined in interlinked articles and essays. There are specialized forums and noticeboards where editors can turn for help making an article more neutral, figuring out whether a source was reliable, or deciding whether a certain view was fringe or mainstream. By 2005, the pages where editors stipulated policy and debated articles were found to be growing faster than the articles themselves. Today, this administrative backend is at least five times the size of the encyclopedia it supports.

The most important thing to know about this system is that, like the neutrality principle from which it arose, it largely ignores content to focus on process. If editors disagree about, for example, whether the article for the uninhabited islands claimed by both Japan and China should be titled “Senkaku Islands,” “Diaoyu Islands,” or “Pinnacle Islands,” they first try to reach an agreement on the article’s Talk page, not by arguing who is correct, but by arguing which side’s position better accords with specific Wikipedia policies. If they can’t agree, they can summon an uninvolved editor to weigh in, or file a “request for comment” and open the issue to wider debate for 30 days.

If this fails and editors begin to quarrel, they might get called before the Arbitration Committee, but this elected panel of editors will also not decide who is right. Instead, they will examine the reams of material generated by the debate and rule only on who has violated Wikipedia process. They might ban an editor for 30 days for conspiring off-Wiki to sway debate, or forbid another editor from working on articles about Pacific islands over repeated ad hominem attacks, or in extreme cases ban someone for life. Everyone else can go back to debating, following the process this time.

As a result, explosive political controversies and ethnic conflicts are reduced to questions of formatting consistency. But because process decides all, process itself can be a source of intense strife. The topics of “gun control” and “the Balkans” are officially designated as “contentious” due to recurring edit wars, where people keep reverting each other’s edits without attempting to build consensus; so, too, are the Wikipedia manual of style and the question of what information belongs in sidebars. In one infamous battle, debate over whether to capitalize “into” in the film title Star Trek Into Darkness raged for more than 40,000 words.

Because disputes on Wikipedia are won or lost based on who has better followed Wikipedia process, every dispute becomes an opportunity to reiterate the project’s rules and principles

In 2009, law professors David A. Hoffman and Salil K. Mehra published a paper analyzing conflicts like these on Wikipedia and noted something unusual. Wikipedia’s dispute resolution system does not actually resolve disputes. In fact, it seems to facilitate them continuing forever.

These disputes may be crucial to Wikipedia’s success, the researchers wrote. Online communities are in perpetual danger of dissolving into anarchy. But because disputes on Wikipedia are won or lost based on who has better followed Wikipedia process, every dispute becomes an opportunity to reiterate the project’s rules and principles.

Trolls who repeatedly refuse to follow the process eventually get banned, but initial infractions are often met with explanations of how Wikipedia works. Several of the editors I spoke with began as vandals only to be won over by someone explaining to them how they could contribute productively. Editors will often restrict who can work on controversial topics to people who have logged a certain number of edits, ensuring that only those bought into the ethos of the project can participate.

In 2016, researchers published a study of 10 years of Wikipedia edits about US politics. They found that articles became more neutral over time — and so, too, did the editors themselves. When editors arrived, they often proposed extreme edits, received pushback, and either left the project or made increasingly moderate contributions.

This is obviously not the reigning dynamic of the rest of the internet. The social platforms where culture and politics increasingly play out are governed by algorithms that have the opposite effect of Wikipedia’s bureaucracy in nearly every respect. Optimized to capture attention, they boost the novel, extreme, and sensational rather than subjecting them to increased scrutiny, and by sending content to users most likely to engage with it, they sort people into clusters of mutual agreement. This phenomenon has many names. Filter bubbles, epistemological fragmentation, bespoke realities, the sense that everyone has lost their minds. On Wikipedia, it’s called a “point of view split,” and editors banned it early. You are simply not allowed to make a new article on the same topic. Instead, you must make the case for a given perspective’s place amid all the others while staying, literally, on the same page.

In February, the conservative organization Media Research Center released a report claiming that “Wikipedia Effectively Blacklists ALL Right-Leaning Media.” It was essentially a summary of a publicly available policy page on Wikipedia that lists discussions about the reliability of sources and color codes them according to the latest consensus — green for generally reliable, yellow for lack of clear consensus, and red for generally unreliable. ProPublica is green because it has an “excellent reputation for fact-checking and accuracy, is widely cited by reliable sources, and has received multiple Pulitzer Prizes.” Newsweek is yellow after a decline in editorial standards following its 2013 acquisition and recent use of AI to write articles. Newsmax, the One America News Network, and several other popular right-leaning sources are red due to repeatedly publishing stories that were proven wrong. (As are some left-leaning sources, like Occupy Democrats.) The New York Post (generally unreliable, but marginally reliable on entertainment) used the report as the basis for an editorial titled “Big Tech must block Wikipedia until it stops censoring and pushing disinformation.”

The page is called Reliable sources/Perennial sources, as in sources that are perennially discussed. Editors made the page in 2018 as a repository for past discussions that they could refer to instead of having to repeatedly debate the reliability of the Daily Mail — the first publication to be deprecated, the year before — every time someone tried to cite it. It is not a list of preapproved or banned sources, the page reads. Context matters, and consensus can change.

But to Wikipedia’s critics, the page has become a symbol of the encyclopedia’s biases. Sanger, the briefly tenured cofounder, has found a receptive audience in right-wing activist Christopher Rufo and other conservatives by claiming Wikipedia has strayed from its neutrality principle by making judgments about the reliability of sources. Instead, he argues, it should present all views equally, including things “many Republicans believe,” like the existence of widespread fraud in the 2020 election and the FBI playing a role in the January 6th Capitol attack.

Last spring, the reliable source page collided with one of the most intense political flashpoints on Wikipedia, the Israel-Palestine conflict. In April, an editor asked whether it was time to reevaluate the reliability of the Anti-Defamation League in light of changes to the way it categorizes antisemitic incidents to include protests of Israel, among other recent controversies. About 120 editors debated the topic for two months, producing text equal to 1.9 The Old Man and the Seas, or “tomats,” a standard unit of Wikipedia discourse. The consensus was that the ADL was reliable on antisemitism generally but not when the Israel-Palestine conflict was involved.

Unusually for a Wikipedia administrative process, the decision received enormous attention. The Times of Israel called it a “staggering blow” for the ADL, which mustered Jewish groups to petition the foundation to overrule the editors. The foundation responded with a fairly technical explanation of how Wikipedia’s self-governing reliability determinations work.

In the year since, conservative and pro-Israel organizations have published a series of reports examining the edit histories of articles to make a case that Wikipedia is biased against Israel. In March, the ADL itself issued one such report, called “Editing for Hate,” claiming that a group of 30 “malicious editors” slanted articles to be critical of Israel and favorable to Palestine. As evidence, the report highlights examples like the removal of the phrase “Palestinian terrorism” from the introduction of the article on Palestinian political violence.

Yet the edit histories show that these examples are often plucked from long editing exchanges, the outcome of which goes unmentioned. The “terrorism” line that the ADL cited was indeed removed — it had also only just been added, was added back shortly after being cut, then was removed again, added back, and revised repeatedly before editors brokered a compromise on the talk page.

Breitbart, Pirate Wires, and other right-leaning publications now regularly mine Wikipedia’s lengthy debates for headlines like “How Wikipedia Launders Regime Propaganda,” accusing the site of being a mouthpiece for the Democratic Party, or “Cover Up: Wikipedia Editors Propose Deleting Page on Iran Advocating for Israel’s Destruction,” despite the article having just been created and the outcome being to merge the contents into the article on Iran-Israel relations. These reports are a dependable source of viral outrage on X. The strategy also appears effective at convincing lawmakers. In May, Rep. Debbie Wasserman Schultz (D-FL) and 22 other members wrote to the Wikimedia Foundation citing the ADL report and demanding Wikimedia “rein in antisemitism, uphold neutrality.”

Alice O'Connor, better known by her pen name Ayn Rand, was a Russian-born American writer and philosopher. She is known for her fiction and for developing a philosophical system that she named Objectivism. " data-portal-copyright="" /> Here are two black swans, but even with no black swans to possibly falsify it, "All swans are white" would still be shown falsifiable by "Here is a black swan" — it would still be a valid observation statement in the empirical language, even if empirically false." data-portal-copyright="" />

The August letter from House Republicans requesting information on attempts to influence the encyclopedia, data on editors who had been disciplined by Arbcom, and other records also cited the ADL report.

While some search for bias in the minutiae of edit histories, others try to encompass all of Wikipedia. Last year, a researcher at the conservative Manhattan Institute scraped Wikipedia for mentions of political terms and public officials and used a GPT language model to analyze them for bias. The report found “a mild to moderate” tendency to associate figures on the political right with more negative sentiment than those on the left. The study, which was not peer reviewed, has become a regular fixture in claims of liberal bias on Wikipedia.

The report still illustrates the challenges of evaluating the neutrality of a text as vast and stripped of subjective opinion as Wikipedia. An examination of the datasets shows that the passages GPT classified as non-neutral are often anodyne factual statements: that a lawmaker won or lost an election, represented a certain district, or died. It also conflated unrelated people of the same name, so, for example, most of the non-neutral statements about Mike Johnson concern not Mike Johnson the current Republican House Majority Speaker but a robber in a 1923 silent film, a prog-rock guitarist, multiple football players, and a famous yodeler.

But the more fundamental question is whether balanced sentiment or balanced anything across the contemporary political spectrum is the correct expectation for a project that operates by a different standard, one based on measures of reliability. Supposing the sentiment readings do reflect a real imbalance, is that due to the biases of editors, biases in their sources, or some other external imbalance, like a tendency by right-leaning politicians to express negative sentiments of fear or anger (a possibility the report raises, then dismisses).

Wikipedia has a long history of attempting to disentangle and correct its various biases. The site’s editor community has been overwhelmingly white, male, and based in the United States and Europe since the site began. In 2018, 90 percent of editors were men, and only 18 percent of biographies in the encyclopedia were of women. That year, the Canadian physicist Donna Strickland won a Nobel Prize, and people turning to Wikipedia to learn about her discovered she lacked an article.

Women have been historically excluded from the sciences, underrepresented in coverage of the sciences, and therefore underrepresented in the sources Wikipedia editors can cite

But the causal connection between these facts was not straightforward. Women have been historically excluded from the sciences, underrepresented in coverage of the sciences, and therefore underrepresented in the sources Wikipedia editors can cite. An editor had tried to make an article on Strickland several months before the Nobel but was overruled due to a lack of coverage in reliable sources. “Wikipedia is a mirror of the world’s biases, not the source of them. We can’t write articles about what you don’t cover,” tweeted then-executive director Katherine Maher.

Wikipedia’s sourcing guidelines are conservative in their deference to traditional institutions of knowledge production, like established newsrooms and academic peer review, and this means that it is sometimes late to ideas in the process of moving from fringe to mainstream. The possibility that covid-19 emerged from a lab was relegated to a section on conspiracy theories and is only now, after reporting by reliable sources, gaining a toehold on the covid pandemic article. Similarly, as awareness grew of the ways Western academic and journalistic institutions have excluded the perspectives of colonized people, critics argued that Wikipedia’s reliance on these same institutions made it impossible for the encyclopedia to be truly comprehensive.

Not all the bias comes from the project’s sources, though. A study that attempted to control for offline inequalities by examining only contemporary sociologists of similar achievement found that male academics were still more likely to have articles. As volunteers, editors work on topics they think are important, and the encyclopedia’s emphases and omissions reflect their demographics. Minor skirmishes in World War II and every episode of The Simpsons have an article, some of which are longer than the articles on the Ethiopian civil war or climate change in the Maldives. In an effort to fill in these gaps, the foundation has for several years funded editor recruitment and training initiatives under the banner of “knowledge equity.”

“Most editors on Wikipedia are English-speaking men, and our coverage is of things that are of interest to English-speaking men,” said a retired market analyst in Cincinnati who has been editing for over 20 years. “Our sports coverage is second to none. Video games, we got it covered. Wars, the history of warfare, my god. Trains, radio stations… But our coverage of foods from other countries is very low, and there is an absolute systemic bias against coverage of women and people of color.” For her part, she tries to fill gaps around food, creating new articles whenever she encounters a Peruvian chili sauce or African fufu that lacks one.

Yet these initiatives have come under attack as “DEI” by conservative influencers and Musk, who called for Wikipedia to be defunded until “they restore balance.”

If you think something is wrong on Wikipedia, you can fix it yourself

These accusations of bias, familiar from attacks on the media and social platforms, encounter some unique challenges when leveled against Wikipedia. Crucially, if you think something is wrong on Wikipedia, you can fix it yourself, though it will require making a case based on verifiability rather than ideological “balance.”

Over the years, Wikipedia has developed an immune response to outside grievances. When people on X start complaining about Wikipedia’s suppression of UFO sightings or refusal to change the name of the Gulf of Mexico to Gulf of America, an editor often restricts the page to people who are logged in and puts up a notice directing newcomers to read the latest debate. If anything important was missed, they are welcome to suggest it, the notice reads, provided their suggestion meets Wikipedia’s rules, which can be read about on the following pages. That is, Wikipedia’s first and best line of defense is to explain how Wikipedia works.

Occasionally, people stick around and learn to edit. More often, they get bored and leave.

It was not unusual for skirmishes to break out over the Wikipedia page for Asian News International, or ANI. It is the largest newswire service in India, and as its Wikipedia article explains, it has a history of promoting false anti-Muslim and pro-government propaganda. It was these facts that various anonymous editors — not logged into Wikipedia accounts, so appearing only as IP addresses — attempted to remove last spring.

As typically happens, an experienced editor quickly reinstated the deleted sentences, noting that they had been removed without explanation. Then came another drive-by edit: actually, ANI is not propaganda and very credible, someone wrote, citing a YouTube video. Reverted: YouTube commentary is not a reliable source. Then another IP address, deleting a sentence about ANI promoting a false viral story about necrophilia in Pakistan. Reverted again. Another IP address, deleting the mention of propaganda with the explanation that the sources were “leftist dogs and swine.”

As the edit battle escalated, an editor locked the page so that only people who were logged in and had made a certain number of edits could make changes, ending the barrage of IP addresses.

Two months later, ANI sued.

The lawsuit revealed that several of the IP addresses had belonged to representatives of ANI attempting to remove unflattering information about the company. Blocked from doing so, ANI sued for defamation under a recent amendment to India’s equivalent of Section 230 that places stricter requirements on platforms to moderate content. When the Wikimedia Foundation declined to reveal the identities of three editors who had defended the page, the presiding judge said he would ask the government to block the site, threatening to cut off the country with the highest number of English Wikipedia readers after the US and the UK. “If you don’t like India,” the judge said, “please don’t work in India.”

During the appeal, Wikimedia’s lawyer argued that disclosing the identities of editors would destroy the encyclopedia’s self-regulating system and expose contributors to reprisals. Also, he noted, the sentences in question, like every assertion on Wikipedia, were only summarizing other sources, and those sources — the publications The Caravan and The Ken — had not been sued for defamation. (As with editors, the foundation’s first response to external threats is often to explain how Wikipedia works.) The judge dismissed the argument, saying that journalism might be “read by a hundred people, you don’t bother about it… it does not have the gravitas.” Wikipedia, however, is read by millions.

By this point the case had garnered enough coverage to warrant its own Wikipedia page. This seemed to enrage the judge, particularly the line noting that the judge’s demand to reveal the identities of editors had been described as “censorship and a threat to the flow of information.” This “borders on contempt,” the judge said, demanding that the foundation take the page down within 36 hours. In a rare move, the foundation complied.

The case alarmed editors around the world. An open letter calling on the Wikimedia Foundation to protect the anonymity of the editors garnered more than 1,300 signatures, the most of any letter directed at the foundation. Nevertheless, last December, the foundation disclosed the editors’ identities to the judge under seal. Responding to outrage on Wikipedia’s editor forum, Wales asked for calm and urged people not to jump to conclusions.

The Wikimedia Foundation has historically taken a hard line against attempts to influence the project. In 2017, when the Turkish government demanded several articles be deleted, Wikipedia refused and was blocked for nearly three years as it fought to the country’s Constitutional Court and won. For the second half of 2024, the most recent data available, the foundation complied with about 8 percent of requests for user data, compared to Google’s 82 percent and Meta’s 77 percent. And the data provided was sparse, because Wikipedia retains almost none.

Instead of brute censorship, what has emerged is a sort of gray-zone information warfare

But attempts to influence the site have grown more sophisticated. The change is likely due to multiple factors: a global rise of political movements that wish to control independent media, the increased centrality of Wikipedia, and a technical change to the website itself. In 2015, Wikipedia switched to the encrypted HTTPS extension by default, making it impossible to see what pages users visited, only that they were visiting the Wikipedia domain. This meant that governments that had previously been censoring specific articles on opposition figures or historic protests had to choose between blocking all of Wikipedia or none of it. Almost every country save China (and Russia, for several hours) chose to not to block it. This was a victory for open knowledge, but it also meant governments had a greater interest in controlling what was written in the encyclopedia.

Instead of brute censorship, what has emerged is a sort of gray-zone information warfare. After mainland China quashed protests against the Hong Kong national security law in 2019, a battle began over how the protests would be remembered. Editors in mainland China — which can edit using VPNs — argued for the inclusion of state-friendly media that described the protests as “riots” or “terrorist attacks” while removing citations to independent media for unreliability and bias. In one case, an editor attempted to strip all citations to one of Hong Kong’s premier papers, Apple Daily, hours before it was shut down by the government. By conspiring offline and using fake accounts, they won elections to admin positions and with them the power to see other editors’ IP addresses, which they discussed using to reveal their opponents’ identities to the police. Shortly afterward, the Wikimedia Foundation banned or restricted more than a dozen editors operating from mainland China, saying that the project had been “infiltrated” and that “some users have been physically harmed as a result.”

Russia employed similar tactics after its invasion of Ukraine in 2022. State media and government officials attacked Wikipedia in the press with accusations of anti-Russian bias, promulgation of fake news, and foreign manipulation. The site remained accessible, but Russian search engines put a banner above it saying it was in violation of the law. Meanwhile, the government harassed the foundation with a series of fines for publishing “false” information about the military, which the foundation has refused to pay. Finally, on the encyclopedia, state-aligned editors pushed the government’s view while vigilantes doxxed and threatened their opposition. Last year, the head of Wikimedia Russia was declared a “foreign agent” and forced to resign from his job as a professor at Moscow State University.

In neighboring Belarus, editor Mark Bernstein was doxxed by a pro-Russian group in 2022, arrested, and sentenced to three years of home confinement. As many as five other editors have been detained by Belarusian authorities in recent months, according to media reports and editors.

As these battles continued, the Russian government supported the creation of a more compliant alternative, called Ruwiki, which launched early last year with the copying of 1.9 million articles from the originals, edited to reflect the government view. On Ruwiki, edits must comply with Russian laws and are subject to approval from outside experts. There, the map of Ukraine does not include Donetsk or Kherson, the war is a “special operation” in response to NATO aggression, and accounts of torture in Bucha are fake news.

The first large-scale anti-Zionist demonstrations in Palestine, March 1920, during the Occupied Enemy Territory Administration. The crowd of Muslim and Christian Palestinians are shown outside Damascus Gate, Old City of Jerusalem." data-portal-copyright="" /> Palestinian political violence refers to acts of violence or terrorism committed by Palestinians with the intent to accomplish political goals in the context of the Israeli–Palestinian conflict." data-portal-copyright="" /> On January 6th, 2021, the United States Capitol in Washington, DC, was attacked by a mob of supporters of President Donald Trump in an attempted self-coup, two months after his defeat in the 2020 presidential election. " data-portal-copyright="" /> Pareidolia is the tendency for perception to impose a meaningful interpretation on a nebulous stimulus, usually visual, so that one detects an object, pattern, or meaning where there is none." data-portal-copyright="" /> is a 1952 novella by the American author Ernest Hemingway. " data-portal-copyright="" />

Wikipedia remains online in Russia, but with Ruwiki, the government may now feel emboldened to block it. In May, at a hearing on media safety for children, the head of the Russian Duma Committee on the Protection of the Family said that the encyclopedia’s “interpretation of our historical events feels so hostile that we need to raise the issue of blocking this information resource,” and that the encyclopedia’s depiction of history is opposed to Russian “traditional, spiritual values.”

The goal of these campaigns is what the Wikimedia Foundation calls “project capture.” The term originates in an independent report the foundation commissioned in response to the takeover of the Croatian-language Wikipedia by a cabal of far-right editors.

In 2010, a group of editors won election to admin positions and began citing far-right alternative media to rewrite history. On Croatian Wikipedia, the Nazis invaded Poland to stop a genocide against the German people, Croatia’s role in the Holocaust is foreign propaganda, and Ratko Mladić was a decorated military leader whose conviction by the UN for genocide (briefly noted quite far down) was the result of an international conspiracy. When other editors attempted to correct the articles, the admins banned them for violating rules against hate speech or harassment.

The encyclopedia became so warped that it began receiving press coverage. The Croatian Minister of Education warned students not to use it. In an interview with a Croatian paper, Wales confirmed the foundation was aware of the problem and looking into it. Yet the foundation has a policy of allowing Wikipedia projects to self-govern, and interfering with Croatian Wikipedia risked opening a door to the many governments and companies that want things on Wikipedia changed.

Editors mounted a resistance and attempted to vote the admins out, but the admins defeated the attempt using votes from what were later revealed to be dozens of fake accounts. But because the admins were the only ones with the technical ability to trace IP addresses, the opposition had no way to prove this. The cabal now controlled all the levers of power. By 2019, nearly all of the editors who opposed them had been banned or harassed off the project.

In 2020, one of the few remaining dissident editors compiled a comprehensive textual and statistical analysis of editing patterns of dozens of accounts and filed a request for an admin to run IP traces to see if they were sock puppets. The admin stalled, then attempted to fudge the traces, but did so in such a transparent way that it was clear the accounts were indeed fakes.

This was the evidence required to procedurally break the cabal. High-ranking admins called “stewards” from other-language Wikipedias administered a new vote on banning the Croatian admins. This time, the admins lost. Their ringleader, username Kubura, was banned from all Wikipedia projects forever, a punishment that had been leveled against less than a dozen others in Wikipedia history. A local daily covered the incident with the headline “Kubura’s Downfall: Banned Globally, His Followers Retreat, Leaderless.”

Wikipedia’s processes are only effective if they are administered by people who believe in the spirit of the project

The foundation’s postmortem analysis compared the takeover to “state capture, one of the most pressing issues of today’s worldwide democratic backsliding.” The clique still cited the reliability of sources and invoked rules of debate, but it bent these processes to serve their nationalist purpose. As many governments have discovered, it is extremely difficult to insert propaganda into Wikipedia without running afoul of some rule or another. But what the Croatia capture showed is that Wikipedia’s processes are only effective if they are administered by people who believe in the spirit of the project. If they can be silenced or replaced, it becomes possible to steer the encyclopedia in a different direction.

Donna Theo Strickland (born May 27th, 1959) is a Canadian optical physicist and pioneer in the field of pulsed lasers. " data-portal-copyright="" /> A telescope in the Very Large Telescope system producing four orange laser guide stars." data-portal-copyright="" /> Oral tradition, or oral lore, is a form of human communication in which knowledge, art, ideas, and culture are received, preserved, and transmitted orally from one generation to another." data-portal-copyright="" />

One editor I spoke with, who asked to remain anonymous for reasons that will be obvious, had been editing Wikipedia for several years while living in a Middle Eastern country where much other media is tightly controlled. One day he received a call from a member of the intelligence service inviting him to lunch. He cried for hours — everyone knew what this meant.

The meeting was cordial but clear. They didn’t want him to stop editing Wikipedia. They wanted his help. They knew the encyclopedia has rules and you can’t just insert flagrant propaganda, but as a respected member of the community, maybe he could edit in ways that were a little friendlier to the government, maybe decide in its favor when certain topics came up for debate. In exchange, maybe the service could help him if he ever got in trouble with the police, for example, over his sexuality; he was gay in a country where that was illegal.

He fled the country weeks later. He now edits from abroad, but he knows of five to 10 others who have faced arrest or intimidation over their editing. They must do constant battle with editors he believes to be government agents who push the state’s perspective, debating tirelessly for hours because it is literally their job.

It’s a rare person who is able to uproot their life in the service of a volunteer side project. Understandably, many others faced with such threats become more cautious in their editing or stop altogether. Multiple editors based in India said that they now avoid editing topics related to their country. The ANI case had a chilling effect, as have recurring harassment campaigns. The far-right online publication OpIndia regularly accuses Wikipedia of “anti-Hindu and anti-India bias,” in ways that parallel attacks from the US right, down to citations of Manhattan Institute research and quotes from the disgruntled cofounder, Sanger. The organization has published the real names and employers of editors it accuses of being “leftists” or “Islamists,” leading at least one veteran editor to delete their account.

Even ancient history can be cause for reprisals. In February, after the release of a Bollywood action film about Chhatrapati Sambhaji Maharaj, a 17th-century king who fought the Mughals, accounts on X began whipping up outrage over several facts on Sambhaji’s Wikipedia page that they deemed to be anti-Hindu. When editors reversed attempts to delete the offending lines, another X user posted their usernames and called on government officials to investigate them. Days later, local press reported that the Maharashtra cyber police opened cases against at least four editors.

“If you issue cases and file complaints against editors, they tend not to edit those pages anymore”

“Various editors have left Wikipedia over this persecution, fearing their own safety,” said an Indian Wikipedia editor who asked to remain anonymous out of fear of retaliation. “I believe this is completely useful for the right wing, if you issue cases and file complaints against editors, they tend not to edit those pages anymore, fearing for their safety in real life.”

He still edits, but mostly sticks to the safer ground of the Roman Empire.

In April, the Trump administration’s interim US attorney for DC, Edward Martin Jr., sent a letter to the Wikimedia Foundation accusing the organization of disseminating “propaganda” and intimating that it had violated its duties as a tax-exempt nonprofit.

From a legal perspective, it was an odd document. The tax status of nonprofits is not generally the jurisdiction of the US attorney for DC, and many of the supposed violations, like having foreign nationals on its board or permitting “the rewriting of key, historical events and biographical information of current and previous American leaders,” are not against the law. Sanger is quoted, criticizing editor anonymity. In several cases, the rules Martin accuses Wikipedia of violating are Wikipedia’s own, like a commitment to neutrality. But the implied threat was clear.

“We’ve been anticipating something like this letter happening for some time,” a longtime editor, Lane Rasberry, said. It fits the pattern seen in India and elsewhere. He has been hearing more reports of threats against editors who work on pages related to trans issues and has been conducting security trainings to prevent their identities being revealed. Several US-based editors told me they now avoid politically contentious topics out of fear that they could be doxxed and face professional or legal retaliation. “There are more Wikipedia editors getting threats, more people getting scared,” Rasberry said.

The "little green men" were Russian soldiers who were masked and wore unmarked uniforms upon the outbreak of the Russo–Ukrainian War in 2014." data-portal-copyright="" /> The 2019–2020 Hong Kong protests (also known by other names) were a series of demonstrations against the Hong Kong government's introduction of a bill to amend the Fugitive Offenders Ordinance in regard to extradition. " data-portal-copyright="" /> May 2015 satellite image of the Crimean Peninsula." data-portal-copyright="" /> The owl of Athena, a symbol of knowledge in the Western world." data-portal-copyright="" /> Sambhaji, also known as Shambhuraje, ruled from 1681 to 1689 as the second king (Chhatrapati) of the Maratha Empire, a prominent state in early modern India." data-portal-copyright="" /> Stanislav Alexandrovich Kozlovsky is a Russian scientist-psychologist and specialist in the field of cognitive neuroscience of memory and perception." data-portal-copyright="" />

Talking to editors, I encountered a confounding spread of opinions about the seriousness of the threat to Wikipedia, often in the same conversation. The site has sloughed off more than two decades of attacks, and so far the latest round is no different. The Heritage Foundation plan to dox editors has yet to materialize. Musk’s calls for his followers to stop donating have resulted in surges in donations, according to publicly available data.

In India, the High Court struck down the order to take down the article about ANI’s defamation case, though the case itself is ongoing. Wikipedia’s critics on the right and in the Silicon Valley elite often propose generative AI as the solution to Wikipedia’s perceived biases, for each user a bespoke source of ideologically agreeable information. Yet all these projects remain wholly reliant on Wikipedia, and so far the most aggressive such initiative, Musk’s Grok, has spent much of its existence flailing between fact-checking Musk’s own conspiracy theories and proclaiming itself MechaHitler.

But new threats continue to appear. In August, the foundation lost its case arguing for an exemption from the UK’s Online Safety Act, which would force Wikipedia to verify the identities of its editors, though it is continuing to appeal. In Portugal the foundation received a court order arising from a defamation case brought by Portuguese American businessman Cesar DePaço, who objected to information on his page about past criminal allegations and links to the far-right Portuguese party Chega. Complying with the ruling, the foundation struck several facts from his biography and disclosed “a small amount of user data” about eight editors. The foundation is now bringing the case before the European Court of Human Rights. And in the US, there is the recent House Oversight letter.

No matter the outcome, these cases contribute to a general increase in pressure on the project’s already strained editors. English Wikipedia has fewer than 40,000 active editors, defined as users who have made five or more edits in the last month. The number of active administrators, crucial to maintaining the site and enforcing policy, peaked in 2008 and now stands at around 450. AI threatens to squeeze the editor pipeline further. The more people who get information from AI summaries of Wikipedia rather than the site itself, the fewer people who will wander down a rabbit hole, encounter an error that needs correcting, and become editors themselves.

“Wikipedia should not be taken for granted.”

At the same time, people are using AI to add plausible-looking but false or biased information to the encyclopedia, increasing the workload for editors. Harassment, ideological editing campaigns, government investigations, targeted lawsuits — even if they lead nowhere, they will make the prospect of editing more daunting and increase the odds that current editors burn out. “Wikipedia should not be taken for granted,” Rasberry said. “This is an existential threat.”

The first reactions to the Martin letter on the Wikipedia editor forums were radical: the foundation should leave the US, maybe for France, or Iceland, or Germany. This would not be unprecedented, an editor pointed out. The Encyclopédistes fled to Switzerland when the ancien régime attempted to censor them. Maybe the site should go dark in protest.

But moderation soon prevailed. “The community needs to chill on the blackout talk,” wrote an editor by the name of Tazerdadog. “We’re not there yet.” Right now, the best response to these threats is to double down on Wikipedia’s policies, particularly the refusal to be censored and its dedication to neutral point of view, they wrote.

NPOV

“I 100% agree with you, Tazerdadog,” replied “Jimbo Wales.” “Emphasizing to the WMF that NPOV is non-negotiable is not really the issue.” In fact, Wales wrote, he is chairing a working group on strengthening the policy. The initiative was announced in March, framed as a response to the global rise in threats to sources of neutral information, and to a fragmentation of the public’s understanding of the very concepts of neutrality and facts. Wikipedia’s response, it seemed, would be to neutral harder.

In May, I met Wales for coffee at a members club in Chelsea where he had been granted an honorary membership after giving a talk. (Wikipedia, as journalists have noted for years, did not make Wales a tech billionaire.) Extravagant bouquets of pastel flowers were arranged in an arch above the doorway and festooned the tables of the interior. Wales, dressed to meet his wife at the Chelsea Flower Show, matched the decor in a green linen suit and floral shirt. He does not, he said, normally dress like a leprechaun.

He was not particularly concerned about the attacks on Wikipedia, he said, though he warned that he is “pathologically optimistic.” Wikipedia has been attacked since it began. It fought Turkey’s ban to the Constitutional Court and won. Even Russian Wikipedia has proven resilient. In the US, the government lacks much of the leverage it has deployed against other institutions. Wikipedia doesn’t rely on government funding, and protections for online speech are strong. In the last fiscal year, the foundation took in $170 million in donations, with an average size of about $10.

As for the accusations of bias, why not investigate? Whether the attacks are in good faith or bad, it doesn’t really matter, Wales said. The foundation had already decided that it was a good time, given the fragmented and polarizing world, to examine and bolster Wikipedia’s neutrality processes. Wales, leaning over the coffee table, seemed excited at the prospect.

“If somebody turns up on a talk page and says, ‘Hey, this article is a mess, it’s wrong. It’s really biased,’ the right answer is to not scream at them and run and hide. The right answer is go, ‘Oh, tell me more. Let’s dig in. Where is it biased? How do we think about how do we fix that?’”

Let’s figure out the best methodologies for studying neutrality, Wales said. Let’s look at how editors evaluate the reliability of sources. Maybe Wikipedia does use the label “far-right” more than “far-left,” Wales said, a criticism that has been leveled at the site. Is that because the media uses the term more, and does Wikipedia use the term more or less than the media does, and does the media use the term more because there are more far-right movements in the world today?

“You have to chew on these things. There’s no simple answers.”

But there are answers. If the social platforms and language models that increasingly shape our understanding of the world are inscrutable black boxes, Wikipedia is the opposite, maybe the most legible, endlessly explainable information management system ever made. For any sentence, there is a source, and a reason that that source was used, and a reason for that reason.

“Let’s dig in,” Wales repeated. “Let’s assess the evidence. Let’s talk to a lot of different people. Let’s really try and understand.” Come, be part of the process. His working group is starting to discuss the best approach. The meetings, Wales acknowledged, have been very tedious so far.

As for the letter from the interim DC attorney, Trump withdrew Martin’s nomination in May, though he still has a position leading the Justice Department’s retribution-oriented “task force on weaponization.” In any case, the Wikimedia Foundation responded promptly.

“The foundation staff spent a lot of passion writing it,” Wales said of the reply. “Then they ran it by me for review, and I was ready to jump in, but I was like, actually, it’s perfect.”

“It’s very calm,” Wales said. “Here are the answers to your questions, here is what we do.” It explains how Wikipedia works.

An edit-a-thon is an event where some editors of online communities such as Wikipedia, OpenStreetMap (also known as a "mapathon"), and LocalWiki edit and improve a specific topic or type of content. " data-portal-copyright="" /> The Quaker business method or Quaker decision-making is a form of group decision-making and discernment, as well as of direct democracy, used by Quakers, or members of the Religious Society of Friends, to organise their religious affairs. " data-portal-copyright="" /> Wikipedia's goal is to create a well-written, reliable encyclopedia like the Encyclopædia Britannica, except Wikipedia is much, much bigger: Britannica has about 120,000 articles, while the English Wikipedia has over 7 million articles." data-portal-copyright="" /> The Wikipedia Monument, located in Słubice, Poland, is a statue designed by Armenian sculptor Mihran Hakobyan honoring Wikipedia contributors." data-portal-copyright="" /> Socrates was known to steadfastly assume others around him were acting in good faith." data-portal-copyright="" /> Jimmy Donal Wales (born August 7th, 1966), also known as Jimbo Wales, is an American internet entrepreneur and former financial trader. " data-portal-copyright="" />

How AI can make history

2024-02-15T09:00:00-05:00

Like millions of other people, the first thing Mark Humphries did with ChatGPT when it was released in late 2022 was ask it to perform parlor tricks, like writing poetry in the style of Bob Dylan — which, while very impressive, did not seem particularly useful to him, a historian studying the 18th-century fur trade. But Humphries, a 43-year-old professor at Wilfrid Laurier University in Waterloo, Canada, had long been interested in applying artificial intelligence to his work. He was already using a specialized text recognition tool designed to transcribe antiquated scripts and typefaces, though it made frequent errors that took time to correct. Curious, he pasted the tool’s garbled interpretation of a handwritten French letter into ChatGPT. AI corrected the text, fixing all the Fs that had been misread as an S and even adding missing accents. Then Humphries asked ChatGPT to translate it to English. It did that, too. Maybe, he thought, this thing would be useful after all.

For Humphries, AI tools held a tantalizing promise. Over the last decade, millions of documents in archives and libraries have been scanned and digitized — Humphries was involved in one such effort himself — but because their wide variety of formats, fonts, and vocabulary rendered them impenetrable to automated search, working with them required stupendous amounts of manual research. For a previous project, Humphries pieced together biographies for several hundred shellshocked World War I soldiers from assorted medical records, war diaries, newspapers, personnel files, and other ephemera. It had taken years and a team of research assistants to read, tag, and cross-reference the material for each individual. If new language models were as powerful as they seemed, he thought, it might be possible to simply upload all this material and ask the model to extract all the documents related to every soldier diagnosed with shell shock.

“That’s a lifetime’s work right there, or at least a decade,” said Humphries. “And you can imagine scaling that up. You could get an AI to figure out if a soldier was wounded on X date, what was happening with that unit on X date, and then access information about the members of that unit, that as historians, you’d never have the time to chase down on an individual basis,” he said. “It might open up new ways of understanding the past.”

Improved database management may be a far cry from the world-conquering superintelligence some predict, but it’s characteristic of the way language models are filtering the real world. From law to programming to journalism, professionals are trying to figure out whether and how to integrate this promising, risky, and very weird technology into their work. For historians, a technology capable of synthesizing entire archives that also has a penchant for fabricating facts is as appealing as it is terrifying, and the field, like so many others, is just beginning to grapple with the implications of such a potentially powerful but slippery tool.

AI seemed to be everywhere at the 137th annual meeting of the American Historical Association last month, according to Cindy Ermus, an associate professor of history at the University of Texas at San Antonio. She chaired one of several panels on the topic. Ermus described her and many of her colleagues’ relationship to AI as that of “curious children,” wondering with both excitement and wariness what aspects of their work it will change and how. “It’s going to transform every part of historical research, from collection, to curation, to writing, and of course, teaching,” she said. She was particularly impressed by Lancaster University lecturer Katherine McDonough’s presentation of a machine learning program capable of searching historic maps, initially trained on ordnance surveys of 19th-century Britain.

“It’s going to transform every part of historical research, from collection, to curation, to writing, and of course, teaching.”

“She searched the word ‘restaurant,’ and it pulled up the word ‘restaurant’ in tons of historical maps through the years,” Ermus said. “To the non-historian, that might not sound like a big deal, but we’ve never been able to do that before, and now it’s at our fingertips.”

Another attendee, Lauren Tilton, professor of liberal arts and digital humanities at the University of Richmond, had been working with machine learning for over a decade and recently worked with the Library of Congress to apply computer vision to the institution’s vast troves of minimally labeled photos and films. All archives are biased — in what material is saved to begin with and in how it is organized. The promise of AI, she said, is that it can open up archives at scale and make them searchable for things the archivists of the past didn’t value enough to label.

“The most described materials in the archive are usually the sort of voices we’ve heard before — the famous politicians, famous authors,” she said. “But we know that there are many stories by people of minoritized communities, communities of color, LGBTQ communities that have been hard to tell, not because people haven’t wanted to, but because of the challenges of how to search the archive.”

AI systems have their own biases, however. They have the well-documented tendency to reflect the gender, racial, and other biases of their training data — the fact that, as Ermus pointed out, when she asked GPT-4 to create an image of a history professor, it drew an elderly white man with elbow patches on his blazer — but they also display a bias that Tilton calls “presentism.” Because the vast preponderance of training data is scraped from the contemporary internet, models reflect a contemporary worldview. Tilton encountered this phenomenon when she found image recognition systems struggled to make sense of older photos, for example, labeling typewriters as computers and their paperweights as their mice. These were image recognition systems, but language models have a similar problem.

Impressed with ChatGPT, Humphries signed up for the OpenAI API and set out to make an AI research assistant. He was trying to track 18th-century fur traders through a morass of letters, journals, marriage certificates, legal documents, parish records, and contracts in which they appear only fleetingly. His goal was to design a system that could automate the process.

One of the first challenges he encountered was that 18th-century fur traders do not sound anything like a language model assumes

One of the first challenges he encountered was that 18th-century fur traders do not sound anything like a language model assumes. Ask GPT-4 to write a sample entry, as I did, and it will produce lengthy reflections on the sublime loneliness of the wilderness, saying things like, “This morn, the skies did open with a persistent drizzle, cloaking the forest in a veil of mist and melancholy,” and “Bruno, who had faced every hardship with the stoicism of a seasoned woodsman, now lay still beneath the shelter of our makeshift tent, a silent testament to the fragility of life in these untamed lands.”

Whereas an actual fur trader would be far more concise. For example, “Fine Weather. This morning the young man that died Yesterday was buried and his Grave was surrounded with Pickets. 9 Men went to gather Gum of which they brought wherewith to Gum 3 Canoes, the others were employed as yesterday,” as one wrote in 1806, referring to gathering tree sap to seal the seams of their bark canoes.

“The problem is that the language model wouldn’t pick up on a record like that, because it doesn’t contain the type of reflective writing that it’s trained to see as being representative of an event like that,” said Humphries. Trained on contemporary blog posts and essays, it would expect the death of a companion to be followed by lengthy emotional remembrances, not an inventory of sap supplies.

By fine-tuning the model on hundreds of examples of fur trader prose, Humphries got it to pull out journal entries in response to questions, but not always relevant ones. The antiquated vocabulary still posed a problem — words like varangue, a French term for the rib of a canoe that would rarely appear in the model’s training data, if ever.

After much trial and error, he ended up with an AI assembly line using multiple models to sort documents, search them for keywords and meaning, and synthesize answers to queries. It took a lot of time and a lot of tinkering, but GPT helped teach him the Python he needed. He named the system HistoryPearl, after his smartest cat.

He tested his system against edge cases, like the Norwegian trader Ferdinand Wentzel, who wrote about himself in the third person and deployed an odd sense of humor, for example, writing about the birth of his son by speculating about his paternity and making self-deprecating jokes about his own height — “F. W.’s Girl was safely delivered of a boy. – I almost believe it is his Son for his features seem to bear some resemblance of him & his short legs seem to determine this opinion beyond doubt.” This sort of writing stymied earlier models, but HistoryPearl could pull it up in response to a vaguely phrased question about Wentzel’s humor, along with other examples of Wentzel’s wit Humphries hadn’t been looking for.

The tool still missed some things, but it performed better than the average graduate student Humphries would normally hire to do this sort of work. And faster. And much, much cheaper. Last November, after OpenAI dropped prices for API calls, he did some rough math. What he would pay a grad student around $16,000 to do over the course of an entire summer, GPT-4 could do for about $70 in around an hour.

“They’re still talking about the technology as if it is a theoretical thing without the full understanding that it poses a very real, existential threat to our whole raison d’être as higher educators.”

“That was the moment where I realized, ‘Okay, this begins to change everything,’” he said. As a researcher, it was exciting. As a teacher, it was frightening. Organizing fur trading records may be a niche application, but a huge number of white collar jobs consist of similar information management tasks. His students were supposed to be learning the sorts of research and thinking skills that would allow them to be successful in just these sorts of jobs. In November, he published a newsletter imploring his peers in academia to take the rapid development of AI seriously. “AI is simply starting to outrun many people’s imaginations,” he wrote. “They’re still talking about the technology as if it is a theoretical thing without the full understanding that it poses a very real, existential threat to our whole raison d’être as higher educators.”

In the meantime, though, he was pleased that his tinkering had resulted in what he calls a “proof of concept”: reliable enough to be potentially useful, though not yet enough to fully trust. Humphries and his research partner, the historian Lianne Leddy, submitted a grant to scale their research up to all 30,000 voyageurs in their database. In a way, he found the labor required to develop this labor-saving system comforting. The largest improvements in the model came from feeding it the right data, something he was able to do only because of his expertise in the material. Lately, he has been thinking that there may actually be more demand for domain experts with the sort of research and critical assessment skills the humanities teach. This year he will teach an applied generative AI program he designed, run out of the Faculty of Arts.

“In some ways this is old wine in new bottles, right?” he said. In the mid 20th century, he pointed out, companies had vast corporate archives staffed by researchers who were experts, not just in storing and organizing documents, but in the material itself. “In order to make a lot of this data useful, people are needed who have both the ability to figure out how to train models, but more importantly, who understand what is good content and what’s not. I think that’s reassuring,” he said. “Whether I’m just deluding myself, that’s another question.”

AI Is a Lot of Work

2023-06-20T08:05:00-04:00

This article is a collaboration between New York Magazine and The Verge.

A few months after graduating from college in Nairobi, a 30-year-old I’ll call Joe got a job as an annotator — the tedious work of processing the raw information used to train artificial intelligence. AI learns by finding patterns in enormous quantities of data, but first that data has to be sorted and tagged by people, a vast workforce mostly hidden behind the machines. In Joe’s case, he was labeling footage for self-driving cars — identifying every vehicle, pedestrian, cyclist, anything a driver needs to be aware of — frame by frame and from every possible camera angle. It’s difficult and repetitive work. A several-second blip of footage took eight hours to annotate, for which Joe was paid about $10.

Then, in 2019, an opportunity arose: Joe could make four times as much running an annotation boot camp for a new company that was hungry for labelers. Every two weeks, 50 new recruits would file into an office building in Nairobi to begin their apprenticeships. There seemed to be limitless demand for the work. They would be asked to categorize clothing seen in mirror selfies, look through the eyes of robot vacuum cleaners to determine which rooms they were in, and draw squares around lidar scans of motorcycles. Over half of Joe’s students usually dropped out before the boot camp was finished. “Some people don’t know how to stay in one place for long,” he explained with gracious understatement. Also, he acknowledged, “it is very boring.”

This article is a collaboration between New York Magazine and The Verge.

But it was a job in a place where jobs were scarce, and Joe turned out hundreds of graduates. After boot camp, they went home to work alone in their bedrooms and kitchens, forbidden from telling anyone what they were working on, which wasn’t really a problem because they rarely knew themselves. Labeling objects for self-driving cars was obvious, but what about categorizing whether snippets of distorted dialogue were spoken by a robot or a human? Uploading photos of yourself staring into a webcam with a blank expression, then with a grin, then wearing a motorcycle helmet? Each project was such a small component of some larger process that it was difficult to say what they were actually training AI to do. Nor did the names of the projects offer any clues: Crab Generation, Whale Segment, Woodland Gyro, and Pillbox Bratwurst. They were non sequitur code names for non sequitur work.

As for the company employing them, most knew it only as Remotasks, a website offering work to anyone fluent in English. Like most of the annotators I spoke with, Joe was unaware until I told him that Remotasks is the worker-facing subsidiary of a company called Scale AI, a multibillion-dollar Silicon Valley data vendor that counts OpenAI and the U.S. military among its customers. Neither Remotasks’ or Scale’s website mentions the other.

Much of the public response to language models like OpenAI’s ChatGPT has focused on all the jobs they appear poised to automate. But behind even the most impressive AI system are people — huge numbers of people labeling data to train it and clarifying data when it gets confused. Only the companies that can afford to buy this data can compete, and those that get it are highly motivated to keep it secret. The result is that, with few exceptions, little is known about the information shaping these systems’ behavior, and even less is known about the people doing the shaping.

For Joe’s students, it was work stripped of all its normal trappings: a schedule, colleagues, knowledge of what they were working on or whom they were working for. In fact, they rarely called it work at all — just “tasking.” They were taskers.

The anthropologist David Graeber defines “bullshit jobs” as employment without meaning or purpose, work that should be automated but for reasons of bureaucracy or status or inertia is not. These AI jobs are their bizarro twin: work that people want to automate, and often think is already automated, yet still requires a human stand-in. The jobs have a purpose; it’s just that workers often have no idea what it is.

The current AI boom — the convincingly human-sounding chatbots, the artwork that can be generated from simple prompts, and the multibillion-dollar valuations of the companies behind these technologies — began with an unprecedented feat of tedious and repetitive labor.

In 2007, the AI researcher Fei-Fei Li, then a professor at Princeton, suspected the key to improving image-recognition neural networks, a method of machine learning that had been languishing for years, was training on more data — millions of labeled images rather than tens of thousands. The problem was that it would take decades and millions of dollars for her team of undergrads to label that many photos.

Li found thousands of workers on Mechanical Turk, Amazon’s crowdsourcing platform where people around the world complete small tasks for cheap. The resulting annotated dataset, called ImageNet, enabled breakthroughs in machine learning that revitalized the field and ushered in a decade of progress.

Annotation remains a foundational part of making AI, but there is often a sense among engineers that it’s a passing, inconvenient prerequisite to the more glamorous work of building models. You collect as much labeled data as you can get as cheaply as possible to train your model, and if it works, at least in theory, you no longer need the annotators. But annotation is never really finished. Machine-learning systems are what researchers call “brittle,” prone to fail when encountering something that isn’t well represented in their training data. These failures, called “edge cases,” can have serious consequences. In 2018, an Uber self-driving test car killed a woman because, though it was programmed to avoid cyclists and pedestrians, it didn’t know what to make of someone walking a bike across the street. The more AI systems are put out into the world to dispense legal advice and medical help, the more edge cases they will encounter and the more humans will be needed to sort them. Already, this has given rise to a global industry staffed by people like Joe who use their uniquely human faculties to help the machines.

Is that a red shirt with white stripes or a white shirt with red stripes? Is a wicker bowl a “decorative bowl” if it’s full of apples? What color is leopard print?

Over the past six months, I spoke with more than two dozen annotators from around the world, and while many of them were training cutting-edge chatbots, just as many were doing the mundane manual labor required to keep AI running. There are people classifying the emotional content of TikTok videos, new variants of email spam, and the precise sexual provocativeness of online ads. Others are looking at credit-card transactions and figuring out what sort of purchase they relate to or checking e-commerce recommendations and deciding whether that shirt is really something you might like after buying that other shirt. Humans are correcting customer-service chatbots, listening to Alexa requests, and categorizing the emotions of people on video calls. They are labeling food so that smart refrigerators don’t get confused by new packaging, checking automated security cameras before sounding alarms, and identifying corn for baffled autonomous tractors.

“There’s an entire supply chain,” said Sonam Jindal, the program and research lead of the nonprofit Partnership on AI. “The general perception in the industry is that this work isn’t a critical part of development and isn’t going to be needed for long. All the excitement is around building artificial intelligence, and once we build that, it won’t be needed anymore, so why think about it? But it’s infrastructure for AI. Human intelligence is the basis of artificial intelligence, and we need to be valuing these as real jobs in the AI economy that are going to be here for a while.”

The data vendors behind familiar names like OpenAI, Google, and Microsoft come in different forms. There are private outsourcing companies with call-center-like offices, such as the Kenya- and Nepal-based CloudFactory, where Joe annotated for $1.20 an hour before switching to Remotasks. There are also “crowdworking” sites like Mechanical Turk and Clickworker where anyone can sign up to perform tasks. In the middle are services like Scale AI. Anyone can sign up, but everyone has to pass qualification exams and training courses and undergo performance monitoring. Annotation is big business. Scale, founded in 2016 by then-19-year-old Alexandr Wang, was valued in 2021 at $7.3 billion, making him what Forbes called “the youngest self-made billionaire,” though the magazine noted in a recent profile that his stake has fallen on secondary markets since then.

This tangled supply chain is deliberately hard to map. According to people in the industry, the companies buying the data demand strict confidentiality. (This is the reason Scale cited to explain why Remotasks has a different name.) Annotation reveals too much about the systems being developed, and the huge number of workers required makes leaks difficult to prevent. Annotators are warned repeatedly not to tell anyone about their jobs, not even their friends and co-workers, but corporate aliases, project code names, and, crucially, the extreme division of labor ensure they don’t have enough information about them to talk even if they wanted to. (Most workers requested pseudonyms for fear of being booted from the platforms.) Consequently, there are no granular estimates of the number of people who work in annotation, but it is a lot, and it is growing. A recent Google Research paper gave an order-of-magnitude figure of “millions” with the potential to become “billions.”

Automation often unfolds in unexpected ways. Erik Duhaime, CEO of medical-data-annotation company Centaur Labs, recalled how, several years ago, prominent machine-learning engineers were predicting AI would make the job of radiologist obsolete. When that didn’t happen, conventional wisdom shifted to radiologists using AI as a tool. Neither of those is quite what he sees occurring. AI is very good at specific tasks, Duhaime said, and that leads work to be broken up and distributed across a system of specialized algorithms and to equally specialized humans. An AI system might be capable of spotting cancer, he said, giving a hypothetical example, but only in a certain type of imagery from a certain type of machine; so now, you need a human to check that the AI is being fed the right type of data and maybe another human who checks its work before passing it to another AI that writes a report, which goes to another human, and so on. “AI doesn’t replace work,” he said. “But it does change how work is organized.”

You might miss this if you believe AI is a brilliant, thinking machine. But if you pull back the curtain even a little, it looks more familiar, the latest iteration of a particularly Silicon Valley division of labor, in which the futuristic gleam of new technologies hides a sprawling manufacturing apparatus and the people who make it run. Duhaime reached back farther for a comparison, a digital version of the transition from craftsmen to industrial manufacturing: coherent processes broken into tasks and arrayed along assembly lines with some steps done by machines and some by humans but none resembling what came before.

Worries about AI-driven disruption are often countered with the argument that AI automates tasks, not jobs, and that these tasks will be the dull ones, leaving people to pursue more fulfilling and human work. But just as likely, the rise of AI will look like past labor-saving technologies, maybe like the telephone or typewriter, which vanquished the drudgery of message delivering and handwriting but generated so much new correspondence, commerce, and paperwork that new offices staffed by new types of workers — clerks, accountants, typists — were required to manage it. When AI comes for your job, you may not lose it, but it might become more alien, more isolating, more tedious.

Earlier this year, I signed up for Scale AI’s Remotasks. The process was straightforward. After entering my computer specs, internet speed, and some basic contact information, I found myself in the “training center.” To access a paying task, I first had to complete an associated (unpaid) intro course.

The training center displayed a range of courses with inscrutable names like Glue Swimsuit and Poster Macadamia. I clicked on something called GFD Chunking, which revealed itself to be labeling clothing in social-media photos.

The instructions, however, were odd. For one, they basically consisted of the same direction reiterated in the idiosyncratically colored and capitalized typography of a collaged bomb threat.

“DO LABEL items that are real and can be worn by humans or are intended to be worn by real people,” it read.

“All items below SHOULD be labeled because they are real and can be worn by real-life humans,” it reiterated above photos of an Air Jordans ad, someone in a Kylo Ren helmet, and mannequins in dresses, over which was a lime-green box explaining, once again, “DO Label real items that can be worn by real people.”

I skimmed to the bottom of the manual, where the instructor had written in the large bright-red font equivalent of grabbing someone by the shoulders and shaking them, “THE FOLLOWING ITEMS SHOULD NOT BE LABELED because a human could not actually put wear any of these items!” above a photo of C-3PO, Princess Jasmine from Aladdin, and a cartoon shoe with eyeballs.

Feeling confident in my ability to distinguish between real clothes that can be worn by real people and not-real clothes that cannot, I proceeded to the test. Right away, it threw an ontological curveball: a picture of a magazine depicting photos of women in dresses. Is a photograph of clothing real clothing? No, I thought, because a human cannot wear a photograph of clothing. Wrong! As far as AI is concerned, photos of real clothes are real clothes. Next came a photo of a woman in a dimly lit bedroom taking a selfie before a full-length mirror. The blouse and shorts she’s wearing are real. What about their reflection? Also real! Reflections of real clothes are also real clothes.

After an embarrassing amount of trial and error, I made it to the actual work, only to make the horrifying discovery that the instructions I’d been struggling to follow had been updated and clarified so many times that they were now a full 43 printed pages of directives: Do NOT label open suitcases full of clothes; DO label shoes but do NOT label flippers; DO label leggings but do NOT label tights; do NOT label towels even if someone is wearing it; label costumes but do NOT label armor. And so on.

There has been general instruction disarray across the industry, according to Milagros Miceli, a researcher at the Weizenbaum Institute in Germany who studies data work. It is in part a product of the way machine-learning systems learn. Where a human would get the concept of “shirt” with a few examples, machine-learning programs need thousands, and they need to be categorized with perfect consistency yet varied enough (polo shirts, shirts being worn outdoors, shirts hanging on a rack) that the very literal system can handle the diversity of the real world. “Imagine simplifying complex realities into something that is readable for a machine that is totally dumb,” she said.

Once, Victor stayed up 36 hours straight labeling elbows and knees and heads in photographs of crowds — he has no idea why.

The act of simplifying reality for a machine results in a great deal of complexity for the human. Instruction writers must come up with rules that will get humans to categorize the world with perfect consistency. To do so, they often create categories no human would use. A human asked to tag all the shirts in a photo probably wouldn’t tag the reflection of a shirt in a mirror because they would know it is a reflection and not real. But to the AI, which has no understanding of the world, it’s all just pixels and the two are perfectly identical. Fed a dataset with some shirts labeled and other (reflected) shirts unlabeled, the model won’t work. So the engineer goes back to the vendor with an update: DO label reflections of shirts. Soon, you have a 43-page guide descending into red all-caps.

“When you start off, the rules are relatively simple,” said a former Scale employee who requested anonymity because of an NDA. “Then they get back a thousand images and then they’re like, Wait a second, and then you have multiple engineers and they start to argue with each other. It’s very much a human thing.”

The job of the annotator often involves putting human understanding aside and following instructions very, very literally — to think, as one annotator said, like a robot. It’s a strange mental space to inhabit, doing your best to follow nonsensical but rigorous rules, like taking a standardized test while on hallucinogens. Annotators invariably end up confronted with confounding questions like, Is that a red shirt with white stripes or a white shirt with red stripes? Is a wicker bowl a “decorative bowl” if it’s full of apples? What color is leopard print? When instructors said to label traffic-control directors, did they also mean to label traffic-control directors eating lunch on the sidewalk? Every question must be answered, and a wrong guess could get you banned and booted to a new, totally different task with its own baffling rules.

Most of the work on Remotasks is paid at a piece rate with a single task earning anywhere from a few cents to several dollars. Because tasks can take seconds or hours, wages are hard to predict. When Remotasks first arrived in Kenya, annotators said it paid relatively well — averaging about $5 to $10 per hour depending on the task — but the amount fell as time went on.

Scale AI spokesperson Anna Franko said that the company’s economists analyze the specifics of a project, the skills required, the regional cost of living, and other factors “to ensure fair and competitive compensation.” Former Scale employees also said pay is determined through a surge-pricing-like mechanism that adjusts for how many annotators are available and how quickly the data is needed.

According to workers I spoke with and job listings, U.S.-based Remotasks annotators generally earn between $10 and $25 per hour, though some subject-matter experts can make more. By the beginning of this year, pay for the Kenyan annotators I spoke with had dropped to between $1 and $3 per hour.

That is, when they were making any money at all. The most common complaint about Remotasks work is its variability; it’s steady enough to be a full-time job for long stretches but too unpredictable to rely on. Annotators spend hours reading instructions and completing unpaid trainings only to do a dozen tasks and then have the project end. There might be nothing new for days, then, without warning, a totally different task appears and could last anywhere from a few hours to weeks. Any task could be their last, and they never know when the next one will come.

This boom-and-bust cycle results from the cadence of AI development, according to engineers and data vendors. Training a large model requires an enormous amount of annotation followed by more iterative updates, and engineers want it all as fast as possible so they can hit their target launch date. There may be monthslong demand for thousands of annotators, then for only a few hundred, then for a dozen specialists of a certain type, and then thousands again. “The question is, Who bears the cost for these fluctuations?” said Jindal of Partnership on AI. “Because right now, it’s the workers.”

“I really am wasting my life here if I made somebody a billionaire and I’m earning a couple of bucks a week.”

To succeed, annotators work together. When I told Victor, who started working for Remotasks while at university in Nairobi, about my struggles with the traffic-control-directors task, he told me everyone knew to stay away from that one: too tricky, bad pay, not worth it. Like a lot of annotators, Victor uses unofficial WhatsApp groups to spread the word when a good task drops. When he figures out a new one, he starts impromptu Google Meets to show others how it’s done. Anyone can join and work together for a time, sharing tips. “It’s a culture we have developed of helping each other because we know when on your own, you can’t know all the tricks,” he said.

Because work appears and vanishes without warning, taskers always need to be on alert. Victor has found that projects pop up very late at night, so he is in the habit of waking every three hours or so to check his queue. When a task is there, he’ll stay awake as long as he can to work. Once, he stayed up 36 hours straight labeling elbows and knees and heads in photographs of crowds — he has no idea why. Another time, he stayed up so long his mother asked him what was wrong with his eyes. He looked in the mirror to discover they were swollen.

Annotators generally know only that they are training AI for companies located vaguely elsewhere, but sometimes the veil of anonymity drops — instructions mentioning a brand or a chatbot say too much. “I read and I Googled and found I am working for a 25-year-old billionaire,” said one worker, who, when we spoke, was labeling the emotions of people calling to order Domino’s pizza. “I really am wasting my life here if I made somebody a billionaire and I’m earning a couple of bucks a week.”

Victor is a self-proclaimed “fanatic” about AI and started annotating because he wants to help bring about a fully automated post-work future. But earlier this year, someone dropped a Time story into one of his WhatsApp groups about workers training ChatGPT to recognize toxic content who were getting paid less than $2 an hour by the vendor Sama AI. “People were angry that these companies are so profitable but paying so poorly,” Victor said. He was unaware until I told him about Remotasks’ connection to Scale. Instructions for one of the tasks he worked on were nearly identical to those used by OpenAI, which meant he had likely been training ChatGPT as well, for approximately $3 per hour.

“I remember that someone posted that we will be remembered in the future,” he said. “And somebody else replied, ‘We are being treated worse than foot soldiers. We will be remembered nowhere in the future.’ I remember that very well. Nobody will recognize the work we did or the effort we put in.”

Identifying clothing and labeling customer-service conversations are just some of the annotation gigs available. Lately, the hottest on the market has been chatbot trainer. Because it demands specific areas of expertise or language fluency and wages are often adjusted regionally, this job tends to pay better. Certain types of specialist annotation can go for $50 or more per hour.

A woman I’ll call Anna was searching for a job in Texas when she stumbled across a generic listing for online work and applied. It was Remotasks, and after passing an introductory exam, she was brought into a Slack room of 1,500 people who were training a project code-named Dolphin, which she later discovered to be Google DeepMind’s chatbot, Sparrow, one of the many bots competing with ChatGPT. Her job is to talk with it all day. At about $14 an hour, plus bonuses for high productivity, “it definitely beats getting paid $10 an hour at the local Dollar General store,” she said.

Also, she enjoys it. She has discussed science-fiction novels, mathematical paradoxes, children’s riddles, and TV shows. Sometimes the bot’s responses make her laugh; other times, she runs out of things to talk about. “Some days, my brain is just like, I literally have no idea what on earth to ask it now,” she said. “So I have a little notebook, and I’ve written about two pages of things — I just Google interesting topics — so I think I’ll be good for seven hours today, but that’s not always the case.”

Each time Anna prompts Sparrow, it delivers two responses and she picks the best one, thereby creating something called “human-feedback data.” When ChatGPT debuted late last year, its impressively natural-seeming conversational style was credited to its having been trained on troves of internet data. But the language that fuels ChatGPT and its competitors is filtered through several rounds of human annotation. One group of contractors writes examples of how the engineers want the bot to behave, creating questions followed by correct answers, descriptions of computer programs followed by functional code, and requests for tips on committing crimes followed by polite refusals. After the model is trained on these examples, yet more contractors are brought in to prompt it and rank its responses. This is what Anna is doing with Sparrow. Exactly which criteria the raters are told to use varies — honesty, or helpfulness, or just personal preference. The point is that they are creating data on human taste, and once there’s enough of it, engineers can train a second model to mimic their preferences at scale, automating the ranking process and training their AI to act in ways humans approve of. The result is a remarkably human-seeming bot that mostly declines harmful requests and explains its AI nature with seeming self-awareness.

Put another way, ChatGPT seems so human because it was trained by an AI that was mimicking humans who were rating an AI that was mimicking humans who were pretending to be a better version of an AI that was trained on human writing.

This circuitous technique is called “reinforcement learning from human feedback,” or RLHF, and it’s so effective that it’s worth pausing to fully register what it doesn’t do. When annotators teach a model to be accurate, for example, the model isn’t learning to check answers against logic or external sources or about what accuracy as a concept even is. The model is still a text-prediction machine mimicking patterns in human writing, but now its training corpus has been supplemented with bespoke examples, and the model has been weighted to favor them. Maybe this results in the model extracting patterns from the part of its linguistic map labeled as accurate and producing text that happens to align with the truth, but it can also result in it mimicking the confident style and expert jargon of the accurate text while writing things that are totally wrong. There is no guarantee that the text the labelers marked as accurate is in fact accurate, and when it is, there is no guarantee that the model learns the right patterns from it.

This dynamic makes chatbot annotation a delicate process. It has to be rigorous and consistent because sloppy feedback, like marking material that merely sounds correct as accurate, risks training models to be even more convincing bullshitters. An early OpenAI and DeepMind joint project using RLHF, in this case to train a virtual robot hand to grab an item, resulted in also training the robot to position its hand between the object and its raters and wiggle around such that it only appeared to its human overseers to grab the item. Ranking a language model’s responses is always going to be somewhat subjective because it’s language. A text of any length will have multiple elements that could be right or wrong or, taken together, misleading. OpenAI researchers ran into this obstacle in another early RLHF paper. Trying to get their model to summarize text, the researchers found they agreed only 60 percent of the time that a summary was good. “Unlike many tasks in [machine learning] our queries do not have unambiguous ground truth,” they lamented.

When Anna rates Sparrow’s responses, she’s supposed to be looking at their accuracy, helpfulness, and harmlessness while also checking that the model isn’t giving medical or financial advice or anthropomorphizing itself or running afoul of other criteria. To be useful training data, the model’s responses have to be quantifiably ranked against one another: Is a bot that helpfully tells you how to make a bomb “better” than a bot that’s so harmless it refuses to answer any questions? In one DeepMind paper, when Sparrow’s makers took a turn annotating, four researchers wound up debating whether their bot had assumed the gender of a user who asked it for relationship advice. According to Geoffrey Irving, one of DeepMind’s research scientists, the company’s researchers hold weekly annotation meetings in which they rerate data themselves and discuss ambiguous cases, consulting with ethical or subject-matter experts when a case is particularly tricky.

There are people classifying the emotional content of TikTok videos, new variants of email spam, and the precise sexual provocativeness of online ads.

Anna often finds herself having to choose between two bad options. “Even if they’re both absolutely, ridiculously wrong, you still have to figure out which one is better and then write words explaining why,” she said. Sometimes, when both responses are bad, she’s encouraged to write a better response herself, which she does about half the time.

Because feedback data is difficult to collect, it fetches a higher price. Basic preferences of the sort Anna is producing sell for about $1 each, according to people with knowledge of the industry. But if you want to train a model to do legal research, you need someone with training in law, and this gets expensive. Everyone involved is reluctant to say how much they’re spending, but in general, specialized written examples can go for hundreds of dollars, while expert ratings can cost $50 or more. One engineer told me about buying examples of Socratic dialogues for up to $300 a pop. Another told me about paying $15 for a “darkly funny limerick about a goldfish.”

OpenAI, Microsoft, Meta, and Anthropic did not comment about how many people contribute annotations to their models, how much they are paid, or where in the world they are located. Irving of DeepMind, which is a subsidiary of Google, said the annotators working on Sparrow are paid “at least the hourly living wage” based on their location. Anna knows “absolutely nothing” about Remotasks, but Sparrow has been more open. She wasn’t the only annotator I spoke with who got more information from the AI they were training than from their employer; several others learned whom they were working for by asking their AI for its company’s terms of service. “I literally asked it, ‘What is your purpose, Sparrow?’” Anna said. It pulled up a link to DeepMind’s website and explained that it’s an AI assistant and that its creators trained it using RLHF to be helpful and safe.

Until recently, it was relatively easy to spot bad output from a language model. It looked like gibberish. But this gets harder as the models get better — a problem called “scalable oversight.” Google inadvertently demonstrated how hard it is to catch the errors of a modern-language model when one made it into the splashy debut of its AI assistant, Bard. (It stated confidently that the James Webb Space Telescope “took the very first pictures of a planet outside of our own solar system,” which is wrong.) This trajectory means annotation increasingly requires specific skills and expertise.

Last year, someone I’ll call Lewis was working on Mechanical Turk when, after completing a task, he received a message inviting him to apply for a platform he hadn’t heard of. It was called Taskup.ai, and its website was remarkably basic: just a navy background with text reading GET PAID FOR TASKS ON DEMAND. He applied.

The work paid far better than anything he had tried before, often around $30 an hour. It was more challenging, too: devising complex scenarios to trick chatbots into giving dangerous advice, testing a model’s ability to stay in character, and having detailed conversations about scientific topics so technical they required extensive research. He found the work “satisfying and stimulating.” While checking one model’s attempts to code in Python, Lewis was learning too. He couldn’t work for more than four hours at a stretch, lest he risk becoming mentally drained and making mistakes, and he wanted to keep the job.

“If there was one thing I could change, I would just like to have more information about what happens on the other end,” he said. “We only know as much as we need to know to get work done, but if I could know more, then maybe I could get more established and perhaps pursue this as a career.”

I spoke with eight other workers, most based in the U.S., who had similar experiences of answering surveys or completing tasks on other platforms and finding themselves recruited for Taskup.ai or several similarly generic sites, such as DataAnnotation.tech or Gethybrid.io. Often their work involved training chatbots, though with higher-quality expectations and more specialized purposes than other sites they had worked for. One was demonstrating spreadsheet macros. Another was just supposed to have conversations and rate responses according to whatever criteria she wanted. She often asked the chatbot things that had come up in conversations with her 7-year-old daughter, like “What is the largest dinosaur?” and “Write a story about a tiger.” “I haven’t fully gotten my head around what they’re trying to do with it,” she told me.

Taskup.ai, DataAnnotation.tech, and Gethybrid.io all appear to be owned by the same company: Surge AI. Its CEO, Edwin Chen, would neither confirm nor deny the connection, but he was willing to talk about his company and how he sees annotation evolving.

“I’ve always felt the annotation landscape is overly simplistic,” Chen said over a video call from Surge’s office. He founded Surge in 2020 after working on AI at Google, Facebook, and Twitter convinced him that crowdsourced labeling was inadequate. “We want AI to tell jokes or write really good marketing copy or help me out when I need therapy or whatnot,” Chen said. “You can’t ask five people to independently come up with a joke and combine it into a majority answer. Not everybody can tell a joke or solve a Python program. The annotation landscape needs to shift from this low-quality, low-skill mind-set to something that’s much richer and captures the range of human skills and creativity and values that we want AI systems to possess.”

Last year, Surge relabeled Google’s dataset classifying Reddit posts by emotion. Google had stripped each post of context and sent them to workers in India for labeling. Surge employees familiar with American internet culture found that 30 percent of the labels were wrong. Posts like “hell yeah my brother” had been classified as annoyance and “Yay, cold McDonald’s. My favorite” as love.

Surge claims to vet its workers for qualifications — that people doing creative-writing tasks have experience with creative writing, for example — but exactly how Surge finds workers is “proprietary,” Chen said. As with Remotasks, workers often have to complete training courses, though unlike Remotasks, they are paid for it, according to the annotators I spoke with. Having fewer, better-trained workers producing higher-quality data allows Surge to compensate better than its peers, Chen said, though he declined to elaborate, saying only that people are paid “fair and ethical wages.” The workers I spoke with earned between $15 and $30 per hour, but they are a small sample of all the annotators, a group Chen said now consists of 100,000 people. The secrecy, he explained, stems from clients’ demands for confidentiality.

Surge’s customers include OpenAI, Google, Microsoft, Meta, and Anthropic. Surge specializes in feedback and language annotation, and after ChatGPT launched, it got an influx of requests, Chen said: “I thought everybody knew the power of RLHF, but I guess people just didn’t viscerally understand.”

The new models are so impressive they’ve inspired another round of predictions that annotation is about to be automated. Given the costs involved, there is significant financial pressure to do so. Anthropic, Meta, and other companies have recently made strides in using AI to drastically reduce the amount of human annotation needed to guide models, and other developers have started using GPT-4 to generate training data. However, a recent paper found that GPT-4-trained models may be learning to mimic GPT’s authoritative style with even less accuracy, and so far, when improvements in AI have made one form of annotation obsolete, demand for other, more sophisticated types of labeling has gone up. This debate spilled into the open earlier this year, when Scale’s CEO, Wang, tweeted that he predicted AI labs will soon be spending as many billions of dollars on human data as they do on computing power; OpenAI’s CEO, Sam Altman, responded that data needs will decrease as AI improves.

“I mean, what it can do is amazing,” she said of the chatbot. “But it still does some really weird shit.”

Chen is skeptical AI will reach a point where human feedback is no longer needed, but he does see annotation becoming more difficult as models improve. Like many researchers, he believes the path forward will involve AI systems helping humans oversee other AI. Surge recently collaborated with Anthropic on a proof of concept, having human labelers answer questions about a lengthy text with the help of an unreliable AI assistant, on the theory that the humans would have to feel out the weaknesses of their AI assistant and collaborate to reason their way to the correct answer. Another possibility has two AIs debating each other and a human rendering the final verdict on which is correct. “We still have yet to see really good practical implementations of this stuff, but it’s starting to become necessary because it’s getting really hard for labelers to keep up with the models,” said OpenAI research scientist John Schulman in a recent talk at Berkeley.

“I think you always need a human to monitor what AIs are doing just because they are this kind of alien entity,” Chen said. Machine-learning systems are just too strange ever to fully trust. The most impressive models today have what, to a human, seems like bizarre weaknesses, he added, pointing out that though GPT-4 can generate complex and convincing prose, it can’t pick out which words are adjectives: “Either that or models get so good that they’re better than humans at all things, in which case, you reach your utopia and who cares?”

As 2022 ended, Joe started hearing from his students that their task queues were often empty. Then he got an email informing him the boot camps in Kenya were closing. He continued training taskers online, but he began to worry about the future.

“There were signs that it was not going to last long,” he said. Annotation was leaving Kenya. From colleagues he had met online, he heard tasks were going to Nepal, India, and the Philippines. “The companies shift from one region to another,” Joe said. “They don’t have infrastructure locally, so it makes them flexible to shift to regions that favor them in terms of operation cost.”

One way the AI industry differs from manufacturers of phones and cars is in its fluidity. The work is constantly changing, constantly getting automated away and replaced with new needs for new types of data. It’s an assembly line but one that can be endlessly and instantly reconfigured, moving to wherever there is the right combination of skills, bandwidth, and wages.

Lately, the best-paying work is in the U.S. In May, Scale started listing annotation jobs on its own website, soliciting people with experience in practically every field AI is predicted to conquer. There were listings for AI trainers with expertise in health coaching, human resources, finance, economics, data science, programming, computer science, chemistry, biology, accounting, taxes, nutrition, physics, travel, K-12 education, sports journalism, and self-help. You can make $45 an hour teaching robots law or make $25 an hour teaching them poetry. There were also listings for people with security clearance, presumably to help train military AI. Scale recently launched a defense-oriented language model called Donovan, which Wang called “ammunition in the AI war,” and won a contract to work on the Army’s robotic-combat-vehicle program.

Anna is still training chatbots in Texas. Colleagues have been turned into reviewers and Slack admins — she isn’t sure why, but it has given her hope that the gig could be a longer-term career. One thing she isn’t worried about is being automated out of a job. “I mean, what it can do is amazing,” she said of the chatbot. “But it still does some really weird shit.”

When Remotasks first arrived in Kenya, Joe thought annotation could be a good career. Even after the work moved elsewhere, he was determined to make it one. There were thousands of people in Nairobi who knew how to do the work, he reasoned — he had trained many of them, after all. Joe rented office space in the city and began sourcing contracts: a job annotating blueprints for a construction company, another labeling fruits despoiled by insects for some sort of agricultural project, plus the usual work of annotating for self-driving cars and e-commerce.

But he has found his vision difficult to achieve. He has just one full-time employee, down from two. “We haven’t been having a consistent flow of work,” he said. There are weeks with nothing to do because customers are still collecting data, and when they’re done, he has to bring in short-term contractors to meet their deadlines: “Clients don’t care whether we have consistent work or not. So long as the datasets have been completed, then that’s the end of that.”

Rather than let their skills go to waste, other taskers decided to chase the work wherever it went. They rented proxy servers to disguise their locations and bought fake IDs to pass security checks so they could pretend to work from Singapore, the Netherlands, Mississippi, or wherever the tasks were flowing. It’s a risky business. Scale has become increasingly aggressive about suspending accounts caught disguising their location, according to multiple taskers. It was during one of these crackdowns that my account got banned, presumably because I had been using a VPN to see what workers in other countries were seeing, and all $1.50 or so of my earnings were seized.

“These days, we have become a bit cunning because we noticed that in other countries they are paying well,” said Victor, who was earning double the Kenyan rate by tasking in Malaysia. “You do it cautiously.”

Another Kenyan annotator said that after his account got suspended for mysterious reasons, he decided to stop playing by the rules. Now, he runs multiple accounts in multiple countries, tasking wherever the pay is best. He works fast and gets high marks for quality, he said, thanks to ChatGPT. The bot is wonderful, he said, letting him speed through $10 tasks in a matter of minutes. When we spoke, he was having it rate another chatbot’s responses according to seven different criteria, one AI training the other.

A history of metaphors for the internet

2023-05-18T08:53:40-04:00

When I wrote about this web surfing competition, it got me thinking about different metaphors for the internet. Surfing seemed like an odd one, an artifact from a very particular time in the mid-1990s when people used terms like “information superhighway” and “cyberspace” unironically. Where did these metaphors come from, and where did they go? Have any persisted, and have new ones taken their place?

The more I read, the more it seemed that these old metaphors hadn’t died out at all, though their meanings had changed. No one says “information superhighway” anymore, but whenever anyone explains net neutrality, they do so in terms of fast lanes and tolls. Twitter is a “town square,” a metaphor that was once used for the internet as a whole. These old metaphors had been joined by a few new ones: I have a feeling that “the cloud” will soon feel as dated as “cyberspace.”

From the web to virtual villages

The Web – 1990: Tim Berners-Lee decided to call his system of linked hypertext documents “The Worldwide Web” instead of the “Mine of Information” or the “Information Mesh,” which he also considered. Later, it would be crawled by “spiders,” though the spider metaphor never really caught on.
The Information Superhighway – 1991: Popularized by Al Gore as he pushed to expand and improve the national networking infrastructure, at the time used primarily by researchers. The information superhighway, in contrast, had stronger commercial associations. It also carried with it the assumption that it’s a public work and that activity on it could be regulated.
Virtual villages, cafes, flea markets, and parks – 1993: Tech journalist Howard Rheingold published The Virtual Community: Homesteading on the Electronic Frontier, likening the message board he was dialing into to a “virtual village.” Like other people trying to convey the communitarian aspects of the internet, he also likened the subcultures and groups to cafes, flea markets, Hyde Park, and other public gathering places.

“Information is fairly formless, so almost everything we do online we do with some kind of metaphor,” says Judith Donath, who studies interface design at Harvard’s Berkman Klein Center for Internet & Society. Moreover, because information is formless, the metaphors we use to describe it are particularly powerful — they’re what gives it form, telling people how a service ought to be used. Software metaphors can be both verbal and visual. Donath cites email as a particularly entrenched example. The mail metaphor made sense initially but locked us into a cumbersome system of folders. There’s no reason an email couldn’t exist in multiple categories, as in some sort of tagging system, other than that it would “break the metaphor,” she says, which is what Google eventually did with Gmail.

The 1990s saw a boom in sweeping metaphors for the entire internet, mostly because it was a time when people who were very excited about the internet were trying to explain it to people who didn’t understand it at all. That’s when you get your “internet superhighways,” “infobahns,” “global villages,” and “coffee houses with a thousand rooms.” But these metaphors weren’t simply clumsy attempts at communicating what the internet was — implicit in each of them was a vision of what the internet ought to be.

Take “cyberspace,” the founding spatial metaphor popularized by William Gibson in 1984’s Neuromancer. Going online wasn’t just sitting down at your computer and transmitting signals through a network; it was jacking into another dimension, leaving your physical body behind and entering a utopian space of pure information, one that was typically visualized as buildings literally constructed from neon data. Cyberspace became the chosen metaphor of the libertarian and countercultural strains of the early internet. As the media began to drum up internet panic, it became a scary place, full of cybercriminals cybersexing, but it was still an alternate dimension of total freedom.

These days, “cyberspace” still has these anarchic associations, but now the term only comes up in conversations about securing it. Government officials are pretty much the only people using it unironically. “Cyberspace is real,” then-President Barack Obama declared in 2009, announcing a new cybersecurity effort. “There will be no dark spaces for dark acts any more,” said Carl Bildt, the former prime minister of Sweden, at the 2011 London Conference on Cyberspace.

Cyberspace and a series of tubes

Cyberspace – 1996: Compared to the highway metaphor, cyberspace came to represent a more anarchic vision of the internet, an imagined virtual region separate from the physical world. “Governments of the Industrial World, you weary giants of flesh and steel, I come from Cyberspace, the new home of Mind,” wrote John Perry Barlow in “A Declaration of the Independence of Cyberspace.” The media often treated it as a scary place, warning of cybercrime and cybersex, and depicting it as a landscape made of neon numbers.
Fast lanes, slow lanes – 1997: Tim Wu coined the term “net neutrality” in 2003, but in 2006, he used an extended highway metaphor to explain why it was good. “How would you feel if I-95 announced an exclusive deal with General Motors to provide a special ‘rush-hour’ lane for GM cars only? That seems intuitively wrong,” Wu wrote in Slate. The fast-lane metaphor continues to be the primary way net neutrality is discussed, at least by those advocating for an open internet.
A series of tubes – 2006: “The internet is not a big truck,” said Sen. Ted Stevens in a fumbling attack on net neutrality. “It’s a series of tubes.” Stevens was widely mocked, but he wasn’t wrong. The truth is that “tubes” is probably a more accurate description of the internet — of its physical nature, anyway — than most of the metaphors discussed here. The internet is certainly more tubes than clouds.

Compare cyberspace to the other major metaphor of the ’90s: the information superhighway. Al Gore popularized the term as he pushed for the expansion of a national computer network, at the time used mostly for research. The highway was the perfect metaphor: it’s a big state-funded infrastructure project that will facilitate commerce, not an anarchic frontier. Like the railroad, which this 1993 article from The New York Times compares it to, it will conquer and develop the frontier. The “metaphor of the Internet as the information superhighway was chosen deliberately to demonstrate the utility and everyday nature of the Internet over the utopian vision of cyberspace that had informed its early development,” write professors Cornelius Puschmann and Jean Burgess.

This metaphor, too, has political implications, as the information scientist Peter Lyman points out. If the internet is a highway, then that implies the government should regulate what people do on it. The highway is also designed for moving private property to market, implying that the information superhighway is for moving and selling information, now understood primarily as intellectual property — not for freely copying and distributing data.

Interestingly, the highway metaphor has also flipped. Where cyberspace is used to describe a place that governments must bring under control, the information highway is invoked by activists trying to keep it free. Wu, who coined the term “net neutrality,” used an extended highway metaphor in 2006 to explain why people should care. Since then, fast lanes, slow lanes, and tolls have become the default language of the net neutrality debate, at least among those who support it. What started as a metaphor for regulation and markets ended up as a symbol of freedom.

When I started looking into metaphors, I thought I’d mostly be chronicling antiquated terms. I was surprised to find it still alive in the net neutrality debate. I was even more surprised when Donath pointed me toward all the other — newer — metaphors that might not initially seem metaphorical.

Facebook itself is a metaphor, she says. It uses the analogy of the freshman lookbook. It uses friendship as a metaphor to describe any connection. It uses a newspaper to describe its feed of events, which creates a tacit expectation that, like a newspaper editorial board, it will curate what you see. Twitter, on the other hand, is a “global town square” where anyone can be heard.

“So much of the internet has been branded,” says professor Julie Cohen, “what’s interesting now is what different brands end up with as metaphors.”

Clouds and town squares

Clouds – 2006: Amazon launches Elastic Compute Cloud, beginning its domination of the remote computing industry and the ubiquity of “cloud” everything. Cisco wants to turn its routers into data-gathering hubs, allowing computation to be done more locally. Naturally, it calls its system “fog computing.”
The Stream – 2009: “The stream is winding its way throughout the Web and organizing it by nowness,” wrote TechCrunch’s Erick Schonfeld. Facebook, Twitter, Digg, Google Reader, and other services were adopting reverse chronological real-time feeds. Metaphors are often both verbal and physical. Email uses the metaphor of mail, for example; it also uses the interface of folders to organize it.
Town square – 2011: With the Arab Spring, Twitter, in particular, was described as the “global town square,” a semi-public space where people could be heard and organize protests. Dick Costolo ran with the metaphor, saying that Twitter’s ability to both broadcast and have back-and-forth exchanges had brought back the Greek Agora.

${object.year}

`; object.objects.forEach((item) => { var timelineObject = document.createElement("div"); timelineObject.classList.add("timeline-object"); timelineObject.innerHTML = `

${item.date}

${item.text}

`; yearObject.appendChild(timelineObject); }); timelineContainer.appendChild(yearObject); }); timelineContainer.classList.remove("empty"); timelineContainer.parentNode.classList.remove("my-40"); var timelineHeight = timelineContainer.getBoundingClientRect().height; var lastTimelineObject = timelineContainer.querySelector(".year-object:last-child .timeline-object:last-child"); var lastTimelineObjectHeight = lastTimelineObject.getBoundingClientRect().height; var line = document.createElement("div"); line.classList.add("line"); line.style.height = timelineHeight - 20 - lastTimelineObjectHeight + "px"; timelineContainer.appendChild(line); }); }); .timeline-container { position: relative; } .timeline-sidebar { background-color: #F8F5FF; padding: 0.5rem 0; } .timeline-sidebar h3 { padding: 0 0.5rem; } .timeline-container .line { position: absolute; top: 25px; left: 27.5px; width: 1px; background-color: rgba(49, 40, 59, 1); transform: translateX(-50%); } .timeline-container .year-object { margin: 12px 0; padding: 6px 15px 0 40px; } .timeline-container h4.year { transform: translateX(-34px); padding: 6px; margin-bottom: 6px; z-index: 1; background-color: rgba(248, 245, 255, 1); position: relative; } .timeline-container .timeline-object .timeline-date { position: relative; } .timeline-container .timeline-object .timeline-date:empty:before { transform: translate(-8px, -50%) translateY(0.6rem); } .timeline-container .timeline-object .timeline-date:before { position: absolute; content: ""; display: block; top: 50%; right: 100%; transform: translate(-8px, -50%); width: 9px; height: 9px; background-color: rgba(82,0,255, 1); border-radius: 50%; z-index: 1; }

The internet is everywhere now, so it’s harder to use totalizing metaphors that describe it as a separate space. The division between physical space and the internet posited by “cyberspace” — digital dualism, as Nathan Jurgenson calls it — was always dubious, but it’s especially hard to maintain when you use Google Maps, Yelp, Uber, and other apps to navigate and interact with the world. People stumbling into things while looking at their phones is both a measure of them being “elsewhere” and a measure of how present the internet is in the physical world.

But ethereal, obfuscating metaphors persist. The Atlantic’s Rebecca Rosen traces “the cloud” back to the way early network engineers symbolized the unknown networks their systems hooked into. Largely thanks to Amazon, which launched its Elastic Compute Cloud service in 2006, the term is now used to describe any remote data storage and computing. The cloud is weightless and intentionally vague: your data is up there somewhere, in a better place, where you can forget about it. It’s in sharp contrast to the industrial reality of remote servers, which are gigantic, loud, and require tremendous amounts of energy.

“Big data” is often referred to as a torrent, a flood, or an ocean — a natural resource that must be harnessed. Rowan Wilken, a professor at the Swinburne University of Technology, worries that the metaphor obscures the fact that this data is often created by users.

“Almost anything about the internet is going to have metaphors that help you understand it, because otherwise it’s formless,” Cohen says. “And they’ll all have political implications.”

A history of metaphors for the internet

2023-05-17T16:54:00-04:00

From the web to virtual villages

The Web – 1990: Tim Berners-Lee decided to call his system of linked hypertext documents “The Worldwide Web” instead of the “Mine of Information” or the “Information Mesh,” which he also considered. Later, it would be crawled by “spiders,” though the spider metaphor never really caught on.
The Information Superhighway – 1991: Popularized by Al Gore as he pushed to expand and improve the national networking infrastructure, at the time used primarily by researchers. The information superhighway, in contrast, had stronger commercial associations. It also carried with it the assumption that it’s a public work and that activity on it could be regulated.
Virtual villages, cafes, flea markets, and parks – 1993: Tech journalist Howard Rheingold published The Virtual Community: Homesteading on the Electronic Frontier, likening the message board he was dialing into to a “virtual village.” Like other people trying to convey the communitarian aspects of the internet, he also likened the subcultures and groups to cafes, flea markets, Hyde Park, and other public gathering places.

Cyberspace and a series of tubes

Cyberspace – 1996: Compared to the highway metaphor, cyberspace came to represent a more anarchic vision of the internet, an imagined virtual region separate from the physical world. “Governments of the Industrial World, you weary giants of flesh and steel, I come from Cyberspace, the new home of Mind,” wrote John Perry Barlow in “A Declaration of the Independence of Cyberspace.” The media often treated it as a scary place, warning of cybercrime and cybersex, and depicting it as a landscape made of neon numbers.
Fast lanes, slow lanes – 1997: Tim Wu coined the term “net neutrality” in 2003, but in 2006, he used an extended highway metaphor to explain why it was good. “How would you feel if I-95 announced an exclusive deal with General Motors to provide a special ‘rush-hour’ lane for GM cars only? That seems intuitively wrong,” Wu wrote in Slate. The fast-lane metaphor continues to be the primary way net neutrality is discussed, at least by those advocating for an open internet.
A series of tubes – 2006: “The internet is not a big truck,” said Sen. Ted Stevens in a fumbling attack on net neutrality. “It’s a series of tubes.” Stevens was widely mocked, but he wasn’t wrong. The truth is that “tubes” is probably a more accurate description of the internet — of its physical nature, anyway — than most of the metaphors discussed here. The internet is certainly more tubes than clouds.

“So much of the internet has been branded,” says Cohen, “what’s interesting now is what different brands end up with as metaphors.”

Clouds and town squares

Clouds – 2006: Amazon launches Elastic Compute Cloud, beginning its domination of the remote computing industry and the ubiquity of “cloud” everything. Cisco wants to turn its routers into data-gathering hubs, allowing computation to be done more locally. Naturally, it calls its system “fog computing.”
The Stream – 2009: “The stream is winding its way throughout the Web and organizing it by nowness,” wrote TechCrunch’s Erick Schonfeld. Facebook, Twitter, Digg, Google Reader, and other services were adopting reverse chronological real-time feeds. Metaphors are often both verbal and physical. Email uses the metaphor of mail, for example; it also uses the interface of folders to organize it.
Town square – 2011: With the Arab Spring, Twitter, in particular, was described as the “global town square,” a semi-public space where people could be heard and organize protests. Dick Costolo ran with the metaphor, saying that Twitter’s ability to both broadcast and have back-and-forth exchanges had brought back the Greek Agora.

${object.year}

`; object.objects.forEach((item) => { var timelineObject = document.createElement("div"); timelineObject.classList.add("timeline-object"); timelineObject.innerHTML = `

${item.date}

${item.text}

“Almost anything about the internet is going to have metaphors that help you understand it, because otherwise it’s formless,” Cohen says. “And they’ll all have political implications.”

Now everyone claims Microsoft will build a data center on the Foxconn land in Wisconsin

2023-04-07T12:50:02-04:00

Foxconn’s dome in Wisconsin, with the “high performance computing” shipping container next to it. | Image: Nilay Patel

The Foxconn land in Wisconsin may finally be used for a meaningful technology project — but not by Foxconn.

Last week, the village board of Mount Pleasant voted to allow Microsoft to build a data center on land previously cleared for the Foxconn LCD fab that never arrived. Microsoft will buy the land for $50 million, some of which will be used to reimburse Foxconn for releasing its rights to the land. It does not appear that Foxconn will play any part in operating the data center itself.

The announcement came days before an election in which the board incumbents narrowly defeated challengers critical of the Foxconn deal.

Foxconn spokesperson Rusty Schultz declined to comment on the record, instead pointing to an unattributed statement published by other outlets that ambiguously suggests Foxconn is partnering with Microsoft in some way.

Microsoft will be eligible for $5 million in tax credits per year

Microsoft will build the $1 billion data center on a 315-acre parcel previously allotted to Foxconn, with construction starting no later than 2026, according to the development agreement. The project will help the Village pay down some of the debt it took on to prepare the site for Foxconn’s never-built LCD factory. Microsoft will be eligible for $5 million in tax credits per year based on improvements it makes to the land. Neither Microsoft nor the Village has said how many jobs will be created, but given the highly automated nature of data centers, it’s likely to be very few.

Microsoft president and chief legal officer Brad Smith is from Appleton, Wisconsin, and the company has significant investments in the state, most notably in the Titletown Tech project in Green Bay. But the company is tight-lipped about its plans in Mount Pleasant.

“Our data center campus investment plans with the Village of Mount Pleasant and Racine County are part of Microsoft’s long-term commitment to the local communities in Wisconsin. We look forward to our work there,” the company’s Frank Shaw said in a statement. The company declined to add any additional information or make anyone available to answer questions about the project.

The village board took Microsoft’s arrival as vindication of the Foxconn deal. “Microsoft was attracted to this location because it is primed for development,” said Village President David DeGroot in an emailed statement.

DeGroot had been facing a tough reelection challenge from Kelly Gallaher, a vocal critic of the Foxconn deal. Gallaher was running with Kim Mahoney, another critic and the last remaining holdout at the Foxconn site after the Village cleared the area through eminent domain. Mahoney finally settled with the Village for $950,000 and moved away late last year. Both Gallaher and Mahoney lost their bids this week by several hundred votes.

Foxconn has pivoted repeatedly in the five years since then-President Trump announced plans for a 20 million square foot LCD fab he called the “eighth wonder of the world.” After it quickly became apparent that an LCD factory did not make economic sense, Foxconn announced it would build robot coffee kiosks, servers, and other ideas that never came to fruition. During the height of the pandemic, it even announced that it would manufacture ventilators in Wisconsin with Medtronic, another plan which went nowhere. Internally, Foxconn pursued everything from dairy exporting to fish farming to make good on its investment in the state. Foxconn has also long insisted that a large glass orb on the site is actually a data center, even though it is actually office space and a conference center.

Foxconn insists large glass orb on the site is actually a data center, even though it is actually office space

In 2021, Wisconsin and Foxconn agreed to scale down their contract to reflect a far smaller project than the original fab. Under the new contract, Foxconn aims to create at least 1,454 jobs, rather than 13,000, and the state is on the hook for $80 million in credits rather than $3 billion. Foxconn has qualified for nearly $40 million of those credits so far and employed 768 people at the end of 2022, according to The Associated Press. Precisely what those people are doing remains unclear. Foxconn has said it is manufacturing servers, but it has also refused access to local journalists.

The debacle has proved costly to Mount Pleasant, which bulldozed dozens of homes in order to clear land for the project. According to an investigation by the Milwaukee Journal Sentinel earlier this year, the debt taken on by the Village amounts to 500 percent of its operating revenue. Village taxpayers have also paid $167 million to various contractors and vendors, including to Claude Lois, the politically connected consultant overseeing the project for $28,000 per month.

How Kindle novelists are using ChatGPT

2022-12-24T12:00:00-05:00

Earlier this year, I wrote about genre-fiction authors using AI in their novels. Most wrote for Amazon’s Kindle platform, where an extremely rapid pace of publishing, as fast as a book a month, is the norm. AI helped them write quickly, but it also raised complex aesthetic and ethical questions. Would the widespread use of AI warp fiction toward the most common conventions and tropes? What parts of the writing process can be automated before the writing no longer feels like their own? Should authors have to disclose their use of AI?

With the debut of ChatGPT, many of the questions these writers were dealing with have become more urgent and mainstream. I checked back with one of the authors, Jennifer Lepp, who writes in the cozy paranormal mystery subgenre under the pen name Leanne Leeds, to see how she was thinking about AI now. She’s still using the GPT-3-based tool Sudowrite — in fact, she is now paid to write tips on using it for the company’s blog — and has begun incorporating some of the more recent tools into her fiction. We spoke about what it’s been like working with ChatGPT, how its debut has roiled the independent author community, and other topics.

When we spoke last time, you had gone through an evolution of using Sudowrite, first mostly as a sort of thesaurus, then experimenting with incorporating its text into your work, then letting it lead you and having an alienating experience with that and reining it back and using it primarily to flesh out descriptions you’d outlined. What’s your process like now?

Well, I had hoped that it would help me write two books at the same time, and that failed spectacularly. Apparently, I’m still connected to my own writing. So, on the one hand, that was good.

You thought that it could allow you to toggle back and forth and write two books simultaneously?

I figured, Hey, if I don’t know what to write, I’ll just pop something in there and it will get me going, and I’ll be right back into the book I left a week ago. It didn’t quite work out that way. If I didn’t know what I was doing, it didn’t matter what it spit out at me. It wasn’t going to help me reconnect with material I already wrote.

You and a few other independent authors were early adopters of these tools. With ChatGPT, it feels like a lot of other people are suddenly grappling with the same questions you were confronting. What’s that been like?

I definitely am still grappling, and I think I’m grappling a little bit more publicly. For the most part, people before had kind of rolled their eyes — I don’t think they understood what people were using AI for. ChatGP3 exploded that. Every group, every private, behind-the-scenes author group I’m in, there’s some kind of discussion going on.

Right now, everybody’s talking about using it on the peripherals. But there seems to be this moral chasm between: “It does blurbs really well, and I hate doing blurbs, and I have to pay somebody to do blurbs, and blurbs isn’t writing, so I’m going to use it for blurbs.” Or “Well, I’m going to have it help me tighten up my plot because I hate plotting, but it plots really well, so I’m going to use it for that.” Or “Did you know that if you tell it to proofread, it’ll make sure that it’s grammatically correct?’

“Every private, behind the scenes author group I’m in, there’s some kind of discussion going on.”

Everybody gets closer and closer to using it to write their stuff, and then they stop, and everybody seems to feel like they have to announce when they’re talking about this: “But I do not ever use it its words to write my books.”

And I do. It doesn’t drive my plot. It doesn’t generally drive any of the ideas in my books. It doesn’t create characters. But the actual words, just to get them down faster and get it out, I do. So I’ve found myself in the past couple of weeks wondering, do I engage in this debate? Do I say anything? For the most part, I’ve said nothing.

What do you think the line is that people are drawing?

It’s a concern of plagiarism. Everybody knows that they crawled stuff with permission and without permission.

And there’s an ethical question. I can go in and — right now, I’m listening to Jim Butcher’s audiobooks. I love his tone. I love the deadpan snark. So I went into the AI when I was thinking about trying to get something like that with a character and said, “Rewrite it in the style of Jim Butcher.” Bam! The same kind of deadpan, urban fantasy phrasing.

Well, where did it get that? It’s almost exactly the same argument and the same fear that’s going on with visual artists. It’s just much more obvious in the artist community. I have three authors that I’ve read extensively, indie authors that I’m friends with, and I know they never gave permission for their stuff to be looked at, and I was able to reasonably recreate their style.

Do you see a line between using AI for something like a description and using it to mimic another author’s voice?

Yeah. That I won’t do. That, for me, is an ethical line. I may like Jim Butcher, and I may wish to God I could write like him, but I’m not going to take my stories and have them rewritten in his voice to rip him off.

But you could, if you were ethically okay with that, with this technology and what it allows you to do.

Have you incorporated ChatGPT into your work?

Right now, I use it for titles and plots — specifically mystery plots. And blurbs.

I basically started out by just telling it who I am and what I need. “I am writing a paranormal mystery that takes place in the small town of Table Rock, Texas. It has a female amateur sleuth. This is her name. I need a murder victim. I need how they were killed. I need four murder suspects with information about why they’re suspected and how they are cleared. And then tell me who the guilty killer is.”

And it will do just that. It will spit that out.

“It seems to understand what I’m asking for.”

What are some of the things that it’s given you?

Right now, I have [plots for] books two, three, four, five, six, and seven, and all of those murder mysteries were ChatGPT-generated, though I edited some of them. The impressive thing about it is that if I tell it that it’s a cozy mystery and I tell it that it needs to be humorous, it seems to understand what I’m asking for. The names that it gives me for the suspects are cutesy. The reasons behind it are never gory or serious.

You feel like you can automate that part of it and still feel in control of the story?

There are two parts of a cozy mystery. There’s the murder, and the murder is the thing that all of the characters revolve around. But the murder, to me, tends to be less important than all the revolving. So there has to be a murder, and it has to be amusing and funny and give reasons for mayhem and strangeness. But what it is is almost inconsequential to the plot, even though it’s the thing that drives everything.

“The progress is so incredibly fast, and so a few questions have really been answered.”

You mentioned over email that you were using AI for book covers.

I didn’t do the whole cover on DALL-E, but on the seventh book that I had, I had sketched out a plot that involved a Lykoi cat. It’s a cat that is so ugly it’s cute. It’s apparently a fairly new breed that was like a crossbreed between a cat with hair and a hairless cat. And so it’s got hair in some places, and it looks like a werewolf.

So I would have had to find a photographer that could do a shoot, find a Lykoi cat, pay everybody to get the image and the cover that I needed. That’s expensive. So on a lark, I was like, Huh? I wonder…

And I went to open my account, jumped into DALL-E. Boom! For me, it saved so much time and money, and the cover looks great, but a photographer didn’t get paid, right? Somebody who wanted to pose their cat didn’t get paid.

How do you see these tools and the way writers use them evolving?

I’m really just stuck in the middle, wondering which way it’s going to go. I definitely don’t want to encourage people who aren’t comfortable using it to use it. I do think it’s going to leak into their lives. It’s already leaking into all our other software, so I think it’s going to be very hard to get away from. But I definitely don’t know where it’s all going. ChatGPT shocked the hell out of me. I had thought, well, it’ll take three or four years, and it’ll get better. Then came ChatGPT, and oh my god, that’s so much better! It’s been six months! The progress is so incredibly fast, and so few questions have really been answered.

The interview has been condensed and edited.