AI models don’t need publishers’ data



Sam Altman, CEO of OpenAI, attends the 54th annual meeting of the World Economic Forum in Davos, Switzerland, on Jan. 18, 2024.

Denis Balibouse | Reuters

DAVOS, Switzerland — Sam Altman said he was “surprised” by The New York Times’ lawsuit against his company, OpenAI, saying its artificial intelligence models didn’t need to train on the publisher’s data.

Describing the legal action as a “strange thing,” Altman said OpenAI had been in “productive negotiations” with the Times before news of the lawsuit came out. According to Altman, OpenAI wanted to pay the outlet “a lot of money to display their content” in ChatGPT, the firm’s popular AI chatbot.

“We were as surprised as anybody else to read that they were suing us in The New York Times. That was sort of a strange thing,” the OpenAI leader said on stage at the World Economic Forum in Davos on Thursday.

He added that he isn’t that worried by the Times’ lawsuit, and that a resolution with the publisher isn’t a top priority for OpenAI.

“We are open to training [AI] on The New York Times, but it’s not our priority,” Altman said in front of a packed Davos crowd.

“We actually don’t need to train on their data,” he added. “I think this is something that people don’t understand. Any one particular training source, it doesn’t move the needle for us that much.”

The Times sued both Microsoft and OpenAI late last year, accusing the companies of alleged copyright infringement through the use of its articles as training data for its AI models.

The news outlet seeks to hold Microsoft and OpenAI accountable for “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.”

In the suit, the Times showed examples in which ChatGPT spewed out near-identical versions of the publisher’s stories. OpenAI has disputed the Times’ allegations.

Ian Crosby, a partner at Susman Godfrey who’s representing The New York Times as lead counsel, said in a statement that Altman’s commentary about the lawsuit shows OpenAI is admitting to using copyrighted content to train its models and effectively “free riding” on the paper’s investments in journalism.

“OpenAI is acknowledging that they have trained their models on The Times’ copyrighted works in the past and admitting that they will continue to copy those works when they scrape the Internet to train models in the future,” Crosby said in a statement emailed to CNBC on Thursday.

He called that practice “the opposite of fair use.”

The legal action has ignited worries that more media publishers could go after OpenAI with similar claims. Other outlets are looking to partner with the firm to license their own content, rather than battle it out in court. Axel Springer, for instance, has a deal with the company where it licenses its content.

OpenAI responded to the Times’ lawsuit earlier this year, saying in a statement that instances of “regurgitation,” or spitting out entire “memorized” parts of specific pieces of content or articles, “is a rare bug that we are working to drive to zero.”

In that same statement, the AI developer said that it works to collaborate with news organizations and create new revenue and monetization opportunities for the industry. “Training is fair use, but we provide an opt-out because it’s the right thing to do,” the company said.

Altman’s comments echo remarks the AI leader made at an event organized by Bloomberg in Davos earlier this week. Then, Altman said that he wasn’t that worried about the Times’ lawsuit, disputed the publisher’s allegations and said there would be plenty of ways to monetize news content in the future.

“There’s all the negatives of these people being like … don’t do this, but the positives are, I think there’s going to be great new ways to consume and monetize news and other published content,” Altman said.

“And for every one New York Times situation, we have many more super productive things about people that are excited to build the future and not do the theatrics.”

Altman added there were ways that OpenAI could tweak the company’s GPT models, so that they don’t regurgitate any stories or features posted online online word for word

“We don’t want to regurgitate someone else’s content,” he said. “But the problem is not as easy as it sounds in a vacuum. I think we can get that number down and down and down, quite low. And that seems like a super reasonable thing to evaluate us on.”


Source link