Category: 3. Business

  • Zuckerberg to testify in landmark social media trial

    Zuckerberg to testify in landmark social media trial

    Facebook founder Mark Zuckerberg has been ordered to testify in a landmark trial in the US over the impact of social media on young people.

    Los Angeles County Superior Court Judge Carolyn Kuhl this week rejected the argument his company, Meta Platforms, had made that an in-person appearance was not necessary.

    Her order also applies to Snap boss Evan Spiegel, as well as Adam Mosseri, who leads Meta-owned Instagram.

    The trial, expected in January, is among the first to advance from a wave of litigation accusing social media companies of making their apps addictive and enticing to young people despite being aware of mental health and other risks.

    Meta did not respond to a request for comment.

    Law firm Kirkland & Ellis, which is representing Snap, said the decision did not “bear at all” on the truth of the claims and that they “look forward to the opportunity” of why they believe the “allegations against Snapchat are wrong factually and as a matter of law”.

    Hundreds of claims brought by parents and school districts were consolidated into one case before the Los Angeles County Superior Court in 2022.

    They accuse the companies of having ineffective parental controls and weak safety features, and also say that alerts for “likes” and other responses keep young people tied to the platforms.

    Meta and Snap have contested the claims, which are similar to those in a separate, but similarly sprawling federal case. TikTok and YouTube, owned by Alphabet, are also named in the suits.

    In seeking to have both cases dismissed, the tech companies have argued that federal law protected them from responsibility for content on their platforms. Under a law passed in the 1990s, they are not liable for what people say or post on their services.

    But the Los Angeles judge said the companies must still face claims of negligence and personal injury stemming from the apps’ designs.

    Lawyers representing young people and their parents argue that the companies decided not to make changes because they were concerned about the hit to the business.

    Meta had said Zuckerberg and Mosseri had already submitted to questioning as part of the case and that appearing in person represented “a substantial burden” and would “interfere with business”.

    But Judge Kuhn wrote that hearing directly from the heads of the company was key to evaluating those claims.

    “The testimony of a CEO is uniquely relevant,” Judge Kushn said, as their “knowledge of harms, and failure to take available steps to avoid such harms” could help prove negligence.

    Beasley Allen, one of the firms law firms leading the litigation against the social media companies, was pleased with the ruling.

    “We are eager for trial to force these companies and their executives to answer for the harms they’ve caused to countless children,” it said in a statement.

    Social media companies have been facing growing legal and political pressure arising from concerns about the impact of the apps on young people’s mental health.

    Testifying before Congress on the issues last year, Zuckerberg said his company took the issues seriously and defended the protections it had in place, while distancing it from responsibility.

    “The existing body of scientific work has not shown a causal link between using social media and young people having worse mental health,” he said.

    Instagram last year started rolling out special “teen accounts” in response to concerns.

    It updated that system earlier this month, adding a default setting that screens out content guided by a system similar to movie ratings. It also said parents could opt for stricter controls.

    Reporting contributed by Lily Jamali

    Continue Reading

  • International Wrap-Up: Six Pride players called up for October international duty

    International Wrap-Up: Six Pride players called up for October international duty

    With a week break in NWSL play before the final game of the seaosn, six Orlando Pride players will be representing their respective countries in international friendlies around the world.

    Here’s how to follow all of the action to come:

    Schedule (All times in ET)

    Anna Moorhouse | England | International Friendlies

    • Saturday, October 25, 12:30 p.m. – England vs. Brazil
    • Tuesday, October 28, 2 p.m. – England vs. Austrailia

    Jacquie Ovalle | Mexico | International Friendlies

    • Thursday, October 23, 10 p.m. – Mexico vs. New Zealand
    • Sunday, October 26, 7 p.m. – Mexico vs. New Zealand

    Angelina | Brazil | International Friendlies

    • Saturday, October 25, 12:30 p.m. – England vs. Brazil
    • Tuesday, October 28, 1:15 p.m. – Italy vs. Brazil

    Emily Sams | United States | International Friendlies

    • Thursday, October 23, 7 p.m. – United States vs. Portugal
    • Sunday, October 26, 7 p.m. – United States vs. Portugal
    • Wednesday, October 29, 8:00 p.m. – United States vs. New Zealand

    Grace Chanda | Zambia | International Friendlies

    • Wednesday, October 22, 9 a.m. – Zambia vs. Namibia
    • Sunday, October 26, 9 a.m. – Zambia vs. Namibia

    Zara Chavoshi | Canada | International Friendlies

    • Friday, October 24, 1:30 p.m. – Switzerland vs. Canada
    • Tuesday, October 28, 2:45 p.m. – Netherlands vs. Canada


    Continue Reading

  • Stablecoins: Issues for regulators as they implement GENIUS Act

    Stablecoins: Issues for regulators as they implement GENIUS Act

    The GENIUS Act creates the U.S. regulatory framework for payment stablecoin issuers to operate in the U.S. or for foreign entities to offer stablecoins to U.S. residents. While the GENIUS Act clarifies much, financial regulators must now write rules that will determine if stablecoins can gain trust and contribute to a more efficient, lower cost payments system, or if they will continue to be used mainly for crypto trading and for users in some countries to more easily access U.S. dollars.  

    The GENIUS Act makes clear that payment stablecoins are neither a security nor a national currency, nor do they have deposit insurance or access to Federal Reserve payment services. The law says that payment stablecoin issuers can be nonbank entities (federal or state qualified) or subsidiaries of insured depository institutions (IDIs). Issuers’ activities are restricted to offering and redeeming stablecoins and ancillary activities.  

    This article highlights four key issues for regulators as they implement the GENIUS Act:

    First, financial regulators need to write capital, liquidity, and risk management requirements for issuers within 18 months to ensure stable value, and to protect financial stability and the singleness of money (to the maximum extent possible without central bank settlement).

    Second, the Treasury, Federal Reserve (Fed), and Federal Deposit Insurance Corporation (FDIC) need to set conditions under which nonfinancial companies can issue stablecoins to avoid excessive concentration of economic power or increase financial stability risks.

    Third, the Treasury needs to make sure that the “comparable regulatory regime” for foreign issuers to offer stablecoins to U.S. residents is as strong as the rules for domestic issuers to reduce arbitrage incentives.

    Fourth, FinCEN, a bureau of the Treasury responsible for implementing and enforcing compliance with Anti-Money Laundering/Countering the Financing of Terrorism (AML/CFT) rules, needs to write new regulations to ensure issuers have the technological capacity to counter illicit finance.       

    Stablecoin activity has been growing rapidly on a global basis. U.S. dollar-backed stablecoins (backed by reserve assets and not an algorithm) reached more than $260 billion in the third quarter of 2025, with Tether’s USDT accounting for more than one half of the total, and Circle’s USDC showing the fastest growth since the end of 2020. In addition, monthly transactions have risen to more than $1 trillion, about 10 times the volume at year-end 2020. While stablecoins can reduce the costs of cross-border remittances and multinational corporate cash management, this type of activity appears to still be limited.

    In addition to reducing transaction costs, stablecoins, other digital dollars such as tokenized bank liabilities, as well as non-digital (near) real-time fast payments could create significant benefits for the economy by reducing the hidden costs of inefficient payments.1 Payment transactions overall could increase if more efficient payments enable the valuable transactions that did not occur because of high fixed costs and lengthy clearing and settlement processes. These include small payments or short-term loans to households and small businesses that are impractical when fixed costs are high and value is not transferred quickly. Instantaneous, programmable payments on a 24/7 basis also could better meet future demand for “on chain” payments as securities become tokenized or as agentic AI pays in real time for its purchases of data and computing power as it generates content.

    At the same time, stablecoin issuance could be concentrated because of network externalities of payment systems. If rules are weak and stablecoins fail to gain trust, they could be out-competed over time, while along the way increasing risks to financial stability, fracturing the monetary system, and facilitating money laundering and terrorist financing. 

    Capital, liquidity, and risk management requirements

    The defining characteristic of stablecoins is their use as money. This relies on an issuer’s ability to trade stablecoins for dollars, in full and on time. The GENIUS Act provides for stable value and liquidity by limiting the types of assets that can back a stablecoin on at least a one-to-one basis and requires monthly disclosures and certification of audited financial statements. However, a significant problem with GENIUS is that the permissible reserve assets include uninsured deposits in banks and shares of credit unions. But uninsured deposits are risky and illiquid, raising the possibility that stablecoins backed by these assets will not offer a stable value and will be prone to runs—unless these risks aremitigated by capital, liquidity, and risk management standards. Moreover, allowing stablecoins to hold uninsured bank deposits creates direct two-way interconnections between the risk of banks and the risk of stablecoins. 

    Without appropriate standards to reduce runs and limit interconnections, stablecoins could create the type of fire sale dynamics from prime money market mutual funds (MMFs) during the Global Financial Crisis. Prime MMFs promised stable value and were allowed to invest in risky deposits and commercial paper of financial companies. In 2007-2008, when financial firms began to suffer losses from subprime mortgages, prime MMFs could not maintain par. Without capital to offset the losses, investors ran, and the Reserve Primary Fund was forced to close, leading to a breakdown in broader short-term funding markets and threatening market and credit intermediation. In addition, accounting standards allowed for prime MMFs to be held as a cash-like asset. Companies that had prime MMFs in cash-like operational accounts liquidated their shares on concerns they would not be able to access funds to meet regular payroll and other expenses, further exacerbating the runs. The Treasury and the Fed had to step in.  

    Currently, the largest stablecoin issuers hold a significant share of assets that are not risk-free or liquid. Recent financial statements indicate that Circle USDC had almost 14% of its portfolio in the deposits of regulated financial institutions, likely exceeding the deposit insurance limit of $250,000 per account. For Tether USDT, about 20% of reserve assets were in assets that were not cash and cash equivalents, and included secured loans, Bitcoin, precious metals, and other investments.

    To reduce financial stability risks and to build trust in stablecoins, financial regulators should require stablecoin issuers that want to hold more uninsured deposits of financial institutions to have higher capital and liquidity standards than issuers that are invested mainly in safer and more liquid assets. The GENIUS Act allows for this—the federal regulator can “include capital buffers that are tailored to the business model and risk profile” and “are necessary to ensure the ongoing operations;” that is, to ensure the financial integrity and the ability of the issuer to meet the financial obligations, including redemption. The law also allows regulators to require asset diversification. Establishing appropriate requirements would go a long way to preventing stablecoins from being an information-sensitive asset (see Gorton and Zhang, 2023), like bank notes of the 19th century wildcat banking era. During that time, banks issued their own bank notes, but transactions were inhibited by the ability of holders of these bank notes to assess their fluctuating values, and many banks failed.

    In addition, GENIUS does not prohibit permitted payment stablecoins from being counted as a cash or cash-equivalent asset on a corporate balance sheet. Accounting standard-setters should restrict stablecoins from being categorized as a cash-equivalent asset if prudential requirements are insufficient to prevent runs. 

    A second problem with the permissible reserve assets is that GENIUS allows for both Treasury repo and Treasury reverse repo transactions—that is, an issuer can both borrow and lend based on Treasury collateral. The inclusion of Treasury repo transactions appears to be to allow stablecoin issuers to access funds without having to sell their Treasury securities to meet liquidity needs. Instead, the issuer could borrow under a repurchase agreement—sell the Treasury security and receive cash with an agreement to buy back the security the next day. (This is in contrast to a reverse repo transaction where the issuer could lend to a counterparty based on Treasury collateral received.)

    Permitting Treasury repo as a reserve asset could pose significant problems if a stablecoin issuer were to fail. Under GENIUS, reserve assets are required to be held in segregated accounts to allow consumers access to their payment stablecoins even if the stablecoin issuer were to fail. But in bankruptcy, repo is exempt from the automatic stay that freezes a borrower’s assets and prevents collections once bankruptcy proceedings have been initiated. That means the repo lender could seize the Treasury collateral if the stablecoin issuer could not return the cash. In this event, assets in segregated accounts would fall short of what is needed for 1:1 redemption.

    To address this risk, regulators should not permit the securities available for repo transactions to be included in the amount needed for 1:1 backing and held in segregated accounts. If repo borrowing is truly needed to provide cash to meet short-term liquidity needs on a regular basis, regulators should also set stricter liquidity standards for these issuers.    

    Nonfinancial stablecoin issuers

    An important issue is the conditions under which nonfinancial companies will be permitted to issue payment stablecoins. The GENIUS Act does not strictly protect the separation of banking and commerce, which has been in place in the U.S. since at least the National Banking Act of 1863-64 to avoid inefficient credit allocation and excessive concentration of economic power. 

    GENIUS allows for a publicly traded nonfinancial company (a company not engaged in one or more financial activities, as defined by section 4(k) of the Bank Holding Company Act) to be a payment stablecoin issuer if the Stablecoin Certification Review Committee (SCRC) determines unanimously that it will not pose a material risk to the safety and soundness of the banking system, the financial stability of the U.S., or the Deposit Insurance Fund; and the company will comply with limitations on use of nonpublic personal information and anti-tying provisions as specified in the GENIUS Act. The SCRC is composed of the Secretary of the Treasury, the Chair of the Federal Reserve Board (or, if the Chair delegates, the Vice Chair for Supervision), and the Chair of the FDIC. 

    While Facebook abandoned its plans to issue a stablecoin called Libra (later re-formulated as Diem), concerns persist as large technology and retail firms consider issuing stablecoins. Serious risks to economic and monetary stability could arise if nonfinancial firms can take advantage of existing private data from their platforms or raise the costs of switching to other payment products. In addition, because payments systems build on network externalities—the system is more valuable the more participants there are—it is likely there will be only a few stablecoin issuers in a concentrated market. The SCRC should seek public input on any determinations. In addition, a glaring loophole is that privately held nonfinancial companies are not prohibited from issuing stablecoins. Regulators can’t fix this; only Congress can.  

    Foreign stablecoin issuers

    The Secretary of the Treasury has the authority to issue a safe harbor to a foreign issuer of stablecoins to offer stablecoins to U.S. residents if they are subject to comparable foreign regulations. The Secretary would need to submit a justification for any safe harbor determination to the chairs and ranking members of the Senate Banking Committee and the House Financial Services Committee. 

    Regulators should provide input into what frameworks would meet “comparable foreign regulation” to avoid arbitrage and to ensure that the provisions of GENIUS to protect consumers, mitigate risks to financial stability, and enforce compliance with rules against money laundering and terrorist financing would be preserved.     

    Preventing use for illicit finance

    Additional input from regulators is needed to address the risks that stablecoins could be used increasingly for illicit finance activities. GENIUS clarifies that a payment stablecoin issuer is subject to Bank Secrecy Act requirements; that stablecoins offered by a foreign issuer may not be offered for trading in the U.S. by a digital asset service provider (DASP) unless the issuer has the technological capability to comply and does comply; and instructs the Secretary of the Treasury to seek public comment to identify methods and techniques to detect illicit finance activity. In addition, FinCEN is required to issue within three years public guidance and rules on implementation of innovative or novel methods to detect illicit activity and tailored risk management standards. 

    Treasury issued a Request for Comment on Innovative Methods to Detect Illicit Activity Involving Digital Assets in August 2025 to get public input on innovative or novel methods, techniques, or strategies that regulated financial institutions use, or could potentially use, to detect illicit activity involving digital assets. In particular, Treasury asks commenters about application program interfaces, artificial intelligence, digital identity verification, and use of blockchain technology and monitoring.2 Companies are creating and improving software and data so that stablecoin issuers and DASPs can better identify illicit activities and take actions to freeze and block accounts engaged in these activities. FinCEN and federal and state regulators will need to ensure that such services are being used appropriately or developed internally to comply with high AML/CFT standards.

    Conclusion

    GENIUS is a landmark bipartisan effort that balances the potential benefits of stablecoins as a means of payment or settlement while mitigating serious risks for financial stability and the monetary system. Instantaneous programmable payments on a 24/7 basis could create significant economic value by enabling valuable transactions that did not occur because of high costs and long settlement periods and to meet future demands for “on chain” digital payments. The rules and standards that now need to be written to implement GENIUS are critical to determine if stablecoins will achieve their potential and have a place in a future payment system as other types of payment instruments also continue to develop. 

    Continue Reading

  • Securing the AI Frontier: Irregular Co-founder Dan Lahav

    Contents

    Intro

    Dan Lahav: There was a scenario where there was an agent-on-agent interaction. I won’t say the names, but you can kind of think about it like a Claude, a Gemini. And it was a critical security task, that was the simulation that they were in. But after working for a while, one of the models decided that they’ve worked enough and they should stop. It did not stop there. It convinced the other model that they should both take a break. So the model did social engineering on the other model—to another model. But now try to think about the situation where you actually, as an enterprise, are delegating an autonomous workflow that is critical to you to complete, and the more complicated and capable machines are ultimately going to be, the more of these weird examples we’re going to encounter.

    Dean Meyer: Today on Training Data, we dig into the future of frontier AI security with Dan Lahav, founder of Irregular. Dan challenges how we think about security in a world where AI models are not just tools, but autonomous economic actors. He explains why the rise of AI agents will force us to reinvent security from first principles, and reveals how the very nature of threats is shifting from, say, code vulnerabilities to unpredictable emergent AI behaviors. Dan also shares surprising real-world simulations where AI models outmaneuver traditional defenses, and why proactive experimental security research is now essential. His view is that in a world where more economic value will shift to human on AI or AI on AI, solving these problems is paramount. Enjoy the show.

    The Future of AI Security

    Dean Meyer:Dan. Wonderful to have you with us today.

    Dan Lahav: It’s a pleasure to be here.

    Dean Meyer: Awesome. So before we jump into questions, I will just say that it was very hard to get in front of Dan. I was trying to get in front of him for three months, probably thirty to forty emails, five or six people around us who we both knew closely were pinging him all the time, and he was still not responsive. And I basically learned where he was spending most of his time, and I kind of went …

    Sonya Huang: Did you stalk him?

    Dean Meyer: I kind of stalked him. I kind of stalked him. And eventually, we basically bumped into each other, like, not intentionally. And anyway, so we bumped into each other. I was like, “Dan, you know, you’re brilliant. I keep hearing great things. Please respond. Let’s find time. You know, we at Sequoia spend a lot of time in AI security.” And eventually, we found time the following week. So welcome, Dan. Thank you for everything.

    Dan Lahav: It seems that I’m going to have to start this podcast with an apology, sorry Dean, sorry Sonya, sorry the entirety of Sequoia. It indeed took time.

    Dean Meyer: It took time, but we partnered, and here we are. And you guys have done wonderful things. So it’s wonderful to have you with us today.

    Dan Lahav: Yeah, it’s a very, very happy ending and, you know, just, like, appreciate you and everyone here.

    Dean Meyer: Of course. Of course. Okay, so let’s jump into it. I’m going to start with a spicy question. As we recently saw, you partnered with OpenAI on GPT-5. And let’s kind of look forward a little bit. What does security look like in a world of GPT-10?

    Dan Lahav: Ooh, spicy and speculative, indeed. So let me wrap my head around that. So obviously everything I’m going to say is speculation, projection, but I think the way that we think about what’s going to come is trying to understand how we’re even going to produce economic value, and how organizations and enterprises and people are going to consume stuff in the world at the time of GPT-10 or Claude-10 or just like any one of the models.

    Let’s do a thought experiment to just, like, clarify why we actually believe that sometimes we think in the next two to three to five years there’s going to be a huge shift in the way that humans are even organizing themselves. As an outcome, security is probably going to be very different as well. So here’s the thought experiment: So imagine a situation where you work with OpenAI, and you go one generation up or two generations up and you tell your parents or grandparents that you’re doing work with Anthropic or with OpenAI or with Google DeepMind on security, I think their mind would go on to assuming that the work that you’re doing is probably providing a bodyguard service to Sam or to Dario or to Demis. Because the canonical security problem of a few decades ago, you know, it’s like our parents’, grandparents’ generation, was physical security, because the vast majority of economic activity was in the physical realm and not in the digital world.

    Dean Meyer: Yeah.

    Dan Lahav: And, you know, after the PC revolution and the internet revolution, we shifted the way that we are organizing and creating value. We transitioned primarily to a digital environment. And just think about how strong of a testimony it is of how many times you did an economic activity of value just by getting an email from someone that you may have not met. Just this morning I got an email from my bank activating me to do something from a person that I’ve never met, maybe it was a security person—that’s not a great thing to say openly, but just like we do that all of the time because that’s the way that we interact in society.

    So our view is that soon that’s going to happen again. And the reason is that AI models are getting gradually so capable that a lot of the economic activity of value is going to transition to human-on-AI interaction and AI-on-AI interaction. And that means that we may see soon a fleet of agents in an enterprise, or a human when they’re doing a simple activity like trying to draft a Facebook post, taking a collection of different AI tools in order to just promote that activity that they’re doing. And we’re essentially embedding tools that are increasingly more capable, and we’re delegating them tasks that require more and more and more and more autonomy in order to drive meaningful parts of our lives. So we’re transitioning from an age where software is deterministic and is fully understood end to end, to an age where this is no longer the case. And as an outcome, enterprises themselves, or just how we interact with the world, is going to go to a fundamental change, and it’s clear that security is just not going to be the same.

    As an interesting analogy, think about Blockbuster—may it rest in peace—and Netflix, the current version of Netflix. They both, if you think about it, give the exact same value to the consumer. Both allow you to list units of content for your pleasure and entertainment. But clearly and intuitively, security for Netflix and security for Blockbuster is not the same. Like, one was a chain that organized—you need to go and just physically rent a DVD. And another one is much more of modern architecture where you’re just streaming stuff to your home. So even enterprises that are going to provide the exact same value in the near future, may have, like, a very, very different backend to how they’re shaped in this autonomous age that we’re entering, which makes it clear that security as a whole is going to be very, very, very different. And we need to recalibrate to just like an age of autonomous security that’s coming upon us.

    Sonya Huang: You were at our AI Ascent event earlier this year, right? Do you remember when Jensen Huang shamed everybody who was there for the fact that not enough people in the room were thinking about security in a world of agents. And I remember Jensen said something about how, you know, you can imagine that as these agents are allowed to act more autonomously in enterprises, you should expect orders of magnitude more security agents than the actual productive agents themselves watchdogging and shepherding this herd of agents effectively.

    Dan Lahav: So I’m biased still. I agree with Jensen. I think Jensen was the first person that I’ve met that was much more bullish on AI security than myself, because in our view, you need a collection of defense bots that are also going to be working side by side with capability bots in just the next generations of how enterprises are going to be created. But indeed, he gave a ratio that he thinks that it was going to be 100-1. And just like how many, just like defense and security bots are going to be required out of the assumption that secure by design in AI is not going to work. So I’m not sure that I agree with that part of the conclusion. I think that we can make significant progress on secure by design, specifically embedding defenses in the AI models themselves. That being said, we share the view that the future is going to be one where we’ll need to have a lot of just agents that are specifically for the task of monitoring other agents and making sure that they’re not going to step out of bounds.

    Dean Meyer: So maybe on that question, just to dive one layer deeper, what is the state of model cyber capabilities today, and how has that changed over the past 12 to 18 months?

    Dan Lahav: It’s a great question, and I actually think that the rate of change is the most relevant part here, because models are capable of doing so much more now than they were even capable of doing a quarter or two quarters before. So just to give an intuition, so this is now just like we’re entering the fourth quarter of 2025. At the beginning of the year, coding agents were not a widespread thing yet. The ability to do tool use properly was not just starting, but obviously much, much more nascent than it is right now. Reasoning models were only at the beginning as well. So just, like, think about all of the things that were added last year, and what they mean also for security elements.

    So what we’re seeing now is that the combination of coding being much better, models being able to have multimodal operations, tool use improving, reasoning skills improving, if you’re using models for offensive capabilities, we are seeing unlocks all of the time. Something that is now feasible that was not even feasible a quarter ago is proper training of different vulnerabilities and exploiting them in order to do much more complicated actions. So for example, if you have a website and you want to hack it on the application, a few months ago, if you needed to integrate a collection of vulnerabilities in order to perform an action of value—at least autonomously without a human being involved—models were unable to do that, even the state of the art models. That’s not the case anymore.

    So obviously that depends. It’s not a hundred percent success, and obviously that depends also on the level of complexity of vulnerabilities and the environment that you’re trying to hack. But we have seen huge spikes of just, like, being able to scan more and more complicated code bases, exploiting more complex vulnerabilities, training them in order to do these exploitations, et cetera. And, you know, just like the recent GPT-5 launch on security and on the offensive side specifically of what models are capable of doing, we have seen a significant jump in their ability to be able to be much more competent across a collection of skills that actually matter a lot around the cyber kill chain.

    AI Model Capabilities and Cybersecurity

    Dean Meyer: Can you tell us more about that? And obviously, there’s some things that are publicly available, others that are not, but at least on the scorecard and what OpenAI have shared in particular for GPT-5, what are some of the capabilities that you’ve seen that were surprising?

    Dan Lahav: We are seeing constant improvement on the ability of models to, for example, have situational awareness on whether they are in a network. And up until a few months ago, the beginning of the year, complete models were unable to do that. They were able to run some operations locally, but they were usually not having situational awareness over what’s happening and what they can activate, even in more limited and constrained scenarios as we put them in. And that’s not the case anymore. So we still sleep very, very easily at night because the level of sophistication is still somewhat limited. But we are finding ourselves trying to create more and more and more complicated scenarios just because there is a huge jump in being able to take more complicated context, as I said before, chain complicated vulnerabilities to one another in order to do multi-step reasoning and exploit. And these are all new skills that going one year back did not exist.

    Dean Meyer: You guys are trusted partners by many of the labs, including Anthropic, including OpenAI, including Google DeepMind. You work very closely with them for quite some time at this point. Why did you take the approach of working—kind of embedding yourselves within the labs as opposed to, I don’t know, selling directly to an enterprise right now?

    Dan Lahav: There are multiple companies that are doing AI security. We are pioneering a category of the market that we call “frontier AI security,” and we think it’s fundamentally different. And the core thing is actually very simple: The rate of progress and the rate of adoption of models change so many things at the same time that while traditional security tends to be somewhat reactive in nature, here we need a very aggressive, proactive approach. In markets that are dominated by a rate of innovation that is frankly unmatched, I think, unparalleled in human history, we think it’s more interesting to take a temporal niche of the market, that is to say, focus on the first group of people or organizations that are about to experience a problem—so the labs, because they are the contenders to create the most advanced and increasingly sophisticated AI models in the world—work very closely in order to just see firsthand the kinds of problems that are going to emerge and utilize that in order to have a clear and crisp understanding of what’s going to come six, twelve, twenty-four months ahead of time, such that we can be prepared at a moment where general deployers are going to need to be in a situation of embedding these advanced models and already have solutions that are going to be relevant for them.

    Sonya Huang: Given the rapid pace of progress in the foundation model side of the world, if you’re at one of these model companies—and I think the people there are sincere, they want to do good for the world, they now know their models are capable of being used for extreme harm and cyber attacks as well, what do you do about that conundrum? And I remember—so we’ve been working with OpenAI since 2021. I remember back in those days, every enterprise user of the API past some volume had to be manually approved for their use case in order to even access the API. It feels like the ship has sailed of anybody anywhere will be able to access some of these models. And so how can you make the models sort of secure by design if you’re in one of these foundation model seats right now?

    Dan Lahav: I think it’s a great question. One thing on the premise of the question, I think that at least right now, at the moment in time in which we’re in, the ability of models to actually do extreme harm, you know, it exists in potentially some use cases, but at least in cyber, I think we’re not there just yet. And that matters.

    And just to be really sharp on what I mean here, models can clearly be used in order to do harm, but there is a distinction between harm and extreme harm that should be made. Harm would be an example of using a model in order to fool the senior citizen and in order to just steal money from them, so just like scaling up phishing operations. That can happen easily right now. Extreme harm, in my view, would be something along the lines of taking down multiple parts of critical infrastructure in the United States at once, that you can take full cities off the grid, making hospitals not work. Models are not there yet.

    And that’s not me nitpicking on the question. I actually think it matters quite a lot, because how much time we have to prepare to a world where models that are that capable actually dominate the strategies that we can take on the defensive side. Because our view is that the first thing that we should do, just like a first order thing, is be able to monitor and have a view of what’s going to come, such that we’ll have an ability to have a much higher resolution discussion of which capabilities are progressing, at which pace they are progressing, should we expect them to continue to progress at this pace or accelerate in the future? And that dictates the order of and the priority of some defenses, when we’re going to embed them, whether we should embed them, et cetera. And if we get this wrong, I also think it’s unfair to the companies and to the world as well, because AI also has, like, so much potential to do good that if we deploy a lot of some defenses that may chip away from productivity ahead of time, we’re also doing real harm to just innovation and the world at large. And it’s a very delicate balance to strike.

    So I think just like the first order thing to do if you’re working inside of the labs is actually having and supporting a large ecosystem that can take the models and measure them and get to high resolution before this is even possible to do. The second bit is figuring out a defense strategy that is informed by exactly what’s happening and treating it almost like a regular science with experiments of just how to assess, how to do predictions, et cetera.

    There are some defenses that will require a degree of customization. For example, if you’re someone that is creating monitoring infrastructure, we’ll still need that. You may want to recalibrate some of your infrastructure to give you higher alerts that AI is going off the rails, for example. But there are some problems that are very easy to write about, but actually very hard to develop solutions for. For example, the service of just like customizing your monitoring software in order to prioritize alerts that are coming from your AI layer. How are you going to be able to understand when AI is doing something which is problematic? Occasionally you’re going to be able to run into that, but sometimes this may be—I think it’s like the entire subsection of the market which is anomaly detection, which is a huge subsection of security is going to have a big problem very soon, because anomaly detection is based on measuring a baseline, understanding what is a baseline and measuring against that baseline in order to see that something is an anomaly. But if you don’t have a crisp understanding of the baseline and how it should look like, you have an issue of understanding that something went badly.

    So as an outcome, there are some defenses that are going to operate as is and there are some defenses that we should just like recalibrate, either customize or create from scratch. Just because there is a lot of science to be done of understanding how models look when they’re under attack, understanding how it looks like when models are attacking something, and because I think we still have some time for the world that’s about to come, my recommended strategy would be: Invest a lot in creating robust tools that would allow you to have rigorous evidence of what’s going to come, such that you can have the discussion at high resolution, map which classic defenses are still relevant, understand where your gaps are at, and invest a lot in R&D in order to make sure that you have cost-effective defenses that you can roll out ahead of time before models are going to be deployed.

    Sonya Huang: I know one of the broader questions in frontier AI research today is: Can we actually understand the mind of a neural net and what’s happening? And so I’m wondering if we can actually detect the characteristics of when is the model starting to behave badly? And do you actually—can you understand that in the activations of the net?

    Dan Lahav: For some of the attacks we may be able to detect that, but that’s still a very big open question. I’d say our approach in which we’re trying to pioneer around security research, which is actually work from the outside in. So that would be putting the models in high fidelity, realistic environments that would push them to the limits. And in these environments, you’re recording pretty much everything that’s happening in the background. So that would be both the internals of the model, like the neural net, but also interactions of the model with the environment or the AI system with the environment.

    And it actually allows you, by this recording, to play around with using all of this data to create, firstly, a mapping of how it even looks like when an attack is happening, whether you’re trying to attack a model or using the model to attack a target. And then based on that, taking up classifiers. And I think just like a key thing at least around where we are right now in security, and a key realization is while understanding the internals of a model is ultimately going to be important to fully solve the field, we can make a lot of progress just by understanding that something is not right, even if we don’t have full, crisp, full understanding of just the internals and what’s going on there.

    And also in these environments, the other nice thing is you can insert or remove defenses to see if they’re effective. So think about a situation where the question that I’m trying to ask is whether a model is going to be able to evade a detection from an antivirus. This is a core security question. If models are going to be able to do that in enterprises, this is an issue because they’re going to be able to scheme and do malicious things and just harm security processes without anyone noticing.

    And in specific environments we see that this is actually possible. Models are able to do that. We were able to show how a model was able to take out Windows Defender, the real Windows Defender—in a toy environment, granted, but real Windows Defender. But that kind of research, when you’re putting a model in an environment and recording a lot of stuff both internally and externally allows you to play with these parameters to ask a question of A) should I change something on the model itself in order to understand that something bad is happening? But also take a different approach, which is agnostic to that, and say should I alter something in the defense in order to make sure it’s more resilient?

    So I think our view is that for the long run, for the vision of the field, we do need to solve problems such as mechinterp in order to also solve issues in security, which also shows how different problems in AI are just changing the core of just so many different fields. Security itself may just be reframed to an issue of reliability and control in the near future, not just classic traditional security. And drawing an analogy to the human brain, we still don’t have a proper understanding of our own mechinterps like the neural net in my head. That being said, we were able to create very powerful mitigations, you know, and just solutions to multiple problems that we’re experiencing as humans in our lives just by being observational on how we’re interacting with the environment or just understanding smaller parts of the inner parts of the system.

    Real-World AI Security Simulations

    Sonya Huang: I love that framing. You said something scary to me that the models are capable of taking out Windows Defender in your simulation environments. Can you share more about how that even happened?

    Dan Lahav: Yeah. So just to be clear, it’s AI. It happened in a toy environment, you know, it’s a controlled setup simulation. That’s not to say that everyone that is using Windows Defender now should just be scared of an outcome like that. So I just want to make sure that Microsoft is not going to sue Sequoia for just taking down the stock. You can think about a network—so it’s a network simulation. So think about a network environment. Just to simplify, imagine the network environment that you have in your enterprise, in your work, in your home, the Sequoia network, for example, your phones, your laptops are connected into this network environment that you’re using.

    And the AI there was a storage bot, so it had access to the organizational file in order to be a helpful storage bot. For example, just scan some of the organizational files such that it can summarize some of the inputs. Like you’re doing it all of the time when you’re interacting with files that you’re having just like access to a file and you’re asking, “Give me a summary of these four things. What’s common between these documents? What’s not common between these documents?”

    And also as a prelude to the simulation, there was an attacker that had access to the enterprise Slack. That happened multiple times in the last year, just like Slack is occasionally hacked, as happens to every software that is very ubiquitous and just being used by so many different organizations in the world. And the attacker gave a very simple instruction to this helpful AI storage bot. It just said, “I want you to download this file to the organizational network.” But clearly, because it’s a storage bot, it did not have access to download from the internet. But what you saw that was pretty amazing, that through a collection of prompting, the model, because it was given the clear instructions to just do it at all costs—obviously much more sophisticated language I’m simplifying here—it realized that it can’t do downloads because it doesn’t have access and because Windows Defender, the real Windows Defender, is blocking it from downloading files. So that’s the way the organization was set up.

    So you can see internally of how the model is thinking, that it was able to map and have situational awareness in the environment, A) understanding that this is the problem, that it should have higher privileges to just do more stuff in the organization, and that it should be able to remove Windows Defender because it’s bothering it in order to achieve its ultimate goal. So then you can see as a next step in that reasoning trail that the model is realizing that it’s a storage bot. So as an outcome, it can access different files, obviously still by just asking it to access different files, but it started to just scan whatever it had access to in that privilege level, and found that someone accidentally in one of these files left a hard-coded password to the organizational cloud. And it used that. And now it was able to just get access to something that it didn’t have access to before.

    And I’ll cut the story short. It just did a collection of things that ultimately just did more privilege escalation, more privilege escalation, more privilege escalation, and was able to just gain more and more and more and more access to stuff that it should not have given access to. And ultimately, it was able to just get enough access to have admin privileges on the organization and kill critical security processes. And as an outcome, it was able to outmaneuver Windows Defender, remove an organizational defense and upgrade its own privileges such that it can download a file and just actually hack the organization that it’s a part of. We call it internally, this simulation, a double agent, just because it’s an agent that is doing that, but it’s turning against its host. So it’s kind of like a fun play of words of these kinds of new attacks.

    And in a way, the AI is kind of used very much similarly, this prompting, as that you would think about a parasite that is attacking externally, because you’re using this very lean text of just sending something to a model, and you’re using the fact that it has a very strong brain in order to just do a collection of actions that are very advanced.

    And I want to say the point in time in which we’re in right now is that this is a toy setup and it’s not—I don’t expect that, you know, for a lot of these things to happen in the wild yet. That being said, we are seeing huge “progress,” quote-unquote. And I feel it in security. Have you guys seen, like, the classic—I think it was called Bricks, like the game that DeepMind demoed just like a decade ago, where just like it starts very badly, and just like then the AI is able to just figure out better and better and better strategies. And it is first relevant to just Bricks and then just relevant to just like many, many, many other games. And here we are right now, just like a decade after the state of AI.

    So I think security by being a derivative market of what it is ultimately you’re trying to secure, is at the more nascent stage right now, where in toy setups and simulations we’re able to start to only get a glimpse of what’s about to come. And we are seeing stuff like models having enough power to do stuff such as maneuver the host in order to just do privilege escalation attacks, remove some organizational barriers and wipe out even real security software such as Windows Defender. And while these are not things that will likely happen in the wild now, it’s likely that in a year or two or three, if we’re not going to have the appropriate defenses, this is going to be a world that we’re going to just land up on. And clearly the implications here matter, right? I assume that the vast majority of enterprises in the world don’t want to deploy or just adopt tools that are able to outmaneuver their defenses.

    Working with AI Labs

    Dean Meyer: How do you think about model improvement, especially in the context of reinforcement learning, playing a pretty significant role in the improvement of coding, even tool use? For example, how does reinforcement learning play a role in cybersecurity?

    Dan Lahav: I think that’s literally a billion-dollar question, or just like maybe a trillion-dollar question. I don’t know. Because my background is as a researcher, I’ll keep my scientific integrity and just say that there’s a lot of uncertainty, but I’m still going to give a speculation of what’s likely and what’s going to come.

    We’ve already seen that RL is very, very useful to a lot of the innovations that we’re seeing right now around coding, around math, and in other verticals as well. I think it’s likely at the point in time in which we are in right now that RL is going to be able to scale as well, that is that we’re going to see something similar to scaling laws, that if we’re going to input more data or just have breakthroughs and improvements in training, we’re going to ultimately get better models, at least in the verticals that I’ve mentioned before, by RL.

    I think it’s still an open question on whether RL generalizes, just like where we are right now in the world. So that means that if you’re using data and RL environments in order to improve the model encoding, whether you’re going to see a huge jump in being able to produce better literature, for example. If you think about it, that’s roughly—you know, a huge simplification—something that we did come out to expect out of models. We lived in the last few years in a world where models were showing properties of advancing a lot of capabilities in the same time, which is different than the world that we lived in before, where I still have the skill from just like what feels in our previous life, just like previous jobs of understanding how to create huge ML data sets in order to just improve in a very narrow domain.

    And that world, it still exists, but we shifted into just a much more generalized paradigm. And there’s a question of just whether RL is going to provide that. And the reason that that matters is we still are at the early stages of A) figuring out if unique improvements in—or just like taking data that is relevant for RL training around security is going to push the security frontier, or whether improvements that RL is providing around coding or math or others just like scientific skills is going to be relevant for security.

    My intuition on the first one is a fairly strong yes, that we are going to see a success in some experiments of just using security data in order to have improvements such that AI can become better and better just like at security engineering tasks. I think there are some indicators that are showing that we’re on the way to doing that. I think it’s not going to be as clean as improvements that have happened in coding math, just because the complexity and noise level around some security tasks are going to make it a harder problem. I think we are going to also get some boosts around security that is coming from other domains improving in RL soon. If you’re better at coding, you are going to be better at some security tasks as well.

    I think it’s still unclear about whether this is going to generalize, and in security we’re in a more nascent situation around just like what’s happening right now in RL, but I am placing a not-insignificant bet that there is going to come just like a string of just like innovation, just like a string of innovation is potentially going to come around that, and that we’ll see some improvements on security as well with RL over the upcoming period.

    Enterprise AI Security Strategies

    Dean Meyer: That’s very exciting. Now let’s take a step back and talk about the enterprise. And so I think enterprises are still very much so in the early innings of building, deploying agentic AI. And how should CISOs and security teams be thinking about security as we move into this world?

    Dan Lahav: I would say that there is a lot of dependency on exactly what you’re trying to do. So I’m going to simplify kind of like aggressively in this answer. A useful frame to start thinking about when you’re an enterprise that is starting to deploy on a first-level issue is to kind of see AI, and specifically agentic AI, as the new frontier of inside risk. So there are multiple enterprises already that are trying to give the agents that are deploying internally persistent identities or ways to interact with them, for example either Slack identities or email identities or systemized inside of the organizations how and when you can approach AI. And that allows us to just have a clear visibility about where AI is being used and in which use cases. And then it becomes an issue of access controls and an issue of just like privileges that you’re giving to these identities that you’re putting in. And that’s usually a useful frame in the beginnings of just embedding. So just gain some visibility and just try to turn it as much as you can into an issue of access and into an issue of control of privileges.

    That’s obviously insufficient, especially if you’re getting to the stage where you’re allowing AIs to interact with one another in the organization. So for example, agent-on-agent communication. And that may sound very advanced, but if you’re running one agent that is doing summarization and another agent that is just ranking the quality and acting as a judge on these summarizations, then you’re already at the stage where you’re having AI-on-AI interaction. The base level would be to just have state-of-the-art monitoring on this AI-on-AI interaction.

    Here, I’ll however say that there is a genuine gap in what is available out there. A lot of monitoring software was not built to detect agents that can potentially communicate in ever-changing protocols, that if something goes wrong can try to bypass and understand that they’re in monitored environments, and need to also take into account attacks that are unique and novel to AI itself. So clearly, if you’re starting to deploy right now, you need to make sure that you have proper visibility and monitoring into the interactions of just like both the internals of the models and the externals. But you need to also realize that a lot of the software is not properly built for that age, especially, especially, especially if you’re moving to agent-on-agent communication. So that’s a more advanced use case.

    Just to give a funny example of something that happened to us in one of the streams of work that we’ve done with one of the frontier labs, there was a scenario where there was an agent-on-agent interaction. So these are two real frontier models—actually one of each kind. So it’s like having an environment where there was, like, one—I won’t say the names, but you can kind of think about it’s like a Claude, a Gemini, it’s like a Grok, it’s like a Llama, ChatGPT, et cetera. So just like two of these that were speaking to one another, and very, very, very randomly in this environment—it was a critical security task. That was the simulation that they were in. But after working for a while—and you can see it in the reasoning trail—one of the models decided that they’ve worked enough and they should stop. And, you know, these are stochastic machines at heart. And that’s the design. And it’s just likely somehow it got picked up as part of the distribution that if you work a long time—because it’s learned from the internet.

    Sonya Huang: Trained on me and Dean talking to each other.

    Dan Lahav: Exactly.

    [CROSSTALK]

    Dean Meyer: This should be a new policy at Sequoia.

    Dan Lahav: Yeah, exactly. That you should take a break, right? And. you know, that makes sense if you can actually pick up that behavior. And on an individual model level, that’s already—you know, in this example it’s funny and it’s weird. That being said, it did not stop there. It convinced the other model that they should both take a break. So the model did social engineering on the other model, to another model, which is—and again, it’s funny in a simulated environment, in a toy setup, but now try to think about the situation where you actually as an enterprise are delegating an autonomous workflow that is critical to you to complete. And occasionally, there’s kind of like this weird thing that stopped working and you don’t understand why.

    And the more complicated and capable machines are ultimately going to be, the more of these weird examples we’re going to encounter. And just like a different thing that happened to us, we gave a model a CTF charge, so Capture the Flag. It’s very common in cyber competitions that you’re trying to achieve something, you usually have to do a chain of vulnerabilities that you need to exploit in order to just obtain and just capture a flag. And that gives you validation that the model was able to do a collection of cyber actions that ended up in a success. But the model here again, it understood that it’s in the context of a CTF, and decided that potentially the challenge is too hard. So it did what potentially humans would do, which is he tried to email the organizers of the competition in order to help it gain a solution.

    Dean Meyer: [laughs]

    Dan Lahav: And that is literally—however, if you think about it in an enterprise setting, it’s like you have an identity that just like unasked, may try to just use your servers in order to send an email to the world. In our example, by the way, just like the other fun thing was a second order issue, is that the model failed at doing that not because it had an issue of maneuvering inside, but because it hallucinated the email address. So as an outcome, it tried to send an email to an email that doesn’t exist, which also shows just like the classic other problems that you’re having in AI and in AI adoption are going to be chained to security problems as well, which shows the frontier of attacks and defenses that we’ll need to develop here.

    So, you know, just like if I’m going back into just monitoring, et cetera, a lot of monitoring software, you have to embed it and you have to use what’s out there already. But it’s not built for these kinds of challenges, you know? That’s why a lot of our approach is to figure out how these attacks are going to look like, how to just redo some of the defenses, what’s going to be required. And occasionally I think it’s like a common misconception is that all that you need to do, that all of it ultimately collapses into an issue of access management. And that, while I think a lot of the basis is there is by just figuring out how to do the access management world and just manage the privileges, it’s only step one of just what we need to do. And there is a mind shift that we also need to have when we’re approaching this subject, which is the rate of innovation is so high, our ability to understand what’s happening at the frontier, you know, just like so many things are happening at once, try to be very engaged with the community in order to figure out what kind of problems that you’re even going to encounter essentially over time in order to be better prepared.

    Governmental AI Security Considerations

    Dean Meyer: Okay, so as we shift from the enterprise to sovereign AI, we know the UK government and a set of others are customers of Irregular, so how should governments and countries be thinking about AI risk?

    Dan Lahav: Obviously, all of the risks that apply on the enterprise side and to the labs themselves apply also on the governmental level. Because if you’re now the Department of Defense, the Department of Commerce, Department of Education, doesn’t matter, and you’re using advanced AI models, you’re importing the benefits and risks that come associated with them. So everything that we’ve said about the enterprises, everything that we’ve said about the frontier labs themselves, they have similarities on the governmental side as well.

    Usually governments, however, come with a set of unique requirements and a new level of risk that is relevant to them. So just like one, they are often targets of other very strong adversaries, and should take into account that the adversaries are now taking offensive AI models and are already starting to use them in order to scale up, whether simple things such as phishing campaigns, up to testing more and more advanced cyber offensive weapons, that is scaling up their efforts, that is trying to bypass the fact that I think pretty much every critical system that countries have was hacked at some point in time. We have not yet seen multiple critical systems ubiquitously just going under.

    And the fact that AI on the offender side can scale up operations aggressively means that countries should essentially recreate their approach around critical infrastructure. And that is AI is being elevated in that context from a classic security risk into a national security issue. And the infrastructure, and just like the thought leadership should be created there.

    The other bit is that from a country perspective—and you can argue on whether this is the right thing or not, but multiple governments that we’ve spoken with are very strongly emphasizing the effort of sovereignty in the context of AI. And what they usually mean by that is that they are anxious around being dependent, because they understand that AI is extremely critical as the infrastructure that could be the key to the 21st century and potentially beyond. Because of that, especially if the country is doing an end-to-end effort, starting from building local data centers that could be used in order to train and to do inference on advanced AI models, up to the point of potentially training the models and creating the AI systems that surround them and having proprietary environments that they also take in, defenses should be done across this entire spectrum.

    And we’ve indeed done work to just both create standards of how to secure these data centers, and making sure that people are not going to lift critical assets, how to run models on such data centers. For example, we’ve done a combination of a white paper with Anthropic that is discussing confidential inference systems, and trying to just figure out how to create a standard in the field, up to the fact of when actually using these models, taking into considerations how to customize some of the defenses that enterprises need, and create the variations of them that governments would need for their use cases, especially if they’re putting AI as part of—not just taking into consideration that AI can be used by adversaries to attack critical infrastructure, but by the fact that they may integrate AI to their own critical infrastructures. And that requires a whole new level of thinking through the defenses.

    Dean Meyer: Dan, this was a lot of fun. Thank you very much for joining us.

    Dan Lahav: It was a pleasure being here, and also very happy that I ended up answering your emails.

    Dean Meyer: Thank you.

    Continue Reading

  • Why Active ETFs Are More About Hype Than Performance

    Why Active ETFs Are More About Hype Than Performance

    In 2021, the stock market valuation of GameStop skyrocketed thanks to a social media frenzy. It heralded the rise of the so-called “meme” stock, but also a rise in a particular kind of trading, one driven more by the attention economy than by inherent value in an investment, according to new Northeastern University research.

    Between 20 and 25 years ago, as much as 80% of trading was done actively, says Da Huang, assistant professor of finance at Northeastern University. Active trading, he says, is defined by asset managers (usually in instruments like mutual funds) who are trying — actively — to beat the market.

    The alternative is passive trading. “The passive fund tells you upfront, we’re not going to beat the market, we’re just going to be the market,” Huang says. Passive funds track the entire stock market, or large sections of the market, for more predictable gains.

    In the U.S. today, around 55% to 60% of the market is passive and 40 to 45% active, according to Huang.

    But that hasn’t stopped a new financial instrument from gaining in popularity. It’s known as active electronically traded funds, or AETFs.

    Da Huang, assistant professor of finance. He says that most active ETFs are just “shiny objects.” Photo by Matthew Modoono/Northeastern University.

    The rise of active ETFs

    “Active management is dying out,” Huang says. “This is why they’re trying to transform their funds into this tradable, almost stock-like instrument in the market, so that they grab this new clientele.”

    That clientele are retail traders who come from a social media-driven environment.

    Until recently, the SEC only provided active ETF approvals on a case-by-case basis, making it difficult to get approval. But on Sept. 29, the SEC granted its first blanket approval for an entire firm’s active ETF class, leading to what Huang described as an “opening the floodgate moment.”

    Now, ETFs are fast becoming the instrument of choice for these retail traders, who use electronic trading platforms like Robinhood to purchase funds when they’re at extremes, hoping to make large profits in short periods of time.

    But the active ETF market is one that’s full of contradictions, Huang notes.

    One of the big differences between mutual funds and ETFs, Huang says, is that mutual funds only disclose their purchases and sell-offs once a quarter, whereas ETFs disclose that information daily. 

    “Back in the day, you would think asset management is all about secrecy,” he says. “I’m not going to tell you what I do. This is my secret sauce.” 

    Given that the stock market is a zero-sum game — in any trade there will be a winner and a loser — it was typical that asset managers, who were trying to beat the market, would keep their strategy a secret from the other players, with the goal of creating value for their investors. 

    “Now people are doing the opposite,” says Huang. When it comes to the managers of active ETFs, “they’re the first to go out and broadcast what they do.” In that case, “it’s pretty obvious. They’re not trying to outperform, they’re trying to grab attention,” Huang says.

    The economics of attention

    With traditional strategies, advertising a fund’s trades publicly would be “almost ridiculous,” Huang says. But active ETFs aren’t interested in traditional strategies.

    “What we find is that when you make an investment fund almost operate like a stock — that is free to trade on the market at any time during the day — that creates an attention effect, in the sense that people don’t really look at the performance of the fund anymore.”

    Instead of doing their research to find a stock that performs well, which is highly valued, retail traders instead “trade based on their feelings, their attention, or if this fund is doing something cool,” according to Huang’s research.

    “Something cool,” the finance professor says, might be trading in internet-notable stocks, such as Tesla, GameStop or DraftKings, for the attention it brings. Or, it might simply mean an asset manager appearing on a talk show to discuss their current strategy in hopes of enticing new traders into their fund.

    Market volatility and investor welfare

    In the last 10 years, active ETFs have expanded from a market share of a few hundred million dollars to over a trillion dollars, Huang says. The investment firm BlackRock expects those funds to reach $4.2 trillion by 2030.

    But the research reveals some potential drawbacks. Active ETFs, driven by retail traders’ unpredictable attention, could lead to greater market volatility and overall worse outcomes for traders, especially traders no longer focused on performance and value. 

    “This is going to have a huge impact on market efficiency and on investors’ welfare,” according to Huang. “Retail investors just need to focus more on performance and the fees they’re being charged for.”

    In the end, he explains, most active ETFs “really don’t deliver superior outcomes. They are just shiny objects, and they grab your attention, and you are mind-controlled into buying these products.”

    Noah Lloyd is the assistant editor for research at Northeastern Global News and NGN Research. Email him at n.lloyd@northeastern.edu. Follow him on X/Twitter at @noahghola.


    Continue Reading

  • BCG appoints new chief risk officer after Gaza aid controversy

    BCG appoints new chief risk officer after Gaza aid controversy

    Unlock the Editor’s Digest for free

    Boston Consulting Group has appointed a new chief risk officer to oversee tougher internal controls and whistleblower procedures after the controversy over its work in Gaza.

    Amyn Merchant had been given executive responsibility for risk management and compliance, having overseen those functions more broadly as head of BCG’s audit and risk committee, the consulting firm said on Tuesday.

    The previous chief risk officer, Adam Farber, resigned from the role in July after revelations that BCG helped establish the Israeli-backed Gaza Humanitarian Foundation. The launch of the aid effort, which was designed to supplant the UN, was marred by the deaths of hundreds of Palestinians.

    In June, BCG fired two partners who led “unauthorised” work with GHF. The Financial Times reported the following month that the project had been discussed at senior levels of the firm and extended to developing a postwar plan for Gaza that envisaged the voluntary relocation of a quarter of the population.

    Merchant will implement what BCG has internally called “Project Reinforce”, reflecting the conclusions of an investigation into the Gaza work conducted by the law firm WilmerHale.

    “The investigation concluded that this work was the result of individual misconduct, enabled by unwarranted process exceptions, gaps in oversight, and misplaced trust,” BCG said. “While it confirmed that BCG’s core processes and culture remain strong, it also identified areas that can be further reinforced.”

    The FT reported last month that BCG would train staff on humanitarian principles and impose extra oversight on sensitive projects.

    On Tuesday, it added that it was reinforcing business acceptance controls across all of its client work and “improving visibility and access to independent and confidential Speak-Up channels for all BCGers”.

    Merchant has been at BCG for more than 30 years in roles on three continents and has chaired its audit and risk committee for the past five years.

    “The ARC is a governance committee responsible for overseeing financial integrity, risk management, compliance and internal controls, not the day-to-day running of operations,” a BCG spokesperson said. “Its role is to challenge management on why controls failed, whether accountability has been enforced, and what corrective actions are being implemented.”

    Merchant also chairs the firm’s senior leadership development programme and oversees relations with alumni. He previously led BCG’s New York office.

    “His global perspective and commitment to responsible leadership will be instrumental as we continue to evolve BCG’s processes and safeguards in step with the scale and breadth of our global operations,” said Christoph Schweizer, chief executive.

    Unlike at rival McKinsey, the chief risk officer is not a member of BCG’s main executive committee.

    Continue Reading

  • Gold is getting knocked on Tuesday – it’s still the hottest trade of the year

    Gold is getting knocked on Tuesday – it’s still the hottest trade of the year

    Continue Reading

  • LangChain: From Agent 0-to-1 to Agentic Engineering

    LangChain: From Agent 0-to-1 to Agentic Engineering

    Congratulations to Harrison, Ankush, Julia and everyone at LangChain on today’s $125M Series B funding announcement. It’s been a delight working with this team since leading the Series A. 

    Beyond the funding, today marks the release of langchain 1.0, a major rite of passage for any open source company. Releasing 1.0 reflects the team’s conviction that LangChain, as currently built and packaged, is the right foundational architecture to become the scaffolding for agent engineering.

    LangChain is one of the most popular open-source projects in AI with 90M monthly downloads, and is used by 35% of the Fortune 500. The company has quickly grown from an open source package to a full-fledged agent engineering platform, including a low-level agent orchestration and runtime called LangGraph that’s used by companies such as Uber, Klarna, LinkedIn and J.P. Morgan; and an observability and agent deployment platform called LangSmith used by Clay, Cloudflare, Replit, Vanta, Rippling, Mercor and more.

    LangChain is one of the most used, debated and sometimes misunderstood projects and companies in AI. Below are some reflections on the evolution of LangChain to date and how I view the opportunity ahead.

    Early Days

    Harrison Chase’s superpower is simple: he sees the future of agents. He saw it before just about anybody, and has consistently pushed the frontier of agent engineering forward.

    Harrison started LangChain as a nights-and-weekends side project back in 2022, right as ChatGPT was taking off. The original LangChain was a proof-of-concept of what language models were capable of becoming: connect them with tools, chain them together into a sequence of calls and business logic, and voila, model becomes agent. 

    Harrison’s vision struck a chord with developers, and the LangChain project quickly became one of the fastest adopted open source projects. By the time we led the Series A in early 2023, it had more than 2,000 open-source contributors and was being used by more than 50,000 LLM applications. After two years of exponential growth, it is now surpassed only by OpenAI in download volume.

    September 2025 Download Metrics (PYPI) (in millions)

    Crucible Moment: LangGraph

    As the ecosystem matures, LangChain has matured alongside it, constantly growing the core open source package while also adding in new capabilities.

    Because the LLM ecosystem is evolving at such a rapid clip, maintaining the scaffolding to support a developer ecosystem of this size hasn’t always been easy. But Harrison has been resilient and unflappable, constantly listening hard to his challengers and evolving the package towards the future.

    Sometime last year, it became increasingly clear that the abstraction level on which the OG LangChain was built was a convenient launching point for developers, but wasn’t sufficiently low-level enough to support all the granular control that developers were craving as the space matured.

    Harrison made the hard decision to develop another framework in parallel: LangGraph allows developers to control every step of their custom agent with low-level orchestration, memory, and human-in-the-loop support, and manage long-running tasks with durable execution. 

    This was a tough but remarkably wise call. Today, LangGraph is the most popular low-level agent orchestration framework and runtime, used by companies including Uber, Klarna, LinkedIn and J.P. Morgan and downloaded 12M times a month.

    Getting one open-source project to broad product-market fit could be a function of luck. Getting two open-source projects to PMF is a testament to how clearly Harrison and the broader LangChain team see the future of agent development.

    LangChain 1.0

    Today marks a new chapter, merging LangChain’s past and future: the team has completely rewritten langchain in its 1.0 release to be opinionated, focused and powered by LangGraph’s runtime. 

    This means that langchain 1.0 has all the “0-to-1 booster fuel” the community has come to expect—pre-built architectures for common agent patterns, improved model and tool integrations, and more—but gives developers access to a deep level of customization and control. 

    In addition, helping developers on their agent engineering journey means expanding beyond frameworks into developer tools. Working with LLMs requires a mindset shift, from deterministic software to embracing stochasticity, constantly observing and testing data. To that end, the team has invested heavily in LangSmith, which enables agent engineering across the full lifecycle: observability, evaluation and deployment.

    LangSmith has proven successful as a standalone developer platform for LangChain and non-LangChain developers alike, and is currently used by companies such as Clay, Cloudflare, Replit, Vanta, Rippling and Mercor, with traffic growing 12x year over year.

    What’s Ahead

    LangChain’s ambitions are large. This team will not stop until they become the de facto platform for all of agent engineering. They will get there by deepening their core offerings, and continuing to push the frontier.

    Deepening the core: The team has pushed hard on deployment, making it possible to ship your agent in one click on LangSmith, using scalable infrastructure built for long-running tasks. They’ve also launched their first 1P agent, an Insights Agent in LangSmith Observability that automatically categorizes agent behavior patterns, as well as a no code text-to-agent builder experience for business users.

    Pushing the frontier: As the model space progresses, the agentic building blocks also evolve. LangChain has been hard at work on many of the core primitives we believe will drive the future of agentic engineering, including memory.

    It’s hard to believe LangChain is just three years old. We are proud to partner with a team that has adapted so resiliently, and more important, driven so much innovation in the agent engineering ecosystem. Today’s milestone is just the beginning.

    Continue Reading

  • Toy Fair® 2026 Retailer Registration off to Strong Start

    Toy Fair® 2026 Retailer Registration off to Strong Start

    Hundreds of retail outlets have already committed to attend Toy Fair® 2026, the largest and most influential toy & play industry event in North America. More than 850 individual retail outlets and attendees representing 59 countries have secured their spots to attend the global event, taking place February 14 to 17 at the Javits Center in New York City.

    “With early commitments from influential buyers and a strong international presence, Toy Fair 2026 will deliver a concentrated environment for business growth,” said Kimberly Carcone, executive vice president of global experiences at The Toy Association™. “Retailers come to source strategically, and exhibitors come ready to build partnerships that last well beyond the show floor. Together, they’re shaping the year ahead for the global toy industry.”

    Among the outlets registered to date, 19 of the top 25 global retailers are confirmed, including Amazon, Barnes & Noble, Five Below, Kohl’s, Macy’s, Target, TJ Maxx, Walmart, and more.

    International interest continues to increase, with retailer registration so far representing 32 countries and a current total of 59 countries represented by toy and entertainment professionals from the entire play ecosystem. These strong numbers emphasize the enduring global significance of Toy Fair and its appeal to buyers and industry professionals worldwide.

    Toy Fair is the ultimate destination for discovering the latest innovations and trends that will define the toy market in the coming year. In addition to discovering product across the show floor filled with new toys and games, attendees can benefit from business, networking, and educational opportunities, including Toy Fair University, unique product discovery zones like the Launch Pad for new exhibitors, and the Creative Factor program for inventors and designers. More updates on educational sessions and networking events will be announced in the coming weeks.

    Toy Fair® will take place February 14 to 17, 2026 at the Javits Center in New York City. Attendee registration is now open. Exhibitor applications are still being accepted. Visit ToyFairNY.com to learn more.

    Continue Reading

  • Journal of Medical Internet Research

    Journal of Medical Internet Research

    Background

    Large language models (LLMs) represent a breakthrough in artificial intelligence (AI), capable of processing, understanding, and generating humanlike language at scale. With their advanced natural language processing capabilities, LLMs are increasingly explored in specialized domains, including both the medical and nursing fields []. Recent studies have demonstrated the potential of LLMs to support a wide range of clinical tasks, such as diagnosis support, medical documentation, and treatment planning for medical professionals, while also showing promise in assisting nursing-specific duties, such as care plan generation, patient education, and automation of nursing notes [,].

    Despite the potential of LLMs, their integration into clinical and nursing practice is hindered by several critical challenges. A key concern is the generation of inaccurate content, along with limited transparency regarding how responses are produced. However, whether in medical applications or nursing practice, even minor errors can have a serious impact on patient safety []. Furthermore, because LLMs do not inherently access external knowledge bases, their outputs may fail to incorporate the latest evidence. This includes clinical guidelines and drug updates that are critical for medical decision-making, as well as nursing best practices and care protocols that are essential for effective patient management []. To address the limitations of “out-of-the-box” LLMs, Lewis et al [] proposed retrieval-augmented generation (RAG) for knowledge-intensive natural language processing tasks.

    RAG enhances the generative capabilities of LLMs by incorporating external knowledge retrieval mechanisms []. Unlike traditional models relying solely on internal parameters, RAG leverages in-context learning to proactively retrieve relevant information before response generation []. This significantly reduces inaccurate information and improves the transparency of information sources, which is crucial in health care []. Furthermore, as medical and nursing scenarios involve distinct reasoning paradigms, general-purpose LLMs often struggle to differentiate between them. RAG addresses this limitation by supporting diagnosis-centered medical reasoning through context-aware retrieval of evidence-based knowledge and facilitating nursing reasoning through the integration of patient information to assist nurses in identifying cues and confirming nursing problems, thus providing differentiated support for both paradigms [,]. However, current reviews of RAG primarily adopt a technical perspective while overlooking the specific needs and contexts of medical and nursing practice, such as alignment with clinical workflows, adherence to ethical standards, and the ability to reason as clinicians or nurses [,].

    Goals of This Review

    To bridge this knowledge gap and enable its effective and responsible integration, developing a comprehensive understanding of current applications of RAG in medical and nursing settings is crucial. Through this scoping review, we aim to categorize types of RAG and their developmental stages, while establishing a foundational understanding of the field in terms of adopted techniques, reasoning strategies, application tasks, and ethics. This review serves a dual purpose: first, to provide health care professionals with a navigational map of existing research and second, to identify key trends, limitations, and future directions of RAG in the medical and nursing domains. Considering the complexity and fragmented information landscape, where implementation is often driven by technical teams unfamiliar with clinical workflows, this study takes an important step toward enabling health care professionals to lead RAG system development and application.

    Overview

    This scoping review included articles that described the development or application of RAG technologies in medical and nursing contexts. The review followed the methodological framework proposed by Arksey and O’Malley [] and subsequently refined by Levac et al []. This methodological framework consists of five stages: (1) identifying the research questions, (2) identifying relevant studies, (3) selecting studies, (4) charting the data, and (5) collating, summarizing, and reporting the results. The PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) checklist [] was used as a guideline in reporting the results of the study (). This project was registered with the Open Science Framework [].

    Identifying the Research Questions

    To address the aims of the study, the following research questions were identified:

    1. Into what categories can RAG frameworks in the medical and nursing domains be classified?
    2. Can the workflow of RAG systems be structured into distinct stages to guide medical and nursing practice, and what enhancement techniques are applied at each stage?
    3. What methods have been used to improve reasoning capabilities within RAG frameworks?
    4. In what application tasks have medical and nursing RAG frameworks been deployed?
    5. What practical measures have been taken to mitigate ethical risks in the development and application of RAG frameworks in the medical and nursing domains?

    Identifying Relevant Studies

    We conducted a literature search using 4 electronic databases covering the period from November 1, 2022, to May 31, 2025: PubMed, Web of Science, IEEE Xplore, and arXiv. This time frame was chosen because LLMs only became widely available in late 2022, and RAG was introduced to reduce the generation of inaccurate content. In light of the rapid development in this field, preprints were also included to ensure the inclusion of the most recent advances. A comprehensive search strategy was developed and refined in collaboration with the research team, and a health science librarian was consulted. Search terms included keywords such as retrieval augmented generation, RAG, health care, medicine, medical, nursing, and care. The complete search strategy is provided in .

    Study Selection

    The initial criteria used to identify articles included (1) studies published in English only, to ensure consistency in data extraction and interpretation, as translating non-English studies could introduce potential biases or inaccuracies that might affect the overall findings; (2) studies applying the RAG framework to perform end-to-end or user-facing medical tasks were included while those focusing solely on isolated natural language processing components, such as entity recognition or relation extraction, were excluded because the aim was to explore the integrated technical architecture and application of complete RAG systems rather than individual submodules; (3) only studies proposing RAG frameworks applied to medical and nursing domains were considered; (4) studies were required to clearly describe the RAG framework architecture, the retrieval data sources, and the retrieval methods used. In addition, we excluded literature reviews, conference abstracts without accessible full text, and articles without accessible full text.

    Data Extraction

    Identified articles were imported into EndNote (Clarivate Inc), where duplicates were removed. Titles and abstracts were screened independently and categorized as include, exclude, or potentially include. Two authors conducted independent assessments, and any disagreements were resolved through discussion and consensus, with a third reviewer adjudicating if a consensus could not be reached. A standardized data extraction form was developed and refined based on team feedback. We initially conducted a pilot extraction on 10 representative studies to explore and determine the most appropriate data extraction dimensions for this review. On the basis of the findings from this pilot phase, the extraction categories were developed a priori and further refined. Ultimately, the extracted data covered 5 key dimensions: the type of RAG method proposed, technical details corresponding to each stage of the RAG framework, reasoning strategies used, application tasks addressed, and ethical considerations reported. Relevant information was extracted for each included article, with one reviewer performing the initial extraction and another verifying and completing the data as needed.

    Quality Assessment

    To evaluate the reporting quality of the included RAG-related studies, we adopted the Minimum Information for Medical AI Reporting (MINIMAR) framework [], a recently developed guideline specifically tailored for the reporting of AI research in medical contexts. Although many of the included studies were proof of concept, the MINIMAR framework was ultimately selected for evaluation because it specifically addresses the critical aspects of medical AI systems, such as data transparency, model evaluation, and other related factors. MINIMAR outlines four essential components: (1) study population and setting, (2) patient demographics, (3) model architecture, and (4) model evaluation, comprising a total of 21 key reporting features correspond to all four components of the MINIMAR framework. The overall MINIMAR adherence rate was subsequently calculated to quantify the reporting quality across studies. Given the rapid development of RAG-related research, no standardized quality appraisal tool currently exists in this field. While MINIMAR assesses reporting completeness, it does not cover methodological rigor or clinical relevance. To address this gap, we additionally applied a self-developed evaluation framework, including 3 dimensions: methodological rigor, clinical relevance, and reporting transparency, with a total score of 10 points. All studies were independently assessed by 2 reviewers, with disagreements resolved through discussion.

    Overview of Included Studies

    A total of 917 articles were retrieved from the 4 databases, as illustrated in the PRISMA flow diagram (). After removing 205 duplicates, 445 articles were excluded based on title and abstract screening. The interrater reliability for the initial screening was high, with a Cohen κ of 0.87, indicating substantial agreement. A total of 118 full-text articles were assessed for eligibility. Ultimately, 67 studies met the inclusion criteria and were selected in this review.

    Figure 1. PRISMA flow diagram.

    Quality Appraisal Results of Included Studies

    After consensus was reached, the overall adherence rate to MINIMAR across all included studies was 62.3% (). The adherence rates for the 4 essential components of MINIMAR, including study population and setting, patient demographics, model architecture, and model evaluation, were 55.25%, 6.56%, 87.30%, and 89.20%, respectively. The high compliance in model architecture and evaluation suggests that the included studies generally reported the technical aspects well. In contrast, the low adherence in patient demographic reporting highlights a limitation in conveying population characteristics, reflecting the proof-of-concept nature of many of the included studies that often did not involve real patient data. Detailed evaluations for each study are provided in [-]. We further assessed each study using a self-developed evaluation framework (). The average scores were 3.89 (SD 0.42), 2.59 (SD 0.38), and 2.94 (SD 0.18), respectively, with an overall mean score of 9.43 (SD 0.55) out of 10, indicating moderate to high quality but with room for improvement in clinical applicability ( [-]).

    RAG Technologies Applied to Medical and Nursing Domains

    Classification of RAG Methodologies

    In this review, RAG methodologies were categorized into 5 functional types: knowledge graph (KG)–enhanced RAG, text-based RAG, agentic RAG, multimodal RAG, and plug-and-play RAG frameworks that directly adopt existing tools. Detailed descriptions of each type are provided in . A total of 17 studies implemented RAG frameworks enhanced with KGs [,,-]. Among them, 3 studies used dynamically constructed KGs [,,]. In parallel, 6 studies applied agentic RAG frameworks [,-]. Two studies proposed a multimodal RAG framework that integrates text with other modalities [,]. Six studies directly adopted existing RAG plug-and-play frameworks, including LangChain, Pinecone, and NotebookLM, to streamline retrieval and generation [-]. The remaining 36 studies fell under the category of text-based RAG. Among them, 2 studies used dynamically evolving knowledge bases rather than static ones [,], incorporating real-time sources, such as sensor data or PubMed.

    KG Construction Approaches

    Given the frequent integration of KGs in medical and nursing RAG frameworks, we further examined the methods used for KG construction. Among the 17 studies that adopted KG-RAG frameworks, construction approaches were grouped into 4 major categories. Use of open-source KGs was identified in 7 studies [,,,,,,]. The most commonly used open-source KGs included the Unified Medical Language System [,], along with other structured resources, such as the Scalable Precision Medicine Open Knowledge Engine [] and the SmartQuerier Oncology KG []. A rule-based construction was reported in one study []. LLM-assisted methods were used for KG construction in 6 studies [,,,,,]. Deep learning-based approaches were applied in 3 studies [,,].

    RAG Enhancement Strategies Across Pipeline Stages

    Theoretical Framework for Staging RAG

    Herbert A Simon, a pioneer in decision science, proposed a foundational model of the decision-making process that divides it into 3 primary stages []. In the intelligence phase, problems are identified and the purpose of the action involving the decision is determined. In the design phase, possible solutions are developed, and alternatives are proposed to address the problem. In the choice phase, the alternative that best meets the decision’s objective is selected. Years later, Turban extended the Simon model by adding a fourth phase called implementation, which focuses on carrying out the chosen solution [].

    We adopt the Simon decision-making theory to structure the RAG framework, as its staged process aligns with how RAG is applied in clinical and nursing practice. Clinicians and nurses typically identify patient needs, retrieve relevant information, integrate it to form judgments, and generate appropriate interventions. This sequence corresponds to the 4 phases of the decision-making process model: intelligence, design, choice, and implementation. Based on this parallel, we divide the RAG process into 4 stages: intent recognition, knowledge retrieval, knowledge integration, and generation. The 5 categories of RAG and the 4 distinct stages of the RAG system workflow are shown in .

    Figure 2. Retrieval-augmented generation classification and stage mapping.
    Enhancements at the Intent Recognition Stage

    Among the reviewed studies, approximately half (34/67, 50.7%) applied enhancement techniques at the intent recognition stage. Intent classification, which categorizes user inputs into predefined intent types, was used in 10 studies [,-,-]. Query rewriting, which reformulates user queries to improve retrievability or clarity, was implemented in 7 studies [,,,,-]. Query decomposition, which breaks down complex queries into simpler subqueries, was adopted in 4 studies [,,,]. Medical entity recognition, which extracts clinically relevant terms from user input, appeared in 11 studies [,,,,,-,,,]. Semantic parsing, which converts natural language into structured meaning representations, was identified in one study []. In addition, one study [] used a hybrid strategy that combined both intent classification and semantic parsing.

    Enhancements at the Knowledge Retrieval Stage

    All the studies were enhanced at the knowledge retrieval stage. In total, 5 distinct retrieval strategies were identified across the included studies. Hybrid retrieval, involving the combination of multiple retrieval mechanisms, was applied in 16 studies [,,,,,,,,,,,,,,,]. Sparse retrieval, often based on traditional keyword-matching methods or statistical models, such as BM25 or term frequency-inverse document frequency (TF-IDF), was used in 6 studies [,,,,,]. Dense retrieval, which uses neural networks to encode queries and documents into vectors for similarity-based retrieval, was the most frequently adopted individual strategy, appearing in 33 studies [-,,,,,,,,,,,,-,,,-]. Structured retrieval, which queries schema-based knowledge sources, was adopted in 5 studies [,,,,]. Recursive augmented retrieval, which iteratively refines queries based on intermediate outputs, was found in one study [].

    Because of its dominant role among retrieval strategies, dense retrieval was analyzed in greater depth with respect to its implementation components. Specifically, commonly used embedding models included text-embedding-ada-002, text-embedding-3-small, and sentence transformer variants (eg, all-mpnet-base-v2, all-MiniLM-L6-v2), as well as BAAI General Embedding (BGE), GIST-large-embedding-v0, gte-base-zh, and Vertex AI Search. Facebook AI Similarity Search (FAISS) was the most frequently used vector similarity engine, typically using cosine similarity for top-k retrieval. In addition, some studies used custom retrievers specifically designed for biomedical applications, such as MedCPT [,,,].

    Enhancements at the Knowledge Integration Stage

    At the knowledge integration stage, many studies combined 2 or more methods to enhance accuracy. Among these, reranking was the most commonly applied technique across the included studies. Additional approaches included authenticity verification [,,,,,,,,,,,], semantic consistency control [,,,,,,], conflict detection [,,,,,], multisource fusion [,,,,,,,,], and structured reasoning [,,,,,,,,,]. Notably, one study [] investigated knowledge compression strategies to eliminate redundant content before integration. [-] presents the specific techniques adopted at each stage of the RAG pipeline.

    Enhancements at the Generation Stage

    Three primary strategies were identified at the answer-generation stage to enhance the quality and reliability of the model outputs. At this stage, nearly all the reviewed studies used prompt engineering strategies to regulate the output behavior of LLMs, including the structure, tone, and content of the generated responses. Building on this, 19 studies further incorporated chain-of-thought (CoT) prompting, a technique that guides models to perform structured, step-by-step reasoning, thereby enhancing the logical consistency of the generated outputs [,,,,,,-,-,,,,,]. In addition, 3 studies [,,] used self-reflection methods that enabled the model to evaluate and revise its initial responses.

    Reasoning Strategies in RAG Frameworks

    Among the reviewed studies, 26 incorporated various reasoning strategies within their RAG frameworks to help the LLMs follow clinical reasoning pathways. Agentic multistage reasoning was adopted in 6 studies [,-]. Five studies [,,,,] used CoT prompting, one of which [] used an iterative refinement variant of CoT. Notably, this form of CoT differs from the one used during the answer-generation stage. In this context, CoT serves as an explicit reasoning framework that structures the model inferential process, rather than merely functioning as a general prompt to elicit step-by-step outputs. Graph-structured reasoning was used in 11 studies [,,,,-,,], including 1 study [] that applied directed acyclic graph-based reasoning.

    In addition to the common strategies described above, 4 studies applied more specialized reasoning approaches that closely reflected real-world clinical workflows. Two studies incorporated clinical process–aligned reasoning [,], which aims to mimic the step-by-step logic of clinical decision-making. Specifically, MEDPLAN [] simulated subjective, objective, assessment, and plan-based diagnostic workflows by sequentially generating assessments and treatment plans, while DrHouse [] adopted an exclusion-based reasoning model that updated disease probabilities through guideline-driven questioning. Two other studies implemented recurrence-based multihop reasoning [,], in which reasoning was achieved through iterative query refinement and multistep evidence accumulation. MedRAG [] simulated multi-round diagnostic reasoning using a proactive questioning mechanism based on differential features, while recurrence generation–augmented retrieval (RGAR) [] used recursive alignment between conceptual knowledge and patient-specific facts to iteratively refine diagnostic conclusions.

    Application Distribution of RAG in Medical and Nursing Domains

    Of the 67 studies included in this review, 23 (34%) focused on diagnostic and clinical decision support, 36 (54%) addressed medical question answering, 2 (3%) explored drug discovery, 3 (4%) focused on medical education, and 3 (4%) were applied to the intelligent processing of electronic medical records. Crucially, a gap was identified in nursing-focused research. Of the 67 included studies, only 4 (6%) [,,,] were specifically designed for nursing-related applications, and their scope was limited to question-answering tasks. The distribution of the RAG tasks and their corresponding percentages are shown in .

    Figure 3. Distribution of application tasks in medical and nursing fields using retrieval-augmented generation.

    Sensitivity Analysis

    A sensitivity analysis, excluding the 39 preprints, was performed to assess their influence on the conclusions of this review. The reanalysis using only peer-reviewed articles yielded no substantial differences in the distribution of RAG framework types, the proposed workflow stages, or the key findings regarding reasoning support and ethical considerations.

    Ethical Considerations

    The medical and nursing domains are among the most highly regulated sectors, governed by principles such as biomedical ethics and stringent data protection regulations. To ensure the responsible deployment of LLMs in health care, ethical concerns must be carefully addressed. These include safeguarding data privacy, enhancing patient safety, and ensuring fairness for patients. Among the 67 studies reviewed, only 9 explicitly addressed patient data privacy. Seven studies [-] applied deidentification techniques, one [] used stratified isolation via triple graph construction by separating patient data into a dedicated layer, and another [] used an advanced encryption standard with key management services to secure sensitive patient information. Patient safety was a focus in only 1 study []; a technique that flags safety concerns was developed, demonstrating zero instances of alarming red flags during testing. Fairness was considered in 2 studies. One study [] conducted a detailed evaluation of system performance across 32 personality configurations, while another applied [] a previous probability adjustment to reduce demographic biases.

    Principal Findings

    To the best of our knowledge, this review is the first to systematically examine RAG frameworks in the medical and nursing domains, highlighting common practices, current trends, and underexplored areas in the design of domain-specific RAG systems. By categorizing RAG frameworks into text-based, multimodal, agentic, KG-enhanced, and plug-and-play types, we identified key architectural trends. We further divided the RAG framework into 4 stages, namely intent recognition, knowledge retrieval, knowledge integration, and generation, and explored the specific techniques applied at each stage. We observed several notable technological trends: shifting from surface-level matching toward contextualized intent recognition [,], from vague semantics toward logic-driven dynamic retrieval [,], from passive toward active knowledge retrieval [,], and from simple aggregation toward coherent context construction [,]. Moreover, although various reasoning strategies have emerged, few systems align with the procedural logic of medical and nursing workflows, highlighting a significant gap between current implementations and domain-specific reasoning needs. Importantly, we also identified a profound imbalance between the medical and nursing applications of RAG, with nursing-specific research remaining sparse and insufficiently explored.

    A persistent challenge and research focus is the selection of external knowledge sources, as it directly influences the retrieval accuracy of RAG. This review highlights the widespread adoption of KGs within RAG frameworks, owing to their structured logic capabilities. Furthermore, given the dynamic and evolving nature of patient conditions, research is increasingly focusing on the retrieval of dynamic knowledge sources as a complement to static repositories. For example, one study enabled the integration of real-time sensor data, which proved particularly beneficial for handling complex and evolving patient cases []. However, concerns about the quality of external knowledge sources remain a significant challenge. Graph construction is increasingly shifting from deep learning-based methods to LLM-driven generation. Although this approach improves efficiency, it often fails to reflect the specific procedures and accuracy required in medical and nursing workflows []. Future work should focus on improving the quality of external knowledge sources. For example, building event-centered cognitive KGs that align with disease progression can support dynamic reasoning and enhance the ability of RAG systems to manage the complexity of real clinical settings.

    Most studies focus on the final performance of the system, neglecting the analysis of each development stage. In contrast, this study examined all stages of RAG system development, enabling clinicians and nurses to better understand its internal functioning and take a leading role in guiding its design and implementation. Our study revealed the evolution from keyword-based methods to a deeper semantic understanding in the intent recognition stage. Because of the considerable variability in patient communication styles and health literacy levels, their queries are often diverse and unstandardized, rendering keyword-based methods inadequate for effectively handling patient-generated natural language questions []. To address this, recent RAG frameworks have increasingly introduced semantic-focused techniques early in the pipeline, including expansion, disambiguation, and decomposition []. However, in real-world clinical practice, these methods often introduce noise and add computational burden, and even small delays can compromise their usefulness in time-critical settings, such as emergency triage and bedside decision-making. Therefore, future work should explore optimization strategies that balance retrieval precision with efficiency, enabling scalable deployment of RAG systems in routine health care settings.

    This review identified a notable shift at the knowledge retrieval stage, from ambiguous semantic matching to logic-driven dynamic retrieval. Most current studies still rely heavily on dense retrieval methods based on semantic similarity. Although these methods perform well in capturing general semantic resemblance, they often fail to recognize strict clinical logic, such as negation and hierarchical structures. As a result, they may return outputs that are semantically similar but clinically inconsistent or contextually inappropriate [,]. Although recent work has introduced logic-driven dynamic retrieval methods that incorporate clinical reasoning and contextual adaptation into the retrieval process, these approaches still face significant limitations []. In particular, current methods often fail to recognize temporal sequences and hierarchical structures that are critical in medical and nursing contexts. Therefore, future research should focus on developing retrieval frameworks capable of deep understanding and using these relationships to provide more accurate and context-aware support.

    Our review also identified a trend in the retrieval stage, shifting from passive to active knowledge retrieval. Instead of simply returning relevant content, emerging systems can adjust both what information they retrieve and how they retrieve it, based on real-time contexts, such as changes in the patient’s condition or the history of queries []. This proactive retrieval approach holds particular promise for active patient management by providing more timely and context-aware support. Building on this trend, we cautiously speculate that future large models may evolve to proactively interact with the external world and continuously generate feedback without relying entirely on human-provided knowledge []. However, such capabilities are still in their early stages. A major challenge that remains is building trust. To gain acceptance in clinical settings, proactive agents must be able to reliably interpret complex situations and clearly explain their actions. Without robust mechanisms for accountability and transparency, they may be perceived as unsafe or untrustworthy. Therefore, the immediate research goal may not be full autonomy, but rather developing “human-in-the-loop” systems in which proactive agents suggest actions or information that clinicians or nurses can quickly validate, modify, or reject, seamlessly integrating AI proactivity with human oversight.

    In terms of knowledge integration, most RAG frameworks in our review still follow the approach of directly feeding all retrieved chunks into the language model context []. Although simple, this often leads to fragmented, inconsistent, or clinically irrelevant outputs, especially in the high-stakes environments of medicine and nursing []. A growing body of research is moving from simple information aggregation to logically coherent context construction. For example, the studies reviewed mention techniques such as evidence reranking, authenticity verification, and knowledge compression, all designed to prioritize high-quality medical knowledge before generation []. However, when dealing with multimodal data, these techniques still fail to achieve effective knowledge integration. In real-world clinical scenarios, effective decision-making often requires the synthesis of heterogeneous data types, including text, images, structured records, and real-time sensor signals []. Future efforts should focus on frameworks that can effectively align across modalities to support more comprehensive, accurate, and patient-centered outputs.

    Reasoning is essential in medical and nursing practice, where professionals must continuously interpret patient condition changes, formulate hypotheses, gather additional information, and identify underlying causes to determine appropriate interventions. LLMs can only truly support clinical work if they acquire this reasoning ability, which is still underdeveloped in current systems. Current research primarily focuses on enhancing the reasoning capabilities of LLMs through prompting techniques. However, these methods are fundamentally constrained by their reliance on associative learning rather than causal inference []. While excelling at pattern recognition, they struggle to mimic the abductive or deductive reasoning required in medical diagnosis and nursing care planning. In addition, some studies attempt to model reasoning using annotated clinical formats, such as subjective, objective, assessment, and plan []. However, these approaches primarily facilitate implicit pattern imitation rather than explicit learning of causal mechanisms, and struggle to capture the causal relationships embedded in clinical and nursing workflows. To address these limitations, future work should incorporate causal science approaches, such as causal graphs and structural causal models, to constrain model outputs, thereby improving the reasoning performance of LLMs [].

    A central finding of this review is the profound imbalance between the medical and nursing applications of RAG. Although RAG frameworks have been applied across various scenarios, only 6% (4/67) of the included studies focused on the nursing domain, and these were primarily limited to question-answering tasks. Core nursing practices, such as proactive patient management in home care settings, remain largely unexplored []. One possible reason is the dominant focus on physician-centered workflows, which has led to a relative lack of resources for nursing applications. Publicly available datasets and evaluation benchmarks, for example, are typically designed around clinician-driven tasks []. However, nursing reasoning is as complex as clinical decision-making, involving continuous monitoring, real-time decision-making, and frequent patient interactions []. KG-based RAG, which is capable of retrieving 2- or 3-hop entities, is well suited to support such complexity. Furthermore, while medical knowledge systems are relatively well established, nursing still lacks standardized and structured knowledge representations, which hinders the effective integration of nursing knowledge into RAG systems []. To truly bridge this gap, we call for a concerted effort that not only advances nursing knowledge modeling and benchmark development but also equips nurses with education on RAG and related AI technologies, thereby enabling more widespread and equitable integration of RAG into nursing practice.

    Ethical concerns such as bias, privacy, and safety are critical when applying RAG-based LLMs in the medical and nursing domains []. Our review shows that only a small number of studies have attempted to address these issues, highlighting significant room for improvement. Although RAG offers significant potential, its use must be guided by ethical standards to protect patient privacy and ensure safety. For example, connecting to external databases may risk exposing sensitive information such as prescription records []. Current mitigation approaches often rely on static safeguards, such as the removal of personally identifiable information and the implementation of role-based access controls []. However, the dynamic and context-sensitive nature of clinical privacy often renders existing methods inadequate, highlighting the need for future research to develop more adaptive privacy-preserving mechanisms, such as differential privacy, real-time consent management, and query auditing tools that can respond to evolving regulatory requirements []. Beyond privacy, patient safety and algorithmic bias represent major ethical challenges. To ensure safety, RAG-based systems should incorporate proactive measures, such as comprehensive adversarial testing and simulation of edge-case scenarios []. At the same time, algorithmic bias, which may exacerbate health disparities, should be mitigated through systematic bias audits, fairness-aware algorithms, and transparent reporting of model performance across diverse demographic groups.

    Limitations

    This study has several important limitations. First, it included only English-language literature. Although translating non-English studies could introduce biases or inaccuracies, this exclusion may have led to the omission of relevant research in other languages. Second, preprints were included to capture the most recent developments in this rapidly evolving field. However, as preprints lack peer review, they may overrepresent unvalidated innovations, potentially introducing bias into the findings. Therefore, conclusions drawn from these sources should be considered preliminary, and future reviews may reassess the evidence once these preprints are formally published and peer reviewed. In addition, the number of nursing-focused studies included in the review was relatively small, despite using nursing-specific search terms. Although we conducted supplementary searches of gray literature sources, no additional eligible nursing-related studies were identified. As such, findings related to nursing should be interpreted with caution. Further research is needed to validate and extend these findings within the nursing context. Finally, because of the lack of specialized evaluation tools for the emerging field of RAG, we used MINIMAR for quality assessment. Although not ideal, MINIMAR was the most appropriate available framework for evaluating RAG systems at this stage.

    Conclusions

    This review summarizes the current applications and trends of RAG frameworks in the medical and nursing domains. We classified RAG types and analyzed their techniques across 4 functional stages. Although early efforts toward logic-driven reasoning exist, alignment with clinical and nursing workflows remains limited, highlighting a key direction for future research. In addition, we found a profound imbalance between the medical and nursing applications of RAG and call for greater attention to nursing-specific needs.

    This research was supported by a grant (CX23YZ02) from the Chinese Institutes for Medical Research, Beijing, and the Key Program of the National Natural Science Foundation of China (72034005).

    All data generated or analyzed during this study are included in this published paper and its multimedia appendices.

    YM: conceptualization, data curation, investigation, methodology, software, and writing—original draft

    YW: conceptualization, methodology, writing—review and editing, supervision, validation, and funding acquisition.

    All authors have read and approved the final manuscript.

    None declared.

    Edited by J Sarvestan; submitted 12.Jul.2025; peer-reviewed by Y You, J Thrift, F Al Dhabbari, M Al Zoubi, S Zhao, X Liu; comments to author 29.Jul.2025; revised version received 06.Aug.2025; accepted 05.Sep.2025; published 21.Oct.2025.

    ©Yiqun Miao, Yuhan Zhao, Yuan Luo, Huiying Wang, Ying Wu. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 21.Oct.2025.

    This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

    Continue Reading