In the rapidly evolving landscape of artificial intelligence, OpenAI's Assistant has emerged as a powerful tool for various applications, including document analysis and question-answering tasks. However, as with any cutting-edge technology, users are discovering both its potential and its limitations. This blog post delves into the real-world challenges encountered by our team when using OpenAI Assistant to analyze complex regulatory documents in the world of Sanctions, including enforcement actions from Office of Foreign Assets Control (OFAC), consent orders from FinCEN and New York Department of Financial Services (NYDFS), and other regulatory agencies.
Setting the Stage: OFAC enforcement actions, and the need for an assistant
OFAC enforcement actions are public documents that contain a reasonably detailed analysis of the reasons a fine is issued against a given company to settle a sanctions violation. They go into the details of the regulatory violation, the actors involved, the actions that may have been taken by the organization, and how to remediate the gaps that the regulators found. By analyzing enforcement actions, sanctions operations teams can assess the potential monetary risk associated with gaps in their program. It can also be a blueprint for remediation in case such gaps exist.
While OFAC is the most active agency issuing fines on sanctions violations, other regulatory agencies can jointly work with OFAC and expand the scope of the findings as it may overlap with other functions and compliance programs. FinCEN, NYDFS, and the Department of Justice (DOJ) have issued consent orders that include violations of the sanctions regime.
OFAC enforcement actions aren't extensive documents. They usually range between 4 to 10 pages depending on the scope of the analysis. There are about a dozen issued per year by OFAC. Consent orders, on the other hand, can be very lengthy, spanning tens of pages.
Therefore, if a sanctions operations team would like to use the information contained in these actions to assess risk and research remediation and mitigation actions, it may find it a bit daunting and time-consuming. This is not too dissimilar to the role of a paralegal in the legal world.
Enter the AI assistant
AI technology is evolving, and lots of investment is being poured into solving this type of problem. OpenAI, as one of the main players in the AI world, released an initial version of such assistants. So, before diving into the challenges, it's important to understand what OpenAI Assistant is and its intended use. OpenAI Assistant is an AI-powered tool designed to assist with a wide range of tasks, from answering questions to generating content and analyzing data. It's built on OpenAI's large language models (LLMs) ChatGPT and can be customized for specific use cases. In our case we used one of the three tools available: file search (other tools available but less applicable to our case are code interpreter and function calls).
Therefore, in theory, AI assistants makes it an ideal candidate for parsing through dense regulatory documents and providing insights across these documents.
The Reality Check: Challenges in the Trenches
Our team's experience with OpenAI Assistant revealed several significant challenges that impacted its effectiveness in handling regulatory document analysis:
Hallucinations: The Mirage of Accuracy
One of the most pressing issues we encountered was the Assistant's tendency to generate plausible-sounding but entirely fabricated information. This phenomenon, known as "hallucination" in AI circles, led to a significant waste of time as team members found themselves constantly double-checking every answer against the source documents.
This issue isn't unique to our experience. In fact, a study published in the journal "Nature Machine Intelligence" highlighted similar concerns, stating that "large language models have a tendency to generate plausible-sounding but factually incorrect or nonsensical information, a phenomenon known as hallucination" (Bender et al., 2021).
Source Adherence: The Wandering AI
Another significant challenge was getting the Assistant to reliably answer questions based solely on specified source documents. Despite explicit instructions, the model often ignored the provided documents, instead drawing from its broader training data or other unspecified sources.
This lack of source adherence is particularly problematic in regulatory contexts, where precision and accuracy are paramount. The inability to constrain the AI's responses to authorized sources severely limits its utility in sensitive applications.
Cross-Document Analysis: The Missing Link
Our team found that the Assistant struggled with tasks requiring analysis across multiple documents. For instance, it couldn't reliably sum up penalty amounts from different documents or categorize documents based on specific criteria.
This limitation significantly hampers the Assistant's ability to provide comprehensive insights across a set of regulatory documents, a crucial requirement for many compliance and legal applications.
Memory Inconsistencies: The Forgetful Assistant
The Assistant's ability to remember and utilize information from previous interactions was unreliable. This inconsistency in maintaining context across a conversation made it challenging to build upon previous answers or conduct multi-step analyses.
Document Length Sensitivity: The Scale Problem
We observed that the Assistant's performance degraded noticeably with longer documents. While it handled shorter documents reasonably well, its reliability decreased as document length increased.
This limitation poses significant challenges for analyzing comprehensive regulatory documents, which often span dozens or hundreds of pages.
Our experiences are not isolated. Many organizations and researchers have reported similar challenges with AI language models in specialized domains. For instance, a report by the AI Now Institute at New York University warns about the risks of using AI in high-stakes decision-making processes, particularly in regulatory and legal contexts (Crawford et al., 2019).
Charting a Path Forward
While these challenges are significant, they're not insurmountable. Here are some of the strategies we implemented inside our core Yanez AI to address these limitations:
Human in the center: ensure that technology enhances human capabilities, fosters meaningful collaboration, and aligns with the values and needs of the people it serves.
Rigorous Fact-Checking Protocols: A systematic process for verifying AI-generated information against source documents. This involved developing custom verification tools, integrating human oversight more closely into the workflow, and implementing evaluation frameworks like the RAG Triad proposed by Truera (now Snowflake).
Domain-Specific Fine-Tuning: Worked with AI developers to create more specialized models trained specifically on regulatory documents. This improves source adherence and reduces hallucinations in this context.
Enhanced Prompt Engineering: Developed more sophisticated prompts that better constrain the AI's responses to relevant sources and improve cross-document analysis capabilities.
RAG Pipelines: Evaluated several RAG frameworks with different techniques and capabilities until converging on one that proved, based on expert human feedback, to be most reliable.
Meta Data: Extracted characteristics and features from each document, coalesced them into a higher-level document, and fed it as a data source.
Continuous Evaluation and Feedback: Established a system for ongoing evaluation of the AI's performance and fed these insights back into the development process to drive continuous improvement.
Conclusion
OpenAI's base assistant, and likely other generalized assistants, will have challenges in performing on documents and data sources that are very much domain-specific, and with the characteristics of regulatory documents, where accuracy and correctness are fundamental. However, by acknowledging these limitations and working towards solutions, we were able to harness the potential of AI while mitigating its risks.
Our data shows that Yanez AI is performing close to three times better than OpenAI assistant when focused on sanctions-related regulatory documents. Therefore, complementing general-purpose LLMs with the proper framework, agents, and tools yields a significant improvement in correctness and accuracy.
As we continue to explore the integration of AI in regulatory analysis, it's crucial to maintain a balanced perspective—embracing the technology's potential while remaining acutely aware of its current limitations. Only through this nuanced approach can we continue to develop AI systems that truly enhance our ability to navigate the complex world of regulatory compliance.
You can access Yanez AI work on regulatory compliance by interacting with Yanez AI Interactive Infographic.
---
References:
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610-623).
Crawford, K., Dobbe, R., Dryer, T., Fried, G., Green, B., Kaziunas, E., ... & Whittaker, M. (2019). AI Now 2019 Report. AI Now Institute, New York University.