Tutorial: GPT Quorum Powered NEISS Analysis Method
Overcoming Subjective Hurdles With AI-Powered Injury Classification
The AI Advantage in Injury Categorization
We present an uncomplicated GPT Quorum technique for categorizing injuries, mainly when applied to narrative accounts from the NEISS database.
The National Electronic Injury Surveillance System (NEISS) is a database that tracks injuries treated in hospital emergency departments across the United States. In the realm of NEISS injury categorization, the application of modern Large Language Models (LLMs) introduces precision, pace, and practicality, notably sidestepping most of the subjective stumbling blocks inherent in human analysis. The pitfalls of subjectivity often taint the clarity and consistency necessary for accurate data interpretation:
Inconsistent Interpretations: Personal perspectives, steeped in individual experiences and emotions, can dramatically diverge, disrupting the establishment of consistent standards in data categorization.
Biased Judgments: Unseen prejudices or overt biases can skew subjective analysis, potentially resulting in decisions that lean on stereotypes or baseless assumptions rather than unbiased logic.
Potential for Misunderstanding: Without clear criteria or standards, subjective judgments can lead to misunderstandings or misinterpretations, as two individuals might interpret the same situation or data differently.
Elusive Accountability: The cloak of subjective reasoning can shield arbitrary decisions, hindering the process of holding individuals responsible for their interpretative actions.
Illustratively, consider a NEISS narrative:
Say you were tasked with deciphering whether this depicts a Spherical Rare Earth Magnet ingestion, would you dub it an "In-Scope" injury linked to a likely SREM swallow? Depending on the definition of an SREM, determinations could differ dramatically. And, given the frequent vagueness of NEISS narratives, two testers might tackle the text and end up at opposing outcomes, even with identically explicit explanations.
Enter AI's role. For researchers sifting through injury data, the shift to becoming prompt engineers for AI means crafting clear, concise instructions, easily digestible by both human and machine. This approach is not only efficient and economical but also enhances verification ease for third parties wanting to confirm adherence to described definitions, boosting transparency and trust in the epidemiologist's output.
As an observer scrutinizing another's injury analysis, two critical queries arise: "Is the injury count executed as asserted?" and "Do the categorizations align logically with their intended purpose?" Here, the GPT-4 Quorum test becomes a valuable tool for the former question, providing a structured method for validation.
Example Case of Magnet Metrics - GPT-4 Quorum in Action
It is inherently impossible to eliminate subjectivity, from a naturally subjective task. However, aim isn't to eliminate subjectivity but to harness consistent, replicable methods via AI, particularly employing the advanced GPT-4. While low-tier LLMs like GPT 3.5 and Claude 2 can cater to your needs, the GPT-4 is the smartest system currently standing.
Advantages of AI involvement include swift, repeatable processes, and an admirable adherence to detailed directives for compact tasks. However, like humans, AI isn't entirely error-exempt from error or inconsistency. Even the GPT-4, while currently reigning as the most sophisticated LLM, occasionally errs. To enhance its efficacy, certain strategies are instrumental:
GPT-4 Quorum Configuration: Deploying three GPT-4 instances with identical prompts cultivates a 'quorum,' mitigating minority viewpoints and revealing the range of variation in responses.
Complex Prompt Hacks: Querying GPT-4 to think "step by step", "explain your thinking", and "take a deep breath" dramatically increases accuracy, and leaves a trail of thought that can be retrospectively analyzed.
Delving into a practical application provides clarity on these abstract concepts. As a case study, we navigate through the initial stages of the Consumer Product Safety Commission's (CPSC) procedure for classifying 'In-Scope' injuries in their 2022 Magnet Set Rulemaking (see Pg 172). Step 1 and Step 2 exemplify the transition from a non-subjective to a subjective classification, which will be provided as a tutorial for ease of reproduction.
By dissecting the CPSC's methodology for injury classification, we consider the potential for AI integration. Our focus isn't merely on the technology's ability to follow orders but on its capacity to do so with reduced error margins and increased transparency. Through this lens, we can appreciate the impact of AI, particularly the GPT-4 model, in refining the reliability of data categorization in regulatory environments.
Step 1: Creating "Master Set"
Step one of the classification process is a non-subjective phase, a straightforward scheme that doesn't necessitate the nuanced capabilities of AI. Here, the CPSC delineates the construction of a 'master set' of NEISS incidents, anchored by specific, unambiguous criteria:
Timeframe encompassing 01/01/2010 to 12/31/2021
Ingestions, which are signified by NEISS Diagnosis = 41
Narrative that contain "magnet" or other keywords.
The precision of these parameters allows for effortless execution through the NEISS interface and simple spreadsheet computation. Filters for date and diagnosis are entered directly in to the NEISS query builder (https://www.cpsc.gov/cgibin/neissquery/home.aspx), and a basic formula efficiently identifies narratives containing 'MAGNET'. While the guidelines hint at hunting for 'other keywords', their actual impact is minimal, evidenced by a lone exception related to a science kit, which ultimately falls outside the 'In-Scope' designation.
If you're trying these steps yourself, check that there should be 1,945 incidents from 2010-2021, and a sum of the weights should result in a national estimate of 40,041. Make a copy of these incidents in a separate sheet for Step 2.
Here is a prepopulated spreadsheet, but is expanded to include all years 2003-2021: https://docs.google.com/spreadsheets/d/1QcJuIGLgibjR0SQiUXopBR0YmKEEkn-dFtbw9LzjbNQ/edit?pli=1#gid=1020768321
In this demo sheet, column AA is used to flag matches for incidents that contain the text "MAGNET", and the formula used is:
=IF(AND(ISNUMBER(FIND("MAGNET",$W2)), OR($J2=41, $M2=41)), 1, 0)
Step 2: Confirmed Magnet Ingestions Only
The CPSC's second step trims unconfirmed magnet ingestions, and is an excellent example of well written instructions that are well suited to AI interpretation with minimal modification. The classification criteria are concise, clear-cut, and even include a brief example for prompt inclusion. The concepts revolving around the resolution of uncertainty or ambiguity would be a task tricky to tackle with traditional spreadsheet rules, but ripe for GPT-4.
There are a many tools that allow for GPT-4's integration into spreadsheets. Our method harnesses Google Sheets paired with GPT Workspace (https://app.gpt.space/pricing). At $29/month, this GPT-4 tier grants unlimited API access.
From our 'master set' of 1,945 magnet mentions from Step 1, here is an GPT Workspace formula that merges filtering criteria with narrative content from cell W2:
The output of this command will either be "1" if it is not excluded, otherwise a brief explanation as to why the ingestion was excluded. Processing took a trifling 30 minutes for the entire 'master set'. For those treading this trail, we recommend duplicating the results or deploying GPT Workspace’s 'Freeze' function to avoid accidental repeat processing.
After our first run of GPT-4, 1,853 confirmed magnet ingestions persisted of the initial 1,945 "master set" incidents. A trim total of 92, credited to ambiguous/uncertain magnet mentions. However, as ChatGPT veterans know, AI assertions aren’t always accurate or consistent.
To test the tool's trustworthiness, we implement a quorum by testing it thrice, then tallying majority votes from three GPT-4 runs. The consensus in our instance was 1,854 cases, with 91 pruned. Intriguingly, near identical incident counts across runs didn't necessarily reflect identical incidents among the three AI judges. Of the 1,945 'master set' incidents, there were 27 indications of dissent, one GPT-4 against the other two. Still, our quorum concurred completely 98.6% of the time—a consistency challenging for humans to mimic.
Comparing GPT-4's quorum to the CPSC's claims post-Step 2 unveils unexpected disparities. Using the same NEISS data, and following CPSC's own instructions, the 1,854 incidents projects a national estimate of 37,719 injuries from 2010-2021, yet CPSC posits approximately 26,600, stemming from 1,184 NEISS incidents.
Clearly, magnet ingestions mishaps are more prevalent than portrayed by the CPSC. Without even addressing the question of whether the CPCS's mid-process classification for all confirmed magnet ingestions make sense, it's crystal clear: the criteria the CPSC publicized and what they practiced behind closed doors are worlds apart.
Note, that trick of asking GPT-4 to think "Step by Step" and to explain it's reasoning was not used in this case study since the results were already very consistent. An example of the application of this prompting technique can be seen in the 2022 "In Scope" Comparison report.
Step 3: Identifying In-Scope Ingestions
In the CPSC's concluding chapter, the focus is on "In-Scope" injuries, the apparently avoidable injuries that would be prevented, assuming the "In-Scope" magnets were barred from the market. For readers tracking this tutorial, put your pencils down, Step 3 signals a pause in our spreadsheet saga. Yet, Step 3 serves as a stark showcase of the 'not-to-dos' while drafting definite, concise classifications for either GPT-4 or human discernment.
Pages 173 -175 of the Final Rule briefing (CPSC Link, Mirror) unravel CPSC's "In-Scope" sorting, starting with subcategories like "Magnet Set," "Science Kit," "Unidentified," ultimately lumped into broader bins: "In Scope" or "Exclusions."
For hypothetical consideration of GPT-4's potential role, the CPSC’s Step 3 exemplifies the need for crisp, compact classification guides for seamless AI integration. Take, for instance, the "Magnet Sets" criteria:
Despite its mere fraction of Step 3, the "Magnet Set" segment spotlights several pitfalls that would need to be untangled pre GPT-4 deployment:
Overly Verbose - Lengthy prompts are likely to result in inconsistent verdicts, whether classified by human intelligence or artificial intelligence. Brevity breeds clarity.
Vague - Phrases like "referred to as a magnet set" demand definition. We know it doesn't imply just a text match for "MAGNET SET", as there are only 3 instances of this after 2010. Explicit exemplars are essential.
Internally Contradictory - Discrepancies dwell between descriptors like "loose-as-received ingestible magnets for..." and subsequent classification criteria. Any contradictory cues, irrelevant to actual criteria, require removal before AI analysis.
External Dependencies - Inferring "magnet sets through product name" is likely to involve outside insight beyond GPT-4's 2021 training. This may be unavoidable due to Section 6(b) disclosure rules. But when possible, . Additional descriptive conditions should be listed when possible.
Hypothetically, Step 3 could be completed by AI if we knew the manner in which the CPSC actually categorized magnet ingestions. Each "individual magnet category" described could be interpreted by a GPT-4 quorum, after being trimmed, summarized, and specified.
Unfortunately, Step 3 is not auditable, since CPSC has not released the NEISS identification list of the 1,015 In Scope Magnet Ingestions, nor the 1,184 Confirmed Magnet Ingestions. Thus the public is prevented from scrutinizing the manner in which Step 3 was actually performed to their own claimed instructions. Furthermore, it would be impossible to reach the same numbers as the CPSC's Table 1 even if the CPSC was diligent in following their own instructions, due to the discrepancies from input NEISS data in prior Step 2. It is not in the scope of this tutorial to consider whether or not it is reasonable to blame all unidentified magnet ingestions on high powered magnet sets in an effort to ban adults from purchasing them.
You can find our demo spreadsheet with GPT-4 outputs and spreadsheet formulas visible here: https://docs.google.com/spreadsheets/d/1QcJuIGLgibjR0SQiUXopBR0YmKEEkn-dFtbw9LzjbNQ/edit?pli=1#gid=350948132
The sheet titled "41 MAGNET ONLY" contains all years from 2003-2022. All GPT classifier outputs, quorum consensus, and dissents are in columns AB-AF. The "step by step" and "explain your thinking" prompting strategy was not necessary in this example due to the short and straight forward prompt. But an example can be seen in 2022 MagnetSafety.org NEISS Analysis / MAGN sheet, with the prompt in the header note. To see the plain language thought process of each GPT4 instance, make a copy and expand columns between AS-AY.
Good thing we live in the age of AI, because what used to be an impossible burden of manual classification, is now a relatively cheap automated ordeal. And a grand applause to the transparency minded CPSC of 40 years ago with the democratized design of the NEISS database, because without publicly available NEISS data, it would be impossible to double-check the CPSC's work.
Comments, or suggestions are welcomed. Email: outreach (@) magnetsafety.org
Sign up for notifications here.