Carolyn Meinel's Research Project (continued #3)

Featured below, indications that Semantic Studio as of 2019 was able to detect logical constructions. At first I had trouble detecting negations (Figure 2) but succeeded in Fig. 3. I also tested (Fig. 4) for a short list of key words: "similar, dissimilar, likewise, unlike." This approach was poor at detecting relevant rationales, apparently because those key words are so common. Surprisingly, though, the highest scored rationales were high on opinions (analyses) while the lowest scored were high on facts

Problem: the entire dataset I used that year was created by a poorly characterized group of ~500 Amazon Mechanical Turk Prime workers who generated ~8k rationales via IARPA's Hybrid Forecasting Competition. The total number and composition of these MTurkers varied during the creation of this dataset. I do not know whether they could see each others rationales and/or forecasts. IARPA isn't telling us. However, I did some sleuthing and found that some (many?) MTurkers speed their work using algorithms and sharing information on covert forums. On the results below, texts in red in some cases refer to my concerns about cheating noted in 2019 -- 2020 as I evaluated the results

Since then, research on how MTurkers cheat here: Dateline: 9 May 2024: "Psychology study participants recruited online may provide nonsensical answers. Data quality suffers in some studies using the MTurk platform—but participant screening and other safeguards can help. Source: doi: 10.1126/science.z7jdvzt

Also notable, this query had the counterintuitive result that the lowest ten scored rationales all cited "base rate" data, meaning either historical comparisons. or time series data. Indeed, this and similar base rate constructions generally could as easily be found among the lowest scores of logical constructions queries as in the case of directly querying for base rates. I do not know if this is an artifact of MTurk data. I intend to test this using new data from my AI Alignment testing. once we get Ai Translate feeding data to our human forecasters, who will NOT be MTurkers.

Query for Fig. 1a below: Thing1 happened and thing2 happened.

thing1 happened and thing2 happenedFigure 1a: Query string : "thing1 happened and thing2 happened.

Figure 1b below, same spreadsheet, but a listing of the bottom ten scored rationales, all featuring historical data. Wow.lowest soered rationales, query "thing one happened and thing2 happened."

Above, Figure 1b, bottom ten scored rationales.

Next, in Fig. 2a below an example of a logical construction of not or unlikely. As of 2029-2020, Semantic Studio did not reliably distinguish between not and not not, or unlikely or likely.

Thing1 happened but also thing2.Figure 2a: Query "thing1 happened but also thing2 happened."

Figure 2b below, bottom ten scores.

lowest scored ratioanles fromthe query thing 1 happened but also thing 2

Figure 2b: Lowest ten scored rationales for query "thing1 happened but also thing2.

Next in Figure 3 below, victory in querying for negation. Query string "thing1 happened despite thing 2 happening." No detections of “despite”in the top ten..This is good because it did not operate as a mere key word detector. These rationales appear to have been selected for containing generally negative terms such as “low chance,” “isn't,” “don't,” “won't,” “chances... extremely low,” “doubt,” “nothing... will be happening,” “0 percent chance,” “lowered my expectations.”

thing1 happened despite thing2 happeningFigure 3: Query string: .thing1 happened despite thing 2 happening

In Figure 4 below, I queried a short list of words: "similar, dissimilar, likewise, unlike." This approach was poor at detecting relevant rationales, apparently because those key words are so common. Surprisingly, though, the highest scored rationales were high on opinions (analyses) while the lowest scored were high on facts.

Back to home -->

© 2024 Carolyn Meinel. All rights reserved.