127 DOI: 10.4312/linguistica.65.1.127-143Matej Meterc* ZRC SAZU, Inštitut za slovenski jezik Frana Ramovša Rok Mrvič** ZRC SAZU, Inštitut za slovensko narodopisje THE BEST KNOWN AND FREQUENTLY USED SLOVENE PROVERBS ACCORDING TO CHATGPT-4O: EXPLORING THE POTENTIAL FOR AN AI-BASED PAREMIOLOGICAL MINIMUM 1 SOURCES OF LINGUISTIC DATA ON CONTEMPORARY, ACTIVELY USED PAREMIOLOGY IN THE SLOVENE LANGUAGE Prior to the emergence of modern, particularly the recent decades’ corpus-driven lexi- cography, Slovene proverbs and related paremiological expressions were—beyond their primary domain of oral communication—preserved in collections.1 These prov- erb collections offered readers heterogeneous lists of expressions, drawn from earlier compilations and intermingled with items known to the collectors from active usage or recorded during fieldwork. Among Slovene dictionaries that documented contemporary proverbs at the time of their compilation, the Dictionary of the Slovene Standard Language (Slovar slov- enskega knjižnega jezika, SSKJ) stands out, containing approximately 600 paremi- ological expressions. Research conducted in the context of identifying a Slovene paremiological minimum and optimum revealed (Meterc 2017: 213) that 73.6% of the paremiological expressions found in the SSKJ are still known today by more than half of the survey respondents. Corpus-based research on Slovene proverbs for the establishment of a paremiological optimum (Meterc 2017: 75–107) provided an empirical foundation for the integration of paremiology into the Dictionary of the Slovene Standard Language, 3rd Edition, published on the Fran.si portal since 2016 under the title eSSKJ. In this dictionary, phraseology and paremiology are systemati- cally represented through corpus analysis and surveys (Meterc, Jakop 2016; Meterc 2019). Similar corpus-based methodologies underlie the Dictionary of Proverbs and Similar Paremiological Expressions (Slovar pregovorov in sorodnih paremioloških izrazov, SPP; Meterc 2020–), which employs a broader range of corpora (Meterc 2023: 124–125) and also draws on data from studies of the Slovene paremiological minimum and optimum. * matej.meterc@zrc-sazu.si ** rok.mrvic@zrc-sazu.si 1 This contribution was prepared within the framework of research programmes P6-0038 and P6- 0088, and project J6-50197, financed by the Slovenian Research and Innovation Agency (ARIS). 128 Artificial intelligence undoubtedly holds great potential for the analysis of linguistic data within lexicography (de Schryver 2023; Jakubiček, Rundell 2023). The focus of this paper, however, lies elsewhere—we concentrate on a much more basic level of inquiry: the retrieval of representative paremiological expressions in a given language based on their frequency and familiarity among speakers, by posing direct questions to an AI model. The aim is to assess the extent to which the AI model (ChatGPT-4o) serves as a useful and reliable source of paremiological material. A brief note on the selection of the large language model (LLM) used in our research: we chose ChatGPT due to its wide popularity and high capability, which, shortly after the start of our study, began to be rivaled by several other emerging models. After test- ing various later models (such as Grok, DeepSeek, and the Slovene GaMS) as well as subsequent versions of ChatGPT, we can add an observation that, at the time of this article’s publication, the results of these models appear broadly similar at first glance. Nevertheless, the newly emerging models and updated versions of existing ones represent significant potential for future research in paremiology. An in-depth comparative analysis of results produced by different models, however, lies beyond the scope of this article. 2 AN ANALYSIS OF RESPONSES CONCERNING SLOVENE PROVERBS AND THE RESEARCH POTENTIAL OF AI USAGE IN COMBINATION WITH LANGUAGE CORPORA AND SURVEYS ChatGPT-4o is an intriguing source of paremiological material for several reasons. Two distinct spheres of its potential use may be highlighted. The first pertains to the growing role of AI models as information sources for a broad range of users who are not linguists, but who may, for various reasons, be in- terested in proverbs in a given language—whether their own or a foreign one. In such cases, artificial intelligence serves as an alternative to the diverse array of other online resources, which include both specialist (paremiographically designed) sources—such as digitized and primarily online dictionaries and collections—and a multitude of non- professional sources (e.g., forum discussions, articles focusing on “typical” or “unique” proverbs of a particular language, etc.), which are often marked by a lack of representa- tiveness and inconsistency in their selection of primary materials. The second sphere of artificial intelligence usage concerns specialized research in linguistics, lexicography, paremiography, and folkloristics. Our aim is to define (1) the potential of AI-generated responses to complement data from linguistic corpora and sur- veys, and (2) the relevance of such responses in the context of one of the key theoreti- cal and practically applicable paremiological concepts: the notions of the paremiological minimum and optimum, which will be described in detail in Section 2.5. This paper presents an evaluation of responses generated by artificial intelligence through a LLM, using empirical data drawn from linguistic corpora and online surveys among Slovene native speakers, available from previous research (Meterc 2017, 2023). With regard to the paremiological material obtained via artificial intelligence, the paper seeks to address the following questions: 129 1. How reliable are the responses in terms of the type of expression identified as a proverb (typological perspective)? (Section 2.2.) 2. How useful can such querying be in determining the most representative expressions (relevance in relation to the core of actively used paremiology in a given language)? (Section 2.3) 3. How reliable are the responses in terms of the form of individual proverbs (accuracy of the expressions with respect to attested variants and the most representative, canonical form)? (Section 2.4) 4. What is the potential for developing an LLM-based or LLM- and corpus-based paremiological minimum and optimum? (Section 2.5) 2.1 Procedure of the Inquiry: Questions Posed to ChatGPT-4o Following the initial test of paremiological inquiries using OpenAI’s GPT-3.5 model in January 2024, four more systematically formulated questions were submitted to the ChatGPT-4o model on June 5 and 17, 2024. The questions were composed in accordance with the following guideline from empirical paremiology: Frequent proverbs tend to be familiar, whereas familiar proverbs may, but need not occur frequently. In linguistic (Jakobsonian) terms, a given culture’s stock of familiar proverbs thus turns out to be some kind of a paradigmatic inventory, from which items may be (or may not be) projected onto the syntagmatic axis of concrete (more or less frequent) proverb usage. (Grzybek, Chlosta 2008: 104) The size of the proverb sets used in each prompt was adjusted, as larger sets (we tested sets of up to 300 proverbs per prompt) significantly increased both the number of repeated or partially repeated proverbs and the number of hallucinations produced by the ChatGPT model. Test prompts were also run on different dates, but this had no observable impact on the frequency of repetitions or hallucinations, which ultimately led us to reduce the number of proverbs per set. A further reason for downsizing was the length limitation of this article, which did not allow for a detailed discussion of all findings that emerged from the comparative analysis of results. Thus, four questions were prepared, each formulated with a different degree of emphasis on empirical verifiability—or, in the case of the criterion of “popularity,” on its inherent indeterminacy: A. Please generate a list of the 20 most common and widespread Slovene proverbs (5 June 2024). B. Please generate a list of the 20 most well-known Slovene proverbs among speakers of Slovene (5 June 2024). C. Please generate a list of the 20 Slovene proverbs that are most common in writ- ten texts (17 June 2024). D. Please generate a list of the 20 most popular Slovene proverbs (17 June 2024). 130 In response to our follow-up question as to whether the numbering of examples in its answers reflects the degree of their representativeness, ChatGPT-4o stated that “the order does not reflect any ranking of importance, popularity, or frequency of use.” Nevertheless, as shown in Table 1 below, the order of items in responses to similar questions submitted on the same date remains largely consistent, contributing to the clarity of presentation but carrying no further significance. The expressions listed in response to the first question (Column A) are labeled with the letter A and numbered from 1 to 20. To maintain conciseness and facilitate an overview of overlapping expressions across different responses, the subsequent columns (B, C, and D) refer to identical items by citing their designation from Column A (e.g., A1 in Column B). New expressions not appearing in Column A are marked with the respective column label (e.g., B19 and B20 for expressions unique to Column B, or C2 for one absent in both Columns A and B). Expressions that are either included in the Slovene paremiological minimum (accompanied by a percentage indicating familiarity among Slovene speakers) or occur frequently in Slovene language corpora compiled in the metaFida 1.0 corpus (approx. 6 billion tokens) are highlighted in bold. Approximate corpus frequency from metaFida is indicated in the table following a dash (e.g., “– 300”). In the frequency analysis and expression identification, we employed paremiological search methods previously applied in the phraseographic work for the eSSKJ and SPP dictionaries (Meterc 2019). Proverbs that are not in active contemporary use are presented in regular font: in some cases, these are still commonly used (e.g., Prijatelja spoznaš v nesreči), while others are rare or absent from modern Slovene usage and were not included in the paremiological minimum. However, at least some evidence for these latter expressions or their variants can be found either in contemporary materials or in the Pregovori corpus (Babič et al. 2023).2 Three items that could not be confirmed in any corpus and are structurally clear hallucinations are marked in italics and with an asterisk (*). Italics are also used to indicate expressions that are not proverbs but rather realizations of verbal idioms (e.g., Ne stavi na eno karto). Where the AI-generated form differs from the base form listed in the SPP dictionary, the dictionary’s base form is provided (e.g. Molk je zlato.). The English translations of the generated proverb material are provided in parentheses based on the form of the Slovene proverbs. If an English equivalent exists that matches the Slovene proverb in both meaning and structural components, it is presented in parentheses in regular font. If no such equivalent exists in English, a literal translation (lit.) is provided. In cases where two forms of a proverb are listed within a single entry, the corresponding English translations follow in the same order. 2 The Pregovori corpus, currently in version 1.1, is the largest Slovene collection of paremiologi- cal expressions, based on over 2,500 sources from 1578 to 2010—including books, journals, ca- lendars, fieldwork notes, and manuscripts. It contains more than 37,000 annotated paremiological expressions. 131 Table 1: A combined overview of the ChatGPT’s answers to the four questions above, presented as four separate lists of proverbs A B C D 5 June 2024 5 June 2024 17 June 2024 17 June 2024 “20 most common and widespread Slovene proverbs” “20 most well- known Slovene proverbs” “20 Slovene proverbs most common in written texts” “20 most popular Slovene proverbs” 1 A1 Bolje vrabec v roki kot golob na strehi. (lit.: Better a sparrow in the hand than a pigeon on the roof.) 99% – 300 A1 A4 A3 2 A2 Kdor visoko leta, nizko pade. (Who flies high, falls deep.) 98.4% – 280 A2 C2 Po jutru se dan pozna. (lit.: The day is known by the morning.) – 800 C2 3 A3 Kjer je dim, tam je tudi ogenj. (Kjer je dim, je tudi ogenj.) (Where there’s smoke, there’s fire.) 93.9% – 500 A3 A1 A1 4 A4 Kuj železo, dokler je vroče. (Strike the iron, while it is hot.) 98.5% – 250 A4 A2 A2 132 5 A5 Kdo se smeje zadnji, se najslajše smeje. (Kdor se zadnji smeje, se najslajše smeje.) (lit.: He who laughs last, laughs the sweetest.) 99% – 180 A5 C5 Kar lahko storiš danes, ne odlašaj na jutri. (Don’t put off until tomorrow what you can do today.) 99% – 90 C5 6 A6 Pes, ki laja, ne grize. (Dog that barks does not bite.) 99.1% – 240 A6 C6 Tiha voda bregove dere. (lit.: Silent water wears away the banks.) 98% – 150 C6 7 A7 Pomoč prijatelja v nesreči je več vredna kot tisoč prijateljev v sreči. (lit.: The help of a friend in misfortune is worth more than a thousand friends in happiness.) (Prijatelja spoznaš v nesreči.) (lit.: You recognize a friend in misfortune.) 99% – 150 A9 C7 Vsak je svoje sreče kovač. (lit.: Everyone is the blacksmith of their own fortune.) 98.4% – 300 C7 133 8 A8 Ne stavi na eno karto. (lit.: Don’t bet on one card.) A10 C8 Ne glej darovanemu konju v zobe. (lit.: Don’t look into a gifted horse’s teeth.) (Podarjenemu konju se ne gleda v zobe.) (lit.: One shouldn’t look into a gift horse’s teeth.) 97.8% – 240 C8 9 A9 Bolje pozno kot nikoli. (Better late than never.) 98.7% – 630 A13 C9 Iz malega raste veliko. (lit.: From small grows great.) 97.2% – 460 C10 10 A10 Govorjenje je srebro, molčanje je zlato. (Speech is silver, silence is gold.) (Molk je zlato.) (Silence is golden.) 80.7% – 270 A18 C10 Počasi se daleč pride. (lit.: One gets far by moving slowly.) 99.4% – 740 C16 11 A11 Zarečenega kruha se največ poje. (lit.: The bread you swear off is the one you end up eating most.) 99.3% – 650 A19 C11 Eno jabolko na dan odžene zdravnika stran. (lit. An apple a day drives the doctor away.) – 170 C19 134 12 A12 Brez muje se še čevelj ne obuje. (lit.: Even a shoe can’t be put on without effort.) 98.7% – 370 A20 C12 Vsak začetek je težak. (Every beginning is difficult.) 97.4% – 940 C15 13 A13 Osel gre samo enkrat na led. (lit.: A donkey goes onto the ice only once.) 98.1% – 250 A11 C13 Kdor se zadnji smeje, se najslajše smeje. A5 Kdo se smeje zadnji, se najslajše smeje. (lit.: He who laughs last, laughs the sweetest.) 99% – 180 C12 14 A14 Priložnost dela tatu. (Opportunity makes the thief.) 91.7% – 390 A14 C14 Pometi pred svojim pragom. (lit.: Sweep in front of your own doorstep.) (Najprej pometi pred svojim pragom.) (lit.: Sweep in front of your own doorstep first.) 98.2% – 360 C11 135 15 A15 Sreča je opoteča. (Luck is fickle.) 97.8% – 200 A12 C15 Kakor boš postlal, tako boš ležal. (Kakor si boš postlal, tako boš spal.) (You’ve made your bed, now lie in it.) 98.4% – 110 C13 Kdor se zadnji smeje, se najslajše smeje. A5 Kdo se smeje zadnji, se najslajše smeje. (lit.: He who laughs last, laughs the sweetest.) 99% – 180 16 A16 Dva petelina se na enem kupu ne zmenita. (lit.: Two roosters don’t get along on the same dung heap.) A15 C16 Kdor drugemu jamo koplje, sam vanjo pade. (He who digs a pit for others falls in himself.) 99% – 370 D16 Delo iz človeka naredi človeka.* (lit.: Work makes a man out of a man.) 17 A17 Daleč od oči, daleč od srca. (Far from the eyes, far from the heart.) 93% – 400 A17 C17 Ne boj se zgodnjega vstajanja, boj se poznega ležanja.* (lit.: Don’t be afraid of early rising, be afraid of going to bed late.) D17 Kdor trdo dela, se daleč pride.* (lit.: He who works hard, gets far.) 18 A18 Rana ura, zlata ura. (lit.: Early hour, golden hour.) 99% – 430 A16 C18 Kdor prej pride, prej melje. (lit. He who arrives first, grinds first.) 98.4% – 1000 C9 136 19 A19 Jabolko ne pade daleč od drevesa. (The apple doesn’t fall far from the tree.) 1 99.1% – 640 B19 Nič novega pod soncem. (Nothing new under the sun.) – 150 C19 Laž ima kratke noge. (lit.: A lie has short legs.) 98.4% – 540 C17 20 A20 Pametni popustijo. (Pametnejši odneha.) (The wiser gives in.) 98.4% – 230 B20 Vsak je svoje sreče kovač. (lit.: Everyone is the blacksmith of their own fortune.) 98.4% – 600 C20 Kar seješ, to boš žel. (Kar seješ, to žanješ.) (As you sow, so shall you reap.) 93.3% – 450 C18 2.2 The Accuracy of AI in Correctly Identifying the Genre of an Expression Our first concern is the accuracy of the AI-generated responses in terms of whether the listed expressions conform to the definitional characteristics of a proverb. Specifically, we are interested in determining whether the provided examples can indeed be classified as proverbs, or whether they may in fact represent other types of phraseological or paremiological expressions—or even non-phraseological and non-paremiological expressions. From a structural perspective, paremiological expressions differ from other types of phraseologisms in that they constitute a complete text (a texteme), rather than merely a fragment of one (Permjakov 1970: 19). Mlacek (1983: 131, 138) empha- sized that a proverb conveys a complete thought containing a generally valid logical judgment. Although such judgments do not constitute a larger coherent system of logic, they nevertheless function as self-contained units (Mieder 2004: 1). Proverbs serve the function of “modeling reality”; they express regularities, provide guidance, and offer moral instruction (Permjakov 1970: 9; Mlacek 1983: 131; Mieder 2004: 3; Kržišnik 2008: 38). The majority of the listed expressions are proverbs, as they conform to the defini- tional characteristics outlined above. An exception is the verbal idiom staviti vse na eno karto (lit.: to bet everything on one card, ‘to risk everything by relying on a single op- tion’), which appears in the imperative form Ne stavi na eno karto (lit.: Don’t bet on one card) in the response to Question A. The misclassification of phrasemes—such as citing one form of a multi-word expression instead of a proverb—is a relatively common er- ror even in responses from native speakers (Meterc 2017: 192). None of the proverbs in the lists appear to be invented; however, one hallucinated expression appears in the 137 response to the third question, and three hallucinations are present in the response to the fourth question—these are addressed further below. Each list contains between 17 and 20 actual proverbs, indicating an accuracy rate of 85 to 100 percent with respect to genre classification. 2.3 The Effectiveness of AI in Identifying the Most Representative Paremiological Expressions (Core of Actively Used Paremiology in a Given Language) All proverbs from the lists that are included in the paremiological minimum are known to more than 90% of speakers, with the exception of Molk je zlato (Silence is golden), which is known to 80%; the remaining proverbs rank among the 200 most well-known proverbs in Slovene. The best-known proverb is Počasi se daleč pride (lit. One gets far by moving slowly), recognized by 99.4% of respondents. In addition to inclusion in the paremiological minimum, the presence of a proverb within the broader body of contemporary Slovene paremiology also serves as a criterion for evaluating the accuracy of the AI-generated responses. The generated lists contain three proverbs that are neither part of the minimum nor included in the SSKJ—Nič novega pod soncem (Nothing new under the sun), Po jutru se dan pozna (lit. The day is known by the morning), and Eno jabolko na dan odžene zdravnika stran (lit. An apple a day drives the doctor away). These three proverbs are relatively common and may also be considered part of the core of Slovene paremiology, although no data on their familiarity among speakers is currently available. The proverb Po jutru se dan pozna is frequent in the metaFida 1.0 corpus (approx. 800 occurrences) and also stood out in terms of frequency among additional responses provided by survey participants in the study on the paremiological minimum (Meterc 2017: 192). All three of the aforementioned proverbs are already included in the SPP dictionary. One expression from the AI-generated list is less common in contemporary language: the form Dva petelina se na enem kupu ne zmenita could not be confirmed in modern usage, whereas the variant Ne moreta biti dva petelina na enem kupu (lit.: There can’t be two roosters on the same heap) does appear in contemporary texts. The following forms are found only in the Pregovori corpus: Ne moreta biti dva petelina na istem gnoju (lit.: There can’t be two roosters on the same dung heap), Kjer sta dva petelina na enem dvorišču, je vrišč (lit.: Where there are two roosters in one yard, there’s always a racket), Dva petelina na dvorišč je vrišč (lit.: Two roosters in one yard – a racket), and Ni dobro, če sta dva petelina v kurniku (lit.: It’s not good if there are two roosters in the henhouse). According to the Slovene paremiological minimum, which was established through an online survey of 527 speakers (Meterc 2023: 122–124), between 15 and 18 expressions from each list are included, indicating an accuracy rate of 75 to 95 percent in relation to the minimum. If we also take into account the three proverbs mentioned above—although not covered by the minimum survey, they nonetheless belong to the core of Slovene paremiology based on their frequency in the metaFida 1.0 corpus—then between 17 and 19 expressions per list can be considered part of the core of contemporary Slovene paremiology, resulting in an accuracy rate of 85 to 95 percent. 138 In the contemporary Slovene corpus metaFida 1.0—as well as in older sources included in the Pregovori corpus—there is no evidence (not even in variant forms) of four expressions listed by the AI, such as Ne boj se zgodnjega vstajanja, boj se poznega ležanja (lit.: Don’t be afraid of early rising, be afraid of going to bed late). Two of these expressions are clear hallucinations: Delo iz človeka naredi človeka (lit.: Work makes a man out of a man) and Kdor trdo dela, se daleč pride (lit.: He who works hard, gets far). In both cases, the AI has generated expressions by combining elements from actual Slovene proverbs that do exist and are part of the paremiological minimum: Obleka naredi človeka (Clothes make the man, ‘expresses that a person’s appearance influences others’ perception’) and Počasi se daleč pride (Slowly, one gets far, ‘expresses that steady, patient effort leads to long-term success’). The form Kdor trdo dela, se daleč pride (lit.: He who works hard, gets far) is grammatically incorrect, as the reflexive possessive particle se belongs to the original proverb and does not match the pronoun kdor in the newly generated expression. 2.4 The Accuracy of AI in Providing Structurally Appropriate Proverbs In the following section, we focus on the aspect of the conventionalized form of paremiological expressions. We compare the forms provided by the AI with those that are either corpus-confirmed variants or the most frequent, representative, and lexicographically standardized base forms. The majority of the listed forms (14 to 16 per list, or approximately 70 to 80%) are frequently used forms that are also cited in the SPP dictionary, based on corpus- based research on proverb variation (Meterc 2019). The accuracy of the responses is even higher if we consider that nearly all of these forms also correspond to the basic lexicographic forms (SPP), which are the most frequent in actual usage. An exception is, for instance, the variant Kjer je dim, tam je tudi ogenj (Where there is smoke, there is also fire, ‘expresses that it is reasonable to look for, acknowledge, or assume a cause behind something’), which is indeed attested in use and differs from the dictionary base form Kjer je dim, je tudi ogenj only by the addition of the adverb tam (there). The SPP dictionary lists as many as 15 variants of this proverb. Two forms of the same proverb appear on two different dates: the base form Kdor se zadnji smeje, se najslajše smeje (He who laughs last, laughs the sweetest, ‘expresses that final success is the most satisfying, especially after early triumphs by others’) and the variant Kdo se smeje zadnji, se najslajše smeje, which is not attested in actual use. Among the AI-generated responses, there are also forms that deviate significantly from the base form, or that are entirely absent from actual usage. One such example is Ne glej darovanemu konju v zobe (lit.: Don’t look into a gifted horse’s teeth). In the Pregovori corpus, two similar forms are attested: Darovanemu konju ne glej na zobe and Darovanemu konju ne glej v zobe (Don’t look at a gift horse’s teeth vs. Don’t look into a gift horse’s teeth). Another example is Govorjenje je srebro, molčanje je zlato (Speech is silver, silence is golden). The closest attested variant in contemporary Slovene, as listed in the SPP dictionary, is Govorjenje je srebro, molk je zlato ‘expresses that speaking can be valuable, but staying silent is often wiser or more virtuous’. In both 139 cases, it is highly likely that the English proverb influenced the AI’s response. An interesting example is the form Pomoč prijatelja v nesreči je več vredna kot tisoč prijateljev v sreči (lit.: The help of a friend in misfortune is worth more than a thousand friends in happiness). In Slovene, the conventional form of this proverb is Prijatelja spoznaš v nesreči (lit.: You recognize a friend in misfortune, ‘expresses that true friendship is revealed in times of trouble or hardship’). It is highly likely that the version generated by GPT-4o is a partial hallucination that reformulates the proverb using a construction pattern common in both English and Slovene: One X is worth a thousand Y (e.g., A picture is worth a thousand words). Such hallucinations—and those for which no clear connection to a known Slovene proverb can be found (e.g., Ne boj se zgodnjega vstajanja, boj se poznega ležanja (lit.: Don’t be afraid of early rising; be afraid of going to bed late)—represent a distinct area of research potential. 2.5 The Potential for Developing a Paremiological Minimum (and Optimum) Based on AI-Generated Material The theoretical and applied benefits of establishing a paremiological minimum—a list of the most widely known proverbs in a given language (Permjakov 1989)—and a paremiological optimum—a list of proverbs that are both widely known and frequently used (Ďurčo 2014)—have been extensively described (Meterc 2017: 40–45). One of the key issues in this context is the selection of paremiological material to be tested through surveys and corpus analysis. Since it is not feasible to test thousands of expressions, a well-designed preliminary selection of proverbs is essential. A major challenge in obtaining paremiological data through artificial intelligence lies in the opacity of how the tool operates. It is worth noting that in commercial LLMs—such as all previous and current OpenAI models—the processes by which these models arrange and generate answers are not open or inspectable. The weights, training data, and fine-tuning details of GPT models are not publicly available, which means that the underlying sources remain obscured. Already from the results presented in the article (e.g., the inconsistency of forms such as Ne glej darovanemu konju v zobe with contemporary usage), it is evident that the model used draws on certain collections that include a large amount of outdated paremiological material and are published on websites, but does not draw, for instance, on language corpora or lexicographic sources. This can be verified by asking it for 20 proverbs from the Dictionary of Proverbs and Similar Paremiological Expressions (SPP; Meterc 2020–), as it also lists expressions that are not included in the dictionary (e.g., the hallucinated expression Voda na svoj mlin teče — lit. The water runs to its own mill). Moreover, when asked about proverbs not yet included in this dictionary, it lists some that actually are in it (e.g., Kjer je volja, je pot — lit. Where there is a will, there is a way). It also states that it does not have direct access to the language corpus (e.g., metaFida1.0). Another problematic aspect is the non-reproducibility of the research: questions such as those presented in this article often yield slightly different responses each time. This also renders the ranking of results for similar prompts within paremiological research unreliable. What can be regarded as reliable, however, is the overlap of proverbs 140 that can serve as a basis for subsequent human analysis and for the construction of a paremiological minimum. The partial non-reproducibility can be turned to our advantage if the goal is to collect as much relevant paremiological material as possible or to broaden the results of a targeted selection requested from a LLM. Thus, the greatest contribution of such models to research may lie in the very early stages of minimum construction. We can conclude that it was necessary to verify whether differences in wording regarding the familiarity, frequency, or popularity of proverbs affect the results. This influence was not confirmed, as the model responds to the question about familiarity as if it were about frequency. For two very similar questions — (A) Please generate a list of the 20 most common and widespread Slovene proverbs, and (C) Please generate a list of the 20 Slovene proverbs that are most common in written texts — which were submitted on two different dates (5 and 17 June 2024), the model provides two distinctly different lists of proverbs. However, the most overlapping lists are those generated in response to questions submitted on the same date (A and B on one hand, and C and D on the other). Across the four responses collected on two different dates, there are 33 distinct proverbs that fall within the paremiological minimum. When comparing the first list (A) from June 5, 2024, and the first list (C) from June 17, we find that out of 32 different proverbs that belong to the paremiological minimum, four appear in both lists. Rarely used proverb forms and hallucinations tend to recur on the same date, but not across different dates. We may therefore expect that by posing additional questions on multiple dates, it would be possible to obtain an overlapping set of relevant proverbs, which could subsequently be used in a survey to determine the paremiological minimum and in a corpus-based frequency analysis aimed at transforming the paremiological minimum into a paremiological optimum, following the methodology proposed by Peter Ďurčo (2014: 189–201). 3 CONCLUSIONS For the average internet user, artificial intelligence offers a simple and fast alternative for obtaining information about representative, frequent, or well-known proverbs in a given language. It serves as an alternative to (1) dictionary sources or (2) the heterogeneous and often chaotic array of online materials, such as digitized older collections, forum posts and non-specialist articles focusing on typical proverbs in a particular language. AI is more user-friendly, requiring less prior knowledge and enabling quicker access to relatively reliable information. This is particularly important for non-expert users, who are only able to assess the reliability and representativeness of sources to a limited extent. The issues mentioned above regarding the non-reproducibility of results may, at present, be best addressed through simple methodological adaptations. These steps can improve accuracy, or at least minimize the number of factors that might affect the outcomes of a given prompt: 141 1. the prompt should be used in a fixed and isolated form, free of additional context and within a new conversation (chat); 2. a deterministic form of the response should be specified, for example by including an instruction such as “Do not vary your selection or wording in future responses”; 3. the results generated from such prompts should be saved and treated as fixed datasets (as shown in Table 1). Based on Slovene empirical data on proverb familiarity and corpus data on frequency, our research has demonstrated that the responses provided by ChatGPT-4o are, to a large extent, successful: − 85–100% accuracy in terms of genre-specific features of proverbs; − 85–95% accuracy in terms of the representativeness of the proverbs with respect to the core of Slovene paremiology; − 70–80% accuracy in terms of the structural form of the proverbs. For the specialized user—such as a phraseologist, paremiologist, or paremiographer—the AI-generated responses represent a useful body of paremiological material that must be verified using additional empirical data (e.g., frequency, familiarity, and the conventionalization of form). With the help of the selected LLM model, it is possible to compile high-quality lists of the most relevant paremiological expressions. By repeating queries and combining results—followed by selection based on corpus analysis—it may be possible to obtain larger sets of representative expressions. This is particularly valuable from the perspective of phraseographic and paremiographic work, and potentially also more broadly—for example, in phraseodidactics and paremiodidactics. Through previous empirical analyses of Slovene paremiology in terms of familiarity and frequency, we have confirmed the relevance of the AI-generated lists, which could be used as a basis for developing paremiological minimums and optimums in languages where such benchmarks have not yet been established. Artificial intelligence could serve as a useful tool for the preliminary selection of paremiological expressions before they are presented to large groups of survey participants and analyzed within linguistic corpora. The hallucinated forms we obtained from the LLM have strong research potential. As they, to varying degrees and in different ways, resemble proverbs that exist or once existed in the given language, it would be interesting for future research to analyse how they are formed, their recurring structural patterns, and the extent to which it can be determined which existing proverbs they derive from. This could serve as a valuable comparative addition to studies on how genuine proverbs are formed. 142 Sources BABIČ, Saša; et al. (2023) Collection of Slovene paremiological units Pregovori 1.1. Slovene language resource repository CLARIN.SI. 31 March 2025. http://hdl. handle.net/11356/1853 ERJAVEC, Tomaž (2023) Corpus of combined Slovene corpora metaFida 1.0. Slovene language resource repository CLARIN.SI. 31 March 2025. http://hdl.handle. net/11356/1775 METERC, Matej (2020–) Slovar pregovorov in sorodnih paremioloških izrazov. htt- ps://www.fran.si <31 March 2025> OpenAI. 2024. ChatGPT-4o model. https://openai.com/chatgpt <31 March 2025> eSSKJ: Slovar slovenskega knjižnega jezika. https://www.fran.si <31 March 2025> Slovar slovenskega knjižnega jezika. https://www.fran.si <31 March 2025> Bibliography DE SCHRYVER, Gilles-Maurice (2023) Generative AI and Lexicography: The Cur- rent State of the Art Using ChatGPT. International Journal of Lexicography 36/4, 355–387. ĎURČO, Peter (2014) “Empirical Research and Paremiological Minimum.” In: H. Hrisztova-Gotthardt/M. A. Varga (eds), Introduction to Paremiology: A Compre- hensive Guide to Proverb Studies. Warsaw: Versita, 183–205. GRZYBEK, Peter/Christoph CHLOSTA (2008) “Some Essentials on the Popularity of (American) Proverbs.” In: K. J. McKenna (ed.), Festschrift on the Occasion of Wolfgang Mieder’s 65th Birthday. Burlington: University of Vermont, 95–110. JAKUBÍČEK, Miloš/Michael RUNDELL (2023) “The End of Lexicography? Can ChatGPT Outperform Current Tools for Post-Editing Lexicography?” In: M. Medveď/M. Měchura/C. Tiberius/I. Kosem/J. Kallas/M. Jakubíček/S. Krek (eds), Electronic Lexicography in the 21st Century (eLex 2023). Proceedings of the eLex 2023 Conference, 518–533. KRŽIŠNIK, Erika (2008) “Viri za kulturološko interpretacijo frazeoloških enot.” Jezik in slovstvo 53/1, 33–47. METERC, Matej (2017) Paremiološki optimum: najbolj poznani in pogosti pregovori ter sorodne paremije v slovenščini. Ljubljana: Založba ZRC, ZRC SAZU. METERC, Matej (2019) “Analiza frazeološke variantnosti za slovarski prikaz v eSSKJ- ju in SPP-ju.” Jezikoslovni zapiski 25/2, 33–45. METERC, Matej (2023) “Izbiranje iztočnic za Slovar pregovorov in sorodnih paremioloških izrazov: želje, merila in empirični podatki.” In: M. Jesenšek (ed.), Pleteršnikovi dnevi: ob stoletnici smrti Maksa Pleteršnika: program simpozija in povzetki referatov: Pišece, Ljubljana, Cankova, 12.–14. september 2023. Ljubljana: Slovenska akademija znanosti in umetnosti, 29–30. METERC, Matej/Nataša JAKOP (2016) “Lexikografické spracovanie frazeologických variantov v novom slovníku slovinského spisovného jazyka.” In: M. Lišková (ed.), Akademický slovník současné češtiny a software pro jeho tvorbu aneb Slovníky a je- jich uživatelé v 21. století: sborník abstraktů z workshopu, Praha, 29.–30. listopadu 143 2016. Praha: Ústav pro jazyk český AV ČR, 55–56. MIEDER, Wolfgang (2004) Proverbs: A Handbook. Westport: Greenwood Press. MLACEK, Jozef (1983) “Problémy komplexného rozboru prísloví a porekadiel.” Slovenská reč 48/2, 129–140. PERMJAKOV, Grigorij Lvovič (1970): Osnovy strukturnoj paremiologii: Zametki po obščej teorii kliše. Moskva: Nauka. PERMJAKOV, Grigorij Lvovič (1989) “On the Question of a Russian Paremiological Minimum.” Proverbium 6, 91–102. Abstract THE BEST KNOWN AND FREQUENTLY USED SLOVENE PROVERBS AC- CORDING TO CHATGPT-4O: EXPLORING THE POTENTIAL FOR AN AI-BASED PAREMIOLOGICAL MINIMUM The article examines the type of data yielded by the publicly accessible AI model GPT- 4o concerning the core of Slovene paremiology—namely, the most widely known and/ or most frequently used proverbs. The proverbs identified by the model are compared with data obtained from corpus-based analyses, survey research, and established Slovene paremiographical sources. Drawing on these comparisons, this study outlines the current strengths and limitations of using this AI model in paremiological research and evaluates its potential impact on contemporary proverb studies. Keywords: Slovene paremiology, large language model, GPT-4o, paremiological minimum, paremiological optimum Povzetek NAJBOLJ POZNANI IN POGOSTO UPORABLJENI SLOVENSKI PREGOVORI PO IZBORU CHATGPT-4O: RAZISKOVANJE POTENCIALA UMETNE INTELIGENCE ZA VZPOSTAVITEV PAREMIOLOŠKEGA MINIMUMA Članek razkriva, kakšne podatke nam o jedru slovenske paremiologije – najbolj po- znanih in/ali pogostih pregovorih – ponujajo odgovori prosto dostopnega modela UI, znanega kot GPT-4o. Nabori navedenih izrazov so primerjani s podatki, ki so nam o poznanosti in pogostnosti na voljo iz korpusnih in anketnih raziskav ter iz slovenskih paremiografskih virov. Na podlagi pridobljenih podatkov so izpostavljene trenutne prednosti in slabosti tovrstne uporabe imenovanega modela in njegovega vpliva na sodobno paremiološko raziskovanje. Ključne besede: slovenska paremiologija, veliki jezikovni model, GPT-4o, paremiolo- ški minimum, paremiološki optimum