As academic research begins exploring the use of large language models over human participants in market research, this article suggests a set of questions to guide academic reviewers in thoroughly examining these articles.
Several academic pre-prints explore topics such as the “potential of generative language models to substitute for human participants in perceptual analysis” [1], “possibility that language models can be studied as effective proxies for specific human sub-populations in social science research” [2], and “using GPT for market research” [3]. Because social researchers turn to the topic of using large language models in place of human participants with a sense of novelty and excitement, reviewers for academic publications should diligently approach it with scientific scepticism.
Below are suggested questions for reviewers of academic articles that deal with synthetic data for marketing research.
1. How robust are study findings to wording changes in prompts?
Specifically, if the authors try to restructure sentences while preserving all the information in them, will the study results be supported by the new synthetic data?
For example, if the original prompt used in a paper was:
You are _ years old male/female. You live in the state of _. You come from a low/middle/high-class family. You have [no] children. Please answer the following questions just like a survey respondent would. In giving answers you will strictly follow the required format and will not add any additional words.
Request that authors compare the results generated with the prompt to a modified version of the same prompt, such as:
As a _-year-old man/woman from a low/middle/high-class family in _ with [no] children, respond to these survey questions. You must strictly follow the required format, adding no extra words.
A robust instrument would not be prone to produce (unpredictably) different results depending on minor changes in how information is passed to it. If a paper’s methodology generalises to practical applications, the methodology must be robust to reasonable adjustment by practitioners. However, with LLMs, these seemingly inconsequential changes in prompts may result in significantly different results. Thus, re-doing the prompting would help check for robustness of the method and help exclude “motivated thinking” in relation to the prompts on the part of the authors.
Normally, requesting a new data collection at the review stage is cost-prohibitive. But with synthetic data, that is not the case. Generating a new dataset with an adjusted prompt structure should be both easy and relatively cheap.
2. How was the paper’s subject matter reflected in the pre-training data of the model used to generate synthetic data?
This question helps address three related concerns:
First, since it is known that large language models are capable of reproducing parts of their pre-training data [4], one must know if there was any contamination, such that the synthetic data generated is a regurgitation of the original data used in pretraining?
Second, can we reliably establish the domain of generalisability of findings made in a paper?
Third, is novel evidence added by the paper? Or does it extract common or shared knowledge from synthetic data in the same way as a literature review would summarise extant knowledge, thus reinforcing existing ideas and biases of past studies?
For most commercially available or even open-weights LLMs (including GPT-series and LLaMA), comprehensive datasheets for datasets are not released, often out of fear of copyright lawsuits, which makes this examination impossible. But even though it is often impossible, it is not unnecessary to address these concerns.
Leaving them unaddressed is akin to examination of how querying the secretive Grandma Lupicia can perform versus a market research study. Grandma Lupicia may be very knowledgeable about people’s behaviour in a particular country, but without knowing where she gets her knowledge, can we confidently claim that she is not feeding us only what she knows from past research reports and her predictions will fail once we go outside a certain boundary?
3. Did the study explore a topic of common knowledge, or did it take on a new topic?
Companies rarely spend research budgets to do primary data collection on topics they already understand well. Therefore, most commercial market research is performed to understand how people would react to a new stimulus or collect previously unknown information (i.e. where no past primary data collection was done).
To claim that synthetic data is “useful for market research”, the paper must show that synthetic data helped validly explore a new topic.
4. Is it ethical to use LLM-generated synthetic data for research?
There have been reports that OpenAI’s models have been trained on open access social media posts and forums such as Reddit [5] without users’ informed consent, knowledge, or compensation. Unwittingly, millions of people are being wrangled into a research participant’s seat in violation of the Nuremberg Code, a set of core principles of social research ethics widely used today.
Authors who perform research using synthetic survey responses need to justify to the ethics committees of their universities and to their prospective publications why they are using data obtained without subjects’ consent.
References
[1] Language Models for Automated Market Research: A New Way to Generate Perceptual Maps
[2] [2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples
[3] Using GPT for Market Research
[4] [2202.07646] Quantifying Memorization Across Neural Language Models
[5] Reddit ends its role as a free AI training data goldmine