Bing Search and Microsoft are dedicated to supporting the research community and regularly provide information and data to the research community in a variety of ways.
Bing Search already provides researchers and the public with access to
MS MARCO, a collection of datasets focused on deep learning in search that are derived from Bing Search queries and related data. Research organizations can gain access to the MS MARCO datasets instantaneously via the
MS MARCO homepage. The MS MARCO dataset has been cited in numerous research papers since its release and has been utilized for a range of research issues, including in connection with misinformation and disinformation. Because the dataset is provided open source, the extent to which it has been used for disinformation related research purposes cannot easily be ascertained.
In 2020, Bing Search also shared
a search dataset for Coronavirus Intent comprised of queries from all over the world that had an intent related to the Coronavirus or Covid-19 (e.g., searches for “Coronavirus updates Seattle” or “Shelter in place”) for use by researchers and the public. This data, which is divisible by country, is particularly relevant to misinformation research on public health issues and the COVID-19 pandemic, as it provides insights into how users sought information related to the coronavirus during the pandemic. The dataset was also posted to
Azure Open datasets for Machine Learning,
Tensorflow.org, and
Kaggle. See additional information on the dataset at
Extracting Covid-19 insights from Bing Search data | Bing Search Blog.
In 2024, Microsoft publicly released a new information rich dataset, MS MARCO Web Search dataset, leveraging Bing search data. This dataset closely mimics real-world web document and query distribution and provides rich information for various kinds of downstream tasks and encourages research in various areas, It also contains rich information from the web pages, such as visual representation rendered by web browsers, raw HTML structure, clean text, semantic annotations, language and topic tags labeled by industry document understanding systems, etc. MS MARCO Web Search further contains 10 million unique queries from 93 languages with millions of relevant labeled query-document pairs collected from the search log of the Microsoft Bing search engine to serve as the query set.
Additionally, researchers who are registered webmasters may utilize Bing Search’s
Keyword Tools and
Backlinks Webmaster Tools to provide insights into search usage and keywords. Bing is also working on ways to provide deeper research access to the tool across the research community and hopes to provide updates in its next report.
Bing Search also offers use of
Bing APIs to the public, which include Bing Image Search, Bing News Search, Bing Video Search, Bing Visual Search, Bing Web Search, Bing Entity Search, Bing Autosuggest, and Bing Spell Check. Bing Search provides free access to these APIs for up to 1,000 transactions per month, which may be leveraged by the research community.
In addition to the above datasets, Microsoft Research maintains a public portal of codes, APIs, software development kits, and datasets that are available to the Research Community at
Researcher tools: code & datasets - Microsoft Research. These public research tools can be accessed by researchers and downloaded instantaneously without formal applications or login credentials.
Bing launched a
Qualified Researcher Program to enable EU researchers to easily request access for publicly accessible Bing data from a singular landing page. However, because these datasets are already available open-source (see below), we expect some researchers may elect to obtain datasets via the above means to avoid the burden of an application and credentialing process.
Bing compiled a specialized dataset of European Parliament election related queries in different EU languages for use by the research community and to support transparency; researchers can apply using the form found
here Additionally, Bing has engaged with European researchers to discuss the types of data that will be most useful to the research community.
Microsoft is also a leader in research in Responsible AI and provides
a range of tools and resources dedicated to promoting responsible usage of artificial intelligence to allow practitioners and researchers to maximize the benefits of AI systems while mitigating harms.
Lastly, given the open nature of the Bing Search index and public nature of search results, researchers can utilize Bing Search or Bing’s generative AI experiences to run specific queries and analyze results (unlike social media which may require private accounts or connections between users to access certain materials).