SharePoint Search – Crawl and Content C onfiguration

SharePoint Search – Crawl and Content Configuration Steve Peschka Sr. Principal Architect Microsoft Corporation

Crawl and Content Configuration Connectors Crawling and Content Sources Query Throttling Result Sources Improvements in Document Parsing Entity Extraction Schema Management

Connectors • The following connectors will be available out of the box in SharePoint: • SharePoint • HTTP • File Share • BDC – also includes these other connectors that are built on BDC framework: • Exchange Public Folders • Lotus Notes • Documentum Connector • Taxonomy Connector • Requires the Term Store to be provisioned for crawling, so requires SharePoint Server • People Profile Connector • Requires the profile store to be deployed and populated; profile store is only part of SharePoint Server

Crawling and Content Sources • There are improvements to the crawling feature itself: • For HTTP sites, crawler supports a new type of authentication - anonymous • Crawling also works with certain out of the box web part content that is rendered asynchronously on the client • The crawler gets a “classic” type rendering of pages with the new asynchronous web parts on them in order to index them

Crawling “Continuously” • “Continuous crawling” is a new feature to crawling in SharePoint 2013 – it only applies to SharePoint sources • When you crawl continuously, every 15 minutes (by default) the crawler gets changes and pushes them to content processing • You can change the interval using Set-SPEnterpriseSearchCrawlContentSource • Because of changes in how the index is created and stored, a document can appear in the index within seconds of going through the content processing component – you no longer have to wait for long index merges until it shows in results

Continuous Crawl vs. Incremental Crawl • Both Continuous and Incremental crawls are supported in SharePoint 2013 – use both by splitting different start addresses into two content sources • Continuous crawl has these advantages: • Starts working even when the first full crawl is ongoing so you don’t have to wait for full crawl to complete for content to start being searchable • Continuous crawls happen in parallel, so one long crawl does not block a new one from starting • Continuous crawls mark errors for recrawl later and continue instead of using retry logic; this lets them complete much quicker if there are issues • Incremental crawl has these advantages: • You control the schedule if you don’t have sufficient hardware to support continuous crawls • It has extensive retry logic built in when errors occur

Query Throttling • Every client that issues a query specifies a ClientType • Each ClientType has an associated priority – High, Medium or Low • Every app needs to specify ClientType in the Query object they create so that they can configure which tier they belong to • If you don’t specify, you are automatically assigned a Low priority • If query latencies in a higher tier are worse than a threshold we start throttling queries from lower tiers • Query throttling is turned off by default for on premise farms • ClientType and priority are managed in the search service admin pages

Result Sources – FKA Scopes and Federated Search • Scopes and Federated Search in SharePoint 2010 are now known as “Result Sources” in SharePoint 2013 • Results Sources also support a “Remote SharePoint Index” • This is for scenarios where you have multiple SharePoint farms but don’t want to create a central farm that crawls them all • It also simplifies the problem of passing credentials for the current user around (i.e. Kerberos, etc.). It does this with: • An oAuth trust between search applications • Passing the current user’s identity claim to the remote farm when making the search request – the remote farm “rehydrates” the user’s claims • It requires an OAuth trust between the farms

Hybrid Integration with Office 365 • Remote SharePoint Index is how search can be federated between on premise and Office 365 for hybrid farms • It provides support for query only – not crawl • Requires you to: • Set up an OAuth trust between your on prem farm and your Office 365 tenancy • Create a result source for the remote farm (i.e. Office 365 if you’re on prem and vice versa) • An externally addressable endpoint for the on prem farm that can be reached by the Office 365 sites • You can configure certificate authentication with your endpoint, such as a reverse proxy • You can either query the result source directly, or create a query rule to also issue user queries to a remote farm when desired

Hybrid Integration A token is created for the user and security trimmed results are returned Requires 2-way AD sync between on premise and Office 365 By using a query rule you can integrate the results from both farms into a single display for users We have a whitepaper now available that describes the configuration in more detail: http://aka.ms/oht1dx Results from the Cloud Results from On Prem

Improvements in Result Sources • Some of the key functional improvements in Result Sources over Federated Search include: • Site and site collection admins can manage and configure result sources for their site collection • It will reduce requests for SSA admins to centrally create and manage federated sources • Empowers lower-level admins to create and manage federated sources to meet their specific requirements. • Exchange is now a data source for a result source • You can apply query transformations to a result source • For example, adding criteria to it that will be appended to each query, e.g. author=“Our CEO”, etc.

Demo Content Sources and Result Sources

Improvements in Document Parsing • SharePoint 2013 introduces new parsing features – Format Handlers • Automatic file format detection: no longer relies on file extension • Deep link extraction for Word and PowerPoint formats • Visual metadata extraction: titles, authors and dates • High-performance format handlers for HTML, DOCX, PPTX, TXT, Image, XML and PDF formats • New Montage, Visio and OneNote filters • The IFilter API continues to be supported as a means of extending the supported set of file formats

Entity Extraction for Companies • Custom refiners were introduced into SharePoint with FAST Search 2010. Company extraction was managed in search admin via a web part. • In SharePoint 2013 the experience is consolidated with other term management by moving much of this into the term store • You can manage term lists for entity extraction like any other term set (with a few exceptions); however you cannot add additional term sets for extraction • You can also do custom entity extraction in SharePoint 2013 using cmdlets and csv files, similar to how it was done in FAST Search for SharePoint 2010.

Schema Management • In SharePoint 2013 we’re able to give site collection admins much more flexibility to work with managed properties • The farm search admin can define managed properties when the schema needs to be extended • Site collection admins have similar but limited power though, because they make some changes of the schema model per site collection • Site collection admins can pick up new crawled properties for custom metadata in their sites and create managed properties from them • There are also managed properties available out of the box that properties can be mapped to – RefinableString, RefineableDouble, etc.; that gives site collection admins the ability to create fully refinable and sortable managed properties • Full crawl not needed to create crawled and managed properties – use site columns and Reindex List or Site

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

SharePoint Search – Crawl and Content C onfiguration