How to Improve SQL Query Performance for Large Datasets

When your SQL database grows to millions of rows, poorly performing queries can grind your application to a halt. Slow response times frustrate end users, reduce productivity, and drive up infrastructure costs. The difference between a query that takes 10 seconds and one that completes in milliseconds often comes down to a few targeted optimizations.

Learning how to improve SQL query performance is essential for any developer or DBA working with large datasets. This hands-on guide walks you through proven techniques to identify bottlenecks, optimize query structure, apply indexing strategies, and monitor performance over time. Whether you are troubleshooting a specific slow query or building a long-term performance tuning strategy, these methods will help you get measurable results.

Improving SQL Server query performance on large datasets

To improve SQL query performance on large tables, you need a systematic approach: create a test dataset, identify slow-running queries, analyze execution plans, and then apply targeted optimizations.

Prerequisites

To complete this tutorial, ensure you have the following:

Microsoft SQL Server
Azure Data Studio (Mac) or Microsoft SQL Server Management Studio (Windows)
A database

Although this tutorial uses Azure Data Studio to execute all the SQL query examples, the steps are the same for using SQL Server Management Studio on Windows.

Step 1: Create the large dataset

First, you need a large dataset to explore all the techniques in this article. Using the SQL query below, create a sales table with one million rows of randomly generated data:

CREATE TABLE sales ( 
    id INT IDENTITY(1,1), 
    customer_name VARCHAR(100), 
    product_name VARCHAR(100), 
    sale_amount DECIMAL(10,2), 
    sale_date DATE 
) 

DECLARE @i INT = 1 

WHILE @i <= 1000000 
BEGIN 
    INSERT INTO sales (customer_name, product_name, sale_amount, sale_date) 
    VALUES (CONCAT(‘Customer’, @i), CONCAT(‘Product’, FLOOR(RAND()*(10-1+1)+1)), 
FLOOR(RAND()*(1000-100+1)+100), DATEADD(day, -FLOOR(RAND()*(365-1+1)+1), GETDATE())) 
    SET @i = @i + 1 
END

Step 2: Identify slow-running queries

Once you have the dataset, the first step toward improving SQL query performance is finding which queries run slowly. SQL Server provides several built-in tools for this purpose.

Using dynamic management views (DMVs)

You can use the sys.dm_exec_requests system dynamic management view (DMV) to find slow queries in real time. This view provides information about running queries, including status, CPU time, and total elapsed time.

For example, examine the following query that retrieves customer names and their total sales amounts for the year 2023 from the sales table:

SELECT customer_name, SUM(sale_amount) AS total_sales 
FROM sales 
WHERE YEAR(sale_date) = 2023 
GROUP BY customer_name

Once the query is running, use the sys.dm_exec_requests DMV to view the query’s progress:

SELECT r.session_id, r.status, r.total_elapsed_time, r.cpu_time, r.wait_time, r.command 
FROM sys.dm_exec_requests r 
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) 
WHERE status != ‘sleeping’

This action returns a list of active queries on the server with their total elapsed time and CPU time. Evaluate those stats to determine if your query is running slowly:

Fig. 1: Identifying slow-running SQL queries

The example query above returns stats like a total_elapsed_time of 1,178 milliseconds and a cpu_time of 1,736 milliseconds. If these times seem unusually long for your specific application, you can optimize the query to improve its performance.

Using historical query stats with sys.dm_exec_query_stats

While sys.dm_exec_requests shows currently running queries, you can also analyze historical query performance using sys.dm_exec_query_stats. This DMV provides cumulative statistics for cached query plans, making it easier to find your most resource-intensive queries over time:

SELECT TOP 10 
    SUBSTRING(qt.TEXT, (qs.statement_start_offset/2)+1, 
        ((CASE qs.statement_end_offset 
            WHEN -1 THEN DATALENGTH(qt.TEXT) 
            ELSE qs.statement_end_offset 
        END - qs.statement_start_offset)/2)+1) AS query_text, 
    qs.execution_count, 
    qs.total_logical_reads, 
    qs.total_worker_time, 
    qs.total_elapsed_time/1000000 AS total_elapsed_time_seconds 
FROM sys.dm_exec_query_stats qs 
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt 
ORDER BY qs.total_logical_reads DESC

This query ranks your top 10 most resource-heavy queries by logical reads. You can change the ORDER BY clause to sort by total_worker_time (CPU) or total_elapsed_time (duration) depending on which bottleneck you are investigating.

Waiting versus running queries

SQL queries fall into two categories: waiting and running. Understanding their differences is crucial for optimizing query performance.

A running query actively uses system resources, including CPU, memory, and disk I/O. A waiting query is paused, waiting for another query to release a lock or for system resources to become available.

Both types impact your database’s performance. Waiting queries can block other queries, creating chain reactions that slow down response times across the entire application. Identifying and resolving both running bottlenecks and wait-related delays is key to improving SQL Server performance on large datasets.

Step 3: Analyze queries with execution plans

Execution plans are one of the most powerful tools for understanding why a query runs slowly. They show you exactly how SQL Server processes your query, including which operations consume the most resources.

To generate an execution plan, press CTRL + M on Windows or CMD + M on Mac in Azure Data Studio or SQL Server Management Studio before running your query. The execution plan appears in the results pane after the query completes.

Key things to look for in an execution plan:

Table Scans: Indicate the query is reading every row in a table, which is costly on large datasets.
Index Seeks vs. Index Scans: Index Seeks target specific rows efficiently, while Index Scans read the entire index.
Missing Index warnings: SQL Server suggests indexes that could improve performance.
High-cost operators: Sort, Hash Match, and Nested Loops operators with high percentages indicate bottlenecks.

Step 4: Apply targeted optimizations

Now that you have identified and analyzed slow-running queries, the next step is to apply specific techniques to improve their performance.

Limiting returned data

To optimize SQL queries when working with large datasets, consider limiting the data these queries return via window functions or pagination.

Window functions can group, aggregate, and limit data from large datasets. For example, a sales table may have a million rows, and you want to show 10 records per page. Use the ROW_NUMBER() window function to assign row numbers and filter based on the desired page:

DECLARE @page_number AS INT; 
SET @page_number = 2; 

WITH numbered_sales AS ( 
   SELECT *, ROW_NUMBER() OVER(ORDER BY sale_date) AS row_num 
   FROM sales 
) 
SELECT * FROM numbered_sales 
WHERE row_num > 10 * (@page_number - 1) AND row_num <= 10 * @page_number

This code returns 10 records on page two:

Fig. 2: Limiting data using a window function

Alternatively, pagination with OFFSET and FETCH divides results into smaller sets, improving response times. The code below returns only 10 rows at a time:

DECLARE @page_number AS INT; 
SET @page_number = 2; 

SELECT * 
FROM sales 
ORDER BY sale_date 
OFFSET (@page_number - 1) * 10 ROWS 
FETCH NEXT 10 ROWS ONLY

These data-limiting techniques can significantly improve SQL query performance when working with large datasets by reducing the volume of data the server needs to process and return.

Indexing strategies

Indexing is one of the most effective ways to improve SQL query performance on large tables. An index creates a data structure that allows the database engine to locate rows quickly without scanning the entire table.

However, creating indexes on every column is not the right approach. Too many indexes slow down INSERT, UPDATE, and DELETE operations because each index must be maintained. The goal is to create targeted indexes on columns that appear frequently in WHERE, JOIN, and ORDER BY clauses.

Use the execution plan to identify missing indexes. For example, suppose you want to retrieve a single customer name from your sales table. Without an index, SQL Server might scan all one million rows to find “Customer10098”:

SELECT * 
FROM sales 
WHERE customer_name = ‘Customer10098’

After running the query, the execution plan shows that the Table Scan operation used 95% of the overall query cost. This example shows that the query read 1,000,000 rows:

Fig. 3: Performing a table scan operation

The execution plan shows a Missing Index warning and suggests how to create the index:

Fig. 4: Identifying the missing index

Click the Missing Index warning to view the example code for creating the index:

Fig. 5: The missing index suggestion

SQL Server supports two primary index types:

Clustered index: Determines the physical order of data in the table. Each table can have only one. Ideal for primary keys and columns used frequently in range queries.
Non-clustered index: Stored separately from the table data, containing a copy of the indexed columns with pointers to the original rows. Useful for columns frequently searched in WHERE clauses.

To create the suggested non-clustered index, replace MyDB with your database’s name:

USE [MyDB] 
GO 
CREATE NONCLUSTERED INDEX [ix_customer_name] 
ON [dbo].[sales] ([customer_name]) 
GO

After creating the index, rerun the query and compare the statistics:

SELECT * 
FROM sales 
WHERE customer_name = ‘Customer10098’

The results now show the query read just 1 row instead of 1,000,000:

Fig. 6: Displaying the statistics

The number of reads, CPU cost, and I/O cost all decreased dramatically. This is the kind of improvement that targeted indexing delivers when you need to improve SQL query performance on large tables.

Removing unused indexes

While missing indexes hurt read performance, unused indexes hurt write performance. Every INSERT, UPDATE, or DELETE operation must update all indexes on the affected table. Use the sys.dm_db_index_usage_stats DMV to identify indexes that are being maintained but never used in queries, and consider dropping them to reduce overhead.

Parallel execution

Parallel execution improves the performance of SQL Server queries by running large queries on multiple CPUs, dividing the work among processors to handle large amounts of data more efficiently.

The MAXDOP (Maximum Degree of Parallelism) value determines the maximum number of logical processors for the query. It’s 0 by default, meaning SQL Server decides. To explicitly control parallel execution, set the MAXDOP option.

For example, the code below has parallel execution limited to a single processor with MAXDOP set to 1:

-- Disable parallel query execution 
SELECT customer_name, SUM(sale_amount) AS total_sales 
FROM sales 
WHERE YEAR(sale_date) = 2023 
GROUP BY customer_name 
OPTION (MAXDOP 1)

Fig. 7: Disabling parallel execution

Stats will show that the total_elapsed_time and cpu_time are high.

To optimize this query, allow more processors. Set the MAXDOP option to a value higher than 1:

-- Enable parallel query execution 
SELECT customer_name, SUM (sale_amount) AS total_sales 
FROM sales 
WHERE YEAR (sale_date) = 2023 
GROUP BY customer_name 
OPTION (MAXDOP 4)

With MAXDOP set to 4, SQL Server uses up to 4 processors to execute the query. Queries that involve sorting, grouping, or joining operations benefit most from parallelization.

Combine parallel execution with proper indexing, simplified query structure, and appropriate data types to achieve the best results.

SQL query writing best practices

Beyond server-level settings and indexing, the way you write your SQL queries has a direct impact on performance. Following these best practices helps you avoid common anti-patterns that cause unnecessary load on the database engine.

Select only the columns you need

Using SELECT * forces the database to retrieve every column in the table, even those your application does not use. This increases I/O, memory consumption, and network bandwidth. Always specify the exact columns you need:

-- Avoid this 
SELECT * FROM sales WHERE sale_date = ‘2023-06-15’ 

-- Use this instead 
SELECT customer_name, sale_amount FROM sales WHERE sale_date = ‘2023-06-15’

This practice also enables the query optimizer to use covering indexes more effectively, where the index contains all the columns the query needs without accessing the base table.

Use INNER JOIN instead of WHERE for joins

Joining tables using the WHERE clause can create a Cartesian product (cross join) before filtering results, which is highly inefficient on large datasets. INNER JOIN explicitly defines the relationship between tables and allows the optimizer to plan the query more efficiently:

-- Avoid this (implicit join) 
SELECT s.customer_name, o.order_date 
FROM sales s, orders o 
WHERE s.id = o.sale_id 

-- Use this instead (explicit INNER JOIN) 
SELECT s.customer_name, o.order_date 
FROM sales s 
INNER JOIN orders o ON s.id = o.sale_id

Filter early with WHERE instead of HAVING

WHERE filters rows before grouping, while HAVING filters after the aggregation step. When you can apply a filter condition before the GROUP BY operation, always use WHERE to reduce the dataset size before aggregation occurs:

-- Less efficient: filters after grouping all rows 
SELECT customer_name, SUM(sale_amount) AS total 
FROM sales 
GROUP BY customer_name 
HAVING sale_date >= ‘2023-01-01’ 

-- More efficient: filters before grouping 
SELECT customer_name, SUM(sale_amount) AS total 
FROM sales 
WHERE sale_date >= ‘2023-01-01’ 
GROUP BY customer_name

Reserve HAVING for conditions that must operate on aggregated values, such as HAVING SUM(sale_amount) > 5000.

Avoid functions on indexed columns in WHERE clauses

Wrapping a column in a function prevents the query optimizer from using an index on that column. This forces a full table scan, severely degrading performance on large tables:

-- Avoid: YEAR() prevents index usage on sale_date 
SELECT * FROM sales WHERE YEAR(sale_date) = 2023 

-- Better: use a date range to allow index seeks 
SELECT * FROM sales 
WHERE sale_date >= ‘2023-01-01’ AND sale_date < ‘2024-01-01’

Use EXISTS instead of IN for subqueries

When checking if rows exist in another table, EXISTS typically performs better than IN because EXISTS stops processing as soon as it finds the first match, while IN may evaluate the entire subquery result set:

-- Less efficient with large subquery results 
SELECT customer_name FROM sales 
WHERE customer_name IN (SELECT customer_name FROM vip_customers) 

-- More efficient 
SELECT s.customer_name FROM sales s 
WHERE EXISTS (SELECT 1 FROM vip_customers v WHERE v.customer_name = s.customer_name)

Use wildcards at the end of a phrase only

Leading wildcards in LIKE patterns (e.g., ‘%Smith’) prevent index usage and force the database to scan every row. Place wildcards at the end of the search phrase whenever possible to allow index seeks:

-- Avoid: leading wildcard causes full scan 
SELECT * FROM sales WHERE customer_name LIKE ‘%Smith’ 

-- Better: trailing wildcard allows index seek 
SELECT * FROM sales WHERE customer_name LIKE ‘Smith%’

Use CTEs to simplify complex queries

Common Table Expressions (CTEs) improve readability and can help the optimizer process complex queries more efficiently by breaking them into logical steps. They also reduce the temptation to nest multiple subqueries, which can lead to poor execution plans:

WITH monthly_sales AS ( 
    SELECT customer_name, 
           MONTH(sale_date) AS sale_month, 
           SUM(sale_amount) AS monthly_total 
    FROM sales 
    WHERE sale_date >= ‘2023-01-01’ AND sale_date < ‘2024-01-01’ 
    GROUP BY customer_name, MONTH(sale_date) 
) 
SELECT customer_name, AVG(monthly_total) AS avg_monthly_sales 
FROM monthly_sales 
GROUP BY customer_name 
ORDER BY avg_monthly_sales DESC

Monitoring SQL query performance

Optimizing queries is only half the battle. To maintain database health over time, you need continuous monitoring that alerts you when performance degrades. This is especially important for production environments where new data, schema changes, or traffic spikes can turn a fast query into a bottleneck overnight.

Using APM tools for database monitoring

Application Performance Monitoring (APM) tools provide deep visibility into how your application interacts with the database. Unlike manual DMV queries, APM solutions automatically capture slow SQL statements, correlate them with application transactions, and track performance trends over time.

Site24x7 APM Insight supports Java, .NET, Ruby on Rails, PHP, Node.js, and Python applications. It automatically traces database calls, captures the exact SQL statements that take the longest to execute, and displays them with execution time and call counts. You can drill down from a slow web transaction directly to the specific database query causing the delay.

Database server monitoring

Beyond application-level tracing, monitoring the database server itself provides additional context for diagnosing performance issues. Site24x7 offers dedicated monitoring for major databases including MySQL, PostgreSQL, SQL Server, and Oracle. These integrations track server-level health metrics such as buffer pool usage, connection counts, lock waits, and I/O throughput.

For MySQL environments, Site24x7’s log management can parse MySQL slow query logs, categorizing them by host, query time, lock time, rows sent, and rows examined. The dashboard surfaces the top 10 slow queries by different parameters, giving you an immediate view of where to focus your optimization efforts.

Setting up performance baselines and alerts

Establish performance baselines for your most critical queries so that you can detect regressions early. Key metrics to track include:

Average query execution time: Any sustained increase signals a potential issue with data growth, missing indexes, or stale statistics.
CPU and memory utilization: High sustained usage can indicate queries that need optimization or server resources that need scaling.
Lock wait times: Increasing wait times suggest contention that may require query restructuring or isolation level adjustments.
I/O throughput: Spikes in disk I/O often correlate with full table scans or large sort operations.

Configure alerts on these metrics so your team is notified before slow queries impact end users.

Conclusion

Improving SQL query performance for large datasets requires a combination of diagnostic techniques and targeted optimizations. Start by identifying slow queries using DMVs and execution plans. Then apply indexing strategies, limit the data your queries return, and enable parallel execution where appropriate.

Equally important are the query writing practices you follow every day: selecting only necessary columns, using explicit JOINs, filtering early with WHERE clauses, and avoiding functions on indexed columns. These habits prevent performance issues before they occur.

Finally, invest in continuous monitoring with tools like Site24x7 APM Insight to detect regressions early and maintain database health as your data grows. When your queries are optimized and monitored, your end users get the fast, reliable experience they expect.

FAQs

1. How does Site24x7 identify slow SQL queries?

Site24x7 APM Insight agents (for Java, .NET, Ruby on Rails, PHP, Node.js, and Python) automatically trace database calls. They capture the exact SQL statements that take the longest to execute and display them with execution time and call counts.

2. Can I see which application transaction triggered a slow query?

Yes, Site24x7 provides end-to-end transaction tracing. You can drill down from a slow web transaction directly to the specific database query causing the delay.

3. Does Site24x7 support monitoring specific database servers?

Site24x7 offers dedicated monitoring for major databases like MySQL, PostgreSQL, SQL Server, and Oracle to track server-level health, buffer usage, and connection stats alongside query performance.

4. What are the most common causes of slow SQL queries?

Common causes include missing or unused indexes, using SELECT * instead of specific columns, inefficient JOIN operations, full table scans on large datasets, poorly written WHERE clauses, and lack of query execution plan analysis.

5. How can I improve SQL query performance without changing the query itself?

You can improve performance by adding appropriate indexes, enabling parallel execution, updating table statistics, increasing server resources, partitioning large tables, and using database caching mechanisms.

6. What tools can I use to find slow SQL queries?

You can use SQL Server Dynamic Management Views (DMVs), execution plans in SQL Server Management Studio or Azure Data Studio, SQL Server Extended Events, and APM solutions like Site24x7 APM Insight that automatically trace and identify slow database operations.

Was this article helpful?

Sorry to hear that. Let us know how we can improve the article.

How to improve SQL query performance for large datasets