Sunday, August 30, 2009

Inside LINQ queries

This time we will dig a little bit deeper inside LINQ queries. We will see what C# 3.0 compiler is doing behind the scenes when it meets LINQ query in code and how we can use it. Lets look at simple LINQ query:

var q =
    from i in new[] {1, 2, 3, 4, 5, 6}
    where i < 5
    select i.ToString("C");

Compiler handles LINQ query in two phases. Firstly it translates the query into a chain of method calls, each operator in LINQ syntax like where, select, ... is translated respectively into the methods Where, Select, ... .

var q =
    new[] {1, 2, 3, 4, 5, 6}.
    Where(i => i < 5).
    Select(i => i.ToString("C"));

So operator where was translated into Where method and select into Select method. As you can see lambda expressions have been also generated and put as a methods' parameters. The second phase is a discovery phase. It's a phase of looking for the marching methods. In our sample the type "array of int" does not have any instance method named Where so compiler is analyzing suitable extensions methods. The closest extension method with matching parameters is defined in System.Linq.Enumerable class so this is the one that will be used. Where method is a generic method and in our case it returns IEnumerable<int> type. The implementation of Where method is quite simple and looks like this:

public static class Enumerable
{
    public static IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source,
        Func<TSource, bool> predicate)
    {
        foreach (var i in source)
            if(predicate(i))
                yield return i;
    }        
}

Now the compiler is looking for Select method. Once again type IEnumerable<int> does not contain instance Select method but but an Enumerable class contains matching one. Finally, our query returns type IEnumerable<string> and the code builds succeeded because all necessary methods has been found. If you are interested how more complicated queries are translated and you are using ReSharper, just point your mouse over a LINQ query, press Alt + Enter and choose "Convert LINQ to methods chain" option. ReSharper will automatically generate method chain for you.

Now we can see that LINQ query is just a chain of method calls, nothing more. The real power of LINQ is the implementation of these methods. .Net Framework 3.5 gives us Enumerable which contains about 50 methods with many different overloads such as Where, Select, Join, GoupBy, OrderBy, etc. Almost all methods extend IEnumerable<T> type so they can be used with all types implementing this interface, almost each collection in .Net Framework implementing this interface, which applies to almost all collections in the .NET Framework. I truly encourage you to familiarize with those methods because they are really powerful and only few of them are used in simple from ... where ... select LINQ queries.

Once we know how LINQ queries are handled by C# compiler we are ready to do some tricky stuff.

Lets say we would like to use the second overload of the Where method from Enumerable class inside LINQ query. This overload allows us to use the index of the iterated item and the the implementation looks like this:

public static class Enumerable
{
    public static IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source,
        Func<TSource, int, bool> predicate)
    {
        int index = 0;
        foreach (var i in source)
            if (predicate(i, index++))
                yield return i;
    }    
}

It means that we can write a condition based on the index of the items stored in queried sequence. Although we cannot extend syntax of the LINQ query but we can extend the IEnumerable<T> type with our own Where method implementation changing the second parameter's type.

static class EnhancedEnumerable
{
    public static IEnumerable<T> Where<T>(this IEnumerable<T> collection,
        Func<T, Func<int, bool>> superPredicate)
    {
        int index = 0;
        foreach (var item in collection)
            if (superPredicate(item).Invoke(index++))
                yield return item;
    }
}

Now we are able to write a query based on the index of an item.

var q =
    from i in new[] { 1, 2, 3, 4, 5, 6 }
    where index  => 
        {
            Console.WriteLine("{0}. {1}", index, i);
            return (index % 2 == 0) && (i < 5);
        }
    select i.ToString("C");

Another interesting thing we can do with LINQ query is that we can provide a set of objects describing some domain with appropriate "LINQ methods" called Where, Select, OrderBy and so on, so that we could write LINQ query over them. This approach give us very strong control over LINQ query syntax because just during compile time we can chose witch LINQ operators we support and which we don't. Lets look at the example.

Lets assume that we have a company and we have a simple application storing data about our employees. The data are stored in files on the disk, each employee is stored in separate file. Additionally files are grouped in folders by some common feature. For example employees working in particular department are placed in the same folder. Lets assume also that the main types describing our domain look like this:

Company

Lets start with a simple LINQ query:

Company company= new Company();
var q =
    from e in company
    select e.Name;

This code will not compile because compiler cannot find matching Select method so lets provide it.

public class Company
{
    public IEnumerable<T> Select<T>(Func<Employee, T> p) { ... }
}

Is it worth to mention that variable e in our query is an instance of Employee type because matched Select method takes an Employee type as a first parameter of lambda expression. Now lets provider support for where operator.

var q =
    from e in company
    where e.Salary > 100
    select e.Name;

public class Company
{
    public Employee Where(Func<Employee, bool> p)) 
}

public abstract class Employee<T>
{
    public IEnumerable<T2> Select<T2>(Func<T, T2> p) { ... }
}

public sealed class Employee : Employee<Employee> { }

As you can see this is a very powerful mechanism which gives us ability to provide only those operators that are supported in our "query language". We could say that it is an alternative mechanism to writing a custom LINQ provider when we want to integrate LINQ queries with our specific domain. LINQ provider is much more flexible during process of analyzing and execution of queries but this approach has one very important feature. We can choose which operators we provide and what parameters the particular operator can take. Our query is validated at compile time, which means that if we don't support for instance sort operator and some query contains that operator, compilation will fail with error "SortBy method can not be found". When writing custom LINQ provider we have to validate LINQ query at runtime ourselves. To make this clear see next queries.

var q =
    from Manager m in company
    select m.ManagedProjects;
    
// this query is translated into: q = company.Cast<Manager>().Select(m => m.ManagedProjects);

public class Company
{
    public T Cast<T>() where T : ICast { ... }
}

public interface ICast
{ }
    
public sealed class Developer : Employee<Developer>, ICast { }
public sealed class Manager : Employee<Manager>, ICast { }

We provide Cast operator for the Company type so we can use Developer or Manager type in from clause (if fact each type implementing ICast). Please notice that the type of variable m is Manager not Employee in our query. Next query will show even more interesting feature.

var q =
   from e in company
   where e.Salary > 100     
   where e.IsManager
   where e.ManagedProjects > 1
   select new { e.Name, e.ManagedProjects };
   
public sealed class Employee
{
    public Manager IsManager { get { return null; } }

    public Manager Where(Func<Employee, Manager> p) { ... }
}  

Here the type of variable e in the first part of the query is Employee, but after line "where e.IsManager" the type of e is Manager. As we saw we may provide support for any subset of LINQ operators but that's not all. We can also provide the particular operator only for specified type, for example "only managers can be sorted" or "we can group employees only by Department or Localization":

var q =
    from Manager m in company                  
    orderby m.ManagedProjects descending, m.Name
    select new { m.Name, m.ManagedProjects };

// q = company.Cast<Manager>().
//    OrderByDescending(m => m.ManagedProjects).ThenBy(m => m.Name).
//    Select(m => new {m.Name, m.ManagedProjects});

public sealed class Manager
{
    public Manager OrderBy<T>(Func<Manager, T> p) { ... }
    public Manager OrderByDescending<T>(Func<Manager, T> p) { ... }
    public Manager ThenBy<T>(Func<Manager, T> p) { ... }
    public Manager ThenByDescending<T>(Func<Manager, T> p) { ... }
}


var q =
    from Developer e in company
    group e by e.Location;
    
// q = company.Cast<Developer>().GroupBy(e => e.Location);

public abstract class Employee<T>    
{
    public Location Location { get; protected set; }
    public Department Department { get; protected set; }
    
    public IEnumerable<Group<Location,T>> GroupBy(Func<T, Location> p) { ... }
    public IEnumerable<Group<Department, T>> GroupBy(Func<T, Department> p) { ... }
}  

public class Group<TKey, T>
{
    public TKey Key { get; private set;}
    public IEnumerable<T> Items { get; private set;}
    
    internal Group(TKey key, IEnumerable<T> items)
    {
        Key = key;
        Items = items;
    }
}

Finally I'd like to show you how smart the C# 3.0 compiler is. When compiler meets a LINQ query with a simple select operator like this:

var q =
    from i in new[] {1, 2, 3, 4, 5, 6}
    where i < 5
    select i;

The small small optimization takes place. Because our query after translation looks like this:

var q =
    new[] {1, 2, 3, 4, 5, 6}.
    Where(i => i < 5).
    Select(i => i);

so probably calling Select method does not change anything (converting i into i) so this call is not generated and the final code looks like this:

var q =
    new[] {1, 2, 3, 4, 5, 6}.
    Where(i => i < 5);  

And of course in this particular case (LINQ to objects) it is true. But if we provide our own Select method returning DateTime type and if we place EnhancedEnumerable class "closer" to the above query (for instance in the same namespace) then our method will be used instead of standard LINQ operator (System.Linq.Enumerable.Where).

static class EnhancedEnumerable
{
    public static DateTime Select<TSource, TResult>(this IEnumerable<TSource> source,
        Func<TSource, TResult> predicate)
    {
        return DateTime.Now;
    }
}

In such case will the optimization take place or not ? Yes, it will. The optimization is performed in the first phase of query translation when we don't know yet which Select method will be used or even if any exits. There are more such optimizations so we should be careful when writing more advanced queries.

Treat this post as a prerequisite for next few posts where we will be talking more about LINQ internals.

The whole "queryable" solution with classes: Company, Employee, Developer and so on can be downloaded from here.

No comments: