Difference between revisions of "MongoDB Aggregate Pipeline"

From mi-linux
Jump to navigationJump to search
(Created page with " == Aggregation Pipeline == So far '''find()''' either returns all the elements of an array, if one element matches the search criteria, or '''$elematch''' returns the first...")
 
 
(39 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
[[Main Page]] >> [[MongoDB|MongoDB]] >>[[MongoDB_Workbook|MongoDB Workbook]] >> Aggregation Pipeline
  
 
== Aggregation Pipeline ==
 
== Aggregation Pipeline ==
 
So far '''find()''' either returns all the elements of an array, if one element matches the search criteria, or '''$elematch''' returns the first one found only. The latter is fine if there is only one to be found, but not so good if several items in the array should match the search criteria.
 
  
 
The aggregation pipeline is a framework for data aggregation modelled on the concept of data processing pipelines. What this means, is documents enter a multi-stage pipeline that transforms the documents into aggregated results.
 
The aggregation pipeline is a framework for data aggregation modelled on the concept of data processing pipelines. What this means, is documents enter a multi-stage pipeline that transforms the documents into aggregated results.
  
 
This is similar to using GROUP BY in SQL, where you might aggregate the average grades of all students taking a module.
 
This is similar to using GROUP BY in SQL, where you might aggregate the average grades of all students taking a module.
 +
 +
The power of the aggregation pipeline is to do processing on the data.
  
 
The MongoDB aggregation pipeline consists of stages and each stage transforms the documents as they pass through the pipeline. A stage can generate new documents or filter out documents. A stage can also appear several times in the pipeline.
 
The MongoDB aggregation pipeline consists of stages and each stage transforms the documents as they pass through the pipeline. A stage can generate new documents or filter out documents. A stage can also appear several times in the pipeline.
 +
 +
The ''aggregate()'' method is used for aggregation and each pipeline stage appear in an array. Documents pass through the stages in sequence.
  
 
The syntax is:
 
The syntax is:
  
<pre style="color:blue">
+
<pre style="color:purple">
 
db.collectionName.aggregate( [ { <stage> }, ... ] )
 
db.collectionName.aggregate( [ { <stage> }, ... ] )
 
</pre>
 
</pre>
Line 18: Line 21:
 
The pipeline for instance, could:
 
The pipeline for instance, could:
  
* project out certain details from each document, such as the employees;
+
* project out certain details from each document, such as ''sal'' and ''comm'';
* group the projected details by a certain fields and then using an aggregate function, such as group by the deptno and then counting the number of occurrences;
+
* group the projected details by a certain field and then use an aggregate function. For example, group by the deptno and then count the number of occurrences;
* sorting the results in order;
+
* sort the results in order;
* limiting the results to a certain number, such as the first 10;
+
* limit the results to a certain number, such as the first 10;
  
 
These are represented by the following operators: '''$project''','''$group''', '''$sort''' or '''$limit'''.  
 
These are represented by the following operators: '''$project''','''$group''', '''$sort''' or '''$limit'''.  
Line 29: Line 32:
 
https://docs.mongodb.com/manual/reference/operator/aggregation/
 
https://docs.mongodb.com/manual/reference/operator/aggregation/
  
 +
=== $group ===
  
=== $filter ===
+
''$group'' will take a set of input documents, group them by a specified key and then apply an aggregate function to each group.
  
In the aggregation pipeline, array has various operators and the one we are interested in is [https://docs.mongodb.com/manual/reference/operator/aggregation/filter/#exp._S_filter $filter]
+
For example, to sum the salaries found in the ''emp'' collection:
  
This returns a subset of the array with only the elements that match the filter condition.
+
db.emp.aggregate ( [
 +
{ $group:
 +
        { _id: "$deptno", total: {$sum: "$sal"} } }
 +
])
  
$filter has the following syntax:
+
This is similar to the SQL command:
  
<pre style="color:blue">
+
<pre style="color:purple">
{ $filter: {
+
SELECT deptno, sum(sal) AS total FROM emp
input: <array>,   /* expression for the array */
+
GROUP BY deptno;
as: <string>,   /* variable name for the element */ 
 
cond: <expression> /* filter condition */
 
} }
 
 
</pre>
 
</pre>
  
'''$filter''' is one of the stages of the pipeline and can not be used by itself. It is used with an aggregation framework operator, such as '''$project'''.
+
=== $lookup ===
  
 +
''$lookup'' can be used to provide a left outer join between two collections, such as ''dept'' and ''emp''.
  
We can use the pipeline to ''gather'' elements of our employees array to get the employees matching the query criteria only, rather than one, or everyone. This example has only one stage:
+
The $lookup stage does an equality match between a column from the input documents with a column from the joined collection.
  
  db.deptCollection.aggregate([ {
+
In this example, the ''dept'' collection provides the input documents and the lookup will be performed in the ''emp'' collection. ''deptno'' in both documents is used to join the collections:
       $project: {
+
 
        empSet: {
+
  db.dept.aggregate([ {
            $filter: {
+
       $lookup: {
              input: "$employees",
+
          from: "emp",
              as: "employee",
+
          localField: "deptno",
              cond: { $gte: [ "$$employee.sal", 2000 ] }
+
          foreignField: "deptno",
            }
+
          as: "employees"
 +
        }
 +
  }
 +
]).pretty()
 +
 
 +
This is the equivalent of the SQL statement:
 +
 
 +
<pre style="color:purple">
 +
SELECT *
 +
FROM dept
 +
LEFT OUTER JOIN emp
 +
ON dept.deptno = emp.deptno;
 +
</pre>
 +
 
 +
The ''mgr'' field holds the ''empno'' of each employee's manager. To provide a list of employees the who they manage involved joining a collection to itself:
 +
 
 +
  db.emp.aggregate([ {
 +
      $lookup: {
 +
          from: "emp",
 +
          localField: "empno",
 +
          foreignField: "mgr",
 +
          as: "manages"
 
         }
 
         }
 +
    }
 +
  ]).pretty()
 +
 +
An empty set [] will be returned for the ''manages'' field for any employees who are not managers.
 +
 +
=== Complex query ===
 +
 +
A more complex aggregate pipeline groups the ''emp'' collection by deptno and carries out several calculations:
 +
 +
db.emp.aggregate(
 +
  [
 +
      {$group : {
 +
          _id : { _id : "$deptno"},
 +
          averageSalary: { $avg: "$sal" },
 +
          totalSal: { $sum: { $sum: "$sal"} },
 +
          totalAnnualSal: { $sum: { $multiply: [ "$sal", 12 ] } },
 +
          count: { $sum: 1 }
 +
        }
 +
      }
 +
  ]
 +
)
 +
 +
The calculations are:
 +
* averageSalary returns the average salary
 +
* totalSal returns the sum of the salary by department
 +
* $multiply will multiple the sal by 12 to give the total annual salary by department
 +
* $sum: 1 can be used to aggregate a value of one for each document processed, so here will produce count of employees by department
 +
 +
=== $match ===
 +
 +
In a larger collection you may not want all the documents processed in the query. ''$match'' can be used to filter the documents required and passes them to the next stage in the pipeline.
 +
 +
The following query is a repeat of above, but this time only processes documents where the salary is £3000 or above:
 +
 +
db.emp.aggregate(
 +
  [
 +
  {$match: {sal : {$gte:3000}}},
 +
  {$group : {
 +
        _id : { _id : "$deptno"},
 +
        averageSalary: { $avg: "$sal" },
 +
        totalSal: { $sum: { $sum: "$sal"} },
 +
        totalAnnualSal: { $sum: { $multiply: [ "$sal", 12 ] } },
 +
        count: { $sum: 1 }
 +
        }
 
       }
 
       }
   }
+
   ]
  ]).pretty()
+
)
 +
 
 +
=== $sort ===
 +
 
 +
The order of the results returned can depend on how the data was stored. When dealing with a large set of results you typically would want them sorted in some way. This can be achieved using the ''$sort'' command.
 +
 
 +
The syntax is:
 +
 
 +
<pre style="color:purple">
 +
{$sort: {field1:<sort order>, field2: <sort order> .... }}
 +
</pre>
 +
 
 +
Where <sort order> can have one of the following values:
 +
* 1 to specify ascending order
 +
* -1 to specify descending order
 +
 
 +
The previous query can be extended to sort by the _id, which represented the deptno:
 +
 
 +
db.emp.aggregate(
 +
  [
 +
  {$match: {sal : {$gte:3000}}},
 +
  {$group : {
 +
        _id : { _id : "$deptno"},
 +
        averageSalary: { $avg: "$sal" },
 +
        totalSal: { $sum: { $sum: "$sal"} },
 +
        totalAnnualSal: { $sum: { $multiply: [ "$sal", 12 ] } },
 +
        count: { $sum: 1 }
 +
        }
 +
      },
 +
    {$sort: {_id: 1}}
 +
  ]
 +
)
 +
 
 +
The above is also an example of a pipeline with multiple stages: $match, then $group and finally $sort.
 +
 
 +
$sort can be used with a simpler query just to sort the results. For example, sort the employee details by deptno, then ename:
 +
 
 +
db.emp.aggregate(  [   
 +
  {$sort: {deptno:1, ename: 1}} 
 +
  ] ).pretty()
  
Now the system should '''only''' retrieve the employees with salary > 2000.
+
== Other Functions ==
  
 
=== Count ===
 
=== Count ===
  
The power of the aggregation pipeline is to do processing on the data.
+
Documents can also be counted using the ''count'' function.
 +
 
 +
Lets count how many employees are in department 10:
 +
 
 +
db.emp.count({deptno: 10})
 +
 
 +
 
 +
You can also add count() to a find query to count the records returned, instead of listing them:
 +
 
 +
db.dept.find({dname:"SALES"}).count()
 +
 
 +
=== Distinct ===
 +
 
 +
Sometimes you want to find the distinct values for a specified column (similar to distinct in SQL):
 +
 
 +
db.emp.distinct("deptno")
 +
 
 +
 
 +
== Exercise 2.3 ==
 +
 
 +
2.3.1 Try the ''$lookup'' function with the collections swapped:
  
Lets count how many employees each department has:
+
db.emp.aggregate([ {
 +
      $lookup: {
 +
          from: "dept",
 +
          localField: "deptno",
 +
          foreignField: "deptno",
 +
          as: "dept"
 +
        }
 +
    }
 +
  ])
  
db.deptCollection.aggregate({
+
Explain how the results differ to the previous example.
  "$project": {
 
    "deptno": 1,
 
    "Count": { "$size": { "$ifNull": [ "$employees", [] ] }
 
      }
 
    }})
 
  
The ''$ifNull'' operator is needed, since department 40 has no employees - you will get an error message if left out!
+
2.3.2 List all the documents in the ''dept'' collection, sorted by ''loc''.
  
 
== Next Step ==
 
== Next Step ==
  
 
[[MongoDB_Update|Updating]] the collection
 
[[MongoDB_Update|Updating]] the collection

Latest revision as of 10:46, 11 February 2021

Main Page >> MongoDB >>MongoDB Workbook >> Aggregation Pipeline

Aggregation Pipeline

The aggregation pipeline is a framework for data aggregation modelled on the concept of data processing pipelines. What this means, is documents enter a multi-stage pipeline that transforms the documents into aggregated results.

This is similar to using GROUP BY in SQL, where you might aggregate the average grades of all students taking a module.

The power of the aggregation pipeline is to do processing on the data.

The MongoDB aggregation pipeline consists of stages and each stage transforms the documents as they pass through the pipeline. A stage can generate new documents or filter out documents. A stage can also appear several times in the pipeline.

The aggregate() method is used for aggregation and each pipeline stage appear in an array. Documents pass through the stages in sequence.

The syntax is:

db.collectionName.aggregate( [ { <stage> }, ... ] )

The pipeline for instance, could:

  • project out certain details from each document, such as sal and comm;
  • group the projected details by a certain field and then use an aggregate function. For example, group by the deptno and then count the number of occurrences;
  • sort the results in order;
  • limit the results to a certain number, such as the first 10;

These are represented by the following operators: $project,$group, $sort or $limit.

A number of operations exist for the aggregation pipeline, details of which can be found in the MongoDB manual:

https://docs.mongodb.com/manual/reference/operator/aggregation/

$group

$group will take a set of input documents, group them by a specified key and then apply an aggregate function to each group.

For example, to sum the salaries found in the emp collection:

db.emp.aggregate ( [
{ 	$group: 
       { _id: "$deptno", total: {$sum: "$sal"} } }
])

This is similar to the SQL command:

SELECT deptno, sum(sal) AS total FROM emp
GROUP BY deptno;

$lookup

$lookup can be used to provide a left outer join between two collections, such as dept and emp.

The $lookup stage does an equality match between a column from the input documents with a column from the joined collection.

In this example, the dept collection provides the input documents and the lookup will be performed in the emp collection. deptno in both documents is used to join the collections:

db.dept.aggregate([ {
     $lookup:  {
         from: "emp",
         localField: "deptno",
         foreignField: "deptno",
         as: "employees"
       }
  }
]).pretty()

This is the equivalent of the SQL statement:

SELECT * 
FROM dept 
LEFT OUTER JOIN emp
ON dept.deptno = emp.deptno;

The mgr field holds the empno of each employee's manager. To provide a list of employees the who they manage involved joining a collection to itself:

 db.emp.aggregate([ {
      $lookup:  {
          from: "emp",
          localField: "empno",
          foreignField: "mgr",
          as: "manages"
        }
   }
 ]).pretty()

An empty set [] will be returned for the manages field for any employees who are not managers.

Complex query

A more complex aggregate pipeline groups the emp collection by deptno and carries out several calculations:

db.emp.aggregate(
  [
     {$group : {
         _id : { _id : "$deptno"},
         averageSalary: { $avg: "$sal" },
         totalSal: { $sum: { $sum: "$sal"} },
         totalAnnualSal: { $sum: { $multiply: [ "$sal", 12 ] } },
         count: { $sum: 1 }
       }
     }
  ]
)

The calculations are:

  • averageSalary returns the average salary
  • totalSal returns the sum of the salary by department
  • $multiply will multiple the sal by 12 to give the total annual salary by department
  • $sum: 1 can be used to aggregate a value of one for each document processed, so here will produce count of employees by department

$match

In a larger collection you may not want all the documents processed in the query. $match can be used to filter the documents required and passes them to the next stage in the pipeline.

The following query is a repeat of above, but this time only processes documents where the salary is £3000 or above:

db.emp.aggregate(
  [
  {$match: {sal : {$gte:3000}}},
  {$group : {
       _id : { _id : "$deptno"},
       averageSalary: { $avg: "$sal" },
       totalSal: { $sum: { $sum: "$sal"} },
       totalAnnualSal: { $sum: { $multiply: [ "$sal", 12 ] } },
       count: { $sum: 1 }
       }
     }
  ]
)

$sort

The order of the results returned can depend on how the data was stored. When dealing with a large set of results you typically would want them sorted in some way. This can be achieved using the $sort command.

The syntax is:

{$sort: {field1:<sort order>, field2: <sort order> .... }}

Where <sort order> can have one of the following values:

  • 1 to specify ascending order
  • -1 to specify descending order

The previous query can be extended to sort by the _id, which represented the deptno:

db.emp.aggregate(
  [
  {$match: {sal : {$gte:3000}}},
  {$group : {
       _id : { _id : "$deptno"},
       averageSalary: { $avg: "$sal" },
       totalSal: { $sum: { $sum: "$sal"} },
       totalAnnualSal: { $sum: { $multiply: [ "$sal", 12 ] } },
       count: { $sum: 1 }
       }
     },
    {$sort: {_id: 1}}
  ]
)

The above is also an example of a pipeline with multiple stages: $match, then $group and finally $sort.

$sort can be used with a simpler query just to sort the results. For example, sort the employee details by deptno, then ename:

db.emp.aggregate(   [     
  {$sort: {deptno:1, ename: 1}}   
] ).pretty()

Other Functions

Count

Documents can also be counted using the count function.

Lets count how many employees are in department 10:

db.emp.count({deptno: 10})


You can also add count() to a find query to count the records returned, instead of listing them:

db.dept.find({dname:"SALES"}).count()

Distinct

Sometimes you want to find the distinct values for a specified column (similar to distinct in SQL):

db.emp.distinct("deptno")


Exercise 2.3

2.3.1 Try the $lookup function with the collections swapped:

db.emp.aggregate([ {
      $lookup:  {
          from: "dept",
          localField: "deptno",
          foreignField: "deptno",
          as: "dept"
        }
   }
 ])

Explain how the results differ to the previous example.

2.3.2 List all the documents in the dept collection, sorted by loc.

Next Step

Updating the collection