Indexing Metrics Guide

This guide describes the indexing metric providers available in Scava platform. It provides the mapping so that users can query the index.

org.eclipse.scava.metricprovider.indexing.bugs 

Short name: bug indexing metric
Friendly name: Bugs tracking system indexer

This metric prepares and indexes analyses documents relating to bug tracking system.

Mapping Information:

bug.comment

Properties	Type
comment_Id	`keyword`
body	`text`
emotional_dimension *	`keyword`
sentiment	`keyword`
plain_text	`text`
request_reply_classification	`keyword`
content_class	`keyword`
contains_code	`boolean`
bug_Id	`keyword`
project_name	`keyword`
creator	`keyword`
created_at	`date`
uid	`keyword`
referring_to bugs * commits *	`Object` `text` `text`

bug.post

Properties	Type
created_at	`date`
bug_summary	`text`
severity	`keyword`
bug_id	`keyword`
project_name	`text`
creator	`keyword`
uid	`keyword`
migration_issue found problematic_changes change * type	`Object` `boolean` `Object` `text` `double`

Additional Note:

Asterisk (*) is used to indicate entries that are stored as an array.
Type Object is used to indicate an entry with a nested JSON object passed as its value. The depth of nesting is shown with Indentation at each node/level.

org.eclipse.scava.metricprovider.indexing.commits 

Short name: metricprovider.indexing.commits
Friendly name: Commits indexer

This metric prepares and indexes documents related to commits.

Mapping Information:

commits

Properties	Type
created_at	`date`
project_name	`keyword`
uid	`text`
repository	`keyword`
revision	`keyword`
author	`keyword`
author_email	`keyword`
body	`text`
plain_text	`text`
commits_references *	`text`
bugs_references *	`text`

Additional Note:

Asterisk (*) is used to indicate entries that are stored as an array.

org.eclipse.scava.metricprovider.indexing.communicationchannels 

Short name: communication channels indexing metric
Friendly name: communication channels indexer

This metric prepares and indexes documents relating to communication channels.

Mapping Information:

article

Properties	Type
article_Id	`date`
communication_channel_id	`keyword`
uid	`keyword`
thread_id *	`keyword`
project_name	`keyword`
message_body	`text`
subject	`text`
creator	`keyword`
created_at	`date`
emotional_dimension *	`keyword`
sentiment	`keyword`
plain_text	`text`
request_reply_classification	`keyword`
content_class	`keyword`
contains_code	`boolean`

thread

Properties	Type
communication_channel_id	`keyword`
uid	`keyword`
thread_id	`keyword`
project_name	`keyword`
subject	`text`
migration_issue found problematic_changes change * matching_score	`Object` `boolean` `Object` `text` `double`

Additional Note:

Asterisk (*) is used to indicate entries that are stored as an array.
Type Object is used to indicate an entry with a nested JSON object passed as its value. The depth of nesting is shown with Indentation at each node/level.

org.eclipse.scava.metricprovider.indexing.documentation 

Short name: communication channels indexing metric
Friendly name: communication channels indexer

This metric prepares and indexes documents relating to communication channels.

Mapping Information:

documentation_entry

Properties	Type
created_at	`date`
project_name	`keyword`
uid	`keyword`
documentation_id	`keyword`
documentation_entry_id	`keyword`
communication_channel_id	`keyword`
body	`text`
original_format_mime	`keyword`
original_format_name	`text`
plain_text	`text`
sentiment	`keyword`
readability	`keyword`
documentation_type *	`keyword`
licence_found	`boolean`
licence group name header_found score is_deprecated_license is_fsf_libre is_osi_approved license_comments	`Object` `keyword` `keyword` `boolean` `double` `boolean` `boolean` `boolean` `text`

documentation

Properties	Type
last_update	`date`
project_name	`keyword`
uid	`text`
documentation_id	`keyword`
documentation_entries *	`keyword`

Additional Note:

Asterisk (*) is used to indicate entries that are stored as an array.
Type Object is used to indicate an entry with a nested JSON object passed as its value. The depth of nesting is shown with Indentation at each node/level.

Query Examples

Here we present a series of examples of how you can use ElasticSearch queries for finding specific data or doing some aggregations.

More information about ElasticSearch queries can be found at https://www.elastic.co/guide/en/elasticsearch/reference/6.3/search.html

Example 1

Retrieve the documentation entries that have been determined containing a Getting Started guide.

GET documentation.documentation.entry.nlp/_search
{
  "query": {
    "match" : {
      "documentation_types": "Started"

    }
  }
}

As documentation_types is an array, we need to use a match within the query.

Example 2

Retrieve the documentation entries that have at least one of the following characteristics: - Are a Getting Started guide - Are a Development guide - Have a neutral sentiment

GET documentation.documentation.entry.nlp/_search
{
  "query": {
    "bool": {
      "should": [
          {
            "match" : {
              "documentation_types": "Started"
            }
          },
          {
            "match" : {
              "documentation_types": "Development"
            }
          },
          {
            "term":{
              "sentiment" : "__label__neutral"
            }
          }
      ]
    }
  }
}

In this case, the field sentiment is not an array, thus, for looking inside of it we should use a term query, while a match query for fields that are arrays. Furthermore, as we have multiple queries that are linked through OR, we need to declare a query of type bool and should. If the query would be to find the documentation entries that have a Getting Started, a Development guide and have a neutral sentiment, instead of using a should operation, it need to be used a must.

GET documentation.documentation.entry.nlp/_search
{
  "query": {
    "bool": {
      "must": [
          {
            "match" : {
              "documentation_types": "Started"
            }
          },
          {
            "match" : {
              "documentation_types": "Development"
            }
          },
          {
            "term":{
              "sentiment" : "__label__neutral"
            }
          }
      ]
    }
  }
}

Example 3

Searching for GitHub Issues comments containing the word issue in the field plain_text:

GET github.bug.comment.nlp/_search
{
  "query": {
    "match": {
      "plain_text": "issue"
    }
  }
}

Example 4

Searching in all the indexes the string Astérix:

GET _search
{
 "query": {
   "query_string": {
     "query": "Astérix"
   }
 }
}

In this case, as we are not quering any particular field, we use the query_string.

If we would like to search either for Astérix or Obélix, we can use boolean operators:

GET _search
{
 "query": {
   "query_string": {
     "query": "Astérix OR Obélix"
   }
 }
}

If both names must appear, instead of using OR, we can use AND.

GET _search
{
 "query": {
   "query_string": {
     "query": "Astérix AND Obélix"
   }
 }
}

Query strings such as Astérix Obélix will be considered to have an implicit OR, i.e. Astéric OR Obélix.

GET _search
{
 "query": {
   "query_string": {
     "query": "Astérix Obélix"
   }
 }
}

If the we are looking for the exact combination of words, such as Astérix and Obélix, we can use parenthesis to indicate ElasticSearch that the words should be split.

GET _search
{
 "query": {
   "query_string": {
     "query": "(Astérix and Obélix) OR (Felix the Cat)"
   }
 }
}

Example 5

ElasticSearch can do some statistics regarding the output of queries. In this example, we want extended statistics about the field readability in the documentation entries:

GET documentation.documentation.entry.nlp/_search
{
  "size": 0,
  "aggs" : {
        "readability_stats" : {
            "extended_stats" : {
                "field" : "readability"
            }
        }
    }
}

The option size=0, indicates that only the aggregation with the statistics must be returned, otherwise the query will return the elements that also match the searching query.

If instead of wanting some statistics, we want to autogenerate histograms, then we can use the following query:

POST documentation.documentation.entry.nlp/_search
{
  "size": 0,
  "aggs": {
    "histogram_analysis": {
      "histogram" : {
                "field" : "readability",
                "interval" : 2
      }
    }
  }
}

Example 6

With the indexes it is possible to do aggregations and how many times a field value defined as a keyword, such as documentation_id, appears in an index.

POST documentation.documentation.entry.nlp/_search
{
  "size": 0, 
  "aggs": {
    "doc_id": {
      "terms": {
        "field": "documentation_id",
        "size": 100
      }
    }
  }
}

Keys	Action
`?`	Open this help
`←`	Previous page
`→`	Next page
`s`	Search

Indexing Metrics Guide

org.eclipse.scava.metricprovider.indexing.bugs

org.eclipse.scava.metricprovider.indexing.commits

org.eclipse.scava.metricprovider.indexing.communicationchannels

org.eclipse.scava.metricprovider.indexing.documentation

Query Examples

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

org.eclipse.scava.metricprovider.indexing.bugs 

org.eclipse.scava.metricprovider.indexing.commits 

org.eclipse.scava.metricprovider.indexing.communicationchannels 

org.eclipse.scava.metricprovider.indexing.documentation 