Skip to content

Latest commit

 

History

History
27 lines (20 loc) · 1.14 KB

File metadata and controls

27 lines (20 loc) · 1.14 KB

docfreq

Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the HEAD commit), using all implemented extractors in sourc ed.ml at the time (identifiers, literals, graphlets, children, node2vec and uast2seq) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). The document frequency here refers to the frequency of each feature across all documents (we only kept features that appeared at least 5 times).

Example:

from sourced.ml.models import OrderedDocumentFrequencies
df = OrderedDocumentFrequencies().load("55215392-36fc-43e5-b277-500f5b68d0c6")
print("Number of documents:", len(df))

References

ID 55215392-36fc-43e5-b277-500f5b68d0c6
Uploaded 2018-06-20 14:51:45.469503
Version 1.0.0
File https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2F55215392-36fc-43e5-b277-500f5b68d0c6.asdf
Size 69.9 MB
Data collection date July 2018
Number of distinct documents (files) 7,873,334
Number of distinct features 6,194,874
License