Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the HEAD commit), using all implemented extractors in sourc ed.ml at the time (identifiers, literals, graphlets, children, node2vec and uast2seq) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). The document frequency here refers to the frequency of each feature across all documents (we only kept features that appeared at least 5 times).
Example:
from sourced.ml.models import OrderedDocumentFrequencies
df = OrderedDocumentFrequencies().load("55215392-36fc-43e5-b277-500f5b68d0c6")
print("Number of documents:", len(df))
| ID | 55215392-36fc-43e5-b277-500f5b68d0c6 |
| Uploaded | 2018-06-20 14:51:45.469503 |
| Version | 1.0.0 |
| File | https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2F55215392-36fc-43e5-b277-500f5b68d0c6.asdf |
| Size | 69.9 MB |
| Data collection date | July 2018 |
| Number of distinct documents (files) | 7,873,334 |
| Number of distinct features | 6,194,874 |
| License |