# datafu **Repository Path**: mirrors_cloudera/datafu ## Basic Information - **Project Name**: datafu - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: cdh4-0.0.4_4.1.4 - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-08-08 - **Last Updated**: 2025-12-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DataFu DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics. It is used at LinkedIn in many of our off-line workflows for data derived products like "People You May Know" and "Skills". It contains functions for: * PageRank * Quantiles (median), variance, etc. * Sessionization * Convenience bag functions (e.g., set operations, enumerating bags, etc) * Convenience utility functions (e.g., assertions, easier writing of EvalFuncs) * and [more](http://sna-projects.com/datafu/javadoc/0.0.1/)... Each function is unit tested and code coverage is being tracked for the entire library. It has been tested against pig 0.9. [http://sna-projects.com/datafu/](http://sna-projects.com/datafu/) ## What can you do with it? Here's a taste of what you can do in Pig. ### Statistics Compute the [median](http://en.wikipedia.org/wiki/Median) of sequence of sorted bags: define Median datafu.pig.stats.Median(); -- input: 3,5,4,1,2 input = LOAD 'input' AS (val:int); grouped = GROUP input ALL; -- produces median of 3 medians = FOREACH grouped { sorted = ORDER input BY val; GENERATE Median(sorted); } Similarly, compute any arbitrary [quantiles](http://en.wikipedia.org/wiki/Quantile): define Quantile datafu.pig.stats.Quantile('0.0','0.5','1.0'); -- input: 9,10,2,3,5,8,1,4,6,7 input = LOAD 'input' AS (val:int); grouped = GROUP input ALL; -- produces: (1,5.5,10) quantiles = FOREACH grouped { sorted = ORDER input BY val; GENERATE Quantile(sorted); } ### Set Operations Treat sorted bags as sets and compute their intersection: define SetIntersect datafu.pig.bags.sets.SetIntersect(); -- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)}) input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)}); -- ({(1),(4),(5)}) intersected = FOREACH input { sorted_b1 = ORDER B1 by val; sorted_b2 = ORDER B2 by val; GENERATE SetIntersect(sorted_b1,sorted_b2); } Compute the set union: define SetUnion datafu.pig.bags.sets.SetUnion(); -- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)}) input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)}); -- ({(3),(4),(1),(2),(7),(5),(6),(0),(10)}) unioned = FOREACH input GENERATE SetUnion(B1,B2); Operate on several bags even: intersected = FOREACH input GENERATE SetUnion(B1,B2,B3); ### Bag operations Concatenate two or more bags: define BagConcat datafu.pig.bags.BagConcat(); -- ({(1),(2),(3)},{(4),(5)},{(6),(7)}) input = LOAD 'input' AS (B1: bag{T: tuple(v:INT)}, B2: bag{T: tuple(v:INT)}, B3: bag{T: tuple(v:INT)}); -- ({(1),(2),(3),(4),(5),(6),(7)}) output = FOREACH input GENERATE BagConcat(B1,B2,B3); Append a tuple to a bag: define AppendToBag datafu.pig.bags.AppendToBag(); -- ({(1),(2),(3)},(4)) input = LOAD 'input' AS (B: bag{T: tuple(v:INT)}, T: tuple(v:INT)); -- ({(1),(2),(3),(4)}) output = FOREACH input GENERATE AppendToBag(B,T); ### PageRank Run PageRank on a large number of independent graphs: define PageRank datafu.pig.linkanalysis.PageRank('dangling_nodes','true'); topic_edges = LOAD 'input_edges' as (topic:INT,source:INT,dest:INT,weight:DOUBLE); topic_edges_grouped = GROUP topic_edges by (topic, source) ; topic_edges_grouped = FOREACH topic_edges_grouped GENERATE group.topic as topic, group.source as source, topic_edges.(dest,weight) as edges; topic_edges_grouped_by_topic = GROUP topic_edges_grouped BY topic; topic_ranks = FOREACH topic_edges_grouped_by_topic GENERATE group as topic, FLATTEN(PageRank(topic_edges_grouped.(source,edges))) as (source,rank); skill_ranks = FOREACH skill_ranks GENERATE topic, source, rank; This implementation stores the nodes and edges (mostly) in memory. It is therefore best suited when one needs to compute PageRank on many reasonably sized graphs in parallel. ## How To ### Build the JAR ant jar ### Run all tests ant test ### Run specific tests Override `testclasses.pattern`, which defaults to `**/*.class`. For example, to run all tests defined in `QuantileTests`: ant test -Dtestclasses.pattern=**/QuantileTests.class ### Compute code coverage ant coverage ## Contribute The source code is available under the Apache 2.0 license. For help please see the [discussion group](http://groups.google.com/group/datafu). Bugs and feature requests can be filed [here](http://linkedin.jira.com/browse/DATAFU).