1 / 33

Improving Andriod App Development's Efficiency and Quality through Machine Learning Techinque

Improving Andriod App Development's Efficiency and Quality through Machine Learning Techinque. 刘世泽 Lau Shyh Tzer, David. Visiting Student, IIIS, Tsinghua University Summer 2013. BSc. Computer Science, The Chinese University of Hong Kong. Background. Level 1-18, over 20k API Methods.

sen
Download Presentation

Improving Andriod App Development's Efficiency and Quality through Machine Learning Techinque

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Andriod App Development's Efficiency and Quality through Machine Learning Techinque 刘世泽 Lau Shyh Tzer, David Visiting Student, IIIS, Tsinghua University Summer 2013 BSc. Computer Science, The Chinese University of Hong Kong

  2. Background • Level 1-18, over 20k API Methods • Difficult for developers to master the usage of Android API, especially the inexperienced developer • Aim: Adapt Machine Learning and Reverse Engineering Technique to Analyze the Usage Pattern • Possibly developer a helper tool to suggest/fix the Android API usage during development stage • Problem: The Growing of Android API

  3. Adapt reverse engineering technique to retrieve the needed data from packaged Android App (.apk) Perform data mining on the result raw data to dig out interesting API usage pattern, relationship. 2 1 Workflow

  4. 1 Perform static analysis not dynamic analysis Retrieve .dex file Disassembly Decompile Becomes a Java problemLots of analysis tool Good for hacking. Take time to familiar with smali code .jar (.class) bytecode Dalvik bytecode smali code 3 Low level, Lack of analysis tools (Few) .apk .apk .apk 1 2 Reverse Engineering on Android App • Need to retrieve information from package (.apk) file not source code • There’re basically three options:

  5. Support directly decompile .apk to .jar file 1 linux sh dex2jar/d2j-dex2jar.sh someApk.apk windows dex2jar\d2j-dex2jar.bat someApk.apk Reverse Engineering on Android App Decompile .jar (.class) bytecode .apk • Use dex2jar open source tool: https://code.google.com/p/dex2jar/ • We can then redirect it into a Java problem and focus on the static analysis with Java bytecode

  6. The easy and abstract approach to understand the structure of the code is to look at its Abstract Syntax Tree (AST) 1 Reverse Engineering on Android App • In order to understand the usage pattern of Android API, we have to know the structure of the code

  7. The stack execution on the bytecode is different than the common program flow that we observe at source code Generate AST from Java bytecode • It’s obvious and easy to parse Java source code into Abstract Syntax Tree, but parse the bytecode is not • Bytecode is a set of instructions that JVM interpret to perform stack execution to run the program

  8. Example method = i3 * i1 i2 return * i3 2 Bytecode AST

  9. Bytecode Outline Plugin for Eclipse http://asm.ow2.org/eclipse/index.html

  10. i 1 = 1 int i=1; ICONST1 ISTORE0 Generate AST from Java bytecode • Intuitively, bytecode is interpreted by JVM as the stack execution, so we can ‘recover’ the code structure and construct the AST through simulating the JVM stack operation Example: Variable Assignment Abstract Syntax Tree Thread Stack

  11. return = i3 i2 i1 2 * * * * i3 i1 i2 i3 2 public int method(inti1,int i2){ int i3=i1*i2 return i3*2; } ILOAD1 ILOAD2 IMUL ISTORE3 ILOAD3 ICONST2 IMUL IRETURN Example 2: From Previous Bytecode Example method Thread Stack Abstract Syntax Tree

  12. Generate AST from Java bytecode • There are various kinds of AST structure, such as condition statement, goto statement, compound statement, but they can all be ‘recovered’ from bytecode by using the previous technique to simulate the stack execution • However, read directly on the .class file result in binary format that useless for our parsing • So we need a systematic way to parse the bytecode

  13. ASM - Bytecode Engineering Library • ASM is an all purpose Java bytecode manipulation and analysis framework http://asm.ow2.org/ • It provides two powerful APIs: Core API and Tree API • Core API creates an interface of visiting bytecode • Tree API parses bytecode into Objects Refer to http://download.forge.objectweb.org/asm/asm4-guide.pdf for complete usage

  14. ASM - Bytecode Engineering Library • Tree API is particularly useful for generating the AST from bytecode • It provides two important interfaces: ClassNode and MethodNode which enables the developer to assess to the bytecode information directly The bytecode (opcode) of the respective method is stored at InsnList instructions

  15. Usage: Assess Class Information: Assess Method Information: ClassReader cr = new ClassReader(“App1.class”); ClassNode cn = new ClassNode(); cr.accept(cn,0); All the information is stored at the properties of the object, so simply retrieve from them List<MethodNode> mnList = cn.methods; for(MethodNode mn:mnList){ mn.name <- method name mn.signature <- method signature (parameter and return value type) mn.instructions; <- the respective opcodes } cn.name; <- the class name cn.field; <- class field variables cn.innerClasses <- class inner classes ASM - Bytecode Engineering Library ACM parse the whole class and store the respective objects over here Each class contains several methods, so it’s List<MethodNode> type

  16. Assess Bytecode Intructions: InsnList insn = mn.instructions; Iterator itr=insn.iterator(); while(itr.hasNext()){ AbstractInsnNode ain=(AbstractInsnNode)itr.next(); int opcode = ain.getOpcode(); int type = ain.getType(); //simulate the stack execution here to generate AST } ASM - Bytecode Engineering Library An abstract class to wrap the instructions. ASM separate instruction into 16 different kinds The bytecode instruction is defined as an integer constant in ASM. For example, 3 = ICONST_0 21 = ILOAD 54 = ISTORE The type of the instructions is also defined in integer constant. For example: 4 = FIELD_INSN 5 = METHOD_INSN Detail of the instruction, like which store to which local variable, the Field Variable ID, invoked method’s signature are stored in this object as properties.

  17. Specifically designed class to suit each code structure ASTNode Inherited ASTArithmeticNode ASTArrayNode ASTArrayValueNode ASTCastNode ASTClassNode ASTConstantNode ASTFieldNode ASTFunctionNode ASTJumpNode ASTLabelNode ASTLocalVariableNode ASTReturnNode ASTMethodNode ASTObjectNode ASTSwitchNode My Design of AST Generator • With the support of ASM Tree API, we can parse the bytecode and simulate each stack execution to construct the respective Abstract Syntax Tree systematically • In order to suit with the needed data for data mining, I designed a customized Abstract Syntax Tree structure Abstract Parent Class

  18. ASTNode • getASTKind • setName getName • setSignature getSignature • setCallBy getCallBy • setUsedBy getUsedBy • setUsedAsObject getUsedAsObject append result text toString “Sample” sb Example: setUsedAsObject setUsedBy StringBuilder sb = new StringBuilder; String text = “Sample”; sb.append(text); String result = sb.toString(); setUsedBy setCallBy setUsedBy My Design of AST Generator CallBy/UsedBy/UsedAsObject are stored as ArrayList<ASTNode> to handle multiple connections Object and its methods have doubly connections to ensure bidirectional traverse

  19. ASTMethodNode ASTLocalVariableNode ASTFieldNode • setIndex getIndex • setVariableType getVariableType • setVariableValue getVariableValue • setFieldValue getFieldValue • addParameter getPara My Design of AST Generator addParameter are stored as ArrayList<ASTNode> to handle multiple connections Trick: Local variable are stored separately at JVM Method Area by index. So in order to track the changing of local variable assignment (such as one variable can be used multiple times), create a hash table to record the pointers reference to the local variable. So the update of the variable assignment can be done easily while parsing the bytecode Trick: Same case with Local Variable, create a hash table to have pointers reference to the Field Variable. Be careful that the hash table is clear while accessing each method, but Field Value is tracked through the whole class

  20. This methods is called by what kind of object? Where are these arguments come from? Where does the return value goes? 2 1 3 ?? = Builder.fromParts(String scheme, String ssp, String fragment); Data Flow Analysis on the AST • With complete Abstract Syntax Tree for an Android App, it gives a very useful details to perform various kinds of static analysis • My research is mainly focused on its data flow analysis:

  21. Trick: use hash table/list to record down the path to avoid infinite loop within the AST ASTClassNode QQ Android ASTClassNode ASTMethodNode ASTMethodNode Collect data from Android API Invocation Data Flow Analysis on the AST • Performed depth-first-search on the AST to trace the return value path, argument path and call by path

  22. Mining on the Analysis Data 2

  23. Result data structure: android/net/Uri$Builder appendQueryParameter java/lang/StringBuilder append java/lang/StringBuilder <init> android/net/Uri buildUpon android/net/Uri$Builder build java/lang/String substring android/net/Uri toSring java/lang/String indexOf appname-android/net/Uri buildUpon-0 appname-android/net/Uri parse-0 Preparing Mining Raw Data • Coded a web crawler to download free Android Apps from http://apk.gfan.com/ open Android Market • Successfully grabbed 10, 266 valid Android Apps and generated respective AST, analysis data through the self-developed ASTGenerator by using Amazon Web Service High Memory cluster.

  24. It’s convenient to convert the analysis raw data into Weka ARFF input format, especially its support of sparse matrix format Mining on the Analysis Data • Adapted Weka 3 to perform data mining task. http://www.cs.waikato.ac.nz/ml/weka/

  25. Mining on the Analysis Data • Adapted Hierarchical Agglomerative Clustering in the result matrix to discover the apis’ relationship and their usage pattern

  26. Designed a MapReduce task and ran it at Hadoop to categorize the result data into methods by methods and compute the statistics of their invocation numbers Mining on the Analysis Data • The analysis result data from 10, 266 apps is huge (~50GB text file), it’s time-consuming and unnecessary to mine directly on them • Then perform hierarchical clustering on methods that have enough data to discover meaningful pattern (like the number of invocation reached a threshold)

  27. Mining on the Analysis Data Total 19, 250 Android API methods discovered to be invoked at least one time among 10, 266 apps Methods that have high numbers of invocation don’t directly mean they’re having higher possible to find meaningful pattern. They’re more likely to be really common usage like UI elements, Log Methods that have average number of invocation, especially those indicating specific feature like geo-location, network, database might be the target of data mining

  28. Clustered Result: • Weka’s hierarchical clustering result in Newick Format, we can use software like Dendroscope to visualize ithttp://ab.inf.uni-tuebingen.de/software/dendroscope/ android/net/Uri buildUpon Some clustering result has obvious clusters which may implied an obvious usage pattern with this method

  29. Clustered Result: • Trick: Some analysis that shows many redundant on the column key (related APIs), perform a filter to throw away those no obvious relations (such as only related one time) before sending the data for clustering android/location/Location <init> May perform several trials and observe the best cut-off branch

  30. Clustered Result: • The result Newick Format can be parsed back to get the list of related APIs for each cluster android/location/Geocoder <init> There may have many unique usage pattern like this which probably an error pattern or special usage

  31. Complete APIs Relationship • The clustering result shows the android/location package classes are strongly inner-connected. Classes like GpsSatellite, Location, Location Manager are highly relied on each other. • By having App’s name as identifier, we can trace back to the information of the nature of the app (its download times, rating, popularity) to determine feature, possible good/bad pattern of the result clusters. Package Analysis • Perform clustering on all the useful data and retrieve the APIs relationship from the cluster after identify its usage pattern. It can eventually conclude a ‘Good’ usage pattern suggestion list when respective API methods are called Identify Good and Error Usage Pattern Mining on the Analysis Data • By tracing back the result Newick Tree, we got the related APIs for each methods • Interesting Results:

  32. Android API vs Java Library App’s Permissions vs Method invocation • There are many Android APIs have data flow relations with Java Library methods, by digging into the methods’ clusters, we can discover some obvious usage pattern of Android APIs and Java Library methods. • Android Permissions information is stored at AndroidManifest.xml which is an binary-encoded XML file in .apk package. It can be decoded by using tool such as APK-tool (https://code.google.com/p/android-apktool/) to read it. Mining on the Analysis Data • Android Permissions declare at statically at compile time and can’t grant dynamically (at run time) • So with the AST generated from the bytecode, we can check/traverse the AST to determine the correctness of one app’s declared permissions. • For example, one app may declared BroadcastReceiver permissions and then there’re no related methods/functions are found at the AST

  33. Thank you very much

More Related